Abstract

Few-shot segmentation is a challenging task due to the limited class cues provided by a few of annotations. Discovering more class cues from known and unknown classes is the essential to few-shot segmentation. Existing method generates class cues mainly from common cues intra new classes where the similarity between support images and query images is measured to locate the foreground regions. However, the support images are not sufficient enough to measure the similarity since one or a few of support mask cannot describe the object of new class with large variations. In this paper, we capture the class cues by considering all images in the unknown classes, i.e., not only the support images but also the query images are used to capture the foreground regions. Moreover, the class-level labels in the known classes are also considered to capture the discriminative feature of new classes. The two aspects are achieved by class activation map which is used as attention map to improve the feature extraction. A new few-shot segmentation based on mask transferring and class activation map is proposed, and a new class activation map based on feature clustering is proposed to refine the class activation map. The proposed method is validated on Pascal Voc dataset. Experimental results demonstrate the effectiveness of the proposed method with larger mIoU values.

1. Introduction

Image segmentation [1] aims to segment object regions from images, which is fundamental to many computer vision tasks [2]. Based on the deep learning-based method [37], the existing segmentation models can segment object well when sufficient annotations are given [8]. However, the existing segmentation methods still have two drawbacks. Firstly, the annotation generation is time consuming. The number of annotations is usually so small that it is hard to train the segmentation models from a few of annotations. The other is that the segmentation models work badly on new classes, i.e., the segmentation models only recognize the objects in the training dataset and cannot segment regions of classes unknown.

To solve the drawbacks, few-shot segmentation [914] is proposed. Given a set of images of new classes, with a few of annotations (support images), the aim of few-shot segmentation is to segment region of query images efficiently. However, the intuitive method of refining the segmentation model by a few of annotations is proved to be ineffective. Few-shot segmentation faces the challenges of discovering object cues from limited annotations. To this end, researchers have proposed many methods to enhance few-shot segmentation [1517]. These methods can be summarized to provide segmentation cues from existing annotations of known classes where the annotations are sufficient to train the model. Therefore, the class-agnostic guided model that transfers segmentation cues from support mask to query mask can be trained firstly and is then used in reference stage to locate the foreground regions in query image directly. Several strategies such as mask transferring and prototype feature are used. The few-shot segmentation has been improved obviously.

Meanwhile, few-shot segmentation still faces the lack of object priors although many existing annotation datasets are used. Two reasons caused such challenge. Firstly, there are large variation interclasses, which make the knowledge transferring between known class and new class very hard. Secondly, there is large variation intraclass. Therefore, a few of annotations cannot describe all the types of classes and leads to bad guidance. In other words, the foreground priors are still limited by current few-shot segmentation manner.

In this paper, we propose a new few-shot segmentation method that considers two aspects, namely, interclass cue and intraclass cue to capture more sufficient segmentation cues from known and unknown classes. The first one captures the semantic relationships between the existing classes and unknown classes and is used to capture the discriminative cue through comparing existing classes and unknown classes. The second one captures the common cues intraclasses, that is, the common features shared by the query and support images are captured to locate the object. The two aspects are achieved by class activation maps (CAMs). A classification model considering only class-level labels is first built. Then, class activation map is extracted based on the feedback analysis. Afterwards, since the discriminative regions are usually small, we expand the discriminative region using the feature clustering method guided by support masks. Finally, the CAM is introduced into the few-shot segmentation mask as an attention map to enhance the query image segmentation.

The contributions of the proposed method are listed as follows:(1)A new few-shot segmentation method based on the segmentation cues interclass and intraclass is proposed(2)Class activation map is used to capture the segmentation cues, and a new attention module is proposed to add the class activation map in to the few-shot segmentation network(3)An extension method based on the clustering method is proposed to enlarging class activation map

Few-shot segmentation aims to segment regions of new classes with a few annotated images given, which is a fundamental task in computer computing [18, 19]. The few-shot segmentation task is always formulated as an information guidance model, where the common knowledge that can be used in segmentation task is learned in the support branch and transferred between the support branch and the query branch. There are two key components in existing few-shot segmentation methods, of which the first component is a class prior extraction module in the support branch, and the second component is a guidance network to transfer the extracted knowledge between branches.

As for class prior extraction, multiple types of class prior have been proposed and can be further categorized into the weight-based methods and prototype-based methods. The weight-based methods consider the weight of a classifier as the class prior. The most representative work in the weight-based methods is OSLSM [10], which leverages a conditional branch to generate parameters for query branch.

The current state-of-the-art methods are prototype-based methods. The prototype-based methods can be further divided into the global prototype, the fusion of global and local prototype, and the prototype of background. The global prototype-based methods consider the converted deep features from the support branch into the class prototype, e.g., PANet [20] and CANet [21] learn a class-specific global prototype with a masked average pooling operation.

The second type of prototype-based methods takes the global and local prototypes into consideration simultaneously and has the ability to extract features with more semantic knowledge. The most representative methods are PPNet [22] and PMMs [22], where the first method decomposes the holistic class representation into a set of part-aware prototypes with k-means and the second correlates the diverse image regions with multiple prototypes to enforce the prototype-based representation with the aid of the EM algorithm.

The third type of prototype-based methods employs the background prototype to enhance the semantic knowledge of foreground. The most representative methods are MLCNet [23] and SCNet [24], where the first method introduces a mining branch that exploits latent novel classes via transferable subclusters and the second method generates self-contrastive background prototypes directly from the query image, enabling the construction of complete sample pairs to form a complementary and auxiliary segmentation task.

As for the design of guidance network between support branch and query branch, multiple types of guidance module have been proposed and can be further categorized into the feature-level guidance network and the parameter-level guidance network. The feature-level guidance conducts similarity propagation based on the extracted features by diverse branches. The representative methods include PFENet [25], which generates the prior mask based on the cosine similarity between features, and then employ the feature enrichment module to propagate this similarity in multiple resolutions. LTM [26] proposed a nonparametric and class-agnostic transformation method, where the relationship of the local features is calculated in a high-dimension metric embedding space based on cosine distance, and then are mapped from the low-level local relationships to high-level semantic cues with the generalized inverse matrix of the annotation matrix. The parameter-level guidance network considers the model parameter of the last specific layer as the class prior and uses the parameter transformation from support branch to query branch to achieve the guidance. The most representative work is CWT [27], where the guidance is conducted at the classification layer only; it proposed a Classifier Weight Transformer to dynamically adapt the support-set trained classifier’s weights to each query image in an inductive way.

3. The Proposed Method

3.1. The Pipeline of the Proposed Method

The pipeline of the proposed method is shown in Figure 1, where the proposed method consists of four steps: the classification step, the CAM generation step, the mask generation step, and the mask refinement step. The classification step is to train a classification network by considering all the existing classes and the new classes based on image-level labels only and output the class activation map that represents the discriminative regions of the unknown classes via gradient feedback forward. Then, since the initial CAM is usually very small, the CAM generation step expands the CAM using the clustering strategy. Afterwards, the mask generation step generates the segmentation mask in terms of soft values based on mask transferring strategy where the CAM generated in the second step is used as attention map to enhance the features of the query image. Finally, the mask refinement step is to improve the segmentation mask based on classical segmentation framework. We next detail the four steps.

3.2. Classification Step

The aim of the classification step is to train a classification model by considering the known classes and new classes and extracts the discriminative regions of new classes that distinct the new class from the existing classes. Therefore, the rough location of new classes can be obtained in the query image.

Specifically, a training dataset consisting of the existing classes and the new classes is constructed firstly. Here, is composed of existing classes with number . is the image set of new classes. Based on all classes, a classification network is trained, and the classification map is extracted using Grad-CAM methods.

Meanwhile, the regions are usually small due to the fact that the rough image-level labels cannot obtain the whole region of the object but a small area. The next step expands the highlighted region using feature clustering.

3.3. CAM Generation Step

The CAM generation step expands the class activation maps based on the idea that the regions located by the initial step can be treated as the class center, and the rest pixels similar with the region highlighted can be treated as the object regions. Therefore, we use the clustering method to obtain the similar pixels.

Specifically, the CAM generation step consists of three substeps: pixel clustering, cluster selection, and CAM generation. In the first step, the K-means clustering [28] is used to cluster the pixels into clusters based on the deep features obtained in the classification step. For each cluster, each pixel is given the activation value in the class activation map, and the mean value of the cluster is obtained through averaging the activation values. The mean value represents the important for the pixel to the class, and the mean value is used as the activation value for all the pixels in the cluster. Thus, a new class activation map is obtained.

3.4. Mask Generation Step

Mask generation step segments foreground regions of query image based on the class activation map . Here, a few-shot segmentation network based on transferring is used, and the class activation map is embedded into the network to enhance the guidance.

3.4.1. Few-Shot Segmentation Network

The few-shot segmentation network is constructed by the method in [26], of which the idea is to obtain the query mask (with size ) based on the relationships as follows:where and are the query mask and support mask, respectively, and the two masks are reshaped into column vector. is the matrix product of and , with size . It is seen that value one in means that the values in and are all value one. Otherwise, the value in is zero.

Once is known, the query mask can be obtained by

Thus, the few-shot segmentation problem changes to obtain the Matrix product , which can be estimated by the feature similarity of the pixels in the support and query images, i.e., the foreground pixels have the similar features, and have similarity distance of value one. Otherwise, the distance is value zero.

Based on the formulation above, the few-shot segmentation network can be constructed as a two-branch based network, with a guidance model by formula (2). The network is shown in Figure 1.

Specifically, given a support image with support mask and query image , a two-branch based network is used to extract the pixel features. One is the support branch that extracts the features of support image , and the other is the query branch that extracts the features of query image . Then, the similarity matrix of and is calculated via calculating the discrete cosine distance, wherewhere is the value at location . and are the th feature and th feature in and . is the discrete cosine distance. Therefore, refers to the similarity of pixels, which is similar with the similarity relationships of masks, and can be used to estimate the matrix product .

Then, is used to estimate the matrix product and is used to obtain the query mask via (2).

Note that estimating using the pixel feature is challenging. Thus, we use the support mask to filter the foreground regions via element-wise production.

3.4.2. Feature Enhancement via CAM

Different from the few-shot segmentation method in [26], we introduce the class activation map which carries the discriminative cues through all classes to enhance the features of query image. Specifically, as shown in Figure 1, the query image is sent into the classification network to form the initial classification map. Then, the clustering algorithm is used to refine the class activation map. The class activation map is then used as attention map to refine the deep features of query image, and the refined features guide the segmentation of query image.

3.5. Mask Refinement Step

The output of few-shot segmentation branch is the soft mask of query image. To obtain the binary mask, a threshold can be used to obtain the hard mask from the soft mask. However, the results are sensitive to the selection of threshold. Therefore, a segmentation mask is used to segment the final mask from the soft mask, where the soft mask is used as foreground probability map, and the segmentation network is performed to obtain the final hard segmentation mask. We use the method in [8] to implement the mask refinement.

4. Experimental Results

4.1. Dataset

We next verify the proposed method based on the Pascal Voc dataset which consists of 20 classes. Similar with the existing few-shot segmentation method, the 20 classes are split into two class set. One is training set that trains the few-shot segmentation network. The other is the test set that validates the segmentation quality of the network. To fully validate the few-shot segmentation model, four splits are used. The details are found in Table 1.

4.2. Implementation Details

We implement our method on Titan-XP GPU. Pytorch is used to realize our method. The network is optimized by Adam optimizer with the initial learning rate 1e − 4. Several backbones such as VGG16, ResNet50, and ResNet 101 are used for sufficient evaluation. The pretrained backbone network based on ImageNet [29] is used for training.

4.3. Subjective Results

We first display some subjective results in Figure 2, where the input image, the prediction results, and the ground truth results are displayed. It is seen that the prediction results are similar with the ground truth results, which demonstrates the fact that our method can segment these new classes of images successfully.

4.4. Objective Results

We objectively evaluate the proposed method by mIoU and FB-IoU values that are usually used for few-shot segmentation evaluation. The results are shown in Table 2, where 1-shot and 5-shot mean the few-shot segmentation with one and five support annotations, respectively. Three backbones such as VGG16, ResNet 50, and ResNet 101 are considered. We can see that ResNet 101 obtains the best results due to the deeper layers in the networks. The results by ResNet 50 are better than VGG 16, which is also caused by the deeper network that captures more semantic features.

4.5. Comparison with Existing Methods

We also compare our method with the existing state-of-the-art methods. The comparison methods are displayed in Table 2. It is seen that our method outperforms these comparison methods, which demonstrate the effectiveness of the proposed method, especially for the comparison with the method in [26], our method can be considered as a improvement of the method [26]. It is seen that our method is better than the method in [26] which demonstrates the effectiveness of our strategy that introduces the class activation map to capture both the cues interclass and intraclass.

4.6. The Ablation Study

We next show the ablation results. The initial CAM and improved CAM are considered for the ablation study. The backbone ResNet 50 is used. The results are shown in Table 3, where mIoU values are shown. It is seen that original CAM can also lead to the improvement. Meanwhile, our improved CAM can enhance the results further, which demonstrates that fact that clustering strategy is a useful method to enhance CAM regions.

5. Discussion

The existing few-shot segmentation methods usually focus on the learning class-agnostic model, which is based on the level interclasses only. Such class-agnostic model can lead to good generalization on new classes, which however also lacks the class cues of new classes. Based on the existing class-agnostic model, we try to add new segmentation cues through the discriminative cues interclass and the common cues intraclasses, which is the level of both interclass and intraclass. Therefore, better segmentation results can be obtained by our method.

It is seen that our method is based on the method in [26] (LTM), which proposed a few-shot segmentation method via estimating the relationship matrix of masks that is an interesting idea. However, our method is different from LTM [26]. Firstly, our main contribution is using the class activation map to capture the segmentation cues interclass and intraclass, which is not considered in [26]. Secondly, an attention module is added in LTM, which can add the CAM segmentation cues to enhance the segmentation. Therefore, our method can be considered as an extension to LTM [26] with better segmentation results.”

6. Conclusion

This paper proposed a new few-shot segmentation method that uses the class activation map to enhance the generation of object priors by considering the common cues intraclass and the discriminative cues interclass. The proposed network consists of four steps: classification step, CAM generation step, mask generation step, and mask refinement step, which are used to generate the initial CAM via class classification, to generate the CAM via feature clustering, to generate the segmentation mask, and to refine the segmentation mask, respectively. The proposed method is validated on Pascal Voc dataset. The experimental results demonstrate that the consideration of common cues intraclass and the discriminative cues interclasses can enhance the few-shot segmentation in terms of large IoU values.

Data Availability

The datasets used for validation are available from https://host.robots.ox.ac.uk/pascal/VOC/. The detailed results are listed in the paper. More results can be found from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This paper was partially supported by The National Natural Science Foundation of China (51577086); Jiangsu Six Talent Peaks (TD-XNY004); and Jiangsu Major Scientific Research Projects in Colleges and Universities (19KJA510012).