Abstract

Person re-identification, aiming to identify the same pedestrian images across disjoint camera views, is a key technique of intelligent video surveillance. Although existing methods have developed both theories and experimental results, most of effective ones pertain to fully supervised training styles, which suffer the small sample size (SSS) problem a lot, especially in label-insufficient practical applications. To bridge SSS problem and learning model with small labels, a novel semisupervised co-metric learning framework is proposed to learn a discriminative Mahalanobis-like distance matrix for label-insufficient person re-identification. Different from typical co-training task that contains multiview data originally, single-view person images are firstly decomposed into pseudo two views, and then metric learning models are produced and jointly updated based on both pseudo-labels and references iteratively. Experiments carried out on three representative person re-identification datasets show that the proposed method performs better than state of the art and possesses low label sensitivity.

1. Introduction

Person re-identification (re-id), namely, seeking occurrences of a query person (probe) from person candidates (gallery), is a hot-spot and challenging topic of intelligent video surveillance [1, 2], which also underpins many crucial multimedia applications, such as person retrieval [3, 4], long-term pedestrian tracking [5, 6], and cross-view action analysis [7]. The main challenge of re-id can be concluded as intrapersonal visual variations across multicamera views even larger than interpersonal ones, due to the significant changes in viewpoints, illuminations, body poses, and background clutters (see Figure 1). Moreover, traditional biometrics, such as gait and face, are unreliable to be exploited especially in uncontrolled practical environment; thus researchers always carry out person re-identification task based on body appearance characteristics. Current person re-identification methods have been primarily introduced to two aspects: feature construction and learning, or subspace/metric learning. Due to more and more attention from computer vision and machine learning fields in recent years, researchers bring great improvements on both theories and experimental results of person re-identification study.(i)Feature construction and learning aim at designing or studying discriminative appearance descriptions [820] that are robust for distinguishing different pedestrians across arbitrary cameras. However, hand-crafted feature construction is extremely challenging due to miscellaneous and complicated variations. Therefore, feature learning based on salience model, deep neural network, and so forth becomes popular approaches to practice better feature representation.(ii)Subspace and metric learning aims at seeking a proper subspace or distance measure by Mahalanobis-like metric learning [2130]. Given a set of person image pairs, metric learning based methods are to learn an optimal positive semidefinite matrix for the validity of metric that maximizes the probability of true matches pair having smaller distance than wrong match pairs.

Whether feature learning or metric learning methods, state of the art usually exploits the characteristics of labelled training data as far as possible, which typically pertains to fully supervised method. However labels are always insufficient in practical applications, resulting in the number of labelled training samples even smaller than that of feature dimensions, namely, small sample size (SSS) problem [31] that is a core challenge of learning based person re-identification. To solve the SSS issue, there are many training styles designed for noisy learning and inadequate supervision [32], and co-training is always one of the most important paradigms that is still vibrant for multiview learning [33]. Therefore, motivated by semisupervised co-training [34], we propose a novel co-metric learning framework for person re-identification to bridge the inadequate labelled data and metric learning model.

In a typical co-training work, training data is adopted to study classification models in two views separately, whereas the updates of models benefit from each other's views. However, different from applications where data is collected from multimodal sources, person re-identification datasets are commonly presented as single-view pedestrian images. In that case, the core difficulty of applying co-training paradigm in person re-identification community comes at learning and updating a model in single view. As we know, the features in higher dimension own more useful information but larger noise, such that dimension reduction is always necessary for feature extraction. If we decompose the high-dimension features into two views before dimension reduction, it is probably to produce different but effective descriptions in pseudo two views for our co-metric learning framework. Therefore, we firstly present a binary-weight learning method for splitting the single-view representation to pseudo two views automatically, and then two metric learning models are studied, respectively, in each view for matching the unlabelled training samples; finally metrics benefit each other and meanwhile are jointly updated based on the ranking list of unlabelled samples iteratively.

The main contributions of this paper can be summarized as follows: (1) An effective co-metric learning framework is presented for semisupervised person re-identification; it can learn a discriminative Mahalanobis-like distance matrix, even lacking adequate labelled data. (2) Pseudo two views of person data could be used for metrics generation based on self-adaptive feature decomposition. (3) Both pseudo-labels and references on unlabelled dataset are adopted for acquiring discriminative metrics update. The rest of the paper is organized as follows. Section 2 introduces a brief review of related work for person re-identification. Section 3 explains our method in detail. Section 4 presents experimental results compared with state of the art on three datasets. Section 5 concludes this paper.

In this section, we give a brief review of the studies most related to person re-identification task. Typically, current person re-identification research can be categorized into two classes: feature representation based methods and distance measure based methods.

Feature representation based methods pay attention to constructing discriminative visual descriptions by feature selection or learning. Gheissari et al. [8] generated salient edges based on a spatial-temporal segmentation algorithm and then obtained an invariant identity signature by combining normalized color and salient edge histograms. Wang et al. [9] designed a co-occurrence matrix based appearance model to capture the spatial distribution of the appearance relative to each of the object parts. Farenzena et al. [10] tried to combine multiple features from five body regions that are exploited by symmetry and asymmetry perceptual principles. Kviatkovsky et al. [11] found that color structure descriptors derived from different body parts turn out to be invariants under different lighting conditions. To improve the discriminative power of visual descriptions, feature selection technique is used to pick out more robust feature weightings, or dimensions, or patch salience. Gray et al. [12] transformed person re-identification into a classification problem and employed an ensemble of the localized features through AdaBoost algorithm. Zhao et al. [13] applied adjacency constrained patch matching to build dense correspondence between image pairs and assigned salience to each patch in an unsupervised manner. Some recent works introduce deep learning framework to acquire robust local feature representations and then encoding them. Li et al. [14] learned a unified deep filter by introducing a patch matching layer and a max-out grouping layer for person re-identification. Ahmed et al. [15] presented a deep convolutional architecture that captured local relationships between person images based on mid-level features. Generally, deep learning is usually utilized to learn feature representations by using deep convolutional features [1417] or from the fully connected features [1820] in person re-identification works.

Distance measure based methods aim at finding out a uniform distance measure by subspace learning or metric learning. Most successful metric learning algorithms demonstrate an obvious superiority based on supervised learning. Hizer et al. [21] and Dikmen et al. [22] utilized a classical metric learning method called LMNN to learn an optimal metric for person re-identification. Zheng et al. [23] learned a Mahalanobis distance metric with a probabilistic relative distance comparison method. Kostinger et al. [24] introduced a simpler metric function (KISSME) to fit pairwise samples based on Gaussian distribution hypothesis, and Tao et al. [25] got better estimation of the covariance matrices of KISS metric learning by seamlessly integrating smoothing and regularization. Mignon et al. [26] learn distance metric from sparse pairwise similarity/dissimilarity constraints in high dimensional space called pairwise constrained component analysis. Pedagadi et al. [27] conducted a metric-like work that combined unsupervised PCA dimensionality reduction and Local Fisher Discriminant Analysis. Li et al. [28] proposed to learn a decision function that joined distance metric and locally adaptive thresholding rule. Wang et al. [29] transformed the deep learning as the most popular machine learning paradigm is also adopted to learn the distance metric. Wang et al. [30] put forward a data-driven distance metric method, re-exploiting the training data to adjust the metric for each query-gallery pair.

3. Methodology

This section presents the main procedures of our co-metric learning framework (see Figure 2), mainly including self-adaptive feature decomposition for pseudo two-view metric learning, semisupervised metric update based on pseudo-labels and references.

3.1. Problem Formulation

Under a semisupervised person re-identification setting, it considers a pair of cameras and with nonoverlapping field of views and training persons set . Labelled training persons set is associated with the two cameras, where is the number of persons. Images of persons captured from and are denoted as and , respectively, . Two labelled training sets corresponding to and are represented by , , and , , where means the same person . Then let unlabelled training persons set , , , and , ; however and may not be the same pedestrian here even if .

A classical supervised metric learning algorithm [21] trains a Mahalanobis-like distance function based on and . Given a pair of training samples and , their distance can be defined as M is a positive semidefinite matrix for the validity of metric. By performing matrix decomposition on M with , (1) can be rewritten as It is easy to see from the above derivation that the essence of the metric is to seek an optimal projection matrix M (or L) under the supervised information generally containing two pairwise constraints, i.e., similar constraint and dissimilar constraint. However, access to labelled data is usually difficult or too expensive to obtain; comparatively unlabelled data is massive and easily acquired. Therefore learning based on both labelled and unlabelled samples is not only meaningful issue but also pressing for practical intelligent video surveillance.

3.2. Self-Adaptive Feature Decomposition

Given a set of single-view training samples, it aims at producing pseudo two-view representations that could be used for learning metric model in each view. In a typical co-training task, there is dataset consisting of two feature views and , which satisfy two conditions [34]: (1) two hypotheses occur having low-error on , ; (2) and need to be conditionally independent.

To achieve the above demands, a binary learning method based on binary-weight vectors , is proposed to decompose single-view features with dimension into two totally different but both effective views , automatically, which could be treated as pseudo two-view features of training samples. , . indicate the th dimension of , respectively, . To make , conditionally independent, a succinct way is that can only be used by one of , . In other words, , can be both indicated as 0/1 weights as As can be seen, the values of , are still uncertain. Therefore, , are trained together on the labelled samples set , ensuring that feature representations generated from , both perform well. , () are, respectively, positives and negatives of sample on , and can be trained by the objective function asSo is similar to and meanwhile dissimilar to as much as possible by applying . denotes the normalized distance of objects; here Euclidean distance is adopted. Similarly, is constructed for . And then, , are trained with the constraints of (3) jointly through minimizing the maximum of the two as

3.3. Semisupervised Metric Update

After acquiring pseudo two-view representations , of person images, Mahalanobis-like metric model would be learned each from one view for matching the unlabelled training samples. Consider a pairwise difference , , where is the person dataset and is the intrapersonal difference if , namely, , while is the interpersonal difference if , namely, . Mahalanobis-like metric can be learned via zero-mean Gaussian structure [24] of the difference space as The above decision function can be simplified as (7) by the log-likelihood ratio test, and then distance between and can be written as (8):The original semidefinite matrix M in Mahalanobis-like metric function is reflected by . Since the ranking lists of unlabelled training samples are calculated based on (8), the core issue comes to how to use these ranking lists for metric update, and three observations could be helpful and important to answer the question. First, co-training style is promoting the models in two views teaching each other; thereby ranking list in one view should benefit model in another. Second, top-n samples in the ranking lists probably have more similar visual appearance as the probe, whereas the visual information of bottom-m samples is perhaps further dissimilar as that of the probe; thus the top-n and bottom-m samples could be treated as positive and negative pseudo-labels for iterative metric update of each other's view. Third, the aim of co-training is to reach an agreement between two views just as increasing consensual pseudo-labels from both views. In that case, top-k neighbours of consensual pseudo-labels on unlabelled samples set may be also useful for metric update, and they could be regarded as special references. Therefore, we attempt to learn a generic model that updates metric learning model by discovering both pseudo-labels and references.

Assume that a metric model M1 is learned in view on labelled samples set and unlabelled training samples , on . , are used to define the positive and negative pseudo-labels of from metric model M2 in view , , . , , the positive and negative references of , are indicated as , . Firstly, is defined to pull the pseudo-positives to as close as possible and meanwhile push the pseudo-negatives away from as far as possible, as And then, is to both pull the pseudo-positives to referential-positives and pull the pseudo-negatives to referential-negatives close enough as Finally, metric update becomes an optimizing problem with the following objective function:Gradient descent algorithm is adopted to optimize (11), and learning procedure of metric model M2 is similar to that of M1. The final M1, or M2, or combination after iterative update can be utilized for test dataset.

4. Experimental Results

In this section, the proposed method is validated by comparing with state-of-the-art person re-identification approaches on three publicly available datasets: the VIPeR dataset [35], PRID2011 dataset [43], and PRID450s dataset [44]. The widely used VIPeR dataset contains 632 person image pairs obtained from two different cameras. Some example images are shown in Figure 3(a). All images of individuals are normalized to a size of 12848 pixels. View changes are the most significant cause of appearance change with most of the matched image pairs containing a viewpoint change of 90 degrees. Other variations are also considered, such as illumination conditions and the image qualities. The PRID2011 is a challenge dataset from two surveillance cameras; particularly there is serious camera characteristics variation as shown in Figure 3(b). In particular, 385 persons’ images are from one camera and 749 persons’ images are from the other camera, with 200 common images in both views. All images are normalized to 128×48 pixels. The PRID450S is an extension of PRID2011; it has significant and consistent lighting changes and chromatic variation, and there are 450 single-shot image pairs captured over two spatially disjoint camera views. All images are normalized to 168 × 80 pixels.

4.1. Implementation Details

Both hand-crafted and deeply learned features are adopted as the original single-view representations in this paper. Hand-crafted feature employs salient color name [42], and deeply learned feature is produced by a typical Siamese convolutional neural network [45]. All the quantitative results are exhibited in standard Cumulated Matching Characteristics (CMC) curves [9], which are plots of the recognition performance versus the rank score and represent the expectation of finding the correct match inside top matches. Following the evaluation protocol described by state of the art [23], dataset is randomly divided into two parts, a half for training and the other for testing. However, different from fully supervised methods that training data are all labelled, only one-third of labelled data are used in this semisupervised person re-identification evaluation while the remaining training data are unlabelled, similarly to [37]. All images from camera view A are treated as probes and those from camera view B as gallery set. For each probe image, there is one person image matched in the gallery set. With two different methods, we use the same configuration for experiments at each trial to get the ranking lists. To achieve stable statistics, we repeated the evaluation procedure for 10 times.

4.2. Experiments on VIPeR

We compare our co-metric learning (CML) based person re-identification method with ten most published unsupervised, semisupervised, and fully supervised results on the VIPeR dataset. Unsupervised/semisupervised approaches include SDALF [10], eSDC [13], TSR [36], SSCDL [37], Null-semi [38], and fully supervised baselines including KISSME [24], kLDFA [39], DeepNN [15], Null Space [38], and XQDA [40]. Semisupervised person re-identification usually assumes the availability of one-third of the training set, while the whole training set of fully supervised approaches is labelled and adopted in learning procedure. To show the quantized comparison results more clearly, we summarize the performance comparison (see Table 1). As can be seen, we make the following observations: (1) our method achieves 32.9% at rank@1 matching rate, which improves the previous best results over 1.1%, and matching rates at rank@5 and rank@10 also possess the highest performance compared with all unsupervised/semisupervised results. (2) Compared with fully supervised baselines, our result is also competitive, especially at rank@1; e.g., the performances of KISSME and kLDFA are both lower than that of our CML. (3) Although there is still long way compared with best fully supervised result, our approaches only need one-third labelled training data, which is more suitable for label-insufficient practical environment.

4.3. Experiments on PRID2011

Compared with VIPeR dataset, the number of person images on PRID2011 is small, where training sample size may be much smaller than feature dimension; i.e., SSS problem can be worse. We compare the state-of-the-art semisupervised baselines kCCA [41], kLFDA [39], XQDA [40], and Null-semi [38] on PRID2011 with access to the implementation codes using the same LOMO features. It can be seen that (see Table 2) (1) except result at rank@10, rank@1, and rank@5, matching rate of our method is the best result compared with baselines, and there is only 0.2% margin below Null-semi that takes the best performance at rank@10. (2) Influenced by small sample size, our approach and baselines all yield much poorer results on PRID2011 dataset compared with results on VIPeR dataset.

4.4. Experiments on PRID450s

Many published unsupervised/semisupervised SDALF [10], eSDC [13], and TSR [36] and fully supervised KISSME [24] and SCNCD [42] are introduced as baselines on PRID450s. The performance of our method is much better than all unsupervised/semisupervised comparisons (see Table 3). It achieves 61.8% at rank@5 and 73.8% at rank@10, which improves the previous best results over 10%. Moreover, for verifying the label sensitivity of our CML method, we test SCNCD with metric learning and our method with 1/2, 1/3, 1/5 labelled training samples (see Table 4). And our results exceed significantly that of SCNCD at every labelled training size and decrease gently along with lower training samples; however SCNCD declines sharply especially with 1/5 training samples. That is because the within-class scatter matrix of traditional metric learning becomes singular, when the number of labels is smaller than the dimension of feature representation. Relatively speaking, our method combines labelled and unlabelled data for learning procedure, which is more robust and less sensitive about label size.

5. Conclusions

This paper proposes a novel semisupervised co-metric learning framework for label-insufficient person re-identification. To bridge the small sample size problem and learning model with small labels, motivated by co-training that is commonly used for insufficient/imperfect-label learning, we adopt binary-weight learning to decompose single-view person features into pseudo two views, which could be used to learn two metric models as a co-training style, and then metrics are jointly updated by discovering both pseudo-labels and references. Experiments on three representative person re-identification datasets show that proposed method performs better than state of the art with small labelled sample size and possesses low label sensitivity.

Data Availability

The three public datasets utilized in this work are freely acquired online, and readers can easily find the download links of datasets via searching references [35, 43, 44] on the Internet.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation Projects of China (no. 61562048 and no. 61562047).