Abstract

Variations between image pixel characteristics contain a wealth of information. Extraction of such cues can be used to describe image content. In this paper, we propose a novel descriptor, called the intensity variation descriptor (IVD), to represent variations in colour, edges, and intensity and apply it to image retrieval. The highlights of the proposed method are as follows. (1) The IVD combines the advantages of the HSV and RGB colour spaces. (2) It can simulate the lateral inhibition mechanism and orientation-selective mechanism to determine an optimal direction and spatial layout. (3) An extended weighted L1 distance metric is proposed to calculate the similarity of images. It does not require complex operations such as square or square root and leads to good performance. Comparative experiments on two Corel datasets containing 15,000 images show that the proposed method performs better than the SoC-GMM, CPV-THF, and STH methods and provides good matching of texture, colour, and shape.

1. Introduction

With the increasing number of images uploaded to the Internet, there is an urgent need to quickly and efficiently retrieve images from large-scale collections. This situation has triggered researchers to propose a variety of content-based image retrieval (CBIR) methods. Most of these methods first extract primary visual features such as colours, edges, and intensities. They then design feature models such as attention models or local pattern models to extract useful information, which is used to construct a feature vector histogram for image matching.

In recent years, use of the computational visual attention model to extract visual features has become a popular method for image retrieval. Liu et al. proposed a bar-shaped structure for image content analysis that combines the visual attention and orientation-selective mechanisms [1]. The bar-shaped structures emphasise the salient structural information in the primary visual features, which significantly improves the image retrieval performance. However, it only takes into account the associations between the bar-shaped structures, whose three pixels are invariant in intensity. The multitrend structure descriptor defines three trends in local structure [2], which comprise the increasing, decreasing, and invariant trends of the three pixels according to the base of the microstructure descriptor [3]. The dLBP method [4] encodes the intensity variation of the three pixels into two bits, providing a total of eight kinds of variation. These methods have demonstrated excellent performance in image retrieval and object detection, but do not efficiently extract information on variations in colour, edges, and intensity. In order to efficiently extract these variations, we propose an intensity variation descriptor (IVD) based on dLBP [4], which can effectively combine the advantages of the HSV and RGB colour spaces and has the power to discriminate texture and spatial structure.

The proposed method has the following highlights. (1) The IVD can combine the advantages of the HSV and RGB colour spaces, which is more efficient than using a single colour space. (2) It can simulate the lateral inhibition mechanism and orientation-selective mechanism to determine the optimal direction and spatial layout. (3) We propose an extended weighted L1 distance, called W1 distance, to calculate the similarity between images and provide good CBIR performance.

The rest of this paper is organised as follows. In Section 2, we briefly review related works. The proposed method is presented in Section 3. We conduct CBIR experiments in Section 4, while Section 5 concludes the paper.

In the field of image retrieval, various methods have been proposed to extract image features. These can be divided into two categories according to the type of features: global feature methods and local feature methods. Global feature methods extract the colour, texture, shape, and other features as essential visual features. It is simple to calculate global features with low feature dimensions. The local features are extracted from the positions of the image that are not easy to change. Therefore, it is robust to occlusion, changes in illumination and background, and geometric transformation [5, 6]. However, local feature methods use algorithms that are usually too complex and produce high-dimensional feature vectors.

The colour histogram is one of the most commonly used global feature extraction methods. It is robust to scale, rotation, and noise. However, it lacks a spatial layout and completely different images may have the same histogram distributions. Some methods have been proposed to address such problems, such as colour correlograms [7] and colour coherence vectors [8]. Texture is an image feature with a certain spatial structure that appears repeatedly. There are some well-known texture description methods, such as GLCM [9], LBP [10], Gabor [11], and wavelet [12]. Shape is also one of the most important features used to describe an image, and human beings can directly judge a category by its shape. Classical shape descriptors include Zernike moments [13], curvature scale space (CSS), and angular radial transform [14]. Among local feature methods, SIFT [15] is one of the most classical. It can achieve scale and rotation invariance by extracting certain points with invariant features in different scale spaces of an image. Due to the complexity of the algorithm and the high feature dimensions, some improved methods have been proposed, including SURF [16] and PCA-SIFT [17]. In addition, there are other local feature methods such as that of Harris and Stephens [18] and oriented FAST and rotated BRIEF (ORB) [19]. The BOW model [20] uses SIFT to extract the local features and then clusters them into visual words via a clustering algorithm. Finally, a dictionary composed of these words retrieves the corresponding results through similarity matching.

In recent years, the use of a single feature for image retrieval has failed to meet the requirements of large-scale image datasets. Accordingly, researchers began to propose image retrieval methods based on multifeature fusion. The most common fusion method is the combination of colour and texture. For example, a microstructure descriptor represents image local features by computing edge orientation similarities and underlying colours [3]. The structure element descriptor (SED) [21] combines colour and texture features using five structure elements denoting five directions. Liu and Yang proposed the colour difference histogram to integrate colour and edge orientation information based on perceptually uniform colour differences and used it for image retrieval [22]. Recently, many image retrieval methods based on LBP variants have been proposed [2326]. Discrete wavelet transform, LBP, and grey level co-occurrence matrixes can be used to exploit multiresolution analysis and to enhance image directional information [2630]. Simulation of human perception and visual attention have been adopted in some multistage image retrieval frameworks [3141]. The manifold ranking method has been used to match similar feature vectors [42, 43]. The WAS method [44] was proposed to integrate the information of adjacent images and compute their similarity. The SFW graph [35] is an efficient graph-matching method used to solve the pairwise image matching problem and reduce computational complexity.

Recently, deep learning has also been developed in the field of image retrieval, especially with image retrieval methods based on the convolutional neural networks (CNN). CNN-based methods use a pretrained or fine-tuned convolutional neural network to extract global features from the fully connected layer or local features from the intermediate layer for image retrieval [4548]. Deep learning methods can achieve excellent retrieval results, but achieving reasonable results requires a lot of effort, including (1) knowledge and experience of network architecture, (2) huge amounts of training data, and (3) much computer time.

In this paper, we propose a simple yet efficient image retrieval method. Our method directly utilises low-level visual features to represent images and does not require complex operations such as modelling, training, clustering, and segmentation.

3. The Proposed Method

Color, intensity, and edges are considered as the primary visual features which are commonly used in CBIR. Here, we propose a novel descriptor, called the intensity variation descriptor (IVD), to represent variations in colour, edges, and intensity, and apply it to image retrieval. The flow diagram of the proposed method is shown in Figure 1. The IVD is built based on the intensity variations using colour, edges, and intensity in a certain direction. With the intensity variation serving as a bridge, the IVD extracts feature by simulating the lateral inhibition mechanism and orientation-selective mechanism and it effectively integrates colour, edges, and intensity for image retrieval.

In the proposed method, firstly, the input image is converted to the HSV colour space from the RGB colour space, and the H, S, and V components are uniform quantized. Then, the R, G, and B components of each pixel in the RGB colour space are taken as inputs for the IVD, and the output intensity variations are used as the feature values of the corresponding HSV-quantized colours. In addition, the edges and intensity information are extracted from the V component and used for quantization. The IVD is used to describe the intensity variation of the quantized edges and intensities in a certain direction, which is selected by the proposed local direction detection method. Finally, the intensity variation values of colours, edges, and intensities are integrated into the feature vector. After similarity matching by W1 distance, the 12 images look like the query image most are returned.

3.1. The Intensity Variation Descriptor (IVD)

As mentioned before, there is a variety of pixel intensity variations. The bar-shaped structure considers the intensity invariant on three pixels [1]. The multitrend structure descriptor adds two situations, where the intensities of three pixels increase and decrease in turn from left to right based on the bar-shaped structure [1] and microstructure [3]. The LBP combines the situations of the invariant and increases into one situation, so there are four variations in LBP, as shown in Figure 2(c). The dLBP considers the intensity difference variation (IDV) based on LBP, as shown in Figure 2(d). Therefore, dLBP contains eight variations which come from four intensity variations (IVs) and two IDVs.

In order to efficiently extract more intensity variation information, an intensity variation descriptor (IVD) is proposed to extract intensity variation and intensity difference variation information based on the bar-shaped structure [1] and dLBP [4], as shown in Figure 2(e).

Let there be three intensity values, , , and . We define the intensity difference between them as follows:

The weight of the intensity variation (IV) can be defined as follows:

The weight of the intensity difference variation (IDV) can be denoted as follows:

After combining the weights of the IV and IDV, we define the intensity variation value as follows:

3.2. Extraction of Visual Features

Human eyes are sensitive to colour and edges. Colour can provide rich information and is the most direct visual feature of images. Edges can represent the boundaries of image contents and textural structures. Therefore, colours and edges are utilised to represent image features in the IVD method.

Both the HSV and Lab colour spaces imitate human colour perception well [22, 38]. The Lab colour space is popular in calculating colour differences for image representation [22] and saliency detection [38], while the HSV colour space is more suitable for extracting colour information based on colour quantization [1, 3, 49]. Therefore, we adopted the HSV colour space for extracting colour features with the proposed method, where the H, S, and V components are uniformly quantized into 6, 3, and 3 bins, respectively. This results in a total of 6 × 3 × 3 = 54 colour bins. We define the colour bins as a colour map , where , , and .

Sobel operators have good noise-suppression characteristics [50] and are simple to calculate. We used them to detect edge cues on the V component. We can obtain an edge amplitude map and edge orientation map by using uniform quantization, , , where , and , , where .

The V component is also utilised to represent intensity. An intensity map can be obtained using the above operation. We defined it as , , where .

3.3. Local Direction Detection

In the IVD method, three intensity values are input to obtain the intensity variation. As shown in Figure 3, in a neighbourhood, the intensity values of three pixels in four directions can meet the requirements of IVD. Unfortunately, the calculation of IVD in four directions is too computationally intensive. Hubel and Wiesel showed that many simple cells in the visual cortex have an orientation-selective mechanism [51, 52], which makes them respond only to lines with a certain orientation [53]. Inspired by this view, we try to select three intensity values in a certain direction as the input of IVD, whose direction is the texture direction of the neighbourhood.

Let an region in an image and its grey value be . The grey values of pixels adjacent to in the 0°, 45°, 90°, and 135° directions are denoted as , , , and , respectively. Therefore, the average grey value difference of the local neighbourhood is denoted as follows:where and are the width and height of the region, respectively. , , , and are the numbers of pixel pairs in four directions, respectively. In this paper, we set . In a certain direction, we consider as a numerator and the direction perpendicular to as a denominator. We define the ratio between them as follows:where adding one to the denominator prevents it becoming a zero value. The direction corresponding to the minimum of is considered the direction of the region.

Using the above implementation, we can determine the regional direction of the edge amplitude map , the edge orientation map , and the intensity map , where the intensity values of the three pixels in this direction are utilised as the input for the IVD.

3.4. Image Representation

In order to represent colour features efficiently, both HSV and RGB colour spaces are utilised to extract colour information. The pixels of the R, G, and B channels are utilised as the input values of the IVD, and the output of the IVD is utilised as the histogram values. The colour features are represented as follows:

In the representation of edges and intensity features, we determine the direction of a region within the edge and intensity feature map, and then use the three intensity values in the direction as the input of the IVD. Furthermore, we use the lateral inhibition formula to combine the spatial information output by the IVD [54].where is the lateral inhibition coefficient, is the output value of the IVD at the centre pixel, and , is the output value of the IVD about eight pixels adjacent to the central pixel. We define as follows:

The histogram of the edge amplitude is defined as follows:where is the edge amplitude value of the centre pixel of the region, and are the surrounding edge amplitude values of in the direction, are the edge amplitude values of the eight pixels adjacent to the central pixel , and and are the surrounding edge amplitude values of in direction .

Similarly, the histograms of edge orientation and intensity map are defined as follows:

Combining , , , and , the final histograms H can be obtained:

4. Experimental Results

In order to validate the performance of the proposed method, some state-of-the-art methods were selected for comparison: SoC-GMM [55], LDP [56], SSH [1], CPV-THF [57], and STH [58]. A distance metric and benchmark dataset were required for the comparisons. In these experiments, we propose using an improved distance to evaluate performance based on the average results of each query in terms of precision, recall, and F-measure, respectively.

4.1. Datasets

Two datasets were used for CBIR. (1) One was the Corel-5K dataset, which contains 5000 images with diverse content such as bark, food textures, waves, microscopic objects, and trees. It contains 50 categories, each with 100 images sized 192 × 128 or 128 × 192 pixels in JPEG format. (2) The second was the Corel-10K dataset containing 10,000 images of the same size as those in the Corel-5K dataset. It has various contents such as flowers, horses, fish, beaches, and mountains. It contains 100 categories, with 100 images in each category. The Corel-5K dataset is a subset of the Corel-10K dataset.

4.2. Distance Metric

In the CBIR experiments, we propose an improved distance formula based on CDH [22], namely, the W1 distance, to implement image matching. It can be considered as an extended weighted L1 distance. Let and be the feature vectors of template and query images, respectively. Both are K-dimensional feature vectors: and . The W1 distance between the template and query images is simply calculated as follows:where . The W1 distance has the best performance when .

4.3. Performance Measures

In comparative CBIR experiments, deciding what kind of performance evaluation metric to use is important. In this paper, the precision, recall, and F-measure metrics were utilised to evaluate the effectiveness of the proposed method. They are defined as follows:where is the number of images retrieved in the top positions that are similar to the query image, is the total number of images retrieved, and is the total number of images in the database that are similar to the query. Parameter allows one to weight either precision or recall more heavily.

In the Corel-5K and Corel-10K datasets, we set , , and .

4.4. Retrieval Performance

The vector dimensionality has a very important impact on performance. In general, with the increase of dimensions, the retrieval performance of an algorithm will improve. However, exorbitant dimensions increase the computational burden. Therefore, choosing an appropriate vector dimensionality not only produces efficient results but also avoids excessive computational cost.

In the CBIR experiments, hue (H) is uniformly quantified into 6, 8, and 12 bins, and Saturation (S) and Value (V) are both quantified as 3 bins in HSV colour space. Thus, the quantization numbers of colours are 6 × 3 × 3 = 54 bins, 8 × 3 × 3 = 72 bins, and 12 × 3 × 3 = 108 bins. At the same time, the quantization numbers of edge orientations are 9 bins, 18 bins, 36 bins, 45 bins, 60 bins, and 90 bins, and the quantization numbers of intensity are 16 bins, 32 bins, and 64 bins.

As shown in Figure 4, the precision did not change much as the quantization number of colours increased. Overall, precision declined with increases in the quantization number of intensity and increased with increases in the quantization number of edge orientation. When the quantization number of edge orientation increased from 18 to 36, the precision increased by more than 1%. When the quantization number of edge orientation was fixed to 36 bins, the quantization number of colour and greyscale were 54 bins and 16 bins, respectively, and the vector dimensionality was the smallest and the precision was the highest. Therefore, we set the quantization numbers of colours as 54 bins and 16 bins.

We further evaluated the effects of the quantization numbers of edge amplitude. The quantization numbers of edge amplitude were 9 bins, 18 bins, 36 bins, 45 bins, 60 bins, and 90 bins. In Figure 5, the precision, recall, and F-measures decrease with increases in the quantization number of edge amplitude. According to the results, we set the quantization number of edge amplitude to 9 bins. Ultimately, the vector dimensionality of the proposed method is 54 + 36 + 16 + 9 = 115 bins.

As shown in Table 1, not using the colour information of the RGB colour space resulted in reductions in precision of at least 1% on the two datasets. If we use the lateral inhibition formula, the performance improved to a certain degree. Performance was better using the selected direction than using any single direction (0°, 45°, 90°, or 135°). Thus, using the selected direction and the lateral inhibition formula and adding the RGB colour space information, improved the performance.

In order to investigate the effectiveness of the proposed W1 distance method, we compared it with the L1 distance (Manhattan distance), the chi-square distance ( statistics), the Canberra distance, the D1 distance (weighted L1 distance), and the proposed distance in CDH methods [22]. The performances of these distances or similarity metrics are listed in Table 2. It is clear that the W1 distance performed better than the other metrics on the Corel-5K and Corel-10K datasets. Moreover, the W1 distance metric does not require complex operations such as square or square root and can be regarded as an extended weighted L1 distance [22].

4.5. Performance Comparisons

Here, we compare the proposed method with state-of-the-art methods such as SSH [1], SoC-GMM [55], LDP [56], CPV-THF [57], and STH [58]. The results are listed in Table 3. The precision of the proposed method on the Corel-5K dataset was 15.11%, 3.01%, and 6.63% higher than with SoC-GMM [55], CPV-THF [57], and STH [58], respectively. The precision of the proposed method on the Corel-10K dataset was 9.63%, 4.97%, 2%, 4.6%, and 8.85% higher than with SoC-GMM [55], LDP [56], SSH [1], CPV-THF [57], and STH [58], respectively.

It has rich colours, complex textures, and various shapes within natural images. However, SoC-GMM only describes colour information. LDP, SSH, CPV-THF, and STH represent colour information in a colour space such as RGB, HSV, or Lab. However, both HSV and RGB colour spaces are considered in the proposed IVD method. It is worth mentioning that the IVD method does not increase the vector dimensionality. CPV-THF and STH can analyse texture features using texton templates. However, texton templates cannot adequately describe texture information because they only consider two pixels with the same intensity in square blocks. The bar-shaped structure of the SSH method only considers situations where three pixels are invariant in intensity [1]. The proposed IVD describes a total of 27 variations in the intensity of three pixels. The IVD method not only describes the intensity variation in a certain direction but can also combine the colour, edges, greyscale, and spatial information. Therefore, the proposed method contains richer information than the SoC-GMM, LDP, SSH, CPV-THF, and STH methods.

Figures 6 and 7 show two image retrieval examples from the Corel-5K and Corel-10K datasets. Each example has 12 images, with the top-left image being the query image. In Figure 6, the query is a gravel image, which has varied colours and complex textures. In Figure 7, the query is an eagle image, which has a background of blue sky and the shape of the eagle. It can be seen that all returned images were correctly ranked within the top 12 images. Furthermore, the colour, texture, and shape of the query image and all the returned images have certain similarities.

Figures 8 and 9 show two image retrieval examples from the Corel-5K and Corel-10K datasets using the SSH method [1], and the queries are also the gravel and an eagle. Seven returned images were incorrectly ranked within the top 12 images in Figure 8. Both the IVD and the SSH methods have discriminatory power for colour, texture, shape, and spatial features, but the IVD method has strong discriminatory power for texture and significantly outperforms better than that of the SSH method.

5. Conclusions

We have proposed a novel image representation, namely, the intensity variation descriptor, to represent image content, and applied it to image retrieval. The proposed descriptor not only extracts the richness of colour information by combining the HSV and RGB colour spaces but effectively describes texture features by extracting information on edges and intensity variations. It still considers the direction and spatial structure information of the textures by simulating the orientation-selection mechanism and the lateral inhibition mechanism.

We have proposed an extended weighted L1 distance metric to improve the retrieval performance of the proposed method. Experimental performance comparisons with the Corel-5K and Corel-10K datasets demonstrate that our method outperforms some state-of-the-art methods and has the power to discriminate texture and spatial structure. There are some potential applications of the proposed IVD method, and it can be applicated to texture recognition, trademark image retrieval and palmprint image retrieval.

Data Availability

The dates and code are available at http://www.ci.gxnu.edu.cn/cbir/Dataset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61866005 and in part by the project of the Guangxi Natural Science Foundation of China under Grant 2018GXNSFAA138017.