Abstract

Pedestrians, motorist, and cyclist remain the victims of poor vision and negligence of human drivers, especially in the night. Millions of people die or sustain physical injury yearly as a result of traffic accidents. Detection and recognition of road markings play a vital role in many applications such as traffic surveillance and autonomous driving. In this study, we have trained a nighttime road-marking detection model using NIR camera images. We have modified the VGG-16 base network of the state-of-the-art faster R-CNN algorithm by using a multilayer feature fusion technique. We have demonstrated another promising feature fusion technique of concatenating all the convolutional layers within a stage to extract image features. The modification boosts the overall detection performance of the model by utilizing the advantages of the shallow layers and the deep layers of the VGG-16 network. The training samples were augmented using random rotation and translation to enhance the heterogeneity of the detection algorithm. We have achieved a mean average precision (mAP) of 89.48% and 92.83% for the baseline faster R-CNN and our modified method, respectively.

1. Introduction

With the current increasing rate of technological advancement, the numbers of automobile users are on the increase. Thus, it is paramount for drivers to interpret and understand the road markings to make the traffic safer [40]. Pedestrians, motorist, and cyclist are being killed at night as a result of poor vision and negligence of drivers to promptly identify a particular road marking [41, 44]. Intelligent driver-assistance systems need to be developed to mitigate the increasing rate of road accidents and assist the human driver in detecting and recognizing road markings [1, 5, 45, 46].

It was reported that 20–50 million people sustain physical injury yearly, and over a million die as a result of traffic accidents [33]. Traffic accidents are mostly caused by driver behavior at the wheel and inadequate road infrastructure. In this regard, the advanced driver-assistance system (ADAS) is being developed and incorporated into some modern automobile [45]. Developing a driver-assistance system that would complement the driver skills by prompting about impending danger or right direction to maneuver ahead would go a long way in reducing the rate of traffic accidents [24, 31, 42]. Identifying road markings in images are tasking; this is due to the occlusion of the road markings by vehicles, variation in view angle, changes in illumination condition, complex background, poor image quality, and shadow [4, 13, 31, 49]. Road-marking and traffic-sign detection systems are vital for driver-assistance systems and autonomous vehicles [6]. Conventionally, road markings can be of different types such as arrows, lane markings, pedestrian crossing, speed limits, and texts [23, 45, 46]. Lots of specific road-marking detection and recognition task were carried out focusing on lane markings [13], crosswalks, and arrows [24]. The current trend in nighttime object detection has far-infrared (FIR) and near-infrared cameras as reliable image acquisition systems [14]. Lots of researchers have reported the application of object detections using near-infrared (NIR) images. Govardha and Pati [14] use a NIR camera to detect pedestrians in nighttime vision. In their research, a combination of Haar-Cascade and Histogram of Oriented Gradients-Support Vector Machine (HOG-SVM) was used as feature extractors. Han and Song [22] utilize the near-infrared camera for nighttime detection. They used aggregated channel features and AdaBoost methods to achieve the detection task. Dai et al. [37] integrated nighttime near-infrared images using a convolutional neural network to detect pedestrians. The images were acquired using a car-mounted NIR camera. The NIR images were evaluated using the convolutional neural networks, and satisfactory results were obtained. However, the current trend on deep learning detection focused on unsupervised detection task since deep learning framework requires large data for training to effectively generalize on the evaluation dataset [38]. NIR is one of the devices that can be used to detect road markings at night. It provides a good vision and sound image quality at night [21, 22].

Classical computer vision approach and deep learning-based approach are the two most commonly used techniques for traffic-sign detection. In classical methods, handcrafted algorithms such as Histogram of Oriented Gradient (HoG), Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF) are used. The classical approach is not too accurate since feature extractors are handcrafted [2, 3, 49].

In recent times, there was a rapid advancement of deep learning technology. Convolutional neural networks and alongside some detection algorithms have shown great prospect in image detection and classification [4, 16, 50]. The successes recorded by AlexNet [17] and Zeiler and Fergus [18] inspired researchers towards utilizing the convolutional neural network as a feature extractor. After the success of AlexNet, several modifications on the CNN classification models were presented. Road markings are detected using methods such as YOLO and SSD [12] or a region proposal-based method. The region-based method outperformed the sliding window search method by reducing the number of the proposals and searching time [8]. Many region proposal algorithms such as selective search [8] and EdgeBox [1] have yielded impressive results in object detection. Regional-based Convolutional Neural Network (R-CNN) has gained considerable popularity in road-marking detection in recent years [1]. The fast R-CNN and faster R-CNN evolved from the R-CNN. They are both region-based detection networks with the later outperforming the former in terms of detection accuracy [11].

The faster R-CNN consist of a base network usually ZF Net, VGG, and ResNet, from which feature maps are obtained, region proposal network (RPN), RoI pooling layers, and classification and regression networks. An RPN is a subnetwork that uses anchors at each pixel position of an image for a class-agnostic classification. The RPN and the fast R-CNN network can be trained separately or jointly. Some reasonable number of RoIs are fed into the RoI pooling layer after which the classification and detection task are performed [32]. The VGG16 has been widely used as a feature extractor in faster R-CNN. The network has thirteen convolutional layers, which are divided into stages. Within each of the stages, feature maps are preserved and they are downsampled by half after passing through a pooling layer. By passing through consecutive stages, final feature maps are obtained. These feature maps are rich in semantic information but lack position information. To overcome this challenge, many researchers have presented the application of feature fusion techniques in object detection to utilize both the position and the semantic information of a feature map [27, 28]. Road-marking detection is a vital aspect of the Intelligent Transportation System (ITS), and research is still ongoing in this area. Literature exists on road-marking detection using classical computer vision. Danescu and Nedevschi [23] developed a road-marking detection and recognition system using a two-step segmentation technique. The classification accuracy obtained ranges from 80 to 95% under different scenarios. Foucher et al. [24] proposed a robust algorithm for detecting and classifying crosswalks and arrows using city images. Their findings show that 90% of crosswalks and 78% of arrows were successfully detected. Qin et al. [20] studied four different types of road markings using a machine vision approach. The marking contours were extracted randomly from the image using an image processing technique. The extracted features were then sent to the classification and the detection modules. Chen et al. [42] utilized a binarized normed gradient (BING) algorithm to speed up the recognition of the detection target. The detected object is classified using a PCANet classifier. In this method, feature extractors are handcrafted. Thus, the method is not too accurate and not suitable for real-time road-marking detection.

Many researchers have reported their findings in the detection of road markings using the concept of deep learning [26]. Vokhidov et al. [30] presented a damaged arrow-road-marking recognition model using the convolutional neural network. Experiment using six types of arrow-road markings shows that the recognition model is promising. However, the author did not consider the detection of the road markings. A research on the detection of road markings using a convolutional neural network was also reported by Qian et al. [1]. In their research, a hybrid region proposal algorithm and a fast-region convolutional neural network were used for detecting road markings. However, the study recommends a further search for more accurate and faster detection algorithm.

Similarly, a study on road-marking recognition using a convolutional neural network was reported by Ahmad et al. [35]. The detection of the symbolic road markings was not within their scope of the study. Tian et al. [13] utilize a deep convolutional neural network based on faster R-CNN for lane marking detection only. The experimental results show that an average precision of 68% was obtained. Wen et al. [39] reported a deep learning model for road-marking extraction, recognition, and completion from three-dimensional (3D) mobile laser scanning point clouds. With the modified U-net model, the precision, recall, and F1-score obtained were 95.97%, 87.52%, and 91.55%, respectively.

Despite these beautiful findings from the literature, fewer studies were reported on road-marking detection. Much attention was paid to just recognition of road markings using a convolutional neural network (CNN). The cited works also focused on daytime images, which has better image quality than the nighttime images. Moreover, the detection of diverse classes of road markings using a near-infrared camera sensor based on a convolutional neural network (CNN) was less studied.

There has been increasing concern on the road-marking detection in the adverse weather conditions such as in the night [19]. Many researchers have recommended the use of a NIR camera for road-marking detection and recognition [21]. Designing a model to detect and recognize all the road markings will be very tasking because some datasets contain text-based information and different countries use different characters to convey information. Thus, we have chosen some essential, frequently occurring road markings. In our research, we have used eleven (11) different kinds of road markings: (a) single markings: pedestrian crossing, forward arrow, left-arrow, right-arrow, turn-around arrow, and slow down sign; (b) dual markings: forward-and-right arrow, forward-and-left arrow, left-and-right arrow, forward-and-turn-around arrow, and left-and-turn-around arrow. Therefore, the objective of this research is to evaluate a modified faster region convolutional neural network algorithm for nighttime road-marking detection with potential application in the advanced driver-assistance system (ADAS).

Our contributions are summarized as follows: (i)This research is a contribution towards developing nighttime advanced driver-assistance system (ADAS) for automotive application(ii)We have created a new, real-life driving scenario nighttime road-marking dataset using a car-mounted near-infrared camera with a VIS filter and NIR supplement. We augmented the training samples using random rotation and translation to boost the heterogeneity of the detection algorithm(iii)We have modified the state-of-the-art faster R-CNN algorithm with a new structure that takes multilayer feature fusion into cognizance. We have demonstrated another promising feature fusion technique of concatenating all the convolutional layers within a stage to extract image features. This modification boosts the overall detection performance of the model by utilizing the advantages of the shallow layers and the deep layers of the VGG-16 network

The remainder of this article is organized in the following manner: Section 2 provides the methodology followed. In Section 3, we presented the results of our experiments. In Section 4, we discussed our findings. Finally, in Section 5, we conclude based on our objectives and highlight the gap for future research. We then provided a list of reference materials.

2. Materials and Methods

The proposed road-marking detection system consists of three networks: (i) feature extractor (VGG-16 network) with feature fusion (ii) detector network—fast R-CNN detector which consists of a RoI pooling layer and the classification and the regression network and (iii) the region proposal network (RPN).

2.1. Feature Extractor (VGG-16)

In our proposed method, we chose VGG-16 for the simplicity of its architecture as a feature extractor. A typical VGG-16 network consists of thirteen convolution layers (13) followed by thirteen (13) activation function layers—Rectified Linear Unit (ReLU), three fully connected layers, and five (5) pooling layers. The VGG-16 comprises of stages: the stage one (conv1) has two convolution layers and two ReLU layers and a pooling layer. The convolutional layers in conv1 are denoted as conv1_1 and conv1_2. The stage two (conv2) has two convolution layers, two ReLU layers, and one pooling layer. The convolutional layers in conv2 are denoted as conv2_1 and conv2_2. The stage three (conv3) has three convolution layers, three ReLU layers, and one pooling layer. The convolutional layers in conv3 are denoted as conv3_1, conv3_2, and conv3_3. The fourth stage (conv4) has three convolution layers, three ReLU layers, and one pooling layer. The convolutional layers in conv4 are denoted as conv4_1, conv4_2, and conv4_3. The last stage (conv5) has three convolution layers, three ReLU layers, and one pooling layer [41]. Within each stage, the feature map is preserved [6, 28]. After an image passes through each stage, its feature map shrinks by half. After several convolutions and pooling operations, coarse feature maps are obtained at the last convolution stage [13, 48].

2.2. Feature Fusion

The features from the deep layers have undergone several convolution operations and downsampling. Consequently, the information from these layers becomes deeply semantic yet more abstract [4, 29, 34, 50]. Conversely, those features from the shallow layers are rich with precise positioning information for the object in the image but score low in semantic feature representation ability [41]. Thus, to take the advantages from the shallow layers and the deep layers and to overcome the tradeoff between the spatial resolution of the lower layers and the distinctive semantic features of the deep layers, multilayer feature fusion is imperative [28, 41]. The feature maps obtained from the output of each convolution stage should have sufficient semantic and position information for an excellent performance of the model. The feature map size should be considered accordingly before fusion; a too small feature map supplies insufficient feature information, and too large increases the computation complexity [28]. In previous research, various techniques were used to fuse the shallow layers and the deep layers of convolutional neural networks [27]. In most existing literature [27, 41], features are extracted from the last convolutional layer within each stage. In our approach, we concatenate feature maps from the convolution layers within each stage. Since within each convolution stage, the feature map is preserved. Therefore, we concatenated the feature maps without resizing. Also, the channel dimensions were sized using a convolutional layer. This convolutional layer can be used to increase or reduce the number of filters. By reducing the number of filters, the amount of training parameters in the network decreases. We then normalized the feature maps from conv1, conv3, and conv5 using batch normalization. This concept of batch normalization was followed by Yan et al. [47]. After that, the convolutional layer feature maps from conv1, conv3, and conv5 of the VGG-16 were concatenated. The feature maps from the conv1 and the conv3 were downsampled to the size of the conv5 feature maps. This downsampling technique to the last convolution stage feature map size worked well for [15] in their research application. By setting the downsampling parameters accordingly, the feature maps were converted to the same size. The proposed feature fusion module is shown in Figure 1.

The resulting same-sized feature maps were merged using element-wise addition, and a convolutional layer was appended to generate the required amount of the feature maps before being fed to the RPN network. In this way, the advantages of concatenation and summation techniques were utilized. These merits reflect on the mean average precision of the proposed method. The overall detection pipeline is depicted in Figure 2.

2.3. Detector Network

Initially, R-CNN relies on selective search algorithm to generate about 2000 regions of interest (RoIs) [4]. However, R-CNN is too slow for the real-time detection task. An improved version of R-CNN exists which utilizes RPN instead of selective search to generate regions of interest (RoIs). A region proposal network (RPN) takes feature maps as input and produces thousands of proposals. An RPN is a kernel size fully convolutional layer which branch into two siblings convolutional layers for class-agnostic classification and the regression of the RoIs. To generate proposals, we slide a sliding window over the last shared convolutional feature map obtained from VGG-16. At each sliding window position, there are nine (9) anchors, which translate through a feature map from which varying sizes of proposals are obtained [9]. The RoIs are sent into two siblings convolutional layers for the box-regression and box-classification task [48]. The maximum possible proposal for each location is denoted by k. The classification (cls) layer is assigned 2k scores (there is an object, or there is no object) for each proposal. The architecture of the detection system showing the detailed modifications on the VGG-16 is depicted in Figure 3. The regression layer is assigned 4k parameters (depicting the coordinate location of the proposal) [10]. The intersection over the union between the ground-truth bounding box area () and anchor box () is used to determine the presence of a positive and negative proposal [11].

The expression for intersection over union is given by [1]:

When training region proposal network, we consider two class labels (object or not object) to each anchor. We consider two cases for positive anchors: (i) the anchor that shows highest intersection-over-union (IoU) overlap with the ground-truth box or (ii) an anchor that has an intersection-over-union overlap more than 0.7 with any ground-truth box. Similarly, we assigned intersection-over-union of less than 0.3 for negative anchors. Any anchor with IoU greater than 0.3 but less than 0.7 is not considered in the training goal [11]. Anchors come in varying scale and varying aspect ratios; they can be sized accordingly to fit a particular application. The proposals generated by the RPN are fed into the region of interest (RoI) pooling layers. The pooling layer takes a fixed size of the RoIs. After that, the output is fed into the classification and the regression layers for classification and regression, respectively.

2.4. Dataset

The dataset used for training and testing the model was captured using a car-mounted near-infrared camera with NIR supplement from the street of Changsha City, Hunan Province, China. The specifications of the image acquisition system are given in Table 1. In order to increase the generalization ability of our model, data augmentation is necessary since CNN requires large data to train [36]; the default augmentation techniques in faster R-CNN are not suitable for our task. Therefore, we set the horizontal flip, vertical flip, and rotate_90 false. We introduced random translation and random rotation with bounding boxes to augment our dataset. We set the rotation range (0–2) in python which implies that each image from the randomly sampled images is selected to undergo either rotation between -100 and 100 with probability 0.5 or remain unchanged with the same probability. Similar procedure is followed for translation. The original rectangular-shaped image becomes a parallelogram, and the random rotation function rotates the four points of the ground-truth bounding boxes by an angle . Thus, the new image width and the height becomes where and are the original image height and width and the new image height and width, respectively.

The dataset captured amount to 8536 images, which contains 11981 road markings. We randomly selected 2,800 images for training and validation which is about three-quarter of the total images, while the remaining were set aside for testing and evaluation of the algorithm as per Ren et al. [11] guide. The ground-truth bounding boxes were subsequently labeled manually using MATLAB. Eleven (11) different road markings (12 classes, including background) were considered.

2.5. Training

Our road-marking detection system was implemented with python 3.6 in Pycharm2017 environment installed on a desktop computer with the following specifications: Corei7-8700; 2.6 GHz processor with 12-core CPU; NVIDIA GeForce GTX 1080Ti 8GBRAM. The nighttime detection system was implemented using the source code (with modifications) published in python [7].

We use the NIR images described in Section 2.4 to train the model with stochastic gradient descent algorithm [13]. We jointly trained the detection algorithm with the initial learning rate of 0.0001 for 550 epochs. We also set the momentum and weight decay as 0.9 and 0.0005, respectively. The VGG-16 was initialized using image Net pretrained weights and then fine tuning of the road-marking detection algorithm using the NIR images. We stopped the training process at 550 epochs for 2000 iterations when the training loss remain constant. We resized the images with the shorter side having 512 pixels. We used three scale anchors of 1282, 2562, and 5122, and three aspect ratios of 1 : 1, 1 : 2, and 2 : 1. For RPN training, we consider all the anchors and randomly sampled all anchors to have a minibatch size of 256 with 1 : 1 foreground and background sample ratio based on our IoU threshold for positive and negative samples. The RPN utilizes all the anchors to compute the classification loss using binary cross entropy. The minibatch identified as foreground is used to calculate the regression loss. The regression loss of the foreground is computed using smooth L1 loss. The nonmaximum suppression (NMS) algorithm was applied to reduce the number of the redundant anchors. The IoU threshold for the suppression and the proposal numbers per image were set to 0.7 and 2000, respectively, from which 300 best proposal per image are used to train our detector network in line with Ren et al. [27]. Similarly, for the detector training, the classification loss is a multiclass cross entropy loss over twelve road markings including background, and smooth L1 loss was used for the prediction offset losses.

We evaluated our detection algorithm using average precision and mean average precision (mAP) with a maximum intersection-over-union (IoU) threshold of 0.5 in line with VOC-2012 requirement. This is as per Ren et al. [11]. The main objective of the experiment was to jointly train the RPN and the detector network to minimize equation (2) given by [11]. In our experiment, the total loss represents the RPN loss and the detector network loss, since the model was jointly trained. The whole detection model was trained end-to-end by back-propagation using stochastic gradient descent (SGD) algorithm.

The first term in equation (3) represents the classification loss over two classes (i.e., object vs. not object). The second term represents the regression loss of bounding boxes. This term has a value only when . Here, is the index of anchor during training the faster R-CNN. is the probability that an anchor contains an object. is the probability of the ground-truth label, for positive and negative anchor, respectively. is a vector which represents the four parameterized coordinates of the predicted bounding box, and is also a vector which corresponds with the ground-truth box associated with the positive anchor [28]. are the normalization factor. is a weighted balancing parameter. In our approach, and and , to set the and terms in equation (1) approximately equal weighted. is the classification loss which is a log loss over binary classes (i.e., there is object against there is no object). The regression loss is given mathematically as

where is the smooth L1 loss usually determined as

The is mathematically expressed as where ,,, and are the center coordinates of the predicted box, width, and height, respectively. , , , and represent the anchor centers, width, and height, and , , , and correspond to the ground-truth box; , , , and are the regression coefficients (RPN prediction).

In general, the training and detection procedure of faster R-CNN are summarized as follows: (i)Pass the images through the VGG-16 net for feature extraction(ii)Extract about 2000 region proposals for an image obtained from the region proposal network(iii)Generate a fixed size feature vector from the RoI pooling layer(iv)Pass the required RoIs to fully connected layers(v)Train the RPN and detection network by back propagation using stochastic gradient descent algorithm(vi)Perform classification and detection task to evaluate the trained model using the test images

2.6. Evaluation Metrics

We evaluated our method based on mean average precision (mAP) and false positive rate which is a metric for the false alarm probability as proposed by [13, 25, 28, 41]. These are vital evaluation parameters as used by [43]. Thus, the algorithm performs best when the mean average precision is high and vice versa. The predicted bounding box was compared with the ground truth bounding box to ascertain the validity of the prediction. The prediction is considered true if the IoU is greater than or equal to 50% as per VOC-2012 requirement; otherwise, false [29]. The IoU was computed using equation (1). The precision and the recall and false positive rate are given by [38, 43] as

3. Results

The results of our experiments are presented in Figure 4 and Table 2.

4. Discussion

First, our road-marking detection algorithm was implemented based on faster R-CNN with VGG-16 as the base network. Then, we run another experiment using our proposed method.

The results obtained with VGG-16 as the base network were used as a baseline for accessing the contribution of our proposed multilayer feature fusion. Figure 4(a) shows the average precisions of forward-and-right arrow (DU1), forward-and-left arrow (DU2), left-and-turn-around arrow (DU3), left-and-right arrow (DU4), left-and-right arrow (DU5), pedestrian crossing (PD), left-arrow (LT), right-arrow (RT), slow down (SD), forward arrow (ST), and turn-round (UT). The maximum and the minimum average precisions recorded were 99.00% and 69.5% corresponding to turn-round marking and pedestrian marking, respectively. Table 2 shows the mean average precision of the classes and the extent of variation between the classes average precision (standard deviation). The training losses recorded was 0.1042 for 550 iterating epochs. As shown in Figure 4(b), the training curve of the state-of-the-art faster R-CNN did not converge quickly. By setting the confidence score threshold between 0 and 1, different corresponding values of recall and false positive rate are obtained from which the detection performance curve is obtained. From Figure 4(c), the modified faster R-CNN algorithm performs better than the state-of-the-art faster R-CNN algorithm in terms of better detection rate and low false alarm rate. Thus, the proposed method has outperformed the state-of-the-art faster R-CNN algorithm.

The statistical -test between the two mean average precisions of the algorithms resulted in a value of 0.030 at a 95% confidence interval. Thus, the value ≤ 0.05 which implies that the two detection algorithms do not perform equally, and there is a significant difference between the two mean average precisions of the modified faster R-CNN and the faster R-CNN algorithms as shown in Table 2. Essentially, the modified faster R-CNN also performed better in terms of the detection rate. Figure 4(d) shows an example of the detection result. The ground-truth bounding box is green, the detected mark is sky blue, the IoU for this example was 90.84%, and the detection accuracy was 99.99%. The examples of good detection results with multiple road markings are shown in Figures 4(e)4(g). However, there were few cases of missed detection were the IoU was not satisfied as seen in Figure 4(h) for the case of pedestrian crossing sign. The mean average precision and the recall for the faster R-CNN and our modified model were 89.43% and 92.83%, respectively. The modification has contributed to an increase in mAP of ∼4%. The mean average precision obtained in both cases has outperformed the results reported by [1, 13]. The detection time of a image (even with multiple road markings), using our desktop computer (Corei7-8700; 2.6 GHz processor with 12-core CPU; NVIDIA GeForce GTX 1060 6 GB and 12 GB-RAM) were ∼0.34 (3.0 fps) and ∼0.30 (3.33 fps), respectively, for faster R-CNN and modified faster R-CNN, respectively. The mAP, the false positive rate, and the detection speed of our model show that our modified algorithm is promising in real-life application.

5. Conclusions

This research demonstrated the application of a NIR camera as a reliable image acquisition system for nighttime driving assistance system. We built our dataset for the experiment from the images captured on the street of Changsha city, and we have also introduced random rotation and translation with bounding boxes to augment our training samples. The proposed method utilizes multilayer feature fusion to take the advantages of the shallow and deep layer features. The proposed algorithm was trained to minimize the loss using stochastic gradient descent algorithm. The trained model was evaluated based on mean average precision, false positive rate, and detection time. We have recorded an improvement of about 4% in mAP. We have accessed statistically the contribution of our proposed method in terms of mAP where we obtained a value of 0.03% which shows that there is a significant difference between the mAP of our proposed method and the faster R-CNN. Also, our proposed method can detect road markings on pixel at ~0.30 s (3.33 fps). Thus, the model is suitable for application in the driver-assistance system. In the future, we would consider other algorithms such as YOLO and SSD for faster detection.

Data Availability

The dataset used has not been publicly made available yet.

Conflicts of Interest

The authors declare no conflict of interest.

Authors’ Contributions

The following are the authors’ contributions: Shitu Abubakar and Junping Hu performed the conceptualization; Shitu Abubakar contributed in the methodology; Shitu Abubakar and Junping Hu prepared the software; Xiaobiao Dai and Hao Sha helped in the validation; Shitu Abubakar performed the formal analysis; Shitu Abubakar wrote and prepared the original draft; Shengjun Liu wrote, reviewed, and edited the manuscript; Junping Hu and Shengjun Liu supervised the study; Junping Hu contributed in acquiring funds.

Acknowledgments

This research was funded by the National Natural Science Foundation of China, grant number 61505264, 2016.

Supplementary Materials

The details of the proposed feature fusion layers and the detection pipeline. (Supplementary Materials)