INFRARED small target detection is a key technique ininfrared search and tracking (IRST) systems. IRST has excellent potential in scientific research and civil applications. With the rise and widespread use of drones in recent years, infrared sensors are increasingly being carried by drones to perform field detection, rescue, and precise positioning [1][2][3] (as shown in Figure 1). However, limited by the
This work was supported by National Natural Science Foundation of China (Grant No. 61704167, 61434004), Beijing Municipal Science and Technology Project (Z181100008918009), Youth Innovation Promotion Association Program, Chinese Academy of Sciences (No.2016107), the Strategic Priority Research Program of Chinese Academy of Science, Grant No.XDB32050200. (Corresponding author: Nanjian Wu.)
The authors are with the State Key Laboratory of Superlattices and Microstructures, Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China, and also with the Center of Materials Science and Optoelectronics Engineering, University of Chinese Academy of Sciences, Beijing 100049, China, and also with the Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Beijing 100083, China (email: zhaomingxin17@semi.ac.cn; chengli17@semi.ac.cn; yangxu@semi.ac.cn; fengpeng06@semi.ac.cn; liuly@semi.ac.cn; nanjian@red.semi.ac.cn).
resolution of the infrared sensor, atmospheric scattering, background temperature noise, etc., the imaging quality of the infrared image is generally worse than that of the visible light sensor. Moreover, in most of these applications, the target size in an infrared image is often less than 10 pixels. Therefore, there are more and more application prospects and technical challenges for the detection of small infrared targets.
Fig. 1. Use infrared sensor on drones for scientific research and field rescue
In order to solve the problem of small target detection, many methods have been proposed. These methods can generally be classified into two categories: single-frame detection and multi-frame detection. Since multi-frame detection algorithms usually consume more time than single-frame detection algorithms, and such algorithms generally assume that the background is static [4], which make the multi-frame detection algorithms unsuitable for unmanned aerial vehicle (UAV) applications. In this paper, we focus on the single-frame detection algorithms.
The traditional single-frame detection methods based on morphological filtering consider that the small target belongs to the high-frequency components in the image so that the small target can be separated from the background by the filter-ing. For example, 2-D least-mean-square (TDLMS) adaptive filtering [5], Max-Mean/Max-Median filtering [6], and TopHat filtering [7] are all such methods. These methods are easily affected by the clutters and noise present in the background, which affects the robustness of the detection.
In recent years, the small target detection methods based on the human visual system (HVS) are mainly used to distinguish the target from the background by constructing different local contrast measures. The local contrast measure (LCM) [8], the improved LCM (ILCM) [9], novel local contrast measure (NLCM) [10], weighted new local contrast measure (WLCM) [11], novel weighted image entropy map (NWIE) [12], multi-scale patch-based contrast measure (MPCM) [13], high-boost-based multiscale local contrast measure (HB-MLCM) [14] are all HVS-based methods. These methods construct an internal window and its adjacent windows, or surrounding areas, in a local area to calculate the contrast between the internal window and adjacent windows or surrounding areas to enhance the local target features. The detection of the target is achieved by sliding the internal window over the entire image and finally using adaptive threshold segmentation. However, these algorithms are also susceptible to factors such as edges and noise.
Due to the great success of deep neural networks in natural image processing, some works [15], [16], [17] have begun to introduce deep learning, especially convolutional neural networks (CNN), into infrared small target detection. However, most CNN algorithms do not perform well in learning small target features [18] and take a long time to run inference. For example, Mask-RCNN [19] takes a one-fifth second to perform an image detection on the GPU. Lightweight networks such as YOLO [20] and SSD [21] are compact and fast, but the detection is less effective. Moreover, the above algorithms always perform poorly on small objects. In the practice of CNN methods in infrared small target detection, Fan et al. [15] used the MNIST dataset to train a multiscale CNN and extracted the convolution kernels to enhance small target images. Lin et al. [16] designed a seven-layer network to detect small targets by learning the synthesis data generated by oversampling. Wang et al. [17] transferred a CNN pre-trained on ILSVRC 2013 to small target datasets to learn small target features. However, objects in the real world usually contain a large number of shapes, colors, and structural information, which are not available in small targets. The effectiveness of transfer learning is limited.
In the field application scenarios of drones, the background often has a lot of clutters such as branches, roads, buildings. Compared with the sky, clouds, and sea surface, the background composition is more complicated. Intuitively, if a broader range of image features can be utilized, it should be helpful to suppress these complicated interferences and reduce the false alarm rate, which is difficult to achieve by traditional methods based on local features.
Encouraged by the progress of image segmentation networks in recent years, we hope to segment the target image from the original image directly. However, using a natural image segmentation network to segment a small target can cause some problems. First of all, the artifact problem seriously degrade segmentation and detection performance. Secondly, the small target occupies a small proportion in the entire image, so that the training process encounters severe class imbalance problem [22]. In actual training, the convergence curve shown in Figure 2 is prone to appear, and the network converges quickly, but the small target features cannot be adequately learned.
Although using a small image of as used in [16] can alleviate the class imbalance problem, the use of small images limits the network to learn the interference characteristics of the background from a broader range, and it is impossible to make full use of the deeper receptive field of the network to suppress background interference.
To solve the above problems, we propose a novel segmenta-
Fig. 2. A training loss curve caused by the class imbalance problem.
tion convolutional neural network called TBC-Net and design the corresponding training method. Compared with traditional methods and HVS-based methods, the proposed network can achieve better detection performance. The contributions of this paper are as follows:
1) A lightweight infrared small target detection neural network called TBC-Net is proposed, which includes a target extraction module and a semantic constraint module.
2) A novel training method is proposed to solve the problem of extreme imbalance of foreground and background in small target image by adding high-level semantic constraint information of images in training.
3) It can achieve real-time detection of images on the NVIDIA Jetson AGX Xavier embedded development board.
The remainder of this paper is organized as follows: In Section II, we briefly review the technical background related to TBC-Net. In Section III, we introduce the network structure and detection method and analyze the storage and computational complexity of the network. In Section IV, we introduce loss function design and training methods. In Section V, we experiment with real infrared sequences. Finally, we present the conclusions of this paper in Section VI.
In this section, we briefly introduce the technical background of TBC-Net, including CNN-based image segmentation, residual learning, and semantic constraint, so that readers can better understand the design ideas of TBC-Net in subsequent sections.
A. CNN-based Segmantation
The fully convolutional neural network (FCN) [23] was proposed to use CNN for image segmentation by replacing the fully connected layer with the convolutional layer. However, FCN uses VGG-Net [24] as its feature extraction network and takes one-fifth of a second to complete an image segmentation, which is difficult to achieve real-time performance. Noh et al. [25] used deconvolution to upsample the feature map and obtain the segmentation mask. There are also many works such as [26][27] based on deconvolution to achieve image segmentation. Nevertheless, there are also some artifacts in images produced by deconvolution.
Image segmentation can also be seen as a pixel-wise classi-fication task. Chen et al. [28] used a convolution operation called ”dilated convolution” to achieve pixel-wise classifi-cation. Yu et al. [29] proposed dilated residual networks (DRN) to reduce the checkerboard artifacts caused by dilated convolution.
In order to fuse the multiscale information, Lin et al. [30] used multiscale image input and sliding pyramid pooling to improve the performance. Deeplab [28] used multiscale input images to extract features. U-Net [31] realized multiscale feature fusion by concatenating the feature maps obtained by downsampling with the upsampled feature maps of the same scale.
All the above algorithms are aimed at natural images; for small infrared targets that lack complicated features within 10 pixels, there is still no effective deep-learning-based segmentation solution.
B. Residual Learning
ResNet [32] was proposed to use residual connections to enhance the learning effect of CNN. Highway Networks [33] and DenseNet [34] have proven that residual connection is effective both in improving convergence property and enhancing network performance. Residual connections are short paths from early layers to later layers. On the one hand, these crosslayer connections can solve the problems that the network is difficult to converge, and the inference accuracy is reduced due to the vanishing-gradient problem, etc.. The shallow information of the network can be directly transmitted to the deeper layer so that the network can more effectively learn the image features contained in the shallow layer. DnCNN [35] used the idea of residual learning to construct a network for image denoising, which has achieved good noise reduction visual effects.
C. Semantic Constraint
Liu et al. [36] proposed the idea of using a high-level vision task to enhance a denoise neural network. They cascaded a denoise CNN with various high-level vision tasks and used the joint loss to update the weights of the denoise CNN during training, thereby improving the visual effect of the output image. This training method, combined with image semantic constraint, has a significant improvement on learning image features such as noise that are difficult to describe.
Following this idea, Wang et al. [37] adopted image denoising as a low-level vision task and image segmentation as a high-level vision task. Then they trained the joint pipeline using hybrid losses to improve denoising effect.
In general, an infrared small target image can be expressed by the following formula [38]:
where and (x, y) are original infrared image, the small targets image, the background, the noise and the pixel location, respectively. In the following parts, we omit the (x, y) without causing confusion.
A. Network Architecture
Our proposed network, as shown in Figure 3, consists of two modules: target extraction module (TEM) and semantic constraint module (SCM). We name the network TBC-Net, and the reason for the name will be explained at the beginning of Section IV. The TEM is a lightweight image segmentation network with compact operations and flexible structure parameters for efficient inference. The SCM is a multi-layer CNN used to achieve high-level classification task.
The input infrared image is processed by TEM to obtain the target image . The SCM classifies the target image
according to the number of the targets contained in
. This high-level vision task can add semantic constraint into the TBC-Net and improve the TEM performance during the training phase. When training TBC-Net, the SCM needs to be pre-trained on the synthesis data, and then when training the TEM, the SCM parameters remain unchanged. The quality of
obtained by TEM will have an impact on the classification results of SCM, and then the constraint information brought by image semantics will be transferred to TEM through backpropagation. The existence of SCM solves the problem that it is difficult to learn the features caused by the imbalance between the target and the background in the small target data, so that the compact TEM can still effectively learn the small target features. To train the network, we propose a joint loss function and a corresponding training method, which are shown in Section IV.
During the inference phase, only the TEM plays a role in extracting targets. Therefore, in practical applications, the inference speed of the network depends only on the complexity of the TEM.
Below we introduce the design ideas and details of TEM and SCM.
B. Target Extraction Module
Compared with natural image datasets such as ImageNet [39], Pascal VOC [40], and MS COCO [41], the infrared small target image does not contain color information. Meanwhile, because the target size is small, it does not contain category, and shape information. Networks used in natural image semantic and instance segmentation generally need to learn the color, shape and category features. However, such problems do not exist on infrared small target data. Therefore, to improve the network’s inference efficiency, we design a compact TEM module with a more lightweight network structure by using the characteristics of the small infrared target data.
Based on the above analysis, we use compact operations to implement the downsampling and upsampling modules to form the ”Encoder-Decoder” structure commonly used in the image segmentation field. The structure of the TEM is shown in Figure 4. We use the 2D convolutional layer and the MaxPooling layer to form the downsampling module (as shown in Figure 5a) and refer to [42] to use the nearest neighbor interpolation and 2D convolutional layer to form the upsampling module (as shown in Figure 5b). The upsampling features are fused with the downsampling features of the same scale by the residual connections. The upsampling modules,
Fig. 3. The architecture of TBC-Net.
the downsampling modules, and the residual connections together form the TEM.
The TEM is used to extract target image , and the formal expression is as follows:
where denotes the target image output by TEM.
The infrared image is generally a single-channel grayscale image. The number of input channels is firstly expanded using a 2D convolution layer. The number of channels that the input layer expands is called base channels (BC). The number of downsampling operations determines how many scales the network needs to fuse, so it is also a critical network structure parameter. We name the maximum scale level produced by downsampling L. By changing BC and L, the structure of the network can be adjusted accordingly. Besides, BC and L also affect the parameter storage space and the amount of computation of the algorithm.
It is worth noting that we do not use zero paddings to avoid interference edges of the target image, and do not use deconvolution to avoid checkerboard artifacts. Both of these affect the detection of small targets (as shown in Figure 6).
C. Semantic Constraint Module
Although in edge segmentation, vessel segmentation, and other applications, there have been many methods such as training on small patches, using weighted loss functions to solve the sample class imbalance problem. However, as mentioned above, training on small size images cannot enable the network to learn the complex interference information of the background, and cannot effectively utilize the large receptive field of the network. Meanwhile, the choice of weights in loss function is task-specific and is hard to optimize [22].
Motivated by [36], we believe that adding semantic information during the training can enable the network to learn the features of small targets better. Nevertheless, small infrared targets often have only 2 to 10 pixels, and there are a large number of non-ideal factors in the infrared imaging process, making it difficult to describe them with high-level semantics such as shapes and categories. According to the characteristics of the small target image, we propose a direct and straightforward image semantic constraint, that is, counting the number
of small targets in the target image. The method of using this semantic information is as follows:
1) Use TEM to extract target image 2) Use another network to predict the number of targets contained in
We use a CNN to classify , and its structure is shown in Table I, where
is the number of classes corresponding to the fully connected layer. We call the classification network as the semantic constraint module to illustrate its guiding role at the semantic level during TEM training. This process is also illustrated in Figure 3.
TABLE I NETWORK STRUCTURE OF SEMANTIC CONSTRAINT MODULE USED IN TBC-NET (
D. Segmentation and Detection
After the TEM obtains the target image, we use the adaptive threshold method on the target image to obtain the binarized segmentation image. The calculation method of the adaptive threshold is defined as follows:
where T is the segmentation threshold, and
are the mean and standard deviation of the TEM output image, respectively. k is the empirical parameter which ranges from 20 to 30 in our experiment.
The complete workflow of using TBC-Net for small target detection is shown in Figure 7.
Fig. 4. Target extraction module of TBC-Net
Fig. 5. Downsample module and upsample module used in TBC-Net
Fig. 6. Problems caused by zero padding and deconvolution.
E. Storage and Computation Analysis
Generally, only the computational complexity is analyzed in the algorithm analysis. However, for the practical application scenarios of infrared small target detection, the detection algorithm generally runs on hardware devices with limited storage space and computation resources, while the parameter storage requirements of the neural network are relatively large. So, in addition to the computational complexity analysis, we also analyze the parameter storage requirements.
1) Computational Complexity Analysis: In the computational complexity analysis, we do not use the O symbol, but use OPs (the number of multiply-add operations) commonly used in neural network design [43] as an indicator to measure the amount of calculation. The nearest neighbor interpolation used in upsampling modules does not require multiply and accumulate operations, so it is ignored in the calculation of OPs. The OPs of the 2D convolution layer is as follows:
where h and w are the height and width of the input feature map, and
are the number of input and output channels, respectively,
is the convolution kernel size, and s is the stride size. In the 2d convolution layer of TBC-Net,
. Therefore, the OPs of TBC-Net can be obtained by accumulating the OPs of all the downsampling and upsampling modules.
where H and W are the height and width of the TBC-Net input image, respectively.
The calculation shows that the computational complexity of the algorithm is proportional to the number of pixels of the input image and the maximum scale level L, respectively, and is squared with the number of base channel BC.
2) Parameter Storage Analysis: The storage requirement consists of two parts, the feature map storage and the parameter storage
.
During the inference phase of TBC-Net, the previous feature maps cannot be discarded until the first upsampling. After the first upsampling, the downsampled feature map corresponding to the upsampling scale can be discarded. So the maximum storage requirement of feature maps during the inference occurs before the first upsampling:
where is the number of pixels of the input image.
Fig. 7. Using TBC-Net to detect infrared small targets.
In terms of network weights, its storage space mainly depends on BC and L. The number of parameters of a 2-D convolution layer is , the parameter storage space is calculated as follows:
The first part is the amount of weights storage required for the downsampling and upsampling modules, and the second part is the weights storage of the input and output layer. It can be seen that the parameter storage space is approximately proportional to the square of BC and exponential with L. Furthermore, we can get the total storage requirement of TBC-Net:
TABLE II STORAGE AND COMPUTATION UNDER DIFFERENT (
It can be seen from the configuration combination relationship shown in Table II. When BC = 16 and L = 5, the TBC-Net has 3 million parameters, and the maximum storage requirement is 5 million, which is suitable for most embedded device with CNN accelerators for real-time processing.
The loss function and training method are the core of TBCNet’s ability to learn small target features better. We design a joint loss function for TBC-Net that includes target extraction loss, background suppression loss, and classification loss to overcome the shortcomings of CNN in learning small target features. This is why we name the network TBC-Net. T, B, and C are the acronyms of the three loss functions. Furthermore, we propose corresponding data synthesis and training methods.
A. Loss Function Design
1) Target Extraction Loss: The extracted target image should be as close as possible to the ground truth
. Compared with using
norm to measure the difference between images,
norm and structural similarity (SSIM) have better effects [44][45], but directly using
norm brings halo artifacts, refer to literature [44], we use the mix loss of
norm and SSIM here, the expression is as follows:
where N is the total number of images in the training data, the SSIM index is defined as follows:
where and
are the pixel mean and standard deviation in the window calculated by sliding a fixed size window on the image x and y, respectively.
are two variables to stabilize the division with weak denominator.
In experiment, we set , and the window size to calculate
and
is
.
2) Background Suppression Loss: The TEM obtains an image containing only small targets, and small targets are often sparse in the infrared image. Therefore, we refer to the practice in IPI [46] to use the sparse constraint on the target image to further suppress the background and get a sparse result and the loss can be directly expressed by the norm of
as follows:
3) Classification Loss: As mentioned earlier, the semantic constraint module is essentially a network that classifies the target image, so we use the cross-entropy loss expressed as follows:
where CE is the abbreviations of the cross-entropy loss, and denotes the ground truth classification label corresponding to the target image (that is, the number of targets included therein).
4) Total loss: From the above three loss functions, we can get the joint loss function required to train TBC-Net, as shown below:
where is the weight of the loss function
. In the experiment, we take
. The method to calculate the joint loss during training is illustrated in Figure 8.
Fig. 8. The method of calculating joint loss for training TBC-Net.
5) Analysis: Through analysis of the training procedure, we can explain the effectiveness of the joint loss function in small target feature learning. CNN training generally adopts a gradient descent algorithm, that is, the network weights are updated according to the following formula:
where denote the network weights in tth, t + 1th training step and the learning rate, respectively.
With the chain rule, we can decompose the into four partial derivatives of the output image
, from which we can see the effect of the joint loss on the TEM output. The decomposition process is as follows:
We combine the four partial derivatives in parentheses into two parts:
Because these loss functions are additive relationships, their effects on are independent. We can analyze them separately to explain their effects.
For the first part, we focus on their joint role in the target region and the background region. In the background region corresponding to the TEM output image , if
is greater than
, then
and
are simultaneously equal to 1, so that the background and fluctuations caused by the update along the
gradients direction can be suppressed. In the target region, although the same suppression phenomenon occurs when
is greater than
, when
is smaller than
is equal to -1,
is equal to 1, and the gradient of the superposition is 0. Meanwhile, the gradient of
still exists, which can make the output
of the TEM gradually approach
. The combined effect of
and
can shield the gradient update caused by
in the target region to reduce the occurrence of artifacts (as shown in Figure 9) without losing the good suppression effect of
on the background region, and there is no need to worry about the target features being smoothed by
. The qualitative interpretation is shown in Figure 10. In Figure 10, the orange line represents the output grayscale of the TEM, and the green line represents the grayscale of
. The gray area and light blue area show the combined effect of the gradients of the two loss functions in the target region and the background region, respectively.
Meanwhile, by minimizing the , we can make
and
as close as possible both in the target region and the background region.
For the second part, the classification loss is sensitive to the disappearance of the small target. Once an image with a small targets is processed by the TEM to obtain a blank image without the target, the
loss will increase, which can change the
and the network can get rid of the current state, which means the SCM generate semantic constraint on the output of the TEM and solve the problem caused by the data imbalance problem.
B. Data Synthesis and Training
According to the previous design, calculating the loss function requires an original image , a target image
, and a label
indicating the number of targets in the original image
. These three parts constitute a training tuple
for calculating the loss function
. We have a large number of background images
, so the key is to synthesize the image
containing several small targets, and give the label
according to the number of targets.
1) Synthesizing : Kim et al. [47] proposed a method for generating infrared images from object models based on black-body radiation, but we find that when the target size is scaled to within 10 pixels, the shape information of the target itself is weak. Compared to synthesizing data utilizing black-body radiation, it is easier to fuse small target images with the background as in [46]. In more detail, we select one of the target images, randomly adjust the brightness, rescale the target image to a random size, then use the method proposed in [46] to fuse the target into background. The algorithm for fusing a target into the background is shown in Algorithm 1, where
is a random factor for adjusting the brightness. Some examples of local structures of small targets that fused into the background are shown in Figure 11.
2) Label : In the process of synthesizing
, we mark the corresponding
according to the number n of small targets successfully fused into the background, and we retain some background images that are not fused with any small target as negative samples and set
, the corresponding training tuple is
. When fusing two or more targets into the background, we first generate non-overlapping
Fig. 9. Halo artifacts caused by loss, the first row are original images and the second row are output images
of the TEM.
Fig. 10. Schematic diagram of qualitative interpretation of the effects of and
in the target region and background region (Blue and red arrows represent the effects of
gradients on
, respectively).
locations of and fuse
different targets into these locations in turn, where
is the number of targets we need to add in an image and belongs to
. The process is shown in Algorithm 2, where M is the total number of images contained in the target image set. Some examples of synthesized data are
shown in Table III.
In the experimental part, there are 1 to 3 small targets on each synthesis image; that is, . Adding up to 3 targets is mainly based on the following two points. On the one hand, when the number of added targets is too small, the classification network is more sensitive to whether there are small targets in the image rather than the number of targets. On the other hand, when too many small targets are added, many of the targets are too close together, which leads to a dense counting problem [48][49] that is difficult to solve with a classification network. Therefore, according to our attempt, adding 3 to 6 small targets on each image can achieve better results. And the trained classification network can achieve 97.5% accuracy on predicting the number of targets on
, which is good enough for guiding the TEM training.
3) Training Method: When training, we take as input and
as output ground truth label to train the SCM. When the SCM converges, we freeze its weights, and then use
as the input of TBC-Net, and use
to train the TEM. The complete training method of TBC-Net is shown in Algorithm 3.
The TBC-Net is programmed in Pytorch [50]. During training phase, the input image size is with batch size
Fig. 11. Local structures of small targets in synthesis data.
TABLE III EXAMPLES OF BACKGROUND SYNTHESIZED IMAGE
TARGET IMAGE
128. Adam [51] is used as the optimizer. The initial learning rate is set to 0.005 and decays to one tenth of the previous learning rate at the 80th and 120th epoch for a total of 130 epochs training. The training is done on 4 GTX 1080 GPUs, with an average training time of approximately 65 minutes for every network structure. It should be noted that although the image resolution used in the experiment is , since the TEM of TBC-Net does not have any constraint on the resolution of the input image, all the layers used can accept input of any resolution. Therefore, TBC-Net can be applied to the detection of infrared small target images of arbitrary resolution.
Detailed information on the synthesis training data used to
train TBC-Net in the experiment is shown in Table IV.
TABLE IV DETAILS OF SYNTHESIS TRAINING DATA.
A. Experiment Setup
1) Data Sets: Six real sequences taken by drones with infrared sensors are used to validate the effectiveness of TBCNet. These data cover common scenarios and detection issues for UAV field applications such as searching for wildlife
and trapped people in the woods. Details about data sets are described in Table V.
2) Baseline Methods: We choose three morphological fil-tering methods Top-Hat, Max-Mean, Max-Median, and five HVS-based methods NWIE, MPCM, HB-MLCM, ILCM, WLCM commonly used in recent years as the baseline methods for experimental comparison. Since [17] does not disclose the network structure, and [15] does not use the network for small infrared target detection, we do not use the [17][15] methods based on deep learning for comparison. But in the following section, we compare the training method of [16] with our proposed training method. The parameter settings for these methods and TBC-Net are shown in Table VI.
3) Quantative Metrics: The receiver operating charaterisctic (ROC) curves, signal-to-clutter ratio gain (SCRG), background suppression factor (BSF) are adopted as quantitative evaluation criteria. The ROC curve reflects the trade-off between the detector’s detection probability and false alarms and is an important indicator of the quality of a detector. The definitions of detection probability and false-alarm rate are as follows:
Depending on the threshold T set by the detector, the detection probability and the false-alarm rate change accordingly, thereby drawing a relationship between the detection probability and the false alarm rate.
The definition of SCRG and BSF are as follows:
where and
denote signal-to-clutter ratio (SCR) of the output and input images.
, where
denotes the pixel value of the target, and
and
denote the mean and standard deviation of the background area around the target, respectively. In this paper, we use the area of 5 pixels around the target as the background. And
and
denote the standard deviation of the background of the image before and after processing, respectively. Since the standard deviation after background suppression may be 0, in order to avoid the denominator being 0, we refer to paper [52], add an adjustment coefficient
to avoid infinity in SCR and BSF calculations, and set
to 0.01 in this paper.
For convenience, in the experimental part, we normalize the detection results of different algorithms to the [0, 1].
B. Ablation Study
In this section, we compare the network detection performance under different structures and training methods to show the impact of structural changes on the performance of TBCNet and the effectiveness of our proposed training method.
1) Structure Comparison: In order to simplify the process of exploration, we set the network hyperparameters that need to be explored according to some practical and empirical principles. On the one hand, according to the application scenario of TBC-Net, the number of parameters should be controlled on the order of magnitude similar to the compact neural network [43][53][54]; that is, the maximum number of parameters should be within 10 million. On the other hand, since the small target size is between 2 and 10 pixels, it is necessary to fuse at least three scales of information to achieve better detection results. Based on the above considerations, we set the BC between 4 and 16, and L between 2 and 5.
By setting different BC and L values, we can get different TBC-Net variants. By testing the ROC curves of different structured TBC-Net variants on real infrared data, we analyze the effects of BC and L on network performance. All of these TBC-Net variants are trained using loss function .
Keep L unchanged, the ROC curve obtained by changing the value of BC is shown in Figure 12a. It can be seen that when L = 4 and 5, the detection performance of TBC-Net improves while increasing BC, but when L = 4, the network is more sensitive to the change of BC. When L = 5, the difference between BC = 8 and BC = 16 is not obvious.
Keep BC unchanged, change the value of L to get the ROC curve as shown in Figure 12b. Similar to Figure 12a, when BC = 8, the network is more sensitive to L changes, and when BC = 16, except for L = 2, the network performance when is very close.
Structure analysis shows that when one of BC and L is small, the network performance is very sensitive to changes in the other. When BC = 16 and L = 5, the network performance is good enough.
2) Training Method Comparison: Six loss functions are designed to compare the detection performance of the network trained under different loss function combinations. The six loss functions are and
, where
is widely used in many natural image segmentation and denosing tasks which can be compared as a baseline training method. Other combinations are used to prove the effectiveness of our proposed method. We set BC = 16, L = 5 to reduce the exploration space.
We take an image in data set 2 as an example to illustrate the difference in the target image obtained under different loss function training, as shown in Figure 13. It can be seen that
TABLE V DETAILS OF SIX INFRARED SMALL TARGET IMAGE DATA SETS
TABLE VI PARAMETER SETTINGS OF DIFFERENT METHODS
Fig. 12. The receiver operating characteristic (ROC) curves of different TBC-Net variants (a) Keep L and change BC (b) Keep BC and change L
on the data set 2 with a large amount of clutters and noise in the background, 13b and 13c have missed detection. Although networks trained under and
have a strong suppression of the background, they fail to enhance the target. The network trained under
can learn the features of the small target, but there are some halo artifacts around the small target. Networks trained under
and
can enhance the target mixed in clutters, but there is a lot of halo artifacts in 13e, 13f is better than 13e but there is still a lot of noise in the background. The network output 13g under
training not only suppresses clutters and noise well, but also enhances the target. Besides, we can see in 13e that even if only
is used, the network can still learn certain target features, which shows that the SCM does have a guiding effect on the TEM during the training phase.
We use all the images of 1 to 6 data sets to test the network under different loss function training, and plot the ROC curves as shown in Figure 14. It can be seen that the detection performance under training is the best, while the networks obtained under the training of other loss functions have serious missed detection.
In addition, we give a training loss curve of TBC-Net trained under , as shown in Figure 15. The total loss in the figure is
, and
, and
are the three parts of
. In Figure 15, the loss functions do not decay to 0 as quickly as in Figure 2, but instead approach 0 during continuous adjustment. In fact, although the
does not always approach 0, under the combined effect of several losses, the target extraction and detection is better than the network obtained by overfitting the background. We use the network parameters at different epochs trained under
to test the real data, and use an image in data set 3 as an example to show the change in the small target detection effect of the network during training. The results are shown in Figure 16. It can be seen that at the beginning, TBC-Net is unable to distinguish the position of the small target, but with the progress of training, TBC-Net can learn the features of the small target more and more clearly, and it also better suppresses the cluttered background.
Fig. 13. Examples of output images obtained by networks trained under different loss functions (b) Trained under (c) Trained under
loss (d) Trained under
joint loss (e) Trained under
(f) Trained under
joint loss (g) Trained under
Fig. 14. ROC curve of TBC-Net (BC = 16, L = 5) trained under different loss functions tested on all data sets.
The comparative analysis of the loss function shows that the helps to solve the problem of difficult to learn small target features caused by the imbalance between the target and background data. Meanwhile,
has a better effect on suppressing background and artifacts.
Fig. 15. Loss curve of training TBC-Net (
C. Qualitative Evaluation with Baseline Methods
We select a image from each of the six data sets for qualitative evaluation. The original image and the 3D grayscale distribution of images processed by different algorithms are shown in Figure 17a to 17f, respectively. The red rectangle is the enlargement of the target region.
For data set 1, both MPCM and TBC-Net have significant enhancements to the target in the forest, but the background suppression of MPCM is worse than TBC-Net, and other methods respond more strongly to the building than to the target. For data set 2, ILCM, NWIE, MPCM, HB-MLCM, and TBC-Net all enhance the target in the woods, but TBCNet shows greater suppression of the background. For data set 3, in addition to MPCM, other methods can enhance the target running to the woods, and TBC-Net can still suppress the interference of the woods well. For data set 4, there is interference at the edge of the road, and although all methods can enhance both targets, TBC-Net is less sensitive to road edges than other methods. For data set 5, most methods are interfered by the noise of the forest edge and light, which affects the detection of the target, and TBC-Net can better suppress the background noise and enhance the target. For data set 6, all methods have enhanced the target, but NWIE is disturbed by the woods, MPCM produces a strong response on both sides of the road, and TBC-Net can better suppress the interference caused by trees and roads.
D. Quantative Evaluation with Baseline Methods
Figure 18 shows the ROC curves achieved by different algorithms on 6 data sets. It can be seen that due to the better background suppression of TBC-Net, it can achieve a higher detection probability at a lower false alarm rate. Especially on data sets 1, 2, 4, and 6, where there are a lot of interference from branches, roads, and light noise in the background, TBCNet can be significantly better than other detection methods. This also shows from one side that the TBC-Net trained on large-size images rather than local features can effectively suppress some complicated interference backgrounds better than the traditional methods because the network can not only learn the local features of the target but also captures wide range and high-level information from the entire image.
Table VII and VIII show the BSF and SCRG for different algorithms, respectively. For both SCRG and BSF, the higher the value, the better the detection performance.
Except for data set 5, TBC-Net has the highest BSF on other data. On data set 5, although the background generated by MPCM is quite flat, its BSF is higher than TBC-Net, but it incorrectly enhances the background.
For SCRG, TBC-Net can achieve the best results on all data sets. Therefore, considering the results of comprehensive background suppression and target enhancement, the performance of TBC-Net is significantly better than other algorithms.
Fig. 16. Output examples of TBC-Net at different epochs under
Fig. 17. Original images of data set 1-6 and 3D gray distributions output by different algorithms. (a) Data set 1 (b) Data set 2 (c) Data set 3 (d) Data set 4 (e) Data set 5 (f) Data set 6
Fig. 17. Original images of data set 1-6 and 3D gray distributions output by different algorithms. (a) Data set 1 (b) Data set 2 (c) Data set 3 (d) Data set 4 (e) Data set 5 (f) Data set 6
TABLE VII BSF VALUES OF INFRARED DATA SETS 1-6 PROCESSED BY DIFFERENT METHODS
Fig. 18. The receiver operating characteristic (ROC) curves of baseline methods and TBC-Net for six data sets. (a) Data set 1. (b) Data set 2. (c) Data set 3. (d) Data set 4 (e) Data set 5 (f) Data set 6
TABLE VIII SCRG VALUES OF INFRARED DATA SETS 1-6 PROCESSED BY DIFFERENT METHODS
E. Performance Evaluation
1) Performance Comparison with Other Methods: The NVIDIA Jetson AGX Xavier development board is an embedded AI accelerator widely used in robotics and autonomous driving. In addition to running TBC-Net on the CPU and GPU, we deploy it on the Jetson AGX Xavier development board to evaluate its performance on mobile devices.
For baseline methods, we only count the execution time on the CPU. For TBC-Net, we count its execution time on the CPU, GPU, and AGX Xavier development board respectively. The test environment is that the PC has a 3.60-GHz Intel i9-9900k CPU and 64.0-GB memory, the GPU is NVIDIA RTX 2080Ti, and the development board is Jestson AGX Xavier. All traditional methods are implemented in MATLAB 2018b software, while the TBC-Net is implemented using Pytorch 1.0.1.
We first read all image data into memory, and only count the time from image input to processing completion to ignore data preparation time. Table IX shows the average single-frame processing time of different baseline algorithms on the CPU and the average single-frame processing time of TBC-Net on CPU/GPU/Jetson AGX Xavier tested on all sequences, and we use C, G, B to distinguish TBC-Net execution time on CPU, GPU, and development board. The Jetson AGX Xavier supports 10W, 15W, 30W three power modes to suit different applications. We use the 30W power mode here.
The authors of the above methods did not disclose their source code, so we reproduce these algorithms according to original papers. The execution efficiency of some algorithms may not be met or is inconsistent with the efficiency reported in the original paper.
In terms of CPU running time, although TBC-Net is slower than algorithms such as Top-Hat, HB-MLCM, ILCM, and
TABLE IX TIME CONSUMPTION OF DIFFERENT ALGORITHMS.
MPCM, its detection performance is still significantly better than these algorithms. Moreover, from a more practical perspective, TBC-Net runs efficiently on embedded devices with neural network acceleration hardware, while other traditional algorithms do not have as good parallel optimization space on embedded devices to improve performance.
2) Performance Comparison under Different Power Modes: Since most deep learning accelerators are designed for large data throughput operations, processing data in batches will be faster than single-frame processing, so we also test the average single-frame processing time for TBC-Net at different batch sizes. The significance of batch processing is that in some embedded scenarios, an accelerator may be required to process video images acquired by multiple sensors simultaneously. The test results are shown in Table X. In the case of batch-frame processing, TBC-Net can achieve frame rates of up to 50, 91, and 175 in 10W, 15W, and 30W power mode, respectively, indicating that it is suitable for fast target detection in embedded low-power scenarios.
In this paper, we presented a novel lightweight convolutional neural network TBC-Net, including a target extraction module and a semantic constraint module for infrared small target detection. With the help of semantic constraint module, joint loss function, synthesis data, and corresponding training method, we solved the problem caused by the extreme imbalance of targets and the background when using CNN to learn small target features on images of resolution. The input image is processed by the target extraction module to directly obtain the target image, thereby achieving end-to-end small target detection. Moreover, through storage space and computational complexity analysis, TBC-Net is well suited for deployment in embedded systems that support neural network acceleration. The experimental results show that TBCNet trained from large-size images can better suppress the complex interference in the background compared to the detection algorithms using only local features. Besides, TBCNet realizes real-time detection on the NVIDIA Jetson AGX Xavier development board. These advantages make TBC-Net have great application potential in applications such as infrared detection and search using drones.
[1] P. Rudol and P. Doherty, “Human body detection and geolocalization for uav search and rescue missions using color and thermal imagery,” in 2008 IEEE aerospace conference. Ieee, 2008, pp. 1–8.
[2] S. Karma, E. Zorba, G. Pallis, G. Statheropoulos, I. Balta, K. Mikedi, J. Vamvakari, A. Pappa, M. Chalaris, G. Xanthopoulos et al., “Use of unmanned vehicles in search and rescue operations in forest fires: Advantages and limitations observed in a field trial,” International journal of disaster risk reduction, vol. 13, pp. 307–312, 2015.
[3] M. Silvagni, A. Tonoli, E. Zenerino, and M. Chiaberge, “Multipurpose uav for search and rescue operations in mountain avalanche events,” Geomatics, Natural Hazards and Risk, vol. 8, no. 1, pp. 18–33, 2017.
[4] Y. Qin, L. Bruzzone, C. Gao, and B. Li, “Infrared small target detection based on facet kernel and random walker,” IEEE Transactions on Geoscience and Remote Sensing, 2019.
[5] T. Soni, J. R. Zeidler, and W. H. Ku, “Performance evaluation of 2-d adaptive prediction filters for detection of small objects in image data,” IEEE Transactions on Image processing, vol. 2, no. 3, pp. 327–340, 1993.
[6] J.-F. Rivest and R. Fortin, “Detection of dim targets in digital infrared imagery by morphological image processing,” Optical Engineering, vol. 35, 1996.
[7] X. Bai and F. Zhou, “Analysis of new top-hat transformation and the application for infrared dim small target detection,” Pattern Recognition, vol. 43, no. 6, pp. 2145–2156, 2010.
[8] C. P. Chen, H. Li, Y. Wei, T. Xia, and Y. Y. Tang, “A local contrast method for small infrared target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 1, pp. 574–581, 2013.
[9] J. Han, Y. Ma, B. Zhou, F. Fan, K. Liang, and Y. Fang, “A robust infrared small target detection algorithm based on human visual system,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 12, pp. 2168–2172, 2014.
[10] Y. Qin and B. Li, “Effective infrared small target detection utilizing a novel local contrast method,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 12, pp. 1890–1894, 2016.
[11] J. Liu, Z. He, Z. Chen, and L. Shao, “Tiny and dim infrared target detection based on weighted local contrast,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 11, pp. 1780–1784, 2018.
[12] H. Deng, X. Sun, M. Liu, C. Ye, and X. Zhou, “Infrared small-target detection using multiscale gray difference weighted image entropy,” IEEE Transactions on Aerospace and Electronic Systems, vol. 52, no. 1, pp. 60–72, 2016.
[13] Y. Wei, X. You, and H. Li, “Multiscale patch-based contrast measure for small infrared target detection,” Pattern Recognition, vol. 58, pp. 216–226, 2016.
[14] Y. Shi, Y. Wei, H. Yao, D. Pan, and G. Xiao, “High-boost-based multiscale local contrast measure for infrared small target detection,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 1, pp. 33– 37, 2017.
[15] Z. Fan, D. Bi, L. Xiong, S. Ma, L. He, and W. Ding, “Dim infrared image enhancement based on convolutional neural network,” Neurocomputing, vol. 272, pp. 396–404, 2018.
[16] L. Liangkui, W. Shaoyou, and T. Zhongxing, “Using deep learning to detect small targets in infrared oversampling images,” Journal of Systems Engineering and Electronics, vol. 29, no. 5, pp. 947–952, 2018.
[17] W. Wang, H. Qin, W. Cheng, C. Wang, H. Leng, and H. Zhou, “Small target detection in infrared image using convolutional neural networks,” in AOPC 2017: Optical Sensing and Imaging Technology and Applications, vol. 10462. International Society for Optics and Photonics, 2017, p. 1046250.
[18] J. Gao, Y. Guo, Z. Lin, W. An, and J. Li, “Robust infrared small target detection using multiscale gray and variance difference measures,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 11, no. 12, pp. 5039–5052, 2018.
[19] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
[20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779– 788.
[21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
[22] Y. Xue, T. Xu, H. Zhang, L. R. Long, and X. Huang, “Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation,” Neuroinformatics, vol. 16, no. 3-4, pp. 383–392, 2018.
TABLE X TBC-NET INFERENCE TIME WITH DIFFERENT POWER MODE AND IMAGE BATCH SIZE.
[23] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[24] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[25] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1520–1528.
[26] R. Mohan, “Deep deconvolutional networks for scene parsing,” arXiv preprint arXiv:1411.4101, 2014.
[27] S. Saito, T. Li, and H. Li, “Real-time facial segmentation and perfor- mance capture from rgb input,” in European Conference on Computer Vision. Springer, 2016, pp. 244–261.
[28] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[29] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 472–480.
[30] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid, “Efficient piecewise training of deep structured models for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3194–3203.
[31] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[33] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015.
[34] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[35] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
[36] D. Liu, B. Wen, X. Liu, Z. Wang, and T. S. Huang, “When image denoising meets high-level vision tasks: A deep learning approach,” arXiv preprint arXiv:1706.04284, 2017.
[37] S. Wang, B. Wen, J. Wu, D. Tao, and Z. Wang, “Segmentation-aware image denoising without knowing true segmentation,” arXiv preprint arXiv:1905.08965, 2019.
[38] Y. Gu, C. Wang, B. Liu, and Y. Zhang, “A kernel-based nonparametric regression method for clutter removal in infrared small-target detection applications,” IEEE Geoscience and Remote Sensing Letters, vol. 7, no. 3, pp. 469–473, 2010.
[39] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[40] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
[41] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
[42] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” Distill, 2016. [Online]. Available: http://distill.pub/2016/ deconv-checkerboard
[43] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely effi- cient convolutional neural network for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.
[44] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 47–57, 2016.
[45] K. R. Prabhakar, V. S. Srikar, and R. V. Babu, “Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs.” in ICCV, 2017, pp. 4724–4732.
[46] C. Gao, D. Meng, Y. Yang, Y. Wang, X. Zhou, and A. G. Hauptmann, “Infrared patch-image model for small target detection in a single image,” IEEE Transactions on Image Processing, vol. 22, no. 12, pp. 4996–5009, 2013.
[47] Y.-C. Kim, T.-W. Bae, H.-J. Kwon, B.-I. Kim, and S.-H. Ahn, “Infrared (ir) image synthesis method of ir real background and modeled ir target,” Infrared Physics & Technology, vol. 63, pp. 54–61, 2014.
[48] L. Boominathan, S. S. Kruthiventi, and R. V. Babu, “Crowdnet: A deep convolutional network for dense crowd counting,” in Proceedings of the 24th ACM international conference on Multimedia. ACM, 2016, pp. 640–644.
[49] V. Ranjan, H. Le, and M. Hoai, “Iterative crowd counting,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 270–285.
[50] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
[51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[52] X. Bai and Y. Bi, “Derivative entropy-based contrast measure for infrared small-target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2452–2466, 2018.
[53] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[54] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.