and restricts the application and popularization of traditional image fusion technology, while human beings have strong robustness in many computer vision tasks, such as object detection, object recognition and image caption, et al. So, we believe that human beings should also have strong robustness in the field of cross-modal image fusion task. From the perspective of cognitive psychology, the human visual perception system has the characteristics of information selection for the perception of external stimuli, and the human brain has non-linear characteristics for the fusion of perceptual information [1, 2, 3, 4]. The visual attention model based on feature selection and convolution neural network based on non-linear characteristics of brain neuroscience have achieved remarkable results in many computer vision fields. So, we also believe that these two characteristics of human visual perception have positive significance to improve the robustness of cross-modal image fusion task, as verified in Sect.4.
based on human visual perception characteristics. For example, a multi-scale decomposition method based on the sensitivity of human eyes to different brightness regions [5], a convolutional neural network method inspired by neurobiology [6] and a saliency method based on human visual attention mechanism [7]. Among them, multi-scale decomposition focuses more on hierarchical feature extraction of images. The method based on convolution neural network focuses more on learning the characteristics of images by data driven. The method based on visual saliency focuses more on feature extraction of saliency feature map or saliency object. 1) In fusion criteria, the above methods generally use weighted average, maximum or principal component method [8] in image fusion criteria, and those on non-linear feature fusion are few. 2) In feature selection, more emphasis is placed on the extraction and selection of origin image features in the early stage, and the effective selection of fusion features is lacking. However, the human visual fusion perception system is a highly complex non-linear system. Complex characteristics are not only reflected in feature extraction, but also in image information fusion of human brain [1, 3, 4]. In the task of cross-modal image fusion, the human brain filters the perceived target features based on subjective intention, ignores uncertain signals, and fuses non-mutually exclusive features according to prior knowledge. In order to make the results of cross-modal image fusion more in line with the human visual perception system, narrow the gap between human visual perception system and cross-modal image fusion, we propose a non-linear and selective fusion of cross-modal image based on cognitive psychology theory. We give a representative example in Figure 1 to demonstrate the superiority of our method over existing mainstream algorithms. The image is a thermal infrared and visible image, and the data is derived from the traffic dataset in FLIR [9]. From the fused image, we can clearly find that there is a strong boundary effect in the glare. At present, mainstream image fusion algorithms cannot effectively remove glare. Our framework can effectively remove glare regions by introducing non-linear illuminance influence factors, and the fused images have higher clarity.
Fig. 1. Schematic illustration of image fusion. From left to right: IR image[9], Visible image[9], JSR[10], OURS, WLS[11], DDLatLRR[12], LATLRR[13], ZCA[14], CNN[15], CVT[16], DL[17], DTCWT[18], FusionGAN[19], GF[20], GTF[21], LP[22], FEZ[23], CBF[24], CSR[25], JSRD[26], LPSR[27], MSVD[28], RP[29], Wavelet[30]. Our method has a good fusion effect for high light, and the fusion effect is more coincident with human visual perception mechanism.
image features. Our proposed framework combines high and low frequency information, visual saliency information, deep learning features and illuminance information of the original image. The illumination information is used as the non-linear fusion factor of the image fusion to simulate the non-linear fusion characteristics of the human visual system. The attention network is used to simulate the human eye’s selection characteristics of fusion features. The main contributions of our work include the
following three points:
The remainder of this paper is structured as follows. Sect. 2 reviews relevant theory knowledge. Sect. 3 presents the non-linear and selective fusion of cross-modal images. Sect. 4 introduces the experimental datasets, evaluation metrics, and implementation details. Sect. 5 presents a discussion and explanation. Sect. 6 gets a conclusion.
vision system, feature selection characteristics and cross-modal image fusion, so this section will review the existing work from these three aspects.
2.1. Non-linear fusion characteristic
man visual perception system indicate that human perception of external information depends more on the brightness difference between object and background. Human beings have some self-adaptive brightness adjustment function in highlight areas, and human eyes cannot detect the distortion below just noticeable distortion (JND) [1]. However, the process of adaptive brightness adjustment is one of the non-linear characteristics of the human visual system. In addition, human brain as a highly complex non-linear system [31], its processing of information is not simple weighted average, but involves more non-linear processing [32]. Inspired by these characteristics, we propose an innovative cross-modal image fusion framework. The fusion rules of the basic level and the detail level of the image will no longer be based on the traditional method (weighted average, maximum, and sum et al.) [8], but on the illumination factor as a non-linear fusion factor. By introducing a non-linear fusion factor, the non-linear fusion characteristics of human visual perceptron are simulated.
2.2. Feature selection characteristic Zohary et al. [33] pointed out that physiological evidence shows that visual cortex cells are selective in several perceptual dimensions at the same time, which enables
people to select features. At the same time, Kubovy et al. indicated that the brain will select specific ”features” of the stimulus according to the object of interest, such as direction, spatial frequency or the moving direction of the brightness edge, etc. Inspired by this, researchers have made relevant achievements in many computer vision fields. Hu et al. [34] proposed a channel attention network for feature selection. Zhang et al. [35] performed a performance evaluation of channel attention module for residual network on image super-resolution study. Jun et al. [36] proposed a dual attention network for image segmentation, and introduced spatial attention module based on channel attention. All the above methods are based on the characteristics of the human visual perception system. The inherent deduction mechanism of vision in the human visual perception system points out that the human visual system deduces content according to prior knowledge in human brain, and discards uncertain information. Inspired by this feature, we use channel attention network [35] to simulate the feature selection characteristics of the human visual perception system in the cross-modal image fusion tasks. Attention network is used to learn the complex non-mutually exclusive non-linear relationship between different features, and different weight coefficients are given to the features with different degrees of attention.
2.3. Image fusion
sion algorithm and deep learning method. We will review the representative algorithms in these fusion algorithms.
The existing image fusion algorithms have the following problems. Firstly, the effective feature extraction of images is insufficient. Most algorithms only fuse features directly after extracting them, and there is no secondary selection of features [37, 25, 17, 23, 38]. Secondly, the image fusion method is simple. Most algorithms extract the features directly by simple weighted average, select the weight maximum or extract the feature principal component (PCA) [8]. And the non-linear relationship between features is not fully considered, which is not in line with the human visual perception mechanism. Finally, there is still a great gap between the results of cross-modal image fusion of mainstream algorithms and those of human visual fusion. In order to overcome the above problems, we propose a non-linear and selective fusion of cross-modal images. Our method improves the quality of image fusion by simulating human visual characteristics. In our image fusion algorithm, based on the selection characteristics of human visual perception characteristics, the attention module is introduced to select the fused image features. Inspired by the characteristics of human visual perception brightness and contrast sensitivity, we use image illumination information to simulate the non-linear combination characteristics of human eyes with different features in our framework, and establish the non-linear relationship between image fusion based on the illuminance information of the image. Through the simulation of human visual perception characteristics, the image fusion quality is more in line with human subjective evaluation.
following four steps. Firstly, image decomposition is performed to obtain the image base layer and detail layer. Secondly, the image illuminance is modeled, and the non-linear fusion coefficient of the image fusion is obtained. Thirdly, the obtained weight map is combined with the illuminance information fusion factor for feature fusion.
Fig. 2. General block diagram of our framework. Yellow box represents multiscale decomposition of image. Red box indicates significance detection module. The blue box indicates the lighting modeling module. Gray box representation deep feature weight extraction module. Green box representation feature selection module. M, S, L indicate the multi-scale decomposition operation, non-linear fusion functions, highlight block detection respectively.
Finally, the fused feature map is selected by the channel attention module to obtain the final fused image.
3.1. Multi-scale image decomposition
puter vision, and has achieved great results in feature extraction. According to human visual perception theory, human eyes have different sensitivity to different regions of degraded images. Therefore, in the cross-modal image fusion task, we need to decompose the image at different levels. The method can effectively avoid the image ringing effect caused by high and low frequency mixing during image processing. In our image fusion framework, we use the two-scale decomposition method proposed by [42]. This method has better real-time performance than the existing multi-scale decomposition methods.
3.2. Visual saliency detection
detection method based on human visual perception theory has been widely used in the field of computer vision [17]. In image fusion task, the bottom-up and top-down saliency models are usually used, which are realized by the high contrast of the pixels compared with the surrounding information. At present, cross-modal image fusion methods based on saliency detection are mainly two ways, one is to calculate the saliency weight map corresponding to the original image [23, 42], the other is to extract saliency object [18] based on saliency analysis. In this paper, we mainly adopt the bottom-up saliency model method proposed by [42], which has less computational complexity than other algorithms. What we need to explain here is that the saliency detection method we use is not to detect object, but to detect the brightness, contrast, edge and other image attributes.
3.3. Illumination factor modeling
ence of weather, light and motion. Image quality degradation is due to the loss of high frequency information in the image, and the image information loss part is often presented in the form of high light and dark light. The following figure shows the problem of image information loss caused by car lights in visible images. As shown in Figure 3, we conducted a visual modeling analysis of the light field.
the form of a parabolic cross section. Through a lot of validation on FLIR dataset, it is found that the problem of high light caused by circular light source is universal. However, considering the diversity of light sources (rectangle, ellipse and irregular shape) in nature, we cannot use a fixed illumination model for illumination modeling when doing
Fig. 3. Visualization analysis of high optical density images. From left to right indicate visible image, abnormal high light image block, abnormal 3D optical density map, abnormal average optical density curve R1(x, y), normal image block, normal 3D optical density map, normal average optical density curve R2(x, y).
image highlight removal. We need to establish an illuminance model based on image
illuminance, and dynamically adjust the image according to different illuminance models. Aiming at the image fusion task, we validate the current mainstream image fusion algorithm, and the effect is shown in the figure above. From Figure 1 , we can clearly find that the current mainstream image fusion algorithms do not consider the image highlight problem. Therefore, highlight blocks cannot be effectively removed in the effect of image fusion, and there is obvious boundary effect, which seriously affects the quality of image fusion. Therefore, we propose to introduce illumination factors in image fusion to effectively eliminate the influence of highlights on image fusion quality.
mainly composed of ambient light, diffuse reflection light, specular reflection light, and reflection of the object’s own light source[1]. To simplify the model, we believe that the image is composed of two parts including the incident image and the reflected image. At present, the incident image estimation method is mainly based on the image low frequency theory, and it is considered that the image illumination is slow, and the image is mainly a low frequency component. Therefore, in various computer vision tasks, the illumination image is generally estimated based on this theory. However, this method has an obvious drawback in that it ignores the non-smooth nature of the illumination itself, which is especially prominent at the edge of the image illumination. Our method fully considers the above problems, and the specific steps are as follows.
(iii) Modeling of light intensity density function Rj(x, y). It can be expressed as n
3.4. Image fusion
image, and the image texture information is seriously lost. A B and C
D segment represents the transition stage from the highlight image to the surrounding image, and the light intensity curve in this area is non-linear under the influence of the highlight. From A to B and from D to C, the fusion weight of the highlight block image gradually
Fig. 4. Light intensity curve.
decreases. When it reaches the BC segment, the fusion weight of the highlight block image reaches the minimum value. Therefore, in the process of image fusion, we cannot simply detect the B
D image highlight block for image fusion, otherwise there will be serious fusion boundary phenomenon. In our algorithm, the above problems are effectively overcome by non-linear modeling. On the basis of obtaining the non-linear illumination factor, we calculate the basic level fusion B, the detail level fusion D, and the final image fusion F. The specific definitions are as:
Where W denotes saliency weight map, i denotes an i-th image; b denotes a base layer superscript; d denotes a detail layer; x and y denote pixel coordinates; C denotes a pixel normalization constant.
non-linear illumination coefficient. Considering that the non-linear fusion factor needs to be in the range of 0 and 1, we use sigmoid as the activation function. We can find that the weighted average fusion criterion or maximum fusion criterion at the basic level and detail level of the current mainstream algorithms is a special case of our algorithm. When the illumination factor is 0.5, it is the weighted average algorithm in the current mainstream fusion method, and maximizing the illumination factor is the maximum fusion criterion.
3.5. Feature selection
after previous fusion are W C. As shown in Equation 4 , the global average pooling (GP) operation is performed on the T feature map to obtain the global receptive field corresponding to the feature map, so that the network can exclude the spatial relationship between different channels and focus on learning the non-linear relationship between different feature channels. The output CAM after feature selection is defined as [39, 35]:
Where Tk(x, y) represents the pixel value corresponding to the kth channel (x, y) coordinates. After passing through the global average pooling layer, we obtain the output of the attention module through convolution, RELU activation function, convolution, Sigmoid activation function, and dot product operation; S and R represent the activation functions of Sigmoid and Relu respectively, while W1 and W2 represent the weight of two convolutions respectively; FGP indicates the output of the input image after GP operation.
sult produced along with relevant explanations and discussions are presented.
4.1. Experimental setup
sented. Finally, implementation details of evaluated methods are introduced.
4.1.1. Datasets
spectral (enhanced vision, near infrared and long wave infrared or thermal) night images of different military related scenes, registered in different multi-band camnera systems. There are 21 pairs of image pairs commonly used in existing image fusion algorithms.
is provided for the training and verification of the neural network for target detection. The data set is obtained by RGB and thermal imaging camera installed on the vehicle. The dataset contains 14452 annotated hot images, of which 10228 are from short video and 4224 are from 144 second video. Unfortunately, there is no registration.
images. The relevant images have registered the data.
image fusion (VIF). It aims to provide a fair and comprehensive performance comparison platform for Vif algorithm. At present, vifb integrates 21 image pairs, 20 fusion algorithms and 13 evaluation indexes, which can be used for performance comparison. Fortunately, 20 algorithms corresponding to 21 images provide fused images. Unfortunately, no specific code has been released.
Dataset Scene Challenge Modality Registration Matching pairs
TNO [45] Military Illumination, noise Infrared and visible
FLIR [9] Highway Illumination, noise Infrared and visible
evaluate the robustness of our framework, we performed experimental evaluations on different image fusion task data sets.
4.1.2. Metrics
entropy (EN) [48], average gradient (AG) [49], structural similarity (SSIM) [50], mutural information (MI) [51], visual information fidelity (VIF) [52], information fidelity criterion (IFC) [53].
(iii) SSIM [50] denotes structureal similarity. The image quality is evaluated from
(vi) VIF [52] represents visual Information Fidelity. The image quality is evaluated by
4.1.3. Methods
rithms such as fast-zero-learning (FEZ) [23], fonvolutional sparse representation (CSR) [25], deep learning (DL) [17], dense fuse (DENSE) [38], generative adversarial network for image fusion (Fusion GAN) [8], laplacian pyramid (LP) [22], dual-tree complex wavelet transform (DTCWT) [55], latent low-rank representation (LATLRR) [13], multi-scale transform and sparse representation (LP-SR) [27], dense sift (DSIFT) [18], convolutional neural network (CNN) [26], curvelet transformation (CVT) [16], bilateral filter fusion method (CBF) [24], cross joint sparse representation (JSR) [10], joint sparse representation with saliency detection (JSRSD) [26], gradient transfer fusion (GTF) [21], weighted least square optimization (WLS) [11], a ratio of low pass pyramid (RP) [29], multi-resolution singular value decomposition (MSVD) [28], non-linear (OURS), non-linear fusion and feature selection (OURS+). At the same time, we need to point out that the time efficiency of different algorithms is tested on the VIFB data set. As the fourth data set VIFB gives the image fusion results of related algorithms, some of these algorithms are repeated with the algorithms we have compared, and other algorithms will be introduced in Section 4.2.4. In Table 2, the time is calculated using the average time obtained from the VIFB data set.
4.1.4. Implementation details
sequent experiments, we converted all images into grayscale images for subsequent image fusion. 2) We also need to point out that the robustness problem in this paper is mainly verified from two aspects. On the one hand, it is verified from multiple cross-modal datasets. On the other hand, it is verified from complex environment, such as high light and dark light images. Therefore, this paper will not test the robustness separately, which will be reflected in each experiment. 3) For different experiments, there will be some changes in the related algorithm experiments, and the changes will be explained in the respective experimental chapters. 4) These algorithms have already published their code, and the relevant algorithm parameters are the same according to the settings in the public paper. 5) For our proposed algorithm, we also conducted a comparative experiment on whether there is a channel attention module or not. 6) Aiming at the problem that FLIR data set is not registered, we use matlab toolbox to do some manual alignment work. 7) Since VIFB is a new image fusion benchmark proposed in 2020, but the number and time of the algorithms contained in the benchmark are not dominant. Therefore, in the first three data sets, we adopt the latest image fusion algorithms and some classic image fusion algorithms. Of course, the most important thing is that the code base is not public. 8) Although the VIFB dataset does not provide a code base, so we only tested nine of them, and only showed six of them commonly used in Figure 14. 9) Our experimental platform is desktop 3.0 GHZ i5-8500, RTX2070, 32G memory.
INFRARED IMAGE
Fig. 5. Visible and infrared source images with the fusion results obtained by different methods. From (a) to (v) : CNN[15], CVT[16], DL[17], DTCWT[18], FEZ[23], DSIFT[18], CSR[25], DFA[38], DFL1[38], CBF[24], WLS[11], JSR[10], JSRSD[26], LATLRR[60], FusionGan[19], GTF[21], IFCNN[54], LPSR[27], MSVD[28], RP[29], OURS, OURS+.
4.2. Comparative experiments
and the robustness of our algorithm, we will carry out comparative experiments and visual display on TNO data set, FLIR data set, medical data set and VIFB data set. In subsection 4.2.5, we will verify the effectiveness of feature selection [61]. In subsection 4.2.6, we will verify analysis experiment of non-linear fusion.
4.2.1. Results on TNO dataset
pairs of infrared and visible images in the dataset using the 20 image fusion methods shown in Section 4. As shown in Figure 5, we have qualitatively analyzed the data set. From the tree leaf window in the above figure, we can see that there is obvious highlight in the visible image, and the image information is seriously lost, but in the infrared image, the detailed structure information of this place is relatively well preserved. Existing algorithms do not have good image restoration to recover lost information in visible images. Compared to other algorithms, our algorithm has a very high defini-tion in the highlights of the trees in the highlights. In the pedestrian window, we can also see that our algorithm allows pedestrians to maintain high contrast information. The images of our algorithm fusion are more in line with the human visual perception mechanisms.
Fig. 6. Six evaluation indicators for quantitative contrast between infrared and visible Images.
analysis of the dataset. From Figure 6, we can find that CSR algorithm and DSIFT algorithm have great advantages in EN, AG, MI and IFC indexes, but when we look at the fusion effect image, we can find that the subjective effect of these two images is the worst, and there are a lot of fusion boundary effects. These objective indicators will mislead image quality assessment. This shows a problem that the existing objective image quality evaluation indicators have their own limitations, and can not be more perfect evaluation of image quality. Although our algorithm does not have an advantage in the objective index in this dataset, our algorithm can effectively avoid the boundary effect while fully preserving the original details of the image.
4.2.2. Results on FLIR dataset
ments on the FLIR [9] traffic dataset. Subjective visual analysis is shown in Figure 7 , and the relevant quantitative analysis is shown in Figure 8 .
Fig. 7. Qualitative fusion results on visible and thermal infrared images by different method.
higher image fusion quality than other algorithms in the high light block. Our algorithm can effectively remove the highlight and avoid the boundary effect of image fusion. In FLIR data set, DSIFT and CSR are still the worst subjective effects, but the objective indicators are very high. In addition to the boundary effect, the two algorithms seriously lose the detailed texture information of the visible image. At the same time, we observe the SSIM evaluation index and the highlight block image. We can find that the image texture details of the fusion of CVT [16], DTCWT [18] and RP [29] algorithm have not been repaired at all. we can find that in the FLIR dataset, the SSIM evaluation indexes of these three algorithms are generally more than one percentage point
Fig. 8. Six evaluation indicators for quantitative contrast between thermal infrared and visible images.
compared with the proposed algorithm. At the same time, these three algorithms have higher visual fidelity than other algorithms. The reason for the analysis is mainly due to the influence of the brightness and contrast characteristics of the human visual system and the visual masking characteristics. When the image is seriously degraded, the SSIM evaluation index is significantly different from the subjective evaluation. Therefore, when the image is degraded seriously, it is not the higher the SSIM value, the better the image quality.
4.2.3. Results on medical dataset
medical image data set, as shown in Figure 9 and Figure 10. Through comparison, we can find that our image fusion effect and CNN algorithm have better clarity. Of course, we are not the best in objective indicators, mainly because the image fusion effect of RP algorithm has obvious fragmentation effect , resulting in a very high gradient value.
Fig. 9. Qualitative fusion results on CT and MR images by different method.
Fig. 10. Qualitative fusion results on CT and MR images by different method.
4.2.4. Results on VIFB benchmark
Fig. 11. Exemplar infrared and visible images from the VIFB datasets.
Fig. 12. Qualitative fusion results of carLight images on VIFB dataset.
Fig. 13. Qualitative fusion results of elecbike images on VIFB dataset.
Fig. 14. Qualitative fusion results on VIFB dataset.
Fig. 15. Qualitative fusion results of manlight images on VIFB dataset.
Fig. 16. Qualitative fusion results of tricycle images on VIFB dataset.
4.2.5. Validity analysis experiment of feature selection feature
Fig. 17. Validity analysis experiment of feature selection feature. (a) On-linear image fusion result. (b) and (c) use the same network weight, the selective attention module was not added in the training, but (c) the attention selection module was added in the test phase. Similarly, (d) and (e) use the same network weight, but add the selective attention module in the training, (e) relatively and (d) cancel the selective attention module in the test phase. (a), (b), (c) and (d) all use the same network parameters and data for training.
tion selection module will reduce the overall indicators compared with not adding. The information entropy, average gradient and information fidelity decrease a little, but the mutual information, structure similarity and peak signal-to-noise ratio evaluation index will be slightly improved. To some extent, this proves some conclusions of RCAN [35] proposed by Zhang et al. It also proves the effectiveness of the feature selection attention feature. According to the different importance of the feature, different weights are given to the feature map. This will inevitably affect information entropy, gradient and information fidelity. However, this kind of influence is hard to detect in subjective vision, so we involve the following experiments to analyze the influence of feature selection characteristics on image fusion visually.
obvious even if there is no objective quality comparison. Even if we add the attention selection module in the training phase, but cancel the attention module in the test phase, this selection feature also has a great impact on the model weight, it is obvious that the image clarity and contrast have been improved. Compared with the existing algorithm, the effect is significantly improved. But if we do not add attention selection module in the training phase, only use it in the test phase, this way will reduce the quality of the image. This experiment also proves the effectiveness of introducing feature selection mechanism into the field of image fusion, and to some extent shows that human visual selection feature has a positive effect on image fusion.
4.2.6. Validity analysis experiment of non-linear fusion
Fig. 18. Validity analysis experiment of non-linear fusion. (a) Weighted average image fusion results. (b) Sum image fusion results. (c) Maximum image fusion results. (d) Non-linear image fusion result.
to combine different fusion criteria for experimental comparison. In this experiment, only the fusion criteria are modified, including the maximum fusion criteria, weighted average fusion criteria and sum fusion criteria. From Figure 18, we can find that compared with weighted average fusion criterion, sum fusion criterion and maximum fusion criterion, non-linear fusion method has better performance in subjective fusion effect, but this does not mean that its objective index will be very high. Experiments also show the effectiveness of non-linear fusion and the robustness of human visual characteristics in the field of image fusion. Although our method does not fully simulate this non-linear characteristic, the existing results can prove the correctness of this viewpoint to a certain extent.
fusion method is more in line with human visual system than the existing methods. We think the main reasons are as follows. Firstly, the collaboration of traditional and deep learning methods is effective in image fusion tasks. Secondly, illumination as a non-linear factor of feature fusion is consistent with human visual perception characteristics. Finally, in the task of image fusion, feature selection is not only effective in the initial stage of feature extraction, but also very important in the later stage of feature fusion.
a complex environment, the image fusion effect of many existing algorithms is not best subjective quality, but the objective quality evaluation index is very high. This illustrates two problems.
the image quality according to the objective evaluation index. Of course, if it is for the image fusion task with ground truth, it is the right way. However, this method is also used for the cross-modal image fusion task without ground truth, which is not so accurate in the experiment.
uating the quality of cross-modal images.
tasks, there are still some shortcomings. Human visual perception system is a very complex system. When processing image fusion tasks, human beings will add more understanding and cognition from pixel level to semantic level, which is incomparable to the existing image quality evaluation function. When processing image fusion tasks, human beings will add more understanding and cognition from pixel level to semantic level, which is incomparable to the existing image quality evaluation function. Although image fusion and image quality assessment based on deep learning and generative adversarial network have achieved good results, its objective function modeling is seriously restricted by the expression of this kind of cognition. Although we combine the characteristics of human visual perception system in the image fusion task, there is still some gap with the complete human visual perception system, which is also a direction we need to study in the future.
selective of cross-modal image fusion method. The most significant difference between our method and the current mainstream methods includes three points. 1) We don’t need a dedicated image fusion network to train first. 2) We introduce the illuminance fusion factor to simulate the non-linear characteristics of human visual perception for the first time in image fusion. 3) An attention mechanism was introduced in the image fusion task to simulate the selection characteristics of human visual perception. Through a large amount of data verification, experimental results demonstrate that our method is more in line with the human visual perception system than the existing mainstream method. Although our algorithm does not fully simulate human visual perception characteristics, the first simulation of human visual perception characteristics in image fusion tasks is in line with the human visual perception mechanism. Although our method has achieved relatively good results compared with the existing algorithms, how to better learn the non-linear relationship between the features and the spatial structure will be further discussed in the next research work.
We are very grateful to Prof. Roundtree and Dr. Xiaoming Wang for their support of the language of the paper. This work was supported by the National Natural Science Foundation of China under Grants nos. 61871326, and the Shaanxi Natural Science Basic Research Program under Grant no. 2018JM6116.
[10] Q. Zhang, Y. Fu, H. Li, J. Zou, Dictionary learning method for joint sparse
[11] J. Ma, Z. Zhou, B. Wang, H. Zong, Infrared and visible image fusion based on
[12] G. Liu, S. Yan, Latent low-rank representation for subspace segmentation and
[13] H. Li, X.-J. Wu, J. Kittler, Mdlatlrr: A novel decomposition method for infrared
[14] H. Li, X. jun Wu, T. S. Durrani, Infrared and visible image fusion with resnet
[15] Y. Liu, X. Chen, J. Cheng, H. Peng, Z. Wang, Infrared and visible image fusion
[16] F. Nencini, A. Garzelli, S. Baronti, L. Alparone, Remote sensing image fusion
[17] H. Li, X.-J. Wu, J. Kittler, Infrared and visible image fusion using a deep learning
[18] Y. Liu, S. Liu, Z. Wang, Multi-focus image fusion with dense sift, Information
[19]
[20] S. Li, K. Xudong, J. Hu, Image fusion with guided filtering, IEEE Transactions
[21] J. Ma, C. Chen, C. Li, J. Huang, Infrared and visible image fusion via gradient
[22] P. J. Burt, E. H. Adelson, The laplacian pyramid as a compact image code, Read-
[23] F. Lahoud, S. Susstrunk, Zero-learning fast medical image fusion, in: 2019 22th
[24] S. Kumar, B. K., Image fusion based on pixel significance using cross bilateral
[25] Y. Liu, X. Chen, R. Ward, Z. J. Wang, Image fusion with convolutional sparse
[26] C. Liu, Y. Qi, W. Ding, Infrared and visible image fusion method based on
[27] Y. Liu, S. Liu, Z. Wang, A general framework for image fusion based on multi-
[28] V. P. S. Naidu, Image fusion technique using multi-resolution singular value de-
[29] A. Toet, Image fusion by a ratio of low-pass pyramid, Pattern Recognition Letters
[30] L. J. Chipman, T. M. Orr, L. N. Graham, Wavelets and image fusion, in: Interna-
[31] W. Klonowski, Importance of Nonlinear Signal Processing in Biomedicine,
[32] M. Akay, Nonlinear Biomedical Signal Processing: Fuzzy Logic, Neural Net-
[33] E. Zohary, Population coding of visual stimuli by cortical neurons tuned to more
[34] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, 2018. doi:10.1109/
[35] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution us-
[36] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention network for
[37] Y. Liu, X. Chen, H. Peng, Z. Wang, Multi-focus image fusion with a deep convo-
[38] H. Li, X. J. Wu, Densefuse: A fusion approach to infrared and visible images,
[39] A. Fang, X. Zhao, Y. Zhang, A cross-modal image fusion theory guided by human
[40] J. Ma, P. Liang, W. Yu, C. Chen, J. Jiang, Infrared and visible image fusion via
[41] J. Ma, H. Xu, J. Jiang, X. Mei, X.-P. Zhang, Ddcgan: A dual-discriminator condi-
[42] D. P. Bavirisetti, R. Dhuli, Two-scale image fusion of visible and infrared images
[43] J. Zhu, W. Jin, L. Li, Z. Han, X. Wang, Fusion of the low-light-level visible
[44] H. Li, X.-J. Wu, Infrared and visible image fusion using latent low-rank represen-
[45] A. Toet, Tno dataset (2018).
[46] Summers, D, Harvard whole brain atlas, Journal of Neurology Neurosurgery &
[47]
[48] H. R. Sheikh, A. C. Bovik, Image information and visual quality, IEEE Transac-
[49] G. Cui, H. Feng, Z. Xu, Q. Li, Y. Chen, Detail preserved fusion of visible and
[50] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment:
[51] G. Qu, D. Zhang, P. Yan, Information measure for performance of image fusion,
[52] Y. Han, Y. Cai, Y. Cao, X. Xu, A new image fusion performance metric based on
[53] H. R. Sheikh, Member, IEEE, A. C. Bovik, Fellow, An information fidelity cri-
[54] Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, L. Zhang, Ifcnn: A general image
[55] J. J. Lewis, R. J. OCallaghan, S. G. Nikolov, D. R. Bull, N. Canagarajah, Pixel-
[56] D. Bavirisetti, G. Xiao, J. Zhao, R. Dhuli, G. Liu, Multi-scale guided image and
[57] D. P. Bavirisetti, R. Dhuli, Fusion of infrared and visible sensor images based on
[58] D. P. Bavirisetti, Multi-sensor image fusion based on fourth order partial differ-
[59] Y. Zhang, L. Zhang, X. Bai, L. Zhang, Infrared and visual image fusion through
[60] G. Liu, S. Yan, Latent low-rank representation for subspace segmentation and
[61] T. Zhao, X. Wu, Pyramid feature attention network for saliency detection, Vol.
[62] Z. Zhou, B. Wang, S. Li, M. Dong, Perceptual fusion of infrared and visible