Abstract
Cross-modal fusion of visible-infrared images can make targets more prominent, the interaction between multi-modal stream fusion and salient object detection tasks can more accurately depict the target. We propose a multi-modal stream focusing on a salient object detection network based on visible-infrared complementary fusion, namely MFCF. MFCF has two main subnetworks: an Attentional Complementary Image Fusion subnetwork for Light Perception (AComFusion) and a Multimodal Stream Focusing Contextual Salient Object Detection (MSFCSod).To address the issue where redundant information across modalities weakens the fusion, AComFusion is designed with an attention mutual information complementary module to remove redundancy and enhance complementary advantages. Additionally, a light classification module performs adaptive classification of lighting conditions, adjusting the contribution weights of modalities to obtain optimal quality under various lighting conditions. The output of AComFusion is used as a third modality stream and input into MSFCSod along with the visible and infrared sources. This fusion stream drives and guides the detection of infrared and visible streams to externally focus on significant target features. An efficient focusing amplifier module is designed to internally self-focus on the detected significant targets, enhancing their feature representations. Finally, the contextual fusion module integrates more low-level details and high-level semantic features to improve the texture edges of the objects, thus enhancing the MFCF network. Thorough experimental results on several benchmark datasets show that the proposed MFCF network achieved state-of-the-art performance. It also shows strong potential in the subtasks of image fusion and salient object detection.