Multimodal fusion sensitive information classification based on mixed attention and CLIP model 1

Huang, Shuaina; Zhang, Zhiyong; Song, Bin; Mao, Yueheng

doi:10.3233/JIFS-233508

Multimodal fusion sensitive information classification based on mixed attention and CLIP model¹

Article type: Research Article

Authors: Huang, Shuaina^{a; b} | Zhang, Zhiyong^{a; b; *} | Song, Bin^{a; b} | Mao, Yueheng^{a; b}

Affiliations: [a] College of Information Engineering, Henan University of Science and Technology, Henan Luoyang, China | [b] Henan International Joint Laboratory of Cyberspace Security Applications, Henan University of Science and Technology, Henan Luoyang, China

Correspondence: [*] Corresponding author. Zhiyong Zhang, Henan International Joint Laboratory of Cyberspace Security Applications, Henan University of Science and Technology, Henan Luoyang 471023, China. E-mail: zhangzy@haust.edu.cn.

Note: [1] The work was supported by National Natural Science Foundation of China Grant No. 61972133, Project of Leading Talents in Science and Technology Innovation in Henan Province Grant No. 204200510021, Program for Henan Province Key Science and Technology under Grant No. 222102210177 and Henan Province University Key Scientific Research Project under Grant No. 23A520008.

Abstract: Social network attackers leverage images and text to disseminate sensitive information associated with pornography, politics, and terrorism,causing adverse effects on society.The current sensitive information classification model does not focus on feature fusion between images and text, greatly reducing recognition accuracy.To address this problem, we propose an attentive cross-modal fusion model (ACMF), which utilizes mixed attention mechanism and the Contrastive Language-Image Pre-training model.Specifically, we employ a deep neural network with a mixed attention mechanism as a visual feature extractor. This allows us to progressively extract features at different levels. We combine these visual features with those obtained from a text feature extractor and incorporate image-text frequency domain information at various levels to enable fine-grained modeling. Additionally, we introduce a cyclic attention mechanism and integrate the Contrastive Language-Image Pre-training model to establish stronger connections between modalities, thereby enhancing classification performance.Experimental evaluations conducted on sensitive information datasets collected demonstrate the superiority of our method over other baseline models. The model achieves an accuracy rate of 91.4% and an F1-score of 0.9145. These results validate the effectiveness of the mixed attention mechanism in enhancing the utilization of important features. Furthermore, the effective fusion of text and image features significantly improves the classification ability of the deep neural network.

Keywords: Multi-modal, sensitive information, spatial attention mechanism, channel attention mechanism, deep learning

DOI: 10.3233/JIFS-233508

Journal: Journal of Intelligent & Fuzzy Systems, vol. 45, no. 6, pp. 12425-12437, 2023

Published: 02 December 2023

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia