IMF-MF: Interactive moment localization with adaptive multimodal fusion and self-attention

Singh, Pratibha; Kushwaha, Alok Kumar Singh; Varshney, Neeraj

doi:10.3233/JIFS-233071

IMF-MF: Interactive moment localization with adaptive multimodal fusion and self-attention

Article type: Research Article

Authors: Singh, Pratibha^{a; *} | Kushwaha, Alok Kumar Singh^a | Varshney, Neeraj^b

Affiliations: [a] Department of Computer Science and Engineering, Guru Ghasidas Vishwavidyalaya, Bilaspur, Chhattisgarh,India | [b] GLA University, Mathura, Uttar Pradesh, India

Correspondence: [*] Corresponding author. Pratibha Singh Email: pratibhaparihar11@gmail.com.

Abstract: Precise video moment retrieval is crucial for enabling users to locate specific moments within a large video corpus. This paper presents Interactive Moment Localization with Multimodal Fusion (IMF-MF), a novel interactive moment localization with multimodal fusion model that leverages the power of self-attention to achieve state-of-the-art performance. IMF-MF effectively integrates query context and multimodal features, including visual and audio information, to accurately localize moments of interest. The model operates in two distinct phases: feature fusion and joint representation learning. The first phase dynamically calculates fusion weights for adapting the combination of multimodal video content, ensuring that the most relevant features are prioritized. The second phase employs bi-directional attention to tightly couple video and query features into a unified joint representation for moment localization. This joint representation captures long-range dependencies and complex patterns, enabling the model to effectively distinguish between relevant and irrelevant video segments. The effectiveness of IMF-MF is demonstrated through comprehensive evaluations on three benchmark datasets: TVR for closed-world TV episodes and Charades for open-world user-generated videos, DiDeMo dataset, Open-world, diverse video moment retrieval dataset. The empirical results indicate that the proposed approach surpasses existing state-of-the-art methods in terms of retrieval accuracy, as evaluated by metrics like Recall (R1, R5, R10, and R100) and Intersection-of-Union (IoU). The results consistently demonstrate IMF-MF’s superior performance compared to existing state-of-the-art methods, highlighting the benefits of its innovative interactive moment localization approach and the use of self-attention for feature representation and attention modeling.

Keywords: Multimedia data retrieval, query-dependent fusion, ranking system, multimodal retrieval, video segment localization

DOI: 10.3233/JIFS-233071

Journal: Journal of Intelligent & Fuzzy Systems, vol. Pre-press, no. Pre-press, pp. 1-12, 2024

Published: 04 April 2024

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia