Learning hierarchical embedding space for image-text matching

Sun, Hao; Qin, Xiaolin; Liu, Xiaojing

doi:10.3233/IDA-230214

Learning hierarchical embedding space for image-text matching

Article type: Research Article

Authors: Sun, Hao^* | Qin, Xiaolin | Liu, Xiaojing

Affiliations: College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, China

Correspondence: [*] Corresponding author: Sun Hao, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, China. E-mail: sunhao123@nuaa.edu.cn.

Abstract: There are two mainstream strategies for image-text matching at present. The one, termed as joint embedding learning, aims to model the semantic information of both image and sentence in a shared feature subspace, which facilitates the measurement of semantic similarity but only focuses on global alignment relationship. To explore the local semantic relationship more fully, the other one, termed as metric learning, aims to learn a complex similarity function to directly output score of each image-text pair. However, it significantly suffers from more computation burden at retrieval stage. In this paper, we propose a hierarchically joint embedding model to incorporate the local semantic relationship into a joint embedding learning framework. The proposed method learns the shared local and global embedding spaces simultaneously, and models the joint local embedding space with respect to specific local similarity labels which are easy to access from the lexical information of corpus. Unlike the methods based on metric learning, we can prepare the fixed representations of both images and sentences by concatenating the normalized local and global representations, which makes it feasible to perform the efficient retrieval. And experiments show that the proposed model can achieve competitive performance when compared to the existing joint embedding learning models on two publicly available datasets Flickr30k and MS-COCO.

Keywords: Information retrieval, cross-modal representation, hierarchical embedding, local alignment

DOI: 10.3233/IDA-230214

Journal: Intelligent Data Analysis, vol. 28, no. 3, pp. 647-665, 2024

Published: 28 May 2024

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia