Attentive models in vision: Computing saliency maps in the deep learning era

Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita

doi:10.3233/IA-170033

Attentive models in vision: Computing saliency maps in the deep learning era

Issue title: Selected papers from the 16th International Conference of the Italian Association for Artificial Intelligence

Guest editors: Stefano Ferilli and Francesca Alessandra Lisi

Article type: Research Article

Authors: Cornia, Marcella^{; *} | Abati, Davide | Baraldi, Lorenzo | Palazzi, Andrea | Calderara, Simone | Cucchiara, Rita

Affiliations: Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia

Correspondence: [*] Corresponding author: Marcella Cornia, Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia. E-mail: marcella.cornia@unimore.it.

Abstract: Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades both in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we discuss the effectiveness of convolutional neural networks (CNNs) models in saliency prediction. We present a set of Deep Learning architectures developed by us, which can combine both bottom-up cues and higher-level semantics, and extract spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. We will show how these deep networks closely recall the early saliency models, although improved with the semantics learned from the human ground-truth. Eventually, we will present a use-case in which saliency prediction is used to improve the automatic description of images.

Keywords: Saliency, Human Attention, Neuroscience, Vision, Deep Learning

DOI: 10.3233/IA-170033

Journal: Intelligenza Artificiale, vol. 12, no. 2, pp. 161-175, 2018

Published: 29 January 2019

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia