Dual self-attention with co-attention networks for visual question answering. (September 2021)
- Record Type:
- Journal Article
- Title:
- Dual self-attention with co-attention networks for visual question answering. (September 2021)
- Main Title:
- Dual self-attention with co-attention networks for visual question answering
- Authors:
- Liu, Yun
Zhang, Xiaoming
Zhang, Qianyun
Li, Chaozhuo
Huang, Feiran
Tang, Xianghong
Li, Zhoujun - Abstract:
- Highlights: A novel model based on the self-attention mechanism is proposed to learn more effective multi-modal representations. The DSACA model is proposed to capture the internal dependencies and cross-modal correlation between the image and question sentence. Extensive experiments and analysis confirm the superiority of the proposed DSACA. Abstract: Visual Question Answering (VQA) as an important task in understanding vision and language has been proposed and aroused wide interests. In previous VQA methods, Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are generally used to extract visual and textual features respectively, and then the correlation between these two features is explored to infer the answer. However, CNN mainly focuses on extracting local spatial information and RNN pays more attention on exploiting sequential architecture and long-range dependencies. It is difficult for them to integrate the local features with their global dependencies to learn more effective representations of the image and question. To address this problem, we propose a novel model, i.e., Dual Self-Attention with Co-Attention networks (DSACA), for VQA. It aims to model the internal dependencies of both the spatial and sequential structure respectively by using the newly proposed self-attention mechanism. Specifically, DSACA mainly contains three submodules. The visual self-attention module selectively aggregates the visual features at each region by a weightedHighlights: A novel model based on the self-attention mechanism is proposed to learn more effective multi-modal representations. The DSACA model is proposed to capture the internal dependencies and cross-modal correlation between the image and question sentence. Extensive experiments and analysis confirm the superiority of the proposed DSACA. Abstract: Visual Question Answering (VQA) as an important task in understanding vision and language has been proposed and aroused wide interests. In previous VQA methods, Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are generally used to extract visual and textual features respectively, and then the correlation between these two features is explored to infer the answer. However, CNN mainly focuses on extracting local spatial information and RNN pays more attention on exploiting sequential architecture and long-range dependencies. It is difficult for them to integrate the local features with their global dependencies to learn more effective representations of the image and question. To address this problem, we propose a novel model, i.e., Dual Self-Attention with Co-Attention networks (DSACA), for VQA. It aims to model the internal dependencies of both the spatial and sequential structure respectively by using the newly proposed self-attention mechanism. Specifically, DSACA mainly contains three submodules. The visual self-attention module selectively aggregates the visual features at each region by a weighted sum of the features at all positions. The textual self-attention module automatically emphasizes the interdependent word features by integrating associated features among the sentence words. Besides, the visual-textual co-attention module explores the close correlation between visual and textual features learned from self-attention modules. The three modules are integrated into an end-to-end framework to infer the answer. Extensive experiments performed on three generally used VQA datasets confirm the favorable performance of DSACA compared with state-of-the-art methods. … (more)
- Is Part Of:
- Pattern recognition. Volume 117(2021)
- Journal:
- Pattern recognition
- Issue:
- Volume 117(2021)
- Issue Display:
- Volume 117, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 117
- Issue:
- 2021
- Issue Sort Value:
- 2021-0117-2021-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-09
- Subjects:
- Self-attention -- Visual-textual co-attention -- Visual question answering
Pattern perception -- Periodicals
Perception des structures -- Périodiques
Patroonherkenning
006.4 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00313203 ↗
http://www.sciencedirect.com/ ↗ - DOI:
- 10.1016/j.patcog.2021.107956 ↗
- Languages:
- English
- ISSNs:
- 0031-3203
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 17028.xml