Relation-aware attention for video captioning via graph learning. (April 2023)
- Record Type:
- Journal Article
- Title:
- Relation-aware attention for video captioning via graph learning. (April 2023)
- Main Title:
- Relation-aware attention for video captioning via graph learning
- Authors:
- Tu, Yunbin
Zhou, Chang
Guo, Junjun
Li, Huafeng
Gao, Shengxiang
Yu, Zhengtao - Abstract:
- Highlights: We improve the conventional attention mechanism to a relation-aware attention mechanism via graph learning, which aims to 1) support proper semantic alignment between target word states and attended visual features and 2) leverage the attention information from the past and future to guide the current attention process. A linguistics-to-vision heterogeneous graph (HTG) is learned to enhance the inter-relations between target word states and attended visual features. Moreover, a vision-to-vision homogeneous graph (HMG) is learned to capture the intra-relations among all the attended visual features. Experimental results on two benchmark datasets prove that our model performs much better than many state-of-the-art methods. Abstract: Video captioning often uses an attentive encoder-decoder as the baseline model. However, the conventional attention mechanism still remains two problems. First, the attended visual feature is often irrelevant to the target word state, because the attention process only uses the unidirectional flow from vision to linguistics, while lacking the reverse flow. Second, each attention result is independent, because it is computed only based on the previous word states while not considering the attention information from the past and future. This does not suit the attention habits of human beings. In this paper, we improve the conventional attention mechanism to a relation-aware attention mechanism. To this end, we propose two kinds of graphHighlights: We improve the conventional attention mechanism to a relation-aware attention mechanism via graph learning, which aims to 1) support proper semantic alignment between target word states and attended visual features and 2) leverage the attention information from the past and future to guide the current attention process. A linguistics-to-vision heterogeneous graph (HTG) is learned to enhance the inter-relations between target word states and attended visual features. Moreover, a vision-to-vision homogeneous graph (HMG) is learned to capture the intra-relations among all the attended visual features. Experimental results on two benchmark datasets prove that our model performs much better than many state-of-the-art methods. Abstract: Video captioning often uses an attentive encoder-decoder as the baseline model. However, the conventional attention mechanism still remains two problems. First, the attended visual feature is often irrelevant to the target word state, because the attention process only uses the unidirectional flow from vision to linguistics, while lacking the reverse flow. Second, each attention result is independent, because it is computed only based on the previous word states while not considering the attention information from the past and future. This does not suit the attention habits of human beings. In this paper, we improve the conventional attention mechanism to a relation-aware attention mechanism. To this end, we propose two kinds of graph learning strategies, namely the linguistics-to-vision heterogeneous graph (HTG) and the vision-to-vision homogeneous graph (HMG). The HTG aims to enhance the inter-relation of attention by reversely modeling the relation of each word with respect to every attended visual feature, supporting proper semantic alignment in between. The HMG aims to enhance the intra-relation of attention by capturing the relations among all of the attended visual features, which can leverage the attention information from the past and future to guide the current attention process. Extensive experiments on two public datasets show that our proposed method not only significantly improves the baseline model, but also outperforms state-of-the-art methods. … (more)
- Is Part Of:
- Pattern recognition. Volume 136(2023)
- Journal:
- Pattern recognition
- Issue:
- Volume 136(2023)
- Issue Display:
- Volume 136, Issue 2023 (2023)
- Year:
- 2023
- Volume:
- 136
- Issue:
- 2023
- Issue Sort Value:
- 2023-0136-2023-0000
- Page Start:
- Page End:
- Publication Date:
- 2023-04
- Subjects:
- Video captioning -- Relation-aware attention -- Graph learning
Pattern perception -- Periodicals
Perception des structures -- Périodiques
Patroonherkenning
006.4 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00313203 ↗
http://www.sciencedirect.com/ ↗ - DOI:
- 10.1016/j.patcog.2022.109204 ↗
- Languages:
- English
- ISSNs:
- 0031-3203
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 25681.xml