Towards local visual modeling for image captioning. (June 2023)
- Record Type:
- Journal Article
- Title:
- Towards local visual modeling for image captioning. (June 2023)
- Main Title:
- Towards local visual modeling for image captioning
- Authors:
- Ma, Yiwei
Ji, Jiayi
Sun, Xiaoshuai
Zhou, Yiyi
Ji, Rongrong - Abstract:
- Highlights: Local visual modeling with grid features for image captioning. Locality-Sensitive Attention (LSA) is deployed for the intra-layer interaction via local visual modeling. Locality-Sensitive Fusion (LSF) is used for inter-layer information fusion. Locality-Sensitive Transformer Network (LSTNet) outperforms SOTA captioning models on MS-COCO. The generalization of LSTNet is also verified on the Flickr8k and Flickr30k datasets. Abstract: In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF). LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors. It reduces the difficulty of local object recognition during captioning. LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity. With these two novel designs, the proposed LSTNet can model the local visual information of grid features to improve the captioning quality. To validate LSTNet, we conduct extensive experiments on the competitive MS-COCO benchmark. The experimental results show that LSTNet is not only capable of local visual modeling, but also outperforms a bunch ofHighlights: Local visual modeling with grid features for image captioning. Locality-Sensitive Attention (LSA) is deployed for the intra-layer interaction via local visual modeling. Locality-Sensitive Fusion (LSF) is used for inter-layer information fusion. Locality-Sensitive Transformer Network (LSTNet) outperforms SOTA captioning models on MS-COCO. The generalization of LSTNet is also verified on the Flickr8k and Flickr30k datasets. Abstract: In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF). LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors. It reduces the difficulty of local object recognition during captioning. LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity. With these two novel designs, the proposed LSTNet can model the local visual information of grid features to improve the captioning quality. To validate LSTNet, we conduct extensive experiments on the competitive MS-COCO benchmark. The experimental results show that LSTNet is not only capable of local visual modeling, but also outperforms a bunch of state-of-the-art captioning models on offline and online testings, i.e., 134.8 CIDEr and 136.3 CIDEr, respectively. Besides, the generalization of LSTNet is also verified on the Flickr8k and Flickr30k datasets. The source code is available on GitHub: https://www.github.com/xmu-xiaoma666/LSTNet . … (more)
- Is Part Of:
- Pattern recognition. Volume 138(2023)
- Journal:
- Pattern recognition
- Issue:
- Volume 138(2023)
- Issue Display:
- Volume 138, Issue 2023 (2023)
- Year:
- 2023
- Volume:
- 138
- Issue:
- 2023
- Issue Sort Value:
- 2023-0138-2023-0000
- Page Start:
- Page End:
- Publication Date:
- 2023-06
- Subjects:
- Image captioning -- Attention mechanism -- Local visual modeling
Pattern perception -- Periodicals
Perception des structures -- Périodiques
Patroonherkenning
006.4 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00313203 ↗
http://www.sciencedirect.com/ ↗ - DOI:
- 10.1016/j.patcog.2023.109420 ↗
- Languages:
- English
- ISSNs:
- 0031-3203
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 26053.xml