Revisiting image captioning via maximum discrepancy competition. (February 2022)
- Record Type:
- Journal Article
- Title:
- Revisiting image captioning via maximum discrepancy competition. (February 2022)
- Main Title:
- Revisiting image captioning via maximum discrepancy competition
- Authors:
- Wan, Boyang
Jiang, Wenhui
Fang, Yu-Ming
Zhu, Minwei
Li, Qin
Liu, Yang - Abstract:
- Highlights: We propose a new model comparison method without an unaffordable large-scale subjective annotation experiment. A new similarity function named NGSM is proposed as a semantic distance measure to model discrepancy of captions. With this NGSM, the informative images can be selected effectively from an arbitrary large-scale raw image dataset. We demonstrate quantitative results of the generalization ability of the competing ICMs and provide detailed analysis about the key factor of improving the generalization ability of ICMs. Abstract: Image captioning is a hot research topic bridging computer vision and natural language processing during the past several decades. It has achieved great progress with the help of large-scale datasets and deep learning techniques. Though the variety of image captioning models (ICMs), the performance of ICMs have got stuck in a bottleneck judging from the publicly published results. Considering the marginal performance gains brought by recent ICMs, we raise the following question: "what about the performances of the recent ICMs achieve on in-the-wild images? To clarify this question, we compare existing ICMs by evaluating their generalization ability. Specifically, we propose a novel method based on maximum discrepancy competition to diagnose existing ICMs. Firstly, we establish a new test set containing only informative images selected by adopting maximum discrepancy competition on the existing ICMs, from an arbitrary large-scale rawHighlights: We propose a new model comparison method without an unaffordable large-scale subjective annotation experiment. A new similarity function named NGSM is proposed as a semantic distance measure to model discrepancy of captions. With this NGSM, the informative images can be selected effectively from an arbitrary large-scale raw image dataset. We demonstrate quantitative results of the generalization ability of the competing ICMs and provide detailed analysis about the key factor of improving the generalization ability of ICMs. Abstract: Image captioning is a hot research topic bridging computer vision and natural language processing during the past several decades. It has achieved great progress with the help of large-scale datasets and deep learning techniques. Though the variety of image captioning models (ICMs), the performance of ICMs have got stuck in a bottleneck judging from the publicly published results. Considering the marginal performance gains brought by recent ICMs, we raise the following question: "what about the performances of the recent ICMs achieve on in-the-wild images? To clarify this question, we compare existing ICMs by evaluating their generalization ability. Specifically, we propose a novel method based on maximum discrepancy competition to diagnose existing ICMs. Firstly, we establish a new test set containing only informative images selected by adopting maximum discrepancy competition on the existing ICMs, from an arbitrary large-scale raw image set. Secondly, a small-scale and low-cost subjective annotation experiment is conducted on the new test set. Thirdly, we rank the generalization ability of the existing ICMs by comparing their performances on the new test set. Finally, the keys of different ICMs are demonstrated based on a detailed analysis of experimental results. Our analysis yields several interesting findings, including that 1) Using simultaneously low- and high-level object features may be an effective tool to boost the generalization ability for the Transformer based ICMs. 2) Self-attention mechanism may provide better modelling ability for inter- and intra-modal data than other attention-based mechanisms. 3) Constructing an ICM with a multistage language decoder may be a promising way to improve its performance. … (more)
- Is Part Of:
- Pattern recognition. Volume 122(2022)
- Journal:
- Pattern recognition
- Issue:
- Volume 122(2022)
- Issue Display:
- Volume 122, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 122
- Issue:
- 2022
- Issue Sort Value:
- 2022-0122-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-02
- Subjects:
- Image captioning -- Model comparison -- Attention mechanism
Pattern perception -- Periodicals
Perception des structures -- Périodiques
Patroonherkenning
006.4 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00313203 ↗
http://www.sciencedirect.com/ ↗ - DOI:
- 10.1016/j.patcog.2021.108358 ↗
- Languages:
- English
- ISSNs:
- 0031-3203
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 19718.xml