A multimodal attention fusion network with a dynamic vocabulary for TextVQA. (February 2022)

Record Type:: Journal Article
Title:: A multimodal attention fusion network with a dynamic vocabulary for TextVQA. (February 2022)
Main Title:: A multimodal attention fusion network with a dynamic vocabulary for TextVQA
Authors:: Wu, Jiajia
Du, Jun
Wang, Fengren
Yang, Chen
Jiang, Xinzhe
Hu, Jinshui
Yin, Bing
Zhang, Jianshu
Dai, Lirong
Abstract:: Highlights: A novel encoder-decoder method for textVQA is proposed. The proposed method utilizes the multimodal features to improve model accuracy. Attention map loss is used to address the dynamic vocabulary problem. Achieved the first place on ICDAR ST-VQA 2019 challenge. Abstract: Visual question answering (VQA) is a well-known problem in computer vision. Recently, Text-based VQA tasks are getting more and more attention because text information is very important for image understanding. The key to this task is to make good use of text information in the image. In this work, we propose an attention-based encoder-decoder network that combines the multimodal information of visual, linguistic, and location features together. By using the attention mechanism to focus on key features to the question, our multimodal feature fusion can provide more accurate information to improve the performance. Furthermore, we present a decoder with attention map loss, which can not only predict complex answers but also deal with a dynamic vocabulary to reduce the decoding space. Compared with softmax-based cross entropy loss which can only handle a fixed-length vocabulary, the attention map loss significantly improves the accuracy and efficiency. Our method achieved the first place of all three tasks in the ICDAR2019 robust reading challenge on scene text visual question answering (ST-VQA).
Is Part Of:: Pattern recognition. Volume 122(2022)
Journal:: Pattern recognition
Issue:: Volume 122(2022)
Issue Display:: Volume 122, Issue 2022 (2022)
Year:: 2022
Volume:: 122
Issue:: 2022
Issue Sort Value:: 2022-0122-2022-0000
Page Start:
Page End:
Publication Date:: 2022-02
Subjects:: Dynamic vocabulary -- Attention map -- Multimodal fusion -- ST-VQA
Pattern perception -- Periodicals
Perception des structures -- Périodiques
Patroonherkenning
006.4
Journal URLs:: http://www.sciencedirect.com/science/journal/00313203 ↗
http://www.sciencedirect.com/ ↗
DOI:: 10.1016/j.patcog.2021.108214 ↗
Languages:: English
ISSNs:: 0031-3203
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store
Ingest File:: 19718.xml