HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog. Issue 5 (September 2022)
- Record Type:
- Journal Article
- Title:
- HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog. Issue 5 (September 2022)
- Main Title:
- HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog
- Authors:
- Sun, Kaili
Guo, Chi
Zhang, Huyin
Li, Yuan - Abstract:
- Abstract: Visual dialog, a visual-language task, enables an AI agent to engage in conversation with humans grounded in a given image. To generate appropriate answers for a series of questions in the dialog, the agent is required to understand the comprehensive visual content of an image and the fine-grained textual context of the dialog. However, previous studies typically utilized the object-level visual feature to represent a whole image, which only focuses on the local perspective of an image but ignores the importance of the global information in an image. In this paper, we proposed a novel model Human-Like Visual Cognitive and Language-Memory Network for Visual Dialog (HVLM), to simulate global and local dual-perspective cognitions in the human visual system and understand an image comprehensively. HVLM consists of two key modules, Local-to-Global Graph Convolutional Visual Cognition (LG-GCVC) and Question-guided Language Topic Memory (T-Mem). Specifically, in the LG-GCVC module, we design a question-guided dual-perspective reasoning to jointly learn visual contents from both local and global perspectives through a simple spectral graph convolution network. Furthermore, in the T-Mem module, we design an iterative learning strategy to gradually enhance fine-grained textual context details via an attention mechanism. Experimental results demonstrate the superiority of our proposed model, which obtains the comparable performance on benchmark datasets VisDial v1.0 andAbstract: Visual dialog, a visual-language task, enables an AI agent to engage in conversation with humans grounded in a given image. To generate appropriate answers for a series of questions in the dialog, the agent is required to understand the comprehensive visual content of an image and the fine-grained textual context of the dialog. However, previous studies typically utilized the object-level visual feature to represent a whole image, which only focuses on the local perspective of an image but ignores the importance of the global information in an image. In this paper, we proposed a novel model Human-Like Visual Cognitive and Language-Memory Network for Visual Dialog (HVLM), to simulate global and local dual-perspective cognitions in the human visual system and understand an image comprehensively. HVLM consists of two key modules, Local-to-Global Graph Convolutional Visual Cognition (LG-GCVC) and Question-guided Language Topic Memory (T-Mem). Specifically, in the LG-GCVC module, we design a question-guided dual-perspective reasoning to jointly learn visual contents from both local and global perspectives through a simple spectral graph convolution network. Furthermore, in the T-Mem module, we design an iterative learning strategy to gradually enhance fine-grained textual context details via an attention mechanism. Experimental results demonstrate the superiority of our proposed model, which obtains the comparable performance on benchmark datasets VisDial v1.0 and VisDial v0.9. Highlights: A novel deep neural architecture HVLM is proposed for Visual Dialog. A dual-perspectives encoding mechanism is designed to understand an image comprehensively. An iterative learning strategy is designed to capture fine-grained semantic interactions in the dialog history. Experimental results demonstrate that our proposed model outperforms other comparable models by a significant margin on benchmark datasets. … (more)
- Is Part Of:
- Information processing & management. Volume 59:Issue 5(2022)
- Journal:
- Information processing & management
- Issue:
- Volume 59:Issue 5(2022)
- Issue Display:
- Volume 59, Issue 5 (2022)
- Year:
- 2022
- Volume:
- 59
- Issue:
- 5
- Issue Sort Value:
- 2022-0059-0005-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-09
- Subjects:
- Visual Dialog -- Visual-language understanding -- Dual-perspective reasoning -- Simple spectral graph convolution network
Information storage and retrieval systems -- Periodicals
Information science -- Periodicals
Systèmes d'information -- Périodiques
Sciences de l'information -- Périodiques
Information science
Information storage and retrieval systems
Periodicals
658.4038 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064573 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ipm.2022.103008 ↗
- Languages:
- English
- ISSNs:
- 0306-4573
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4493.893000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 23283.xml