Video question answering via grounded cross-attention network learning. Issue 4 (July 2020)
- Record Type:
- Journal Article
- Title:
- Video question answering via grounded cross-attention network learning. Issue 4 (July 2020)
- Main Title:
- Video question answering via grounded cross-attention network learning
- Authors:
- Ye, Yunan
Zhang, Shifeng
Li, Yimeng
Qian, Xufeng
Tang, Siliang
Pu, Shiliang
Xiao, Jun - Abstract:
- Highlights: We study the problem of video question answering from the viewpoint of modeling the rough video representation and the grounded video representation. The joint question-video representation based on rough representation and grounded representation of video is learned for answer predicting. We propose the grounded cross-attention network learning framework, which is a novel hierarchical cross-attention method with a Q - O cross-attention layer and a Q - V - H cross-attention layer. The proposed GCANet adopts a novel mutual attention learning mechanism. We construct two large-scale datasets for video question answering. The extensive experiments validate the effectiveness of our method. Abstract: Video Question Answering is a burgeoning and challenging task in visual information retrieval (VIR), which automatically generates the answer to a question based on referenced video content. Different from the existing visual question answering methods which mainly focus on static image content, video question answering takes temporal dimension into account because of the essential difference in the structure between image and video. In this paper, we study the problem of video question answering from the viewpoint of grounded cross-attention network learning. Specifically, we propose a novel hierarchical cross-attention mechanism of mutual attention learning for video question answering, named as GCANet. We first obtain the multi-level rough video representation fromHighlights: We study the problem of video question answering from the viewpoint of modeling the rough video representation and the grounded video representation. The joint question-video representation based on rough representation and grounded representation of video is learned for answer predicting. We propose the grounded cross-attention network learning framework, which is a novel hierarchical cross-attention method with a Q - O cross-attention layer and a Q - V - H cross-attention layer. The proposed GCANet adopts a novel mutual attention learning mechanism. We construct two large-scale datasets for video question answering. The extensive experiments validate the effectiveness of our method. Abstract: Video Question Answering is a burgeoning and challenging task in visual information retrieval (VIR), which automatically generates the answer to a question based on referenced video content. Different from the existing visual question answering methods which mainly focus on static image content, video question answering takes temporal dimension into account because of the essential difference in the structure between image and video. In this paper, we study the problem of video question answering from the viewpoint of grounded cross-attention network learning. Specifically, we propose a novel hierarchical cross-attention mechanism of mutual attention learning for video question answering, named as GCANet. We first obtain the multi-level rough video representation from frame-level video features and clip-level video features. Then, we utilize region proposal network to generate object-level grounded video features as grounded video representations. Next, the grounded question-video representation is learned by the first layer of the GCANet framework, named as Q − O cross-attention layer. The second Q − V − H cross-attention layer of the GCANet framework helps to learn the joint question-video representation based on both rough representation and grounded representation of video for video question answering. We construct two large-scale video question answering datasets. The experimental results on the proposed datasets demonstrate the effectiveness of our model. … (more)
- Is Part Of:
- Information processing & management. Volume 57:Issue 4(2020:Jul.)
- Journal:
- Information processing & management
- Issue:
- Volume 57:Issue 4(2020:Jul.)
- Issue Display:
- Volume 57, Issue 4 (2020)
- Year:
- 2020
- Volume:
- 57
- Issue:
- 4
- Issue Sort Value:
- 2020-0057-0004-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-07
- Subjects:
- Visual information retrieval -- Video question answering -- Cross-attention
Information storage and retrieval systems -- Periodicals
Information science -- Periodicals
Systèmes d'information -- Périodiques
Sciences de l'information -- Périodiques
Information science
Information storage and retrieval systems
Periodicals
658.4038 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064573 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ipm.2020.102265 ↗
- Languages:
- English
- ISSNs:
- 0306-4573
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4493.893000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 20467.xml