Dynamic self-attention with vision synchronization networks for video question answering. (December 2022)
- Record Type:
- Journal Article
- Title:
- Dynamic self-attention with vision synchronization networks for video question answering. (December 2022)
- Main Title:
- Dynamic self-attention with vision synchronization networks for video question answering
- Authors:
- Liu, Yun
Zhang, Xiaoming
Huang, Feiran
Shen, Shixun
Tian, Peng
Li, Lang
Li, Zhoujun - Abstract:
- Highlights: A novel token selection mechanism based on the dynamic self-attention network is proposed to automatically extract important video features. A vision synchronization network is proposed to align appearance and motion features at the time slice level. Extensive experiments and analysis confirm the superiority of the proposed model DSAVS. Abstract: Video Question Answering (VideoQA) has gained increasing attention as an important task in understanding the rich spatio-temporal contents, i.e., the appearance and motion in the video. However, existing approaches mainly use the question to learn attentions over all the sampled appearance and motion features separately, which neglect two properties of VideoQA: (1) the answer to the question is often reflected on a few frames and video clips, and most video contents are superfluous; (2) appearance and motion features are usually concomitant and complementary to each other in time series. In this paper, we propose a novel VideoQA model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS), to address these problems. Specifically, a gated token selection mechanism is proposed to dynamically select the important tokens from appearance and motion sequences. These chosen tokens are fed into a self-attention mechanism to model the internal dependencies for more effective representation learning. To capture the correlation between the appearance and motion features, a vision synchronization block isHighlights: A novel token selection mechanism based on the dynamic self-attention network is proposed to automatically extract important video features. A vision synchronization network is proposed to align appearance and motion features at the time slice level. Extensive experiments and analysis confirm the superiority of the proposed model DSAVS. Abstract: Video Question Answering (VideoQA) has gained increasing attention as an important task in understanding the rich spatio-temporal contents, i.e., the appearance and motion in the video. However, existing approaches mainly use the question to learn attentions over all the sampled appearance and motion features separately, which neglect two properties of VideoQA: (1) the answer to the question is often reflected on a few frames and video clips, and most video contents are superfluous; (2) appearance and motion features are usually concomitant and complementary to each other in time series. In this paper, we propose a novel VideoQA model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS), to address these problems. Specifically, a gated token selection mechanism is proposed to dynamically select the important tokens from appearance and motion sequences. These chosen tokens are fed into a self-attention mechanism to model the internal dependencies for more effective representation learning. To capture the correlation between the appearance and motion features, a vision synchronization block is proposed to synchronize the two types of vision features at the time slice level. Then, the visual objects can be correlated with their corresponding activities and the performance is further improved. Extensive experiments conducted on three public VideoQA data sets confirm the effectivity and superiority of our model compared with state-of-the-art methods. … (more)
- Is Part Of:
- Pattern recognition. Volume 132(2022)
- Journal:
- Pattern recognition
- Issue:
- Volume 132(2022)
- Issue Display:
- Volume 132, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 132
- Issue:
- 2022
- Issue Sort Value:
- 2022-0132-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-12
- Subjects:
- Video question answering -- Dynamic self-attention -- Vision synchronization
Pattern perception -- Periodicals
Perception des structures -- Périodiques
Patroonherkenning
006.4 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00313203 ↗
http://www.sciencedirect.com/ ↗ - DOI:
- 10.1016/j.patcog.2022.108959 ↗
- Languages:
- English
- ISSNs:
- 0031-3203
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 23281.xml