Hierarchical multimodal attention for end-to-end audio-visual scene-aware dialogue response generation. (September 2020)

Record Type:: Journal Article
Title:: Hierarchical multimodal attention for end-to-end audio-visual scene-aware dialogue response generation. (September 2020)
Main Title:: Hierarchical multimodal attention for end-to-end audio-visual scene-aware dialogue response generation
Authors:: Le, Hung
Sahoo, Doyen
Chen, Nancy F.
Hoi, Steven C.H.
Abstract:: Highlights: Hierarchical attention, including question self-attention and question-guided attention on input helps to improve the model performance. Features of multiple modalities can be fused through nonlinear approaches for better contextual representations. Input video length, question types in user queries, and turn positions affect the quality of the generated responses. Abstract: This work is extended from our participation in the 7 th Dialogue System Technology Challenge (DSTC7), where we participated in the Audio Visual Scene-aware Dialogue System (AVSD) track. The AVSD track evaluates how dialogue systems understand video scenes and responds to users about the video visual and audio content. We propose a hierarchical attention approach on user queries, video caption, audio and visual features that contribute to improved evaluation results. We also apply a nonlinear feature fusion approach to combine the visual and audio features for better knowledge representation. Our proposed model shows superior performance in terms of both objective evaluation and human rating as compared to the baselines. In this extended work, we also provide a more extensive review of the related work, conduct additional experiments with word-level and context-level pretrained embeddings, and investigate different qualitative aspects of the generated responses.
Is Part Of:: Computer speech & language. Volume 63(2020)
Journal:: Computer speech & language
Issue:: Volume 63(2020)
Issue Display:: Volume 63, Issue 2020 (2020)
Year:: 2020
Volume:: 63
Issue:: 2020
Issue Sort Value:: 2020-0063-2020-0000
Page Start:
Page End:
Publication Date:: 2020-09
Subjects:: Dialogue system -- Audio-visual scene-aware dialogue -- Neural network -- Multimodal attention -- Response generation
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454
Journal URLs:: http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗
DOI:: 10.1016/j.csl.2020.101095 ↗
Languages:: English
ISSNs:: 0885-2308
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store
Ingest File:: 13576.xml