Semantic similarity information discrimination for video captioning. (1st March 2023)
- Record Type:
- Journal Article
- Title:
- Semantic similarity information discrimination for video captioning. (1st March 2023)
- Main Title:
- Semantic similarity information discrimination for video captioning
- Authors:
- Du, Sen
Zhu, Hong
Xiong, Ge
Lin, Guangfeng
Wang, Dong
Shi, Jing
Wang, Jing
Xing, Nan - Abstract:
- Abstract: Video captioning is a task that aims to automatically describe objects and their actions in videos using natural language sentences. The correct understanding of vision and language information is critical for video captioning tasks. Many existing methods usually fuse different features to generate sentences. However, the sentences have many improper nouns and verbs. Inspired by the successes of fine-grained visual recognition, we treat the problem of improper words to discriminate semantic similarity information. In this paper, we designed a semantic bilinear block (SBB) to widen the gap between the probability of existing and nonexistent words, which can capture more fine-grained features to discriminate semantic information. Moreover, our designed linear attention block (LAB) implements the channelwise attention for the 1-D feature by simplifying the squeeze-and-excitation structure. Furthermore, we designed a semantic discrimination network (SDN) that integrates the LAB and SBB into video encoder and decoder to leverage successful channelwise attention and discriminate semantic similarity information for better video captioning. Experiments on two widely used datasets, MSVD and MSR-VTT, demonstrate that our proposed SDN can achieve better performance than state-of-the-art methods. Highlights: We propose a semantic discrimination network (SDN) for video captioning. Visual tags are introduced to bridge the gap between vision and language. Build a semanticAbstract: Video captioning is a task that aims to automatically describe objects and their actions in videos using natural language sentences. The correct understanding of vision and language information is critical for video captioning tasks. Many existing methods usually fuse different features to generate sentences. However, the sentences have many improper nouns and verbs. Inspired by the successes of fine-grained visual recognition, we treat the problem of improper words to discriminate semantic similarity information. In this paper, we designed a semantic bilinear block (SBB) to widen the gap between the probability of existing and nonexistent words, which can capture more fine-grained features to discriminate semantic information. Moreover, our designed linear attention block (LAB) implements the channelwise attention for the 1-D feature by simplifying the squeeze-and-excitation structure. Furthermore, we designed a semantic discrimination network (SDN) that integrates the LAB and SBB into video encoder and decoder to leverage successful channelwise attention and discriminate semantic similarity information for better video captioning. Experiments on two widely used datasets, MSVD and MSR-VTT, demonstrate that our proposed SDN can achieve better performance than state-of-the-art methods. Highlights: We propose a semantic discrimination network (SDN) for video captioning. Visual tags are introduced to bridge the gap between vision and language. Build a semantic bilinear block to distinguish similar but not identical vision tag. Experimental results show that our model is superior to the state-of-the-art methods. … (more)
- Is Part Of:
- Expert systems with applications. Volume 213:Part B(2023)
- Journal:
- Expert systems with applications
- Issue:
- Volume 213:Part B(2023)
- Issue Display:
- Volume 213, Issue 2 (2023)
- Year:
- 2023
- Volume:
- 213
- Issue:
- 2
- Issue Sort Value:
- 2023-0213-0002-0000
- Page Start:
- Page End:
- Publication Date:
- 2023-03-01
- Subjects:
- SDN Semantic Discrimination Network -- CMB Channel Mixing Block -- LAB Linear Attention Block -- SBB Semantic Bilinear Block -- S-LSTM Semantic Compositional Network Long Short-Term Memory
Video captioning -- Semantic detection -- Bilinear pooling -- Channel attention -- Natural language processing
Expert systems (Computer science) -- Periodicals
Systèmes experts (Informatique) -- Périodiques
Electronic journals
006.33 - Journal URLs:
- http://www.sciencedirect.com/science/journal/09574174 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.eswa.2022.118985 ↗
- Languages:
- English
- ISSNs:
- 0957-4174
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3842.004220
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 24510.xml