DMMAN: A two-stage audio–visual fusion framework for sound separation and event localization. (January 2021)
- Record Type:
- Journal Article
- Title:
- DMMAN: A two-stage audio–visual fusion framework for sound separation and event localization. (January 2021)
- Main Title:
- DMMAN: A two-stage audio–visual fusion framework for sound separation and event localization
- Authors:
- Hu, Ruihan
Zhou, Songbing
Tang, Zhi Ri
Chang, Sheng
Huang, Qijun
Liu, Yisen
Han, Wei
Wu, Edmond Q. - Abstract:
- Abstract: Videos are used widely as the media platforms for human beings to touch the physical change of the world. However, we always receive the mixed sound from the multiple sound objects, and cannot distinguish and localize the sounds as the separate entities in videos. In order to solve this problem, a model named the Deep Multi-Modal Attention Network (DMMAN), is established to model the unconstrained video datasets for further finishing the sound source separation and event localization tasks in this paper. Based on the multi-modal separator and multi-modal matching classifier module, our model focuses on the sound separation and modal synchronization problems using two stage fusion of the sound and visual features. To link the multi-modal separator and multi-modal matching classifier modules, the regression and classification losses are employed to build the loss function of the DMMAN. The estimated spectrum masks and attention synchronization scores calculated by the DMMAN can be easily generalized to the sound source and event localization tasks. The quantitative experimental results show the DMMAN not only separates the high quality of the sound sources evaluated by Signal-to-Distortion Ratio and Signal-to-Interference Ratio metrics, but also is suitable for the mixed sound scenes that are never heard jointly. Meanwhile, DMMAN achieves better classification accuracy than other contrast baselines for the event localization tasks.
- Is Part Of:
- Neural networks. Volume 133(2021)
- Journal:
- Neural networks
- Issue:
- Volume 133(2021)
- Issue Display:
- Volume 133, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 133
- Issue:
- 2021
- Issue Sort Value:
- 2021-0133-2021-0000
- Page Start:
- 229
- Page End:
- 239
- Publication Date:
- 2021-01
- Subjects:
- Two-stage fusion -- Audio–visual tasks -- Sound source separation -- Sound event localization
Neural computers -- Periodicals
Neural networks (Computer science) -- Periodicals
Neural networks (Neurobiology) -- Periodicals
Nervous System -- Periodicals
Ordinateurs neuronaux -- Périodiques
Réseaux neuronaux (Informatique) -- Périodiques
Réseaux neuronaux (Neurobiologie) -- Périodiques
Neural computers
Neural networks (Computer science)
Neural networks (Neurobiology)
Periodicals
006.32 - Journal URLs:
- http://www.sciencedirect.com/science/journal/08936080 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.neunet.2020.10.003 ↗
- Languages:
- English
- ISSNs:
- 0893-6080
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 6081.280800
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 14916.xml