Multimodal speech recognition for unmanned aerial vehicles. (March 2021)
- Record Type:
- Journal Article
- Title:
- Multimodal speech recognition for unmanned aerial vehicles. (March 2021)
- Main Title:
- Multimodal speech recognition for unmanned aerial vehicles
- Authors:
- Oneață, Dan
Cucu, Horia - Abstract:
- Abstract: Unmanned aerial vehicles (UAVs) are becoming widespread with applications ranging from film-making and journalism to rescue operations and surveillance. Research communities (speech processing, computer vision, control) are starting to explore the limits of UAVs, but their efforts remain somewhat isolated. In this paper we unify multiple modalities (speech, vision, language) into a speech interface for UAV control. Our goal is to perform unconstrained speech recognition while leveraging the visual context. To this end, we introduce a multimodal evaluation dataset, consisting of spoken commands and associated images, which represent the visual context of what the UAV "sees" when the pilot utters the command. We provide baseline results and address two main research directions. First, we investigate the robustness of the system by (i) training it with a partial list of commands, and (ii) corrupting the recordings with outdoor noise. We perform a controlled set of experiments by varying the size of the training data and the signal-to-noise ratio. Second, we look at how to incorporate visual information into our model. We show that we can incorporate visual cues in the pipeline through the language model, which we implemented using a recurrent neural network. Moreover, by using gradient activation maps the system can provide visual feedback to the pilot regarding the UAV's understanding of the command. Our conclusions are that multimodal speech recognition can beAbstract: Unmanned aerial vehicles (UAVs) are becoming widespread with applications ranging from film-making and journalism to rescue operations and surveillance. Research communities (speech processing, computer vision, control) are starting to explore the limits of UAVs, but their efforts remain somewhat isolated. In this paper we unify multiple modalities (speech, vision, language) into a speech interface for UAV control. Our goal is to perform unconstrained speech recognition while leveraging the visual context. To this end, we introduce a multimodal evaluation dataset, consisting of spoken commands and associated images, which represent the visual context of what the UAV "sees" when the pilot utters the command. We provide baseline results and address two main research directions. First, we investigate the robustness of the system by (i) training it with a partial list of commands, and (ii) corrupting the recordings with outdoor noise. We perform a controlled set of experiments by varying the size of the training data and the signal-to-noise ratio. Second, we look at how to incorporate visual information into our model. We show that we can incorporate visual cues in the pipeline through the language model, which we implemented using a recurrent neural network. Moreover, by using gradient activation maps the system can provide visual feedback to the pilot regarding the UAV's understanding of the command. Our conclusions are that multimodal speech recognition can be successfully used in this scenario and that visual information helps especially when the noise level is high. The dataset and our code are available at http://kite.speed.pub.ro . Graphical abstract: Highlights: Unmanned aerial vehicles (drones) can be controlled with voice commands. Images captured from the drone can improve speech recognition of uttered commands. Visual channel is incorporated in the speech recognizer through its language model. Explanations of transcriptions are provided by highlighting regions in the image. … (more)
- Is Part Of:
- Computers & electrical engineering. Volume 90(2021)
- Journal:
- Computers & electrical engineering
- Issue:
- Volume 90(2021)
- Issue Display:
- Volume 90, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 90
- Issue:
- 2021
- Issue Sort Value:
- 2021-0090-2021-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-03
- Subjects:
- Automatic speech recognition -- Multimodal learning -- Domain adaptation -- Unmanned aerial vehicles
Computer engineering -- Periodicals
Electrical engineering -- Periodicals
Electrical engineering -- Data processing -- Periodicals
Ordinateurs -- Conception et construction -- Périodiques
Électrotechnique -- Périodiques
Électrotechnique -- Informatique -- Périodiques
Computer engineering
Electrical engineering
Electrical engineering -- Data processing
Periodicals
Electronic journals
621.302854 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00457906/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.compeleceng.2020.106943 ↗
- Languages:
- English
- ISSNs:
- 0045-7906
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.680000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 16719.xml