Taris: An online speech recognition framework with sequence to sequence neural networks for both audio-only and audio-visual speech. (July 2022)
- Record Type:
- Journal Article
- Title:
- Taris: An online speech recognition framework with sequence to sequence neural networks for both audio-only and audio-visual speech. (July 2022)
- Main Title:
- Taris: An online speech recognition framework with sequence to sequence neural networks for both audio-only and audio-visual speech
- Authors:
- Sterpu, George
Harte, Naomi - Abstract:
- Abstract: It is widely accepted that the visual modality of speech provides complementary information to the speech recognition task, and many models have been introduced in order to make good use of the visual channel. This article develops Taris, a fully differentiable neural network model capable of decoding both audio-only and audio-visual speech in real time. We achieve this by connecting our previously proposed models AV Align and Taris, which are both end-to-end differentiable approaches to audio-visual speech integration and online speech recognition respectively. We evaluate AV Taris under the same conditions as AV Align and Taris on one of the largest publicly available audio-visual speech datasets, LRS2. Our results show that AV Taris is superior to the audio-only variant of Taris, demonstrating the utility of the visual modality to speech recognition within the real time decoding framework defined by Taris. Compared to an equivalent Transformer-based AV Align model that takes advantage of full sentences without meeting the real-time requirement, we report an absolute degradation of approximately 3% with AV Taris. As opposed to the more popular alternative for online speech recognition, namely the RNN Transducer, Taris offers a greatly simplified fully differentiable training pipeline. We speculate that AV Taris has the potential to popularise the adoption of Audio-Visual Speech Recognition (AVSR) technology and overcome the inherent limitations of the audioAbstract: It is widely accepted that the visual modality of speech provides complementary information to the speech recognition task, and many models have been introduced in order to make good use of the visual channel. This article develops Taris, a fully differentiable neural network model capable of decoding both audio-only and audio-visual speech in real time. We achieve this by connecting our previously proposed models AV Align and Taris, which are both end-to-end differentiable approaches to audio-visual speech integration and online speech recognition respectively. We evaluate AV Taris under the same conditions as AV Align and Taris on one of the largest publicly available audio-visual speech datasets, LRS2. Our results show that AV Taris is superior to the audio-only variant of Taris, demonstrating the utility of the visual modality to speech recognition within the real time decoding framework defined by Taris. Compared to an equivalent Transformer-based AV Align model that takes advantage of full sentences without meeting the real-time requirement, we report an absolute degradation of approximately 3% with AV Taris. As opposed to the more popular alternative for online speech recognition, namely the RNN Transducer, Taris offers a greatly simplified fully differentiable training pipeline. We speculate that AV Taris has the potential to popularise the adoption of Audio-Visual Speech Recognition (AVSR) technology and overcome the inherent limitations of the audio modality in less optimal listening conditions. 1 … (more)
- Is Part Of:
- Computer speech & language. Volume 74(2022)
- Journal:
- Computer speech & language
- Issue:
- Volume 74(2022)
- Issue Display:
- Volume 74, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 74
- Issue:
- 2022
- Issue Sort Value:
- 2022-0074-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-07
- Subjects:
- Online speech recognition -- Audio-visual speech integration -- Learning to count words -- Multimodal speech processing -- Speech recognition
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2022.101349 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 21011.xml