MuST-C: A multilingual corpus for end-to-end speech translation. (March 2021)
- Record Type:
- Journal Article
- Title:
- MuST-C: A multilingual corpus for end-to-end speech translation. (March 2021)
- Main Title:
- MuST-C: A multilingual corpus for end-to-end speech translation
- Authors:
- Cattoni, Roldano
Di Gangi, Mattia Antonino
Bentivogli, Luisa
Negri, Matteo
Turchi, Marco - Abstract:
- Highlights: Problem: end-to-end speech translation requires large corpora to train neural models. Contribution: MuST-C is a large multilingual corpus built from English TED Talks. Corpus content: English speech, aligned transcription/translations in 14 languages. Other key features: high topic and speaker variety, large size, free distribution. Discussion: empirical/manual quality evaluation, baseline results on all languages. Abstract: End-to-end spoken language translation (SLT) has recently gained popularity thanks to the advancement of sequence to sequence learning in its two parent tasks: automatic speech recognition (ASR) and machine translation (MT). However, research in the field has to confront with the scarcity of publicly available corpora to train data-hungry neural networks. Indeed, while traditional cascade solutions can build on sizable ASR and MT training data for a variety of languages, the available SLT corpora suitable for end-to-end training are few, typically small and of limited language coverage. We contribute to fill this gap by presenting MuST-C, a large and freely available Mu ltilingual S peech T ranslation C orpus built from English TED Talks. Its unique features include: i) language coverage and diversity (from English into 14 languages from different families), ii) size (at least 237 hours of transcribed recordings per language, 430 on average), iii) variety of topics and speakers, and iv) data quality. Besides describing the corpus creationHighlights: Problem: end-to-end speech translation requires large corpora to train neural models. Contribution: MuST-C is a large multilingual corpus built from English TED Talks. Corpus content: English speech, aligned transcription/translations in 14 languages. Other key features: high topic and speaker variety, large size, free distribution. Discussion: empirical/manual quality evaluation, baseline results on all languages. Abstract: End-to-end spoken language translation (SLT) has recently gained popularity thanks to the advancement of sequence to sequence learning in its two parent tasks: automatic speech recognition (ASR) and machine translation (MT). However, research in the field has to confront with the scarcity of publicly available corpora to train data-hungry neural networks. Indeed, while traditional cascade solutions can build on sizable ASR and MT training data for a variety of languages, the available SLT corpora suitable for end-to-end training are few, typically small and of limited language coverage. We contribute to fill this gap by presenting MuST-C, a large and freely available Mu ltilingual S peech T ranslation C orpus built from English TED Talks. Its unique features include: i) language coverage and diversity (from English into 14 languages from different families), ii) size (at least 237 hours of transcribed recordings per language, 430 on average), iii) variety of topics and speakers, and iv) data quality. Besides describing the corpus creation methodology and discussing the outcomes of empirical and manual quality evaluations, we present baseline results computed with strong systems on each language direction covered by MuST-C. … (more)
- Is Part Of:
- Computer speech & language. Volume 66(2021)
- Journal:
- Computer speech & language
- Issue:
- Volume 66(2021)
- Issue Display:
- Volume 66, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 66
- Issue:
- 2021
- Issue Sort Value:
- 2021-0066-2021-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-03
- Subjects:
- Spoken language translation -- Multilingual corpus
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2020.101155 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 15413.xml