Synthesising visual speech using dynamic visemes and deep learning architectures. (May 2019)
- Record Type:
- Journal Article
- Title:
- Synthesising visual speech using dynamic visemes and deep learning architectures. (May 2019)
- Main Title:
- Synthesising visual speech using dynamic visemes and deep learning architectures
- Authors:
- Thangthai, Ausdang
Milner, Ben
Taylor, Sarah - Abstract:
- Highlights: We propose a method to synthesise visual speech from linguistic features. Best performance is found with dynamic visemes and an LSTM many-to-many architecture. Using subjective tests to compare to other techniques, the proposed method produces more natural animations. Abstract: This paper proposes and compares a range of methods to improve the naturalness of visual speech synthesis. A feedforward deep neural network (DNN) and many-to-one and many-to-many recurrent neural networks (RNNs) using long short-term memory (LSTM) are considered. Rather than using acoustically derived units of speech, such as phonemes, viseme representations are considered and we propose using dynamic visemes together with a deep learning framework. The input feature representation to the models is also investigated and we determine that including wide phoneme and viseme contexts is crucial for predicting realistic lip motions that are sufficiently smooth but not under-articulated. A detailed objective evaluation across a range of system configurations shows that a combined dynamic viseme-phoneme speech unit combined with a many-to-many encoder-decoder architecture models visual co-articulations effectively. Subjective preference tests reveal there to be no significant difference between animations produced using this system and using ground truth facial motion taken from the original video. Furthermore, the dynamic viseme system also outperforms significantly conventional phoneme-drivenHighlights: We propose a method to synthesise visual speech from linguistic features. Best performance is found with dynamic visemes and an LSTM many-to-many architecture. Using subjective tests to compare to other techniques, the proposed method produces more natural animations. Abstract: This paper proposes and compares a range of methods to improve the naturalness of visual speech synthesis. A feedforward deep neural network (DNN) and many-to-one and many-to-many recurrent neural networks (RNNs) using long short-term memory (LSTM) are considered. Rather than using acoustically derived units of speech, such as phonemes, viseme representations are considered and we propose using dynamic visemes together with a deep learning framework. The input feature representation to the models is also investigated and we determine that including wide phoneme and viseme contexts is crucial for predicting realistic lip motions that are sufficiently smooth but not under-articulated. A detailed objective evaluation across a range of system configurations shows that a combined dynamic viseme-phoneme speech unit combined with a many-to-many encoder-decoder architecture models visual co-articulations effectively. Subjective preference tests reveal there to be no significant difference between animations produced using this system and using ground truth facial motion taken from the original video. Furthermore, the dynamic viseme system also outperforms significantly conventional phoneme-driven speech animation systems. … (more)
- Is Part Of:
- Computer speech & language. Volume 55(2019)
- Journal:
- Computer speech & language
- Issue:
- Volume 55(2019)
- Issue Display:
- Volume 55, Issue 2019 (2019)
- Year:
- 2019
- Volume:
- 55
- Issue:
- 2019
- Issue Sort Value:
- 2019-0055-2019-0000
- Page Start:
- 101
- Page End:
- 119
- Publication Date:
- 2019-05
- Subjects:
- Talking head -- Visual speech synthesis -- Deep neural network -- Dynamic visemes
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2018.11.003 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 9439.xml