Synthesizing Talking Faces from Text and Audio: An Autoencoder and Sequence-to-Sequence Convolutional Neural Network. (June 2020)
- Record Type:
- Journal Article
- Title:
- Synthesizing Talking Faces from Text and Audio: An Autoencoder and Sequence-to-Sequence Convolutional Neural Network. (June 2020)
- Main Title:
- Synthesizing Talking Faces from Text and Audio: An Autoencoder and Sequence-to-Sequence Convolutional Neural Network
- Authors:
- Liu, Na
Zhou, Tao
Ji, Yunfeng
Zhao, Ziyi
Wan, Lihong - Abstract:
- Highlights: An effective landmark localization pipeline based on landmark detection, optical flow estimation, and Kalman filter, is proposed to avoid face shake. Part-based autoencoder is introduced to learn low-dimensional representation on different face regions. A sequence-to-sequence convolutional neural network with residual units is proposed to learn the mapping from phoneme to facial codes. The method is tested two public audio-visual datasets and a new dataset called Chinese CCTV News demonstrate the effectiveness of the proposed method against other state-of-the-art methods. Abstract: Synthesizing talking face from text and audio is increasingly becoming a direction in human-machine and face-to-face interactions. Although progress has been made, several existing methods either have unsatisfactory co-articulation modeling effects or ignore relations between adjacent inputs. Moreover, some of these methods often train models on shaky head videos or utilize linear-based face parameterization strategies, which further decrease synthesized quality. To address the above issues, this study proposes a sequence-to-sequence convolutional neural network to automatically synthesize talking face video with accurate lip sync. First, an advanced landmark location pipeline is used to accurately locate the facial landmarks, which can effectively reduce landmark shake. Then, a part-based autoencoder is presented to encode face images into a low-dimensional space and obtain compactHighlights: An effective landmark localization pipeline based on landmark detection, optical flow estimation, and Kalman filter, is proposed to avoid face shake. Part-based autoencoder is introduced to learn low-dimensional representation on different face regions. A sequence-to-sequence convolutional neural network with residual units is proposed to learn the mapping from phoneme to facial codes. The method is tested two public audio-visual datasets and a new dataset called Chinese CCTV News demonstrate the effectiveness of the proposed method against other state-of-the-art methods. Abstract: Synthesizing talking face from text and audio is increasingly becoming a direction in human-machine and face-to-face interactions. Although progress has been made, several existing methods either have unsatisfactory co-articulation modeling effects or ignore relations between adjacent inputs. Moreover, some of these methods often train models on shaky head videos or utilize linear-based face parameterization strategies, which further decrease synthesized quality. To address the above issues, this study proposes a sequence-to-sequence convolutional neural network to automatically synthesize talking face video with accurate lip sync. First, an advanced landmark location pipeline is used to accurately locate the facial landmarks, which can effectively reduce landmark shake. Then, a part-based autoencoder is presented to encode face images into a low-dimensional space and obtain compact representations. A sequence-to-sequence network is also presented to encode the relation of neighboring frames with multiple loss functions, and talking faces are synthesized through a reconstruction strategy with a decoder. Experiments on two public audio-visual datasets and a new dataset called CCTV news demonstrate the effectiveness of the proposed method against other state-of-the-art methods. … (more)
- Is Part Of:
- Pattern recognition. Volume 102(2020:Jun.)
- Journal:
- Pattern recognition
- Issue:
- Volume 102(2020:Jun.)
- Issue Display:
- Volume 102 (2020)
- Year:
- 2020
- Volume:
- 102
- Issue Sort Value:
- 2020-0102-0000-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-06
- Subjects:
- Convolutional neural network -- Autoencoder -- Regression -- Face landmark -- Face tracking -- Lip sync -- Video -- Audio
Pattern perception -- Periodicals
Perception des structures -- Périodiques
Patroonherkenning
006.4 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00313203 ↗
http://www.sciencedirect.com/ ↗ - DOI:
- 10.1016/j.patcog.2020.107231 ↗
- Languages:
- English
- ISSNs:
- 0031-3203
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 12955.xml