Cross database audio visual speech adaptation for phonetic spoken term detection. (July 2017)
- Record Type:
- Journal Article
- Title:
- Cross database audio visual speech adaptation for phonetic spoken term detection. (July 2017)
- Main Title:
- Cross database audio visual speech adaptation for phonetic spoken term detection
- Authors:
- Kalantari, Shahram
Dean, David
Sridharan, Sridha - Abstract:
- Highlights: We show that the use of visual information helps both phone recognition and spoken term detection accuracy. Fused HMM adaptation could be utilized to benefit from multiple databases when training audio visual phone models An additional audio adaptation improves cross-database training accuracy for phone recognition and spoken term detection. A post training step can be used to update all HMM parameters and further improve phone recognition accuracy Abstract: Spoken term detection (STD), the process of finding all occurrences of a specified search term in a large amount of speech segments, has many applications in multimedia search and retrieval of information. It is known that use of video information in the form of lip movements can improve the performance of STD in the presence of audio noise. However, research in this direction has been hampered by the unavailability of large annotated audio visual databases for development. We propose a novel approach to develop audio visual spoken term detection when only a small (low resource) audio visual database is available for development. First, cross database training is proposed as a novel framework using the fused hidden Markov modeling (HMM) technique, which is used to train an audio model using extensive large and publicly available audio databases; then it is adapted to the visual data of the given audio visual database. This approach is shown to perform better than standard HMM joint-training method and alsoHighlights: We show that the use of visual information helps both phone recognition and spoken term detection accuracy. Fused HMM adaptation could be utilized to benefit from multiple databases when training audio visual phone models An additional audio adaptation improves cross-database training accuracy for phone recognition and spoken term detection. A post training step can be used to update all HMM parameters and further improve phone recognition accuracy Abstract: Spoken term detection (STD), the process of finding all occurrences of a specified search term in a large amount of speech segments, has many applications in multimedia search and retrieval of information. It is known that use of video information in the form of lip movements can improve the performance of STD in the presence of audio noise. However, research in this direction has been hampered by the unavailability of large annotated audio visual databases for development. We propose a novel approach to develop audio visual spoken term detection when only a small (low resource) audio visual database is available for development. First, cross database training is proposed as a novel framework using the fused hidden Markov modeling (HMM) technique, which is used to train an audio model using extensive large and publicly available audio databases; then it is adapted to the visual data of the given audio visual database. This approach is shown to perform better than standard HMM joint-training method and also improves the performance of spoken term detection when used in the indexing stage. In another attempt, the external audio models are first adapted to the audio data of the given audio visual database and then they are adapted to the visual data. This approach also improves both phone recognition and spoken term detection accuracy. Finally, the cross database training technique is used as HMM initialization, and an extra parameter re-estimation step is applied on the initialized models using Baum Welch technique. The proposed approaches for audio visual model training have allowed for benefiting from both large extensive out of domain audio databases that are available and the small audio visual database that is given for development to create more accurate audio-visual models. … (more)
- Is Part Of:
- Computer speech & language. Volume 44(2017)
- Journal:
- Computer speech & language
- Issue:
- Volume 44(2017)
- Issue Display:
- Volume 44, Issue 2017 (2017)
- Year:
- 2017
- Volume:
- 44
- Issue:
- 2017
- Issue Sort Value:
- 2017-0044-2017-0000
- Page Start:
- 1
- Page End:
- 21
- Publication Date:
- 2017-07
- Subjects:
- Spoken term detection -- Synchronous hidden Markov model -- Cross-database training -- Phone recognition
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2016.09.001 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 10739.xml