Context-aware correction of spelling errors in Hungarian medical documents. (January 2016)
- Record Type:
- Journal Article
- Title:
- Context-aware correction of spelling errors in Hungarian medical documents. (January 2016)
- Main Title:
- Context-aware correction of spelling errors in Hungarian medical documents
- Authors:
- Siklósi, Borbála
Novák, Attila
Prószéky, Gábor - Abstract:
- Abstract : Highlights: We propose two methods to automatically correct Hungarian clinical text. Method 1 generates a ranked list of correction candidates disregarding context. Method 2 uses an SMT decoder to implement context-aware error correction. Method 1 is integrated into Method 2. Method 2 outperforms Method 1 with regard to correction accuracy. Abstract: Owing to the growing need of acquiring medical data from clinical records, processing such documents is an important topic in natural language processing (NLP). However, for general NLP methods to work, a proper, normalized input is required. Otherwise the system is overwhelmed by the unusually high amount of noise generally characteristic of this kind of text. The different types of this noise originate from non-standard language use: short fragments instead of proper sentences, usage of Latin words, many acronyms and very frequent misspellings. In this paper, a method is described for the automated correction of spelling errors in Hungarian clinical records. First, a word-based algorithm was implemented to generate a ranked list of correction candidates for word forms regarded as incorrect. Second, the problem of spelling correction was modelled as a translation task, where the source language is the erroneous text and the target language is the corrected one. A Statistical Machine Translation (SMT) decoder performed the task of error correction. Since no orthographically correct proofread text from this domain isAbstract : Highlights: We propose two methods to automatically correct Hungarian clinical text. Method 1 generates a ranked list of correction candidates disregarding context. Method 2 uses an SMT decoder to implement context-aware error correction. Method 1 is integrated into Method 2. Method 2 outperforms Method 1 with regard to correction accuracy. Abstract: Owing to the growing need of acquiring medical data from clinical records, processing such documents is an important topic in natural language processing (NLP). However, for general NLP methods to work, a proper, normalized input is required. Otherwise the system is overwhelmed by the unusually high amount of noise generally characteristic of this kind of text. The different types of this noise originate from non-standard language use: short fragments instead of proper sentences, usage of Latin words, many acronyms and very frequent misspellings. In this paper, a method is described for the automated correction of spelling errors in Hungarian clinical records. First, a word-based algorithm was implemented to generate a ranked list of correction candidates for word forms regarded as incorrect. Second, the problem of spelling correction was modelled as a translation task, where the source language is the erroneous text and the target language is the corrected one. A Statistical Machine Translation (SMT) decoder performed the task of error correction. Since no orthographically correct proofread text from this domain is available, we could not use such a corpus for training the system. Instead, the word-based system was used to create translation models. In addition, a 3-gram token-based language model was used to model lexical context. Due to the high number of abbreviations and acronyms in the texts, the behaviour of these abbreviated forms was further examined both in the case of the context-unaware word-based and the SMT-decoder-based implementations. The results show that the SMT-based method outperforms the first candidate accuracy of the word-based ranking system. However, the normalization of abbreviations should be handled as a separate task. … (more)
- Is Part Of:
- Computer speech & language. Volume 35(2016)
- Journal:
- Computer speech & language
- Issue:
- Volume 35(2016)
- Issue Display:
- Volume 35, Issue 2016 (2016)
- Year:
- 2016
- Volume:
- 35
- Issue:
- 2016
- Issue Sort Value:
- 2016-0035-2016-0000
- Page Start:
- 219
- Page End:
- 233
- Publication Date:
- 2016-01
- Subjects:
- Spelling correction -- Medical text processing -- Agglutinating languages
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2014.09.001 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 8942.xml