Generalisation in named entity recognition: A quantitative analysis. (July 2017)
- Record Type:
- Journal Article
- Title:
- Generalisation in named entity recognition: A quantitative analysis. (July 2017)
- Main Title:
- Generalisation in named entity recognition: A quantitative analysis
- Authors:
- Augenstein, Isabelle
Derczynski, Leon
Bontcheva, Kalina - Abstract:
- Highlights: Quantitative study of NER performance in diverse corpora of different genres, including newswire and social media. Multiple state of the art NER approaches are tested. Possible reasons for NER failure are analysed and quantified: NE diversity, unseen NEs and features, NEs changing over time. The proportion of unseen NEs is found to be the most reliable predictor of NE performance. Future NER work needs to address named entity generalisation and out-of-vocabulary lexical forms. Abstract: Named Entity Recognition (NER) is a key NLP task, which is all the more challenging on Web and user-generated content with their diverse and continuously changing language. This paper aims to quantify how this diversity impacts state-of-the-art NER methods, by measuring named entity (NE) and context variability, feature sparsity, and their effects on precision and recall. In particular, our findings indicate that NER approaches struggle to generalise in diverse genres with limited training data. Unseen NEs, in particular, play an important role, which have a higher incidence in diverse genres such as social media than in more regular genres such as newswire. Coupled with a higher incidence of unseen features more generally and the lack of large training corpora, this leads to significantly lower F 1 scores for diverse genres as compared to more regular ones. We also find that leading systems rely heavily on surface forms found in training data, having problems generalising beyondHighlights: Quantitative study of NER performance in diverse corpora of different genres, including newswire and social media. Multiple state of the art NER approaches are tested. Possible reasons for NER failure are analysed and quantified: NE diversity, unseen NEs and features, NEs changing over time. The proportion of unseen NEs is found to be the most reliable predictor of NE performance. Future NER work needs to address named entity generalisation and out-of-vocabulary lexical forms. Abstract: Named Entity Recognition (NER) is a key NLP task, which is all the more challenging on Web and user-generated content with their diverse and continuously changing language. This paper aims to quantify how this diversity impacts state-of-the-art NER methods, by measuring named entity (NE) and context variability, feature sparsity, and their effects on precision and recall. In particular, our findings indicate that NER approaches struggle to generalise in diverse genres with limited training data. Unseen NEs, in particular, play an important role, which have a higher incidence in diverse genres such as social media than in more regular genres such as newswire. Coupled with a higher incidence of unseen features more generally and the lack of large training corpora, this leads to significantly lower F 1 scores for diverse genres as compared to more regular ones. We also find that leading systems rely heavily on surface forms found in training data, having problems generalising beyond these, and offer explanations for this observation. … (more)
- Is Part Of:
- Computer speech & language. Volume 44(2017)
- Journal:
- Computer speech & language
- Issue:
- Volume 44(2017)
- Issue Display:
- Volume 44, Issue 2017 (2017)
- Year:
- 2017
- Volume:
- 44
- Issue:
- 2017
- Issue Sort Value:
- 2017-0044-2017-0000
- Page Start:
- 61
- Page End:
- 83
- Publication Date:
- 2017-07
- Subjects:
- Natural language processing -- Information extraction -- Named entity recognition -- Generalisation -- Entity drift -- Social media -- Quantitative study
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2017.01.012 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 371.xml