Citation segmentation from sparse & noisy data: A joint inference approach with Markov logic networks. (8th December 2014)
- Record Type:
- Journal Article
- Title:
- Citation segmentation from sparse & noisy data: A joint inference approach with Markov logic networks. (8th December 2014)
- Main Title:
- Citation segmentation from sparse & noisy data: A joint inference approach with Markov logic networks
- Authors:
- Heckmann, Dustin
Frank, Anette
Arnold, Matthias
Gietz, Peter
Roth, Christian - Abstract:
- Abstract : This article presents an approach to citation segmentation that addresses special challenges as typically found in Digital Humanities applications. We perform citation segmentation from Optical Character Recognition (OCR) input obtained from volumes of a printed bibliography, the Turkology Annual . This showcase application features serious difficulties for state-of-the-art techniques in citation segmentation: multilingual citation entries, lack of data redundancy, inconsistencies, and noise from OCR input . Our approach is based on Markov logic networks (MLN) (Richardson and Domingos, Markov logic networks. Machine Learning, 62 (1): 107–36, 2006), a framework of statistical relational learning that combines first-order logic with probabilistic modeling. Formalization in first-order logic offers high expressivity and flexibility, and makes it possible to tailor segmentation to specific conventions of a given bibliography. We show that in face of the specific difficulties found with segmenting references from a digitized bibliography, our MLN formalizations outperform state-of-the-art statistical methods. We obtain 88% F1 -score for exact field match, a 24.8% increase over a conditional random fields-based system baseline. In contrast to prior work, we address a data set featuring sparse and noisy data. Our method extends Poon and Domingos (Joint Inference in information extraction. In Proceedings of the Twenty-Second National Conference on Artificial IntelligenceAbstract : This article presents an approach to citation segmentation that addresses special challenges as typically found in Digital Humanities applications. We perform citation segmentation from Optical Character Recognition (OCR) input obtained from volumes of a printed bibliography, the Turkology Annual . This showcase application features serious difficulties for state-of-the-art techniques in citation segmentation: multilingual citation entries, lack of data redundancy, inconsistencies, and noise from OCR input . Our approach is based on Markov logic networks (MLN) (Richardson and Domingos, Markov logic networks. Machine Learning, 62 (1): 107–36, 2006), a framework of statistical relational learning that combines first-order logic with probabilistic modeling. Formalization in first-order logic offers high expressivity and flexibility, and makes it possible to tailor segmentation to specific conventions of a given bibliography. We show that in face of the specific difficulties found with segmenting references from a digitized bibliography, our MLN formalizations outperform state-of-the-art statistical methods. We obtain 88% F1 -score for exact field match, a 24.8% increase over a conditional random fields-based system baseline. In contrast to prior work, we address a data set featuring sparse and noisy data. Our method extends Poon and Domingos (Joint Inference in information extraction. In Proceedings of the Twenty-Second National Conference on Artificial Intelligence . Vancouver, Canada: AAAI Press, 2007)'s approach by applying joint inference at the field level . By this move, we are able to cope with the lack of citation redundancy and noise in the data. Our approach can be characterized as knowledge-based and hence does not rely on annotated training data. The rule sets we designed can be adapted to other bibliographies, or further types of digitized sources, such as historical dictionaries or encyclopedias. … (more)
- Is Part Of:
- Digital scholarship in the humanties. Volume 31:Number 2(2016)
- Journal:
- Digital scholarship in the humanties
- Issue:
- Volume 31:Number 2(2016)
- Issue Display:
- Volume 31, Issue 2 (2016)
- Year:
- 2016
- Volume:
- 31
- Issue:
- 2
- Issue Sort Value:
- 2016-0031-0002-0000
- Page Start:
- 333
- Page End:
- 356
- Publication Date:
- 2014-12-08
- Subjects:
- Philology -- Data processing -- Periodicals
Computational linguistics -- Periodicals
410.285 - Journal URLs:
- http://www.oxfordjournals.org/ ↗
http://dsh.oxfordjournals.org/ ↗ - DOI:
- 10.1093/llc/fqu061 ↗
- Languages:
- English
- ISSNs:
- 2055-768X
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 12981.xml