Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study. (22nd April 2019)
- Record Type:
- Journal Article
- Title:
- Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study. (22nd April 2019)
- Main Title:
- Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study
- Authors:
- Hill, Mark J
Hengchen, Simon - Abstract:
- Abstract: This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.
- Is Part Of:
- Digital scholarship in the humanties. Volume 34:Number 4(2019)
- Journal:
- Digital scholarship in the humanties
- Issue:
- Volume 34:Number 4(2019)
- Issue Display:
- Volume 34, Issue 4 (2019)
- Year:
- 2019
- Volume:
- 34
- Issue:
- 4
- Issue Sort Value:
- 2019-0034-0004-0000
- Page Start:
- 825
- Page End:
- 843
- Publication Date:
- 2019-04-22
- Subjects:
- Philology -- Data processing -- Periodicals
Computational linguistics -- Periodicals
410.285 - Journal URLs:
- http://www.oxfordjournals.org/ ↗
http://dsh.oxfordjournals.org/ ↗ - DOI:
- 10.1093/llc/fqz024 ↗
- Languages:
- English
- ISSNs:
- 2055-768X
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 20860.xml