A clustering approach to extract data from HTML tables. Issue 6 (November 2021)
- Record Type:
- Journal Article
- Title:
- A clustering approach to extract data from HTML tables. Issue 6 (November 2021)
- Main Title:
- A clustering approach to extract data from HTML tables
- Authors:
- Jiménez, Patricia
Roldán, Juan C.
Corchuelo, Rafael - Abstract:
- Abstract: HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiency. Highlights: User-friendly HTML tables are a popular means to publish data. It is difficult for software agents to leverage them automatically. We present a new method to extract their data automatically. Its approach is totally unsupervised and builds on genetic clustering. It is as effective as the best supervised proposal, but far more efficient.
- Is Part Of:
- Information processing & management. Volume 58:Issue 6(2021)
- Journal:
- Information processing & management
- Issue:
- Volume 58:Issue 6(2021)
- Issue Display:
- Volume 58, Issue 6 (2021)
- Year:
- 2021
- Volume:
- 58
- Issue:
- 6
- Issue Sort Value:
- 2021-0058-0006-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-11
- Subjects:
- HTML tables -- Data extraction -- Clustering -- Genetic algorithms
Information storage and retrieval systems -- Periodicals
Information science -- Periodicals
Systèmes d'information -- Périodiques
Sciences de l'information -- Périodiques
Information science
Information storage and retrieval systems
Periodicals
658.4038 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064573 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ipm.2021.102683 ↗
- Languages:
- English
- ISSNs:
- 0306-4573
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4493.893000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 19867.xml