Stable web scraping: an approach based on neighbour zone and path similarity of page elements. (2018)
- Record Type:
- Journal Article
- Title:
- Stable web scraping: an approach based on neighbour zone and path similarity of page elements. (2018)
- Main Title:
- Stable web scraping: an approach based on neighbour zone and path similarity of page elements
- Authors:
- Gao, Peng
Han, Hao
Guo, Junxia
Saeki, Motoshi - Abstract:
- Web scraping techniques based on XPath enable users to consistently extract information of interest from webpages that do not provide a structured interface. However, XPath-based extraction is likely to fail when encountering page variants, resulting in a high cost of repair. Countermeasures based on pattern matching or model learning often require careful pre-processing, which is not suitable for cases where the target data is frequently re-designated. In this paper, we present a new extraction method for the stable scraping of arbitrary designated data from webpages. Instead of attempting to find the desired data directly, we first determine its approximate location in the changed page, called the neighbour zone. Then we search for the precise location by ranking the path similarity of page elements within the neighbour zone. Experiments on a large set of real-world webpages show that our method has better stability for web scraping, compared with the XPath-based extraction. In the two datasets, 0.118 and 0.891 F1-score were increased respectively.
- Is Part Of:
- International journal of Web engineering and technology. Volume 13:Number 4(2018)
- Journal:
- International journal of Web engineering and technology
- Issue:
- Volume 13:Number 4(2018)
- Issue Display:
- Volume 13, Issue 4 (2018)
- Year:
- 2018
- Volume:
- 13
- Issue:
- 4
- Issue Sort Value:
- 2018-0013-0004-0000
- Page Start:
- 301
- Page End:
- 333
- Publication Date:
- 2018
- Subjects:
- webpage -- web scraping -- semi-structured data extraction -- XPath expression -- stability -- HTML tree -- node distance -- path similarity
World Wide Web -- Periodicals
Web site development -- Periodicals
Application software -- Development -- Periodicals
006.7 - Journal URLs:
- http://www.inderscience.com/jhome.php?jcode=ijwet ↗
http://www.inderscience.com/ ↗ - Languages:
- English
- ISSNs:
- 1476-1289
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 9320.xml