An effective and efficient Web content extractor for optimizing the crawling process. (27th March 2013)
- Record Type:
- Journal Article
- Title:
- An effective and efficient Web content extractor for optimizing the crawling process. (27th March 2013)
- Main Title:
- An effective and efficient Web content extractor for optimizing the crawling process
- Authors:
- Uzun, Erdinç
Serdar Güner, Edip
Kılıçaslan, Yılmaz
Yerlikaya, Tarık
Agun, Hayri Volkan - Abstract:
- <abstract abstract-type="main"> <title>SUMMARY</title> <p>Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. But, Web pages contain additional information that can be useful for the crawling process. We have developed a crawler, iCrawler (intelligent crawler), the backbone of which is a Web content extractor that automatically pulls content out of seven different blocks: menus, links, main texts, headlines, summaries, additional necessaries, and unnecessary texts from Web pages. The extraction process consists of two steps, which invoke each other to obtain information from the blocks. The first step learns which HTML tags refer to which blocks using the decision tree learning algorithm. Being guided by numerous sources of information, the crawler becomes considerably effective. It achieved a relatively high accuracy of 96.37% in our experiments of block extraction. In the second step, the crawler extracts content from the blocks using string matching functions. These functions along with the mapping between tags and blocks learned in the first step provide iCrawler with considerable time and storage efficiency. More specifically, iCrawler performs 14 times faster in the second step than in the first step. Furthermore, iCrawler significantly decreases storage costs by 57.10%<abstract abstract-type="main"> <title>SUMMARY</title> <p>Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. But, Web pages contain additional information that can be useful for the crawling process. We have developed a crawler, iCrawler (intelligent crawler), the backbone of which is a Web content extractor that automatically pulls content out of seven different blocks: menus, links, main texts, headlines, summaries, additional necessaries, and unnecessary texts from Web pages. The extraction process consists of two steps, which invoke each other to obtain information from the blocks. The first step learns which HTML tags refer to which blocks using the decision tree learning algorithm. Being guided by numerous sources of information, the crawler becomes considerably effective. It achieved a relatively high accuracy of 96.37% in our experiments of block extraction. In the second step, the crawler extracts content from the blocks using string matching functions. These functions along with the mapping between tags and blocks learned in the first step provide iCrawler with considerable time and storage efficiency. More specifically, iCrawler performs 14 times faster in the second step than in the first step. Furthermore, iCrawler significantly decreases storage costs by 57.10% when compared with the texts obtained through classical HTML stripping. Copyright © 2013 John Wiley &amp; Sons, Ltd.</p> </abstract> … (more)
- Is Part Of:
- Software, practice & experience. Volume 44:Number 10(2014)
- Journal:
- Software, practice & experience
- Issue:
- Volume 44:Number 10(2014)
- Issue Display:
- Volume 44, Issue 10 (2014)
- Year:
- 2014
- Volume:
- 44
- Issue:
- 10
- Issue Sort Value:
- 2014-0044-0010-0000
- Page Start:
- 1181
- Page End:
- 1199
- Publication Date:
- 2013-03-27
- Subjects:
- Computer software -- Periodicals
Computer programming -- Periodicals
Computer programs -- Periodicals
005.3 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/spe.2195 ↗
- Languages:
- English
- ISSNs:
- 0038-0644
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 8321.453000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 3034.xml