An efficient scheme for automatic web pages categorization using the support vector machine. Issue 3 (2nd July 2016)
- Record Type:
- Journal Article
- Title:
- An efficient scheme for automatic web pages categorization using the support vector machine. Issue 3 (2nd July 2016)
- Main Title:
- An efficient scheme for automatic web pages categorization using the support vector machine
- Authors:
- Bhalla, Vinod Kumar
Kumar, Neeraj - Abstract:
- ABSTRACT: In the past few years, with an evolution of the Internet and related technologies, the number of the Internet users grows exponentially. These users demand access to relevant web pages from the Internet within fraction of seconds. To achieve this goal, there is a requirement of an efficient categorization of web page contents. Manual categorization of these billions of web pages to achieve high accuracy is a challenging task. Most of the existing techniques reported in the literature are semi-automatic. Using these techniques, higher level of accuracy cannot be achieved. To achieve these goals, this paper proposes an automatic web pages categorization into the domain category. The proposed scheme is based on the identification of specific and relevant features of the web pages. In the proposed scheme, first extraction and evaluation of features are done followed by filtering the feature set for categorization of domain web pages. A feature extraction tool based on the HTML document object model of the web page is developed in the proposed scheme. Feature extraction and weight assignment are based on the collection of domain-specific keyword list developed by considering various domain pages. Moreover, the keyword list is reduced on the basis of ids of keywords in keyword list. Also, stemming of keywords and tag text is done to achieve a higher accuracy. An extensive feature set is generated to develop a robust classification technique. The proposed scheme wasABSTRACT: In the past few years, with an evolution of the Internet and related technologies, the number of the Internet users grows exponentially. These users demand access to relevant web pages from the Internet within fraction of seconds. To achieve this goal, there is a requirement of an efficient categorization of web page contents. Manual categorization of these billions of web pages to achieve high accuracy is a challenging task. Most of the existing techniques reported in the literature are semi-automatic. Using these techniques, higher level of accuracy cannot be achieved. To achieve these goals, this paper proposes an automatic web pages categorization into the domain category. The proposed scheme is based on the identification of specific and relevant features of the web pages. In the proposed scheme, first extraction and evaluation of features are done followed by filtering the feature set for categorization of domain web pages. A feature extraction tool based on the HTML document object model of the web page is developed in the proposed scheme. Feature extraction and weight assignment are based on the collection of domain-specific keyword list developed by considering various domain pages. Moreover, the keyword list is reduced on the basis of ids of keywords in keyword list. Also, stemming of keywords and tag text is done to achieve a higher accuracy. An extensive feature set is generated to develop a robust classification technique. The proposed scheme was evaluated using a machine learning method in combination with feature extraction and statistical analysis using support vector machine kernel as the classification tool. The results obtained confirm the effectiveness of the proposed scheme in terms of its accuracy in different categories of web pages. … (more)
- Is Part Of:
- New review of hypermedia and multimedia. Volume 22:Issue 3(2016)
- Journal:
- New review of hypermedia and multimedia
- Issue:
- Volume 22:Issue 3(2016)
- Issue Display:
- Volume 22, Issue 3 (2016)
- Year:
- 2016
- Volume:
- 22
- Issue:
- 3
- Issue Sort Value:
- 2016-0022-0003-0000
- Page Start:
- 223
- Page End:
- 242
- Publication Date:
- 2016-07-02
- Subjects:
- Web page categorization -- support vector machine -- machine learning -- classification and extraction
Hypertext systems -- Periodicals
Interactive multimedia -- Periodicals
Multimedia systems -- Periodicals
005.75 - Journal URLs:
- http://www.tandfonline.com/loi/tham20 ↗
http://www.tandfonline.com/ ↗ - DOI:
- 10.1080/13614568.2016.1152316 ↗
- Languages:
- English
- ISSNs:
- 1361-4568
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 6087.764530
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 1062.xml