A hadoop based platform for natural language processing of web pages and documents. (December 2015)
- Record Type:
- Journal Article
- Title:
- A hadoop based platform for natural language processing of web pages and documents. (December 2015)
- Main Title:
- A hadoop based platform for natural language processing of web pages and documents
- Authors:
- Nesi, Paolo
Pantaleo, Gianni
Sanesi, Gianmarco - Abstract:
- Abstract: The rapid and extensive pervasion of information through the web has enhanced the diffusion of a huge amount of unstructured natural language textual resources. A great interest has arisen in the last decade for discovering, accessing and sharing such a vast source of knowledge. For this reason, processing very large data volumes in a reasonable time frame is becoming a major challenge and a crucial requirement for many commercial and research fields. Distributed systems, computer clusters and parallel computing paradigms have been increasingly applied in the recent years, since they introduced significant improvements for computing performance in data-intensive contexts, such as Big Data mining and analysis. Natural Language Processing, and particularly the tasks of text annotation and key feature extraction, is an application area with high computational requirements; therefore, these tasks can significantly benefit of parallel architectures. This paper presents a distributed framework for crawling web documents and running Natural Language Processing tasks in a parallel fashion. The system is based on the Apache Hadoop ecosystem and its parallel programming paradigm, called MapReduce. In the specific, we implemented a MapReduce adaptation of a GATE application and framework (a widely used open source tool for text engineering and NLP). A validation is also offered in using the solution for extracting keywords and keyphrase from web documents in a multi-nodeAbstract: The rapid and extensive pervasion of information through the web has enhanced the diffusion of a huge amount of unstructured natural language textual resources. A great interest has arisen in the last decade for discovering, accessing and sharing such a vast source of knowledge. For this reason, processing very large data volumes in a reasonable time frame is becoming a major challenge and a crucial requirement for many commercial and research fields. Distributed systems, computer clusters and parallel computing paradigms have been increasingly applied in the recent years, since they introduced significant improvements for computing performance in data-intensive contexts, such as Big Data mining and analysis. Natural Language Processing, and particularly the tasks of text annotation and key feature extraction, is an application area with high computational requirements; therefore, these tasks can significantly benefit of parallel architectures. This paper presents a distributed framework for crawling web documents and running Natural Language Processing tasks in a parallel fashion. The system is based on the Apache Hadoop ecosystem and its parallel programming paradigm, called MapReduce. In the specific, we implemented a MapReduce adaptation of a GATE application and framework (a widely used open source tool for text engineering and NLP). A validation is also offered in using the solution for extracting keywords and keyphrase from web documents in a multi-node Hadoop cluster. Evaluation of performance scalability has been conducted against a real corpus of web pages and documents. … (more)
- Is Part Of:
- Journal of visual languages & computing. Volume 31:Part B(2016)
- Journal:
- Journal of visual languages & computing
- Issue:
- Volume 31:Part B(2016)
- Issue Display:
- Volume 31, Issue 2 (2016)
- Year:
- 2016
- Volume:
- 31
- Issue:
- 2
- Issue Sort Value:
- 2016-0031-0002-0000
- Page Start:
- 130
- Page End:
- 138
- Publication Date:
- 2015-12
- Subjects:
- Natural language processing -- Hadoop -- Part-of-speech tagging -- Text parsing -- Web crawling -- Big Data Mining -- Parallel computing -- Distributed systems
Visual programming languages (Computer science) -- Periodicals
Visual programming (Computer science) -- Periodicals
Programming languages (Electronic computers) -- Semantics -- Periodicals
Langages de programmation visuelle -- Périodiques
Programmation visuelle -- Périodiques
Langages de programmation -- Sémantique -- Périodiques
Programming languages (Electronic computers) -- Semantics
Visual programming (Computer science)
Visual programming languages (Computer science)
Periodicals
Electronic journals
005 - Journal URLs:
- http://www.sciencedirect.com/science/journal/1045926X ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.jvlc.2015.10.017 ↗
- Languages:
- English
- ISSNs:
- 1045-926X
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 5072.495200
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 1260.xml