A practical guide to text mining with topic extraction. (3rd August 2015)
- Record Type:
- Journal Article
- Title:
- A practical guide to text mining with topic extraction. (3rd August 2015)
- Main Title:
- A practical guide to text mining with topic extraction
- Authors:
- Karl, Andrew
Wisnowski, James
Rushing, W. Heath - Abstract:
- <abstract abstract-type="main" id="wics1361-abs-0001"> <title> <x xml:space="preserve">Abstract</x> </title> <p id="wics1361-para-0001">Text analytics continue to proliferate as mass volumes of unstructured but highly useful data are generated at unbounded rates. Vector space models for text data—in which documents are represented by rows and words by columns—provide a translation of this unstructured data into a format that may be analyzed with statistical and machine learning techniques. This approach gives excellent results in revealing common themes, clustering documents, clustering words, and in translating unstructured text fields (such as an open‐ended survey response) to usable input variables for predictive modeling. After discussing the collection and processing of text, we explore properties and transformations of the document‐term matrix (DTM). We show how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation. This latent semantic analysis (LSA) approach produces factors that are compatible with graphical exploration and advanced analytics. We also explore Latent Dirichlet Allocation for topic analysis. We reference published R packages to implement the methods and conclude with a summary of other popular open‐source and commercial software packages. <italic>WIREs Comput Stat</italic> 2015, 7:326–340. doi: 10.1002/wics.1361</p><abstract abstract-type="main" id="wics1361-abs-0001"> <title> <x xml:space="preserve">Abstract</x> </title> <p id="wics1361-para-0001">Text analytics continue to proliferate as mass volumes of unstructured but highly useful data are generated at unbounded rates. Vector space models for text data—in which documents are represented by rows and words by columns—provide a translation of this unstructured data into a format that may be analyzed with statistical and machine learning techniques. This approach gives excellent results in revealing common themes, clustering documents, clustering words, and in translating unstructured text fields (such as an open‐ended survey response) to usable input variables for predictive modeling. After discussing the collection and processing of text, we explore properties and transformations of the document‐term matrix (DTM). We show how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation. This latent semantic analysis (LSA) approach produces factors that are compatible with graphical exploration and advanced analytics. We also explore Latent Dirichlet Allocation for topic analysis. We reference published R packages to implement the methods and conclude with a summary of other popular open‐source and commercial software packages. <italic>WIREs Comput Stat</italic> 2015, 7:326–340. doi: 10.1002/wics.1361</p> <p>For further resources related to this article, please visit the <ext-link ext-link-type="uri" xlink:href="http://wires.wiley.com/remdoi.cgi?doi=10.1002/wics.1361" xlink:type="simple" xmlns:xlink="http://www.w3.org/1999/xlink">WIREs website</ext-link>.</p> </abstract> … (more)
- Is Part Of:
- Wiley interdisciplinary reviews. Volume 7:Number 5(2015)
- Journal:
- Wiley interdisciplinary reviews
- Issue:
- Volume 7:Number 5(2015)
- Issue Display:
- Volume 7, Issue 5 (2015)
- Year:
- 2015
- Volume:
- 7
- Issue:
- 5
- Issue Sort Value:
- 2015-0007-0005-0000
- Page Start:
- 326
- Page End:
- 340
- Publication Date:
- 2015-08-03
- Subjects:
- Mathematical statistics -- Data processing -- Periodicals
Science -- Data processing -- Periodicals
Social sciences -- Data processing -- Periodicals
Mathematical statistics -- Periodicals
519.50285 - Journal URLs:
- http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)1939-0068 ↗
http://www3.interscience.wiley.com/journal/122458798/home ↗
http://onlinelibrary.wiley.com/ ↗ - DOI:
- 10.1002/wics.1361 ↗
- Languages:
- English
- ISSNs:
- 1939-5108
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 3210.xml