Supervised machine learning for text analysis in R. (2021)
- Record Type:
- Book
- Title:
- Supervised machine learning for text analysis in R. (2021)
- Main Title:
- Supervised machine learning for text analysis in R
- Further Information:
- Note: Emil Hvitfeldt, Julia Sigle.
- Authors:
- Hvitfeldt, Emil
Sigle, Julia - Contents:
- I Natural Language Features 1. Language and modeling Linguistics for text analysis A glimpse into one area: morphology Different languages Other ways text can vary Summary 2. Tokenization What is a token? Types of tokens Character tokens Word tokens Tokenizing by n-grams Lines, sentence, and paragraph tokens Where does tokenization break down? Building your own tokenizer Tokenize to characters, only keeping letters Allow for hyphenated words Wrapping it in a function Tokenization for non-Latin alphabets Tokenization benchmark Summary 3. Stop words Using premade stop word lists Stop word removal in R Creating your own stop words list All stop word lists are context-specific What happens when you remove stop words Stop words in languages other than English Summary 4. Stemming How to stem text in R Should you use stemming at all? Understand a stemming algorithm Handling punctuation when stemming Compare some stemming options Lemmatization and stemming Stemming and stop words Summary 5. Word Embeddings Motivating embeddings for sparse, high-dimensional data Understand word embeddings by finding them yourself Exploring CFPB word embeddings Use pre-trained word embeddings Fairness and word embeddings Using word embeddings in the real world Summary II Machine Learning Methods; Regression A first regression model Building our first regression model Evaluation Compare to the null model Compare to a random forest model Case study: removing stop words Case study: varying n-grams CaseI Natural Language Features 1. Language and modeling Linguistics for text analysis A glimpse into one area: morphology Different languages Other ways text can vary Summary 2. Tokenization What is a token? Types of tokens Character tokens Word tokens Tokenizing by n-grams Lines, sentence, and paragraph tokens Where does tokenization break down? Building your own tokenizer Tokenize to characters, only keeping letters Allow for hyphenated words Wrapping it in a function Tokenization for non-Latin alphabets Tokenization benchmark Summary 3. Stop words Using premade stop word lists Stop word removal in R Creating your own stop words list All stop word lists are context-specific What happens when you remove stop words Stop words in languages other than English Summary 4. Stemming How to stem text in R Should you use stemming at all? Understand a stemming algorithm Handling punctuation when stemming Compare some stemming options Lemmatization and stemming Stemming and stop words Summary 5. Word Embeddings Motivating embeddings for sparse, high-dimensional data Understand word embeddings by finding them yourself Exploring CFPB word embeddings Use pre-trained word embeddings Fairness and word embeddings Using word embeddings in the real world Summary II Machine Learning Methods; Regression A first regression model Building our first regression model Evaluation Compare to the null model Compare to a random forest model Case study: removing stop words Case study: varying n-grams Case study: lemmatization Case study: feature hashing Text normalization What evaluation metrics are appropriate? The full game: regression Preprocess the data Specify the model Tune the model Evaluate the modeling Summary Classification A first classification model Building our first classification model Evaluation Compare to the null model Compare to a lasso classification model Tuning lasso hyperparameters Case study: sparse encoding Two class or multiclass? Case study: including non-text data Case study: data censoring Case study: custom features Detect credit cards Calculate percentage censoring Detect monetary amounts What evaluation metrics are appropriate? The full game: classification Feature selection Specify the model Evaluate the modeling Summary III Deep Learning Methods Dense neural networks Kickstarter data A first deep learning model Preprocessing for deep learning One-hot sequence embedding of text Simple flattened dense network Evaluation Using bag-of-words features Using pre-trained word embeddings Cross-validation for deep learning models Compare and evaluate DNN models Limitations of deep learning Summary Long short-term memory (LSTM) networks A first LSTM model Building an LSTM Evaluation Compare to a recurrent neural network Case study: bidirectional LSTM Case study: stacking LSTM layers Case study: padding Case study: training a regression model Case study: vocabulary size The full game: LSTM Preprocess the data Specify the model Summary Convolutional neural networks What are CNNs? Kernel Kernel size A first CNN model Case study: adding more layers Case study: byte pair encoding Case study: explainability with LIME Case study: hyperparameter search The full game: CNN Preprocess the data Specify the model Summary IV Conclusion Text models in the real world Appendix A Regular expressions A Literal characters A Meta characters A Full stop, the wildcard A Character classes A Shorthand character classes A Quantifiers A Anchors A Additional resources B Data B Hans Christian Andersen fairy tales B Opinions of the Supreme Court of the United States B Consumer Financial Protection Bureau (CFPB) complaints B Kickstarter campaign blurbs C Baseline linear classifier C Read in the data C Split into test/train and create resampling folds C Recipe for data preprocessing C Lasso regularized classification model C A model workflow C Tune the workflow … (more)
- Edition:
- 1st
- Publisher Details:
- Boca Raton : Chapman & Hall/CRC
- Publication Date:
- 2021
- Extent:
- 1 online resource, illustrations (black and white, and colour)
- Subjects:
- 006.35
Computational linguistics -- Statistical methods
Natural language processing (Computer science)
Supervised learning (Machine learning)
Predictive analytics
R (Computer program language) - Languages:
- English
- ISBNs:
- 9781000461992
9781000461978
9781003093459 - Related ISBNs:
- 9780367554187
9780367554194 - Notes:
- Note: Includes bibliographical references and index.
Note: Description based on CIP data; resource not viewed. - Access Rights:
- Legal Deposit; Only available on premises controlled by the deposit library and to one user at any one time; The Legal Deposit Libraries (Non-Print Works) Regulations (UK).
- Access Usage:
- Restricted: Printing from this resource is governed by The Legal Deposit Libraries (Non-Print Works) Regulations (UK) and UK copyright law currently in force.
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD.DS.644014
- Ingest File:
- 06_038.xml