The data science handbook. (2017)
- Record Type:
- Book
- Title:
- The data science handbook. (2017)
- Main Title:
- The data science handbook
- Further Information:
- Note: Field Cady.
- Authors:
- Cady, Field, 1984-
- Contents:
- Preface xvii 1 Introduction: Becoming a Unicorn 1 1.1 Aren’t Data Scientists Just Overpaid Statisticians? 2 1.2 How Is This Book Organized? 3 1.3 How to Use This Book? 3 1.4 Why Is It All in Python, Anyway? 4 1.5 Example Code and Datasets 4 1.6 Parting Words 5 Part I The Stuff You’ll Always Use 7 2 The Data Science Road Map 9 2.1 Frame the Problem 10 2.2 Understand the Data: Basic Questions 11 2.3 Understand the Data: Data Wrangling 12 2.4 Understand the Data: Exploratory Analysis 13 2.5 Extract Features 14 2.6 Model 15 2.7 Present Results 15 2.8 Deploy Code 16 2.9 Iterating 16 2.10 Glossary 17 3 Programming Languages 19 3.1 Why Use a Programming Language? What Are the Other Options? 19 3.2 A Survey of Programming Languages for Data Science 20 3.3 Python Crash Course 22 3.4 Strings 27 3.5 Defining Functions 32 3.6 Python’s Technical Libraries 37 3.7 Other Python Resources 42 3.8 Further Reading 42 3.9 Glossary 43 3a Interlude: My Personal Toolkit 45 4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning 47 4.1 The Worst Dataset in the World 48 4.2 How to Identify Pathologies 48 4.3 Problems with Data Content 49 4.4 Formatting Issues 51 4.5 Example Formatting Script 54 4.6 Regular Expressions 55 4.7 Life in the Trenches 60 4.8 Glossary 60 5 Visualizations and Simple Metrics 61 5.1 A Note on Python’s Visualization Tools 62 5.2 Example Code 62 5.3 Pie Charts 63 5.4 Bar Charts 65 5.5 Histograms 66 5.6 Means, Standard Deviations, Medians, and Quantiles 69 5.7Preface xvii 1 Introduction: Becoming a Unicorn 1 1.1 Aren’t Data Scientists Just Overpaid Statisticians? 2 1.2 How Is This Book Organized? 3 1.3 How to Use This Book? 3 1.4 Why Is It All in Python, Anyway? 4 1.5 Example Code and Datasets 4 1.6 Parting Words 5 Part I The Stuff You’ll Always Use 7 2 The Data Science Road Map 9 2.1 Frame the Problem 10 2.2 Understand the Data: Basic Questions 11 2.3 Understand the Data: Data Wrangling 12 2.4 Understand the Data: Exploratory Analysis 13 2.5 Extract Features 14 2.6 Model 15 2.7 Present Results 15 2.8 Deploy Code 16 2.9 Iterating 16 2.10 Glossary 17 3 Programming Languages 19 3.1 Why Use a Programming Language? What Are the Other Options? 19 3.2 A Survey of Programming Languages for Data Science 20 3.3 Python Crash Course 22 3.4 Strings 27 3.5 Defining Functions 32 3.6 Python’s Technical Libraries 37 3.7 Other Python Resources 42 3.8 Further Reading 42 3.9 Glossary 43 3a Interlude: My Personal Toolkit 45 4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning 47 4.1 The Worst Dataset in the World 48 4.2 How to Identify Pathologies 48 4.3 Problems with Data Content 49 4.4 Formatting Issues 51 4.5 Example Formatting Script 54 4.6 Regular Expressions 55 4.7 Life in the Trenches 60 4.8 Glossary 60 5 Visualizations and Simple Metrics 61 5.1 A Note on Python’s Visualization Tools 62 5.2 Example Code 62 5.3 Pie Charts 63 5.4 Bar Charts 65 5.5 Histograms 66 5.6 Means, Standard Deviations, Medians, and Quantiles 69 5.7 Boxplots 70 5.8 Scatterplots 72 5.9 Scatterplots with Logarithmic Axes 74 5.10 Scatter Matrices 76 5.11 Heatmaps 77 5.12 Correlations 78 5.13 Anscombe’s Quartet and the Limits of Numbers 80 5.14 Time Series 80 5.15 Further Reading 85 5.16 Glossary 85 6 Machine Learning Overview 87 6.1 Historical Context 88 6.2 Supervised versus Unsupervised 89 6.3 Training Data, Testing Data, and the Great Boogeyman of Overfitting 89 6.4 Further Reading 91 6.5 Glossary 91 7 Interlude: Feature Extraction Ideas 93 7.1 Standard Features 93 7.2 Features That Involve Grouping 94 7.3 Preview of More Sophisticated Features 95 7.4 Defining the Feature You Want to Predict 95 8 Machine Learning Classification 97 8.1 What Is a Classifier, and What Can You Do with It? 97 8.2 A Few Practical Concerns 98 8.3 Binary versus Multiclass 99 8.4 Example Script 99 8.5 Specific Classifiers 101 8.6 Evaluating Classifiers 114 8.7 Selecting Classification Cutoffs 117 8.8 Further Reading 119 8.9 Glossary 119 9 Technical Communication and Documentation 121 9.1 Several Guiding Principles 122 9.2 Slide Decks 124 9.3 Written Reports 128 9.4 Speaking: What Has Worked for Me 130 9.5 Code Documentation 131 9.6 Further Reading 132 9.7 Glossary 132 Part II Stuff You Still Need to Know 133 10 Unsupervised Learning: Clustering and Dimensionality Reduction 135 10.1 The Curse of Dimensionality 136 10.2 Example: Eigenfaces for Dimensionality Reduction 138 10.3 Principal Component Analysis and Factor Analysis 140 10.4 Skree Plots and Understanding Dimensionality 142 10.5 Factor Analysis 143 10.6 Limitations of PCA 143 10.7 Clustering 144 10.8 Further Reading 151 10.9 Glossary 151 11 Regression 153 11.1 Example: Predicting Diabetes Progression 153 11.2 Least Squares 156 11.3 Fitting Nonlinear Curves 157 11.4 Goodness of Fit: R2 and Correlation 159 11.5 Correlation of Residuals 160 11.6 Linear Regression 161 11.7 LASSO Regression and Feature Selection 162 11.8 Further Reading 164 11.9 Glossary 164 12 Data Encodings and File Formats 165 12.1 Typical File Format Categories 165 12.2 CSV Files 167 12.3 JSON Files 168 12.4 XML Files 170 12.5 HTML Files 172 12.6 Tar Files 174 12.7 GZip Files 175 12.8 Zip Files 175 12.9 Image Files: Rasterized, Vectorized, and/or Compressed 176 12.10 It’s All Bytes at the End of the Day 177 12.11 Integers 178 12.12 Floats 179 12.13 Text Data 180 12.14 Further Reading 183 12.15 Glossary 183 13 Big Data 185 13.1 What Is Big Data? 185 13.2 Hadoop: The File System and the Processor 187 13.3 Using HDFS 188 13.4 Example PySpark Script 189 13.5 Spark Overview 190 13.6 Spark Operations 192 13.7 Two Ways to Run PySpark 193 13.8 Configuring Spark 194 13.9 Under the Hood 195 13.10 Spark Tips and Gotchas 196 13.11 The MapReduce Paradigm 197 13.12 Performance Considerations 199 13.13 Further Reading 200 13.14 Glossary 200 14 Databases 203 14.1 Relational Databases and MySQL 204 14.2 Key-Value Stores 210 14.3 Wide Column Stores 211 14.4 Document Stores 211 14.5 Further Reading 214 14.6 Glossary 214 15 Software Engineering Best Practices 217 15.1 Coding Style 217 15.2 Version Control and Git for Data Scientists 220 15.3 Testing Code 222 15.4 Test-Driven Development 225 15.5 AGILE Methodology 225 15.6 Further Reading 226 15.7 Glossary 226 16 Natural Language Processing 229 16.1 Do I Even Need NLP? 229 16.2 The Great Divide: Language versus Statistics 230 16.3 Example: Sentiment Analysis on Stock Market Articles 230 16.4 Software and Datasets 232 16.5 Tokenization 233 16.6 Central Concept: Bag‐of‐Words 233 16.7 Word Weighting: TF‐IDF 235 16.8 n‐Grams 235 16.9 Stop Words 236 16.10 Lemmatization and Stemming 236 16.11 Synonyms 237 16.12 Part of Speech Tagging 237 16.13 Common Problems 238 16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding 240 16.15 Further Reading 241 16.16 Glossary 242 17 Time Series Analysis 243 17.1 Example: Predicting Wikipedia Page Views 244 17.2 A Typical Workflow 247 17.3 Time Series versus Time-Stamped Events 248 17.4 Resampling an Interpolation 249 17.5 Smoothing Signals 251 17.6 Logarithms and Other Transformations 252 17.7 Trends and Periodicity 252 17.8 Windowing 253 17.9 Brainstorming Simple Features 254 17.10 Better Features: Time Series as Vectors 255 17.11 Fourier Analysis: Sometimes a Magic Bullet 256 17.12 Time Series in Context: The Whole Suite of Features 259 17.13 Further Reading 259 17.14 Glossary 260 18 Probability 261 18.1 Flipping Coins: Bernoulli Random Variables 261 18.2 Throwing Darts: Uniform Random Variables 263 <p& … (more)
- Publisher Details:
- Hoboken, NJ : John Wiley & Sons, Inc
- Publication Date:
- 2017
- Extent:
- 1 online resource
- Subjects:
- 005.74
Databases -- Handbooks, manuals, etc
Statistics -- Data processing -- Handbooks, manuals, etc
Big data -- Handbooks, manuals, etc
Information theory -- Handbooks, manuals, etc
COMPUTERS / Databases / General
Big data
Databases
Information theory
Statistics -- Data processing
Electronic books
Handbooks and manuals - Languages:
- English
- ISBNs:
- 9781119092933
1119092930
9781119092919
1119092914
9781119092926
1119092922 - Related ISBNs:
- 9781119092940
1119092949 - Notes:
- Note: Includes bibliographical references and index.
Note: Print version record. - Access Rights:
- Legal Deposit; Only available on premises controlled by the deposit library and to one user at any one time; The Legal Deposit Libraries (Non-Print Works) Regulations (UK).
- Access Usage:
- Restricted: Printing from this resource is governed by The Legal Deposit Libraries (Non-Print Works) Regulations (UK) and UK copyright law currently in force.
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD.DS.116090
- Ingest File:
- 01_110.xml