Building a modern data platform : pursuit of modern big data systems /: pursuit of modern big data systems. (2021)
- Record Type:
- Book
- Title:
- Building a modern data platform : pursuit of modern big data systems /: pursuit of modern big data systems. (2021)
- Main Title:
- Building a modern data platform : pursuit of modern big data systems
- Further Information:
- Note: Yusuf Aytas.
- Authors:
- Aytas, Yusuf
- Contents:
- List of Contributors x vii Preface xix Acknowledgments xxi Acronyms xxiii Introduction xxv 1 An Introduction: What’s a Modern Big Data Platform 1 1.1 Defining Modern Big Data Platform 1 1.2 Fundamentals of a Modern Big Data Platform 2 1.2.1 Expectations from Data 2 1.2.1.1 Ease of Access 2 1.2.1.2 Security 2 1.2.1.3 Quality 3 1.2.1.4 Extensibility 3 1.2.2 Expectations from Platform 3 1.2.2.1 Storage Layer 4 1.2.2.2 Resource Management 4 1.2.2.3 ETL 5 1.2.2.4 Discovery 6 1.2.2.5 Reporting 7 1.2.2.6 Monitoring 7 1.2.2.7 Testing 8 1.2.2.8 Lifecycle Management 9 2 A Bird’s Eye View on Big Data 11 2.1 A Bit of History 11 2.1.1 Early Uses of Big Data Term 11 2.1.2 A New Era 12 2.1.2.1 Word Count Problem 12 2.1.2.2 Execution Steps 13 2.1.3 An Open-Source Alternative 15 2.1.3.1 Hadoop Distributed File System 15 2.1.3.2 HadoopMapReduce 17 2.2 What Makes Big Data 20 2.2.1 Volume 20 2.2.2 Velocity 21 2.2.3 Variety 21 2.2.4 Complexity 21 2.3 Components of Big Data Architecture 22 2.3.1 Ingestion 22 2.3.2 Storage 23 2.3.3 Computation 23 2.3.4 Presentation 24 2.4 Making Use of Big Data 24 2.4.1 Querying 24 2.4.2 Reporting 25 2.4.3 Alerting 25 2.4.4 Searching 25 2.4.5 Exploring 25 2.4.6 Mining 25 2.4.7 Modeling 26 3 A Minimal Data Processing and Management System 27 3.1 Problem Definition 27 3.1.1 Online Book Store 27 3.1.2 User Flow Optimization 28 3.2 Processing Large Data with Linux Commands 28 3.2.1 Understand the Data 28 3.2.2 Sample the Data 28 3.2.3 Building the Shell Command 29List of Contributors x vii Preface xix Acknowledgments xxi Acronyms xxiii Introduction xxv 1 An Introduction: What’s a Modern Big Data Platform 1 1.1 Defining Modern Big Data Platform 1 1.2 Fundamentals of a Modern Big Data Platform 2 1.2.1 Expectations from Data 2 1.2.1.1 Ease of Access 2 1.2.1.2 Security 2 1.2.1.3 Quality 3 1.2.1.4 Extensibility 3 1.2.2 Expectations from Platform 3 1.2.2.1 Storage Layer 4 1.2.2.2 Resource Management 4 1.2.2.3 ETL 5 1.2.2.4 Discovery 6 1.2.2.5 Reporting 7 1.2.2.6 Monitoring 7 1.2.2.7 Testing 8 1.2.2.8 Lifecycle Management 9 2 A Bird’s Eye View on Big Data 11 2.1 A Bit of History 11 2.1.1 Early Uses of Big Data Term 11 2.1.2 A New Era 12 2.1.2.1 Word Count Problem 12 2.1.2.2 Execution Steps 13 2.1.3 An Open-Source Alternative 15 2.1.3.1 Hadoop Distributed File System 15 2.1.3.2 HadoopMapReduce 17 2.2 What Makes Big Data 20 2.2.1 Volume 20 2.2.2 Velocity 21 2.2.3 Variety 21 2.2.4 Complexity 21 2.3 Components of Big Data Architecture 22 2.3.1 Ingestion 22 2.3.2 Storage 23 2.3.3 Computation 23 2.3.4 Presentation 24 2.4 Making Use of Big Data 24 2.4.1 Querying 24 2.4.2 Reporting 25 2.4.3 Alerting 25 2.4.4 Searching 25 2.4.5 Exploring 25 2.4.6 Mining 25 2.4.7 Modeling 26 3 A Minimal Data Processing and Management System 27 3.1 Problem Definition 27 3.1.1 Online Book Store 27 3.1.2 User Flow Optimization 28 3.2 Processing Large Data with Linux Commands 28 3.2.1 Understand the Data 28 3.2.2 Sample the Data 28 3.2.3 Building the Shell Command 29 3.2.4 Executing the Shell Command 30 3.2.5 Analyzing the Results 31 3.2.6 Reporting the Findings 32 3.2.7 Automating the Process 33 3.2.8 A Brief Review 33 3.3 Processing Large Data with PostgreSQL 34 3.3.1 Data Modeling 34 3.3.2 Copying Data 35 3.3.3 Sharding in PostgreSQL 37 3.3.3.1 Setting up Foreign Data Wrapper 37 3.3.3.2 Sharding Data over Multiple Nodes 38 3.4 Cost of Big Data 39 4 Big Data Storage 41 4.1 Big Data Storage Patterns 41 4.1.1 Data Lakes 41 4.1.2 Data Warehouses 42 4.1.3 Data Marts 43 4.1.4 Comparison of Storage Patterns 43 4.2 On-Premise Storage Solutions 44 4.2.1 Choosing Hardware 44 4.2.1.1 DataNodes 44 4.2.1.2 NameNodes 45 4.2.1.3 Resource Managers 45 4.2.1.4 Network Equipment 45 4.2.2 Capacity Planning 46 4.2.2.1 Overall Cluster 46 4.2.2.2 Resource Sharing 47 4.2.2.3 Doing the Math 47 4.2.3 Deploying Hadoop Cluster 48 4.2.3.1 Networking 48 4.2.3.2 Operating System 48 4.2.3.3 Management Tools 49 4.2.3.4 Hadoop Ecosystem 49 4.2.3.5 A Humble Deployment 49 4.3 Cloud Storage Solutions 53 4.3.1 Object Storage 54 4.3.2 Data Warehouses 55 4.3.2.1 Columnar Storage 55 4.3.2.2 Provisioned Data Warehouses 56 4.3.2.3 Serverless Data Warehouses 56 4.3.2.4 Virtual Data Warehouses 57 4.3.3 Archiving 58 4.4 Hybrid Storage Solutions 59 4.4.1 Making Use of Object Store 59 4.4.1.1 Additional Capacity 59 4.4.1.2 Batch Processing 59 4.4.1.3 Hot Backup 60 4.4.2 Making Use of Data Warehouse 60 4.4.2.1 Primary Data Warehouse 60 4.4.2.2 Shared Data Mart 61 4.4.3 Making Use of Archiving 61 5 Offline Big Data Processing 63 5.1 Defining Offline Data Processing 63 5.2 MapReduce Technologies 65 5.2.1 Apache Pig 65 5.2.1.1 Pig Latin Overview 66 5.2.1.2 Compilation To MapReduce 66 5.2.2 Apache Hive 67 5.2.2.1 Hive Database 68 5.2.2.2 Hive Architecture 69 5.3 Apache Spark 70 5.3.1 What’s Spark 71 5.3.2 Spark Constructs and Components 71 5.3.2.1 Resilient Distributed Datasets 71 5.3.2.2 Distributed Shared Variables 73 5.3.2.3 Datasets and DataFrames 74 5.3.2.4 Spark Libraries and Connectors 75 5.3.3 Execution Plan 76 5.3.3.1 The Logical Plan 77 5.3.3.2 The Physical Plan 77 5.3.4 Spark Architecture 77 5.3.4.1 Inside of Spark Application 78 5.3.4.2 Outside of Spark Application 79 5.4 Apache Flink 81 5.5 Presto 83 5.5.1 Presto Architecture 83 5.5.2 Presto System Design 84 5.5.2.1 Execution Plan 84 5.5.2.2 Scheduling 86 5.5.2.3 Resource Management 86 5.5.2.4 Fault Tolerance 87 6 Stream Big Data Processing 89 6.1 The Need for Stream Processing 89 6.2 Defining Stream Data Processing 90 6.3 Streams via Message Brokers 92 6.3.1 Apache Kafka 92 6.3.1.1 Apache Samza 93 6.3.1.2 Kafka Streams 98 6.3.2 Apache Pulsar 100 6.3.2.1 Pulsar Functions 102 6.3.3 AMQP Based Brokers 105 6.4 Streams via Stream Engines 106 6.4.1 Apache Flink 106 6.4.1.1 Flink Architecture 107 6.4.1.2 System Design 109 6.4.2 Apache Storm 111 6.4.2.1 Storm Architecture 114 6.4.2.2 System Design 115 6.4.3 Apache Heron 116 6.4.3.1 Storm Limitations 116 6.4.3.2 Heron Architecture 117 6.4.4 Spark Streaming 118 6.4.4.1 Discretized Streams 119 6.4.4.2 Fault-tolerance 120 7 Data Analytics 121 7.1 Log Collection 121 7.1.1 Apache Flume 122 7.1.2 Fluentd 122 7.1.2.1 Data Pipeline 123 7.1.2.2 Fluent Bit 124 7.1.2.3 Fluentd Deployment 124 7.2 Transferring Big Data Sets 125 7.2.1 Reloading 126 7.2.2 Partition Loading 126 7.2.3 Streaming 127 7.2.4 Timestamping 127 7.2.5 Tools 128 7.2.5.1 Sqoop 128 7.2.5.2 Embulk 128 7.2.5.3 Spark 129 7.2.5.4 Apache Gobblin 130 7.3 Aggregating Big Data Sets 132 7.3.1 Data Cleansing 132 7.3.2 Data Transformation 134 7.3.2.1 Transformation Functions 134 7.3.2.2 Transformation Stages 135 7.3.3 Data Retention 135 7.3.4 Data Reconciliation 136 7.4 Data Pipeline Scheduler 136 7.4.1 Jenkins 137 7.4.2 Azkaban 138 7.4.2.1 Projects 139 7.4.2.2 Execution Modes 139 7.4.3 Airflow 139 7.4.3.1 Task Execution 140 7.4.3.2 Scheduling 141 7.4.3.3 Executor 141 7.4.3.4 Security and Monitoring 142 7.4.4 Cloud 143 7.5 Patterns and Practices 143 7.5.1 Patterns 143 7.5.1.1 Data Centralization 143 7.5.1.2 Singe Source of Truth 144 7.5.1.3 Domain Driven Data Sets 145 7.5.2 Anti-Patterns 146 7.5.2.1 Data Monolith 146 7.5.2.2 Data Swamp 147 7.5.2.3 Technology Pollution 147 7.5.3 Best Practices 148 7.5.3.1 Business-Driven Approach 148 7.5.3.2 Cost of Mainte … (more)
- Edition:
- 1st
- Publisher Details:
- Hoboken : John Wiley & Sons, Inc
- Publication Date:
- 2021
- Extent:
- 1 online resource
- Subjects:
- 005.7
Big data
Data mining - Languages:
- English
- ISBNs:
- 9781119690955
9781119690948 - Related ISBNs:
- 9781119690924
- Notes:
- Note: Description based on CIP data; resource not viewed.
- Access Rights:
- Legal Deposit; Only available on premises controlled by the deposit library and to one user at any one time; The Legal Deposit Libraries (Non-Print Works) Regulations (UK).
- Access Usage:
- Restricted: Printing from this resource is governed by The Legal Deposit Libraries (Non-Print Works) Regulations (UK) and UK copyright law currently in force.
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD.DS.641534
- Ingest File:
- 06_034.xml