Practical Enterprise Data Lake Insights : handle data-driven challenges in an Enterprise Big Data Lake /: handle data-driven challenges in an Enterprise Big Data Lake. (2018)
- Record Type:
- Book
- Title:
- Practical Enterprise Data Lake Insights : handle data-driven challenges in an Enterprise Big Data Lake /: handle data-driven challenges in an Enterprise Big Data Lake. (2018)
- Main Title:
- Practical Enterprise Data Lake Insights : handle data-driven challenges in an Enterprise Big Data Lake
- Further Information:
- Note: Saurabh Gupta, Venkata Giri.
- Authors:
- Gupta, Saurabh
Giri, Venkata - Contents:
- Intro; Table of Contents; About the Authors; About the Technical Reviewer; Acknowledgments; Foreword; Chapter 1: Introduction to Enterprise Data Lakes; Data explosion: the beginning; Big data ecosystem; Hadoop and MapReduce -- Early days; Evolution of Hadoop; History of Data Lake; Data Lake: the concept; Data lake architecture; Why Data Lake?; Data Lake Characteristics; Data lake vs. Data warehouse; How to achieve success with Data Lake?; Data governance and data operations; Data democratization with data lake; Fast Data -- Life beyond Big Data; Conclusion. Chapter 2: Data lake ingestion strategiesWhat is data ingestion?; Understand the data sources; Structured vs. Semi-structured vs. Unstructured data; Data ingestion framework parameters; ETL vs. ELT; Big Data Integration with Data Lake; Hadoop Distributed File System (HDFS); Copy files directly into HDFS; Batched data ingestion; Challenges and design considerations; Design considerations; Commercial ETL tools; Real-time ingestion; CDC design considerations; Example of CDC pipeline: Databus, LinkedIn's open-source solution; Apache Sqoop; Sqoop 1; Sqoop 2; How Sqoop works? Sqoop design considerationsNative ingestion utilities; Oracle copyToBDA; Greenplum gphdfs utility; Data transfer from Greenplum to using gpfdist; Ingest unstructured data into Hadoop; Apache Flume; Tiered architecture for convergent flow of events; Features and design considerations; Conclusion; Chapter 3: Capture Streaming Data with Change-Data-Capture;Intro; Table of Contents; About the Authors; About the Technical Reviewer; Acknowledgments; Foreword; Chapter 1: Introduction to Enterprise Data Lakes; Data explosion: the beginning; Big data ecosystem; Hadoop and MapReduce -- Early days; Evolution of Hadoop; History of Data Lake; Data Lake: the concept; Data lake architecture; Why Data Lake?; Data Lake Characteristics; Data lake vs. Data warehouse; How to achieve success with Data Lake?; Data governance and data operations; Data democratization with data lake; Fast Data -- Life beyond Big Data; Conclusion. Chapter 2: Data lake ingestion strategiesWhat is data ingestion?; Understand the data sources; Structured vs. Semi-structured vs. Unstructured data; Data ingestion framework parameters; ETL vs. ELT; Big Data Integration with Data Lake; Hadoop Distributed File System (HDFS); Copy files directly into HDFS; Batched data ingestion; Challenges and design considerations; Design considerations; Commercial ETL tools; Real-time ingestion; CDC design considerations; Example of CDC pipeline: Databus, LinkedIn's open-source solution; Apache Sqoop; Sqoop 1; Sqoop 2; How Sqoop works? Sqoop design considerationsNative ingestion utilities; Oracle copyToBDA; Greenplum gphdfs utility; Data transfer from Greenplum to using gpfdist; Ingest unstructured data into Hadoop; Apache Flume; Tiered architecture for convergent flow of events; Features and design considerations; Conclusion; Chapter 3: Capture Streaming Data with Change-Data-Capture; Change Data Capture Concepts; Strategies for Data Capture; Retention and Replay; Retention Period; Types of CDC; Incremental; Bulk; Hybrid; CDC -- Trade-offs; CDC Tools; Challenges; Downstream Propagation; Use Case. Centralization of Change DataAnalyzing a Centralized Data Store; Metadata: Data about Data; Structure of Data; Privacy/Sensitivity Information; Special Fields; Data Formats; Delimited Format; Avro File Format; Consumption and Checkpointing; Simple Checkpoint Mechanism; Parallelism; Merging and Consolidation; Design Considerations for Merge and Consolidate; Data Quality; Challenges; Design Aspects; Operational Aspects; Publishing to Kafka; Schema and Data; Sample Schema; Schema Repository; Multiple Topics and Partitioning; Sizing and Scaling; Tools; Conclusion. Chapter 4: Data Processing Strategies in Data LakesMapReduce Processing Framework; Motivation: Why MapReduce?; MapReduce V1 Refresher and Design Considerations; Yet Another Resource Negotiator -- YARN; YARN concepts; Hive; Hive -- Quick Refresher; Hive Components; Hive Metastore (a.k.a. HCatalog); Hive -- Design Considerations; Hive LLAP; Apache Pig; Pig Execution Architecture; Apache Spark; Why Spark?; Resilient Distributed Datasets (RDD); RDD Runtime Components; RDD Composition; Datasets and DataFrames; Bucketing, Sorting, and Partitioning; Deployment Modes of Spark Application. … (more)
- Publisher Details:
- Berkeley, CA : Apress
- Publication Date:
- 2018
- Extent:
- 1 online resource
- Subjects:
- 004.36
Computer science
Electronic data processing -- Distributed processing -- Management
Big data
Information storage and retrieval systems
Big data
Electronic data processing -- Distributed processing -- Management
Information storage and retrieval systems
Computer Science
Big Data
Computer Applications
Big Data/Analytics
Computers -- Data Processing
Business & Economics -- Industries -- Computer Industry
Information technology: general issues
Business mathematics & systems
Computers -- Database Management -- General
Databases
Electronic books - Languages:
- English
- ISBNs:
- 9781484235225
1484235223
1484235215
9781484235218 - Related ISBNs:
- 9781484235218
- Notes:
- Note: Online resource; title from PDF title page (EBSCO, viewed July 5, 2018).
- Access Rights:
- Legal Deposit; Only available on premises controlled by the deposit library and to one user at any one time; The Legal Deposit Libraries (Non-Print Works) Regulations (UK).
- Access Usage:
- Restricted: Printing from this resource is governed by The Legal Deposit Libraries (Non-Print Works) Regulations (UK) and UK copyright law currently in force.
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD.DS.360019
- Ingest File:
- 02_340.xml