Module 01: Data engineering tasks and components
- The role of a data engineer
- Data sources versus data syncs
- Data formats
- Storage solution options on Google Cloud
- Metadata management options on Google Cloud
- Share datasets using Analytics Hub
Module 02: Data replication and migration
- Replication and migration architecture
- The gcloud command line tool
- Moving datasets
- Datastream
Module 03: The extract and load data pipeline pattern
- Extract and load architecture
- The bq command line tool
- BigQuery Data Transfer Service
- BigLake
Module 04: The extract, load, and transform data pipeline pattern
- Extract, load, and transform (ELT) architecture
- SQL scripting and scheduling with BigQuery
- Dataform
Module 05: The extract, transform, and load data pipeline pattern
- Extract, transform, and load (ETL) architecture
- Google Cloud GUI tools for ETL data pipelines
- Batch data processing using Dataproc
- Streaming data processing options
- Bigtable and data pipelines
Module 06: Automation techniques
- Automation patterns and options for pipelines
- Cloud Scheduler and Workflows
- Cloud Composer
- Cloud Run functions
- Eventarc
Module 07: Introduction to data engineering
- Data engineer’s role
- Data engineering challenges
- Introduction to BigQuery
- Data lakes and data warehouses
- Transactional databases versus data warehouses
- Effective partnership with other data teams
- Management of data access and governance
- Building of production-ready pipelines
- Google Cloud customer case study
Module 08: Build a Data Lake
- Introduction to data lakes
- Data storage and ETL options on Google Cloud
- Building of a data lake using Cloud Storage
- Secure Cloud Storage
- Store all sorts of data types
- Cloud SQL as your OLTP system
Module 09: Build a data warehouse
- The modern data warehouse
- Introduction to BigQuery
- Get started with BigQuery
- Loading of data into BigQuery
- Exploration of schemas
- Schema design
- Nested and repeated fields
- Optimization with partitioning and clustering
Module 10: Introduction to building batch data pipelines
- EL, ELT, ETL
- Quality considerations
- Ways of executing operations in BigQuery
- Shortcomings
- ETL to solve data quality issues
Module 11: Execute Spark on Dataproc
- The Hadoop ecosystem
- Run Hadoop on Dataproc
- Cloud Storage instead of HDFS
- Optimize Dataproc
Module 12: Serverless data processing with Dataflow
- Introduction to Dataflow
- Reasons why customers value Dataflow
- Dataflow pipelines
- Aggregating with GroupByKey and Combine
- Side inputs and windows
- Dataflow templates
Module 13: Manage data pipelines with Cloud Data Fusion and Cloud Composer
- Build batch data pipelines visually with Cloud Data Fusion
- Components
- Overview
- Building a pipeline
- Exploring data using Wrangler
- Orchestrate work between Google Cloud services with Cloud Composer
- Apache Airflow environment
- DAGs and operators
- Workflow scheduling
- Monitoring and logging
Module 14: Serverless messaging with Pub/Sub
- Introduction to Pub/Sub
- Pub/Sub push versus pull
- Publishing with Pub/Sub code
Module 16: Dataflow streaming features
- Streaming data challenges
- Dataflow windowing
Module 17: High-throughput BigQuery and Bigtable streaming features
- Streaming into BigQuery and visualizing results
- High-throughput streaming with Bigtable
- Optimizing Bigtable performance
Module 18: Advanced BigQuery functionality and performance
- Analytic window functions
- GIS functions
- Performance considerations
Exams and assessments
There is no specific certification related to this course.
Hands-on learning
There are practical labs in this course.