Updated June 2026Practical Apache Spark for Data Pipelines
Class Duration
21 hours of live training delivered over 3-5 days.
Student Prerequisites
- Familiarity with Python (PySpark)
- Basic data processing knowledge
- Access to a GCP account with Dataproc serverless configured (provided if needed)
Target Audience
Data engineers, Python developers, and data professionals who need to develop, manage, and optimize batch and streaming data pipelines with Apache Spark 4 on GCP Dataproc serverless.
Description
The course equips participants with practical skills to develop, manage, and optimize Apache Spark 4 pipelines on GCP Dataproc serverless. By the end, attendees will understand Spark batch and streaming use-cases, master its execution model, Spark Connect architecture, and core data structures—including the VARIANT type for semi-structured data—and build reusable, performance-tuned pipelines for diverse data workloads.
Learning Outcomes
- Understand use-cases and benefits of Spark Batch and Structured Streaming.
- Gain working knowledge of Spark's execution model and the Spark Connect client-server architecture.
- Develop reusable PySpark code for batch and streaming contexts.
- Build and optimize Spark pipelines on GCP Dataproc serverless.
- Master core data structures—DataFrames, ANSI-mode Spark SQL, and the VARIANT type—plus operations and performance tuning.
Training Materials
Comprehensive courseware is distributed online at the start of class. All students receive a downloadable MP4 recording of the training.
Software Requirements
Students will need access to a GCP account with Dataproc serverless configured. If students are unable to configure access, cloud environment can be provided.
Training Topics
Spark Overview
- Introduction to Apache Spark and its ecosystem
- What's New in Spark 4: ANSI mode, VARIANT, Spark Connect
- Spark Fundamentals Overview
- Pipeline Development Overview
- Advanced Spark and Optimization Overview
Spark Architecture and Use-Cases
- Spark topology: driver, cluster manager, worker nodes, executors
- Spark Connect client-server architecture
- Use-cases for Batch and Structured Streaming
- Spark's role in data engineering
Core Data Structures
- DataFrames and Spark SQL basics (ANSI mode by default)
- VARIANT type for semi-structured data
- Datasets and RDDs as legacy context
- Core operations: filtering, aggregations, joins
Spark Execution Model
- Partitioning
- Lazy Execution
- Fault Tolerance
- Checkpointing
- Serialization
Batch and Streaming Pipelines
- Designing Batch Pipelines
- Structured Streaming Fundamentals
- Stateful Processing with transformWithState
- Python Data Source API for Custom Connectors
- Building Reusable Code Components
Advanced Features
- Broadcast Variables
- Accumulators
- Serialization Challenges
- Resource management: memory, CPU, partitioning
- Adaptive Query Execution (AQE)
- Optimization: caching, shuffle reduction
Case Study and Wrap-up
- Discuss real-world Spark applications
- Review takeaways