This hands-on training course delivers the key concepts and expertise participants need to ingest and process data on a Hadoop cluster using the most up-to-date tools and techniques. Employing Hadoop ecosystem projects such as Spark (including Spark Streaming and Spark SQL), Flume, Kafka, and Sqoop, this training course is the best preparation for the real-world challenges faced by Hadoop developers. With Spark, developers can write sophisticated parallel applications to execute faster decisions, better decisions, and interactive actions, applied to a wide variety of use cases, architectures, and industries.

We can organize this training at your preferred date and location. Contact Us!

Prerequisites

There are no prerequisites for this course.

Who Should Attend

This course is designed for developers and engineers who have programming experience, but prior knowledge of Hadoop is not required
Apache Spark examples and hands-on exercises are presented in Scala and Python. The ability to program in one of those languages is required
Basic familiarity with the Linux command line is assumed
Basic knowledge of SQL is helpful

What You Will Learn

Through expert-led discussion and interactive, hands-on exercises, participants will learn how to:
Distribute, store, and process data in a Hadoop cluster
Write, configure, and deploy Apache Spark applications on a Hadoop cluster
Use the Spark shell for interactive data analysis
Process and query structured data using Spark SQL
Use Spark Streaming to process a live data stream
Use Flume and Kafka to ingest data for Spark Streaming

Training Outline

Introduction

Introduction to Apache Hadoop and the Hadoop Ecosystem

Apache Hadoop Overview
Data Storage and Ingest
Data Processing
Data Analysis and Exploration
Other Ecosystem Tools
Introduction to the Hands-On Exercises
Apache Hadoop File Storage
Problems with Traditional

Large-Scale Systems

HDFS Architecture
Using HDFS
Apache Hadoop File Formats

Data Processing on an Apache Hadoop Cluster

YARN Architecture
Working With YARN

Importing Relational Data with Apache Sqoop

Apache Sqoop Overview
Importing Data
Importing File Options
Exporting Data

Apache Spark Basics

What is Apache Spark?
Using the Spark Shell
RDDs (Resilient Distributed Datasets)
Functional Programming in Spark

Working with RDDs

Creating RDDs
Other General RDD Operations

Aggregating Data with Pair RDDs

Key-Value Pair RDDs
Map-Reduce
Other Pair RDD Operations

Writing and Running Apache Spark Applications

Spark Applications vs. Spark Shell
Creating the SparkContext
Building a Spark Application

(Scala and Java)

Running a Spark Application
The Spark Application Web UI

Configuring Apache Spark Applications

Configuring Spark Properties
Logging

Parallel Processing in Apache Spark

Review: Apache Spark on a Cluster
RDD Partitions
Partitioning of File-Based RDDs
HDFS and Data Locality
Executing Parallel Operations
Stages and Tasks

RDD Persistence

RDD Lineage
RDD Persistence Overview
Distributed Persistence

Common Patterns in Apache Spark

Data Processing

Common Apache Spark Use Cases
Iterative Algorithms in Apache Spark
Machine Learning
Example: k-means

DataFrames and Spark SQL

Apache Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames
Saving DataFrames
DataFrames and RDDs
Comparing Apache Spark SQL, Impala, and Hive-on-Spark
Apache Spark SQL in Spark 2.x

Message Processing with Apache Kafka

What is Apache Kafka?
Apache Kafka Overview
Scaling Apache Kafka
Apache Kafka Cluster Architecture
Apache Kafka Command Line Tools

Capturing Data with Apache Flume

What is Apache Flume?
Basic Flume Architecture
Flume Sources
Flume Sinks
Flume Channels
Flume Configuration

Integrating Apache Flume and Apache Kafka

Overview
Use Cases
Configuration

Apache Spark Streaming:

Introduction to DStreams

Apache Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Streaming Applications

Apache Spark Streaming:

Processing Multiple Batches

Multi-Batch Operations
Time Slicing
State Operations
Sliding Window Operations

Apache Spark Streaming: Data Sources

Streaming Data Source Overview
Apache Flume and Apache Kafka

Data Sources

Example: Using a Kafka Direct Data Source

Conclusion

Why Choose Us

Experience Cloudera Developer for Spark and Hadoop through Bilginç IT Academy's live and interactive virtual classroom environment, accessible from your home, office, or any location. Connect with expert trainers in real time and bring the energy of classroom learning into the digital experience.

Live Instructor-Led Sessions: Join scheduled training sessions with your instructor and fellow delegates in real time.
Interactive Learning Experience: Take part in discussions, practical exercises, group activities, and Q&A sessions throughout the course.
Expert Trainer Network: Learn from experienced trainers with strong industry backgrounds and practical field expertise.
Over 30 Years of Training Expertise: Benefit from Bilginç IT Academy's long-standing experience in delivering professional training since 1995.
Flexible and Scalable Delivery: Access live virtual classrooms worldwide with flexible planning options for individual and corporate training needs.

Experience Cloudera Developer for Spark and Hadoop in a focused classroom environment designed for high engagement and effective learning. Bilginç IT Academy's carefully selected training venues provide a professional setting where delegates can interact directly with expert trainers and peers.

Experienced Trainers: Learn from specialists with extensive field experience and real-world knowledge.
Professional Training Venues: Attend courses in comfortable, well-equipped classrooms designed to support effective learning.
Focused Classroom Experience: Benefit from limited class sizes that encourage discussion, interaction, and personalized support.
Quality-Driven Learning: Develop practical skills through structured, up-to-date, and professionally designed training content.

Meet your team's training needs with Bilginç IT Academy's onsite Cloudera Developer for Spark and Hadoop solution, delivered at your office or preferred location. Align your team's development with your business goals through a training experience tailored to your organization.

Tailored Course Content: Adapt the training program to your organization's projects, team structure, and specific business requirements.
Time and Cost Efficiency: Reduce travel, accommodation, and operational costs while maximizing the value of your training investment.
Team-Focused Learning: Help your employees develop around the same knowledge base and strengthen collaboration across your organization.
Simplified Planning and Tracking: Manage the training process, participant development, and organizational requirements with greater control.

Why have you chosen us?

I have attended a training from Bilginc IT Academy before and I was satisfied.

I have attended a training from a different provider and it was not helpful.

Other

How many employees do you have in your IT department?

0 – 50

50 – 250

250 – 1000

1000+

Cloudera Developer for Spark and Hadoop Training

Prerequisites

Who Should Attend

What You Will Learn

Training Outline

Why Choose Us