Apache Spark

This hands-on training course delivers the key concepts and expertise developers need to develop high-performance parallel applications with Apache Spark . Participants will learn how to use Spark SQL to query structured data and Spark Streaming to perform real-time processing on streaming data from a variety of sources. Developers will also practice writing applications that use core Spark to perform ETL processing and iterative algorithms. The course covers how to work with large datasets stored in a distributed file system, and execute Spark applications on a Hadoop cluster. After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster decisions, better decisions, and interactive analysis, applied to a wide variety of use cases, architectures, and industries.


 Delegates will learn how to


•         Distribute, store, and process data in a Hadoop cluster

•         Write, configure, and deploy Spark applications on a cluster

•         Use the Spark shell for interactive data analysis

•         Process and query structured data using Spark SQL

•         Use Spark Streaming to process a live data stream


 Audience


This course is designed for developers and engineers who have programming experience, but prior knowledge of Hadoop and/or Spark is not required

Introduction to Apache Hadoop and the Hadoop Ecosystem

ntroduction to Apache Hadoop and the Hadoop Ecosystem

Apache Hadoop Overview

Data Ingestion and Storage

Data Processing 

Data Analysis and Exploration

Other Ecosystem Tools

Introduction to the Hands-On Exercises


Apache Hadoop File Storage

Apache Hadoop Cluster Components

HDFS Architecture

Using HDFS 


Distributed Processing on an Apache Hadoop Cluster

YARN Architecture

Working With YARN


      Apache Spark Basics

      What is Apache Spark?

      Starting the Spark Shell

      Using the Spark Shell 

      Getting Started with Datasets and DataFrames

      DataFrame Operations


          Working with DataFrames and Schemas

          Creating DataFrames from Data Sources

          Saving DataFrames to Data Sources

          DataFrame Schemas 

          Eager and Lazy Execution


              Analyzing Data with DataFrame Queries

              Querying DataFrames Using Column Expressions

              Grouping and Aggregation Queries

              Joining DataFrames 


                  RDD Overview

                  RDD Overview 

                  RDD Data Sources

                  Creating and Saving RDDs 

                  RDD Operations


                      Transforming Data with RDDs

                      Writing and Passing Transformation Functions 

                      Transformation Execution

                      Converting Between RDDs and DataFrames 


                          Aggregating Data with Pair RDDs

                          Key-Value Pair RDDs 

                          Map-Reduce

                          Other Pair RDD Operations 


                              Querying Tables and Views with Apache Spark SQL

                              Querying Tables in Spark Using SQL 

                              Querying Files and Views

                              The Catalog API 

                              Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark


                                Working with Datasets in Scala

                                Datasets and DataFrames 

                                Creating Datasets

                                Loading and Saving Datasets 

                                Dataset Operations


                                    Writing, Configuring, and Running Apache Spark Applications

                                    Writing a Spark Application 

                                    Building and Running an Application

                                    Application Deployment Mode 

                                    The Spark Application Web UI

                                    Configuring Application Properties 


                                        Distributed Processing

                                        Review: Apache Spark on a Cluster 

                                        RDD Partitions

                                        Example: Partitioning in Queries 

                                        Stages and Tasks

                                        Job Execution Planning 

                                        Example: Catalyst Execution Plan

                                        Example: RDD Execution Plan 


                                            Distributed Data Persistence

                                            DataFrame and Dataset Persistence 

                                            Persistence Storage Levels

                                            Viewing Persisted RDDs

                                             

                                                Common Patterns in Apache Spark Data Processing

                                                Common Apache Spark Use Cases 

                                                Iterative Algorithms in Apache Spark

                                                Machine Learning 

                                                Example: k-means


                                                    Apache Spark Streaming: Introduction to DStreams

                                                    Apache Spark Streaming Overview 

                                                    Example: Streaming Request Count

                                                    DStreams 

                                                    Developing Streaming Applications


                                                      Apache Spark Streaming: Processing Multiple Batches

                                                      Multi-Batch Operations 

                                                      Time Slicing

                                                      State Operations 

                                                      Sliding Window Operations

                                                      Preview: Structured Streaming 


                                                          Apache Spark Streaming: Data Sources

                                                          Streaming Data Source Overview 

                                                          Apache Flume and Apache Kafka Data Sources

                                                          Example: Using a Kafka Direct Data Source 

                                                          Program Details
                                                          Duration 4 Days
                                                          Capacity Max 12 Persons
                                                          Training Type Classroom / Virtual Classroom


                                                          Can't find what you're looking for? Let us know if you have a query or cannot find what you are looking for.

                                                          Contact