Course intended for

Course main audience are software engineers and data analysts, who want to learn the basis of Big Data processing, which surpasses the capabilities of traditional processing, using tools of Apache Spark family. The course is both for people interested in starting to work with Big Data, as well as, people with previous experience in other Big Data systems, such as Apache Hadoop, who want to learn new technology.

Course objective

The attendees will learn about new problems that arise during Big Data analysis from various sources using Apache Spark family tools. During the course a general set of typical Big Data problems and their solutions using Apache Spark will be presented. Moreover, the attendees will have a general overview of the pros and cons of using Apache Spark for their business problem solving. In addition, the course allows the attendees to familiarize themselves with fast-moving Big Data processing field and the novel approach to problem solving that Apache Spark presents.

Course strengths

The course is conducted by people that have practical work experience with Big Data problems in their everyday practice. Hence, the material often goes beyond the common textbook information, that are often fragmented. Moreover, the content of the training is continuously updated following the modern advancements in the field. After the course the graduate will have a broad view of Big Data problem solving using Apache Spark damily tools for their specific business cases.


The course requires experience in programming in Java, Scala or Python; preferred training language is Scala. Useful skills are: experience in data processing, functional programming, distributed processing, *nix systems.

Course parameters

2 working days, 2*7 working hours, group 8-10 people. The course contains presentations and coding workshops.

Course curriculum

  1. Introduction to Big Data
    1. Definition
    2. What is Big Data?
    3. History of Big Data
    4. Stakeholders in Big Data project
    5. Big Data problems
    6. Big Data processing types
      • Batch
      • Stream
  2. Apache Spark
    1. Introduction
    2. History
    3. Spark vs Hadoop
    4. MapReduce paradyme
    5. Resilient Distributed Datasets (RDDs)
    6. Processing in memory vs from disk
    7. Architecture
    8. Operation variants
      • Spark build-in cluster
      • Apache Mesos
      • Apache YARN
    9. Administration
  3. Spark Core
    1. Introduction
    2. Java vs Spark vs Python
    3. Connecting to cluster
    4. Dataset distribution
    5. RDD operations
      • Transformations
      • Actions
    6. Shared variables
    7. Execution and testing
    8. Job tuning
      • Serialization
      • Memory
  4. Spark SQL
    1. Introduction
    2. Spark SQL vs Hive
    3. Basic operation
    4. Data and schema
    5. Queries
    6. Hive integration
    7. Execution and testing
  5. Spark Streaming
    1. Introduction
    2. Basic operation
    3. Streams
      • Input
      • Transformation
      • Output
    4. Execution and testing
  6. Other Apache Spark family tools
    1. MLlib
    2. GraphX

Any questions?

* Required.

Phone +48 22 2035600
Fax +48 22 2035601