Course intended for:

Training course focuses on developers who want to develop systems for storing and/or analysing big sets of data with the use of Apache Hadoop platform. Course is dedicated to both beginners and programmers who already have preliminary experience with the platform and want to expand or consolidate their knowledge.

Course objective

Participants will gain knowledge needed to work with Apache Hadoop system, including the implementation of effective algorithms on the basis of MapReduce as well as data storage and import into the system. Design patterns and best coding practices will be presented. In the course emphasis is put not only on theoretical aspects but mostly on the practical skills.

Course strengths

Course curriculum includes general introduction to the subject of Big Data along with a detailed presentation of Apache Hadoop tools on the level which enables the participants to start working in this environment. The training is unique since the issues presented during it are not sufficiently covered in the available literature. The curriculum is constantly updated due to the rapid development of these technologies. Presented knowledge is the result of several years of practice of trainers in building systems based on Apache Hadoop platform.


Participants are expected to have basic programming knowledge in Java as well as knowledge of the basics of databases and SQL.

Course parameters

3*8 hours (3*7 net hours) of lectures and workshops (with more workshops than lectures). During the workshops, participants will solve data processing problems by implementing their own algorithms with the use of MapReduce paradigm.

Course curriculum

  1. Introduction to Big Data
  2. Hadoop
    1. Introduction and history
    2. Architecture and components
    3. Running mode
    4. Introduction to the ecosystem
    5. Users and applications
  3. HDFS
    1. Introduction to Distributed Files System
    2. Management with command line
    3. Access through www
    4. API usage
    5. Import and export of data
  4. Introduction to MapReduce
    1. Introduction to MapReduce paradigm
    2. Comparison of subsequent versions of MapReduce
  5. Use ofJava API MapReduce
    1. Input and output formats, creating own formats
    2. Embedded and own types of data
    3. Partitioner and Combiner, when and how to use it
    4. Data counters
    5. Data sorting
    6. Configuration of tasks with the use of paradigms
    7. Creation of own data comparators
    8. Realisation of data connections in w MapReduce
    9. Tasks chains in MapReduce
    10. Use of compression to decrease the amount of data
    11. Optimization of tasks in MapReduce
    12. Use of DistributedCache
  6. Examples of implementation of common algorithms in MapReduce paradigm
  7. Other programming approaches
    1. Streaming – use of programs written in other programming languages
    2. Developing MapReduce algorithms with the Cascading library
  8. Good programming practices in MapReduce paradigm
    1. Design patterns in MapReduce
    2. Unit tests in Testy Apache Hadoop environment
  9. Starting and monitoring of tasks in a cluster
  10. Creating task flow MapReduce
    1. Use of JobControl class
    2. Apache Oozie
  11. HBase
    1. Introduction to HBase
    2. Use of HBase with API
    3. MapReduce in HBase
    4. Unit tests in HBase
  12. Use of Spring Framework library
    1. Project setup (Java + Maven)
    2. Hadoop configuration in Spring
    3. Handling the ecosystem
    4. Testing
    5. Dependency Injection in MapReduce environment
  13. Hive
    1. Introduction
    2. Creating and running queries
    3. Use of User-Defined Function
  14. Pig
    1. Introduction
    2. Creating and using scripts
    3. Use of User-Defined Function
  15. Overview of selected ecosystem elements
    1. YARN
    2. Flume
    3. Zookeeper

Any questions?

* Required.

Phone +48 22 2035600
Fax +48 22 2035601