Course intended for:

Training is intended for data scientists and developers, who want to create or maintain processes of data exploration with the use of Pentaho Data Mining (WEKA).

Course objective:

Course participants will gain knowledge on design, implementation, monitoring, running and tuning of DM processes, as well as will freshen up their knowledge on basic statistical terms and learn the most popular DM algorithms in details. They will be able to choose the appropriate set of tools and techniques for their real-world projects.

Course strengths:

Course curriculum includes general introduction to the subject of Data Mining and Machine Learning, as well as a comprehensive presentation of a real-world process using the Weka environment.


Participants are expected to have basic programming knowledge of Java.

Course parameters:

4*8 hours* (4*7 net hours) of lectures and workshops. During workshops, apart from doing simple exercises, participants will solve data exploration problems and will use and tune DM algorithms. Group size: max. 8-10 people

Course curriculum

  1. Introduction
    1. introduction to data warehouse:
      1. OLTP, OLAP, database, data warehouse, data marts
      3. Normalization, aggregation, facts, dimensions
      4. SQL, MDX, XML/A
      5. ETL
      6. BigData, BigTable, NoSQL, non-relational data warehouses
      7. Others
    2. Pentaho BI Suite Platform
  2. Data exploration
    1. Artificial intelligence, machine learning, data exploration etc.
    2. Basics of data mining algorithms
      1. Algorithms
        • classification
        • clustering
        • finding patterns and association rules
        • transforming and reducing the space of attributes
      2. Techniques:
        • Trees and decision tables
        • linear regression
        • Bayesian networks
        • Neural networks
        • Genetic and evolutionary algorithms
      3. Basic statistical terms
        • Minimum, Maximum
        • Mean, Median
        • Standard deviation, Variance
        • Probability
        • Correlation
        • Distance metric
        • Statistical significance
      4. Others
    3. Overview of data mining tools available on the market
  3. Pentaho Data Mining (WEKA)
    1. Architecture
    2. Weka GUI Chooser
      1. Explorer
      2. Experimenter
      3. Knowledge Flow
      4. Simple CLI
      5. Tools: ARFF Viewer, SQL Viewer etc.
      6. Weka Light, Weka Server
    3. Working with Explorer
  4. Preprocessing and working with data
    1. ARFF data format
    2. Data preprocessing
    3. Attribute selection
    4. Data filtering and types of filters in WEKA e.g. filtering, discretization, normalization etc.
    5. Visualization
    6. Processing of big data sets, JVM 32bit limitations
    7. Sttream processing and incremental learning
  5. Classification
    1. Classification problem definition
    2. Selecting an appropriate set of training and testing data
    3. Types of classification algorithms available in WEKA
    4. Most popular classification algorithms in details
      1. Bayesian networks e.g. Naive Bayesian classifier
      2. Regression e.g. linear regression
      3. Trees and decision tables
    5. Cross-validation, overfitting
    6. Interpretation of classification results
  6. Clustering
    1. Clustering problem definition
    2. Selecting an appropriate set of training and testing data
    3. Types of clustering algorithms available in WEKA
    4. Most popular clustering algorithms in details
      1. Centroids, e.g. k-Means
      2. Density-based, e.g. DBSCAN
    5. Interpreting the results of clustering
  7. Association rule mining
    1. Association rule mining definition
    2. Selecting an appropriate set of training and testing data
    3. Types of association rule mining algorithms available in WEKA
    4. Most popular algorithms in details
      1. Apriori
      2. Frequent Pattern Growth
    5. Interpreting the results
  8. Transforming and reducing the attribute space
    1. Defining the problems of: attribute selection, attribute space reduction and transformation
    2. Types of algorithms for transforming the attribute space in WEKA
    3. Most popular algorithms in details
      1. Searching the attribute space, e.g. BestFirst, ExhaustiveSearch, GeneticSearch
      2. Principal Component Analysis (PCA)
      3. Support Vector Machines (SVM/SVMAttributeEval)
    4. Interpreting the results
  9. Other data mining algorithms and techniques available in WEKA
  10. Extending WEKA
    1. Pentaho Data Mining Plug-Ins
    2. User-define DM algorithms in WEKA
  11. Combining Weka with other Pentaho products
    1. Knowledge Flow Plugin and Pentaho Data Integration

Any questions?

* Required.

Phone +48 22 2035600
Fax +48 22 2035601