Learning Outcomes
Modern applications on the internet have created the need for everyday handling of huge data volumes. Executing algorithms on datasets much larger than the available memory cannot easily be dealt with traditional techniques. The course offers students the necessary knowledge and skills in order to solve problems which involve handling of huge datasets which cannot be stored in memory. The course is split into two parts. In the first part we describe the architecture of distributed systems capable of handling vast amounts of data, while in the second part we describe the corresponding algorithmic techniques.
Course Content
1st week Lecture: Introduction to Big Data. Advanced computational and storage models.
2nd week Lecture: Introduction to advanced distributed systems.
3rd week Lecture: Distributed file systems and the MapReduce platform for parallel computing.
4th week Lab: Practical application. The Hadoop framework.
5th week Lecture: Spark architecture and implementation of algorithms with RDDs.
6th week Lecture: Data processing with the Scala programming language in Spark.
7th week Lab: Practical application. The Spark framework.
8th week Lecture: Basic algorithms with MapReduce and Spark. High-level languages for data analysis.
9th week Lecture: Entity Resolution in Spark.
10th week Lecture: Resource management in distributed systems: YARN, Mesos, Kubernetes.
11th week Lecture: Managing Data Streams: Spark Structured Streaming.
12th week: Student presentations.
13th week: Student presentations.