Big Data Management

Course ID
CSIS1-4
Direction
1st, 2nd, 3rd
Semester
Spring
Type
1st direction elective, 3rd direction elective, 2nd direction mandatory

Learning Outcomes

Modern applications on the internet have created the need for everyday handling of huge data volumes. Executing algorithms on datasets much larger than the available memory cannot easily be dealt with traditional techniques. The course offers students the necessary knowledge and skills in order to solve problems which involve handling of huge datasets which cannot be stored in memory. The course is split into two parts. In the first part we describe the architecture of distributed systems capable of handling vast amounts of data, while in the second part we describe the corresponding algorithmic techniques.

Course Content

1st week Lecture: Introduction to Big Data. Advanced computational and storage models.
2nd week Lecture: Introduction to advanced distributed systems.
3rd week Lecture: Distributed file systems and the MapReduce platform for parallel computing.
4th week Lab: Practical application. The Hadoop framework.
5th week Lecture: Spark architecture and implementation of algorithms with RDDs.
6th week Lecture: Data processing with the Scala programming language in Spark.
7th week Lab: Practical application. The Spark framework.
8th week Lecture: Basic algorithms with MapReduce and Spark. High-level languages for data analysis. 
9th week Lecture: Entity Resolution in Spark.
10th week Lecture: Resource management in distributed systems: YARN, Mesos, Kubernetes.
11th week Lecture: Managing Data Streams: Spark Structured Streaming.
12th week: Student presentations.
13th week: Student presentations.

General Skills

Search, analysis and synthesis of data and information with the use of the assorted technologies

Adaptation in new conditions

Decision Making

Independent work

Team work

Promoting free, creative and deductive reasoning

Learning and Teaching Methods - Evaluation

Teaching methods: On site 

Use of Information and Communication Technologies: eclass, estudies

Activity Work load
Semester
Lectures 26
Lab exercises 12
Thesis 60
Independent Study 52
Total 150

Assessment

I. Final exam 50% which includes:
- Multiple choice questions
- Problem solving
- Comparative evaluation of theory elements
II. Individual assignments 30% assessed in two stages: description of the proposed
approach, implementation.
III. Group assignments 20%: presentation of selected research papers

Literature

Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets. Cambridge
University Press 2020.
Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills. Advanced Analytics With Spark: Patterns
for Learning from Data at Scale. O’Reilly Media 2017
Jacek Laskowski. Apache Spark Internals. [Online] 2023
The International Journal on Very Large Data Bases (VLDBJ)
Proceedings of the VLDB Endowment (PVLDB)
IEEE Transactions on Big Data
IEEE Transactions on Knowledge and Data Engineering (TKDE)