Technologies and Infrastructure for Big Data

Credits: 6

Semester: 2

Course: Core

Language of the course: English

Objectives

Students will learn: the basic mechanisms and algorithms for analyzing complex and large data and extracting knowledge from them; principles of data processing, storage and protection; ethical and legislative norms for working with large data; principles of construction and the basis for organizing the development of modern software solutions for processing large data; principles of processing large amounts of data to extract knowledge based on machine learning methods; to design and develop complex data processing solutions using one or more data mining and retrieval algorithms; to develop new algorithms based on existing ones, apply knowledge of the norms and principles of storage, and protect data to assess scientific activity in terms of the potential impact on society, design and develop basic software applications for data processing using a computational cluster based on modern technologies quipment big data, apply machine-learning techniques to extract knowledge using modern large data processing systems.
Students will study the theoretical bases of the applied data processing algorithms and methods of combining algorithms to achieve the best result, the skills of analyzing and evaluating research activities in accordance with generally accepted international norms and ethical standards for testing and research involving people, skills in working with software interfaces for large-scale processing systems data for packet and stream data processing, skills of working with the MLLib machine learning library.

Contents

Main topics of the discipline:
The main stages of the development of large data processing systems, the main types of systems and their purpose, the evolution of data processing methods.
The purpose of the distributed HDFS file system, the basic principles of the HDFS device, the procedure of replicating data and providing fault tolerance.
History of the origin of technology, the principles of building data processing based on MapReduce, the patterns of MapReduce.
Architecture and principles of the Apache Zookeeper device, consensus algorithms, PAXOS algorithm.
Assignment and tasks of the infrastructure manager, the architecture and principles of the YARN device, the architecture and principles of the Mesos device, a centralized and two-level approach to planning.
Principles of organizing batch data processing, architecture and principles of the Apache Spark device, data processing with the help of Spark.
The principles of streaming data processing, the architecture and principles of Apache Kafka and Apache Flink, the processing of streaming data using Spark Streaming and Apache Flink.
The principles of the organization of interactive data processing, the architecture of Lambda and Kappa, interactive processing of data using Spark SQL.
Representing graph data for batch processing, processing graph data using Spark GraphX.

Format

Lectures and laboratory works.

Assessment

Examination.