HEAL DSpace

Development of large-scale big data system with state management on serverless cloud architectures

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Nikitas, Nikolaos en
dc.contributor.author Νικήτας, Νικόλαος el
dc.date.accessioned 2022-01-28T14:35:07Z
dc.date.available 2022-01-28T14:35:07Z
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/54446
dc.identifier.uri http://dx.doi.org/10.26240/heal.ntua.22144
dc.description Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακών Σπουδών (Δ.Π.Μ.Σ.) "Επιστήμη Δεδομένων και Μηχανική Μάθηση" el
dc.rights Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα *
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/gr/ *
dc.subject Big data analytics frameworks en
dc.subject Distributed systems en
dc.subject Cloud computing en
dc.subject Serverless arrchitecture en
dc.subject DevOps en
dc.subject Πλαίσια επεξεργασίας μεγάλων δεδομένων el
dc.subject Κατανεμημένα συστήματα el
dc.subject Υπολογιστικό νέφος el
dc.subject Serverless en
dc.subject Αρχιτεκτονική el
dc.subject Πρακτικές el
dc.title Development of large-scale big data system with state management on serverless cloud architectures en
dc.title Ανάπτυξη Συστήματος Μεγάλης Κλίμακας Δεδομένων με Διατήρηση της Κατάστασης σε Serverless Αρχιτεκτονικές Υπολογιστικών Νεφών el
heal.type masterThesis
heal.classification Big Data Serverless Analytics en
heal.language el
heal.language en
heal.access free
heal.recordProvider ntua el
heal.publicationDate 2021-10-01
heal.abstract Over the last years, there has been apparent interest in Big Data. Additionally, there has been great effort in implementing and optimizing distributed Big Data analytics frameworks, such as Apache Spark and Hadoop. More specifically, there is emphasis by industry and academia on improving the all-to-all transfer over the network of intermediate data between the MapReduce computation units, i.e., the shuffle data operation between the stages of a MapReduce-like workload, which is of paramount importance and still remains a serious bottleneck. Apache Spark is a widely used Big Data processing system. Nevertheless, its shuffle operation has some challenges that need to be addressed: the I/O bottleneck that is defined by the limited throughput by HDDs when shuffle intermediate data are small in size and need to be accessed, the lack of fault tolerance in case of a Worker node crash or deallocation since any intermediate data stored will be permanently lost, and the absence of adaptation to containerized environments with isolated and stateless processes that is common in the cloud. In this specific master thesis our goal is to adopt the serverless paradigm in widely embraced large-scale data analytics frameworks and handle intermediate state efficiently. Thus, we implemented a distributed dissaggragated storage architecture that will maintain any intermediate shuffle data from the execution of analytics workloads and we utilize Spark to execute these workloads. Our implementation is named Cherry, which is an open-source distributed task-aware caching shuffle service for serverless analytics. Cherry allows a de-facto statefull workload to be seamlessly executed in a serverless manner by exploiting a remote storage engine while optimizing the shuffle execution time by employing a novel caching mechanism for intermediate data in task-level. Our shuffle service is built on top of Spark and extends its existing shuffle operation, achieving both a fault-tolerant and an elastic execution by storing intermediate state between stages remotely. This ensures that there will be no shuffle data loss due to Worker node crashes. We take advantage of this state dissagregation and we also present a look-ahead task-aware caching mechanism which proactively “warms-up” its available caching layer with only the necessary shuffle data that are about to be requested in an efficient way to avoid workload delays due to I/O bottlenecks. We will also utilize Kubernetes as a container orchestrator so as to approach the realistic containerized environments of isolated processes that exist the cloud. Cherry’s architecture and implementation is thoroughly analyzed, and it is clearly depicted how it interacts with the current Big Data processing system in our case, prov- ing its seamless possible integration with different MapReduce-like analytics frameworks. Our implementation is tested both on synthetic and real data workloads, against Apache Spark and its existing External Shuffle Service, and the conducted results are presented. In the end of the thesis, we sum up our thesis and present possible future expansions of our work. en
heal.advisorName Koziris, Nectarios en
heal.committeeMemberName Konstantinou Ioannis en
heal.committeeMemberName Kalogeraki, Vana en
heal.academicPublisher Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών el
heal.academicPublisherID ntua
heal.numberOfPages 78 σ. el
heal.fullTextAvailability false


Αρχεία σε αυτό το τεκμήριο

Οι παρακάτω άδειες σχετίζονται με αυτό το τεκμήριο:

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής

Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα Εκτός από όπου ορίζεται κάτι διαφορετικό, αυτή η άδεια περιγράφεται ως Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα