dc.contributor.author | Nikitas, Nikolaos | en |
dc.contributor.author | Νικήτας, Νικόλαος | el |
dc.date.accessioned | 2022-01-28T14:35:07Z | |
dc.date.available | 2022-01-28T14:35:07Z | |
dc.identifier.uri | https://dspace.lib.ntua.gr/xmlui/handle/123456789/54446 | |
dc.identifier.uri | http://dx.doi.org/10.26240/heal.ntua.22144 | |
dc.description | Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακών Σπουδών (Δ.Π.Μ.Σ.) "Επιστήμη Δεδομένων και Μηχανική Μάθηση" | el |
dc.rights | Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/3.0/gr/ | * |
dc.subject | Big data analytics frameworks | en |
dc.subject | Distributed systems | en |
dc.subject | Cloud computing | en |
dc.subject | Serverless arrchitecture | en |
dc.subject | DevOps | en |
dc.subject | Πλαίσια επεξεργασίας μεγάλων δεδομένων | el |
dc.subject | Κατανεμημένα συστήματα | el |
dc.subject | Υπολογιστικό νέφος | el |
dc.subject | Serverless | en |
dc.subject | Αρχιτεκτονική | el |
dc.subject | Πρακτικές | el |
dc.title | Development of large-scale big data system with state management on serverless cloud architectures | en |
dc.title | Ανάπτυξη Συστήματος Μεγάλης Κλίμακας Δεδομένων με Διατήρηση της Κατάστασης σε Serverless Αρχιτεκτονικές Υπολογιστικών Νεφών | el |
heal.type | masterThesis | |
heal.classification | Big Data Serverless Analytics | en |
heal.language | el | |
heal.language | en | |
heal.access | free | |
heal.recordProvider | ntua | el |
heal.publicationDate | 2021-10-01 | |
heal.abstract | Over the last years, there has been apparent interest in Big Data. Additionally, there has been great effort in implementing and optimizing distributed Big Data analytics frameworks, such as Apache Spark and Hadoop. More specifically, there is emphasis by industry and academia on improving the all-to-all transfer over the network of intermediate data between the MapReduce computation units, i.e., the shuffle data operation between the stages of a MapReduce-like workload, which is of paramount importance and still remains a serious bottleneck. Apache Spark is a widely used Big Data processing system. Nevertheless, its shuffle operation has some challenges that need to be addressed: the I/O bottleneck that is defined by the limited throughput by HDDs when shuffle intermediate data are small in size and need to be accessed, the lack of fault tolerance in case of a Worker node crash or deallocation since any intermediate data stored will be permanently lost, and the absence of adaptation to containerized environments with isolated and stateless processes that is common in the cloud. In this specific master thesis our goal is to adopt the serverless paradigm in widely embraced large-scale data analytics frameworks and handle intermediate state efficiently. Thus, we implemented a distributed dissaggragated storage architecture that will maintain any intermediate shuffle data from the execution of analytics workloads and we utilize Spark to execute these workloads. Our implementation is named Cherry, which is an open-source distributed task-aware caching shuffle service for serverless analytics. Cherry allows a de-facto statefull workload to be seamlessly executed in a serverless manner by exploiting a remote storage engine while optimizing the shuffle execution time by employing a novel caching mechanism for intermediate data in task-level. Our shuffle service is built on top of Spark and extends its existing shuffle operation, achieving both a fault-tolerant and an elastic execution by storing intermediate state between stages remotely. This ensures that there will be no shuffle data loss due to Worker node crashes. We take advantage of this state dissagregation and we also present a look-ahead task-aware caching mechanism which proactively “warms-up” its available caching layer with only the necessary shuffle data that are about to be requested in an efficient way to avoid workload delays due to I/O bottlenecks. We will also utilize Kubernetes as a container orchestrator so as to approach the realistic containerized environments of isolated processes that exist the cloud. Cherry’s architecture and implementation is thoroughly analyzed, and it is clearly depicted how it interacts with the current Big Data processing system in our case, prov- ing its seamless possible integration with different MapReduce-like analytics frameworks. Our implementation is tested both on synthetic and real data workloads, against Apache Spark and its existing External Shuffle Service, and the conducted results are presented. In the end of the thesis, we sum up our thesis and present possible future expansions of our work. | en |
heal.advisorName | Koziris, Nectarios | en |
heal.committeeMemberName | Konstantinou Ioannis | en |
heal.committeeMemberName | Kalogeraki, Vana | en |
heal.academicPublisher | Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών | el |
heal.academicPublisherID | ntua | |
heal.numberOfPages | 78 σ. | el |
heal.fullTextAvailability | false |
Οι παρακάτω άδειες σχετίζονται με αυτό το τεκμήριο: