HEAL DSpace

Load-centric data shuffling with a patch-based repartitioning algorithm exploiting the data placement and distribution

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Kassela, Evdokia en
dc.contributor.author Κασσελά, Ευδοκία el
dc.date.accessioned 2024-03-11T12:36:02Z
dc.date.available 2024-03-11T12:36:02Z
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/58977
dc.identifier.uri http://dx.doi.org/10.26240/heal.ntua.26673
dc.description Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακών Σπουδών (Δ.Π.Μ.Σ.) "Επιστήμη Δεδομένων και Μηχανική Μάθηση" el
dc.rights Default License
dc.subject Λοξότητα δεδομένων el
dc.subject Ανακατανομή δεδομένων el
dc.subject Φόρτος εργασίας el
dc.subject Συνένωση συνόλων δεδομένων el
dc.subject Κατανεμημένη εκτέλεση el
dc.subject Data skew en
dc.subject Data shuffling en
dc.subject Load balancing en
dc.subject Reduce-side processing en
dc.subject Distributed execution en
dc.title Load-centric data shuffling with a patch-based repartitioning algorithm exploiting the data placement and distribution en
heal.type masterThesis
heal.classification computer science en
heal.classification data science en
heal.language en
heal.access free
heal.recordProvider ntua el
heal.publicationDate 2023-10-31
heal.abstract This diploma thesis aims to the development of a novel data repartitioning algorithm that can be used to process unordered skewed data in distributed environments. In workloads involving joins and aggregations, the presence of skew in the join/group by attribute values typically causes load balancing issues in such environments, where one worker becomes a straggler. To address this problem, solutions that employ a subset-replicate partitioning methodology and rely on cost models have been applied in research, however they are based on custom-build execution engines or custom hardware. Typical general-purpose distributed processing engines which are widely used in industry, try to address the problem through dynamic task monitoring and partition resizing based on user-defined limits. Similarly, code-based user solutions such as key-salting can be used for creating more equally sized partitions. The aforementioned industry-based approaches tackle the problem of uneven load balancing based on task management and rely on user-defined thresholds, yet ignoring the network i/o related overheads that are induced by the replication of the unskewed side. Our goal is to develop an algorithm that can be easily integrated with any distributed processing system without involving cost-models or user-defined parameters and addresses both the load balancing and replication-related network overheads that arise under the presence of data skew for reduce-side operations. Our implementation uses information regarding the data distribution and placement on the workers and tries to minimize data movements by creating prioritized local-based partitions that can be locally processed with zero or minimal data movement while maintaining an even load on the workers. We evaluate our algorithm for different levels of skew in comparison with the hash-based partitioning algorithm using three different evaluation parameters; the size of data that were transferred over the network, the load on the workers, and the estimated execution time using a linear model. The experimental results confirm that the load balancing performed with our algorithm is always perfectly even, and that the increased network traffic which occurs with our algorithm due to data replication incurs a minimal overhead in the execution time in case of a low level of skew. At moderate to high skew levels the overall performance of our algorithm is superior as it is proved to be mainly affected by the worker load and less by the network i/o overheads. en
heal.advisorName Koziris, Nectarios en
heal.committeeMemberName Koziris, Nectarios en
heal.committeeMemberName Goumas, Georgios en
heal.committeeMemberName Konstantinou, Ioannis en
heal.academicPublisher Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών el
heal.academicPublisherID ntua
heal.numberOfPages 78 σ. el
heal.fullTextAvailability false


Αρχεία σε αυτό το τεκμήριο

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής