dc.contributor.author |
Kassela, Evdokia
|
en |
dc.contributor.author |
Κασσελά, Ευδοκία
|
el |
dc.date.accessioned |
2024-03-11T12:36:02Z |
|
dc.date.available |
2024-03-11T12:36:02Z |
|
dc.identifier.uri |
https://dspace.lib.ntua.gr/xmlui/handle/123456789/58977 |
|
dc.identifier.uri |
http://dx.doi.org/10.26240/heal.ntua.26673 |
|
dc.description |
Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακών Σπουδών (Δ.Π.Μ.Σ.) "Επιστήμη Δεδομένων και Μηχανική Μάθηση" |
el |
dc.rights |
Default License |
|
dc.subject |
Λοξότητα δεδομένων |
el |
dc.subject |
Ανακατανομή δεδομένων |
el |
dc.subject |
Φόρτος εργασίας |
el |
dc.subject |
Συνένωση συνόλων δεδομένων |
el |
dc.subject |
Κατανεμημένη εκτέλεση |
el |
dc.subject |
Data skew |
en |
dc.subject |
Data shuffling |
en |
dc.subject |
Load balancing |
en |
dc.subject |
Reduce-side processing |
en |
dc.subject |
Distributed execution |
en |
dc.title |
Load-centric data shuffling with a patch-based repartitioning algorithm exploiting the data placement and distribution |
en |
heal.type |
masterThesis |
|
heal.classification |
computer science |
en |
heal.classification |
data science |
en |
heal.language |
en |
|
heal.access |
free |
|
heal.recordProvider |
ntua |
el |
heal.publicationDate |
2023-10-31 |
|
heal.abstract |
This diploma thesis aims to the development of a novel data repartitioning algorithm that can be used to process unordered skewed data in distributed environments. In workloads involving joins and aggregations, the presence of skew in the join/group by attribute values typically causes load balancing issues in such environments, where one worker becomes a straggler. To address this problem, solutions that employ a subset-replicate partitioning methodology and rely on cost models have been applied in research, however they are based on custom-build execution engines or custom hardware. Typical general-purpose distributed processing engines which are widely used in industry, try to address the problem through dynamic task monitoring and partition resizing based on user-defined limits. Similarly, code-based user solutions such as key-salting can be used for creating more equally sized partitions. The aforementioned industry-based approaches tackle the problem of uneven load balancing based on task management and rely on user-defined thresholds, yet ignoring the network i/o related overheads that are induced by the replication of the unskewed side. Our goal is to develop an algorithm that can be easily integrated with any distributed processing system without involving cost-models or user-defined parameters and addresses both the load balancing and replication-related network overheads that arise under the presence of data skew for reduce-side operations. Our implementation uses information regarding the data distribution and placement on the workers and tries to minimize data movements by creating prioritized local-based partitions that can be locally processed with zero or minimal data movement while maintaining an even load on the workers. We evaluate our algorithm for different levels of skew in comparison with the hash-based partitioning algorithm using three different evaluation parameters; the size of data that were transferred over the network, the load on the workers, and the estimated execution time using a linear model. The experimental results confirm that the load balancing performed with our algorithm is always perfectly even, and that the increased network traffic which occurs with our algorithm due to data replication incurs a minimal overhead in the execution time in case of a low level of skew. At moderate to high skew levels the overall performance of our algorithm is superior as it is proved to be mainly affected by the worker load and less by the network i/o overheads. |
en |
heal.advisorName |
Koziris, Nectarios |
en |
heal.committeeMemberName |
Koziris, Nectarios |
en |
heal.committeeMemberName |
Goumas, Georgios |
en |
heal.committeeMemberName |
Konstantinou, Ioannis |
en |
heal.academicPublisher |
Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών |
el |
heal.academicPublisherID |
ntua |
|
heal.numberOfPages |
78 σ. |
el |
heal.fullTextAvailability |
false |
|