ML-driven automated framework for tuning Spark applications

Νικητοπούλου, Δήμητρα; Nikitopoulou, Dimitra

dc.contributor.author	Νικητοπούλου, Δήμητρα	el
dc.contributor.author	Nikitopoulou, Dimitra	en
dc.date.accessioned	2020-10-12T10:48:54Z
dc.date.available	2020-10-12T10:48:54Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/51401
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.19099
dc.rights	Default License
dc.subject	Σπαρκ	el
dc.subject	Ρύθμιση παραμέτρων	el
dc.subject	Βελτιστοποίηση	el
dc.subject	Επίδραση παραμέτρων	el
dc.subject	Μοντελοποίηση επίδοσης	el
dc.subject	Spark	en
dc.subject	Parameter tuning	en
dc.subject	Optimization	en
dc.subject	Parameter impact	en
dc.subject	Performance modeling	en
dc.title	ML-driven automated framework for tuning Spark applications	en
heal.type	bachelorThesis
heal.classification	Machine learning	en
heal.classification	Μηχανική μάθηση	el
heal.language	el
heal.language	en
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2020-07-20
heal.abstract	Σήμερα, ένας ολοένα αυξανόμενος αριθμός από δεδομένα χαρακτηρίζει την εποχή μας, επειδή συλλέγονται με εύκολο και φθηνό τρόπο από διάφορες συσκευές που είναι συνδεδεμένες στο διαδίκτυο. Ο χειρισμός αυτών των δεδομένων απαιτεί πόρους, που παρέχονται εύκολα από το υπολογιστικό νέφος, αλλά και νέα εργαλεία που να επιταχύνουν τη διαδικασία. Προς αυτή την κατεύθυνση, η κατανεμημένη εκτέλεση των διεργασιών και η αξιοποίηση της υψηλής ταχύτητας που προσφέρει η μνήμη έναντι του σκληρού δίσκου είναι υψίστης σημασίας. Το Spark είναι ένα εργαλείο που εκμεταλλεύεται αυτές τις δύο παρατηρήσεις και μπορεί να χειρίζεται εύκολα μεγάλο όγκο δεδομένων. Η βέλτιστη εκτέλεση των προγραμμάτων του, όμως, εξαρτάται σε μεγάλο βαθμό από τη σωστή ρύθμιση μιας σειράς παραμέτρων, οι οποίες μάλιστα είναι πολλές σε αριθμό. Στην παρούσα εργασία, σχεδιάζουμε ένα σύστημα που εκτελεί αυτόματη ρύθμιση των παραμέτρων του Spark, ανάλογα με την εφαρμογή που εκτελείται και το μέγεθος των δεδομένων εισόδου αυτής. Εντοπίζουμε τις παραμέτρους που έχουν τη μεγαλύτερη επίδραση στην εκτέλεση των εφαρμογών και σκιαγραφούμε τη μεθοδολογία για τη ρύθμισή τους με σκοπό την ελαχιστοποίηση του χρόνου εκτέλεσης. Κατόπιν, ενσωματώνουμε τη λύση μας στο Spark μέσω ενός wrapper script και παρέχουμε στον χρήστη την ικανότητα να τρέχει την εντολή spark-submit που θα έτρεχε, με τη διαφορά ότι η αντίστοιχη εντολή που θα εκτελεστεί στην πραγματικότητα είναι αυτή που χρησιμοποιεί τη βέλτιστη παραμετροποίηση. Τέλος, παρουσιάζουμε την επιτάχυνση της εκτέλεσης που πετύχαμε για ένα σύνολο εφαρμογών, με χρήση των οποίων κατασκευάσαμε το σύστημα, καθώς και άλλων άγνωστων εφαρμογών για να εντοπίσουμε την ικανότητα γενίκευσης της βελτιστοποιητικής ικανότητας της μεθοδολογίας μας.	el
heal.abstract	Nowadays, there is an ever-increasing number of data that characterizes our era, since they are easy and cheap to collect from various devices connected to the Internet. Manipulating big data demands resources, which are provided from the cloud in a convenient way, and some tools to speed up the process. Towards this direction, the distributed execution of processes as well as the exploitation of the high speed that the use of memory has to offer over the hard disk is of utmost importance. Spark is a tool that takes advantage of these remarks and can manipulate easily a vast volume of data. Achieving an optimal execution of its workloads, though, depends to a great extent on the appropriate tuning of a large number of parameters. In this thesis, we design a framework that tunes in an automated way Spark's parameters, depending on the workload and the size of the input data. We locate the parameters with the greatest impact on the execution of the applications and we form a methodology to tune them accordingly so as to minimize the execution time. Next, we integrate our solution into Spark, with the use of a wrapper script and we provide the user the chance to run a simple spark-submit command but actually executing the one with the optimal configuration. Finally, we present the speedup we achieved for the set of applications we used to construct the framework as well as for other unseen applications in order to evaluate the framework's ability to generalize its optimization capacity.	en
heal.advisorName	Soudris, Dimitrios	en
heal.advisorName	Σούντρης, Δημήτριος	el
heal.committeeMemberName	Soudris, Dimitrios
heal.committeeMemberName	Tsanakas, Panayiotis	en
heal.committeeMemberName	Goumas, Georgios	en
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών	el
heal.academicPublisherID	ntua
heal.numberOfPages	94 σ.	el
heal.fullTextAvailability	false