Ανάλυση επίδοσης και μοντελοποίηση του αλγορίθμου συσταδοποίησης k-means σε κεντρικό και κατανεμημένο περιβάλλον

Αφεντουλίδης, Γρηγόριος; Afentoulidis, Grigorios

dc.contributor.author	Αφεντουλίδης, Γρηγόριος	el
dc.contributor.author	Afentoulidis, Grigorios	en
dc.date.accessioned	2016-05-12T07:33:00Z
dc.date.available	2016-05-12T07:33:00Z
dc.date.issued	2016-05-12
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/42484
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.10706
dc.rights	Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/gr/	*
dc.subject	Αλγόριθμος k-means	el
dc.subject	Συσταδοποίηση	el
dc.subject	Συστήματα profiling	el
dc.subject	Προβλεπτική μοντελοποίηση	el
dc.subject	K-means algorithm	en
dc.subject	Clustering	en
dc.subject	Hadoop	en
dc.subject	Apache Spark	en
dc.subject	Profiling systems	el
dc.subject	Predictive modelling	en
dc.subject	Weka	en
dc.title	Ανάλυση επίδοσης και μοντελοποίηση του αλγορίθμου συσταδοποίησης k-means σε κεντρικό και κατανεμημένο περιβάλλον	el
heal.type	bachelorThesis
heal.classification	Computer science	el
heal.language	el
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2015-07-20
heal.abstract	In recent years, interest in execution of data analytics jobs in the science, technology and business fields as well, skyrockets. That led in the development of execution engines that are now offered as services in IaaS providers and take on the execution of such jobs. Meanwhile these execution engines introduce diverse characteristics and runtime architectures, so there exists the need to analyze not only the resources they require but also the execution time they manage to achieve. The complexity of the goal, raises significantly from the fact that this is also affected by the execution parameters of the underlying algorithm. The outcome of such analysis will provide us the means to understand the advantages and disadvantages of every execution engine under specific circumstances and also let us deploy user policies in cloud environments that relate to the cost and the time restraints of the executions. For this purpose we must conduct an experimental analysis on those engines through a profiling process, where we will measure the usage of the resources as well as the overall execution time with carefully selected execution samples. The results of this process will enable us to construct static predictive models that could simulate the performance of the engines for varying execution parameters. In this diploma thesis we are studying the k-means algorithm, which is used for clustering jobs, on the centralized environment of Weka and the distributed environment of Apache Spark. We suggest and deploy two profiling architectures that handle the collection of the metrics regarding the resources used and the time of every experimental execurion. We analyze, with the help of the results of the profiling procedure, the performance of those two engines as the execution parameters vary, and we mention the advantages and disadvantages we notice. We also use the collected data for the construction of predictive models for every metric and every engine, and we comment on the accuracy of those models as well as the usefullness they provide. We alter the size of the cluster in the distributed version in order to check the scalability of the algorithm and we notice up to 30% time improvement. Finally we attempt to contrast the two engines, which designates the supremacy of Spark even for very small datasets.	en
heal.abstract	Τα τελευταία χρόνια αυξάνεται ραγδαία το ενδιαφέρον για την εκτέλεση data analytics εργασιών τόσο στον επιστημονικό και τεχνολογικό αλλά και στον επιχειρηματικό τομέα. Το φαινόμενο αυτό οδήγησε στην ανάπτυξη μηχανών εκτέλεσης που προσφέρονται ως υπηρεσίες σε παρόχους IaaS και αναλαμβάνουν την διεκπεραίωση τέτοιων εργασιών. Καθώς όμως οι μηχανές αυτές εισάγουν διαφορετικά χαρακτηριστικά και αρχιτεκτονικές εκτέλεσης, υπάρχει η ανάγκη να αναλύσουμε τόσο τους απαιτούμενους υπολογιστικούς πόρους που αυτές χρειάζονται, αλλά και τη χρονική επίδοση που αυτές επιτυγχάνουν. Αυτό μάλιστα γίνεται ακόμα πιο περίπλοκο εφ’όσον επηρεάζεται και από τις παραμέτρους εκτέλεσης των αλγορίθμων που υλοποιούν τις εν λόγω εργασίες. Η αξία της ανάλυσης αυτής, έγκειται στο γεγονός ότι θα προσφέρει τα εφόδια να αναγνωρίσουμε τα πλεονεκτήματα της κάθε μηχανής υπό συγκεκριμένες συνθήκες, ενώ ταυτόχρονα θα ικανοποιήσουμε πολιτικές χρηστών σε περιβάλλοντα cloud που σχετίζονται με το κόστος και την ταχύτητα εκτέλεσης. Για το σκοπό αυτό είναι αναγκαία η πειραματική ανάλυση των μηχανών εκτέλεσης μέσω μιας διαδικασίας profiling όπου θα μετρούμε τη χρήση των υπολογιστικών πόρων καθώς και τη διάρκεια εκτέλεσης για προσεκτικά επιλεγμένα δείγματα εκτέλεσης. Με τα αποτελέσματα αυτά μπορούμε να κατασκευάσουμε στατικά μοντέλα που να προσομοιώνουν την συμπεριφορά των μηχανών για μεταβολή διαφορετικών παραμέτρων εκτέλεσης. Στη παρούσα διπλωματική εργασία αναλαμβάνουμε να μελετήσουμε τον αλγόριθμο k- means που χρησιμοποιείται για εργασίες συσταδοποίησης δεδομένων, στο κεντρικό περιβάλλον Weka και στο κατανεμημένο περιβάλλον Apache Spark. Προτείνουμε και υλοποιούμε δύο αρχιτεκτονικές profiling για την ανάκτηση μετρικών που σχετίζονται με τη χρήση των υπολογιστικών πόρων και της χρονική επίδοση κάθε πειραματικής εκτέλεσης. Αναλύουμε από τα αποτελέσματα των μετρήσεών μας τις συμπεριφορές εκτέλεσης των δύο μηχανών καθώς μεταβάλλουμε τις παραμέτρους εκτέλεσης και αναδεικνύουμε πλεονεκτήματα και μειονεκτήματα αυτών. Χρησιμοποιούμε τα δεδομένα που συλλέξαμε για τη κατασκευή μοντέλων για κάθε μετρική και μηχανή εκτέλεσης και αποδεικνύουμε την ακρίβεια καθώς και τη χρησιμότητα αυτών ως προβλεπτικά μοντέλα. Ελέγχουμε την κλιμακωσιμότητα του αλγορίθμου στη κατανεμημένη εκδοχή για διαφορετικό μέγεθος cluster και παρατηρούμε χρονική βελτίωση που αγγίζει το 30%. Τέλος επιχειρούμε σύγκριση των δύο μηχανών εκτέλεσης που μας αναδεικνύει την υπεροχή του Spark ακόμα και για πολύ μικρά μεγέθη dataset	el
heal.advisorName	Κοζύρης, Νεκτάριος	el
heal.committeeMemberName	Παπασπύρου, Νικόλαος	el
heal.committeeMemberName	Τσουμάκος, Δημήτριος	el
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών	el
heal.academicPublisherID	ntua
heal.numberOfPages	113 σ.	el
heal.fullTextAvailability	true