Τεχνικές συμπίεσης για την ανάλυση δεδομένων μεγάλης κλίμακας

Κυριτσάς, Γεώργιος; Kyritsas, Georgios

dc.contributor.author	Κυριτσάς, Γεώργιος	el
dc.contributor.author	Kyritsas, Georgios	en
dc.date.accessioned	2021-07-28T09:07:30Z
dc.date.available	2021-07-28T09:07:30Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/53714
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.21412
dc.rights	Default License
dc.subject	Big Data	en
dc.subject	Ρυθμός παραγωγής δεδομένων	el
dc.subject	In Memory Databases	en
dc.subject	Data Compression	en
dc.subject	Apache Spark	en
dc.subject	Apache Parquet	en
dc.subject	Κατανεμημένα συστήματα	el
dc.subject	Συστήματα βασισμένα στην κύρια μνήμη (IMDBs)	el
dc.subject	Επεξεργασία δεδομένων κατά στήλη	el
dc.subject	Συμπίεση δεδομένων	el
dc.title	Τεχνικές συμπίεσης για την ανάλυση δεδομένων μεγάλης κλίμακας	el
heal.type	bachelorThesis
heal.classification	Επιστήμη υπολογιστών	el
heal.classification	Computer science	en
heal.language	el
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2020-09-18
heal.abstract	Στις μέρες μας ο ρυθμός παραγωγής δεδομένων αυξάνεται με γοργούς ρυθμούς, ξεπερνώντας κατά πολύ το ρυθμό αύξησης της υπολογιστικής ισχύος. Η αξιοποίηση αυτού του όγκου δεδομένων μπορεί να οδηγήσει σε βαθύτερη κατανόηση συμπεριφορών και συστημάτων, όπως για παράδειγμα της λειτουργίας των ανθρώπινων κυττάρων ή των κινήσεων του χρηματιστηρίου. Μια ευρέως διαδεδομένη λύση για την επεξεργασία μεγάλου όγκου δεδομένων είναι αυτή των κατανεμημένων συστημάτων, δηλαδή ενός συνόλου διασυνδεδεμένων υπολογιστών, οι οποίοι λειτουργούν σαν ένα ενιαίο υπολογιστικό σύστημα αυξημένων δυνατοτήτων. Μια άλλη λύση που κερδίζει συνεχώς έδαφος είναι η χρήση συστημάτων που χρησιμοποιούν την κύρια μνήμη για την επεξεργασία των δεδομένων.Καθώς η κύρια μνήμη είναι πολύ ταχύτερη από το δίσκο, τα συστήματα αυτά μπορούν να επιτύχουν τάξεις μεγέθους καλύτερες επιδόσεις σε σχέση με τα συμβατικά. Το πρόβλημα είναι ότι η χωρητικότητα της κύριας μνήμης είναι κατά πολύ μικρότερη από αυτή ενός δίσκου.Σκοπός της παρούσας διπλωματικής είναι να εξετάσουμε τη χρήση της συμπίεσης στον τομέα της ανάλυσης δεδομένωνμεγάλης κλίμακας. Εξετάζουμε τους τρόπους με τουςοποίουςμπορούμε να συμπιέσουμε δεδομένα, ώστε να χωρέσουν στη μνήμη, καθώς και την επίδραση της συμπίεσης στην απόδοση του συστήματος. Έχοντας τα δεδομένα στην κύρια μνήμη, εξαλείφεταιένα σημαντικόκομμάτι καθυστέρησης, αυτό της μεταφοράς δεδομένων από το δίσκο. Προκειμένου να εξετάσουμε αυτή τη προσέγγιση, δημιουργήσαμε το hybridcolumnar, ένα σύστημα συμπίεσης δεδομένων και εκτέλεσης ερωτημάτων απευθείας στη μνήμη, χωρίς να έχει προηγηθεί αποσυμπίεση τους. Στο σύστημα αυτό υλοποιήσαμε διάφορες τεχνικές συμπίεσης με σκοπό να μελετήσουμε τη συμπεριφορά τους, τόσο σε χώρο όσο και σε χρόνο, ανάλογα με τα χαρακτηριστικά τουσυνόλου δεδομένων.Επίσης συγκρίναμε το σύστημα που υλοποιήσαμε, με ένα από τα κυριότερα και ευρέως χρησιμοποιούμενα συστήματα στο χώρο της ανάλυσης δεδομένων, το Parquet.	el
heal.abstract	The growth of data being created every year far outpaces the advancements in computing performance and the disparity between them is expected to grow. By analyzing and exploiting the vast amount of data, new insight on systems and behaviors, such as the inner workings of human cells or stock market movements can be gained. Distributed systems are an effective and popular solution for taming the vast amount of data produced. A distributed system is composed of a set of common computers, acting as a single computer with combined computing and storage capacity. Another option that is gaining ground lately is the use of in-memory databases (IMDBs), which use main memory (RAM) as the primary means of data storage. As RAM is much faster than spinning and even solid-state disks, these systems achieve performance orders of magnitude greater than disk-based systems. The downside of this approach is that the capacity of system memory is orders of magnitude smaller than the capacity of a hard disk.The purpose of this thesis is to evaluate the use of data compression in large scale data processing. We examine ways for data to be compressed in order to fit in main memory and the impact of compression on system performance. Having data reside in main memory, a big bottleneck is eliminated, that of data movement between memory and disk. In order to evaluate this approach, we created hybrid columnar, a system that stores and queries data directly in memory, without prior decompression. In this system we implemented various compression schemes in order to evaluate their performance regarding both time and space, depending on the characteristics of the dataset. We also compare the system we created with Apache Parquet, one of the most established compressed data formats in the field of large-scale data processing.	en
heal.advisorName	Κοζύρης, Νεκτάριος	el
heal.committeeMemberName	Κοζύρης, Νεκτάριος	el
heal.committeeMemberName	Γκούμας, Γεώργιος	el
heal.committeeMemberName	Τσουμάκος, Δημήτριος	el
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών. Εργαστήριο Υπολογιστικών Συστημάτων	el
heal.academicPublisherID	ntua
heal.numberOfPages	98 σ.	el
heal.fullTextAvailability	false