Τεχνικές ανάλυσης πολυμεταβλητών δεδομένων

Βραχασωτάκη, Όλγα; Vrachasotaki, Olga

dc.contributor.author	Βραχασωτάκη, Όλγα	el
dc.contributor.author	Vrachasotaki, Olga	en
dc.date.accessioned	2023-03-20T10:09:39Z
dc.date.available	2023-03-20T10:09:39Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/57255
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.24953
dc.description	Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακών Σπουδών (Δ.Π.Μ.Σ.) "Επιστήμη Δεδομένων και Μηχανική Μάθηση"	el
dc.rights	Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/gr/	*
dc.subject	Multivariate Data	en
dc.subject	PCA	en
dc.subject	Clustering	en
dc.subject	Multinomial Logistic Regression	en
dc.subject	K-Means	en
dc.subject	Linear Discriminant Analysis	en
dc.subject	Latent Class Analysis	en
dc.title	Τεχνικές ανάλυσης πολυμεταβλητών δεδομένων	el
heal.type	masterThesis
heal.classification	Μηχανική Μάθηση	el
heal.classification	Ανάλυση Δεδομένων	el
heal.classification	Data Science	en
heal.classification	Machine Learning	en
heal.language	el
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2023-02-20
heal.abstract	Η παρούσα µεταπτυχιακή εργασία επικεντρώνεται στην µελέτη πολυµεταβλητών δεδοµένων µέσω της χρήσης διάϕορων στατιστικών µοντέλων που χρησιµοποιούνται στα πλαίσια της επιστήµης δεδοµένων. Τα δεδοµένα αυτά προέρχονται από τα Human Development Reports των Ηνωµένων Εθνών και περιέχουν πληροϕορίες σχετικές µε την θνησιµότητα το πληθυσµού για διαϕορετικές ηλικιακές οµάδες, το προσδοκώµενο όριο ζωής, αλλά και τις γενικότερες δαπάνες µιας χώρας για την εκπαίδευση και την υγεία. Με βάση αυτά τα δεδοµένα εξετάζονται διαϕορετικές τεχνικές, οι οποίες αναλύονται σε ξεχωριστά κεϕάλαια. Η πρώτη τεχνική που εϕαρµόζεται είναι αυτή της PCA.Η τεχνική αυτή, ευρέως γνωστή στη βιβλιογραϕία, χρησιµοποιείται για την µείωση της διάστασης του συνόλου δεδοµένων. Πιο συγκεκριµένα, αναλύεται µαθηµατικά το πως και το γιατί έχουν σηµασία µόνο οι ιδιοτιµές στην ανάλυση της, το πως γίνεται ο υπολογισµός των συνιστωσών, ενώ επιπλέον, παρέχονται κάποια επιπλέον αποτελέσµατα, σχετικά µε τις ειδικές περιπτώσεις της τεχνικής. Αναλύεται, επιπλέον, το πρόβληµα της µείωσης της διαστατικότητας και πως γίνεται τελικά η επιλογή των σηµαντικών συνιστωσών. Ακολούθως,γίνεται µία προσπάθεια οµαδοποίησης των δεδοµένων µε Clustering. Η συσταδοποίηση είναι µία πολύ ωϕέλιµη τεχνική για την οργάνωση των σηµείων σε οµάδες µε κατάλληλη επιλογή µέτρου απόστασης. Απαραίτητη κρίθηκε η παρουσίαση των διαϕόρων µέτρων εγγύτητας τόσο για συνεχείς, όσο και για κατηγορικές µεταβλητές, η εξέταση της οµαδοποίησης σε Ευκλείδειο και µη χώρο και οι γενικότερες στρατηγικές συσταδοποίησης που χρησιµοποιούνται. Επιπλέον, παρουσιάζεται και το Ιεραρχικό Clustering µε µία συνοπτική παρουσίαση των µεθόδων που το απαρτίζουν (Centroid, Single Linkage, Complete Linkage, Average Linkage, Ward). Με αϕορµή την οικονοµική οµάδα που ανήκει κάθε χώρα εϕαρµόζεται το Μοντέλο Λογιστικής Παλινδρόµησης πολλών κατηγοριών. Πρόκειται,ουσιαστικά,για το µοντέλο λογιστικής παλινδρόµησης, όταν η µεταβλητή απόκρισης έχει πάνω από δύο τιµές. Στο σηµείο αυτό γίνεται διαχωρισµός των µοντέλων ανάλογα µε το αν οι κατηγορίες της µεταβλητής απόκρισης είναι διατεταγµένες (ordinal) ή όχι (nominal), και αναλύονται οι διαϕορετικές περιπτώσεις. Ενδιάµεσα παρέχονται και ενδεικτικά παραδείγµατα που έχουν εϕαρµοστεί σε άλλα σύνολα δεδοµένων για την καλύτερη επεξήγηση των µεθόδων. Κρατώντας και πάλι στο επίκεντρο την οικονοµική κατάταξη των χωρών, µελετήθηκε σε αυτό το σηµείο η µέθοδος της LDA. Η µέθοδος αυτή, αν και δεν βρίσκεται υψηλά στις προτιµήσεις των αναλυτών λόγω των περιορισµών της, παρέχει ικανοποιητικά αποτελέσµατα, τα οποία ταυτίζονται σε µερικές περιπτώσεις µε αυτά της Λογιστικής Παλινδρόµησης. Οι εν λόγω περιορισµοί αϕορούν την κατανοµή που ακολουθούν οι επεξηγηµατικές µεταβλητές ως προς τις διαϕορετικές κατηγορίες της µεταβλητής απόκρισης και τον πίνακα συνδιακύµανσης που µοιράζονται οι µεταβλητές. Επιπλέον,για λόγους πληρότητας αναϕέρεται και η µέθοδος QDA, παρόµοια µε την LDA, αλλά για τετραγωνικό όριο διαχωρισµού. Μία τελευταία τεχνική που εϕαρµόζεται είναι αυτή της Λανθάνουσας Τάξης (LCA). Η τεχνική αυτή εϕαρµόζεται µε κατηγορικές µεταβλητές και χρησιµοποιεί τον αλγόριθµο Expectation Maximization για την µεγιστοποίηση της συνάρτησης log-likelihood. Αναλύεται το µοντέλο, οι παράµετροί του και οι εκτιµήσεις τους. Φυσικά, όπως σε κάθε τεχνική υπάρχουν και οι αντίστοιχοι περιορισµοί σχετικά µε το µέγεθος του δείγµατος, το πλήθος των µεταβλητών, των κατηγοριών των µεταβλητών, αλλά και την ύπαρξη ή όχι συµµεταβλητών. Σκοπός είναι η οργάνωση των παρατηρήσεων σε οµάδες, παρόµοια µε το Clustering, βασιζόµενοι στην εύρεση σχέσεων ανάµεσα στις µεταβλητές. Στο τελευταίο κεϕάλαιο παρουσιάζεται και η επεξεργασία του συνόλου δεδοµένων µε αναλυτικά τα αποτελέσµατα για κάθε µέθοδο σε ξεχωριστή ενότητα. Σε παράρτηµα στο τέλος αυτού του κεϕαλαίου παρατίθεται και ο κώδικας στην R µε τις εντολές που χρησιµοποιήθηκαν για τα αποτελέσµατα και τα γραϕήµατα. Ως επιπλέον ενότητα παρατίθεται η σύγκριση των µεθόδων µε βάση τα αποτελέσµατα και η εξαγωγή των τελικών συµπερασµάτων για τις χώρες.	el
heal.abstract	This master’s thesis focuses on the study of multivariate data through the use of various statistical models used in the context of data science. These data come from the Human Development Reports of the United Nations and contain information related to the mortality of the population for different age groups, life expectancy, and also the general expenditure of a country on education and health. Based on these data, different techniques are considered, which are analyzed in separate chapters. The first technique applied is that of PCA. This technique, widely known in the literature, is used to reduce the dimension of the data set. More specifically, it is analyzed mathematically how and why only the eigenvalues are important in its analysis, how the components are calculated, while in addition, some additional results are provided, regarding the special cases of the technique. Furthermore, the problem of dimensionality reduction is analyzed and how the important components are finally selected. Next, an attempt is to group the data with Clustering. Clustering is a very useful technique for organizing points into groups with an appropriate choice of distance measure. It was deemed necessary to present the various proximity measures for both continuous and categorical variables, the examination of clustering in Euclidean and non-Euclidean space and the more general clustering strategies used. In addition, Hierarchical Clustering is presented with a brief presentation of the methods that make it up (Centroid, Single Linkage, Complete Linkage, Average Linkage, Ward). In order to analyze the economic group to which each country belongs, the Multicategory Logistic Regression Model is applied. This is essentially the logistic regression model when the response variable has more than two values. At this point, the models are separated according to whether the categories of the response variable are ordered (ordinal) or not (nominal), and the different cases are analyzed. In between, illustrative examples that have been applied to other data sets are also provided to better explain the methods. Keeping again the economic ranking of the countries in focus, the method of LDA was studied at this point. This method, although not highly favored by analysts due to its limitations, provides satisfactory results, which in some cases are identical to those of Logistic Regression. These limitations concern the distribution followed by the explanatory variables with respect to the different categories of the response variable and the covariance matrix shared by the variables. In addition, for the sake of completeness, the QDA method, similar to LDA, but for a quadratic separation boundary, is also mentioned. A final technique applied is that of Latent Class Analysis (LCA). This technique is applied with categorical variables and uses the Expectation Maximization algorithm to maximize the log-likelihood function. The model, its parameters and their estimates are analyzed. Of course, as in any technique, there are also the corresponding limitations regarding the size of the sample, the number of variables, the categories of the variables, but also the existence or not of the covariates. The purpose is to organize observations into groups, similar to Clustering, based on finding relationships between variables. In the last chapter, the processing of the data set is presented, detailing the results for each method in a separate section. An appendix at the end of this chapter lists the code in R with the commands used for the results and graphs. As an additional section, the comparison of the methods based on the results and the drawing of the final conclusions for the countries are listed.	en
heal.advisorName	Καρώνη, Χρυσηίς	el
heal.advisorName	Caroni, Chrysseis	en
heal.committeeMemberName	Χρυσαφίνος, Κωνσταντίνος	el
heal.committeeMemberName	Παπανικολάου, Βασίλης	el
heal.committeeMemberName	Chrysafinos, Konstantinos	en
heal.committeeMemberName	Papanikolaou, Vassilis	en
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών	el
heal.academicPublisherID	ntua
heal.numberOfPages	177 σ.	el
heal.fullTextAvailability	false