Στατιστικές μέθοδοι για την ανάλυση δεδομένων υψηλής διάστασης

Δρόσου, Κρυσταλλένια Π.; Drosou, Krystallenia P.

dc.contributor.advisor	Κουκουβίνος, Χρήστος	el
dc.contributor.author	Δρόσου, Κρυσταλλένια Π.	el
dc.contributor.author	Drosou, Krystallenia P.	en
dc.date.accessioned	2013-07-24T11:29:44Z
dc.date.available	2013-07-24T11:29:44Z
dc.date.copyright	2013-07-15	-
dc.date.issued	2013-07-24
dc.date.submitted	2013-07-15	-
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/8498
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.13355
dc.description	175 σ.	el
dc.description.abstract	Το πρόβλημα της στατιστικής μοντελοποίησης και του εντοπισμού των σημαντικών μεταβλητών σε μεγάλα σύνολα δεδομένων είναι ένα συνηθισμένο ζήτημα στις μέρες μας. Η εργασία αυτή ασχολείται με την στατιστική ανάλυση ενός μεγάλου διαστάσεων συνόλου δεδομένων. Διεξάγουμε μία σεισμική ανάλυση ευαισθησίας κινδύνου χρησιμοποιώντας σεισμικά δεδομένα που αποκτήθηκαν στην Ελλάδα κατά τη διάρκεια των ετών 1962-2003. Ο κύριος σκοπός της ανάλυσης είναι η εξαγωγή γνώσης υψηλού επιπέδου για τη χρήση ή τη λήψη αποφάσεων σ΄αυτό τον τομέα. Οκτώ μη παραμετρικοί ταξινομητές που προέρχονται από μεθόδους εξόρυξης δεδομένων (πολυστρωματικά Perceptrons (MLP) Νευρωνικά Δίκτυα, Radial Basis Function Νευρωνικά (RBFN) δίκτυα, δίκτυα Bayes, διανυσματικές μηχανές υποστήριξης (SVMs), δέντρα ταξινόμησης και παλινδρόμησης (C & RT, CHAID, C5.0 αλγόριθμο, QUEST) μας απασχολούν σε αυτή την εργασία, σε σύγκριση με τη Λογιστική Παλινδρόμηση και L1-νόρμα SVM όσον αφορά τη συνολική ακρίβεια ταξινόμησης, την ευαισθησία, την ειδικότητα και την περιοχή κάτω από την καμπύλη ROC (AUROC). Ο στόχος αυτής της εργασίας είναι διπλός. Αφενός να αξιολογήσει τη σημασία των διαφόρων μεταβλητών εισόδου, προκειμένου να εντοπίσει τους πιθανούς παράγοντες κινδύνου των μεγάλων σεισμών και αφετέρου να εξετάσει ποιοι ταξινομητές είναι οι πλέον κατάλληλοι για μία μεγάλων διαστάσεων ανάλυση δεδομένων, ανιχνεύοντας αποτελεσμα-τικά τις σύνθετες μη γραμμικές σχέσεις και ενδεχομένως οδηγώντας σε πιο ακριβείς προβλέψεις. Συγκεκριμένα, το πρώτο κεφάλαιο ασχολείται με την βασική ιδέα των δεδομένων υψηλής διάστασης και το δεύτερο με τις τεχνικές εξόρυξης δεδομένων και μηχανικής μάθησης. Το κεφάλαιο 3 παρουσιάζει βασικές μεθόδους για τη μείωση των δεδομένων, όπως η παραγοντική ανάλυση και ανάλυση των κυρίων συνιστωσών. Στο τέταρτο κεφάλαιο αναφερόμαστε σε μεθόδους ταξινόμησης και παρουσιάζουμε, όπως έχουμε ήδη αναφέρει, οκτώ μη-παραμετρικούς ταξινομητές (Τεχνητά Νευρωνικά Δίκτυα (MLP, RBFN), Bayesian δίκτυα, Μηχανές Διανυσματικής Υποστήριξης (SVM), Δέντρα Ταξινόμησης και Παλινδρόμησης (C&RT, CHAID, QUEST , C5.0), Λογιστική Παλινδρόμηση και l1-νόρμα svm). Το κεφάλαιο 5 αναφέρεται στην αξιολόγηση ενός μοντέλου με τη χρήση μεθόδων, όπως η πολλαπλή επικύρωση και στην απόδοση των ταξινομητών που αναφέρονται παραπάνω. Επιπλέον, συζητούνται οι όροι της ευαισθησίας και ειδικότητας και γίνεται μια σύντομη αναφορά στην περιοχή κάτω από την καμπύλη (AUC). Στο τελευταίο - κεφάλαιο 6 παρουσιάζουμε το λογισμικό Clementine το οποίο θα εφαρμοστεί σε σεισμικά δεδομένα και προχωράμε στην συνέχεια με την εφαρμογή και την ερμηνεία των αποτελεσμάτων.	el
dc.description.abstract	The problem of statistical modelling and identifying the significant variables in large data sets is common nowadays. This paper deals with the statistical analysis of a large dimensional data set; we conduct with a seismic hazard sensitivity analysis using seismic data from Greece acquired during the years 1962 - 2003. The main purpose of the analysis is to extract high-level knowledge for the domain user or decision-maker. Eight non parametric classifiers derived from data mining methods (Multilayer Perceptrons (MLP) Neural Networks, Radial Basis Function Neural (RBFN) Networks, Bayesian Networks, Support Vector Machines (SVMs), Classification and Regression Tree (C&RT), Chi-square Automatic Interaction Detection (CHAID), C5.0 algorithm and Quick, Unbiased, Efficient Statistical Tree (QUEST)) are employed in this work, and are compared to Logistic Regression and l1-norm SVM in terms of overall classification accuracy, sensitivity, specificity, and Area under the ROC curve (AUROC). The goal of this paper is twofold; assess the importance of several input variables in order to detect the possible risk factors of large earthquakes and examine which classifiers are most suited for a large dimensional data analysis, detecting effectively complex nonlinear relationships and potentially lead to more accurate predictions. Specifically, the first Chapter deals with the main concept of high dimensional data and the second one with data mining and machine learning techniques. Chapter 3 present basic methods for data reduction such as factor analysis and principal component analysis. In the fourth chapter we refer to Classification methods and present, as we have already mentioned, eight non-parametric classifiers (Artificial Neural Networks (MLP, RBFN), Bayesian Networks, Support vector Machines, Classification and Regression Tree (C&RT, CHAID, QUEST, C5.0), Logistic Regression and l1-norm support vector machine). Chapter 5 refers to the evaluation of a model using methods like cross validation and to the performance of the classifiers mentioned above. In addition the terms of sensitivity and specificity are discussed and a brief reference to the the area under the curve (AUC) is presented as well. In the final-chapter 6 we present the Clementine software which we applied to seismic data and proceed with the implementation and interpretation of the results.	en
dc.description.statementofresponsibility	Δρόσου Π. Κρυσταλλένια	el
dc.language.iso	el	en
dc.rights	ETDFree-policy.xml	en
dc.subject	Υψηλής διάστασης δεδομένα	el
dc.subject	Ταξινόμηση	el
dc.subject	Λογιστική παλινδρόμηση	el
dc.subject	Νευρωνικά δίκτυα	el
dc.subject	Μηχανές διανυσματικής υποστήριξης	el
dc.subject	Δέντρα αποφάσεων	el
dc.subject	Μπεϋζιανά δίκτυα	el
dc.subject	ROC καμπύλες	el
dc.subject	Ακρίβεια	el
dc.subject	Ευαισθησία	el
dc.subject	High dimensional data	en
dc.subject	Classification	en
dc.subject	Logistic regression	en
dc.subject	Neural network	en
dc.subject	Support vector machines	en
dc.subject	Desicion trees	en
dc.subject	Bayesian networks	en
dc.subject	ROC curve	en
dc.subject	Accuracy	en
dc.subject	Sensitivity	en
dc.title	Στατιστικές μέθοδοι για την ανάλυση δεδομένων υψηλής διάστασης	el
dc.title.alternative	Statistical methods for the analysis of high dimensional data	en
dc.type	bachelorThesis	el (en)
dc.date.accepted	2013-07-05	-
dc.date.modified	2013-07-15	-
dc.contributor.advisorcommitteemember	Σπηλιώτης, Ιωάννης	el
dc.contributor.advisorcommitteemember	Βόντα, Φίλια	el
dc.contributor.committeemember	Κουκουβίνος, Χρήστος	el
dc.contributor.committeemember	Σπηλιώτης, Ιωάννης	el
dc.contributor.committeemember	Βόντα, Φίλια	el
dc.contributor.department	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Εφαρμοσμένων Μαθηματικών & Φυσικών Επιστημών. Τομέας Μαθηματικών	el
dc.date.recordmanipulation.recordcreated	2013-07-24	-
dc.date.recordmanipulation.recordmodified	2013-07-24	-