Predicting Failures in HPC systems with Data Mining and Deep Learning techniques

Τζώρτζη, Μαρία-Ιωάννα; Tzortzi, Maria-Ioanna

dc.contributor.author	Τζώρτζη, Μαρία-Ιωάννα	el
dc.contributor.author	Tzortzi, Maria-Ioanna	en
dc.date.accessioned	2020-12-02T09:12:47Z
dc.date.available	2020-12-02T09:12:47Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/52158
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.19856
dc.description	Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακών Σπουδών (Δ.Π.Μ.Σ.)
dc.rights	Default License
dc.subject	Βαθιά Μάθηση	el
dc.subject	Υπερυπολογιστές	en
dc.subject	Πρόβλεψη σφαλμάτων
dc.subject	Μηχανική Μάθηση
dc.subject	Τεχνητή Νοημοσύνη
dc.subject	Deep Learning
dc.subject	HPC
dc.subject	Failure prediction
dc.subject	Machine Learning
dc.subject	Artificial Intelligence
dc.title	Predicting Failures in HPC systems with Data Mining and Deep Learning techniques	en
dc.contributor.department	Επιστήμη Δεδομένων και Μηχανική Μάθηση	el
heal.type	masterThesis
heal.classification	Deep Learning	en
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2020-07-10
heal.abstract	While the computation capacity of HPC systems increases, their complexity leads to complex fault manifestation. Faults are frequent and expected to increase in the systems of this and the next generation. Currently, substantial compute capacity and power is wasted in recovering failed components. Towards the resiliency of HPC systems, the community proposes and implements various solutions, such as fault-tolerant application, recovery techniques (e.g. checkpoint-and-restart), as well as a more thorough understanding of system logs. An alternative, yet imperative, solution is the ability to predict failures with a known lead time, and also to pin-point the node of impending failures. The aim of this diploma thesis is firstly to study/visualize logs of IBM's supercomputer MIRA and also develop a generic methodology for predicting failures. This methodology infers failure chains and ultimately tracks the node ids to pin-point failure location. What's more using this methodology we can extract the lead time. To achieve all of the above, at first logs are being visualized and analyzed using python3 and pandas. Following, LSTMs (Long Short-Term memory) are used to predict failure chains leading to the cessation of the execution of applications running at specific nodes. A phrase analysis of unlabeled log entries is performed, which may or may not belong to the failure chain. There is a three-phase deep learning approach to first train to predict next phrases, second re-train only sequence of phrases leading to failure chains augmented with expected lead times and third predict lead times during testing deployment to predict which specific node fails in how many minutes.	en
heal.advisorName	Γκούμας, Γεώργιος
heal.committeeMemberName	Γκούμας, Γεώργιος	el
heal.committeeMemberName	Κοζύρης, Νεκτάριος	el
heal.committeeMemberName	Πνευματικάτος, Διονύσιος	el
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών	el
heal.academicPublisherID	ntua
heal.fullTextAvailability	false