Predicting Failures in HPC systems with Data Mining and Deep Learning techniques

DSpace/Manakin Repository

Show simple item record

dc.contributor.author Τζώρτζη, Μαρία-Ιωάννα el
dc.contributor.author Tzortzi, Maria-Ioanna en
dc.date.accessioned 2020-12-02T09:12:47Z
dc.date.available 2020-12-02T09:12:47Z
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/52158
dc.identifier.uri http://dx.doi.org/10.26240/heal.ntua.19856
dc.description Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακών Σπουδών (Δ.Π.Μ.Σ.)
dc.rights Default License
dc.subject Βαθιά Μάθηση el
dc.subject Υπερυπολογιστές en
dc.subject Πρόβλεψη σφαλμάτων
dc.subject Μηχανική Μάθηση
dc.subject Τεχνητή Νοημοσύνη
dc.subject Deep Learning
dc.subject HPC
dc.subject Failure prediction
dc.subject Machine Learning
dc.subject Artificial Intelligence
dc.title Predicting Failures in HPC systems with Data Mining and Deep Learning techniques en
dc.contributor.department Επιστήμη Δεδομένων και Μηχανική Μάθηση el
heal.type masterThesis
heal.classification Deep Learning en
heal.access free
heal.recordProvider ntua el
heal.publicationDate 2020-07-10
heal.abstract While the computation capacity of HPC systems increases, their complexity leads to complex fault manifestation. Faults are frequent and expected to increase in the systems of this and the next generation. Currently, substantial compute capacity and power is wasted in recovering failed components. Towards the resiliency of HPC systems, the community proposes and implements various solutions, such as fault-tolerant application, recovery techniques (e.g. checkpoint-and-restart), as well as a more thorough understanding of system logs. An alternative, yet imperative, solution is the ability to predict failures with a known lead time, and also to pin-point the node of impending failures. The aim of this diploma thesis is firstly to study/visualize logs of IBM's supercomputer MIRA and also develop a generic methodology for predicting failures. This methodology infers failure chains and ultimately tracks the node ids to pin-point failure location. What's more using this methodology we can extract the lead time. To achieve all of the above, at first logs are being visualized and analyzed using python3 and pandas. Following, LSTMs (Long Short-Term memory) are used to predict failure chains leading to the cessation of the execution of applications running at specific nodes. A phrase analysis of unlabeled log entries is performed, which may or may not belong to the failure chain. There is a three-phase deep learning approach to first train to predict next phrases, second re-train only sequence of phrases leading to failure chains augmented with expected lead times and third predict lead times during testing deployment to predict which specific node fails in how many minutes. en
heal.advisorName Γκούμας, Γεώργιος
heal.committeeMemberName Γκούμας, Γεώργιος el
heal.committeeMemberName Κοζύρης, Νεκτάριος el
heal.committeeMemberName Πνευματικάτος, Διονύσιος el
heal.academicPublisher Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών el
heal.academicPublisherID ntua
heal.fullTextAvailability false

Files in this item

This item appears in the following Collection(s)

Show simple item record