dc.contributor.author |
Τζώρτζη, Μαρία-Ιωάννα
|
el |
dc.contributor.author |
Tzortzi, Maria-Ioanna
|
en |
dc.date.accessioned |
2020-12-02T09:12:47Z |
|
dc.date.available |
2020-12-02T09:12:47Z |
|
dc.identifier.uri |
https://dspace.lib.ntua.gr/xmlui/handle/123456789/52158 |
|
dc.identifier.uri |
http://dx.doi.org/10.26240/heal.ntua.19856 |
|
dc.description |
Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακών Σπουδών (Δ.Π.Μ.Σ.) |
|
dc.rights |
Default License |
|
dc.subject |
Βαθιά Μάθηση |
el |
dc.subject |
Υπερυπολογιστές |
en |
dc.subject |
Πρόβλεψη σφαλμάτων |
|
dc.subject |
Μηχανική Μάθηση |
|
dc.subject |
Τεχνητή Νοημοσύνη |
|
dc.subject |
Deep Learning |
|
dc.subject |
HPC |
|
dc.subject |
Failure prediction |
|
dc.subject |
Machine Learning |
|
dc.subject |
Artificial Intelligence |
|
dc.title |
Predicting Failures in HPC systems with Data Mining and Deep Learning techniques |
en |
dc.contributor.department |
Επιστήμη Δεδομένων και Μηχανική Μάθηση |
el |
heal.type |
masterThesis |
|
heal.classification |
Deep Learning |
en |
heal.access |
free |
|
heal.recordProvider |
ntua |
el |
heal.publicationDate |
2020-07-10 |
|
heal.abstract |
While the computation capacity of HPC systems increases, their complexity leads to complex fault manifestation. Faults are frequent and expected to increase in the systems of this and the next generation. Currently, substantial compute capacity and power is wasted in recovering failed components. Towards the resiliency of HPC systems, the community proposes and implements various solutions, such as fault-tolerant application, recovery techniques (e.g. checkpoint-and-restart), as well as a more thorough understanding of system logs. An alternative, yet imperative, solution is the ability to predict failures with a known lead time, and also to pin-point the node of impending failures.
The aim of this diploma thesis is firstly to study/visualize logs of IBM's supercomputer MIRA and also develop a generic methodology for predicting failures. This methodology infers failure chains and ultimately tracks the node ids to pin-point failure location. What's more using this methodology we can extract the lead time.
To achieve all of the above, at first logs are being visualized and analyzed using python3 and pandas. Following, LSTMs (Long Short-Term memory) are used to predict failure chains leading to the cessation of the execution of applications running at specific nodes. A phrase analysis of unlabeled log entries is performed, which may or may not belong to the failure chain. There is a three-phase deep learning approach to first train to predict next phrases, second re-train only sequence of phrases leading to failure chains augmented with expected lead times and third predict lead times during testing deployment to predict which specific node fails in how many minutes. |
en |
heal.advisorName |
Γκούμας, Γεώργιος |
|
heal.committeeMemberName |
Γκούμας, Γεώργιος |
el |
heal.committeeMemberName |
Κοζύρης, Νεκτάριος |
el |
heal.committeeMemberName |
Πνευματικάτος, Διονύσιος |
el |
heal.academicPublisher |
Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών |
el |
heal.academicPublisherID |
ntua |
|
heal.fullTextAvailability |
false |
|