Modeling and Mitigation of Parametric Time-Dependent Variability in Digital Systems

Rodopoulos, Dimitrios; Ροδόπουλος, Δημήτριος

dc.contributor.author	Rodopoulos, Dimitrios	en
dc.contributor.author	Ροδόπουλος, Δημήτριος	el
dc.date.accessioned	2018-02-07T09:40:50Z
dc.date.available	2018-02-07T09:40:50Z
dc.date.issued	2018-02-07
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/46413
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.2879
dc.rights	Default License
dc.subject	Reliability	en
dc.subject	Variability	el
dc.subject	Aging	el
dc.title	Modeling and Mitigation of Parametric Time-Dependent Variability in Digital Systems	en
dc.contributor.department	Εθνικό Μτεσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών	el
heal.type	doctoralThesis
heal.classification	Integrated circuits--Very large scale integration--Computer-aided design	en
heal.classificationURI	http://id.loc.gov/authorities/subjects/sh2008104741
heal.language	en
heal.access	campus
heal.recordProvider	ntua	el
heal.publicationDate	2016-05
heal.abstract	Current and future semiconductor technology nodes, bring about a variety of challenges that pertain to the reliability and dependability of digital integrated systems. Compounds, such as high-κ materials in the transistor gate stack, tend to intensify the time-zero and time-dependent variability of transistors. A case, can certainly be made for the phenomena like Bias Temperature Instability (BTI) and Random Telegraph Noise (RTN). The use of such modern materials is also coupled to an aggressive downscaling trend, which further amplifies matter discretization within the transistor's channel. These are two major manufacturing trends that give rise to time-dependent variability in integrated digital systems. It stands to reason, that as the materials become more variable, the error rates experienced at the circuit- or even system-level are also intensified. Coupled to increasing integrated circuit functionality (namely, the ``More than Moore'' trend), it is reasonable to expect that future digital chips will be exhibiting reliability profiles that vary across the chip's lifetime. The goal of the research presented in the current text is to study the above observations both regarding the modeling and the mitigation of time-dependent variability. Target systems, naturally, constitute digital integrated circuits, such as processors and memories thereof. The current research contributes with specific reductions to practice that aim to confirm existing techniques and develop novel insight into the modeling and mitigation of time-dependent variability. As such, the current text is broadly split into two major parts. The modeling part starts with the reiteration of atomistic modeling concepts, which are very useful in the analysis of phenomena like BTI and RTN, especially for deca-nanometer transistors. As a result, the complexity of atomistic models for integrated circuit aging analysis is made abundantly clear. In order to alleviate the complexity of circuit reliability analysis across the system lifetime, the current research has formulated the concept of the Compact Digital Waveform (CDW). This format targets regions of circuit operation that are similar from a waveform point of view (e.g. similar frequency or duty cycle) and abstracts them to a single point. This enables striding over circuit lifetime intervals, while retaining key features of atomistic reliability models (e.g. workload dependency). Still on the modeling front, the current research additionally contributes towards exposing transistor variability information to the architecture level. The metric of choice in this case is the component failure probability (Pfail). This is typically required by system designers to appropriately provision their systems with appropriate disabling or correction mechanisms. In order to derive the Pfail, this text features the Most Probable Failure Point (MPFP) method, which is applied for the cases of BTI/RTN variability. Two modeling approaches are used for these aging phenomena and observations are drawn regarding the importance of accurate standard deviation capturing (regarding the threshold voltage variability). A statistical reformulation of the MPFP concept is also presented towards handling standard cells, apart from memory components. The failure of system components has traditionally been triggering some sort of reaction from the system, at least as far as detectable errors are concerned. Academia and industry have been using the term Reliability, Availability and Serviceability (RAS) in order to refer to such techniques. Their invocation is strictly coupled to the rate of errors that appears at the circuit level and typically comes at a measurable performance cost. On the mitigation side, the starting point of the current research is a simple, rollback-based RAS technique that aims to recover a system from transient errors. This concept has been implemented on a research-grade many-core platform, in the data plane of which errors are injected at a user-defined rate. The rollbacks make sure the running application is brought to an earlier correct state, so that execution can continue. This fail/stop model, however granular, is creating a measurable drawback in the timely execution of the running application. It is reasonable to explore that a portion of injected errors may not corrected in order to reduce this performance overhead. In view of this trade-off, the current work explores the trade-off between application correctness, performance and energy budget, in search of the optimal operation points. The final milestone of the current work is a generally applicable solution for the problem of dependable performance, in view of temporal RAS overheads such as the one illustrated above. Towards that direction, the issue of dependable performance is formulated from scratch and a control-theoretic solution is proposed. More specifically, a PID controller is used in order to manipulate the frequency of a processor, within which RAS schemes are invoked at the price of additional clock cycles. The concept is verified within the current research, both with simple simulations and on a real processing platform.	en
heal.advisorName	Soudris, Dimitrios	en
heal.advisorName	Catthoor, Francky	en
heal.advisorName	Σούντρης, Δημήτριος	el
heal.advisorName	Πεσκμετζή, Κιαμάλ	el
heal.committeeMemberName	Soudris, Dimitrios	en
heal.committeeMemberName	Catthoor, Francky	en
heal.committeeMemberName	Pekmestzi, Kiamal	en
heal.committeeMemberName	Groeseneken, Guido	en
heal.committeeMemberName	Varvarigou, Theodora	en
heal.committeeMemberName	Sazeides, Yiannakis	en
heal.committeeMemberName	Gizopoulos, Dimitrios	en
heal.committeeMemberName	Kaczer, Ben	en
heal.academicPublisher	Εθνικό Μτεσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών	el
heal.academicPublisherID	ntua
heal.numberOfPages	198 σ.
heal.fullTextAvailability	true