heal.abstract |
Current and future semiconductor technology nodes, bring about a variety of challenges that pertain to the reliability and dependability of digital integrated systems. Compounds, such as high-κ materials in the transistor gate stack, tend to intensify the time-zero and time-dependent variability of transistors. A case, can certainly be made for the phenomena like Bias Temperature Instability (BTI) and Random Telegraph Noise (RTN). The use of such modern materials is also coupled to an aggressive downscaling trend, which further amplifies matter discretization within the transistor's channel. These are two major manufacturing trends that give rise to time-dependent variability in integrated digital systems. It stands to reason, that as the materials become more variable, the error rates experienced at the circuit- or even system-level are also intensified. Coupled to increasing integrated circuit functionality (namely, the ``More than Moore'' trend), it is reasonable to expect that future digital chips will be exhibiting reliability profiles that vary across the chip's lifetime.
The goal of the research presented in the current text is to study the above observations both regarding the modeling and the mitigation of time-dependent variability. Target systems, naturally, constitute digital integrated circuits, such as processors and memories thereof. The current research contributes with specific reductions to practice that aim to confirm existing techniques and develop novel insight into the modeling and mitigation of time-dependent variability. As such, the current text is broadly split into two major parts.
The modeling part starts with the reiteration of atomistic modeling concepts, which are very useful in the analysis of phenomena like BTI and RTN, especially for deca-nanometer transistors. As a result, the complexity of atomistic models for integrated circuit aging analysis is made abundantly clear. In order to alleviate the complexity of circuit reliability analysis across the system lifetime, the current research has formulated the concept of the Compact Digital Waveform (CDW). This format targets regions of circuit operation that are similar from a waveform point of view (e.g. similar frequency or duty cycle) and abstracts them to a single point. This enables striding over circuit lifetime intervals, while retaining key features of atomistic reliability models (e.g. workload dependency).
Still on the modeling front, the current research additionally contributes towards exposing transistor variability information to the architecture level. The metric of choice in this case is the component failure probability (Pfail). This is typically required by system designers to appropriately provision their systems with appropriate disabling or correction mechanisms. In order to derive the Pfail, this text features the Most Probable Failure Point (MPFP) method, which is applied for the cases of BTI/RTN variability. Two modeling approaches are used for these aging phenomena and observations are drawn regarding the importance of accurate standard deviation capturing (regarding the threshold voltage variability). A statistical reformulation of the MPFP concept is also presented towards handling standard cells, apart from memory components.
The failure of system components has traditionally been triggering some sort of reaction from the system, at least as far as detectable errors are concerned. Academia and industry have been using the term Reliability, Availability and Serviceability (RAS) in order to refer to such techniques. Their invocation is strictly coupled to the rate of errors that appears at the circuit level and typically comes at a measurable performance cost.
On the mitigation side, the starting point of the current research is a simple, rollback-based RAS technique that aims to recover a system from transient errors. This concept has been implemented on a research-grade many-core platform, in the data plane of which errors are injected at a user-defined rate. The rollbacks make sure the running application is brought to an earlier correct state, so that execution can continue. This fail/stop model, however granular, is creating a measurable drawback in the timely execution of the running application. It is reasonable to explore that a portion of injected errors may not corrected in order to reduce this performance overhead. In view of this trade-off, the current work explores the trade-off between application correctness, performance and energy budget, in search of the optimal operation points.
The final milestone of the current work is a generally applicable solution for the problem of dependable performance, in view of temporal RAS overheads such as the one illustrated above. Towards that direction, the issue of dependable performance is formulated from scratch and a control-theoretic solution is proposed. More specifically, a PID controller is used in order to manipulate the frequency of a processor, within which RAS schemes are invoked at the price of additional clock cycles. The concept is verified within the current research, both with simple simulations and on a real processing platform. |
en |