heal.abstract |
The space industry has undergone a transformation in recent years, shifting from classic space-grade systems to mixed architecture that combines radiation-hardened components and Commercial-off-the-Shelf (COTS) devices, including FPGAs. The advantages of the COTS components are the SWaP-C (size, weight, power, and cost), the processing performance, and the development flexibility when compared with the classical-computational architectures, which stress meeting the increased data and computational demands effectively. However, commercial devices are susceptible to ionizing radiation failures. The Single Event Upsets (SEU), that alter the functionality are those, we try to mitigate in this thesis. All the fault-tolerance techniques are implemented in the MPSoC Ultascale+ Architecture, which is commonly used in space missions. We developed a custom hardware architecture for a Fast Fourier Transform Algorithm to be used as our digital signal processing (DSP) component. The fault-mitigation architectures are both application-independent and application-specific, while the fault injection and evaluation campaign we developed, is used to categorize and sort our techniques based on the reduction of the downtime of our device. Besides that, we explored the impact of utilizing different FPGA blocks, i.e. DSP and LUT on the reliability of the device. The fault mitigation techniques implemented in the Hardware accelerator are spatial(i.e. DMR, TMR), temporal, hybrid between the spatial and temporal (i.e. DMR Temporal), and in addition and fine-grain spatial redundancy (i.e TMR in Stage and Butterfly). For the module correction, we included in our design the full and partial reconfiguration (i.e. Dynamic Partial Reconfiguration), and the internal scrubber offered by Xilinx's SEM IP. The injection is performed also using the SEM IP, by adducing bit-flip in the configuration memory. The injection campaign involved 200,000 configuration memory addresses. As far as the impact of different FPGA computational blocks is considered, the LUT implementation suffered from increased downtime 2.2×, 3.68×, 2.7×, when comparing to the DSP implementation for the 8,16,32-point FFT respectively. The best reduction architectures are the temporal, due to its small resource utilization and Error Detection, as well as the TMR which offered Error Detection & Correction. The best improvement in downtime was the reduction of 95% and 65% in 8 and 32-point FFT, as both architectures have taken advantage of all the correction methodologies (FR, PR, CMS). |
en |