heal.abstract |
FPGAs are integrated circuits that contain a huge number of programmable fabric that can be configured to implement desired compute-intensive functions. One of the most important features of certain FPGA types, such as flash- or SRAM-based, is their reconfiguration capability. This distinct capability makes them suitable for achieving customizable designs, and at the same time capable of re-adapting to changes depending on the application. FPGAs, especially the SRAM-based, are gaining momentum in various domains, such as automotive, robotics, space and avionics. Nevertheless, in environments such as those of space, FPGAs are susceptible to ionizing radiation, which can cause Single Event Upsets (SEUs) in their configuration memory. These upsets may cause deviations to the user programmed designs, affecting its normal behaviour. Therefore modern industry should take into serious consideration SEUs, as critical missions can not afford the corruption of important data processing applications. Considering that there is a trend of using Commercial Off-The-Shelf (COTS) FPGAs rather than radiation-hardened ones (for performance and flexibility reasons), research is conducted on improving their reliability and providing solutions for fault tolerance.
In this context, this thesis presents various fault tolerance architectures, aiming to protect the Zynq SoC FPGAs from SEUs. These architectures attempt to overcome the effect of SEUs, making use of the most effective mitigation techniques, such as Internal Configuration Scrubbing (via Xilinx's SEM IP), Triple Modular redundancy (TMR), Partial Reconfiguration (PR) and reboot through a Watchdog Timer. For the purpose of verifying the robustness of the proposed architectures, we utilized SEM IP controller and implemented an injection strategy, which targets sensitive bits in the configuration SRAM memory of the Xilinx’s Zynq-7000 SoC commercial boards.
The results of the injection setup have been evaluated based on reliability, availability, error rate and mean relative error. According to our experiments on the Zybo board, each proposed variation of the SEU mitigation techniques improves the sensitivity of the fault-tolerant system. At the same time, we observe that sensitivity improves, as more mitigation techniques are involved, i.e., in our hybrid fault-tolerant architectures. When comparing individually the architectures, we conclude that Internal Configuration Scrubbing significantly outperforms the other mitigation techniques. The most robust solution is the combination of all the mitigation techniques, which compared to the unmitigated design, provides 72% more correct functionality, 72% less downtime and 98% correction capability in the occurrence of SEUs, while using less than 4K LUTs, 5K FFs, and 6 RAMBs. |
en |