Fault Tolerant Network-on-Chip Router Architectures for Multi-Core Architectures
MetadataShow full item record
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
AbstractAs the feature size scales down to deep nanometer regimes, it has enabled the designers to fabricate chips with billions of transistors. The availability of such abundant computational resources on a single chip has made it possible to design chips with multiple computational cores, resulting in the inception of Chip Multiprocessors (CMPs). The widespread use of CMPs has resulted in a paradigm shift from computation-centric architectures to communication-centric architectures. With the continuous increase in the number of cores that can be fabricated on a single chip, communication between the cores has become a crucial factor in its overall performance. Network-on-Chip (NoC) paradigm has evolved into a standard on-chip interconnection network that can efficiently handle the strict communication requirements between the cores on a chip. The components of an NoC include routers, that facilitate routing of data between multiple cores and links that provide raw bandwidth for data traversal. While diminishing feature size has made it possible to integrate billions of transistors on a chip, the advantage of multiple cores has been marred with the waning reliability of transistors. Components of an NoC are not immune to the increasing number of hard faults and soft errors emanating due to extreme miniaturization of transistor sizes. Faults in an NoC result in significant ramifications such as isolation of healthy cores, deadlock, data corruption, packet loss and increased packet latency, all of which have a severe impact on the performance of a chip. This has stimulated the need to design resilient and fault tolerant NoCs. This thesis handles the issue of fault tolerance in NoC routers. Within the NoC router, the focus is specifically on the router pipeline that is responsible for the smooth flow of packets. In this thesis we propose two different fault tolerant architectures that can continue to operate in the presence of faults. In addition to these two architectures, we also propose a new reliability metric for evaluating soft error tolerant techniques targeted towards the control logic of the NoC router pipeline. First, we present Shield, a fault tolerant NoC router architecture that is capable of handling both hard faults and soft errors in its pipeline. Shield uses techniques such as spatial redundancy, exploitation of idle resources and bypassing a faulty resource to achieve hard fault tolerance. The use of these techniques reveals that Shield is six times more reliable than baseline-unprotected router. To handle soft errors, Shield uses selective hardening technique that includes hardening specific gates of the router pipeline to increase its soft error tolerance. To quantify soft error tolerance improvement, we propose a new metric called Soft Error Improvement Factor (SEIF) and use it to show that Shield’s soft error tolerance is three times better than that of the baseline-unprotected router. Then, we present Soft Error Tolerant NoC Router (STNR), a low overhead fault tolerating NoC router architecture that can tolerate soft errors in the control logic of its pipeline. STNR achieves soft error tolerance based on the idea of dual execution, comparison and rollback. It exploits idle cycles in the router pipeline to perform redundant computation and comparison necessary for soft error detection. Upon the detection of a soft error, the pipeline is rolled back to the stage that got affected by the soft error. Salient features of STNR include high level of soft error detection, fault containment and minimum impact on latency. Simulations show that STNR has been able to detect all injected single soft errors in the router pipeline. To perform a quantitative comparison between STNR and other existing similar architectures, we propose a new reliability metric called Metric for Soft error Tolerance (MST) in this thesis. MST is unique in the aspect that it encompasses four crucial factors namely, soft error tolerance, area overhead, power overhead and pipeline latency overhead into a single metric. Analysis using MST shows that STNR provides better reliability while incurring low overhead compared to existing architectures.
Degree ProgramGraduate College
Electrical & Computer Engineering