# Device- and Temperature Dependency of Systematic Fault Injection Results in Artix-7 and iCE40 FPGAs

Christian Fibich

Dept. of Electronic Engineering University of Applied Sciences Technikum Wien Vienna, Austria fibich@technikum-wien.at Martin Horauer Dept. of Electronic Engineering University of Applied Sciences Technikum Wien Vienna, Austria horauer@technikum-wien.at Roman Obermaisser Chair for Embedded Systems University of Siegen Siegen, Germany roman.obermaisser@uni-siegen.de

Abstract-Systematic fault injection into the configuration memory of SRAM-based FPGAs promises to gain insight into the criticality of individual configuration bits. Current approaches implicitly assume that results obtained on one FPGA device can be generalized to all devices of that type and hence allow to parallelize fault injection. This work, to the best of our knowledge, is the first to challenge this assumption. To that end, a synthetic test design was subjected to systematic fault injection on 16 Xilinx Artix-7 as well as 10 Lattice iCE40 FPGAs for which bitstream documentation is publicly available. The results of these experiments indicate that the derived sets of critical configuration bits vary from device to device of the same type, especially if the interconnect is targeted. Furthermore, temperature is observed to influence the fault injection results on Artix-7. Suggestions for dealing with the implications in future fault injection experiments are provided.

Index Terms—FPGA, Fault Injection, Device Variations

#### I. INTRODUCTION

Fault injection into configuration memory is a widely-used method to test the robustness of SRAM-based FPGA designs. Two distinct fault injection approaches are used:

(1) Randomized fault injection aims at emulating the behavior of the Design under Test (DUT) under the influence of randomly occurring configuration bit flips, as expected to be caused by radiation.

(2) Systematic fault injection flips every bit in a region of interest (e.g., in a certain portion of a device where the DUT is placed) one-by-one. With parallel supervision of the DUT – e.g., by comparing its outputs to reference values or a "golden" design – a set of *critical bits* can be deduced. These are the bits that are necessary to remain in their correct state for a design to function as intended.

Approach (2) is recommended by Xilinx for evaluating the effectiveness of Triple Modular Redundancy (TMR) introduced by their XTMR tool [1]. Xilinx Vivado furthermore aids in the acceleration of systematic fault injection experiments by allowing to generate a list of "essential" configuration memory bits. These are bits that are used to implement the DUT, but are not necessarily critical for its correct functionality.

State-of-the-art systematic fault injection approaches assume that experiments conducted on a single FPGA device can be generalized to all devices of the same type, see Section II. As contributions, Section III presents experiments that challenge the above generalization on Xilinx Artix-7 and Lattice iCE40 FPGAs, and Section IV analyzes possible causes for deviceand temperature-related effects discovered on Artix-7 FPGAs during these experiments. Suggestions for integrating the insights gained from the results of these experiments into future fault injection campaigns can be found in Section V.

## II. STATE OF THE ART

Systematic fault injection is widely used in the analysis of the fault tolerance of FPGA designs: In [2] the ratio of critical to essential configuration bits configuring a LEON3 CPU on a Xilinx Virtex-5 FPGA is estimated using systematic fault injection on a randomly chosen subset of essential bits. The authors of [3] apply systematic fault injection to a design generated by high-level synthesis to determine the effects of certain synthesis strategies on design reliability. In [4] systematic fault injection is used on Xilinx Virtex-6 FPGAs to identify critical bits that cannot be repaired by scrubbing alone because they cause the design to enter an incorrect state, requiring reset or power cycle.

TURTLE, a dual-FPGA fault injection platform based on Xilinx Artix-7 FPGAs is proposed in [5]. The authors state that by using multiple TURTLE platforms in parallel, systematic fault injection tasks may be accelerated. A system for accelerating systematic fault injection by parallelization across multiple Xilinx Zynq-7000 FPGAs of the same type is proposed in [6].

While the publications above perform one fault injection experiment per bit, a different approach is taken in [7]. In this work, the sensitivity of multiple soft CPU cores on a space-grade Xilinx Virtex-5 FPGA is evaluated. The authors observe that subsequent identical runs of fault injection resulted in different subsets of critical bits. Thus, they perform six test runs over the entire set of essential bits in order to discover more critical bits.

The works discussed above do not consider device- or temperature dependency, even though results described in [7] hint at (presumably) random processes that lead to varying criticality results in subsequent fault injection campaigns on the same design. It is implicitly assumed that a single systematic

This work has been supported by the Austrian Federal Ministry for Digital and Economic Affairs (BM:DW) and the National Foundation for Research, Technology and Development as related to the Josef Ressel Center "Innovative Platforms for Electronic-Based Systems" (INES) managed by the Christian Doppler Society (CDG).

fault injection run on a single device is sufficient to determine the critical bits in a design. Furthermore, the parallel fault injection approaches proposed in [5] and [6] aim at accelerating systematic fault injection by splitting the fault injection task into chunks that are addressed on different devices. Thereby, these approaches implicitly assume that all devices show identical behavior under fault injection.

## III. DEVICE- AND TEMPERATURE DEPENDENCY IN SYSTEMATIC FAULT INJECTION

In order to put the (implicit) assumptions of previous works discussed in Section II to a test, we propose the following experiment and analyses to be performed for a given DUT and target FPGA device type:

(1) Analyze the composition of the DUT's bitstream into functional categories, e.g., bits configuring logic (LUTs, carry blocks, sequential logic) and bits configuring interconnect. This requires detailed knowledge of the target's configuration structure. This information is, for example, publicly available for Lattice's iCE40 and ECP5 families and for Xilinx's Series-7 family via the third-party SymbiFlow project<sup>1</sup>.

(2) Perform systematic fault injection experiments on multiple devices, adhering to the following constraints:

- To minimize influence of manufacturing variations, the target devices should be as similar as possible (i.e., same family, size, speed/temperature grade, date- and lot-code).
- As seen for example in [7], there may exist random processes in FPGA devices that cause different sets of critical bits to be discovered in subsequent experiments on the same device. Thus, multiple repetitions per bit should be carried out.
- Device variations may be amplified by differences in chip temperature, requiring controlled temperature conditions.

(3) To identify differences between devices, analyze the union and intersection sets of critical bits over all devices.

(4) To learn more about the device variations and temperature behavior, further analysis may be performed on bits that are critical only on a subset of devices, and/or only in certain temperature ranges. This may, e.g., bring functional groups to special attention, that are overrepresented in the respective analyses compared to their actual population in the bitstream.

# A. Experimental setup

*Target Devices and Toolchains:* The experiments and analyses described above were performed on a set of Xilinx Artix-7 devices (xc7a35tcpg236-1) aboard 16 Digilent BASYS-3 development boards, and a set of Lattice iCE40-up5k devices on 10 iCEBreaker development boards. This evaluation excludes Intel/Altera devices due to the lack of publicly available bitstream documentation. Vivado 2017.2 was used for the Artix-7 platform for Synthesis, P&R, and bitstream generation. For iCE40, an open-source RTL-to-bitstream toolflow, consisting of Yosys, nextpnr, and the IceStorm toolbox [8] was used. *Design under Test:* A synthetic test design was used as a target for systematic fault injection experiments. It consists of a Linear Feedback Shift Register (LFSR) and a counter both controlled by a state machine.

A single *run* of the DUT is timed by the counter that counts down a predefined number of clock cycles, during which the LFSR runs from a predefined initial value. A run is started by external logic, while completion is signaled by the DUT. Correct behavior of the design is characterized by both a correct latency and a correct final LFSR value. The design was chosen as its small size allows for fast complete systematic fault injection runs, while it contains basic components found in almost all FPGA designs.

Analysis of the DUT's Bitstream: The DUT consists of 19 flip-flops and 47 LUTs mapped to Artix-7, and 19 flipflops, 26 LUTs, and 6 SB\_CARRY cells mapped to iCE40. Xilinx Vivado's bitstream generation tool reports 7539 essential configuration bits for the implemented circuit. An essential bits feature is not available within the iCE40 toolchain. It, however, provides the program icebox\_explain that describes the bits configured actively by the bitstream. Bits that may influence configured logic or add unwanted drivers to used wires have been added to the bits output by this tool. This results in 1839 essential configuration bits on iCE40.

These essential configuration bits for both FPGAs have been classified into three categories by their functionality as shown in Table  $I^2$ . Furthermore, the table shows the interconnect-related configuration bits classified by the consequences of bit flips:

On Artix-7, the interconnect resources are assumed to behave as multiplexers based on transmission gates or pass transistors, where each configuration bit controls one or more of these gates. This is substantiated by several patents of Xilinx, e.g. [9], and the layout of the interconnect-related configuration bits<sup>3</sup>. Under this assumption, a bit flip may 1) split a wire into two disconnected parts, 2) force an additional driver on a wire, or 3) create a new partial connection.

On iCE40, interconnect resources appear as simple buffers enabled by a single configuration memory bit, or as multiplexers with either 2, 3, or 4 select bits driven by configuration memory cells [10]. These bits select one of 4, 8, or 16 input wires to drive the output wire. While some of these multiplexers are combined with a dedicated buffer-enable configuration bit, others disable their driver if an all-zero configuration is present in their bits. Thus, a single bit flip may 1) disconnect a driver that should be connected, 2) force an additional driver on a wire, or 3) select a different, possibly undriven, source wire by altering a multiplexer's select bits.

*Fault Injection Platform & Constraints:* The DUT described above is instantiated in a fault injection platform that controls DUT execution and checks results. The actual implementation of the fault injection platform and process differs between the two development boards used:

<sup>&</sup>lt;sup>1</sup>https://symbiflow.github.io/, Accessed 2020-11-25

<sup>&</sup>lt;sup>2</sup>On Artix-7, all bits in CLB tiles have been classified as *logic*, despite some routing resources being present in these tiles (e.g., FF input MUXes)

<sup>&</sup>lt;sup>3</sup>https://symbiflow.github.io/prjxray-db/artix7/tile\_int\_l.html, Accessed 2020-11-25

 TABLE I

 Composition of Essential Configuration Bits of DUT

| Configuration Bit  | Art      | ix-7     | iCE40    |          |  |
|--------------------|----------|----------|----------|----------|--|
| Category           | Absolute | Relative | Absolute | Relative |  |
| Other              | 131      | 1.7%     | 32       | 1.7%     |  |
| Logic              | 2952     | 39.2%    | 767      | 41.7%    |  |
| Interconnect       | 4456     | 59.1%    | 1040     | 56.6%    |  |
| Disconnect         | 1352     | 17.9%    | 190      | 10.3%    |  |
| Contention         | 2802     | 37.2%    | 180      | 9.8%     |  |
| Partial Connection | 302      | 4.0%     | _        | _        |  |
| Source Change      | -        | -        | 670      | 36.4%    |  |
| Sum                | 7539     | 100%     | 1839     | 100%     |  |

On the Artix-7 board, this platform can take advantage of the Internal Configuration Access Port (ICAP) and perform fault injection at runtime. The platform is based on a MicroBlaze CPU and Xilinx AXI-HWICAP partial reconfiguration controller. The entire design runs at 50 MHz with a worst-case setup slack of 6.326 ns reported by Vivado.

The platform implemented on the iCEBreaker board does not perform fault injection itself, as iCE40 lacks partial reconfiguration capabilities. Here, the fault injection is implemented by generating faulty bitstreams offline on the host. The design implemented on the iCEBreaker board runs at 50 MHz with a worst-case setup slack of 2.4 ns reported by nextpnr.

The DUT and the injector/controller hardware are floorplanned into different rows of clock regions (Artix-7) or tiles (iCE40) to minimize influence of injected faults on the controller itself.

*Fault Injection Sequence:* In line with the fault injection approaches discussed in Section II faults are injected *before* the DUT starts operation. Each fault injection experiment performs the following steps:

(1) Configuration: On Artix-7, the FPGA boots from configuration flash and notifies the host that it is ready. On iCE40, the host configures the FPGA with a faulty bitstream via JTAG.

(2) Fault Injection (Artix-7 only): The host requests fault injection into a particular configuration bit identified by its frame address, word index, and bit position. Fault injection happens while the DUT's clock is stopped using a BUFGCE clock buffer, as it requires readback, modification, and update of an entire configuration frame via partial reconfiguration. (3) The DUT is reset and started.

(4) Check Results: Once the DUT completes its run (or the timeout of twice the correct latency expires) the controller checks the correctness of latency and LFSR output. The results of this check are reported back to the host.

(5) Reboot (Artix-7 only): Finally, the controller reboots the FPGA from configuration flash to ensure a fault-free state for the next run.

Fault injection experiments were repeated eight times per essential configuration bit, surpassing the value of six repetitions proposed in [7]. A configuration bit is counted as critical if during *any one* of these eight runs the controller identified either latency, LFSR output, or both as incorrect.

*Temperature Environment & Monitoring:* For the purpose of ensuring controlled temperature conditions and the possibility to perform systematic temperature tests, the tests were carried

out in a Revolutionary Science RS-IF-203 incubator/fridge as a temperature chamber. On Artix-7, chip temperature was continuously monitored and recorded using the *XADC* system monitor block that is provided as a hard IP in these devices, while iCE40 devices do not provide this kind of monitoring.

# B. Results

The fault injection experiments outlined in Section III-A were conducted at four temperature steps: 20 °C, 30 °C, 40 °C and 50 °C.

As a measure of basic fault injection quality as done in [7], the average share of critical bits present in all 8 repetitions per bit among all devices as well as the standard deviation was analyzed. In each of the four temperature steps, across all devices, on Artix-7 between 99.2% ( $\sigma$ =0.4%) and 99.7% ( $\sigma$ =0.2%) of total critical bits were present in all fault injection runs. On iCE40, this ranged from 95.7% ( $\sigma$ =0.2%) to 96.0% ( $\sigma$ =0.3%). For discovering these bits, a single fault injection run per bit would have sufficed with a high probability. By performing 7 additional runs per bit nevertheless, the remaining < 1% to 4% of critical bits were discovered.

The experimental results regarding device dependency are provided for Artix-7 in Table II and for iCE40 in Table III in the form of the union set (i.e., bits critical on *any* device) and intersection set (i.e., bits critical on *all* devices) sizes for the four temperature steps.

On both device types, there is a considerable difference between the union and intersection set sizes, indicating that on each device different subsets of tested configuration bits are critical. Considering the lowest temperature step (20 °C), of the 1263 bits critical on any device, only 963 (or 76.2%) are actually critical on all 16 tested Artix-7 devices. On iCE40 841 of 955 bits (or 88.1%) are critical on all devices at the same temperature step. These initial results confirm devicedependence of systematic fault injection results. At the lowest temperature step on both device types, criticality of configuration bits associated with clock routing, global configuration, and logic (tiles) are relatively consistent from device to device when compared with interconnect-associated bits. For example, on Artix-7 95.9% of bits associated with logic (tiles) critical on any device are critical on all devices (370 of 386), but only 66.7% of interconnect-associated bits (568 of 851). This effect can also be seen on iCE40: 97.3% of logic-associated bits critical on any device are critical on all devices (177 of 182) but only 85.8% of interconnect-associated bits (660 of 769).

In addition, the results on Artix-7 (Table II) appear to be temperature-dependent: When the experiments are repeated at 30 °C, 40 °C, and 50 °C it can be seen that the gaps between union and intersection sets start to close, i.e., the devices start behaving more uniformly. This especially affects the interconnect, where in the highest temperature step 89.6% of bits are critical on all devices (593 of 662) compared to 66.7% (568 of 851) in the lowest step. Especially the disparity among "disconnect" bits is reduced with higher temperature.

In contrast, this effect is less prominent in the fault injection results for iCE40 (Table III). There appears to be some fluc-

 TABLE II

 Set Sizes of Critical Bits over 16 Artix-7 Devices

| Union Set (Critical on any device)         |                       |      |      |      |  |  |
|--------------------------------------------|-----------------------|------|------|------|--|--|
| Configuration Bit Catagory                 | Temperature Step (°C) |      |      |      |  |  |
| Configuration Bit Category                 | 20                    | 30   | 40   | 50   |  |  |
| Clock Routing & Other                      | 26                    | 27   | 27   | 27   |  |  |
| Logic                                      | 386                   | 383  | 379  | 380  |  |  |
| Interconnect                               | 851                   | 799  | 717  | 662  |  |  |
| Disconnect                                 | 612                   | 560  | 480  | 425  |  |  |
| Contention                                 | 157                   | 157  | 155  | 155  |  |  |
| Partial connection                         | 82                    | 82   | 82   | 82   |  |  |
| Sum                                        | 1263                  | 1209 | 1123 | 1069 |  |  |
| Intersection Set (Critical on all devices) |                       |      |      |      |  |  |
| Configuration Bit Category                 | Temperature Step (°C) |      |      |      |  |  |
| Configuration Dit Category                 | 20                    | 30   | 40   | 50   |  |  |
| Clock Routing & Other                      | 25                    | 25   | 25   | 25   |  |  |
| Logic                                      | 370                   | 375  | 371  | 371  |  |  |
| Interconnect                               | 568                   | 575  | 584  | 593  |  |  |
| Disconnect                                 | 361                   | 366  | 378  | 388  |  |  |
| Contention                                 | 154                   | 151  | 150  | 148  |  |  |
| Partial connection                         | 53                    | 58   | 56   | 57   |  |  |
| Sum                                        | 963                   | 975  | 980  | 989  |  |  |

 TABLE III

 Set Sizes of Critical Bits over 10 iCE40 Devices

| Union Set (Critical on any device)         |                       |     |     |     |  |
|--------------------------------------------|-----------------------|-----|-----|-----|--|
| Configuration Bit Category                 | Temperature Step (°C) |     |     |     |  |
| Configuration Dit Category                 | 20                    | 30  | 40  | 50  |  |
| Global                                     | 4                     | 4   | 4   | 4   |  |
| Logic                                      | 182                   | 182 | 181 | 182 |  |
| Interconnect                               | 769                   | 770 | 768 | 770 |  |
| Disconnect                                 | 163                   | 163 | 163 | 163 |  |
| Contention                                 | 29                    | 29  | 29  | 31  |  |
| Source change                              | 577                   | 578 | 576 | 576 |  |
| Sum                                        | 955                   | 956 | 953 | 956 |  |
| Intersection Set (Critical on all devices) |                       |     |     |     |  |
| Configuration Bit Catagory                 | Temperature Step (°C) |     |     |     |  |
| Configuration Bit Category                 | 20                    | 30  | 40  | 50  |  |
| Global                                     | 4                     | 4   | 4   | 4   |  |
| Logic                                      | 177                   | 178 | 178 | 177 |  |
| Interconnect                               | 660                   | 668 | 670 | 667 |  |
| Disconnect                                 | 137                   | 137 | 137 | 134 |  |
| Contention                                 | 18                    | 18  | 18  | 18  |  |
| Source change                              | 505                   | 513 | 515 | 515 |  |
| Sum                                        | 841                   | 850 | 852 | 848 |  |

tuation among both intersection and union set sizes between temperature steps, but not to the extent as visible on Artix-7.

A more in-depth analysis of the temperature-dependent configuration bits on both Artix-7 and iCE40 is shown in Table IV. For this analysis, the union sets generated in each temperature step were compared. The bits that only appear in a subset of those four sets are displayed in the table, marked with a filled dot if they are critical in a given step. The bits were first divided by tile function, while the interconnect bits have further been categorized by the consequences of a bit flip.

On Artix-7, the vast majority of temperature-dependent bits *cease* to be critical with higher temperature (shown in the center three rows of the Artix-7 part). Additionally, there are some bits that *become* critical with higher temperature, and there is a single bit that behaves rather erratically with temperature.

The bits found in logic tiles have been analyzed and found to be associated with intra-CLB routing resources, while the bits subsumed under "other" are found in tiles related to clock

| TABLE IV                                        |  |  |  |  |  |
|-------------------------------------------------|--|--|--|--|--|
| BITS CRITICAL ONLY IN CERTAIN TEMPERATURE STEPS |  |  |  |  |  |

| Artix-7                 |        |         |        |                       |              |             |       |       |
|-------------------------|--------|---------|--------|-----------------------|--------------|-------------|-------|-------|
| Tem                     | peratu | re Step | ) (°C) | Sum                   | Interconnect |             | Logic | Other |
| 20                      | 30     | 40      | 50     | 1                     | Disconn.     | Cont.       |       |       |
| 0                       | 0      | 0       | •      | 2                     | 0            | 0           | 1     | 1     |
| 0                       | •      | •       | •      | 11                    | 9            | 0           | 1     | 1     |
| ٠                       | 0      | 0       | 0      | 65                    | 61           | 0           | 4     | 0     |
| •                       | •      | 0       | 0      | 85                    | 79           | 2           | 4     | 0     |
| •                       | •      | •       | 0      | 57                    | 56           | 0           | 0     | 1     |
| •                       | •      | 0       | •      | 1                     | 1            | 0           | 0     | 0     |
|                         | iCE40  |         |        |                       |              |             |       |       |
| Tem                     | peratu | re Step | o (°C) | (°C) Sum Interconnect |              | Logic       | Other |       |
| 20                      | 30     | 40      | 50     | 1                     | Cont.        | Source chg. |       |       |
| 0                       | 0      | 0       | •      | 2                     | 1            | 1           | 0     | 0     |
| 0                       | 0      | •       | •      | 3                     | 2            | 1           | 0     | 0     |
| 0                       | •      | •       | •      | 5                     | 3            | 2           | 0     | 0     |
| •                       | 0      | 0       | 0      | 1                     | 1            | 0           | 0     | 0     |
| •                       | •      | 0       | 0      | 3                     | 1            | 2           | 0     | 0     |
| •                       | •      | •       | 0      | 4                     | 1            | 3           | 0     | 0     |
| 6 other combinations 18 |        |         | 18     | 10                    | 7            | 1           | 0     |       |

routing. Not only are most of the temperature-related bits found in the interconnect, also they almost all share the same consequence: if these bits are flipped, they disconnect two wires that are meant to be connected.

The analysis described above was also performed on the iCE40 fault injection results. The iCE40 part of Table IV shows that there are some bits that are unique to almost any subset of temperature steps. This suggests that no significant temperature dependency exists in the evaluated range on iCE40. Here, too, most bits with changing criticality are interconnect bits.

Finally, Fig. 1 and Fig. 2 show the growth of the union set of critical bits for each temperature step as systematic fault injection is conducted on a growing number of devices (after initially testing on the first board). For both tested FPGA types, 1000 random sequences of adding board results (from the 16! and 10! possible sequences, respectively) were selected. On Artix-7, this shows a clear temperature dependency once again, while on iCE40 the results are similar over the entire temperature range. The horizontal line in both figures indicates 1% of the respective union set size of critical bits over all devices of each type at each temperature step.

# IV. ANALYSIS OF EFFECTS OF DISCONNECT-TYPE INTERCONNECT FAULTS

The evidence found for device- and temperature-dependency of the criticality of configuration bits on Artix-7, as shown in Section III-B, prompts further analysis for possible causes. Interconnect bits that disconnect two used wire segments when flipped stand out especially. These bits are the main contributors to differences between Artix-7 devices, and also are the most prominent bits when analyzing for temperature dependency.

In [9] Xilinx describes an interconnect multiplexer based on pass transistors where each of these transistors is controlled directly by a configuration bit. A bit flip would cause the respective transistor to open, breaking the route at that point. The behavior of the part of the route now disconnected from its driver depends on the actual circuit implementation of the



Fig. 1. Additional critical bits found vs. number of Artix-7 devices tested



Fig. 2. Additional critical bits found vs. number of iCE40 devices tested

interconnect multiplexer. For example, a wire may be held at a certain voltage level due to parasitic capacitance, slowly discharging below the low threshold voltage of the cell(s) driven by that wire – a process that may be temperature-dependent.

In order to gain insight into the mechanisms behind this behavior, the following experiment is proposed: Route a wire along a known path between two LUTs (used as buffers). A fault injection platform shall be able to 1) inject faults into the configuration, 2) control the logical value of the wire's driver, and 3) monitor the wire's output value for changes and their timing relative to the fault injection operations.

The fault injection platform shall then inject faults into the configuration bits that break a single interconnect point along the route if flipped. This process shall be carried out for each configuration bit in isolation with both logic 0 and 1 driving the route. Starting with the fault injection process, all transitions of the output wire shall be timestamped and recorded.

#### A. Experimental Setup

The experiment described above requires runtime access to the configuration for disconnecting routes. As iCE40 does not provide dynamic partial reconfiguration, it has only been carried out on Artix-7 in this work.



Fig. 3. Implemented Circuit for Evaluating Effects of Disconnect-Type Faults

Fig. 3 shows the circuit implemented to perform the experiment. The route under test starts at a LUT that acts as a buffer for the signal di, passes through two INT tiles, and ends at another LUT that buffers the signal and forwards it to do. The entire route is controlled by seven configuration bits (boxes  $10_07$ ,  $13_07$ , etc. in Fig. 3). Disconnection experiments have been performed for each of the seven bits. For each bit, input values 0 and 1 were tested. After fault injection, the output signal was monitored for changes for  $2^{20}$  clock cycles (i.e., 20.97 ms at 50 MHz). The experiments have been conducted over a range of chip temperatures between 20 °C and 65 °C by sweeping the set temperature of the temperature chamber.

## B. Results

In the runs where di\_int, the input value to the route, was set to 1 before disconnection, no changes in the output value after disconnection have been observed within  $2^{20}$  clock cycles over the entire temperature range. Thus, in these cases the consequence of the fault for subsequent logic would be stuck-at-1 behavior of the concerned signal.

Different behavior was observed in two configuration bits  $(10_07, 19_03)$  when the input value to the route was set to 0 before disconnection. In these cases do\_int started at 0 immediately after disconnection, but after a certain time period transitioned to 1 and remained at that level. The observation that a non-driven wire is pulled high is consistent with observations made for Xilinx Virtex-4 [11] and with SymbiFlow's bitstream documentation for Series-7 FPGAs<sup>4</sup>.

Fig. 4 plots the recorded time for do\_int to change from 0 to 1 after fault injection vs. the chip temperature. Different data markers and colors represent individual tested FPGA devices. The plots suggest that there exists a certain threshold temperature up to which the net transitions from 0 to 1 fairly quickly after disconnection. Above the threshold temperature, this time quickly rises above the  $2^{20}$  clock cycles for which the net was monitored. This threshold varies both from device to device and from bit to bit within the same device. The latter can be seen in Fig. 4 when comparing the thresholds of bits  $10_07$  and  $19_03$  in the highlighted device 16.

Thus, if tests are conducted at a certain chip temperature, a single interconnect wire may exhibit pre-threshold-behavior

<sup>&</sup>lt;sup>4</sup>https://symbiflow.readthedocs.io/projects/prjxray/en/latest/architecture/ interconnect.html#vcc-drivers, Accessed 2020-11-25



Fig. 4. Time until 0-to-1 Transition after Disconnection vs. Chip Temperature

after disconnection on some devices. Depending on how soon the platform resumes the DUT after injecting the fault, transitioning may even be completed by this time. For the rest of the fault injection experiment, until the fault is corrected, the wire will exhibit the stuck-at-1 fault model. On other devices, this wire may exhibit transitional or post-threshold behavior, where the time until transition may extend into or beyond the experiment's runtime. As a consequence, a transition between stuck-at-0 and stuck-at-1 during the experiment or stuck-at-0 behavior for the entire experiment would be the case.

A circuit sensitive to only one of these conditions may exhibit different behavior on different devices or at different temperatures under fault injection. This may serve as a partial explanation for the effects observed in Artix-7 FPGAs.

#### V. CONCLUSION

This work evaluates device- and temperature-dependencies of systematic fault injection results in SRAM-based Xilinx Artix-7 and Lattice iCE40 FPGAs for which third-party bitstream documentation is publicly available. It is, to the best of our knowledge, the first work to challenge the implicit assumptions of temperature- and device independence of systematic fault injection approaches in the state of the art.

Results for both FPGA families as provided in Section III-B show that performing the same systematic fault injection experiment on different instances of the same FPGA type leads to different sets of critical configuration bits to be discovered. In addition, on Artix-7 the subsets of discovered critical bits vary with temperatures well within the device's specified operating range of 0 °C to 85 °C. In contrast, on iCE40 no temperature-related effects were seen in the evaluated temperature range.

Possible causes for this difference might be the different feature sizes of Artix-7 (28 nm) and iCE40 (40 nm) as well as different circuit structures for the interconnect resources in these FPGA types. As shown in Section IV, on Artix-7 wires that become disconnected from their driver as a result of a bit flip, take some time to transition to 1 after disconnection, given they are driven 0 before. This time is shown to depend on device, temperature, and even specific interconnect points.

The effects described and experimentally confirmed in this work should be addressed when performing systematic fault injection experiments. In contrast to the state of the art, running experiments on a single device may not be enough to accurately predict critical bits on all devices of this type. However, as can be seen in the results in Section III-B, there may be certain categories of configuration bits that exhibit little device-todevice variation for which a conventional approach may still be valid. For these bits, also the approach of parallel fault injection leads to accurate results. Other categories, for example interconnect bits, and especially disconnect-type bits on Artix-7, may exhibit different criticalities on different devices for specific designs. For these bits, experiments should be performed on multiple devices. The number of devices depends on the level of certainty required by the application so that performing the experiment on additional devices does not yield new critical bits previously undiscovered.

Furthermore, temperature should be considered as a factor in future fault injection experiments. Performing tests at one certain temperature may hide the criticality of interconnect bits at another temperature. Additionally, the temperature dependency may lead to different consequences of a bit flip in the interconnect depending on the value the wire transports in the moment of disconnection. Systematic fault injection experiments that only inject errors while the DUT is in one particular state (e.g., in an idle state before operation) may miss these state-dependent critical bits.

State-dependency of fault injection results as well as evaluations on more complex DUTs and other FPGA architectures will be the subject of future work.

#### REFERENCES

- Xilinx, Inc., "Xilinx TMRTool User Guide TMRTool Software Version 13.2," 2017, Accessed 2020-11-25. [Online]. Available: https: //www.xilinx.com/support/documentation/user\_guides/ug156-tmrtool.pdf
- [2] A. Sari and M. Psarakis, "A fault injection platform for the analysis of soft error effects in FPGA soft processors," in 2016 IEEE 19th International Symposium on Design and Diagnostics of Electronic Circuits Systems (DDECS), 2016.
- [3] J. Tonfat et al., "Method to Analyze the Susceptibility of HLS Designs in SRAM-Based FPGAs Under Soft Errors," in Applied Reconfigurable Computing. Springer International Publishing, 2016, pp. 132–143.
- [4] B. Schmidt, D. Ziener, J. Teich, and C. Zöllner, "Optimizing scrubbing by netlist analysis for FPGA configuration bit classification and floorplanning," *Integration*, vol. 59, pp. 98 – 108, 2017.
- [5] C. Thurlow, H. Rowberry, and M. Wirthlin, "TURTLE: A Low-Cost Fault Injection Platform for SRAM-based FPGAs," in 2019 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Dec 2019, pp. 1–8.
- [6] S. T. Fleming and D. Thomas, "Injecting FPGA Configuration Faults in Parallel," in 2018 International Conference on Field-Programmable Technology (FPT), 2018, pp. 198–205.
- [7] N. A. Harward, M. R. Gardiner, L. W. Hsiao, and M. J. Wirthlin, "Estimating Soft Processor Soft Error Sensitivity through Fault Injection," in 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, 2015, pp. 143–150.
- [8] D. Shah et al., "Yosys+nextpn: An Open Source Framework from Verilog to Bitstream for Commercial FPGAs," in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2019, pp. 1–4.
- [9] S. Jose *et al.*, "Interconnect Multiplexers And Methods Of Reducing Contention Currents In An Interconnect Multiplexer," US Patent 9509307 B1, Nov. 2016.
- [10] C. Wolf and M. Lasser, "Project IceStorm," Accessed 2020-11-25. [Online]. Available: http://bygone.clairexen.net/icestorm/
- [11] J. S. Monson, M. Wirthlin, and B. Hutchings, "A Fault Injection Analysis of Linux Operating on an FPGA-Embedded Platform," *International Journal of Reconfigurable Computing*, vol. 2012, 2012.