elcome dear friends of protection, control and electrical engineering. In a current technical article, Siemens Siprotec informs about measures against unwanted bit flips within their devices. Since this article describes not only the measures themselves, but also the physically causal phenomenon, we would like to share this article unchanged.
Have fun while reading,
Your EEA TEAM
With the advancement of semiconductor technology in the nanometer range, a physical phenomenon, which was typically only observed in aerospace engineering up to now, has become also relevant to numerical protection devices. This phenomenon is referred to below as SEU (Single Event Upset).
It affects all types of microprocessor-based devices, including numerical protection devices such as those of the SIPROTEC 5 product family.
This document is intended to provide an overview of the causes and effects of SEUs, as well as the solutions that SIEMENS uses to maintain the reliability of its SIPROTEC 5 products.
Well known in the aerospace industry is that the cosmic particle radiation of high-energy particles such
as ionized hydrogen can lead to temporary disturbances and even to the destruction of microelectronic elements. This radiation is caused, for example, by eruptive solar winds.
The earth is largely shielded from this radiation by its magnetic field, which is why effects at sea level are less common. Nevertheless, some particles manage to penetrate the earth's atmosphere. The result of the collision with oxygen and nitrogen atoms is a chain reaction that releases other particles such as neutrons and alpha particles.
All earthly materials have traces of natural radioactive elements that emit alpha particles. These also occur in the plastic housings and metal layers of semiconductor components such as microprocessors and memory chips. However, the effect of high-energy neutrons is ten times higher than that of alpha particles.
The increasing miniaturization in semiconductor technology goes hand in hand with ever larger integrated memories and lower operating voltages. The energy of penetrating elementary particles is now sufficient to induce bit flips, for example, in memory modules. This in turn can lead to malfunctions in the operation of numerical devices. Appropriate protective measures are therefore necessary.
Definitions and Physics
SE – Single Event
A single event is the interaction of an energetic particle such as a neutron with a semiconductor. The impacting particle gives off energy to the material, which is known as LET (Linear Energy Transfer). The occurrence is random and depends largely on the energy of the particle.
The effects of single events are observable and measurable errors. A distinction is made between a hard error and a soft error. In the event of a soft error, “only” data is corrupted. A hard error, on the other hand, which requires very high radiation, irreversibly destroys the semiconductor. The latter case can almost be ruled out when using numerical protection devices in their typical locations.
SEU – Single Event Upset
A single event upset is the change in a logical state (bit flip) in a writable, electronic memory cell (soft error) caused by a single event.
The different types of collision can result in a variety of secondary products (Fig. 2), which cause a current pulse
at the output of the attacked transistors. This can lead to a change in the charge distribution and thus to a “switching” of a p-n junction.
Typical components affected by SEUs are DRAMs (Dynamic Random Access Memory), SRAMs(Static Random-Access Memory), FPGAs (Field-Programmable Gate Array) and micrcontrollers with large integrated memories, as they are used today in all modern electronic devices (e.g. computers, TV sets, smartphones, car electronics, automation and protection technology).
Impact on the functionality of numeric devices
Numerical devices with highly integrated electronic components are always potentially at risk of being affected by SEUs with a certain probability of occurrence. In principle, the SEU error rate increases as the chip structures get smaller and smaller. However, semiconductor manufacturers manage to effectively limit the SEU error rate by simultaneously densifying the structures on the silicon. Meanwhile, with structures of e.g. 14nm, there is a trend reversal, the building blocks become less sensitive again. In their technical data for memory chips, semiconductor manufacturers indicate the probability of occurrence of SEUs.
A bit flip caused by an SEU can – if an affected device does not respond appropriately – lead to a device malfunction.
Can SEUs be prevented?
🌐 No, as all measures to avoid particle impacts involve extreme effort and would therefore be uneconomical.
Why are electronic components such as FPGAs used?
🌐 FPGAs (Field-Programmable Gate Array) are highly integrated components with configurable logic. This enables the implementation of complex mathematical functions that increase the performance of numerical devices. Due to their high functional density, FPGAs are potentially more at risk to SEUs than ordinary memory modules.
🌐 But it was only with the use of FPGAs that the development of high-performance, multifunctional protection and control devices with extensive communication options became possible.
🌐 Local hardware function upgrades can be carried out without affecting the overall firmware of a device. Maintenance and upgrade cycles are reduced to a minimum.
🌐 FPGAs are a firmly established technology. State-of-the-art protection and automation devices from all manufacturers use FPGAs.
Can the occurrence of SEUs be reduced?
🌐 Shielding measures on the devices themselves are not practical. However, the installation location (e.g. building with concrete ceilings) can reduce the occurrence of SEUs. With a material thickness of one meter of concrete, the penetration of neutrons is already reduced by 70%.
🌐 By using FPGAs with the smallest possible on-chip memory, the potential attack surface can be minimized.
How likely is a SEU?
🌐 The occurrence of a SEU depends on the location(e.g. the hight above sea level). In higher atmospheric layers,
the probability of a SEU increases due to high-energy particles. Typical locations for protection and automation equipment are in the range of 0 to 4,000m (12,000 feet) above sea level. Another criterion is the geographical location. The protective effect of the earth's magnetic field decreases at the poles and at points of local field line anomalies.
🌐 SemiconductormanufacturersindicatetheprobabilityofoccurrenceofSEUsinthetechnicaldataoftheircomponents as key figures. A typical MTBF (Mean Time Between Failure) of 50 – 250 years for use at sea level is specified
for the components used in SIPROTEC 5.
Can SEU-related malfunctions be prevented?
🌐 Yes! By using suitable hardware and firmware measures, stable device operation and maximum availability can be achieved.
SEU handling in SIPROTEC 5 devices
“Goal for critical applications: Limit the probability of system error propagation and/or provide detection-recovery mechanisms via failsafe strategies.”
NASA Goddard Radiation Effects and Analysis Group https://radhome.gsfc.nasa.gov
Protection and automation devices are such critical applications. The SIPROTEC 5 device family follows this goal and takes actions. In accordance with this goal, hardware and firmware measures (fail-safe strategies) have been taken in the SIPROTEC 5 device family, which guarantee maximum availability and stability of the function.
🌐 Use of ECC (Error Correcting Code) protected DRAMs, which correct individual bit errors during operation without functional interruption.
🌐 Strict hardware design rule: Use of FPGAs with the smallest possible on-chip memory to reduce the potential attack surface.
🌐 The functionally relevant routing resources (FPGA programming) only occupy approx. 10% of the memory area concerned.
🌐 Implementation of an active SEU detection (bit flip monitoring).
FPGAs on IO-Boards
🌐 Autonomous system recovery: The FPGA is re-initialized after detection of a SEU in a few milliseconds without having to restart the device. The mainboard CPU guarantees a synchronized device behavior even in the case of a single IO board re-initialization.
🌐 Data security through CRC checksum monitoring: IO boards with analog measurement inputs provide each detected sample with a CRC checksum. Corrupt data is thus recognized by the mainboard CPU and marked as invalid.
This prevents an incorrect response by the protective functions.
FPGAs on Mainboard
🌐 Only in the case of an SEU on the mainboard is a quick device restart (warm start) necessary to restore a deterministic device status. The cause of the restart is reported in the operational event log.
SIPROTEC 5 Graceful Restart Strategy
The graceful restart strategy ensures the fastest possible system recovery for simultaneous system stability
by maintaining maximum functional availability.
🌐 In the case of a SEU, the reinitialization of an IO board FPGA is reported to the mainboard CPU.
🌐 This holds the device-internal communication until the completion of the IO re-initialization. Analog data (currents, voltages) are not updated during this interval. Binary input and output signals retain the status before initialization. Protection and measurement functions go into a safe waiting state (inactive or blocked) during the entire re-initialization.
🌐 The device remains active, the life contact remains activated.
🌐 The external communication (protocols) reports the current device behavior.
🌐 Internal LOGs record the entire reinitialization process.
🌐 The device automatically returns to stable operation after approx.100ms.
🌐 Only a SEU on the FPGA of the mainboard leads to a device restart (warm start).
SEUs as a phenomenon have been known for decades but are becoming more and more important with advancing integration of electronic components. Siemens minimizes SEUs through strict hardware design rules and obtains maximum availability and complete functional stability by means of suitable monitoring measures and the implemented graceful restart strategy.
published by Siemens Industry Inc., 100 Technology Drive Alpharetta, GA 30005 United States