Problem |
How do you deal with transient faults?
|
Context |
Fault-tolerant system software that must deal with failure
events. Failures are tied to episode counts and frequencies. One example from 1A processor systems in AT&T telecommunication products: as memory words (dynamic RAM, called "cells" in 1A processor terminology) got weak, the memory module would generate a parity error trap (a trap refresh fail- ure). Examples include both 1A processor dynamic RAM and 1B processor static RAM.
|
Forces |
You want a hardware module to exhibit hard failures
before taking drastic action. Some failures come from the environment
and should not be blamed on the device.
|
Solution |
A failure group has a counter that is initialized to a
predetermined value when the group is initialized. The counter
is decremented for each fault or event (usually faults) and incremented
on a periodic basis; however, the count is never incremented beyond its
initial value. There are different initial values and different leak
rates for different subsystems: for example,
it is a half-hour for the 1A memory (store) subsystem. The strategy
for 1A dynamic RAM specifies that the first fault in a store
(within the timing window) causes the store to be taken out of
service, diagnosed, and then automatically restored to service.
On the second, third, and fourth failure (within the timing
window) you just leave it in service. For the fifth episode within
the timing window, take the unit out of service, diagnose it, and
leave it out.
|
Resulting context |
A system where errors are isolated and handled (by taking devices
out of service), but where transient errors (e.g. room humidity)
don't cause unnecessary out of service action.
|
Rationale |
Periodically increasing the count on the resource
creates a sliding time window. The resource is considered sane
when the counter (re-)attains its initialized value. Humidity,
heat, and other environmental problems could cause transient
errors which should be treated differently (i.e., pulling the card
does no good).
|
[Source: James Coplien, "Pattern Mining", C++ Report, Oct 95, p84] |