Leaky bucket counter

Problem How do you deal with transient faults?

Context Fault-tolerant system software that must deal with failure events. Failures are tied to episode counts and frequencies.

One example from 1A processor systems in AT&T telecommunication products: as memory words (dynamic RAM, called "cells" in 1A processor terminology) got weak, the memory module would generate a parity error trap (a trap refresh fail- ure). Examples include both 1A processor dynamic RAM and 1B processor static RAM.

Forces You want a hardware module to exhibit hard failures before taking drastic action. Some failures come from the environment and should not be blamed on the device.

Solution A failure group has a counter that is initialized to a predetermined value when the group is initialized. The counter is decremented for each fault or event (usually faults) and incremented on a periodic basis; however, the count is never incremented beyond its initial value. There are different initial values and different leak rates for different subsystems: for example, it is a half-hour for the 1A memory (store) subsystem. The strategy for 1A dynamic RAM specifies that the first fault in a store (within the timing window) causes the store to be taken out of service, diagnosed, and then automatically restored to service. On the second, third, and fourth failure (within the timing window) you just leave it in service. For the fifth episode within the timing window, take the unit out of service, diagnose it, and leave it out.

Resulting context A system where errors are isolated and handled (by taking devices out of service), but where transient errors (e.g. room humidity) don't cause unnecessary out of service action.

Rationale Periodically increasing the count on the resource creates a sliding time window. The resource is considered sane when the counter (re-)attains its initialized value. Humidity, heat, and other environmental problems could cause transient errors which should be treated differently (i.e., pulling the card does no good).

[Source: James Coplien, "Pattern Mining", C++ Report, Oct 95, p84]