Riding over transients

Problem How do you know whether a problem will work itself out or not?

Context A fault-tolerant application where some errors, overload conditions, etc. may be transient. The system can escalate through recovery strategies, taking more drastic action at each step. A typical example is a fault tolerant telecommunication system using static traffic engineering, where you want to check for overload or transient faults.

Forces You want to catch faults and problems. There is no sense in wasting time solving a problem that goes away by itself Many problems work themselves out, given time.

Solution Don t react immediately to detected conditions. Make sure the condition really exists by checking several times, or use Leaky Bucket Counters to detect a critical number of occurrences in a specific time interval. For example: by averaging over time or just by waiting a while, give transient faults a chance to pass.

Resulting context Errors can be resolved with truly minimal effort. The human operator need not intervene for transient errors (as in the pattern Minimize Human Interaction).

Rationale This pattern detects "temporally dense" events. Think of the events as spikes on a time line. If a small number of spikes (specified by a threshold) occur together (where "together" is specified by the interval), then the error is a transient. If the episode transcends the interval, it's not transient: the leak rate is faster than the refill rate, and the pattern indicates an error condition. If the burst is more intense than expected (it exceeds the error threshold) then it's unusual behavior not associated with a transient burst, and the pattern indicates an error condition. Used by Leaky Bucket Counters, Five Minutes of No Escalation Messages. and others.

[Source: James Coplien, "Pattern Mining", C++ Report, Oct 95, p83]