When one failure is too many

Adam Bahret
Adam Bahret
Long tail 1

In reliability analysis, the chance of a product failure is represented as a probability distribution function (pdf). PDF plots can take any shape: Symmetrical/Asymmetrical, Skewed Right/Left, Peaked/Flat, Long/ Short tail.

All these varieties of probable probability plots are difficult to manage. To mitigate product failures, design engineers need to understand how the PDF is shaped for their own product. Based on that info, design margins, and redundancies can be introduced. For curves with long tails (low probability of more events), the mitigation efforts might seem like an endless venture. Afterall, the Pareto Chart preaches to us to focus on ‘vital few’ and ignore the ‘trivial many’. While this is sound advice and an efficient way to manage program time, it is paramount that the user acknowledges the risk of ignoring the trivial.

Long tail 1
Long tail PDF

Let me explain this idea by roping in Use Case 7. I’ve been interested in the extreme ends of the pdf plot for a while now. At the tail end of the curve are operating conditions that the designer has not planned for. The creative, weird, rebellious users of the product live in this region. Ex: dropping a mobile phone from the 10th storey of a building vs. dropping a mobile phone from a table. When it’s dropped from the 10th storey, it fails with certainty. The probability of that occurring is close to zero. This is the concept of Use Case 7. Extreme conditions with low probability that will cause failures.

While it might be valid to focus only on a few vital reliability issues, it is not the best case for Risk Management. A system can manifest risk from component failures, operating conditions, controls, materials used, etc. All these “risk events” can be represented as a pdf plot just like a reliability pdf plot.

A risk with the highest severity, low occurrence, and hazardous consequence is in the long tail. When the risk is hiding in the long tail, it can spell doom. In financial markets, there is a term for these kinds of risks – Black Swan. These risks will mostly never happen, but when it does it scars the company forever. Recent examples of these risks are: Boeing plug door blowout, Cruise AV accident, and the Fukushima nuclear disaster.

Product managers need to be aware of hidden black swans in their products. The risks can be in design, operation, or during disposal. Various hazard analysis tools (HAZOP, FMEA) are used by Engineering teams to assess these risks. The scope of these responsibilities stretches beyond the reliability engineering team. Even if the product is reliable, an operational event can bring catastrophe.

However, reliability plays a role in reducing risks emanating from product failures. One method I follow during FMEA is to rank corrective actions based on Severity and not on total RPN score. For infrequent events or events outside the boundary of the team, a hedging strategy is used. Teams have limited resources and one cannot practically design out all the risks. Hedge the improvements across a portfolio of possible risks. Actions can range from full redesign – safety margin – redundancy – maintenance – alarms – warning – data collection.

Mitigating BSwan 1
Mitigating Hidden Risks

When products are designed robust, the risk dissolves. The culture and implementation of a robust design by the product development team will influence other teams to reduce their share of the risk. As the responsibility spreads, the risk pdf with a long tail will be managed.

Share this post