Why can’t we shake MTBF?

Adam Bahret
Adam Bahret

Mean Time Between Failure (MTBF) is one of the most well know reliability metrics.

But to anyone who works with reliability, it seems like it was developed by some evil anti-reliability mastermind to undermine the possibility of connecting reliability to anything or anyone.

 

Mean Time Between Failure means what?

  • It’s the time between two failures? –
  • It’s when the first failure occurs? –
  • It’s how long the product is good for?
  • It seems way to big to be a reasonable goal! “How can an air pump have an MTBF of 4 million hours?  That’s ridiculous these things are only supposed to last for five years!

This is the process of understanding everyone goes through as they are introduced to MTBF, formally or informally.

There are other ways to communicate the parameter that MTBF represents, failure rate is simply the inverse of MTBF.  Why don’t we use failure rate?  A 2.5 million hr MTBF is the equivalent to 1/2.5 Mil = 0.0000004 fails/hr

Well I guess that is our answer.  What the hell does that mean? 0.0000004 fails/hr?  That means nothing to me, I have no idea if that is good or bad.  I at least know what an hour is when we talk MTBF.

We can’t use % reliability as a direct replacement because % reliability needs a period of time the statement is over.  90% reliability over 2 years.  MTBF and failure rate do not have to include a period of time to be defined so can be easily translated to specific metrics that do involve a products behavior in a specific time period.

Are those our only three options?  MTBF, Failure rate, and %reliability/unreliability (if we have a time period to express it over).

I think that is part of why we seem stuck in MTBF.

So let’s discuss what MTBF actually is, because for now we are kinda stuck with it.  (Shaking fist at our nemisis evil MTBF mastermind “You win this time Dr MTBF!!!”)

MTBF is most commonly used to describe the reliability of a design during it’s intended use life.

MTBF is when approximately half* of the population has failed during use life.  Failures that occur due to infant mortality or wear-out are not included in this metric.

*(I said approx half because it depends on the applied distribution.  If it is an exponential distribution the MTBF is when 63.2% have failed)

So MTBF is a really bad point in the product’s history, a lot has gone wrong.  Half* of the products have failed in the customer’s hands.  Few things to consider here.  Failures that are classified as infant mortality are not included.  When a product hit’s it’s designated end of use-life (hopefully before a wear-out failure) it is removed from the population and replaced with a new one.  So the population we are measuring the fail rate for is continuously in this rotation.

Here’s a statistic that emphasizes how much MTBF is not intuitive.

What is the MTBF of a human?   Over 800 years!  The surprise most of us have with that answer highlights the misconceptions as to what it represents.

We all know that the likely “use-life” of a human is around 65 years.  Basically, on average “wear-out” based failure modes are going to become more dominate at this point.

If we then take the MTBF number to be representative as to the failure rate during use life, it would be attributed to random accidents and illness that are fatal.  We will assume that they, accident/illness, are random which indicates we will use the exponential distribution to represent use life. So by definition when the time equals MTBF 63.2% of the population has failed from random accidents

So by definition when the time equals MTBF 63.2% of the population has failed from random accidents and illness during use life.  Each person is replaced by a new one that is past infant mortality and has not run to wear-out (retirement).  So this is sounding a lot like measuring if an employee is going to die.  You don’t’ care about the status of children who can’t work and people who have retired.  Your question is “How often will an employee call in “dead” and I have to hire a new one”.  So if it is a company with 20,000 employees that means that at time=MTBF, 63.2% have died while in your employment.  That is 12,6400 employees that have died from random accident or illness?  All of a sudden 800 years is starting to sound about right.  Imagine a town with 20,000 people.  At what point in time would their cemetery have over 12 thousand graves that were only from people between 13 and 65?  All the infant, children, and elder deaths are in a separate cemetery.  800 years.

But all that explanation doesn’t make the sound of “A human has an MTBF of 800 years” sound any less ridiculous.

So you win this time Dr Evil!!!        But someday we will create a better metric…some day.

If you need any therapy on this topic please head over to www.nomtbf.com  It’s a great little community created by Fred Schenkelberg.  He’s gone so far as to create “No MTBF” buttons that people can wear so we can find each other at conferences and events to comfort each other in person.

-Adam

 

Share this post