Reliability and life-data analysis¶
Reliability analysis asks how long a part survives, and with what confidence. The answer depends on the data you have. Sometimes you have failure times from a life test, some of them censored because the test stopped before every unit failed. Sometimes you have a count of failures over a known number of operating hours. Sometimes you have no field data at all and you are sizing a demonstration test or composing the reliability of a system from its components. mfgQC covers each of these with a separate entry point, and each one reports the assumption that makes its number trustworthy.
This page pins every formula, default, and method to the code in the
mfgqc/reliability/ subpackage: life.py, nonparametric.py, system.py,
availability.py, and demonstrate.py. The maximum-likelihood life fit is regression
tested against R's survreg; the rest are pinned to the standards and texts cited at
the end.
A note on roles. The life-data entry points read two metadata roles off the loaded
table: time (the operating time or cycles at which the unit failed or was censored)
and event (1 for an exact failure, 0 for a right-censored suspension). If you do not
set a time role, the measure column is used as the time. If you do not set an event
role, every observation is treated as an exact failure. Set the roles with .roles(...):
1. Parametric life-data fitting (life_fit)¶
What it is and when you use it¶
A life-data fit puts a probability distribution on time-to-failure. Once you have the fitted distribution you can read off the mean time to failure, the time by which 10 percent of units have failed (B10 life), the median life (B50), and the reliability \(R(t)\) at any time. You use it when you have a sample of failure times, possibly with some units still running at the end of the test (suspensions), and you want a smooth model rather than the empirical step function.
The distributions¶
life_fit supports four distributions (the _DISTS tuple in life.py):
exponential, weibull, lognormal, and normal. The default is weibull. The
parameters mfgQC reports are:
| distribution | parameters | scipy form |
|---|---|---|
exponential |
scale (\(\theta\)) |
expon(scale=θ) |
weibull |
shape (\(\beta\)), scale (\(\eta\)) |
weibull_min(β, scale=η) |
lognormal |
mu, sigma (of \(\ln t\)) |
lognorm(σ, scale=e^{μ}) |
normal |
mu, sigma |
norm(loc=μ, scale=σ) |
The Weibull shape \(\beta\) is the diagnostic that matters most on the plant floor. \(\beta < 1\) is infant mortality (failure rate decreasing with age), \(\beta = 1\) is a constant failure rate (the Weibull reduces to the exponential), and \(\beta > 1\) is wear-out (failure rate increasing with age).
Maximum likelihood and how censoring is handled¶
The default method is maximum likelihood (method="mle"). Censoring is built into the
likelihood, not worked around. Each observation contributes a different term to the log
likelihood depending on what is known about it (_negloglik in life.py):
- an exact failure at \(t\) contributes the log density \(\ln f(t)\);
- a right-censored suspension (still running at \(t\)) contributes the log survival \(\ln R(t) = \ln\big(1 - F(t)\big)\);
- a left-censored unit contributes \(\ln F(t)\);
- an interval-censored unit, known to have failed between \(t_{lo}\) and \(t_{hi}\), contributes \(\ln\big(F(t_{hi}) - F(t_{lo})\big)\).
The total log likelihood is the sum of these terms, and mfgQC minimizes its negative
with a Nelder-Mead search. The public life_fit dispatcher reads the event role and
splits the data into exact failures (event == 1) and right-censored suspensions
(event == 0). The left- and interval-censored terms exist in the likelihood machinery
but are not currently populated by the dispatcher, so through qc.life_fit(...) you get
exact failures plus right censoring. A fit needs at least 2 exact failures; with fewer,
life_fit raises an error and points you to the nonparametric life_table instead.
There is a second method, method="rankreg", which fits by rank regression on the
probability plot. It uses median-rank plotting positions adjusted for suspensions by the
Johnson method, with the Benard approximation \(F = (\text{rank} - 0.3)/(n + 0.4)\), then
fits a straight line to the linearized cumulative distribution. It is the recommended
cross-check when you have few failures, because the Weibull-shape MLE is biased upward
at small failure counts.
The reported MTTF (mean time to failure) is the mean of the fitted distribution, never the sample mean of censored data. For the Weibull, \(\text{MTTF} = \eta\,\Gamma(1 + 1/\beta)\), computed through scipy's frozen distribution.
Confidence intervals¶
For the MLE, the parameter confidence intervals are likelihood-ratio profile
intervals (_lr_ci in life.py), not Wald (normal-approximation) intervals. For each
parameter, mfgQC profiles the log likelihood: it fixes the parameter at a trial value,
re-optimizes the others, and finds the two values where the profile log likelihood drops
below the maximum by \(\chi^2_{1,\,1-\alpha}/2\). These intervals are not symmetric about
the estimate, which matters for the Weibull shape and scale at small samples. The
rank-regression method uses a Wald interval from the numerically estimated Hessian
instead.
What mfgQC checks for you¶
life_fit attaches three assumption checks (_adequacy in life.py):
- Distribution adequacy. The probability-plot correlation (PPCC) is reported, and the check passes if it is at least 0.95. mfgQC also fits all four distributions and reports their AIC so you can compare. It does not switch distributions for you.
- Failure count. Fewer than 10 exact failures is flagged as low power, with a
recommendation to cross-check the shape with
method="rankreg". - Constant failure rate. For a Weibull fit, a shape more than 0.5 away from 1 is flagged, because a constant-rate MTBF would mislead you when the rate is not constant.
Worked example¶
Ten units on a life test. Two were still running when the test stopped (event = 0):
import pandas as pd, mfgqc
df = pd.DataFrame({
"hours": [120, 230, 410, 580, 690, 750, 880, 1010, 1200, 1450],
"event": [1, 1, 1, 1, 0, 1, 1, 0, 1, 1],
})
qc = mfgqc.load(df, measure="hours").roles(time="hours", event="event")
fit = qc.life_fit(dist="weibull")
print(fit.report())
Life fit (weibull, mle): R(t) and percentiles
=============================================
n = 10 failures = 8 suspensions (right-censored) = 2
method = mle (MLE primary; rank regression secondary)
shape = 1.686 [0.86732, 2.8272] (95% CI)
scale = 918.99 [584.8, 1559.8] (95% CI)
MTTF = 820.44 (from the fitted distribution, not the sample mean)
B10 (10% fail) = 241.9 B50 (median life) = 739.44
AIC = 126.59 probability-plot correlation = 0.9929
competing fits (AIC, lower is better): weibull=126.6, exponential=127.1, normal=127.6, lognormal=128.1
Assumption checks:
[PASS] distribution_fit (prob-plot correlation): prob-plot=0.993; n=10
[FAIL] failure_count (exact failures): exact=8; n=10 [low power]
[FAIL] constant_failure_rate (Weibull shape vs 1): Weibull=1.69; n=10
The shape estimate is 1.69, so these units are in wear-out, and mfgQC flags that a
constant-rate MTBF would be the wrong summary here. The profile interval on the shape,
\([0.87,\,2.83]\), is wide because there are only 8 exact failures, and mfgQC flags that
too. The fitted result also exposes fit.R(t) for the reliability at any time and
fit.hazard(t) for the instantaneous failure rate.
2. Nonparametric reliability (life_table, Kaplan-Meier)¶
What it is and when you use it¶
The Kaplan-Meier estimator gives the reliability function \(R(t)\) without assuming any distribution. It is the product-limit step function: the empirical survival curve, corrected for censoring. Use it as the assumption-free baseline, to read the median life straight from the data, or to judge by eye whether a parametric fit is reasonable.
The product-limit estimator¶
Order the distinct failure times \(t_1 < t_2 < \dots\). At each failure time \(t_i\), let
\(d_i\) be the number of failures at \(t_i\) and \(n_i\) the number of units still at risk
(operating and not yet failed or censored) immediately before \(t_i\). The estimator is the
running product (kaplan_meier in nonparametric.py):
A suspension removes a unit from the at-risk count without contributing a failure, so it flattens the curve rather than dropping it. The confidence bounds use Greenwood's formula for the variance,
with a normal \(z\)-interval clipped to \([0, 1]\). The median life is the first failure time at which \(\hat R(t)\) falls to or below 0.5; if the curve never reaches 0.5 (heavy censoring), the median is reported as "not reached." The estimator carries no assumption checks, because there is no distributional assumption to check.
Worked example¶
The same ten units as above:
Kaplan-Meier R(t): 8 failures / 10 units
========================================
n = 10 failures = 8 suspensions = 2
median life = 750 [1200, 410]
t R(t) lower upper
0 1.0000 1.0000 1.0000
120 0.9000 0.7141 1.0000
230 0.8000 0.5521 1.0000
410 0.7000 0.4160 0.9840
580 0.6000 0.2964 0.9036
750 0.4800 0.1587 0.8013
880 0.3600 0.0445 0.6755
1200 0.1800 0.0000 0.4752
1450 0.0000 0.0000 0.0000
The median life of 750 hours sits close to the Weibull B50 of 739 from the parametric fit, which is one reason to trust the Weibull here.
3. MTBF (mtbf)¶
What it is and when you use it¶
When the failure rate is constant (the exponential life model), the natural summary is the mean time between failures: total operating time divided by the number of failures. You use it for repairable systems and for components in their useful-life period, where failures arrive at a roughly steady rate. The constant-rate assumption is the catch, and mfgQC says so on every report.
m = mfgqc.reliability.mtbf(5000.0, failures=3) # total time, failures
m = qc.mtbf() # from time + event roles
The formula and bounds¶
The point estimate is total accumulated test time \(T\) over the number of failures \(r\):
The confidence bounds come from the chi-square distribution and depend on how the test
was stopped (mtbf in demonstrate.py). For a time-terminated test (stopped at a
fixed time, the default kind="time_terminated"):
For a failure-terminated test (stopped at the \(r\)-th failure, kind="failure_terminated"):
The difference is the extra two degrees of freedom in the time-terminated lower bound,
which accounts for the partial interval after the last failure. The default confidence
is 0.90. Passed a QCData, mtbf derives \(T\) as the sum of the time column and \(r\) as
the count of event == 1.
Worked example¶
A test ran for 5000 total unit-hours and saw 3 failures, stopped at a fixed time:
MTBF (time_terminated, constant failure rate)
=============================================
total test time = 5000 failures = 3 test = time_terminated
MTBF = 1666.67 [644.857, 6114.78] (90% CI)
failure rate = 0.0006
The point MTBF is \(5000/3 = 1666.67\) hours, and the 90 percent interval runs from 645 to 6115 hours. The interval is very wide because three failures is little information, which is the honest picture at this sample size.
4. System reliability (system, series, parallel, k_of_n)¶
What it is and when you use it¶
System reliability composes the reliability of a system from the reliabilities of its components. You use it in design and planning, before you have field data, to see whether an architecture meets a target and where redundancy helps. Every formula here assumes the component failures are independent, and mfgQC attaches an independence flag to every result, because a shared cause (a common power supply, a common environment) breaks the assumption.
The formulas¶
For component reliabilities \(R_1, \dots, R_n\) (system.py):
Series (all components must survive):
Parallel (redundant, at least one must survive):
k-of-n (at least \(k\) of \(n\) identical components, each with reliability \(R\), survive) is the upper tail of a binomial:
The system(blocks) function nests these. A blocks argument is a float (one
component), a list (a series of blocks), or a dict {"series": [...]} or
{"parallel": [...]} that nests the above to any depth, so you can build a block diagram
like {"series": [0.99, {"parallel": [0.9, 0.9]}]}.
Worked examples¶
print(mfgqc.reliability.series([0.99, 0.98, 0.97]).report())
print(mfgqc.reliability.parallel([0.90, 0.90]).report())
print(mfgqc.reliability.k_of_n(2, 3, 0.95).report())
System reliability (series)
===========================
structure: series components: 3
R = prod(0.99, 0.98, 0.97)
system reliability = 0.941094
Three components in series multiply to \(0.99 \times 0.98 \times 0.97 = 0.941\). Two components in parallel at 0.90 each give \(1 - (0.10)(0.10) = 0.99\), and a 2-of-3 system of 0.95 components gives 0.99275. Each report repeats the independence assumption in plain words.
5. Availability (availability)¶
What it is and when you use it¶
Availability is the fraction of time a repairable system is up and able to do its job. Reliability asks how long until failure; availability folds in how fast you repair. You use it for equipment that is maintained and returned to service, where the steady-state uptime matters more than a single time-to-failure.
The formulas¶
mfgQC computes three availability indices (availability in availability.py), each a
steady-state ratio of uptime to total time:
Inherent availability (kind="inherent", the default) counts corrective repair only:
Achieved availability (kind="achieved") adds preventive maintenance. With
corrective rate \(\lambda = 1/\text{MTBF}\), preventive frequency \(f_{pm}\) (actions per
unit time), and preventive time \(t_{pm}\) per action, mfgQC forms the mean time between
maintenance and the mean maintenance time,
Operational availability (kind="operational") adds the mean logistics delay time
(waiting for a part or a technician):
Worked example¶
Using the MTBF from the example above (1666.67 hours) and an 8-hour mean repair time:
Availability (inherent)
=======================
inherent availability = 0.995223 (99.5223%)
formula: MTBF / (MTBF + MTTR)
inputs:
mtbf = 1666.67
mttr = 8
The system is up 99.52 percent of the time once corrective repair is accounted for.
6. Bearing life (bearing_life, ISO 281)¶
What it is and when you use it¶
For rolling-element bearings, the standard life metric is the basic rating life \(L_{10}\): the life that 90 percent of a population of identical bearings will reach under a stated constant load and speed. It is a fleet rating, not a single-unit prediction. You use it to size or compare bearings against a duty cycle.
The formula¶
The basic rating life in millions of revolutions follows the load-life relation
(bearing_life in system.py):
where \(C\) is the basic dynamic load rating, \(P\) is the equivalent dynamic bearing load,
and the life exponent \(p\) depends on the contact geometry: \(p = 3\) for ball bearings
(kind="ball") and \(p = 10/3\) for roller bearings (kind="roller"). mfgQC converts to
hours using the shaft speed in rpm,
It also reports the rated life at other reliabilities through the ISO 281 reliability factor \(a_1\), computed from a Weibull slope of 1.5:
Worked example¶
A ball bearing with a dynamic rating of 26000 N, an equivalent load of 4000 N, running at 1800 rpm:
ISO 281 rated bearing life (ball)
=================================
C (dynamic rating) = 26000 P (equivalent load) = 4000 speed = 1800 rpm life exponent p = 3
L10 = 274.62 million revolutions = 2542.82 hours
rated life at other reliabilities (ISO 281 a1 factor):
L10 (R=0.90): 2542.82 hours
L 5 (R=0.95): 1573.64 hours
L 4 (R=0.96): 1351.43 hours
L 3 (R=0.97): 1111.76 hours
L 2 (R=0.98): 845.546 hours
L 1 (R=0.99): 530.866 hours
Here \((C/P)^3 = 6.5^3 = 274.6\) million revolutions, which at 1800 rpm is 2543 hours at 90 percent reliability. Demanding 99 percent reliability cuts the rated life to 531 hours.
7. Demonstration test (demonstration_test)¶
What it is and when you use it¶
A reliability demonstration test sizes a pass/fail test to prove a reliability target at a stated confidence. The common case is the zero-failure (success-run) test: how many units must run without failure to demonstrate reliability \(R\) at confidence \(C\)? You use it to plan acceptance tests and to trade off sample size against the strength of the claim.
The formula¶
The default model is binomial (demonstration_test in demonstrate.py). For a
zero-failure plan (failures=0), the confidence that the true reliability is at
least \(R\) when \(n\) units pass is
which rearranges to the required sample size
You leave exactly one of reliability, confidence, or n as None, and
demonstration_test solves for it. For a fixed-failure plan (failures = f > 0),
the confidence comes from the binomial tail, \(C = 1 - \text{Binom.cdf}(f;\, n,\, 1 - R)\),
and mfgQC searches for the sample size (or solves for reliability) that meets the target.
A zero-failure pass bounds the reliability at the stated confidence; it does not prove it, and it demonstrates the assumed model as much as the hardware. mfgQC prints that caveat on every plan.
Worked example¶
How many units must pass a zero-failure test to demonstrate 95 percent reliability at 90 percent confidence?
Reliability demonstration test (solved for n)
=============================================
demonstrate reliability >= 0.9500 at 90% confidence
plan: n = 45 units, zero-failure (success-run)
assumed model: binomial
The plan requires 45 units to pass with no failures. That matches \(\lceil \ln(0.10)/\ln(0.95) \rceil = \lceil 44.9 \rceil = 45\).
Sources¶
- ISO 281, Rolling bearings: dynamic load ratings and rating life. The basis for the \(L_{10}\) basic rating life, the \((C/P)^p\) load-life relation with \(p = 3\) for ball bearings and \(p = 10/3\) for roller bearings, and the \(a_1\) reliability adjustment. See the bibliography.
- Montgomery, D. C. Introduction to Statistical Quality Control. Wiley. mfgQC's primary source for SPC and capability; the reliability material here draws on the same general framework for the exponential MTBF and chi-square interval. See the bibliography.
- The parametric maximum-likelihood fit matches R's
survregfrom thesurvivalpackage, a standard reference for survival models. It is not a primary text in the bibliography.
The Kaplan-Meier product-limit estimator with Greenwood variance, the series/parallel/ k-of-n system formulas, the availability indices, and the binomial demonstration plan are standard results in the reliability-engineering literature. The bibliography does not yet carry a dedicated reliability-engineering text (for example Meeker and Escobar, or O'Connor); the formulas above are pinned to the cited standards and to the code, and the text notes where no primary source is listed rather than inventing one.
For the full signatures and result-object methods, see the API reference.