Does Older Hard Drives Fail Faster?
Having to deal with multiple failures on a daily basis in a large datacenter is quite normal. However, an unexpectedly high number of failures can pose a threat to data reliability. The simplest method to minimize such failures is to replace older devices with newer ones on a regular basis. The question is, how long should we keep them before returning — how long would they work reliably enough so as to keep the number of failures low?
Backblaze (a company providing cloud storage services) has been doing an amazing job of publicly releasing their well-sanitized hard drive logs in an easily consumable form since 20151. I came across these while I was doing a project on reliability analysis at my University in collaboration with NetApp. This article describes a part of my project where I crunched such logs to obtain failure rate of hard drives as a "continuous" function of its age, which might help in answering the above question.
You can find all scripts and utilities I used for generating results at this repo; from processing Backblaze data files to generation of plots.
Table of Contents
- On Failure Rates
- The Methodology
- Notes on Plots
- Results
- Seagate ST4000DM000
- HGST HMS5C4040ALE640
- HGST HMS5C4040BLE640
- Hitachi HDS5C3030ALA630
- Hitachi HDS722020ALA330
- Seagate ST8000DM002
- Hitachi HDS5C4040ALE630
- Seagate ST6000DX000
- Seagate ST8000NM0055
- Hitachi HDS723030ALA640
- Seagate ST31500541AS
- Seagate ST500LM012 HN
- Results - Low Data
- Results - Bad Egg?
- Conclusions
On Failure Rates
Failure Rate is the frequency with which a device fails (or is expected to fail), expressed in failures per unit time.2
For instance, a manufacturer may specify the failure rate of a device to be 0.5 failures per year. We may interpret it in two ways:
- Such a device is expected to fail within 2 years.
- If we have 10 such devices, then 5 of them are expected to fail within a year.
On the other hand, if the device failure rate was specified at 1.5 per year, then:
- Such a device is expected to fail within 0.67 years.
- If we have 10 such devices and whenever a failure occurs, we're going to quickly replace it with a new one, then we are expected to see 15 failures within a year.
Although perfectly valid, failure rates are rarely expressed in a unit where the value is higher than 1. Continuing the previous example, 1.5 per year is more intuitive to be expressed as 0.125 per month, since a natural interpretation of failure rate is as the fraction of population that is expected to fail with a unit of time. Furthermore, it is often expressed in percentage (e.g. 12.5% per month or 150% per year) rather than a simple fraction.
Let's take a simple example. Suppose we observed 20 failures over 2 months from a population of 1000 devices, then the (average) failure rate is
The above expression suggests distributing the fraction of failures over the span of observation. This holds true only when the change in size of the population is marginal (or kept the same through replacements). Howerever, in most data centers, the size of the drive population fluctuates significantly due to (a combination of)
- Batches of (new) drives being deployed.
- Batches of (old) drives being returned.
- Failure rate being too high.
In such a cases, what should be the "size of population"? The answer is to use the time-average of the size of population, which resolves into the following generalized expression for failure rate calculation:
To illustrate, suppose we started out with 1000 devices. During the first month, 40 devices failed. At the start of second month, 2000 new devices were added to the population, and 120 devices failed during that month. Then the total device-time we observed should be about (960 + 20) + (2840 + 60) device-months (assuming that 40 failed devices in first month contributed 20 device-months, and 120 failed devices in second month contributed 60 device-months), and the failure rate is:
Age of a device is believed to be a key factor that affects its risk of failing. Therefore, it seems more interesting to study failure rates with respect to age, rather than actual time. For this, we simply map the events of interest (those discussed in previous section) from time-domain to the age-domain. Let's look at a small example of this mapping.
Suppose we have 2 hard disks — "disk-1" is brand new and "disk-2" is 365 days old. Both of them are kept in operation for the next two years. On 500th day of operation, the newer disk (disk-1) fails and is removed, but disk-2 kept working for the entire two years. Here's the time-domain plot of these events:
Red dashed-line indicates the failure event. Note that the older disk-2, after two years (730 days) of operation, will be 1095 days old. Here's the same in age-domain:
After this, the same principles I described in the previous sub-section apply, so as to calculate failure rates at various points (or regions) of age. For the above (extremely limited) example, the average failure rate during the second year of age (365-730 days) is
Sidenote: For hard disks, age is best represented by power-on time, since the self-monitoring system present in most drives contain power-on hours attribute.
The Methodology
- Using disk logs, construct
- The number of disks observed as a function of power-on time (\(N(t)\)).
- The cumulative number of failures as a function of power-on time (\(C(t)\)).
- Apply Savitzky-Golay filter (a smoothing technique) with polynomial-order 1 and a
fixed window-size to \(N(t)\) and \(C(t)\).
- Let \(N_s(t)\) denote the smoothed \(N(t)\).
- Let \(C_s'(t)\) denote the first-derivative of smoothed \(C(t)\).
- Then the failure rate vs power-on time function is computed as
$$ \lambda(t) = \frac{C_s'(t)}{N_s(t)} $$
The above method equivalent to what I discussed in the previous section, with a moving window. At \(t_0\), \(C_s'(t_0)\) is the average number of failures per unit-time in a particular window around \(t_0\), and \(N_s(t_0)\) is the average number of disks in the same window.
Notes on Plots
- In the top subplots, "disks observed" denotes \(N_s(t)\) and "rate of failures" denotes \(C_s'(t)\).
- The window size of the filter is roughly correlated to the number of disks that were observed at various points of age (power-on time).
- The failure rate values at regions where the number of disks observed is less than 1000 is not to be taken too seriously (both low and high).
Seagate ST4000DM000
Total disk-years observed: 87647
Total failures observed: 2593
Mean failure rate: 2.96% per year
Useful power-on span of observation: 4.42 years
Mean number of disks over useful span: 19828
Window size: 1 month
Total disk-years observed: 22824
Total failures observed: 129
Mean failure rate: 0.57% per year
Useful power-on span of observation: 4.88 years
Mean number of disks over useful span: 4673
Window size: 1 month
Total disk-years observed: 22220
Total failures observed: 128
Mean failure rate: 0.58% per year
Useful power-on span of observation: 3.51 years
Mean number of disks over useful span: 6317
Window size: 1 month
Hitachi HDS5C3030ALA630
Total disk-years observed: 15645
Total failures observed: 123
Mean failure rate: 0.79% per year
Useful power-on span of observation: 5.72 years
Mean number of disks over useful span: 2731
Window size: 45 days
Hitachi HDS722020ALA330
Total disk-years observed: 11894
Total failures observed: 202
Mean failure rate: 1.70% per year
Useful power-on span of observation: 6.11 years
Mean number of disks over useful span: 1944
Window size: 3 months
(Note: The initial spikes in failure rate are in a region with less than 100 disks observed.)
Seagate ST8000DM002
Total disk-years observed: 10191
Total failures observed: 113
Mean failure rate: 1.11% per year
Useful power-on span of observation: 1.36 years
Mean number of disks over useful span: 7461
Window size: 15 days
Hitachi HDS5C4040ALE630
Total disk-years observed: 9820
Total failures observed: 63
Mean failure rate: 0.64% per year
Useful power-on span of observation: 5.06 years
Mean number of disks over useful span: 1937
Window size: 2 months
Seagate ST6000DX000
Total disk-years observed: 4749
Total failures observed: 59
Mean failure rate: 1.24% per year
Useful power-on span of observation: 3.15 years
Mean number of disks over useful span: 1502
Window size: 2 months
Seagate ST8000NM0055
Total disk-years observed: 3653
Total failures observed: 44
Mean failure rate: 1.20% per year
Useful power-on span of observation: 0.83 years
Mean number of disks over useful span: 4370
Window size: 15 days
Hitachi HDS723030ALA640
Total disk-years observed: 3332
Total failures observed: 67
Mean failure rate: 2.01% per year
Useful power-on span of observation: 5.04 years
Mean number of disks over useful span: 659
Window size: 3 months
Total disk-years observed: 3302
Total failures observed: 166
Mean failure rate: 5.03% per year
Useful power-on span of observation: 4.71 years
Mean number of disks over useful span: 701
Window size: 3 months
Seagate ST31500541AS
Total disk-years observed: 2818
Total failures observed: 274
Mean failure rate: 9.72% per year
Useful power-on span of observation: 5.70 years
Mean number of disks over useful span: 494
Window size: 6 months
Seagate ST500LM012 HN
Total disk-years observed: 1555
Total failures observed: 39
Mean failure rate: 2.51% per year
Useful power-on span of observation: 3.10 years
Mean number of disks over useful span: 502
Window size: 4 months
Total disk-years observed: 1264
Total failures observed: 58
Mean failure rate: 4.59% per year
Useful power-on span of observation: 3.17 years
Mean number of disks over useful span: 396
Window size: 6 months
Total disk-years observed: 874
Total failures observed: 40
Mean failure rate: 4.58% per year
Useful power-on span of observation: 3.47 years
Mean number of disks over useful span: 251
Window size: 6 months
Results - Low Data
The following models were not as popular as the ones seen till now, and the available logs were insufficient to produce low-noise graphs. Even with a wider window size, failure rates reached significantly higher values for these models. Thus Y-axis limit was raised to 48%.
Total disk-years observed: 765
Total failures observed: 52
Mean failure rate: 6.80% per year
Useful power-on span of observation: 5.16 years
Mean number of disks over useful span: 148
Window size: 8 months
Seagate ST4000DX000
Total disk-years observed: 688
Total failures observed: 76
Mean failure rate: 11.05% per year
Useful power-on span of observation: 4.61 years
Mean number of disks over useful span: 149
Window size: 8 months
Seagate ST31500341AS
Total disk-years observed: 506
Total failures observed: 125
Mean failure rate: 24.70% per year
Useful power-on span of observation: 4.57 years
Mean number of disks over useful span: 110
Window size: 8 months
Seagate ST33000651AS
Total disk-years observed: 440
Total failures observed: 19
Mean failure rate: 4.32% per year
Useful power-on span of observation: 3.75 years
Mean number of disks over useful span: 116
Window size: 8 months
Results - Bad Egg?
One model stood out from the rest with a staggeringly high failure rate.
Seagate ST3000DM001
Total disk-years observed: 3480
Total failures observed: 1454
Mean failure rate: 41.78% per year
Useful power-on span of observation: 3.28 years
Mean number of disks over useful span: 1060 disks
Window size: 3 months
(Note: The upper-limit of Y-axis is 120%!)
Just to make sure the logs were sane3, I plotted the number of failures reported per day and the number of disks in deployment, against datestamps.
For comparison, below is the same for the most popular model.
That number of failures is indeed quite an anomaly. I can think of a few possible reasons for this:
- It was caused due to a hardware issue where these drives were part of, such as a faulty power-supply.
- It was because the hard disks exhibit some behavior which makes the failure detection software raise "false" alarms. "False" from the perspective of the manufacturer at least.
- They were indeed bad eggs!
I'm not sure which it is, and Backblaze only had this to say about these drives4:
The Seagate Barracuda 7200.14 3 TB drives are another story. We’ll cover how we handled their failure rates in a future blog post [that never came].
The method I presented is useful for investigating dependence of failure rate on age of drives. In particular, the method is focused on the span of age over which a fair amount of data is available.
Only looking at the failure rate trend of the most popular model, one might be tempted to conclude that hard drives have a higher risk of failing as they age. But most other models seem unaffected by age (within the observed span), and some even show a decreasing trend.
Experts claimed that this journey would produce bathtubs! However, none of what I produced matches that description. Perhaps forces are at play that prevents their manifestation, or perhaps the journey is incomplete.
Partly borrowed from ↩
The data released by Backblaze is of exceptionally high-quality. I've seen things and written queries that still haunts me at night! ↩