Mean Time Between Failures

Posted by Ryan Brown on Feb 28, 2019

There are acronyms for everything now days. It can be even more confusing and complex in the technology industry, especially when it comes to enterprise hardware. Hard drives alone have dozens of acronyms. Drives are classified by type HDD, SSD, SAS and SATA. Within the types of drives, the performance and endurance have acronyms too. Here’s a breakdown on some of those acronyms related to drive failures, what they mean, how they’re calculated, and how you can use them.

The first failure is a physical failure of any part of the drive. Drive manufacturers express the likelihood of this failure using a term called Mean Time Between Failures or MTBF. There are many ways that people try to explain this calculation, but it is basically the total powered on hours needed for a population of drives to experience a failure. If you had 1,000 drives working 24 hours a day (total 24,000 hours per day) with a MTBF of 1,500,000 hours, you would expect a failure in 62.5 days (1,500,000 / 24,000). This is generally more relevant to spinning disks that have moving parts.

The second failure is specific to solid state drives and is caused by a limitation of each cell in the memory modules where it can only be written to a certain number of times. You sum the total capacity that can be written across all the cells and that gives you the Total Bytes Written or TBW. Many manufacturers are also using the term Drive Writes Per Day or DWPD to represent the TBW divided by the number of days in 5 years (1,825). These are collectively referred to the endurance of a drive and are the most common measure for SSD failures because they are not necessarily sensitive to powered on hours like spinning disks. The tricky part is that each manufacturer may only give you one figure. But by knowing the math, you can calculate one without the other and then see if your workload is covered by the drive specs. If the manufacturer only gives you DWPD, you can take the drive size multiplied times 1,825 to get TBW. You can then divide that by how long you want to keep the drive to see if your daily write workload is covered.