S.M.A.R.T, or not so smart…

A few months ago I did some research on hard disk S.M.A.R.T data. For those not in the loop, it stands for Self-Monitoring, Analysis, and Reporting Technology, and is basically a mechanism modern hard disks use to track their health. Wikipedia has a great write up on the concept, but the short version is that hard disks monitor things such as temperature, bad block relocation actions, unrecoverable read errors and the like.

The disks also track interesting things like power outages, shocks, and spin up times. There’s a whole list of them on the Wikipedia page, but of course, not all metrics are captured by all drives.

My research involved trying to find a method to predict in advance whether a hard disk was healthy enough to survive full disk encryption without breaking. The challenge was that any test to determine this was as likely to cause the drive to fail as encrypting it was. Either way you ended up with a broken hard disk and no user data.

There are two common causes of hard disk failure related to full disk encryption – thermal shock and undiscovered bad blocks.

Thermal problems are pretty easy to track using the S.M.A.R.T system – All drives I’ve seen track “Temperature Difference from 100”, which gives a cross-manufacture index of how close the drive is to its design thermal limit. You can get this by querying item 190 in the S.M.A.R.T data field using either the vendor tools, or WMI. Unfortunately there’s no clear indication that thermal problems have any bearing on drive lifetime.

Undiscovered bad blocks are a much harder problem to work around. The theory is that because the average hard disk is much bigger than the working data set most users utialize, it’s possible for the drive to have lots of undiscovered bad sectors.

Normally, you’d never notice a bad sector on a disk – the firmware of the drive itself is responsible for noticing bad conditions and mitigates them by “remapping” the bad location to a reserved part of the drive. Again, Wikipedia has a good write up, but simply:

The OS writes a file in a number of sectors:

1001,1002,1003,1004,1005,1006

The drive does the write, but notices that sector 1003 is unreliable, so makes a remap entry to a reserved sector, 10,001

The OS reads the sectors from the disk:

1001,1002,1003,1004,1005,1006

but the drive returns the data using the remap table:

1001,1002,10001,1004,1005,1006

Normally this remapping would happen over time as more and more of the disk gets used, but of course with full disk encryption, the first thing we do is write to every sector of the drive to encrypt it. (many people ask why we don’t just encrypt sectors currently in use, the reason is that it opens the drive up to plain text attacks, and also that knowing what’s really in use is quite hard. We want to encrypt files which have been deleted as well.)

Because of this, if there are undiscovered bad sectors on the disk they will get remapped as part of the encryption process. If there are too many to be invisibly remapped by the drive itself, it will start reporting them as bad to the OS, and that’s when things start breaking.

Many people have tried doing a Windows chkdsk prior to encrypting the disk, but of course, unless you do the bad block scan it’s pointless – full disk encryption doesnt really care how valid the file system is – it’s way above the level we are working at. Of course, a bad block scan will cause the invisible remap of real bad blocks just like encryption would, so it’s prone to cause the same loss of data.

The only thing that seemed to help was to use the S.M.A.R.T counters which tell you how many bad blocks the drive has remapped (or rather, the portion of space left to map new blocks), if that starts changing during encryption, it’s a good indication that the process should be stopped so the user can back up their drive and replace it.

Google performed a study of S.M.A.R.T data vs actual drive failures in 2007 and came to some interesting conclusions:

Work at Google on over 100,000 drives has shown little overall predictive value of S.M.A.R.T. status as a whole, but suggests that certain sub-categories of information which some S.M.A.R.T. implementations track do correlate with actual failure rates – specifically, in the 60 days following the first scan error on a drive, the drive is, on average, 39 times more likely to fail than it would have been had no such error occurred. Also, first errors in re-allocations, offline re-allocations and probational counts are strongly correlated to higher probabilities of failure.

they also found that of their 100,000 failures, 56% had no worry-some indicators in the core S.M.A.R.T data set, and even adding in the entire S.M.A.R.T data population, still 36% failed without any indication whatsoever. One positive metric Google produced was that once a S.M.A.R.T error was reported, the drive was 39x more likely to fail in the next 60 days than a drive with no reported errors.

Despite having S.M.A.R.T , if you track all the metrics you’ll still find at least a third of your drives will fail without warning.

You can download a VBS Class to show you SMART data from CTOGoneWild

Comments (4) Trackbacks (1) Leave a comment Trackback

Ron Metgzer

July 3, 2009 at 11:12

Reply

While I have read and studied the S.M.A.R.T. study published by Google, I have found that S.M.A.R.T. data is useless unless some one actually looks at the data prior to failure, since Windows seems unlikely to report marginal issues to the user or at least to a log file which could be reviewed by a Systems Administrator.

That said, bad block identification has been a problem for many, many years. Steve Gibson of Gibson Research Center (grc.com) has a program called SpinRite which works very deeply on the drive. SpinRite identifies and potentially corrects soft errors on the drive, that would otherwise get marked as ‘bad’ blocks.

SpinRite uses a non-destructive method of recovering data from marginal sectors. It is very data intensive process. However, this style of technology would be a perfect fit to EEPC during the encryption process.

Whether EEPC uses a simple test of the entire drive Prior to beginning the encryption process, or simply invokes the bad block recovery when issues are discovered, I leave that up to ‘you.’

Currently:
Step 1: If my data is important (prior to encryption), I would first back up the drive and physically secure the backup (after a test of the backup of course).

Step 2: Next, I would run SpinRite on the drive, at Level 2 at least, Level 4 if this is a critical system. This way I have a good idea of possible bad block errors prior to the encryption process. If the drive is having issues, I would replace the drive, restore my data, and begin Step 2 again.

Step 3: Encrypt.

I never AssUMe the drive is of good integrity before encryption. In my humble opinion, nor should ‘you.’
serge51

August 5, 2009 at 05:37

Reply

There are tools capable of monitoring the SMART attributes and alert you if any attribute is reaching the threshold value. Also by tracking changes to the attributes over time and extrapolating the results it will predict a possible failure date. You can check DiskCheckup, it’s free for personal use.
Lil Bart

March 13, 2010 at 12:18

Reply

Can be controled, anything can be dome on computer. Because all hardware devices are controlled via software.And always use RAID II.
Simon Hunt

March 13, 2010 at 13:30

Reply

Raid II? No one uses RAID II any more – it’s clumsy, computationally expensive, and requires complex hardware. It’s also dog-slow compared to other levels. I’m not sure there’s been any practical reason to use Raid II in the last 20 years? http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_2

August 21, 2009 at 15:50

New S.M.A.R.T Monitor Tool « Simon Says – Musings of a Mid-Tier CTO

	Christopher Burgess on How not to implement smarthome…
	Mike on Is Encryption enough? Why just…
	Simon Hunt on Is Encryption enough? Why just…
	Mike on Is Encryption enough? Why just…
	Stéphane Pautrel on New S.M.A.R.T Monitor Tool for…

Simon Says – Musings of a Technology CTO