Monday, December 12, 2011

How To Detect a Failing Hard Drive

Hard drive failures are rarely positive events. The magnitude of what is lost depends on what is on the drive and whether the data lost is replaceable. Most businesses implement RAID (redundant array of independent disks) arrays on their systems and the loss of a single hard drive (or potentially a small number of drives) can be tolerated without losing any data. Additionally, most companies develop disaster recovery and business continuity plans that involve backing up and restoring data up to a specific point in time.

Most home users don't build systems with RAID arrays and the systems that come pre-configured from vendors such as Dell and HP only ship with a single drive (unless multiple drives are configured into the system and purchased). Even if multiple drives are shipped, RAID is not typically configured by the factory before the PC ships. This means that when a home user's hard drive fails, everything from documents, games, music, and movies to family photos and other items that have a lot of sentimental value can be irreversibly lost. In some cases, the data can be recovered by a data recovery specialist, but the data recovered might be incomplete or corrupt.

The hard drive industry has had a number of years to work on this problem and has made great strides in both reducing the number of hard drive failures (and increasing the mean time between failures, MTBF) and working on predictive analysis that may indicate a drive is close to failing (note that the predictive analysis may not be valuable in the case of sudden catastrophic loss of the drive). One tool that is useful for predictive analysis is the Self Monitoring, Analysis, and Reporting Technology (SMART) functionality that is built into most modern hard drives. Note that Google performed a study and demonstrated that only subsets of the SMART attributes are useful for predictive analysis, where others are purely informational.

Each hard drive manufacturer (Seagate, Western Digital, Intel, Hitachi, Samsung, OCZ, etc...) defines their own metrics that they track and expose through SMART, but there are a number of standardized attributes. Vendors also have the flexibility to add specific logs that can be checked through SMART. If logs and metrics aren't enough, SMART also has the capability to run self-tests of the hard drive.

Viewing these attributes and logs requires a tool such as smartctl (part of smartmontools). I demonstrate smartmontools here because there are ports for a number of different platforms (Windows, Linux, UNIX, Mac OS, etc) and the source code is freely available to compile on new platforms. Note that most hardware RAID arrays do not expose the drives directly to the operating system, but most have vendor-supported tools for viewing SMART statistics on each of the attached drives.

Note that for this demonstration, I am using the Windows build of smartmontools version 5.42-1. If the command line arguments presented here don't work, see if you need slightly different parameters by running smartctl -h.

To start out, I opened a command prompt (cmd.exe) and navigated to the binary install path for the smartmontools (for me, C:\Program Files(x86)\smartmontools\bin). From here, I used the --scan option of smartctl to Identify which drives the operating system sees,

C:\Program Files (x86)\smartmontools\bin>smartctl --scan
/dev/sda -d ata # /dev/sda, ATA device
/dev/sdb -d ata # /dev/sdb, ATA device
/dev/sdc -d ata # /dev/sdc, ATA device 
 
In this case, my system has three SATA drives and all of them are visible to the operating system. Note that smartctl returns a more linux representation of the devices (using /dev/xxx instead of using the SCSI notation [port, bus, target, logical unit] that Windows uses internally for SATA drives). Note that this screenshot was taken from the msinfo32 utility,



We can select a disk and retrieve all of the SMART information from one of the drives,

C:\Program Files (x86)\smartmontools\bin>smartctl -a /dev/sda 
  
The first section of the output is the information for the hard drive,

smartctl 5.42 2011-10-20 r3458 [i686-w64-mingw32-2008r2(64)-sp1] (sf-win32-5.42-1)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3500630AS
Serial Number:    9QG3ZZZ9
Firmware Version: 3.AAK
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Dec 12 13:30:19 2011 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled 
 
The second section identifies the capabilities with regard to SMART.
 

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
     was completed without error.
     Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
     without error or no self-test has ever 
     been run.
Total time to complete Offline 
data collection:   (  430) seconds.
Offline data collection
capabilities:     (0x5b) SMART execute Offline immediate.
     Auto Offline data collection on/off support.
     Suspend Offline collection upon new
     command.
     Offline surface scan supported.
     Self-test supported.
     No Conveyance Self-test supported.
     Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
     power-saving mode.
     Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
     General Purpose Logging supported.
Short self-test routine 
recommended polling time:   (   1) minutes.
Extended self-test routine
recommended polling time:   ( 163) minutes.

The next section identifies vendor specific attributes/metrics. This is an important section for identifying predictive failures. There arre several values reported, the normalized value (VALUE column), the worst normalized value ever recorded while SMART has been enabled (WORST column), the threshold column, and finally the raw value (RAW_VALUE). The threshold column requires more explanation. The VALUE and WORST values are scaled between 0 and 255 and are typically reported in a way that less is worse. If the VALUE or WORST is below the threshold value, then this may be a sign that the disk needs to be replaced immediately, some attributes indicate that the disk is expected to fail within 24 hours. Some attributes, which are more informational (such as temperature) may not indicate an impending failure, but may indicate increased wear on the drive. In my case, I had some short term cooling problems with the PC and SMART reports a failure in the past.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   114   067   006    Pre-fail  Always       -       83089801
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       161
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   086   060   030    Pre-fail  Always       -       449282521
  9 Power_On_Hours          0x0032   062   062   000    Old_age   Always       -       33680
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       163
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   064   044   045    Old_age   Always   In_the_past 36 (Min/Max 24/39)
194 Temperature_Celsius     0x0022   036   056   000    Old_age   Always       -       36 (0 21 0 0 0)
195 Hardware_ECC_Recovered  0x001a   062   056   000    Old_age   Always       -       55988229
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0 
 
The next pieces are the error log (also very important for determining impending failure) and self test log.

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay. 
 
From the data, this drive is not expected to immediately fail and all of the metrics indicate that the disk should continue to function. Tests can be performed on the drives by using smartctl -t <test_name> <drive>. See smartctl -h for more details.

If your drive is starting to fail or becomes unbootable, it may be necessary to rescue the files drom the failing hard drive.

See Also
How To Rescue Files From a Damaged System
Windows Crash Dump Analysis
Identifying Cooling Issues
Troubleshooting Memory Errors
Stress Testing a CPU To Detect Hardware Failure 
Stress Testing a Video Card










4 comments:

  1. Its like you read my mind! You appear to know a lot about this, like you wrote the book in it or something. I think that you can do with some pics to drive the message home a bit, but other than that, this is magnificent blog. An excellent read. I'll definitely be back.
    data recovery irvine ca

    ReplyDelete
  2. I recently bought a new asus a53e-es92 notebook and sometimes I hear clicking sound inside.

    Can it be failing?

    Ian.

    ReplyDelete
  3. Thanks for your summary. Posting an example of the values of my failing drive, so your blog visitors get an idea how it would show. I had to time to offload the data. Check the Spin_Up_Time values and the remark on the right "failing_now"

    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 5
    2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0
    3 Spin_Up_Time 0x0023 007 007 025 Pre-fail Always FAILING_NOW 28395
    4 Start_Stop_Count 0x0032 092 092 000 Old_age Always - 8487
    5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
    7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
    8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
    9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 10232
    10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
    11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 2
    12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 3975
    191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 66
    192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
    194 Temperature_Celsius 0x0002 064 050 000 Old_age Always - 31 (Min/Max 12/51)
    195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
    196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
    197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
    198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
    199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0
    200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 261
    223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 2
    225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 8508

    ReplyDelete