2013年夏のファイルサーバHDD障害発生

またHDDがreadonlyでマウントされてました。

とりあえず復旧させます。
過去の記事 : ファイルサーバHDD障害対応2

何度もやってるのでログは省略

# service smb stop
SMB サービスを停止中: [ OK ]
NMB サービスを停止中: [ OK ]
# umount /dev/sdb
# fsck -fy /dev/sdb
# reboot

やってることは

smb停止
ディスクのアンマウント
fsckで復旧
マシン再起動

で、今までやってなかったですがsmartしてみます。
まだインストールしてなかったので入れるところから。

yum -y install smartmontools

入れたらチェックしてみます。

# smartctl -t short /dev/sdb
(略)
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     12835         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

読み込み失敗って出てますが…。
でも不良セクタがどこなのかも分からない。
shortじゃなくてlongにしてみます。

# smartctl -t long /dev/sdb
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-308.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 255 minutes for test to complete.
Test will complete after Sun Jul 21 20:13:42 2013

Use smartctl -X to abort test.

なんと4時間かかる。
そんなに待つのも嫌なので一旦フォーマットしてからやり直して見ることにします。
3TBのドライブはfdiskじゃだめらしい

# umount /dev/sdb
# mkfs.ext3 /dev/sdb
(略)
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 35 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

フォーマット完了。
ラベルをつけてマウントしてみる

# e2label /dev/sdb 00main
# mount -a
mount: スペシャルデバイス LABEL=00main が存在しません

マウント失敗。ラベルがないと言ってます。

# e2label /dev/sdb


たしかにラベルが設定されていないです。
おかしい。
もう一度やってみたり別のラベル名にしても反映されません。
わからないのでリブートしてみます。
しかしリブートしてやりなおしても反映されません。

ログをみてみると・・・

#tail -65 /var/log/messages
Jul 21 18:00:51 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 21 18:00:51 localhost kernel: ata1.00: BMDMA stat 0x65
Jul 21 18:00:51 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in
Jul 21 18:00:51 localhost kernel:          res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error)
Jul 21 18:00:51 localhost kernel: ata1.00: status: { DRDY ERR }
Jul 21 18:00:51 localhost kernel: ata1.00: error: { UNC }
Jul 21 18:00:51 localhost kernel: ata1.00: configured for UDMA/133
Jul 21 18:00:51 localhost kernel: ata1: EH complete
Jul 21 18:00:54 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 21 18:00:54 localhost kernel: ata1.00: BMDMA stat 0x65
Jul 21 18:00:54 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in
Jul 21 18:00:54 localhost kernel:          res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error)
Jul 21 18:00:54 localhost kernel: ata1.00: status: { DRDY ERR }
Jul 21 18:00:54 localhost kernel: ata1.00: error: { UNC }
Jul 21 18:00:54 localhost kernel: ata1.00: configured for UDMA/133
Jul 21 18:00:54 localhost kernel: ata1: EH complete
Jul 21 18:00:57 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 21 18:00:57 localhost kernel: ata1.00: BMDMA stat 0x65
Jul 21 18:00:57 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in
Jul 21 18:00:57 localhost kernel:          res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error)
Jul 21 18:00:57 localhost kernel: ata1.00: status: { DRDY ERR }
Jul 21 18:00:57 localhost kernel: ata1.00: error: { UNC }
Jul 21 18:00:57 localhost kernel: ata1.00: configured for UDMA/133
Jul 21 18:00:57 localhost kernel: ata1: EH complete
Jul 21 18:01:00 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 21 18:01:00 localhost kernel: ata1.00: BMDMA stat 0x65
Jul 21 18:01:00 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in
Jul 21 18:01:00 localhost kernel:          res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error)
Jul 21 18:01:00 localhost kernel: ata1.00: status: { DRDY ERR }
Jul 21 18:01:00 localhost kernel: ata1.00: error: { UNC }
Jul 21 18:01:00 localhost kernel: ata1.00: configured for UDMA/133
Jul 21 18:01:00 localhost kernel: ata1: EH complete
Jul 21 18:01:03 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 21 18:01:03 localhost kernel: ata1.00: BMDMA stat 0x65
Jul 21 18:01:03 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in
Jul 21 18:01:03 localhost kernel:          res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error)
Jul 21 18:01:03 localhost kernel: ata1.00: status: { DRDY ERR }
Jul 21 18:01:03 localhost kernel: ata1.00: error: { UNC }
Jul 21 18:01:03 localhost kernel: ata1.00: configured for UDMA/133
Jul 21 18:01:03 localhost kernel: ata1: EH complete
Jul 21 18:01:06 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 21 18:01:06 localhost kernel: ata1.00: BMDMA stat 0x65
Jul 21 18:01:06 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in
Jul 21 18:01:06 localhost kernel:          res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error)
Jul 21 18:01:06 localhost kernel: ata1.00: status: { DRDY ERR }
Jul 21 18:01:06 localhost kernel: ata1.00: error: { UNC }
Jul 21 18:01:06 localhost kernel: ata1.00: configured for UDMA/133
Jul 21 18:01:06 localhost kernel: sd 1:0:0:0: Unhandled sense code
Jul 21 18:01:06 localhost kernel: sd 1:0:0:0: SCSI error: return code = 0x08000002
Jul 21 18:01:06 localhost kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Jul 21 18:01:06 localhost kernel: sdb: Current [descriptor]: sense key: Medium Error
Jul 21 18:01:06 localhost kernel:     Add. Sense: Unrecovered read error - auto reallocate failed
Jul 21 18:01:06 localhost kernel:
Jul 21 18:01:06 localhost kernel: Descriptor sense data with sense descriptors (in hex):
Jul 21 18:01:06 localhost kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jul 21 18:01:06 localhost kernel:         22 2c 00 00
Jul 21 18:01:06 localhost kernel: ata1: EH complete
Jul 21 18:01:06 localhost kernel: sdb : very big device. try to use READ CAPACITY(16).
Jul 21 18:01:06 localhost kernel: SCSI device sdb: 5860533168 512-byte hdwr sectors (3000593 MB)
Jul 21 18:01:06 localhost kernel: sdb: Write Protect is off
Jul 21 18:01:06 localhost kernel: SCSI device sdb: drive cache: write back
Jul 21 18:01:06 localhost kernel: sdb : very big device. try to use READ CAPACITY(16).
Jul 21 18:01:06 localhost kernel: SCSI device sdb: 5860533168 512-byte hdwr sectors (3000593 MB)
Jul 21 18:01:06 localhost kernel: sdb: Write Protect is off
Jul 21 18:01:06 localhost kernel: SCSI device sdb: drive cache: write back

エラーログがいっぱい出てます。
なんだかよくわからないのでsmartしてみます。
とりあえずショートで。
時間を開けて確認。

# smartctl -t short /dev/sdb
# smartctl -a /dev/sdb

smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-308.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST3000DM001-9YN166
Serial Number:    W1F08ARK
LU WWN Device Id: 5 000c50 04516c837
Firmware Version: CC47
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sun Jul 21 18:20:37 2013 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   095   073   006    Pre-fail  Always       -       170508513
  3 Spin_Up_Time            0x0003   096   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1109
  5 Reallocated_Sector_Ct   0x0033   094   094   036    Pre-fail  Always       -       8248
  7 Seek_Error_Rate         0x000f   052   048   030    Pre-fail  Always       -       455286117952
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       12837
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       93
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   082   082   099    Old_age   Always   FAILING_NOW 18
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       16679
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   056   037   045    Old_age   Always   In_the_past 44 (23 23 45 44 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1088
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       449817
194 Temperature_Celsius     0x0022   044   063   000    Old_age   Always       -       44 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012   087   087   000    Old_age   Always       -       2168
198 Offline_Uncorrectable   0x0010   087   087   000    Old_age   Offline      -       2168
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       191194764155146
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       869669508604
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       16158619122266

SMART Error Log Version: 1
ATA Error Count: 16591 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 16591 occurred at disk power-on lifetime: 12837 hours (534 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      00:14:26.012  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:14:26.011  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:14:26.008  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:14:26.004  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:14:25.981  READ NATIVE MAX ADDRESS EXT

Error 16590 occurred at disk power-on lifetime: 12837 hours (534 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      00:14:23.044  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:14:23.043  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:14:23.040  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:14:23.037  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:14:23.012  READ NATIVE MAX ADDRESS EXT

Error 16589 occurred at disk power-on lifetime: 12837 hours (534 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      00:14:20.077  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:14:20.077  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:14:20.073  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:14:20.070  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:14:20.049  READ NATIVE MAX ADDRESS EXT

Error 16588 occurred at disk power-on lifetime: 12837 hours (534 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      00:14:17.119  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:14:17.118  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:14:17.115  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:14:17.111  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:14:17.090  READ NATIVE MAX ADDRESS EXT

Error 16587 occurred at disk power-on lifetime: 12837 hours (534 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      00:14:14.161  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:14:14.160  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:14:14.156  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:14:14.152  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:14:14.128  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     12837         573308928
# 2  Extended offline    Completed: read failure       90%     12835         -
# 3  Short offline       Completed: read failure       90%     12835         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

エラーが出てますね。

198 Offline_Uncorrectable   0x0010   087   087   000    Old_age   Offline      -       2168

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     12837         573308928

この2つ。
フォーマット直後でこんなのがでてるってことはもうほぼ物理的な故障です。
まだ買って2年くらいだった気がするので保証期間だったような…。
保証書さがすか…。

コメント