またHDDがreadonlyでマウントされてました。
とりあえず復旧させます。
過去の記事 : ファイルサーバHDD障害対応2
何度もやってるのでログは省略
やってることは
smb停止
ディスクのアンマウント
fsckで復旧
マシン再起動
で、今までやってなかったですがsmartしてみます。
まだインストールしてなかったので入れるところから。
入れたらチェックしてみます。
読み込み失敗って出てますが…。
でも不良セクタがどこなのかも分からない。
shortじゃなくてlongにしてみます。
なんと4時間かかる。
そんなに待つのも嫌なので一旦フォーマットしてからやり直して見ることにします。
3TBのドライブはfdiskじゃだめらしい
フォーマット完了。
ラベルをつけてマウントしてみる
マウント失敗。ラベルがないと言ってます。
たしかにラベルが設定されていないです。
おかしい。
もう一度やってみたり別のラベル名にしても反映されません。
わからないのでリブートしてみます。
しかしリブートしてやりなおしても反映されません。
ログをみてみると・・・
エラーログがいっぱい出てます。
なんだかよくわからないのでsmartしてみます。
とりあえずショートで。
時間を開けて確認。
エラーが出てますね。
この2つ。
フォーマット直後でこんなのがでてるってことはもうほぼ物理的な故障です。
まだ買って2年くらいだった気がするので保証期間だったような…。
保証書さがすか…。
とりあえず復旧させます。
過去の記事 : ファイルサーバHDD障害対応2
何度もやってるのでログは省略
# service smb stop SMB サービスを停止中: [ OK ] NMB サービスを停止中: [ OK ] # umount /dev/sdb # fsck -fy /dev/sdb # reboot
やってることは
smb停止
ディスクのアンマウント
fsckで復旧
マシン再起動
で、今までやってなかったですがsmartしてみます。
まだインストールしてなかったので入れるところから。
yum -y install smartmontools
入れたらチェックしてみます。
# smartctl -t short /dev/sdb (略) SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 12835 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
読み込み失敗って出てますが…。
でも不良セクタがどこなのかも分からない。
shortじゃなくてlongにしてみます。
# smartctl -t long /dev/sdb smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-308.el5] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 255 minutes for test to complete. Test will complete after Sun Jul 21 20:13:42 2013 Use smartctl -X to abort test.
なんと4時間かかる。
そんなに待つのも嫌なので一旦フォーマットしてからやり直して見ることにします。
3TBのドライブはfdiskじゃだめらしい
# umount /dev/sdb # mkfs.ext3 /dev/sdb (略) Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 35 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override.
フォーマット完了。
ラベルをつけてマウントしてみる
# e2label /dev/sdb 00main # mount -a mount: スペシャルデバイス LABEL=00main が存在しません
マウント失敗。ラベルがないと言ってます。
# e2label /dev/sdb
たしかにラベルが設定されていないです。
おかしい。
もう一度やってみたり別のラベル名にしても反映されません。
わからないのでリブートしてみます。
しかしリブートしてやりなおしても反映されません。
ログをみてみると・・・
#tail -65 /var/log/messages Jul 21 18:00:51 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jul 21 18:00:51 localhost kernel: ata1.00: BMDMA stat 0x65 Jul 21 18:00:51 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in Jul 21 18:00:51 localhost kernel: res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error) Jul 21 18:00:51 localhost kernel: ata1.00: status: { DRDY ERR } Jul 21 18:00:51 localhost kernel: ata1.00: error: { UNC } Jul 21 18:00:51 localhost kernel: ata1.00: configured for UDMA/133 Jul 21 18:00:51 localhost kernel: ata1: EH complete Jul 21 18:00:54 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jul 21 18:00:54 localhost kernel: ata1.00: BMDMA stat 0x65 Jul 21 18:00:54 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in Jul 21 18:00:54 localhost kernel: res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error) Jul 21 18:00:54 localhost kernel: ata1.00: status: { DRDY ERR } Jul 21 18:00:54 localhost kernel: ata1.00: error: { UNC } Jul 21 18:00:54 localhost kernel: ata1.00: configured for UDMA/133 Jul 21 18:00:54 localhost kernel: ata1: EH complete Jul 21 18:00:57 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jul 21 18:00:57 localhost kernel: ata1.00: BMDMA stat 0x65 Jul 21 18:00:57 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in Jul 21 18:00:57 localhost kernel: res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error) Jul 21 18:00:57 localhost kernel: ata1.00: status: { DRDY ERR } Jul 21 18:00:57 localhost kernel: ata1.00: error: { UNC } Jul 21 18:00:57 localhost kernel: ata1.00: configured for UDMA/133 Jul 21 18:00:57 localhost kernel: ata1: EH complete Jul 21 18:01:00 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jul 21 18:01:00 localhost kernel: ata1.00: BMDMA stat 0x65 Jul 21 18:01:00 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in Jul 21 18:01:00 localhost kernel: res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error) Jul 21 18:01:00 localhost kernel: ata1.00: status: { DRDY ERR } Jul 21 18:01:00 localhost kernel: ata1.00: error: { UNC } Jul 21 18:01:00 localhost kernel: ata1.00: configured for UDMA/133 Jul 21 18:01:00 localhost kernel: ata1: EH complete Jul 21 18:01:03 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jul 21 18:01:03 localhost kernel: ata1.00: BMDMA stat 0x65 Jul 21 18:01:03 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in Jul 21 18:01:03 localhost kernel: res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error) Jul 21 18:01:03 localhost kernel: ata1.00: status: { DRDY ERR } Jul 21 18:01:03 localhost kernel: ata1.00: error: { UNC } Jul 21 18:01:03 localhost kernel: ata1.00: configured for UDMA/133 Jul 21 18:01:03 localhost kernel: ata1: EH complete Jul 21 18:01:06 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jul 21 18:01:06 localhost kernel: ata1.00: BMDMA stat 0x65 Jul 21 18:01:06 localhost kernel: ata1.00: cmd 25/00:08:00:00:2c/00:00:22:00:00/e0 tag 0 dma 4096 in Jul 21 18:01:06 localhost kernel: res 51/40:00:00:00:2c/40:00:22:00:00/00 Emask 0x9 (media error) Jul 21 18:01:06 localhost kernel: ata1.00: status: { DRDY ERR } Jul 21 18:01:06 localhost kernel: ata1.00: error: { UNC } Jul 21 18:01:06 localhost kernel: ata1.00: configured for UDMA/133 Jul 21 18:01:06 localhost kernel: sd 1:0:0:0: Unhandled sense code Jul 21 18:01:06 localhost kernel: sd 1:0:0:0: SCSI error: return code = 0x08000002 Jul 21 18:01:06 localhost kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Jul 21 18:01:06 localhost kernel: sdb: Current [descriptor]: sense key: Medium Error Jul 21 18:01:06 localhost kernel: Add. Sense: Unrecovered read error - auto reallocate failed Jul 21 18:01:06 localhost kernel: Jul 21 18:01:06 localhost kernel: Descriptor sense data with sense descriptors (in hex): Jul 21 18:01:06 localhost kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Jul 21 18:01:06 localhost kernel: 22 2c 00 00 Jul 21 18:01:06 localhost kernel: ata1: EH complete Jul 21 18:01:06 localhost kernel: sdb : very big device. try to use READ CAPACITY(16). Jul 21 18:01:06 localhost kernel: SCSI device sdb: 5860533168 512-byte hdwr sectors (3000593 MB) Jul 21 18:01:06 localhost kernel: sdb: Write Protect is off Jul 21 18:01:06 localhost kernel: SCSI device sdb: drive cache: write back Jul 21 18:01:06 localhost kernel: sdb : very big device. try to use READ CAPACITY(16). Jul 21 18:01:06 localhost kernel: SCSI device sdb: 5860533168 512-byte hdwr sectors (3000593 MB) Jul 21 18:01:06 localhost kernel: sdb: Write Protect is off Jul 21 18:01:06 localhost kernel: SCSI device sdb: drive cache: write back
エラーログがいっぱい出てます。
なんだかよくわからないのでsmartしてみます。
とりあえずショートで。
時間を開けて確認。
# smartctl -t short /dev/sdb # smartctl -a /dev/sdb smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-308.el5] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST3000DM001-9YN166 Serial Number: W1F08ARK LU WWN Device Id: 5 000c50 04516c837 Firmware Version: CC47 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sun Jul 21 18:20:37 2013 JST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: ( 584) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 095 073 006 Pre-fail Always - 170508513 3 Spin_Up_Time 0x0003 096 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1109 5 Reallocated_Sector_Ct 0x0033 094 094 036 Pre-fail Always - 8248 7 Seek_Error_Rate 0x000f 052 048 030 Pre-fail Always - 455286117952 9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 12837 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 93 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 082 082 099 Old_age Always FAILING_NOW 18 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 16679 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 056 037 045 Old_age Always In_the_past 44 (23 23 45 44 0) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1088 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 449817 194 Temperature_Celsius 0x0022 044 063 000 Old_age Always - 44 (0 16 0 0 0) 197 Current_Pending_Sector 0x0012 087 087 000 Old_age Always - 2168 198 Offline_Uncorrectable 0x0010 087 087 000 Old_age Offline - 2168 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 191194764155146 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 869669508604 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 16158619122266 SMART Error Log Version: 1 ATA Error Count: 16591 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 16591 occurred at disk power-on lifetime: 12837 hours (534 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:14:26.012 READ DMA EXT 27 00 00 00 00 00 e0 00 00:14:26.011 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:14:26.008 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:14:26.004 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 00:14:25.981 READ NATIVE MAX ADDRESS EXT Error 16590 occurred at disk power-on lifetime: 12837 hours (534 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:14:23.044 READ DMA EXT 27 00 00 00 00 00 e0 00 00:14:23.043 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:14:23.040 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:14:23.037 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 00:14:23.012 READ NATIVE MAX ADDRESS EXT Error 16589 occurred at disk power-on lifetime: 12837 hours (534 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:14:20.077 READ DMA EXT 27 00 00 00 00 00 e0 00 00:14:20.077 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:14:20.073 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:14:20.070 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 00:14:20.049 READ NATIVE MAX ADDRESS EXT Error 16588 occurred at disk power-on lifetime: 12837 hours (534 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:14:17.119 READ DMA EXT 27 00 00 00 00 00 e0 00 00:14:17.118 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:14:17.115 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:14:17.111 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 00:14:17.090 READ NATIVE MAX ADDRESS EXT Error 16587 occurred at disk power-on lifetime: 12837 hours (534 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:14:14.161 READ DMA EXT 27 00 00 00 00 00 e0 00 00:14:14.160 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:14:14.156 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:14:14.152 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 00:14:14.128 READ NATIVE MAX ADDRESS EXT SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 12837 573308928 # 2 Extended offline Completed: read failure 90% 12835 - # 3 Short offline Completed: read failure 90% 12835 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
エラーが出てますね。
198 Offline_Uncorrectable 0x0010 087 087 000 Old_age Offline - 2168
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 12837 573308928
この2つ。
フォーマット直後でこんなのがでてるってことはもうほぼ物理的な故障です。
まだ買って2年くらいだった気がするので保証期間だったような…。
保証書さがすか…。
コメント
コメントを投稿