またHDDが読み取れなくなりました。
いつもどおりの復旧作業です。
fsckします。
メモ: 2013年夏のファイルサーバHDD障害発生
大量にエラーがでます。ヤバイ。
↓コレが大量に出ます。
全ブロック出てるんじゃないかというくらい。
1時間半くらいかかっておわりました。
なんかほぼ全ブロックを復旧したんじゃないかと思う量です。
ログをみると1ブロックずつではないようなので流石に全部ではなかったですがログのサイズが1.72 GB (1,849,350,069 バイト)もありました。
さて、一応終わったのですが、最後の1行のWarningの意味を調べてみます。
fsck doesn't work on a partition (signal 11)
e2fsckのバグでメモリの割り当てを失敗するという・・・。
うーむ。
なんか面倒なのでもう一度実行してみます。
ですが、一番最初から始まった感があります。
なので待機。
待機中に考えてみて2点の疑問というか改善点というか。
特に後者、電気代もかかるしWindows上でrsync的なことができればもうファイルサーバいらないんじゃないの?
Macbookも全然つかわなくなったしたまーーーーーにnexus7にデータいれたりするだけなのでUSBでいいんじゃ…
というわけで次回、Windowsで普通にHDD接続する方法に変更するとして、fsckが終わるのを待ちます。
で、処理がおわったのでログを見てみましたが同じ場所で同じエラーになってました。
マウントしたままでfsckやっていたことに気づいたのでアンマウントしてやってみます。
いよいよヤバくなってきました。
スーパーブロックが壊れているような気がします。
スーパーブロックにはバックアップがあるらしいので指定してやってみます。
e2fsck
スーパーブロックのバックアップの場所はドライブのブロックサイズによって違うらしいです。
e2fsck - システム管理コマンドの説明 - Linux コマンド集 一覧表
ブロックサイズを調べてみます。
対象のドライブは死んでいるので同じようにフォーマットしてある他のドライブで確認してみます。
4096らしいです。
4kの場合は32768らしいので指定して実行します。
だめです!!
認識してないのかも。
確認してみます。
sdbがありませんね。
devにはあるのに…。
もうわからなくなってきたのでリブートしてみます。
物理的に認識しなくなりました。
sdbだったものがいなくなって1つずつつめられてsdgがなくなってます。
こっちにもいません。
BIOSレベルで認識してないのかもしれません。
確認してみます。
認識してない感じです…。
電源を切るために一旦そのまま起動…とやってみたところ認識してました。
謎です。
また認識しなくなる前にとりあえずrsyncしておきます。
怖ろしい。
deleteオプション付だと危なすぎるので無しで手動で実行です。
無事終わったのでfsckしておいてみます。
正常におわりました。
ではSMARTしてみます。
そして確認
ちょっとエラーが出ているようですが、問題無さそうですが、よく見るとTOSHIBAとなってます。
別のドライブの情報になってます。
前回起動時のドライブ認識しなかった時の結果かも知れません。
この際だから他のドライブも見てみます。
Serial Numberが違うのでコレも違うようです。
次のディスクへ。
WDのディスク3つはとってもクリーンでした。
最後の一つ、コレが該当のディスクでした。
でもエラーはでてないですね。
物理的な故障ではなかったということです。
このまま使い続けられそう。
ですが、Linuxファイルサーバは廃止します。
それは次回。
いつもどおりの復旧作業です。
fsckします。
メモ: 2013年夏のファイルサーバHDD障害発生
# service smb stop SMB サービスを停止中: [ OK ] NMB サービスを停止中: [ OK ] # fsck -fy /dev/sdb fsck 1.39 (29-May-2006) e2fsck 1.39 (29-May-2006) /dev/sdb is mounted. WARNING!!! Running e2fsck on a mounted filesystem may cause SEVERE filesystem damage. Do you really want to continue (y/n)? yes Error reading block 1027 (Attempt to read block from filesystem resulted in short read). Ignore error? yes Force rewrite? yes Resize inode not valid. Recreate? yes Pass 1: Checking inodes, blocks, and sizes
大量にエラーがでます。ヤバイ。
↓コレが大量に出ます。
全ブロック出てるんじゃないかというくらい。
Error reading block 1028 (Attempt to read block from filesystem resulted in short read) while doing inode scan. Ignore error? yes Force rewrite? yes Error reading block 1029 (Attempt to read block from filesystem resulted in short read) while doing inode scan. Ignore error? yes Force rewrite? yes (略) Error reading block 732495873 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps. Ignore error? yes Force rewrite? yes Error reading block 732528640 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps. Ignore error? yes Force rewrite? yes Error reading block 732528641 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps. Ignore error? yes Force rewrite? yes Error reading block 732561408 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps. Ignore error? yes Force rewrite? yes Error reading block 732561409 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps. Ignore error? yes Force rewrite? yes Warning... fsck.ext3 for device /dev/sdb exited with signal 11.
1時間半くらいかかっておわりました。
なんかほぼ全ブロックを復旧したんじゃないかと思う量です。
ログをみると1ブロックずつではないようなので流石に全部ではなかったですがログのサイズが1.72 GB (1,849,350,069 バイト)もありました。
さて、一応終わったのですが、最後の1行のWarningの意味を調べてみます。
fsck doesn't work on a partition (signal 11)
QUESTION
Signal 11, what does that mean?
ANSWER
Signal 11, or officially know as "segmentation fault", means that the program
accessed a memory location that was not assigned. That's usually a bug in the
program. So if you're writing your own program, that's the most likely cause.
However, this FAQ will concentrate on the possibilities besides that.
e2fsckのバグでメモリの割り当てを失敗するという・・・。
うーむ。
なんか面倒なのでもう一度実行してみます。
ですが、一番最初から始まった感があります。
なので待機。
待機中に考えてみて2点の疑問というか改善点というか。
- 3Tまるごと1領域ではなくもうすこし分割したほうがいいのでは?
(これだけ頻繁にエラー&復旧するのであれば…) - Linuxでファイルサーバの意味があるのか?
(SVNサーバをやめてしまったやめてしまってファイルサーバとしての用途しかない)
特に後者、電気代もかかるしWindows上でrsync的なことができればもうファイルサーバいらないんじゃないの?
Macbookも全然つかわなくなったしたまーーーーーにnexus7にデータいれたりするだけなのでUSBでいいんじゃ…
というわけで次回、Windowsで普通にHDD接続する方法に変更するとして、fsckが終わるのを待ちます。
で、処理がおわったのでログを見てみましたが同じ場所で同じエラーになってました。
マウントしたままでfsckやっていたことに気づいたのでアンマウントしてやってみます。
# umount /dev/sdb # fsck -fy /dev/sdb fsck 1.39 (29-May-2006) e2fsck 1.39 (29-May-2006) fsck.ext2: Attempt to read block from filesystem resulted in short read while trying to open /dev/sdb Could this be a zero-length partition?
いよいよヤバくなってきました。
スーパーブロックが壊れているような気がします。
スーパーブロックにはバックアップがあるらしいので指定してやってみます。
e2fsck
スーパーブロックのバックアップの場所はドライブのブロックサイズによって違うらしいです。
e2fsck - システム管理コマンドの説明 - Linux コマンド集 一覧表
バックアップスーパーブロックの場所は、 ファイルシステムのブロックサイズによって異なる。 ファイルシステムのブロックサイズが 1k の場合、 バックアップスーパーブロックは 8193 にある。 また、ブロックサイズが 2k の場合は 16384 に、 4k の場合は 32768 にある。
ブロックサイズを調べてみます。
対象のドライブは死んでいるので同じようにフォーマットしてある他のドライブで確認してみます。
# dumpe2fs /dev/sdc dumpe2fs 1.39 (29-May-2006) Filesystem volume name: 00mirror Last mounted on:Filesystem UUID: 64ad8c36-92d2-41f3-914d-9eccbd11f26e Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 366297088 Block count: 732566646 Reserved block count: 36628332 Free blocks: 365925600 Free inodes: 366024015 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 849 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 512 Filesystem created: Mon Jan 30 21:21:06 2012 Last mount time: Sat Nov 9 04:20:24 2013 Last write time: Sat Nov 9 04:20:24 2013 Mount count: 48 Maximum mount count: 33 Last checked: Mon Jan 30 21:21:06 2012 Check interval: 15552000 (6 months) Next check after: Sat Jul 28 21:21:06 2012 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: tea Directory Hash Seed: f9daf461-1700-45e3-b271-47903e96a632 Journal backup: inode blocks Journal size: 128M
4096らしいです。
4kの場合は32768らしいので指定して実行します。
# fsck -fy -b 32768 /dev/sdb fsck 1.39 (29-May-2006) e2fsck 1.39 (29-May-2006) fsck.ext2: Attempt to read block from filesystem resulted in short read while trying to open /dev/sdb Could this be a zero-length partition?
だめです!!
認識してないのかも。
確認してみます。
# fdisk -l Disk /dev/sda: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes デバイス Boot Start End Blocks Id System /dev/sda1 * 1 13 104391 83 Linux /dev/sda2 14 242 1839442+ 83 Linux /dev/sda3 243 258 128520 82 Linux swap / Solaris Disk /dev/sdc: 3000.5 GB, 3000592982016 bytes 255 heads, 63 sectors/track, 364801 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes ディスク /dev/sdc は正常な領域テーブルを含んでいません Disk /dev/sdd: 3000.5 GB, 3000592982016 bytes 255 heads, 63 sectors/track, 364801 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes ディスク /dev/sdd は正常な領域テーブルを含んでいません Disk /dev/sde: 3000.5 GB, 3000592982016 bytes 255 heads, 63 sectors/track, 364801 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes ディスク /dev/sde は正常な領域テーブルを含んでいません Disk /dev/sdf: 3000.5 GB, 3000592982016 bytes 255 heads, 63 sectors/track, 364801 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes ディスク /dev/sdf は正常な領域テーブルを含んでいません Disk /dev/sdg: 3000.5 GB, 3000592982016 bytes 255 heads, 63 sectors/track, 364801 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes ディスク /dev/sdg は正常な領域テーブルを含んでいません
sdbがありませんね。
# ls /dev MAKEDEV full loop1 md0 parport1 ram1 ram4 root sde sg6 tty1 tty18 tty26 tty34 tty42 tty50 tty59 ttyS0 usbdev2.1_ep00 usbdev5.2_ep81 vcsa2 X0R hidraw0 loop2 mem parport2 ram10 ram5 rtc sdf shm tty10 tty19 tty27 tty35 tty43 tty51 tty6 ttyS1 usbdev2.1_ep81 usbdev5.2_ep82 vcsa3 bus hidraw1 loop3 net parport3 ram11 ram6 sda sdg snapshot tty11 tty2 tty28 tty36 tty44 tty52 tty60 ttyS2 usbdev3.1_ep00 vcs vcsa4 console hpet loop4 network_latency port ram12 ram7 sda1 sg0 stderr tty12 tty20 tty29 tty37 tty45 tty53 tty61 ttyS3 usbdev3.1_ep81 vcs2 vcsa5 core initctl loop5 network_throughput ppp ram13 ram8 sda2 sg1 stdin tty13 tty21 tty3 tty38 tty46 tty54 tty62 urandom usbdev4.1_ep00 vcs3 vcsa6 cpu input loop6 null ptmx ram14 ram9 sda3 sg2 stdout tty14 tty22 tty30 tty39 tty47 tty55 tty63 usbdev1.1_ep81 usbdev4.1_ep81 vcs4 zero cpu_dma_latency kmsg loop7 nvram pts ram15 ramdisk sdb sg3 systty tty15 tty23 tty31 tty4 tty48 tty56 tty7 usbdev1.2_ep00 usbdev5.1_ep00 vcs5 disk log mapper oldmem ram ram2 random sdc sg4 tty tty16 tty24 tty32 tty40 tty49 tty57 tty8 usbdev1.2_ep02 usbdev5.1_ep81 vcs6 fd loop0 mcelog parport0 ram0 ram3 rawctl sdd sg5 tty0 tty17 tty25 tty33 tty41 tty5 tty58 tty9 usbdev1.2_ep81 usbdev5.2_ep00 vcsa
devにはあるのに…。
もうわからなくなってきたのでリブートしてみます。
起動中にデバイス見つからないメッセージ |
00mainが見つからないエラーメッセージ |
物理的に認識しなくなりました。
# fdisk -l Disk /dev/sda: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes デバイス Boot Start End Blocks Id System /dev/sda1 * 1 13 104391 83 Linux /dev/sda2 14 242 1839442+ 83 Linux /dev/sda3 243 258 128520 82 Linux swap / Solaris Disk /dev/sdb: 3000.5 GB, 3000592982016 bytes 255 heads, 63 sectors/track, 364801 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes ディスク /dev/sdb は正常な領域テーブルを含んでいません Disk /dev/sdc: 3000.5 GB, 3000592982016 bytes 255 heads, 63 sectors/track, 364801 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes ディスク /dev/sdc は正常な領域テーブルを含んでいません Disk /dev/sdd: 3000.5 GB, 3000592982016 bytes 255 heads, 63 sectors/track, 364801 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes ディスク /dev/sdd は正常な領域テーブルを含んでいません Disk /dev/sde: 3000.5 GB, 3000592982016 bytes 255 heads, 63 sectors/track, 364801 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes ディスク /dev/sde は正常な領域テーブルを含んでいません Disk /dev/sdf: 3000.5 GB, 3000592982016 bytes 255 heads, 63 sectors/track, 364801 cylinders Units = シリンダ数 of 16065 * 512 = 8225280 bytes ディスク /dev/sdf は正常な領域テーブルを含んでいません
sdbだったものがいなくなって1つずつつめられてsdgがなくなってます。
# cd /dev [root@localhost dev]# ls MAKEDEV hpet loop7 parport1 ram12 ramdisk sde stdout tty17 tty28 tty39 tty5 tty60 usbdev1.1_ep81 usbdev5.2_ep00 vcsa4 X0R initctl mapper parport2 ram13 random sdf systty tty18 tty29 tty4 tty50 tty61 usbdev1.2_ep00 usbdev5.2_ep81 vcsa5 bus input mcelog parport3 ram14 rawctl sg0 tty tty19 tty3 tty40 tty51 tty62 usbdev1.2_ep02 usbdev5.2_ep82 vcsa6 console kmsg md0 port ram15 root sg1 tty0 tty2 tty30 tty41 tty52 tty63 usbdev1.2_ep81 vcs zero core log mem ppp ram2 rtc sg2 tty1 tty20 tty31 tty42 tty53 tty7 usbdev2.1_ep00 vcs2 cpu loop0 net ptmx ram3 sda sg3 tty10 tty21 tty32 tty43 tty54 tty8 usbdev2.1_ep81 vcs3 cpu_dma_latency loop1 network_latency pts ram4 sda1 sg4 tty11 tty22 tty33 tty44 tty55 tty9 usbdev3.1_ep00 vcs4 disk loop2 network_throughput ram ram5 sda2 sg5 tty12 tty23 tty34 tty45 tty56 ttyS0 usbdev3.1_ep81 vcs5 fd loop3 null ram0 ram6 sda3 shm tty13 tty24 tty35 tty46 tty57 ttyS1 usbdev4.1_ep00 vcs6 full loop4 nvram ram1 ram7 sdb snapshot tty14 tty25 tty36 tty47 tty58 ttyS2 usbdev4.1_ep81 vcsa hidraw0 loop5 oldmem ram10 ram8 sdc stderr tty15 tty26 tty37 tty48 tty59 ttyS3 usbdev5.1_ep00 vcsa2 hidraw1 loop6 parport0 ram11 ram9 sdd stdin tty16 tty27 tty38 tty49 tty6 urandom usbdev5.1_ep81 vcsa3
こっちにもいません。
BIOSレベルで認識してないのかもしれません。
確認してみます。
BIOSで認識したドライブは6つ |
認識してない感じです…。
電源を切るために一旦そのまま起動…とやってみたところ認識してました。
謎です。
また認識しなくなる前にとりあえずrsyncしておきます。
怖ろしい。
deleteオプション付だと危なすぎるので無しで手動で実行です。
#rsync -av /var/storage/00/main/data /var/storage/00/mirror >> /var/storage/00/main/scripts/log/sync_storage_201311101129.log #rsync -av /var/storage/01/main/data /var/storage/01/mirror >> /var/storage/00/main/scripts/log/sync_storage_201311101134.log #rsync -av /var/storage/02/main/data /var/storage/02/mirror >> /var/storage/00/main/scripts/log/sync_storage_201311101135.log
無事終わったのでfsckしておいてみます。
# fsck -fy /dev/sdb fsck 1.39 (29-May-2006) e2fsck 1.39 (29-May-2006) /dev/sdb is mounted. WARNING!!! Running e2fsck on a mounted filesystem may cause SEVERE filesystem damage. Do you really want to continue (y/n)? yes 00main: recovering journal Pass 1: Checking inodes, blocks, and sizes (略) Pass 5: Checking group summary information 00main: ***** FILE SYSTEM WAS MODIFIED ***** 00main: 273184/366297088 files (3.0% non-contiguous), 366711110/732566646 blocks
正常におわりました。
ではSMARTしてみます。
# smartctl -t short /dev/sdb smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-308.el5] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 1 minutes for test to complete. Test will complete after Sun Nov 10 12:51:35 2013 Use smartctl -X to abort test.
そして確認
# smartctl -a /dev/sdb smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-308.el5] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: TOSHIBA DT01ACA300 Serial Number: 53HV57EGS LU WWN Device Id: 5 000039 ff4cbe551 Firmware Version: MX6OABB0 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sun Nov 10 12:58:09 2013 JST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (22652) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 140 140 054 Pre-fail Offline - 68 3 Spin_Up_Time 0x0007 253 253 024 Pre-fail Always - 163 (Average 169) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 31 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 121 121 020 Pre-fail Offline - 34 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 2680 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 13 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 42 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 42 194 Temperature_Celsius 0x0002 150 150 000 Old_age Always - 40 (Min/Max 25/44) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 2 SMART Error Log Version: 1 ATA Error Count: 2 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 2 occurred at disk power-on lifetime: 2671 hours (111 days + 7 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 10 28 ab 01 00 Error: ICRC, ABRT at LBA = 0x0001ab28 = 109352 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 40 f8 aa 01 e0 08 23:57:42.093 WRITE DMA 27 00 00 00 00 00 e0 08 23:57:42.093 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 08 23:57:42.090 IDENTIFY DEVICE ef 03 46 00 00 00 a0 08 23:57:42.086 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 08 23:57:42.086 READ NATIVE MAX ADDRESS EXT Error 1 occurred at disk power-on lifetime: 2671 hours (111 days + 7 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 20 18 ab 01 00 Error: ICRC, ABRT at LBA = 0x0001ab18 = 109336 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 40 f8 aa 01 e0 08 23:57:41.930 WRITE DMA c8 00 08 08 91 01 e0 08 23:57:41.927 READ DMA c8 00 08 b0 50 00 e0 08 23:57:41.908 READ DMA b0 d5 01 09 4f c2 00 08 23:57:38.802 SMART READ LOG b0 d5 01 06 4f c2 00 08 23:57:38.562 SMART READ LOG SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 2679 - # 2 Short offline Completed without error 00% 2679 - # 3 Short offline Completed without error 00% 2679 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
ちょっとエラーが出ているようですが、問題無さそうですが、よく見るとTOSHIBAとなってます。
別のドライブの情報になってます。
前回起動時のドライブ認識しなかった時の結果かも知れません。
この際だから他のドライブも見てみます。
# smartctl -a /dev/sdc smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-308.el5] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST3000DM001-9YN166 Serial Number: W1F06H8R LU WWN Device Id: 5 000c50 044de7db0 Firmware Version: CC47 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sun Nov 10 13:03:52 2013 JST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 584) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 112 099 006 Pre-fail Always - 46780064 3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 120 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 046 039 030 Pre-fail Always - 1060868547439 9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 15487 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 117 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 062 040 045 Old_age Always In_the_past 38 (20 35 39 38 0) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 92 193 Load_Cycle_Count 0x0032 047 047 000 Old_age Always - 107462 194 Temperature_Celsius 0x0022 038 060 000 Old_age Always - 38 (0 17 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 88772679047248 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 6796113683641 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 943814641006 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 15486 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Serial Numberが違うのでコレも違うようです。
次のディスクへ。
# smartctl -a /dev/sdd smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-308.el5] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) Device Model: WDC WD30EZRX-00MMMB0 Serial Number: WD-WMAWZ0384322 LU WWN Device Id: 5 0014ee 206f9e16d Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Nov 10 13:05:51 2013 JST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (49380) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 200 151 021 Pre-fail Always - 6983 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 120 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 11450 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 118 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 94 193 Load_Cycle_Count 0x0032 179 179 000 Old_age Always - 65417 194 Temperature_Celsius 0x0022 117 093 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. # smartctl -a /dev/sde smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-308.el5] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) Device Model: WDC WD30EZRX-00MMMB0 Serial Number: WD-WMAWZ0381888 LU WWN Device Id: 5 0014ee 25c4f0a2e Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Nov 10 13:07:17 2013 JST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (51000) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 202 150 021 Pre-fail Always - 6858 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 120 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 11456 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 118 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 93 193 Load_Cycle_Count 0x0032 181 181 000 Old_age Always - 59698 194 Temperature_Celsius 0x0022 119 087 000 Old_age Always - 33 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. # smartctl -a /dev/sdf smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-308.el5] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) Device Model: WDC WD30EZRX-00MMMB0 Serial Number: WD-WCAWZ2949188 LU WWN Device Id: 5 0014ee 20758c64b Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Nov 10 13:07:52 2013 JST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (51000) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 198 156 021 Pre-fail Always - 7075 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 86 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 9624 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 86 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 64 193 Load_Cycle_Count 0x0032 185 185 000 Old_age Always - 46253 194 Temperature_Celsius 0x0022 118 097 000 Old_age Always - 34 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 6 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
WDのディスク3つはとってもクリーンでした。
# smartctl -a /dev/sdg smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-308.el5] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST3000DM001-1CH166 Serial Number: Z1F10XY2 LU WWN Device Id: 5 000c50 04e207527 Firmware Version: CC43 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sun Nov 10 13:08:42 2013 JST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 584) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 121869464 3 Spin_Up_Time 0x0003 097 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 137 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 064 061 030 Pre-fail Always - 2725211 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 9638 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 80 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 065 049 045 Old_age Always - 35 (Min/Max 34/35) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 118 193 Load_Cycle_Count 0x0032 080 080 000 Old_age Always - 40017 194 Temperature_Celsius 0x0022 035 051 000 Old_age Always - 35 (0 21 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 92505005621977 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 9201817624 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 30659811275 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
最後の一つ、コレが該当のディスクでした。
でもエラーはでてないですね。
物理的な故障ではなかったということです。
このまま使い続けられそう。
ですが、Linuxファイルサーバは廃止します。
それは次回。
コメント
コメントを投稿