happymaster23
2009-Nov-01 18:34 UTC
[CentOS] Disk error (kicking from MD) during smartctl -t short
Hello, I started smartctl -t short of disk in RAID1, but during this operation this disk was kicked from RAID (only from one MD of three). /var/log/messages: Nov ?1 16:45:45 server kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Nov ?1 16:45:45 server kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Nov ?1 16:45:45 server kernel: ? ? ? ? ?res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error) Nov ?1 16:45:45 server kernel: ata1.00: status: { DRDY ERR } Nov ?1 16:45:45 serve kernel: ata1.00: error: { ABRT } Nov ?1 16:45:45 server kernel: ata1.00: configured for UDMA/133 Nov ?1 16:45:45 server kernel: ata1: EH complete Nov ?1 16:45:45 server kernel: SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) Nov ?1 16:45:45 server kernel: sda: Write Protect is off Nov ?1 16:45:45 server kernel: SCSI device sda: drive cache: write back Nov ?1 16:45:52 server kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Nov ?1 16:45:52 server kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Nov ?1 16:45:52 server kernel: ? ? ? ? ?res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error) Nov ?1 16:45:52 server kernel: ata1.00: status: { DRDY ERR } Nov ?1 16:45:52 server kernel: ata1.00: error: { ABRT } Nov ?1 16:45:52 server kernel: ata1.00: configured for UDMA/133 Nov ?1 16:45:52 server kernel: ata1: EH complete Nov ?1 16:45:52 server kernel: SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) Nov ?1 16:45:52 server kernel: sda: Write Protect is off Nov ?1 16:45:52 server kernel: SCSI device sda: drive cache: write back Nov ?1 16:47:43 server kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Nov ?1 16:47:43 server kernel: ata1.00: BMDMA stat 0x25 Nov ?1 16:47:43 server kernel: ata1.00: cmd ca/00:08:1f:41:1e/00:00:00:00:00/e1 tag 0 dma 4096 out Nov ?1 16:47:43 server kernel: ? ? ? ? ?res 51/10:08:1f:41:1e/00:00:00:00:00/e1 Emask 0x81 (invalid argument) Nov ?1 16:47:43 server kernel: ata1.00: status: { DRDY ERR } Nov ?1 16:47:43 server kernel: ata1.00: error: { IDNF } Nov ?1 16:47:43 server kernel: ata1.00: configured for UDMA/133 Nov ?1 16:47:43 server kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002 Nov ?1 16:47:43 server kernel: sda: Current [descriptor]: sense key: Aborted Command Nov ?1 16:47:43 server kernel: ? ? Add. Sense: Recorded entity not found Nov ?1 16:47:43 server kernel: Nov ?1 16:47:44 server kernel: Descriptor sense data with sense descriptors (in hex): Nov ?1 16:47:44 server kernel: ? ? ? ? 72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00 Nov ?1 16:47:44 server kernel: ? ? ? ? 01 1e 41 1f Nov ?1 16:47:44 server kernel: end_request: I/O error, dev sda, sector 18759967 Nov ?1 16:47:44 server kernel: raid1: Disk failure on sda1, disabling device. Nov ?1 16:47:44 server kernel: Operation continuing on 1 devices Nov ?1 16:47:44 server kernel: ata1: EH complete Nov ?1 16:47:44 server kernel: SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) Nov ?1 16:47:44 server kernel: sda: Write Protect is off Nov ?1 16:47:44 server kernel: SCSI device sda: drive cache: write back Nov ?1 16:47:44 server kernel: RAID1 conf printout: Nov ?1 16:47:44 server kernel: ?--- wd:1 rd:2 Nov ?1 16:47:44 server kernel: ?disk 0, wo:1, o:0, dev:sda1 Nov ?1 16:47:44 server kernel: ?disk 1, wo:0, o:1, dev:sdb1 Nov ?1 16:47:44 server kernel: RAID1 conf printout: Nov ?1 16:47:44 server kernel: ?--- wd:1 rd:2 Nov ?1 16:47:44 server kernel: ?disk 1, wo:0, o:1, dev:sdb1 And output of smarctl -all: === START OF INFORMATION SECTION ==Device Model: WDC WD3201ABYS-01B9A0 Serial Number: Firmware Version: 13.01C02 User Capacity: 320?072?933?376 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Nov 1 19:26:47 2009 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (8400) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 100) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 156 156 021 Pre-fail Always - 5183 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 82 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 14329 10 Spin_Retry_Count 0x0012 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0012 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 82 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 58 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 82 194 Temperature_Celsius 0x0022 123 106 000 Old_age Always - 24 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 ATA Error Count: 5 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 5 occurred at disk power-on lifetime: 14329 hours (597 days + 1 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 10 51 08 a7 4a 1e e1 Error: IDNF at LBA = 0x011e4aa7 = 18762407 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 08 a7 4a 1e e1 0a 45d+04:44:26.835 WRITE DMA ca 00 08 3f 14 00 e2 0a 45d+04:44:26.816 WRITE DMA Error 4 occurred at disk power-on lifetime: 14326 hours (596 days + 22 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 10 51 08 1f 41 1e e1 Error: IDNF at LBA = 0x011e411f = 18759967 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 08 1f 41 1e e1 0a 45d+02:30:44.518 WRITE DMA ca 00 08 3f 14 00 e2 0a 45d+02:30:44.504 WRITE DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 14329 - # 2 Short offline Completed without error 00% 14329 - # 3 Extended offline Completed without error 00% 14328 - # 4 Short offline Completed without error 00% 14326 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. This device was kicked from MD during #4 and #2 short test. Second HDD (same as this problematic) is without errors. Thank you for you help