Kai Gallasch
2014-Nov-05 23:32 UTC
10.1 RC4 r273903 - zpool scrub on ssd mirror - ahci command timeout
Hi.
Not sure if this is 10.1 related or more a problem of the ssd
model and/or ahci controller..
I am currently running 10.1 RC4 r273903 on a zfs on root server with two
mirror pools. One of the pools is a mirror consisting of two Samsung
SSD 850 PRO 512GB SSDs.
When I start a zfs scrub on this pool the result of the scrub is:
# zpool status -v ssdpool
pool: ssdpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are
unaffected. action: Determine if the device needs to be replaced, and
clear the errors using 'zpool clear' or replace the device with
'zpool
replace'. see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 73K in 0h8m with 0 errors on Thu Nov 6 00:00:16
2014 config:
NAME STATE READ WRITE CKSUM
ssdpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/ssdpool0 ONLINE 0 0 17
gpt/ssdpool1 ONLINE 0 0 29
When I do a 'zpool clear' the pool status looks ok again. But when I
again start a zpool scrub the same thing happens again and the
above status "One or more devices has experienced an unrecoverable
error" shows again.
I find the following kernel message in the output of 'dmesg': (after
running zpool scrub two times)
ahcich2: Timeout on slot 15 port 0
ahcich2: is 00000000 cs 000f0000 ss 000f8000 rs 000f8000 tfd 40 serr
00000000 cmd 0024cf17 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
8b a6 1d 56 40 0d 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
Command timeout (ada2:ahcich2:0:0:0): Retrying command
ahcich2: Timeout on slot 23 port 0
ahcich2: is 00000000 cs 0f000000 ss 0f800000 rs 0f800000 tfd 40 serr
00000000 cmd 0024d817 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
1b 23 81 bc 40 06 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
Command timeout (ada2:ahcich2:0:0:0): Retrying command
ahcich2: Timeout on slot 3 port 0
ahcich2: is 00000000 cs 00000030 ss 00000038 rs 00000038 tfd 40 serr
00000000 cmd 0024c317 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
26 bd 18 8e 40 12 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
Command timeout (ada2:ahcich2:0:0:0): Retrying command
Besides: smartctl shows no error on ada2.
Here comes the output..
# smartctl -a -q noserial /dev/ada2
smartctl 6.3 2014-07-26 r3976 [FreeBSD 10.1-RC4 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke,
www.smartmontools.org
=== START OF INFORMATION SECTION ==Device Model: Samsung SSD 850 PRO 512GB
Firmware Version: EXM01B6Q
User Capacity: 512,110,190,592 bytes [512 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Nov 6 00:02:04 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test
result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection
activity was never started.
Auto Offline Data Collection:
Disabled. Self-test execution status: ( 0) The previous
self-test routine completed without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline
immediate. Auto Offline data collection on/off support.
Suspend Offline collection upon
new command.
No Offline surface scan
supported. Self-test supported.
No Conveyance Self-test
supported. Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging
supported. Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 33) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control
supported. SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100
100 010 Pre-fail Always - 0 9 Power_On_Hours
0x0032 099 099 000 Old_age Always - 154 12
Power_Cycle_Count 0x0032 099 099 000 Old_age
Always - 5 177 Wear_Leveling_Count 0x0013 100 100
000 Pre-fail Always - 0 179 Used_Rsvd_Blk_Cnt_Tot
0x0013 100 100 010 Pre-fail Always - 0 181
Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age
Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100
010 Old_age Always - 0 183 Runtime_Bad_Block
0x0013 100 100 010 Pre-fail Always - 0 187
Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0 190 Airflow_Temperature_Cel 0x0032 070 068
000 Old_age Always - 30 195 Hardware_ECC_Recovered
0x001a 200 200 000 Old_age Always - 0 199
UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age
Always - 0 235 Unknown_Attribute 0x0012 100 100
000 Old_age Always - 0 241 Total_LBAs_Written
0x0032 099 099 000 Old_age Always - 400466433
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed
without error 00% 147 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute
delay.
I wonder What is the possible reason for this. Both SSDs are new.
Is this a common problem with zfs and SSDs (for example ahci timeouts
because of high data rates for a bus ?)
K.
--
PGP-KeyID = 0xE401B671927D4A5C
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20141106/87579425/attachment.sig>
Steven Hartland
2014-Nov-05 23:44 UTC
10.1 RC4 r273903 - zpool scrub on ssd mirror - ahci command timeout
Looks like a HW issue, how is it connected to the controller and what is the controller? On 05/11/2014 23:32, Kai Gallasch wrote:> Hi. > > Not sure if this is 10.1 related or more a problem of the ssd > model and/or ahci controller.. > > I am currently running 10.1 RC4 r273903 on a zfs on root server with two > mirror pools. One of the pools is a mirror consisting of two Samsung > SSD 850 PRO 512GB SSDs. > > When I start a zfs scrub on this pool the result of the scrub is: > > # zpool status -v ssdpool > pool: ssdpool > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are > unaffected. action: Determine if the device needs to be replaced, and > clear the errors using 'zpool clear' or replace the device with 'zpool > replace'. see: http://illumos.org/msg/ZFS-8000-9P > scan: scrub repaired 73K in 0h8m with 0 errors on Thu Nov 6 00:00:16 > 2014 config: > > NAME STATE READ WRITE CKSUM > ssdpool ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > gpt/ssdpool0 ONLINE 0 0 17 > gpt/ssdpool1 ONLINE 0 0 29 > > When I do a 'zpool clear' the pool status looks ok again. But when I > again start a zpool scrub the same thing happens again and the > above status "One or more devices has experienced an unrecoverable > error" shows again. > > > I find the following kernel message in the output of 'dmesg': (after > running zpool scrub two times) > > > ahcich2: Timeout on slot 15 port 0 > ahcich2: is 00000000 cs 000f0000 ss 000f8000 rs 000f8000 tfd 40 serr > 00000000 cmd 0024cf17 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 > 8b a6 1d 56 40 0d 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status: > Command timeout (ada2:ahcich2:0:0:0): Retrying command > ahcich2: Timeout on slot 23 port 0 > ahcich2: is 00000000 cs 0f000000 ss 0f800000 rs 0f800000 tfd 40 serr > 00000000 cmd 0024d817 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 > 1b 23 81 bc 40 06 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status: > Command timeout (ada2:ahcich2:0:0:0): Retrying command > ahcich2: Timeout on slot 3 port 0 > ahcich2: is 00000000 cs 00000030 ss 00000038 rs 00000038 tfd 40 serr > 00000000 cmd 0024c317 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 > 26 bd 18 8e 40 12 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status: > Command timeout (ada2:ahcich2:0:0:0): Retrying command > > > Besides: smartctl shows no error on ada2. > Here comes the output.. > > # smartctl -a -q noserial /dev/ada2 > smartctl 6.3 2014-07-26 r3976 [FreeBSD 10.1-RC4 amd64] (local build) > Copyright (C) 2002-14, Bruce Allen, Christian Franke, > www.smartmontools.org > > === START OF INFORMATION SECTION ==> Device Model: Samsung SSD 850 PRO 512GB > Firmware Version: EXM01B6Q > User Capacity: 512,110,190,592 bytes [512 GB] > Sector Size: 512 bytes logical/physical > Rotation Rate: Solid State Device > Device is: Not in smartctl database [for details use: -P showall] > ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c > SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) > Local Time is: Thu Nov 6 00:02:04 2014 CET > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION ==> SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x00) Offline data collection > activity was never started. > Auto Offline Data Collection: > Disabled. Self-test execution status: ( 0) The previous > self-test routine completed without error or no self-test has ever > been run. > Total time to complete Offline > data collection: ( 0) seconds. > Offline data collection > capabilities: (0x53) SMART execute Offline > immediate. Auto Offline data collection on/off support. > Suspend Offline collection upon > new command. > No Offline surface scan > supported. Self-test supported. > No Conveyance Self-test > supported. Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before > entering power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging > supported. Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 33) minutes. > SCT capabilities: (0x003d) SCT Status supported. > SCT Error Recovery Control > supported. SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 1 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 > 100 010 Pre-fail Always - 0 9 Power_On_Hours > 0x0032 099 099 000 Old_age Always - 154 12 > Power_Cycle_Count 0x0032 099 099 000 Old_age > Always - 5 177 Wear_Leveling_Count 0x0013 100 100 > 000 Pre-fail Always - 0 179 Used_Rsvd_Blk_Cnt_Tot > 0x0013 100 100 010 Pre-fail Always - 0 181 > Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age > Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 > 010 Old_age Always - 0 183 Runtime_Bad_Block > 0x0013 100 100 010 Pre-fail Always - 0 187 > Reported_Uncorrect 0x0032 100 100 000 Old_age > Always - 0 190 Airflow_Temperature_Cel 0x0032 070 068 > 000 Old_age Always - 30 195 Hardware_ECC_Recovered > 0x001a 200 200 000 Old_age Always - 0 199 > UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age > Always - 0 235 Unknown_Attribute 0x0012 100 100 > 000 Old_age Always - 0 241 Total_LBAs_Written > 0x0032 099 099 000 Old_age Always - 400466433 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed > without error 00% 147 - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute > delay. > > I wonder What is the possible reason for this. Both SSDs are new. > Is this a common problem with zfs and SSDs (for example ahci timeouts > because of high data rates for a bus ?) > > K. >
Trond Endrestøl
2014-Nov-06 08:20 UTC
10.1 RC4 r273903 - zpool scrub on ssd mirror - ahci command timeout
On Thu, 6 Nov 2014 00:32+0100, Kai Gallasch wrote:> > Hi. > > Not sure if this is 10.1 related or more a problem of the ssd > model and/or ahci controller.. > > I am currently running 10.1 RC4 r273903 on a zfs on root server with two > mirror pools. One of the pools is a mirror consisting of two Samsung > SSD 850 PRO 512GB SSDs. > > When I start a zfs scrub on this pool the result of the scrub is: > > # zpool status -v ssdpool > pool: ssdpool > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are > unaffected. action: Determine if the device needs to be replaced, and > clear the errors using 'zpool clear' or replace the device with 'zpool > replace'. see: http://illumos.org/msg/ZFS-8000-9P > scan: scrub repaired 73K in 0h8m with 0 errors on Thu Nov 6 00:00:16 > 2014 config: > > NAME STATE READ WRITE CKSUM > ssdpool ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > gpt/ssdpool0 ONLINE 0 0 17 > gpt/ssdpool1 ONLINE 0 0 29 > > When I do a 'zpool clear' the pool status looks ok again. But when I > again start a zpool scrub the same thing happens again and the > above status "One or more devices has experienced an unrecoverable > error" shows again. > > > I find the following kernel message in the output of 'dmesg': (after > running zpool scrub two times) > > > ahcich2: Timeout on slot 15 port 0 > ahcich2: is 00000000 cs 000f0000 ss 000f8000 rs 000f8000 tfd 40 serr > 00000000 cmd 0024cf17 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 > 8b a6 1d 56 40 0d 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status: > Command timeout (ada2:ahcich2:0:0:0): Retrying command > ahcich2: Timeout on slot 23 port 0 > ahcich2: is 00000000 cs 0f000000 ss 0f800000 rs 0f800000 tfd 40 serr > 00000000 cmd 0024d817 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 > 1b 23 81 bc 40 06 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status: > Command timeout (ada2:ahcich2:0:0:0): Retrying command > ahcich2: Timeout on slot 3 port 0 > ahcich2: is 00000000 cs 00000030 ss 00000038 rs 00000038 tfd 40 serr > 00000000 cmd 0024c317 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 > 26 bd 18 8e 40 12 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status: > Command timeout (ada2:ahcich2:0:0:0): Retrying commandIncidently, I've recently seen similar messages in my base/head VMs running in VirtualBox on a Win7Pro host at home. These messages appeared after I upgraded VBox to 4.3.18 on both host and guests, and after upgrading base/head to roughly r273963, last Sunday. I haven't touched my base/head VMs since Sunday, but I may find time to investigate both base/head and base/stable/10 later this evening, and see if they all exhibit the same symptoms as yours. All my FreeBSD VMs uses VirtualBox' SATA controller, except stable/8 which only works correctly with VirtualBox' SCSI controller.> Besides: smartctl shows no error on ada2. > Here comes the output.. > > # smartctl -a -q noserial /dev/ada2 > smartctl 6.3 2014-07-26 r3976 [FreeBSD 10.1-RC4 amd64] (local build) > Copyright (C) 2002-14, Bruce Allen, Christian Franke, > www.smartmontools.org > > === START OF INFORMATION SECTION ==> Device Model: Samsung SSD 850 PRO 512GB > Firmware Version: EXM01B6Q > User Capacity: 512,110,190,592 bytes [512 GB] > Sector Size: 512 bytes logical/physical > Rotation Rate: Solid State Device > Device is: Not in smartctl database [for details use: -P showall] > ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c > SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) > Local Time is: Thu Nov 6 00:02:04 2014 CET > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION ==> SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x00) Offline data collection > activity was never started. > Auto Offline Data Collection: > Disabled. Self-test execution status: ( 0) The previous > self-test routine completed without error or no self-test has ever > been run. > Total time to complete Offline > data collection: ( 0) seconds. > Offline data collection > capabilities: (0x53) SMART execute Offline > immediate. Auto Offline data collection on/off support. > Suspend Offline collection upon > new command. > No Offline surface scan > supported. Self-test supported. > No Conveyance Self-test > supported. Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before > entering power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging > supported. Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 33) minutes. > SCT capabilities: (0x003d) SCT Status supported. > SCT Error Recovery Control > supported. SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 1 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 > 100 010 Pre-fail Always - 0 9 Power_On_Hours > 0x0032 099 099 000 Old_age Always - 154 12 > Power_Cycle_Count 0x0032 099 099 000 Old_age > Always - 5 177 Wear_Leveling_Count 0x0013 100 100 > 000 Pre-fail Always - 0 179 Used_Rsvd_Blk_Cnt_Tot > 0x0013 100 100 010 Pre-fail Always - 0 181 > Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age > Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 > 010 Old_age Always - 0 183 Runtime_Bad_Block > 0x0013 100 100 010 Pre-fail Always - 0 187 > Reported_Uncorrect 0x0032 100 100 000 Old_age > Always - 0 190 Airflow_Temperature_Cel 0x0032 070 068 > 000 Old_age Always - 30 195 Hardware_ECC_Recovered > 0x001a 200 200 000 Old_age Always - 0 199 > UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age > Always - 0 235 Unknown_Attribute 0x0012 100 100 > 000 Old_age Always - 0 241 Total_LBAs_Written > 0x0032 099 099 000 Old_age Always - 400466433 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed > without error 00% 147 - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute > delay. > > I wonder What is the possible reason for this. Both SSDs are new. > Is this a common problem with zfs and SSDs (for example ahci timeouts > because of high data rates for a bus ?) > > K.-- +-------------------------------+------------------------------------+ | Vennlig hilsen, | Best regards, | | Trond Endrest?l, | Trond Endrest?l, | | IT-ansvarlig, | System administrator, | | Fagskolen Innlandet, | Gj?vik Technical College, Norway, | | tlf. mob. 952 62 567, | Cellular...: +47 952 62 567, | | sentralbord 61 14 54 00. | Switchboard: +47 61 14 54 00. | +-------------------------------+------------------------------------+