thr3ads.net - freebsd stable - 9211 (LSI/SAS) issues on 11.2-STABLE [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Karl Denninger

2019-Feb-05 15:22 UTC

9211 (LSI/SAS) issues on 11.2-STABLE

On 2/2/2019 12:02, Karl Denninger wrote:> I recently started having some really oddball things? happening under
> stress.? This coincided with the machine being updated to 11.2-STABLE
> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>
> Specifically, I get "errors" like this:
>
> ??????? (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
> length 131072 SMID 269 Aborting command 0xfffffe0001179110
> mps0: Sending reset from mpssas_send_abort for target ID 37
> ??????? (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
> length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0 state
> c xfer 0
> ??????? (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0 state
> c xfer 0
> mps0: Unfreezing devq for target ID 37
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: Command timeout
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: SCSI Status Error
> (da12:mps0:0:37:0): SCSI status: Check Condition
> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
> reset, or bus device reset occurred)
> (da12:mps0:0:37:0): Retrying command (per sense data)
>
> The "Unit Attention" implies the drive reset.? It only occurs on
certain
> drives under very heavy load (e.g. a scrub.)? I've managed to provoke
it
> on two different brands of disk across multiple firmware and capacities,
> however, which tends to point away from a drive firmware problem.
>
> A look at the pool data shows /no /errors (e.g. no checksum problems,
> etc) and a look at the disk itself (using smartctl) shows no problems
> either -- whatever is going on here the adapter is recovering from it
> without any data corruption or loss registered on *either end*!
>
> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>
> mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem
> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
>
1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
After considerable additional work this looks increasingly like either a
missed interrupt or a command is getting lost between the host adapter
and the expander.

I'm going to turn the driver debug level up and see if I can capture
more information..... whatever is behind this, however, it is
almost-certainly related to something that changed between 11.1 and
11.2, as I never saw these on the 11.1-STABLE build.

--
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190205/c3b5a665/attachment.bin>

Karl Denninger

2019-Feb-05 16:26 UTC

head link

9211 (LSI/SAS) issues on 11.2-STABLE

On 2/5/2019 09:22, Karl Denninger wrote:> On 2/2/2019 12:02, Karl Denninger wrote:
>> I recently started having some really oddball things? happening under
>> stress.? This coincided with the machine being updated to 11.2-STABLE
>> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>>
>> Specifically, I get "errors" like this:
>>
>> ??????? (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00
00
>> length 131072 SMID 269 Aborting command 0xfffffe0001179110
>> mps0: Sending reset from mpssas_send_abort for target ID 37
>> ??????? (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00
00
>> length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0
state
>> c xfer 0
>> ??????? (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00
00
>> length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0
state
>> c xfer 0
>> mps0: Unfreezing devq for target ID 37
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: Command timeout
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: SCSI Status Error
>> (da12:mps0:0:37:0): SCSI status: Check Condition
>> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
>> reset, or bus device reset occurred)
>> (da12:mps0:0:37:0): Retrying command (per sense data)
>>
>> The "Unit Attention" implies the drive reset.? It only occurs
on certain
>> drives under very heavy load (e.g. a scrub.)? I've managed to
provoke it
>> on two different brands of disk across multiple firmware and
capacities,
>> however, which tends to point away from a drive firmware problem.
>>
>> A look at the pool data shows /no /errors (e.g. no checksum problems,
>> etc) and a look at the disk itself (using smartctl) shows no problems
>> either -- whatever is going on here the adapter is recovering from it
>> without any data corruption or loss registered on *either end*!
>>
>> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and
shows:
>>
>> mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem
>> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on
pci3
>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities:
>>
1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
> After considerable additional work this looks increasingly like either a
> missed interrupt or a command is getting lost between the host adapter
> and the expander.
>
> I'm going to turn the driver debug level up and see if I can capture
> more information..... whatever is behind this, however, it is
> almost-certainly related to something that changed between 11.1 and
> 11.2, as I never saw these on the 11.1-STABLE build.
>
> --
> Karl Denninger
> karl at denninger.net <mailto:karl at denninger.net>
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/Pretty decent trace here -- any ideas?

mps0: timedout cm 0xfffffe00011b5020 allocated tm 0xfffffe00011812a0
??????? (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
length 131072 SMID 634 Aborting command 0xfffffe00011b5020
mps0: Sending reset from mpssas_send_abort for target ID 37
mps0: queued timedout cm 0xfffffe00011c2760 for processing by tm
0xfffffe00011812a0
mps0: queued timedout cm 0xfffffe00011a74f0 for processing by tm
0xfffffe00011812a0
mps0: queued timedout cm 0xfffffe00011cfd50 for processing by tm
0xfffffe00011812a0
mps0: EventReply??????? :
??????? EventDataLength: 2
??????? AckRequired: 0
??????? Event: SasDiscovery (0x16)
??????? EventContext: 0x0
??????? Flags: 1<InProgress>
??????? ReasonCode: Discovery Started
??????? PhysicalPort: 0
??????? DiscoveryStatus: 0
mps0: (0)->(mpssas_fw_work) Working on? Event: [16]
mps0: (1)->(mpssas_fw_work) Event Free: [16]
??????? (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00
length 131072 SMID 961 completed timedout cm 0xfffffe00011cfd50 ccb
0xfffff8019458e000 during recovery ioc 804b scsi 0 state c
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00 length
131072 SMID 961 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
??????? (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
length 131072 SMID 634 completed timedout cm
0xfffffe00011b5(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00
01 00 00
020 ccb 0xfffff80056fb5000 during recovery ioc 8048 scsi 0 state
c(da11:mps0:0:37:0): CAM status: CCB request completed with an error
(da11:mps0:0:37:0): Retrying command
??????? (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
length 131072 SMID 798 completed timedout cm 0xfffffe00011c2760 ccb
0xfffff80054e86000 during recovery ioc 804b scsi 0 state c
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00 length
131072 SMID 798 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
??????? (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
length 131072 SMID 467 completed timedout cm 0xfffffe00011a74f0 ccb
0xfffff8011efe3000 during recovery ioc 804b scsi 0 state c
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00 length
131072 SMID 467 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
??????? (xpt0:mps0:0:37:0): SMID 2 abort TaskMID 634 status 0x0 code 0x0
count 4
??????? (xpt0:mps0:0:37:0): SMID 2 finished recovery after aborting
TaskMID 634
mps0: Unfreezing devq for target ID 37
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: Command timeout
(da11:mps0:0:37:0): Retrying command
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: CCB request completed with an error
(da11:mps0:0:37:0): Retrying command
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: CCB request completed with an error
(da11:mps0:0:37:0): Retrying command
mps0: EventReply??????? :
??????? EventDataLength: 4
??????? AckRequired: 0
??????? Event: SasTopologyChangeList (0x1c)
??????? EventContext: 0x0
??????? EnclosureHandle: 0x2
??????? ExpanderDevHandle: 0x9
??????? NumPhys: 25
??????? NumEntries: 1
??????? StartPhyNum: 12
??????? ExpStatus: Responding (0x3)
??????? PhysicalPort: 0
??????? PHY[12].AttachedDevHandle: 0x0013
??????? PHY[12].LinkRate: 6.0Gbps (0xaa)
??????? PHY[12].PhyStatus: PHYLinkStatusChange
mps0: (1)->(mpssas_fw_work) Working on? Event: [1c]
mps0: (2)->(mpssas_fw_work) Event Free: [1c]
mps0: EventReply??????? :
??????? EventDataLength: 2
??????? AckRequired: 0
??????? Event: SasDiscovery (0x16)
??????? EventContext: 0x0
??????? Flags: 0
??????? ReasonCode: Discovery Complete
??????? PhysicalPort: 0
??????? DiscoveryStatus: 0
mps0: (2)->(mpssas_fw_work) Working on? Event: [16]
mps0: (3)->(mpssas_fw_work) Event Free: [16]
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: SCSI Status Error
(da11:mps0:0:37:0): SCSI status: Check Condition
(da11:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
reset, or bus device reset occurred)
(da11:mps0:0:37:0): Retrying command (per sense data)

This is what's running right now that produced that -- note no errors
reported on the two providers (the degraded state is intentional; the
third unit is off-site in a bank vault):

[karl at NewFS /var/log]$ zpool status backup
? pool: backup
?state: DEGRADED
status: One or more devices has been taken offline by the administrator.
??????? Sufficient replicas exist for the pool to continue functioning in a
??????? degraded state.
action: Online the device using 'zpool online' or replace the device
with
??????? 'zpool replace'.
? scan: scrub in progress since Tue Feb? 5 06:57:14 2019
??????? 840G scanned at 200M/s, 517G issued at 123M/s, 3.36T total
??????? 0 repaired, 15.04% done, 0 days 06:44:17 to go
config:

??????? NAME????????????????????? STATE???? READ WRITE CKSUM
??????? backup??????????????????? DEGRADED???? 0???? 0???? 0
????????? mirror-0??????????????? DEGRADED???? 0???? 0???? 0
??????????? gpt/backup61.eli????? ONLINE?????? 0???? 0???? 0
??????????? gpt/backup62-1.eli??? ONLINE?????? 0???? 0???? 0
??????????? 13282812295755460479? OFFLINE????? 0???? 0???? 0? was
/dev/gpt/backup62-2.eli

errors: No known data errors

But, here's the SMART data on the disk -- it DOES show the reset:

Before....

Device Statistics (GP Log 0x04)
Page? Offset Size??????? Value Flags Description
0x01? =====? =?????????????? =? ===? == General Statistics (rev 2) =0x01? 0x008?
4????????????? 19? ---? Lifetime Power-On Resets
0x01? 0x018? 6???? 32002138534? ---? Logical Sectors Written
0x01? 0x020? 6?????? 155895325? ---? Number of Write Commands
0x01? 0x028? 6??? 265457016500? ---? Logical Sectors Read
0x01? 0x030? 6????? 1132601490? ---? Number of Read Commands
0x01? 0x038? 6???? 61537515300? ---? Date and Time TimeStamp
0x03? =====? =?????????????? =? ===? == Rotating Media Statistics (rev 1) =0x03?
0x008? 4??????????? 6114? ---? Spindle Motor Power-on Hours
0x03? 0x010? 4??????????? 6114? ---? Head Flying Hours
0x03? 0x018? 4?????????? 30602? ---? Head Load Events
0x03? 0x020? 4?????????????? 0? ---? Number of Reallocated Logical Sectors
0x03? 0x028? 4????????? 283735? ---? Read Recovery Attempts
0x03? 0x030? 4?????????????? 0? ---? Number of Mechanical Start Failures
0x04? =====? =?????????????? =? ===? == General Errors Statistics (rev 1) =0x04?
0x008? 4?????????????? 0? ---? Number of Reported Uncorrectable Errors
0x04? 0x010? 4????????????? 17? ---? Resets Between Cmd Acceptance and
Completio
n
0x05? =====? =?????????????? =? ===? == Temperature Statistics (rev 1) =0x05?
0x008? 1????????????? 38? ---? Current Temperature
0x05? 0x010? 1????????????? 40? N--? Average Short Term Temperature
0x05? 0x018? 1????????????? 32? N--? Average Long Term Temperature
0x05? 0x020? 1????????????? 52? ---? Highest Temperature
0x05? 0x028? 1????????????? 23? ---? Lowest Temperature
0x05? 0x030? 1????????????? 48? N--? Highest Average Short Term Temperature
0x05? 0x038? 1????????????? 25? N--? Lowest Average Short Term Temperature
0x05? 0x040? 1????????????? 44? N--? Highest Average Long Term Temperature
0x05? 0x048? 1????????????? 25? N--? Lowest Average Long Term Temperature
0x05? 0x050? 4?????????????? 0? ---? Time in Over-Temperature
0x05? 0x058? 1????????????? 60? ---? Specified Maximum Operating Temperature
0x05? 0x060? 4?????????????? 0? ---? Time in Under-Temperature
0x05? 0x068? 1?????????????? 0? ---? Specified Minimum Operating Temperature
0x06? =====? =?????????????? =? ===? == Transport Statistics (rev 1) =0x06?
0x008? 4?????????????? 6? ---? Number of Hardware Resets
0x06? 0x010? 4?????????????? 0? ---? Number of ASR Events
0x06? 0x018? 4?????????????? 0? ---? Number of Interface CRC Errors
??????????????????????????????? |||_ C monitored condition met
??????????????????????????????? ||__ D supports DSN
??????????????????????????????? |___ N normalized value

And after....

Device Statistics (GP Log 0x04)
Page? Offset Size??????? Value Flags Description
0x01? =====? =?????????????? =? ===? == General Statistics (rev 2) =0x01? 0x008?
4????????????? 19? ---? Lifetime Power-On Resets
0x01? 0x018? 6???? 32004195070? ---? Logical Sectors Written
0x01? 0x020? 6?????? 155946732? ---? Number of Write Commands
0x01? 0x028? 6??? 266340052468? ---? Logical Sectors Read
0x01? 0x030? 6????? 1136347463? ---? Number of Read Commands
0x01? 0x038? 6???? 61540814700? ---? Date and Time TimeStamp
0x03? =====? =?????????????? =? ===? == Rotating Media Statistics (rev 1) =0x03?
0x008? 4??????????? 6114? ---? Spindle Motor Power-on Hours
0x03? 0x010? 4??????????? 6114? ---? Head Flying Hours
0x03? 0x018? 4?????????? 30602? ---? Head Load Events
0x03? 0x020? 4?????????????? 0? ---? Number of Reallocated Logical Sectors
0x03? 0x028? 4????????? 283848? ---? Read Recovery Attempts
0x03? 0x030? 4?????????????? 0? ---? Number of Mechanical Start Failures
0x04? =====? =?????????????? =? ===? == General Errors Statistics (rev 1) =0x04?
0x008? 4?????????????? 0? ---? Number of Reported Uncorrectable Errors
0x04? 0x010? 4????????????? 17? ---? Resets Between Cmd Acceptance and
Completion
0x05? =====? =?????????????? =? ===? == Temperature Statistics (rev 1) =0x05?
0x008? 1????????????? 41? ---? Current Temperature
0x05? 0x010? 1????????????? 40? N--? Average Short Term Temperature
0x05? 0x018? 1????????????? 32? N--? Average Long Term Temperature
0x05? 0x020? 1????????????? 52? ---? Highest Temperature
0x05? 0x028? 1????????????? 23? ---? Lowest Temperature
0x05? 0x030? 1????????????? 48? N--? Highest Average Short Term Temperature
0x05? 0x038? 1????????????? 25? N--? Lowest Average Short Term Temperature
0x05? 0x040? 1????????????? 44? N--? Highest Average Long Term Temperature
0x05? 0x048? 1????????????? 25? N--? Lowest Average Long Term Temperature
0x05? 0x050? 4?????????????? 0? ---? Time in Over-Temperature
0x05? 0x058? 1????????????? 60? ---? Specified Maximum Operating Temperature
0x05? 0x060? 4?????????????? 0? ---? Time in Under-Temperature
0x05? 0x068? 1?????????????? 0? ---? Specified Minimum Operating Temperature
0x06? =====? =?????????????? =? ===? == Transport Statistics (rev 1) =0x06?
0x008? 4?????????????? 7? ---? Number of Hardware Resets
0x06? 0x010? 4?????????????? 0? ---? Number of ASR Events
0x06? 0x018? 4?????????????? 0? ---? Number of Interface CRC Errors
??????????????????????????????? |||_ C monitored condition met
??????????????????????????????? ||__ D supports DSN
??????????????????????????????? |___ N normalized value

Number of Hardware Resets has incremented.? There are no other errors shown:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME????????? FLAGS??? VALUE WORST THRESH FAIL RAW_VALUE
? 1 Raw_Read_Error_Rate???? PO-R--?? 100?? 100?? 016??? -??? 0
? 2 Throughput_Performance? P-S---?? 137?? 137?? 054??? -??? 104
? 3 Spin_Up_Time??????????? POS---?? 132?? 132?? 024??? -??? 497
(Average 497)
? 4 Start_Stop_Count??????? -O--C-?? 099?? 099?? 000??? -??? 7074
? 5 Reallocated_Sector_Ct?? PO--CK?? 100?? 100?? 005??? -??? 0
? 7 Seek_Error_Rate???????? PO-R--?? 100?? 100?? 067??? -??? 0
? 8 Seek_Time_Performance?? P-S---?? 128?? 128?? 020??? -??? 18
? 9 Power_On_Hours????????? -O--C-?? 098?? 098?? 000??? -??? 17094
?10 Spin_Retry_Count??????? PO--C-?? 100?? 100?? 060??? -??? 0
?12 Power_Cycle_Count?????? -O--CK?? 100?? 100?? 000??? -??? 19
192 Power-Off_Retract_Count -O--CK?? 075?? 075?? 000??? -??? 30602
193 Load_Cycle_Count??????? -O--C-?? 075?? 075?? 000??? -??? 30602
194 Temperature_Celsius???? -O----?? 146?? 146?? 000??? -??? 41 (Min/Max
23/52)
196 Reallocated_Event_Count -O--CK?? 100?? 100?? 000??? -??? 0
197 Current_Pending_Sector? -O---K?? 100?? 100?? 000??? -??? 0
198 Offline_Uncorrectable?? ---R--?? 100?? 100?? 000??? -??? 0
199 UDMA_CRC_Error_Count??? -O-R--?? 200?? 200?? 000??? -??? 0
??????????????????????????? ||||||_ K auto-keep
??????????????????????????? |||||__ C event count
??????????????????????????? ||||___ R error rate
??????????????????????????? |||____ S speed/performance
??????????????????????????? ||_____ O updated online
??????????????????????????? |______ P prefailure warning

I'd throw possible shade at the backplane or cable /but I have already
swapped both out for spares without any change in behavior./

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190205/db84261a/attachment.bin>

freebsd stable - Feb 2019 - 9211 (LSI/SAS) issues on 11.2-STABLE

9211 (LSI/SAS) issues on 11.2-STABLE

9211 (LSI/SAS) issues on 11.2-STABLE