thr3ads.net - zfs discuss - System hangs on SCSI error [Aug 2006]

If this information is useful, please help other people find it:
Share via:

Patrick Petit

2006-Aug-03 10:10 UTC

System hangs on SCSI error

Hi,

Using a ZFS emulated volume, I wasn''t expecting to see a system [1]
hang
caused by a SCSI error. What do you think? The error is not systematic. 
When it happens, the Solaris/Xen dom0 console keeps displaying the 
following message and the system hangs.

*Aug  3 11:11:23 jesma58 scsi: WARNING:
/pci@0,0/pci1022,7450@a/pci17c2,10@4/sd@1,0 (sd2):
Aug  3 11:11:23 jesma58         Error for Command: read(10)                Error
Level: Retryable
Aug  3 11:11:23 jesma58 scsi:   Requested Block: 67679394                  Error
Block: 67679394
Aug  3 11:11:23 jesma58 scsi:   Vendor: SEAGATE                           
Serial Number: 3JA7XWQY
Aug  3 11:11:23 jesma58 scsi:   Sense Key: Unit_Attention
Aug  3 11:11:23 jesma58 scsi:   ASC: 0x29 (bus device reset message occurred),
ASCQ: 0x3, FRU: 0x4*


Not sure whether the error happens when the ZFS emulated volume is 
accessed by Solaris/Xen dom0 or by the Linux/Xen domU guest OS....? Here 
is the usage scenario:

1 - Created a ZFS emaluated volume tank/vol2 (

*# zfs create -V 5gb tank/vol2****
*

2 - Copied an x86 boot sector disk image into that volume

*# file disk.img
disk.img:     DOS executable (COM)

# ls -l disk.img
**-rw-r--r--   1 ppetit   icnc     5368709120 Aug  2 18:45 disk.img*
*
# dd if=disk.img of=/dev/zvol/dsk/tank/vol2 bs=8192

**# zfs get all tank/vol2
NAME             PROPERTY       VALUE                      SOURCE
tank/vol2        type           volume                     -
tank/vol2        creation       Wed Aug  2 18:37 2006      -
tank/vol2        used           5.04G                      -
tank/vol2        available      5.83G                      -
tank/vol2        referenced     5.04G                      -
tank/vol2        compressratio  1.00x                      -
tank/vol2        reservation    5G                         local
tank/vol2        volsize        5G                         -
tank/vol2        volblocksize   8K                         -
tank/vol2        checksum       on                         default
tank/vol2        compression    off                        default
tank/vol2        readonly       off                        default**
*

3 - Boot a Linux Xen domU kernel on that volume which contains an ext3fs 
rootfs partition and a swap partion.

Thanks,

    Patrick
----------------------------------------

[1] SunOS 5.11 matrix-build-2006-07-14 i86xen i386 i86xen on V20Z machine.
* *

---
Patrick Petit                   Sun Microsystems Inc.
Labs, CTO                       ICNC Grenoble (http://icncweb.france)  
Phone: (+33)476 188 232 x38232  180, Avenue de l''Europe
Fax: (+33)476 188 282           38334 Saint-Ismier Cedex, France

Darren Reed

2006-Aug-03 10:34 UTC

head link

Re: System hangs on SCSI error

Patrick Petit wrote:
> Hi,
>
> Using a ZFS emulated volume, I wasn''t expecting to see a system
[1]
> hang caused by a SCSI error. What do you think? The error is not 
> systematic. When it happens, the Solaris/Xen dom0 console keeps 
> displaying the following message and the system hangs.
>
> *Aug  3 11:11:23 jesma58 scsi: WARNING: 
> /pci@0,0/pci1022,7450@a/pci17c2,10@4/sd@1,0 (sd2):
> Aug  3 11:11:23 jesma58         Error for Command: 
> read(10)                Error Level: Retryable
> Aug  3 11:11:23 jesma58 scsi:   Requested Block: 
> 67679394                  Error Block: 67679394
> Aug  3 11:11:23 jesma58 scsi:   Vendor: 
> SEAGATE                            Serial Number: 3JA7XWQY
> Aug  3 11:11:23 jesma58 scsi:   Sense Key: Unit_Attention
> Aug  3 11:11:23 jesma58 scsi:   ASC: 0x29 (bus device reset message 
> occurred), ASCQ: 0x3, FRU: 0x4*

Have you looked into this futher using FMA, using fmadm to start with?

Darren

Patrick Petit

2006-Aug-03 10:53 UTC

head link

Re: [zfs-discuss] System hangs on SCSI error

Darren Reed wrote:
> Patrick Petit wrote:
>
>> Hi,
>>
>> Using a ZFS emulated volume, I wasn''t expecting to see a
system [1]
>> hang caused by a SCSI error. What do you think? The error is not 
>> systematic. When it happens, the Solaris/Xen dom0 console keeps 
>> displaying the following message and the system hangs.
>>
>> *Aug  3 11:11:23 jesma58 scsi: WARNING: 
>> /pci@0,0/pci1022,7450@a/pci17c2,10@4/sd@1,0 (sd2):
>> Aug  3 11:11:23 jesma58         Error for Command: 
>> read(10)                Error Level: Retryable
>> Aug  3 11:11:23 jesma58 scsi:   Requested Block: 
>> 67679394                  Error Block: 67679394
>> Aug  3 11:11:23 jesma58 scsi:   Vendor: 
>> SEAGATE                            Serial Number: 3JA7XWQY
>> Aug  3 11:11:23 jesma58 scsi:   Sense Key: Unit_Attention
>> Aug  3 11:11:23 jesma58 scsi:   ASC: 0x29 (bus device reset message 
>> occurred), ASCQ: 0x3, FRU: 0x4*
>
>
>
> Have you looked into this futher using FMA, using fmadm to start with?

fmadm shows no error :-(

jesma58# fmadm faulty -a
   STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
jesma58#


>
> Darren
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
Patrick Petit                   Sun Microsystems Inc.
Labs, CTO - G2 Systems Exp.     ICNC Grenoble (http://icncweb.france)  
Phone: (+33)476 188 232 x38232  180, Avenue de l''Europe
Fax: (+33)476 188 282           38334 Saint-Ismier Cedex, France

James C. McPherson

2006-Aug-03 12:03 UTC

head link

Re: Re: [zfs-discuss] System hangs on SCSI error

Patrick Petit wrote:> Darren Reed wrote:
>> Patrick Petit wrote:
>>> Using a ZFS emulated volume, I wasn''t expecting to see a
system [1]
>>> hang caused by a SCSI error. What do you think? The error is not 
>>> systematic. When it happens, the Solaris/Xen dom0 console keeps 
>>> displaying the following message and the system hangs.
>>> *Aug  3 11:11:23 jesma58 scsi: WARNING: 
>>> /pci@0,0/pci1022,7450@a/pci17c2,10@4/sd@1,0 (sd2):
>>> Aug  3 11:11:23 jesma58         Error for Command: 
>>> read(10)                Error Level: Retryable
>>> Aug  3 11:11:23 jesma58 scsi:   Requested Block: 
>>> 67679394                  Error Block: 67679394
>>> Aug  3 11:11:23 jesma58 scsi:   Vendor: 
>>> SEAGATE                            Serial Number: 3JA7XWQY
>>> Aug  3 11:11:23 jesma58 scsi:   Sense Key: Unit_Attention
>>> Aug  3 11:11:23 jesma58 scsi:   ASC: 0x29 (bus device reset message
>>> occurred), ASCQ: 0x3, FRU: 0x4*
>> Have you looked into this futher using FMA, using fmadm to start with?
> fmadm shows no error :-(
> jesma58# fmadm faulty -a
>   STATE RESOURCE / UUID
> -------- 
> ----------------------------------------------------------------------
> jesma58#
I have had a similar issue with the solitary SATA disk which makes
up my zfs root pool - errors such as these send the system to a hang
state (fortunately not a hard hang) and require a break/F1-A + forced
crash to get out of.

As I understand it, ZFS will retry operations based on various settings
such as those in ''sd'' and I don''t believe there are
specific error case
handlers in the ZFS code to deal with issues like this.

OTOH it would be nice to see ZFS invoking an error path immediately
on receipt of a failure like your or mine. But I fear that this would
detract from the device agnosticism that we presently have.

Patrick, is your pool mirrored? I know that mine isn''t and as a result
I know that I need to expect that I will suffer.

The other thing that I am concerned with in your scenario is that you
are dd-ing a disk image onto a zvol. I''m not sure that this is the
right way to go about it (although I don''t know what *is* the right
way to do it).

best regards,

James C. McPherson
--
Solaris Datapath Engineering
Storage Division
Sun Microsystems

Patrick Petit

2006-Aug-03 12:53 UTC

head link

Re: Re: [zfs-discuss] System hangs on SCSI error

James C. McPherson wrote:
> Patrick Petit wrote:
>
>> Darren Reed wrote:
>>
>>> Patrick Petit wrote:
>>>
>>>> Using a ZFS emulated volume, I wasn''t expecting to see
a system [1]
>>>> hang caused by a SCSI error. What do you think? The error is
not
>>>> systematic. When it happens, the Solaris/Xen dom0 console keeps
>>>> displaying the following message and the system hangs.
>>>> *Aug  3 11:11:23 jesma58 scsi: WARNING: 
>>>> /pci@0,0/pci1022,7450@a/pci17c2,10@4/sd@1,0 (sd2):
>>>> Aug  3 11:11:23 jesma58         Error for Command: 
>>>> read(10)                Error Level: Retryable
>>>> Aug  3 11:11:23 jesma58 scsi:   Requested Block: 
>>>> 67679394                  Error Block: 67679394
>>>> Aug  3 11:11:23 jesma58 scsi:   Vendor: 
>>>> SEAGATE                            Serial Number: 3JA7XWQY
>>>> Aug  3 11:11:23 jesma58 scsi:   Sense Key: Unit_Attention
>>>> Aug  3 11:11:23 jesma58 scsi:   ASC: 0x29 (bus device reset
message
>>>> occurred), ASCQ: 0x3, FRU: 0x4*
>>>
>>> Have you looked into this futher using FMA, using fmadm to start
with?
>>
>> fmadm shows no error :-(
>> jesma58# fmadm faulty -a
>>   STATE RESOURCE / UUID
>> -------- 
>> ----------------------------------------------------------------------
>> jesma58#
>
>
> I have had a similar issue with the solitary SATA disk which makes
> up my zfs root pool - errors such as these send the system to a hang
> state (fortunately not a hard hang) and require a break/F1-A + forced
> crash to get out of.
>
>
> As I understand it, ZFS will retry operations based on various settings
> such as those in ''sd'' and I don''t believe there
are specific error case
> handlers in the ZFS code to deal with issues like this.
I am wondering to what extent this is the role of ZFS to fix SCSI 
controller errors. Shouldn''t it be role of the controller driver, or 
even the controller itself? I would expect that in such circumstances 
lower layers repare and/or isolate the faulty block by using, for 
instance, a reassignment block. But, for having written SCSI drivers in 
the past (to my discharge that was long time ago) I do not recall 
drivers were that elaborated so letting the above layers deal with the 
hot potatoes :-(
>
> OTOH it would be nice to see ZFS invoking an error path immediately
> on receipt of a failure like your or mine. But I fear that this would
> detract from the device agnosticism that we presently have.
>
> Patrick, is your pool mirrored? I know that mine isn''t and as a
result
> I know that I need to expect that I will suffer.
No it''s not mirrored. It''s a simple pool backed by a physical
disk drive.
>
>
> The other thing that I am concerned with in your scenario is that you
> are dd-ing a disk image onto a zvol. I''m not sure that this is the
> right way to go about it (although I don''t know what *is* the
right
> way to do it).
>
>
>Yes I am wondering the same. Would it be preferable to dd on the raw 
(rdsk) device?
> best regards,
>
> James C. McPherson
> -- 
> Solaris Datapath Engineering
> Storage Division
> Sun Microsystems

Brad Plecs

2006-Aug-10 02:44 UTC

head link

[zfs-discuss] Re: System hangs on SCSI error

I have similar problems ... I have a bunch of D1000 disk shelves attached via
SCSI HBAs to a V880.  If I do something as simple as unplug a drive in a raidz 
vdev, it generates SCSI errors that eventually freeze the entire system.  I can
access the filesystem okay for a couple minutes until the SCSI bus resets, then
I have a frozen box.  I have to stop-a/sync/reset. 

If I offline the device before unplugging the drive, I have no problems.  

Yeah, sure, I know you''re supposed to offline it first, but
I''m trying to test
unexpected failures.  If the power supplies fail on one of my shelves, the data
will be intact, but the system will hang.   This is good, but not great, since I
really
want this to be a high-availability system. 

I believe this is a failure of the OS, controller, or SCSI driver to isolate the
bad device and let the rest of the system operate, rather than a ZFS issue.
 
 
This message posted from opensolaris.org

George Wilson

2006-Aug-10 02:55 UTC

head link

[zfs-discuss] Re: System hangs on SCSI error

Brad,

I''m investigating a similar issue and would like to get a coredump if 
you have one available.

Thanks,
George

Brad Plecs wrote:> I have similar problems ... I have a bunch of D1000 disk shelves attached
via
> SCSI HBAs to a V880.  If I do something as simple as unplug a drive in a
raidz
> vdev, it generates SCSI errors that eventually freeze the entire system.  I
can
> access the filesystem okay for a couple minutes until the SCSI bus resets,
then
> I have a frozen box.  I have to stop-a/sync/reset. 
> 
> If I offline the device before unplugging the drive, I have no problems.  
> 
> Yeah, sure, I know you''re supposed to offline it first, but
I''m trying to test
> unexpected failures.  If the power supplies fail on one of my shelves, the
data will be intact, but the system will hang.   This is good, but not great,
since I really
> want this to be a high-availability system. 
> 
> I believe this is a failure of the OS, controller, or SCSI driver to
isolate the bad device and let the rest of the system operate, rather than a ZFS
issue.
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Brad Plecs

2006-Aug-10 12:07 UTC

head link

[zfs-discuss] Re: System hangs on SCSI error

The core dump timed out (related to the SCSI bus reset?), so I don''t
have one.  I can try it again, though, it''s easy enough to reproduce. 

I was seeing errors on the fibre channel disks as well, so it''s
possible
the whole thing was locked up. 

BP 

-- 
bplecs at cs.umd.edu

George Wilson

2006-Aug-10 12:19 UTC

head link

[zfs-discuss] Re: System hangs on SCSI error

Brad,

I have a suspicion about what you might be seeing and I want to confirm 
it. If it locks up again you can also collect a threadlist:

"echo $<threadlist" | mdb -k

Send me the output and that will be a good starting point.

Thanks,
George

Brad Plecs wrote:> The core dump timed out (related to the SCSI bus reset?), so I
don''t
> have one.  I can try it again, though, it''s easy enough to
reproduce.
> 
> I was seeing errors on the fibre channel disks as well, so it''s
possible
> the whole thing was locked up. 
> 
> BP 
>

Brad Plecs

2006-Aug-14 15:11 UTC

head link

[zfs-discuss] Re: System hangs on SCSI error

> Brad,
> 
> I have a suspicion about what you might be seeing and I want to confirm 
> it. If it locks up again you can also collect a threadlist:
> 
> "echo $<threadlist" | mdb -k
> 
> Send me the output and that will be a good starting point.
I tried popping out a disk again, but for whatever reason, the system
just became sluggish rather than freezing this time.

I didn''t get to take a whole disk shelf offline like I''d done
before,
because this system went into production use over the weekend, but
I''ll send you the threadlist from the single-disk try privately.

BP 

-- 
bplecs at cs.umd.edu

zfs discuss - Aug 2006 - System hangs on SCSI error

System hangs on SCSI error

Re: System hangs on SCSI error

Re: [zfs-discuss] System hangs on SCSI error

Re: Re: [zfs-discuss] System hangs on SCSI error

Re: Re: [zfs-discuss] System hangs on SCSI error

[zfs-discuss] Re: System hangs on SCSI error

[zfs-discuss] Re: System hangs on SCSI error

[zfs-discuss] Re: System hangs on SCSI error

[zfs-discuss] Re: System hangs on SCSI error

[zfs-discuss] Re: System hangs on SCSI error