thr3ads.net - zfs discuss - [zfs-discuss] ZFS Panic [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Grant Lowe

2009-Apr-09 01:03 UTC

[zfs-discuss] ZFS Panic

Hi All,

Don''t know if this is worth reporting, as it''s human error. 
Anyway, I had a panic on my zfs box.  Here''s the error:

marksburg /usr2/glowe> grep panic /var/log/syslog
Apr  8 06:57:17 marksburg savecore: [ID 570001 auth.error] reboot after panic:
assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG,
&numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580
Apr  8 07:15:10 marksburg savecore: [ID 570001 auth.error] reboot after panic:
assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG,
&numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580
marksburg /usr2/glowe>

What we did to cause this is we pulled a LUN from zfs, and replaced it with a
new LUN.  We then tried to shutdown the box, but it wouldn''t go down. 
We had to send a break to the box and reboot.  This is an oracle sandbox, so
we''re not really concerned.  Ideas?

Remco Lengers

2009-Apr-09 12:31 UTC

head link

[zfs-discuss] ZFS Panic

Grant,

Didn''t see a response so I''ll give it a go.

Ripping a disk away and silently inserting a new one is asking for 
trouble imho. I am not sure what you were trying to accomplish but 
generally replace a drive/lun would entail commands like

  zpool offline tank c1t3d0
cfgadm | grep c1t3d0
sata1/3::dsk/c1t3d0            disk         connected    configured   ok
# cfgadm -c unconfigure sata1/3
Unconfigure the device at: /devices/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at
1:3
This operation will suspend activity on the SATA device
Continue (yes/no)? yes
# cfgadm | grep sata1/3
sata1/3                        disk         connected    unconfigured ok
<Replace the physical disk c1t3d0>
# cfgadm -c configure sata1/3

Taken from this page:

http://docs.sun.com/app/docs/doc/819-5461/gbbzy?a=view

..Remco

Grant Lowe wrote:> Hi All,
> 
> Don''t know if this is worth reporting, as it''s human
error.  Anyway, I had a panic on my zfs box.  Here''s the error:
> 
> marksburg /usr2/glowe> grep panic /var/log/syslog
> Apr  8 06:57:17 marksburg savecore: [ID 570001 auth.error] reboot after
panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size,
FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580
> Apr  8 07:15:10 marksburg savecore: [ID 570001 auth.error] reboot after
panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size,
FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580
> marksburg /usr2/glowe>
> 
> What we did to cause this is we pulled a LUN from zfs, and replaced it with
a new LUN.  We then tried to shutdown the box, but it wouldn''t go down.
We had to send a break to the box and reboot.  This is an oracle sandbox, so
we''re not really concerned.  Ideas?
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Grant Lowe

2009-Apr-09 19:21 UTC

head link

[zfs-discuss] ZFS Panic

Hi Remco.

Yes, I realize that was asking for trouble.  It wasn''t supposed to be a
test of yanking a LUN.  We needed a LUN for a VxVM/VxFS system and that LUN was
available.  I was just surprised at the panic, since the system was quiesced at
the time.  But there is coming a time when we will be doing this.  Thanks for
the feedback.  I appreciate it.

----- Original Message ----
From: Remco Lengers <remco at lengers.com>
To: Grant Lowe <glowe at sbcglobal.net>
Cc: zfs-discuss at opensolaris.org
Sent: Thursday, April 9, 2009 5:31:42 AM
Subject: Re: [zfs-discuss] ZFS Panic

Grant,

Didn''t see a response so I''ll give it a go.

Ripping a disk away and silently inserting a new one is asking for trouble imho.
I am not sure what you were trying to accomplish but generally replace a
drive/lun would entail commands like

zpool offline tank c1t3d0
cfgadm | grep c1t3d0
sata1/3::dsk/c1t3d0            disk         connected    configured   ok
# cfgadm -c unconfigure sata1/3
Unconfigure the device at: /devices/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at
1:3
This operation will suspend activity on the SATA device
Continue (yes/no)? yes
# cfgadm | grep sata1/3
sata1/3                        disk         connected    unconfigured ok
<Replace the physical disk c1t3d0>
# cfgadm -c configure sata1/3

Taken from this page:

http://docs.sun.com/app/docs/doc/819-5461/gbbzy?a=view

..Remco

Grant Lowe wrote:> Hi All,
> 
> Don''t know if this is worth reporting, as it''s human
error.  Anyway, I had a panic on my zfs box.  Here''s the error:
> 
> marksburg /usr2/glowe> grep panic /var/log/syslog
> Apr  8 06:57:17 marksburg savecore: [ID 570001 auth.error] reboot after
panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size,
FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580
> Apr  8 07:15:10 marksburg savecore: [ID 570001 auth.error] reboot after
panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size,
FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580
> marksburg /usr2/glowe>
> 
> What we did to cause this is we pulled a LUN from zfs, and replaced it with
a new LUN.  We then tried to shutdown the box, but it wouldn''t go down.
We had to send a break to the box and reboot.  This is an oracle sandbox, so
we''re not really concerned.  Ideas?
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Rince

2009-Apr-10 04:37 UTC

head link

[zfs-discuss] ZFS Panic

FWIW, I strongly expect live ripping of a SATA device to not panic the disk
layer. It explicitly shouldn''t panic the ZFS layer, as ZFS is supposed
to be
"fault-tolerant" and "drive dropping away at any time" is a
rather expected
scenario.

[I''ve popped disks out live in many cases, both when I was
experimenting
with ZFS+RAID-Z on various systems and occasionally, when I''ve had to
replace a disk live. In the latter case, I''ve done cfgadm about half
the
time - the rest, I''ve just live ripped and then brought the disk up
after
that, and it''s Just Worked.]

- Rich

On Thu, Apr 9, 2009 at 3:21 PM, Grant Lowe <glowe at sbcglobal.net> wrote:
>
> Hi Remco.
>
> Yes, I realize that was asking for trouble.  It wasn''t supposed to
be a
> test of yanking a LUN.  We needed a LUN for a VxVM/VxFS system and that LUN
> was available.  I was just surprised at the panic, since the system was
> quiesced at the time.  But there is coming a time when we will be doing
> this.  Thanks for the feedback.  I appreciate it.
>
>
>
>
> ----- Original Message ----
> From: Remco Lengers <remco at lengers.com>
> To: Grant Lowe <glowe at sbcglobal.net>
> Cc: zfs-discuss at opensolaris.org
> Sent: Thursday, April 9, 2009 5:31:42 AM
> Subject: Re: [zfs-discuss] ZFS Panic
>
> Grant,
>
> Didn''t see a response so I''ll give it a go.
>
> Ripping a disk away and silently inserting a new one is asking for trouble
> imho. I am not sure what you were trying to accomplish but generally
replace
> a drive/lun would entail commands like
>
> zpool offline tank c1t3d0
> cfgadm | grep c1t3d0
> sata1/3::dsk/c1t3d0            disk         connected    configured   ok
> # cfgadm -c unconfigure sata1/3
> Unconfigure the device at: /devices/pci at 0,0/pci1022,7458 at
2/pci11ab,11ab at 1
> :3
> This operation will suspend activity on the SATA device
> Continue (yes/no)? yes
> # cfgadm | grep sata1/3
> sata1/3                        disk         connected    unconfigured ok
> <Replace the physical disk c1t3d0>
> # cfgadm -c configure sata1/3
>
> Taken from this page:
>
> http://docs.sun.com/app/docs/doc/819-5461/gbbzy?a=view
>
> ..Remco
>
> Grant Lowe wrote:
> > Hi All,
> >
> > Don''t know if this is worth reporting, as it''s human
error.  Anyway, I
> had a panic on my zfs box.  Here''s the error:
> >
> > marksburg /usr2/glowe> grep panic /var/log/syslog
> > Apr  8 06:57:17 marksburg savecore: [ID 570001 auth.error] reboot
after
> panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size,
> FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c,
line: 580
> > Apr  8 07:15:10 marksburg savecore: [ID 570001 auth.error] reboot
after
> panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size,
> FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c,
line: 580
> > marksburg /usr2/glowe>
> >
> > What we did to cause this is we pulled a LUN from zfs, and replaced it
> with a new LUN.  We then tried to shutdown the box, but it
wouldn''t go down.
>  We had to send a break to the box and reboot.  This is an oracle sandbox,
> so we''re not really concerned.  Ideas?
> >
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


-- 

BOFH excuse #439: Hot Java has gone cold
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090410/fd9c2e0d/attachment.html>

Andre van Eyssen

2009-Apr-10 04:43 UTC

head link

[zfs-discuss] ZFS Panic

On Fri, 10 Apr 2009, Rince wrote:
> FWIW, I strongly expect live ripping of a SATA device to not panic the disk
> layer. It explicitly shouldn''t panic the ZFS layer, as ZFS is
supposed to be
> "fault-tolerant" and "drive dropping away at any time"
is a rather expected
> scenario.
Ripping a SATA device out runs a goodly chance of confusing the 
controller. If you''d had this problem with fibre channel or even SCSI,
I''d
find it a far bigger concern. IME, IDE and SATA just don''t hold up to
the
abuses we''d like to level at them. Of course, this boils down to 
controller and enclosure and a lot of other random chances for disaster.

In addition, where there is a procedure to gently remove the device, use 
it. We don''t just yank disks from the FC-AL backplanes on V880s,
because
there is a procedure for handling this even for failed disks. The five 
minutes to do it properly is a good investment compared to much longer 
downtime from a fault condition arising from careless manhandling of 
hardware.

-- 
Andre van Eyssen.
mail: andre at purplecow.org          jabber: andre at interact.purplecow.org
purplecow.org: UNIX for the masses http://www2.purplecow.org
purplecow.org: PCOWpix             http://pix.purplecow.org

Rince

2009-Apr-10 05:23 UTC

head link

[zfs-discuss] ZFS Panic

On Fri, Apr 10, 2009 at 12:43 AM, Andre van Eyssen <andre at
purplecow.org>wrote:
> On Fri, 10 Apr 2009, Rince wrote:
>
>  FWIW, I strongly expect live ripping of a SATA device to not panic the
>> disk
>> layer. It explicitly shouldn''t panic the ZFS layer, as ZFS is
supposed to
>> be
>> "fault-tolerant" and "drive dropping away at any
time" is a rather
>> expected
>> scenario.
>>
>
> Ripping a SATA device out runs a goodly chance of confusing the controller.
> If you''d had this problem with fibre channel or even SCSI,
I''d find it a far
> bigger concern. IME, IDE and SATA just don''t hold up to the abuses
we''d like
> to level at them. Of course, this boils down to controller and enclosure
and
> a lot of other random chances for disaster.
>
> In addition, where there is a procedure to gently remove the device, use
> it. We don''t just yank disks from the FC-AL backplanes on V880s,
because
> there is a procedure for handling this even for failed disks. The five
> minutes to do it properly is a good investment compared to much longer
> downtime from a fault condition arising from careless manhandling of
> hardware.
>
IDE isn''t supposed to do this, but SATA explicitly has hotplug as a
"feature".

(I think this might be SATA 2, so any SATA 1 controllers out there are
hedging your bets, but...)

I''m not advising this as a recommended procedure, but the failure of
the
controller isn''t my point.

*ZFS* shouldn''t panic under those conditions. The disk layer, perhaps,
but
not ZFS. As far as it should be concerned, it''s equivalent to ejecting
a
disk via cfgadm without telling ZFS first, which *IS* a supported operation.

- Rich
-- 

Procrastination means never having to say you''re sorry.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090410/79430f70/attachment.html>

Miles Nordin

2009-Apr-10 17:50 UTC

head link

[zfs-discuss] ZFS Panic

>>>>> "r" == Rince <rincebrain at gmail.com>
writes:
r> *ZFS* shouldn''t panic under those conditions. The disk
layer,
r> perhaps, but not ZFS.

well, yes, but panicing brings down the whole box anyway so there is
no practical difference, just a difference in blame.

I would rather say, the fact that redundant ZFS ought to be the
best-practice proper way to configure ~all filesystems in the future,
means that disk drivers in the future ought to expect to have ZFS
above them, so panicing when enough drives are still available to keep
the pool up isn''t okay, and also it''s not okay to let problems
with
one drive interrupt access to other drives, and finally we''ve still no
reasonably-practiceable consensus on how to deal with timeout
problems, like vanishing iSCSI targets, and ATA targets that remain
present but take 1000x longer to respond to each command, as ATA disks
often do when they''re failing and as is, I suspect, well-handled by
all the serious hardware RAID storage vendors.

With some chips writing a good driver has proven (on Linux) to be
impossible, or beyond the skill of the person who adopted the chip, or
beyond the effort warranted by the chip''s interestingness. well,
fine, but these things are certainly important enough to document, and
on Linux they ARE documented:

http://ata.wiki.kernel.org/index.php/SATA_hardware_features

It''s kind of best-effort, but still it''s a lot better than
``all those
problems on X4500 were fixed AGES ago, just upgrade'''' /
``still having
problems'''' / ``ok they are all fixed now'''' /
``no they''re not, still
can''t hotplug, still no NCQ'''' / ``well they are much
more stable
now.'''' / ``can I hotplug? is NCQ working?''''
/ .......

Note the LSI 1068 IT-mode cards driven by the proprietary
''mpt'' driver
are supported, by a GPL driver, on Linux, and smartctl works on these
cards. but they don''t appear on the wiki above, so Linux''s
list of
chip features isn''t complete, but it''s a start.

r> As far as it should be concerned, it''s equivalent to
ejecting
r> a disk via cfgadm without telling ZFS first, which *IS* a
r> supported operation.

an interesting point!

Either way, though, we''re responsible for the whole system. ``Our new
handsets have microkernels, which is excellent for reliability! In the
future, when there''s a bug, it won''t crash the whole celfone.
It''ll
just crash the, ahh, the Phone Application.'''' riiiight, sure,
but SO
WHAT?!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090410/5498712a/attachment.bin>

Richard Elling

2009-Apr-11 01:51 UTC

head link

[zfs-discuss] ZFS Panic

Grant Lowe wrote:>> Hi All,
>>
>> Don''t know if this is worth reporting, as it''s human
error.  Anyway, I had a panic on my zfs box.  Here''s the error:
>>
>> marksburg /usr2/glowe> grep panic /var/log/syslog
>> Apr  8 06:57:17 marksburg savecore: [ID 570001 auth.error] reboot after
panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size,
FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580
>> Apr  8 07:15:10 marksburg savecore: [ID 570001 auth.error] reboot after
panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size,
FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580
>> marksburg /usr2/glowe>
>>
>> What we did to cause this is we pulled a LUN from zfs, and replaced it
with a new LUN.  We then tried to shutdown the box, but it wouldn''t go
down.  We had to send a break to the box and reboot.  This is an oracle sandbox,
so we''re not really concerned.  Ideas?
>>
>>     
[this is a standard response]
Assertion failure are, by definition, bugs.

"In computer programming, an assertion is a predicate (i.e., a true?false
statement) placed in a program to indicate that the developer thinks that
the predicate is always true at that place."
http://en.wikipedia.org/wiki/Assertion_(computing)

If you continue to run the same software with the same inputs, then
it is reasonable to expect the same assertion failure. You may have
to apply a patch or otherwise change software versions to continue.

If you find an assertion failure in OpenSolaris, please file a bug at
http://bugs.opensolaris.org If the bug caused a core dump, please
have the dump available for the troubleshooting team.

-- richard

Joerg Schilling

2009-Apr-11 15:51 UTC

head link

[zfs-discuss] ZFS Panic

Andre van Eyssen <andre at purplecow.org> wrote:
> On Fri, 10 Apr 2009, Rince wrote:
>
> > FWIW, I strongly expect live ripping of a SATA device to not panic the
disk
> > layer. It explicitly shouldn''t panic the ZFS layer, as ZFS is
supposed to be
> > "fault-tolerant" and "drive dropping away at any
time" is a rather expected
> > scenario.
>
> Ripping a SATA device out runs a goodly chance of confusing the 
> controller. If you''d had this problem with fibre channel or even
SCSI, I''d
> find it a far bigger concern. IME, IDE and SATA just don''t hold up
to the
> abuses we''d like to level at them. Of course, this boils down to 
> controller and enclosure and a lot of other random chances for disaster.
PATA (ide) does not support hpt-plug, SATA does and SATA uses the same interface
as SAS does.

I would expect that there is no difference between unplugging a SATA drive and 
unplugging a SAS drive.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

zfs discuss - Apr 2009 - ZFS Panic

[zfs-discuss] ZFS Panic

[zfs-discuss] ZFS Panic

[zfs-discuss] ZFS Panic

[zfs-discuss] ZFS Panic

[zfs-discuss] ZFS Panic

[zfs-discuss] ZFS Panic

[zfs-discuss] ZFS Panic

[zfs-discuss] ZFS Panic

[zfs-discuss] ZFS Panic