Hi All, Don''t know if this is worth reporting, as it''s human error. Anyway, I had a panic on my zfs box. Here''s the error: marksburg /usr2/glowe> grep panic /var/log/syslog Apr 8 06:57:17 marksburg savecore: [ID 570001 auth.error] reboot after panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580 Apr 8 07:15:10 marksburg savecore: [ID 570001 auth.error] reboot after panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580 marksburg /usr2/glowe> What we did to cause this is we pulled a LUN from zfs, and replaced it with a new LUN. We then tried to shutdown the box, but it wouldn''t go down. We had to send a break to the box and reboot. This is an oracle sandbox, so we''re not really concerned. Ideas?
Grant, Didn''t see a response so I''ll give it a go. Ripping a disk away and silently inserting a new one is asking for trouble imho. I am not sure what you were trying to accomplish but generally replace a drive/lun would entail commands like zpool offline tank c1t3d0 cfgadm | grep c1t3d0 sata1/3::dsk/c1t3d0 disk connected configured ok # cfgadm -c unconfigure sata1/3 Unconfigure the device at: /devices/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1:3 This operation will suspend activity on the SATA device Continue (yes/no)? yes # cfgadm | grep sata1/3 sata1/3 disk connected unconfigured ok <Replace the physical disk c1t3d0> # cfgadm -c configure sata1/3 Taken from this page: http://docs.sun.com/app/docs/doc/819-5461/gbbzy?a=view ..Remco Grant Lowe wrote:> Hi All, > > Don''t know if this is worth reporting, as it''s human error. Anyway, I had a panic on my zfs box. Here''s the error: > > marksburg /usr2/glowe> grep panic /var/log/syslog > Apr 8 06:57:17 marksburg savecore: [ID 570001 auth.error] reboot after panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580 > Apr 8 07:15:10 marksburg savecore: [ID 570001 auth.error] reboot after panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580 > marksburg /usr2/glowe> > > What we did to cause this is we pulled a LUN from zfs, and replaced it with a new LUN. We then tried to shutdown the box, but it wouldn''t go down. We had to send a break to the box and reboot. This is an oracle sandbox, so we''re not really concerned. Ideas? > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi Remco. Yes, I realize that was asking for trouble. It wasn''t supposed to be a test of yanking a LUN. We needed a LUN for a VxVM/VxFS system and that LUN was available. I was just surprised at the panic, since the system was quiesced at the time. But there is coming a time when we will be doing this. Thanks for the feedback. I appreciate it. ----- Original Message ---- From: Remco Lengers <remco at lengers.com> To: Grant Lowe <glowe at sbcglobal.net> Cc: zfs-discuss at opensolaris.org Sent: Thursday, April 9, 2009 5:31:42 AM Subject: Re: [zfs-discuss] ZFS Panic Grant, Didn''t see a response so I''ll give it a go. Ripping a disk away and silently inserting a new one is asking for trouble imho. I am not sure what you were trying to accomplish but generally replace a drive/lun would entail commands like zpool offline tank c1t3d0 cfgadm | grep c1t3d0 sata1/3::dsk/c1t3d0 disk connected configured ok # cfgadm -c unconfigure sata1/3 Unconfigure the device at: /devices/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1:3 This operation will suspend activity on the SATA device Continue (yes/no)? yes # cfgadm | grep sata1/3 sata1/3 disk connected unconfigured ok <Replace the physical disk c1t3d0> # cfgadm -c configure sata1/3 Taken from this page: http://docs.sun.com/app/docs/doc/819-5461/gbbzy?a=view ..Remco Grant Lowe wrote:> Hi All, > > Don''t know if this is worth reporting, as it''s human error. Anyway, I had a panic on my zfs box. Here''s the error: > > marksburg /usr2/glowe> grep panic /var/log/syslog > Apr 8 06:57:17 marksburg savecore: [ID 570001 auth.error] reboot after panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580 > Apr 8 07:15:10 marksburg savecore: [ID 570001 auth.error] reboot after panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580 > marksburg /usr2/glowe> > > What we did to cause this is we pulled a LUN from zfs, and replaced it with a new LUN. We then tried to shutdown the box, but it wouldn''t go down. We had to send a break to the box and reboot. This is an oracle sandbox, so we''re not really concerned. Ideas? > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
FWIW, I strongly expect live ripping of a SATA device to not panic the disk layer. It explicitly shouldn''t panic the ZFS layer, as ZFS is supposed to be "fault-tolerant" and "drive dropping away at any time" is a rather expected scenario. [I''ve popped disks out live in many cases, both when I was experimenting with ZFS+RAID-Z on various systems and occasionally, when I''ve had to replace a disk live. In the latter case, I''ve done cfgadm about half the time - the rest, I''ve just live ripped and then brought the disk up after that, and it''s Just Worked.] - Rich On Thu, Apr 9, 2009 at 3:21 PM, Grant Lowe <glowe at sbcglobal.net> wrote:> > Hi Remco. > > Yes, I realize that was asking for trouble. It wasn''t supposed to be a > test of yanking a LUN. We needed a LUN for a VxVM/VxFS system and that LUN > was available. I was just surprised at the panic, since the system was > quiesced at the time. But there is coming a time when we will be doing > this. Thanks for the feedback. I appreciate it. > > > > > ----- Original Message ---- > From: Remco Lengers <remco at lengers.com> > To: Grant Lowe <glowe at sbcglobal.net> > Cc: zfs-discuss at opensolaris.org > Sent: Thursday, April 9, 2009 5:31:42 AM > Subject: Re: [zfs-discuss] ZFS Panic > > Grant, > > Didn''t see a response so I''ll give it a go. > > Ripping a disk away and silently inserting a new one is asking for trouble > imho. I am not sure what you were trying to accomplish but generally replace > a drive/lun would entail commands like > > zpool offline tank c1t3d0 > cfgadm | grep c1t3d0 > sata1/3::dsk/c1t3d0 disk connected configured ok > # cfgadm -c unconfigure sata1/3 > Unconfigure the device at: /devices/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1 > :3 > This operation will suspend activity on the SATA device > Continue (yes/no)? yes > # cfgadm | grep sata1/3 > sata1/3 disk connected unconfigured ok > <Replace the physical disk c1t3d0> > # cfgadm -c configure sata1/3 > > Taken from this page: > > http://docs.sun.com/app/docs/doc/819-5461/gbbzy?a=view > > ..Remco > > Grant Lowe wrote: > > Hi All, > > > > Don''t know if this is worth reporting, as it''s human error. Anyway, I > had a panic on my zfs box. Here''s the error: > > > > marksburg /usr2/glowe> grep panic /var/log/syslog > > Apr 8 06:57:17 marksburg savecore: [ID 570001 auth.error] reboot after > panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, > FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580 > > Apr 8 07:15:10 marksburg savecore: [ID 570001 auth.error] reboot after > panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, > FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580 > > marksburg /usr2/glowe> > > > > What we did to cause this is we pulled a LUN from zfs, and replaced it > with a new LUN. We then tried to shutdown the box, but it wouldn''t go down. > We had to send a break to the box and reboot. This is an oracle sandbox, > so we''re not really concerned. Ideas? > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- BOFH excuse #439: Hot Java has gone cold -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090410/fd9c2e0d/attachment.html>
On Fri, 10 Apr 2009, Rince wrote:> FWIW, I strongly expect live ripping of a SATA device to not panic the disk > layer. It explicitly shouldn''t panic the ZFS layer, as ZFS is supposed to be > "fault-tolerant" and "drive dropping away at any time" is a rather expected > scenario.Ripping a SATA device out runs a goodly chance of confusing the controller. If you''d had this problem with fibre channel or even SCSI, I''d find it a far bigger concern. IME, IDE and SATA just don''t hold up to the abuses we''d like to level at them. Of course, this boils down to controller and enclosure and a lot of other random chances for disaster. In addition, where there is a procedure to gently remove the device, use it. We don''t just yank disks from the FC-AL backplanes on V880s, because there is a procedure for handling this even for failed disks. The five minutes to do it properly is a good investment compared to much longer downtime from a fault condition arising from careless manhandling of hardware. -- Andre van Eyssen. mail: andre at purplecow.org jabber: andre at interact.purplecow.org purplecow.org: UNIX for the masses http://www2.purplecow.org purplecow.org: PCOWpix http://pix.purplecow.org
On Fri, Apr 10, 2009 at 12:43 AM, Andre van Eyssen <andre at purplecow.org>wrote:> On Fri, 10 Apr 2009, Rince wrote: > > FWIW, I strongly expect live ripping of a SATA device to not panic the >> disk >> layer. It explicitly shouldn''t panic the ZFS layer, as ZFS is supposed to >> be >> "fault-tolerant" and "drive dropping away at any time" is a rather >> expected >> scenario. >> > > Ripping a SATA device out runs a goodly chance of confusing the controller. > If you''d had this problem with fibre channel or even SCSI, I''d find it a far > bigger concern. IME, IDE and SATA just don''t hold up to the abuses we''d like > to level at them. Of course, this boils down to controller and enclosure and > a lot of other random chances for disaster. > > In addition, where there is a procedure to gently remove the device, use > it. We don''t just yank disks from the FC-AL backplanes on V880s, because > there is a procedure for handling this even for failed disks. The five > minutes to do it properly is a good investment compared to much longer > downtime from a fault condition arising from careless manhandling of > hardware. >IDE isn''t supposed to do this, but SATA explicitly has hotplug as a "feature". (I think this might be SATA 2, so any SATA 1 controllers out there are hedging your bets, but...) I''m not advising this as a recommended procedure, but the failure of the controller isn''t my point. *ZFS* shouldn''t panic under those conditions. The disk layer, perhaps, but not ZFS. As far as it should be concerned, it''s equivalent to ejecting a disk via cfgadm without telling ZFS first, which *IS* a supported operation. - Rich -- Procrastination means never having to say you''re sorry. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090410/79430f70/attachment.html>
>>>>> "r" == Rince <rincebrain at gmail.com> writes:r> *ZFS* shouldn''t panic under those conditions. The disk layer, r> perhaps, but not ZFS. well, yes, but panicing brings down the whole box anyway so there is no practical difference, just a difference in blame. I would rather say, the fact that redundant ZFS ought to be the best-practice proper way to configure ~all filesystems in the future, means that disk drivers in the future ought to expect to have ZFS above them, so panicing when enough drives are still available to keep the pool up isn''t okay, and also it''s not okay to let problems with one drive interrupt access to other drives, and finally we''ve still no reasonably-practiceable consensus on how to deal with timeout problems, like vanishing iSCSI targets, and ATA targets that remain present but take 1000x longer to respond to each command, as ATA disks often do when they''re failing and as is, I suspect, well-handled by all the serious hardware RAID storage vendors. With some chips writing a good driver has proven (on Linux) to be impossible, or beyond the skill of the person who adopted the chip, or beyond the effort warranted by the chip''s interestingness. well, fine, but these things are certainly important enough to document, and on Linux they ARE documented: http://ata.wiki.kernel.org/index.php/SATA_hardware_features It''s kind of best-effort, but still it''s a lot better than ``all those problems on X4500 were fixed AGES ago, just upgrade'''' / ``still having problems'''' / ``ok they are all fixed now'''' / ``no they''re not, still can''t hotplug, still no NCQ'''' / ``well they are much more stable now.'''' / ``can I hotplug? is NCQ working?'''' / ....... Note the LSI 1068 IT-mode cards driven by the proprietary ''mpt'' driver are supported, by a GPL driver, on Linux, and smartctl works on these cards. but they don''t appear on the wiki above, so Linux''s list of chip features isn''t complete, but it''s a start. r> As far as it should be concerned, it''s equivalent to ejecting r> a disk via cfgadm without telling ZFS first, which *IS* a r> supported operation. an interesting point! Either way, though, we''re responsible for the whole system. ``Our new handsets have microkernels, which is excellent for reliability! In the future, when there''s a bug, it won''t crash the whole celfone. It''ll just crash the, ahh, the Phone Application.'''' riiiight, sure, but SO WHAT?! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090410/5498712a/attachment.bin>
Grant Lowe wrote:>> Hi All, >> >> Don''t know if this is worth reporting, as it''s human error. Anyway, I had a panic on my zfs box. Here''s the error: >> >> marksburg /usr2/glowe> grep panic /var/log/syslog >> Apr 8 06:57:17 marksburg savecore: [ID 570001 auth.error] reboot after panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580 >> Apr 8 07:15:10 marksburg savecore: [ID 570001 auth.error] reboot after panic: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 580 >> marksburg /usr2/glowe> >> >> What we did to cause this is we pulled a LUN from zfs, and replaced it with a new LUN. We then tried to shutdown the box, but it wouldn''t go down. We had to send a break to the box and reboot. This is an oracle sandbox, so we''re not really concerned. Ideas? >> >>[this is a standard response] Assertion failure are, by definition, bugs. "In computer programming, an assertion is a predicate (i.e., a true?false statement) placed in a program to indicate that the developer thinks that the predicate is always true at that place." http://en.wikipedia.org/wiki/Assertion_(computing) If you continue to run the same software with the same inputs, then it is reasonable to expect the same assertion failure. You may have to apply a patch or otherwise change software versions to continue. If you find an assertion failure in OpenSolaris, please file a bug at http://bugs.opensolaris.org If the bug caused a core dump, please have the dump available for the troubleshooting team. -- richard
Andre van Eyssen <andre at purplecow.org> wrote:> On Fri, 10 Apr 2009, Rince wrote: > > > FWIW, I strongly expect live ripping of a SATA device to not panic the disk > > layer. It explicitly shouldn''t panic the ZFS layer, as ZFS is supposed to be > > "fault-tolerant" and "drive dropping away at any time" is a rather expected > > scenario. > > Ripping a SATA device out runs a goodly chance of confusing the > controller. If you''d had this problem with fibre channel or even SCSI, I''d > find it a far bigger concern. IME, IDE and SATA just don''t hold up to the > abuses we''d like to level at them. Of course, this boils down to > controller and enclosure and a lot of other random chances for disaster.PATA (ide) does not support hpt-plug, SATA does and SATA uses the same interface as SAS does. I would expect that there is no difference between unplugging a SATA drive and unplugging a SAS drive. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily