This is probably unreproducible, but I just got a panic whilst scrubbing a simple mirrored pool on scxe snv124. Evidently on of the disks went offline for some reason and shortly thereafter the panic happened. I have the dump and the /var/adm/messages containing the trace. Is there any point in submitting a bug report? The panic starts with: Jan 19 13:27:13 host6 ^Mpanic[cpu1]/thread=2a1009f5c80: Jan 19 13:27:13 host6 unix: [ID 403854 kern.notice] assertion failed: 0 == zap_update(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_SCRUB_BOOKMARK, sizeof (uint64_t), 4, &dp->dp_scrub_bookmark, tx), file: ../../common/fs/zfs/dsl_scrub.c, line: 853 FWIW when the system came back up, it resilvered with no problem and now I''m rerunning the scrub.
On Tue, 19 Jan 2010, Frank Middleton wrote:> This is probably unreproducible, but I just got a panic whilst > scrubbing a simple mirrored pool on scxe snv124. Evidently > on of the disks went offline for some reason and shortly > thereafter the panic happened. I have the dump and the > /var/adm/messages containing the trace. > > Is there any point in submitting a bug report?I seem to recall that you are not using ECC memory. If so, maybe the panic is a good thing. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 01/19/10 02:37 PM, Bob Friesenhahn wrote:> I seem to recall that you are not using ECC memory. If so, maybe the > panic is a good thing.This is on SPARC sun4u. Has ECC etc. I agree that without ECC all bets are off :-) Cheers -- Frank
On Jan 19, 2010, at 14:30, Frank Middleton wrote:> This is probably unreproducible, but I just got a panic whilst > scrubbing a simple mirrored pool on scxe snv124. Evidently > on of the disks went offline for some reason and shortly > thereafter the panic happened. I have the dump and the > /var/adm/messages containing the trace. > > Is there any point in submitting a bug report?Was a crash dump generated? If so, then there''s a chance that it can be tracked down I would guess.
Hi Frank, I couldn''t reproduce this problem on SXCE build 130 by failing a disk in mirrored pool and then immediately running a scrub on the pool. It works as expected. Any other symptoms (like a power failure?) before the disk went offline? It is possible that both disks went offline? We would like to review the crash dump if you still have it, just let me know when its uploaded. Thanks, Cindy On 01/19/10 12:30, Frank Middleton wrote:> This is probably unreproducible, but I just got a panic whilst > scrubbing a simple mirrored pool on scxe snv124. Evidently > on of the disks went offline for some reason and shortly > thereafter the panic happened. I have the dump and the > /var/adm/messages containing the trace. > > Is there any point in submitting a bug report? > > The panic starts with: > > Jan 19 13:27:13 host6 ^Mpanic[cpu1]/thread=2a1009f5c80: > Jan 19 13:27:13 host6 unix: [ID 403854 kern.notice] assertion failed: 0 > == zap_update(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, > DMU_POOL_SCRUB_BOOKMARK, sizeof (uint64_t), 4, &dp->dp_scrub_bookmark, > tx), file: ../../common/fs/zfs/dsl_scrub.c, line: 853 > > FWIW when the system came back up, it resilvered with no > problem and now I''m rerunning the scrub. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 01/20/10 04:27 PM, Cindy Swearingen wrote:> Hi Frank, > > I couldn''t reproduce this problem on SXCE build 130 by failing a disk in > mirrored pool and then immediately running a scrub on the pool. It works > as expected.The disk has to fail whilst the scrub is running. It has happened twice now, once with the bottom half of the mirror, and again with the top half.> Any other symptoms (like a power failure?) before the disk went offline? > It is possible that both disks went offline?Neither. The system is on a pretty beefy UPS, and one half of the mirror was definitely online (zpool status just before panic showed one disk offline and the pool as degraded).> We would like to review the crash dump if you still have it, just let me > know when its uploaded.Do you need the unix.0, vmcore.0 or both? I''ll add either or both as attachments to newly created Bug 14012, Panic running a scrub, when you let me know which one(s) you want. Thanks -- Frank
Hi Frank, We need both files. Thanks, Cindy On 01/20/10 15:43, Frank Middleton wrote:> On 01/20/10 04:27 PM, Cindy Swearingen wrote: >> Hi Frank, >> >> I couldn''t reproduce this problem on SXCE build 130 by failing a disk in >> mirrored pool and then immediately running a scrub on the pool. It works >> as expected. > > The disk has to fail whilst the scrub is running. It has happened twice > now, > once with the bottom half of the mirror, and again with the top half. > >> Any other symptoms (like a power failure?) before the disk went offline? >> It is possible that both disks went offline? > > Neither. The system is on a pretty beefy UPS, and one half of the mirror > was definitely online (zpool status just before panic showed one disk > offline and the pool as degraded). > >> We would like to review the crash dump if you still have it, just let me >> know when its uploaded. > > Do you need the unix.0, vmcore.0 or both? I''ll add either or both as > attachments to newly created Bug 14012, Panic running a scrub, > when you let me know which one(s) you want. > > Thanks -- Frank > >
On 01/20/10 05:55 PM, Cindy Swearingen wrote:> Hi Frank, > > We need both files.The vmcore is 1.4GB. An http upload is never going to complete. Is there an ftp-able place to send it, or can you download it if I post it somewhere? Cheers -- Frank
On 01/20/10 04:27 PM, Cindy Swearingen wrote:> Hi Frank, > > I couldn''t reproduce this problem on SXCE build 130 by failing a disk in > mirrored pool and then immediately running a scrub on the pool. It works > as expected.As noted, the disk mustn''t go offline until well after the scrub has started. There''s another wrinkle. There are some COMSTAR iscsi targets on this pool. If there are no initiators accessing any of them, the scrub completes with no errors after 6 hours. If one specific target is active, the panic ensues reproducibly at about 5h30m or so. The precise configuration has 2 disks on one LSI controller as a mirrored pool (whole disks - no slices). Around 750GB of 1.3TB was in use when the most recent iscsi target was created. The pool is read-mostly, so it probably isn''t fragmented. The zvol has copies=1; compression off (no dedupe with snv124). The initiator is VirtualBox running on Fedora C10 on AMD64 and the target disk has 32 bit Fedora C12 installed as "whole disk", which I believe is EFI. To reproduce this might require setting up a COMSTAR iscsi target on a mirrored pool, formatting it with an EFI label, and then running a scrub. Another, similar, target has OpenSolaris installed on it, and it doesn''t seem to cause a panic on a scrub if it is running; AFAIK it doesn''t use EFI, but I have not run a scrub with it active since converting to COMSTAR either. This wouldn''t explain why one or the other disk randomly goes offline and it may be a red herring. But the scrub now runs to completion just as it always has. Since I can''t get FC12 to boot from the EFI disk in VirtualBox, I may reinstall FC12 without EFI and see if that makes a difference, but it is an extremely slow process since it takes almost 6 hours for the panic to occur each time and there''s no practical way to "relocate" the zvol to the start of the pool. HTH -- Frank