thr3ads.net - zfs discuss - [zfs-discuss] kernel panic during zfs import [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Carsten John

2012-Mar-27 07:14 UTC

[zfs-discuss] kernel panic during zfs import

Hallo everybody,

I have a Solaris 11 box here (Sun X4270) that crashes with a kernel panic during
the import of a zpool (some 30TB) containing ~500 zfs filesystems after reboot.
This causes a reboot loop, until booted single user and removed
/etc/zfs/zpool.cache.

>From /var/adm/messages:
savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf Page
fault) rp=ffffff002f9cec50 addr=20 occurred in module "zfs" due to a
NULL pointer dereference
savecore: [ID 882351 auth.error] Saving compressed system crash dump in
/var/crash/vmdump.2

This is what mdb tells:

mdb unix.2 vmcore.2
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp
scsi_vhci zfs mpt sd ip hook neti arp usba uhci sockfs qlc fctl s1394 kssl lofs
random fcp idm sata fcip cpc crypto ufs logindmux ptm sppp ]
$c
zap_leaf_lookup_closest+0x45(ffffff0700ca2a98, 0, 0, ffffff002f9cedb0)
fzap_cursor_retrieve+0xcd(ffffff0700ca2a98, ffffff002f9ceed0, ffffff002f9cef10)
zap_cursor_retrieve+0x195(ffffff002f9ceed0, ffffff002f9cef10)
zfs_purgedir+0x4d(ffffff0721d32c20)
zfs_rmnode+0x57(ffffff0721d32c20)
zfs_zinactive+0xb4(ffffff0721d32c20)
zfs_inactive+0x1a3(ffffff0721d3a700, ffffff07149dc1a0, 0)
fop_inactive+0xb1(ffffff0721d3a700, ffffff07149dc1a0, 0)
vn_rele+0x58(ffffff0721d3a700)
zfs_unlinked_drain+0xa7(ffffff07022dab40)
zfsvfs_setup+0xf1(ffffff07022dab40, 1)
zfs_domount+0x152(ffffff07223e3c70, ffffff0717830080)
zfs_mount+0x4e3(ffffff07223e3c70, ffffff07223e5900, ffffff002f9cfe20,
ffffff07149dc1a0)
fsop_mount+0x22(ffffff07223e3c70, ffffff07223e5900, ffffff002f9cfe20,
ffffff07149dc1a0)
domount+0xd2f(0, ffffff002f9cfe20, ffffff07223e5900, ffffff07149dc1a0,
ffffff002f9cfe18)
mount+0xc0(ffffff0713612c78, ffffff002f9cfe98)
syscall_ap+0x92()
_sys_sysenter_post_swapgs+0x149()


I can import the pool readonly.

The server is a mirror for our primary file server and is synced via zfs
send/receive.

I saw a similar effect some time ago on a opensolaris box (build 111b). That
time my final solution was to copy over the read only mounted stuff to a newly
created pool. As it is the second time this failure occures (on different
machines) I''m really concerned about overall reliability....



Any suggestions?


thx

Carsten

Jim Klimov

2012-Mar-27 12:05 UTC

head link

[zfs-discuss] kernel panic during zfs import

2012-03-27 11:14, Carsten John write:> I saw a similar effect some time ago on a opensolaris box (build 111b).
That time my final solution was to copy over the read only mounted stuff to a
newly created pool. As it is the second time this failure occures (on different
machines) I''m really concerned about overall reliability....
>
>
>
> Any suggestions?
A couple of months ago I reported a similar issue (though with
a different stacktrace and code path). I tracked it to code in
freeing of deduped blocks where a valid code path could return
a NULL pointer, but further routines used the pointer as if it
is always valid - thus a NULL dereference when the pool was
imported RW and tried to release blocks marked for deletion.

Adding a check for non-NULLness in my private rebuild of oi_151a
has fixed the issue. I wouldn''t be surprised to see similar
slackiness in other parts of the code now. Not checking input
values in routines seems like an arrogant mistake waiting to
fire (and it did for us).

I am not sure how to make a webrev and ultimately a signed-off
contribution upstream, but I posted my patch and research on
the list and in illumos bugtracker.

I am not sure how you can fix a S11 system though.
If it is at zpool v28 or older, you can try to import it into
an openindiana installation, perhaps rebuilt for similar
patched code that would check for NULLs and fix your pool
(and then reuse it in S11 if you must). The source is there
on http://src.illumos.org and your stacktrace should tell you
in which functions you should start looking...

Good luck,
//Jim

Paul Kraus

2012-Mar-27 13:03 UTC

head link

[zfs-discuss] kernel panic during zfs import

On Tue, Mar 27, 2012 at 3:14 AM, Carsten John <cjohn at mpi-bremen.de>
wrote:> Hallo everybody,
>
> I have a Solaris 11 box here (Sun X4270) that crashes with a kernel panic
during the import of a zpool (some 30TB) containing ~500 zfs filesystems after
reboot. This causes a reboot loop, until booted single user and removed
/etc/zfs/zpool.cache.
>
>
> From /var/adm/messages:
>
> savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf
Page fault) rp=ffffff002f9cec50 addr=20 occurred in module "zfs" due
to a NULL pointer dereference
> savecore: [ID 882351 auth.error] Saving compressed system crash dump in
/var/crash/vmdump.2
>
    I ran into a very similar problem with Solaris 10U9 and the
replica (zfs send | zfs recv destination) of a zpool of about 25 TB of
data. The problem was an incomplete snapshot (the zfs send | zfs recv
had been interrupted). On boot the system was trying to import the
zpool and as part of that it was trying to destroy the offending
(incomplete) snapshot. This was zpool version 22 and destruction of
snapshots is handled as a single TXG. The problem was that the
operation was running the system out of RAM (32 GB worth). There is a
fix for this and it is in zpool 26 (or newer), but any snapshots
created while the zpool is at a version prior to 26 will have the
problem on-disk. We have support with Oracle and were able to get a
loaner system with 128 GB RAM to clean up the zpool (it took about 75
GB RAM to do so).

    If you are at zpool 26 or later this is not your problem. If you
are at zpool < 26, then test for an incomplete snapshot by importing
the pool read only, then `zdb -d <zpool> | grep ''%''` as
the incomplete
snapshot will have a ''%'' instead of a ''@'' as
the dataset / snapshot
separator. You can also run the zdb against the _un_imported_ zpool
using the -e option to zdb.

See the following Oracle Bugs for more information.

CR# 6876953
CR# 6910767
CR# 7082249

CR#7082249 has been marked as a duplicate of CR# 6948890

P.S. I have a suspect that the incomplete snapshot was also corrupt in
some strange way, but could never make a solid determination of that.
We think what caused the zfs send | zfs recv to be interrupted was
hitting an e1000g Ethernet device driver bug.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players

Deepak Honnalli

2012-Mar-28 07:11 UTC

head link

[zfs-discuss] kernel panic during zfs import

Hi Carsten,

     This was supposed to be fixed in build 164 of Nevada (6742788). If 
you are still seeing this
     issue in S11, I think you should raise a bug with relevant details. 
As Paul has suggested,
     this could also be due to incomplete snapshot.

     I have seen interrupted zfs recv''s causing weired bugs.

Thanks,
Deepak.

On 03/27/12 12:44 PM, Carsten John wrote:> Hallo everybody,
>
> I have a Solaris 11 box here (Sun X4270) that crashes with a kernel panic
during the import of a zpool (some 30TB) containing ~500 zfs filesystems after
reboot. This causes a reboot loop, until booted single user and removed
/etc/zfs/zpool.cache.
>
>
>  From /var/adm/messages:
>
> savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf
Page fault) rp=ffffff002f9cec50 addr=20 occurred in module "zfs" due
to a NULL pointer dereference
> savecore: [ID 882351 auth.error] Saving compressed system crash dump in
/var/crash/vmdump.2
>
> This is what mdb tells:
>
> mdb unix.2 vmcore.2
> Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp
scsi_vhci zfs mpt sd ip hook neti arp usba uhci sockfs qlc fctl s1394 kssl lofs
random fcp idm sata fcip cpc crypto ufs logindmux ptm sppp ]
> $c
> zap_leaf_lookup_closest+0x45(ffffff0700ca2a98, 0, 0, ffffff002f9cedb0)
> fzap_cursor_retrieve+0xcd(ffffff0700ca2a98, ffffff002f9ceed0,
ffffff002f9cef10)
> zap_cursor_retrieve+0x195(ffffff002f9ceed0, ffffff002f9cef10)
> zfs_purgedir+0x4d(ffffff0721d32c20)
> zfs_rmnode+0x57(ffffff0721d32c20)
> zfs_zinactive+0xb4(ffffff0721d32c20)
> zfs_inactive+0x1a3(ffffff0721d3a700, ffffff07149dc1a0, 0)
> fop_inactive+0xb1(ffffff0721d3a700, ffffff07149dc1a0, 0)
> vn_rele+0x58(ffffff0721d3a700)
> zfs_unlinked_drain+0xa7(ffffff07022dab40)
> zfsvfs_setup+0xf1(ffffff07022dab40, 1)
> zfs_domount+0x152(ffffff07223e3c70, ffffff0717830080)
> zfs_mount+0x4e3(ffffff07223e3c70, ffffff07223e5900, ffffff002f9cfe20,
ffffff07149dc1a0)
> fsop_mount+0x22(ffffff07223e3c70, ffffff07223e5900, ffffff002f9cfe20,
ffffff07149dc1a0)
> domount+0xd2f(0, ffffff002f9cfe20, ffffff07223e5900, ffffff07149dc1a0,
ffffff002f9cfe18)
> mount+0xc0(ffffff0713612c78, ffffff002f9cfe98)
> syscall_ap+0x92()
> _sys_sysenter_post_swapgs+0x149()
>
>
> I can import the pool readonly.
>
> The server is a mirror for our primary file server and is synced via zfs
send/receive.
>
> I saw a similar effect some time ago on a opensolaris box (build 111b).
That time my final solution was to copy over the read only mounted stuff to a
newly created pool. As it is the second time this failure occures (on different
machines) I''m really concerned about overall reliability....
>
>
>
> Any suggestions?
>
>
> thx
>
> Carsten
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Carsten John

2012-Mar-28 10:29 UTC

head link

[zfs-discuss] kernel panic during zfs import

-----Original message-----
To:	ZFS Discussions <zfs-discuss at opensolaris.org>; 
From:	Paul Kraus <paul at kraus-haus.org>
Sent:	Tue 27-03-2012 15:05
Subject:	Re: [zfs-discuss] kernel panic during zfs
import> On Tue, Mar 27, 2012 at 3:14 AM, Carsten John <cjohn at
mpi-bremen.de> wrote:
> > Hallo everybody,
> >
> > I have a Solaris 11 box here (Sun X4270) that crashes with a kernel
panic
> during the import of a zpool (some 30TB) containing ~500 zfs filesystems
after
> reboot. This causes a reboot loop, until booted single user and removed 
> /etc/zfs/zpool.cache.
> >
> >
> > From /var/adm/messages:
> >
> > savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e
(#pf
> Page fault) rp=ffffff002f9cec50 addr=20 occurred in module "zfs"
due to a NULL
> pointer dereference
> > savecore: [ID 882351 auth.error] Saving compressed system crash dump
in
> /var/crash/vmdump.2
> >
> 
>     I ran into a very similar problem with Solaris 10U9 and the
> replica (zfs send | zfs recv destination) of a zpool of about 25 TB of
> data. The problem was an incomplete snapshot (the zfs send | zfs recv
> had been interrupted). On boot the system was trying to import the
> zpool and as part of that it was trying to destroy the offending
> (incomplete) snapshot. This was zpool version 22 and destruction of
> snapshots is handled as a single TXG. The problem was that the
> operation was running the system out of RAM (32 GB worth). There is a
> fix for this and it is in zpool 26 (or newer), but any snapshots
> created while the zpool is at a version prior to 26 will have the
> problem on-disk. We have support with Oracle and were able to get a
> loaner system with 128 GB RAM to clean up the zpool (it took about 75
> GB RAM to do so).
> 
>     If you are at zpool 26 or later this is not your problem. If you
> are at zpool < 26, then test for an incomplete snapshot by importing
> the pool read only, then `zdb -d <zpool> | grep
''%''` as the incomplete
> snapshot will have a ''%'' instead of a
''@'' as the dataset / snapshot
> separator. You can also run the zdb against the _un_imported_ zpool
> using the -e option to zdb.
> 
> See the following Oracle Bugs for more information.
> 
> CR# 6876953
> CR# 6910767
> CR# 7082249
> 
> CR#7082249 has been marked as a duplicate of CR# 6948890
> 
> P.S. I have a suspect that the incomplete snapshot was also corrupt in
> some strange way, but could never make a solid determination of that.
> We think what caused the zfs send | zfs recv to be interrupted was
> hitting an e1000g Ethernet device driver bug.
> 
> -- 
>
{--------1---------2---------3---------4---------5---------6---------7---------}
> Paul Kraus
> -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/
)
> -> Sound Coordinator, Schenectady Light Opera Company (
> http://www.sloctheater.org/ )
> -> Technical Advisor, Troy Civic Theatre Company
> -> Technical Advisor, RPI Players
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
Hi,


this scenario seems to fit. The machine that was sending the snapshot is on
OpenSolaris Build 111b (which is running zpool version 14).

I rebooted the receiving machine due to a hanging "zfs receive" that
couldn''t be killed.

zdb -d -e <pool> does not give any useful information:

zdb -d -e san_pool           
Dataset san_pool [ZPL], ID 18, cr_txg 1, 36.0K, 11 objects


When importing the pool readonly, I get an error about two datasets:

zpool import -o readonly=on san_pool
cannot set property for ''san_pool/home/someuser'': dataset is
read-only
cannot set property for ''san_pool/home/someotheruser'': dataset
is read-only

As this is a mirror machine, I still have the option to destroy the pool and
copy over the stuff via send/receive from the primary. But nobody knows how long
this will work until I''m hit again....

If an interrupted send/receive can screw up a 30TB target pool, then
send/receive isn''t an option for replication data at all, furthermore
it should be flagged as "don''t use it if your target pool might
contain any valuable data"

I wil reproduce the crash once more and try to file a bug report for S11 as
recommended by Deepak (not so easy these days...).



thanks



Carsten

zfs discuss - Mar 2012 - kernel panic during zfs import

[zfs-discuss] kernel panic during zfs import

[zfs-discuss] kernel panic during zfs import

[zfs-discuss] kernel panic during zfs import

[zfs-discuss] kernel panic during zfs import

[zfs-discuss] kernel panic during zfs import