thr3ads.net - zfs discuss - [zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140 [May 2010]

If this information is useful, please help other people find it:
Share via:

Steve Gonczi

2010-May-25 19:03 UTC

[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140

Greetings,

I see repeatable crashes on some systems after upgrading.. the signature is
always the same:

operating system: 5.11 snv_139 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff00175f88c0 addr=0
occurred in module "genunix" due to a NULL pointer dereference

list_remove+0x1b(ffffff03e19339f0, ffffff03e0814640)
zfs_acl_release_nodes+0x34(ffffff03e19339c0)
zfs_acl_free+0x16(ffffff03e19339c0)
zfs_znode_free+0x5e(ffffff03e17fa600)
zfs_zinactive+0x9b(ffffff03e17fa600)
zfs_inactive+0x11c(ffffff03e17f8500, ffffff03ee867528, 0)
fop_inactive+0xaf(ffffff03e17f8500, ffffff03ee867528, 0)
vn_rele_dnlc+0x6c(ffffff03e17f8500)
dnlc_purge+0x175()
nfs_idmap_args+0x5e(ffffff00175f8c38)
nfssys+0x1e1(12, 8047dd8)

The stack always looks like the above, the vnode involved is sometimes a file,
sometimes a directory.

e.g.: I have seen the /boot/acpi directory  and the
/kernel/drv/amd64/acpi_driver
fie in the vnode''s path field.
 
looking at the data, I notice that  the z_acl.list_head  indicates a single
member in the list ( presuming that is the case,
because list_prev and list_next point to the same address):

(ffffff03e19339c0)::print zfs_acl_t
{
    z_acl_count = 0x6
    z_acl_bytes = 0x30
    z_version = 0x1
    z_next_ace = 0xffffff03e171d210
    z_hints = 0
    z_curr_node = 0xffffff03e0814640
    z_acl = {
        list_size = 0x40
        list_offset = 0
        list_head = {
            list_next = 0xffffff03e0814640
            list_prev = 0xffffff03e0814640
        }
    }

This member''s next pointer is bad ( sometimes zero, sometimes a low
number, eg. 0x10)
The null pointer  crash happens trying to follow the list_prev pointer:

 0xffffff03e0814640::print zfs_acl_node_t
{
    z_next = {
        list_next = 0
        list_prev = 0
    }
    z_acldata = 0xffffff03e10b6230
    z_allocdata = 0xffffff03e171d200
    z_allocsize = 0x30
    z_size = 0x30
    z_ace_count = 0x6
    z_ace_idx = 0x2
}


This is a repeating pattern,  seems to me always a single zfs_acl_node  in the
list,
with null / garbaged out  list_next and list_prev pointers.
e.g.: in another instance of this crash, the zfs_acl_node looks like this:

::stack
list_remove+0x1b(ffffff03e10d24f0, ffffff03e0fc9a00)
zfs_acl_release_nodes+0x34(ffffff03e10d24c0)
zfs_acl_free+0x16(ffffff03e10d24c0)
zfs_znode_free+0x5e(ffffff03e10cc200)
zfs_zinactive+0x9b(ffffff03e10cc200)
zfs_inactive+0x11c(ffffff03e1281840, ffffff03ea5c7010, 0)
fop_inactive+0xaf(ffffff03e1281840, ffffff03ea5c7010, 0)
vn_rele_dnlc+0x6c(ffffff03e1281840)
dnlc_purge+0x175()
nfs_idmap_args+0x5e(ffffff001811ac38)
nfssys+0x1e1(12, 8047dd8)
_sys_sysenter_post_swapgs+0x149()> ::status...
panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff001811a8c0 addr=10
occurred in module "genunix" due to a NULL pointer dereference
>  ffffff03e0fc9a00::print zfs_acl_node_t{
    z_next = {
        list_next = 0xffffff03e10e1cd9
        list_prev = 0x10
    }
    z_acldata = 0
    z_allocdata = 0xffffff03e10cb5d0
    z_allocsize = 0x30
    z_size = 0x30
    z_ace_count = 0x6
    z_ace_idx = 0x2
}

Looks to me the crash here is the same, and list_next / list_prev are garbage.

Anybody seen this?
Am I skipping  too many versions when I am image-updating?
I am hoping someone who knows this code would chime in.

Steve
-- 
This message posted from opensolaris.org

Steve Gonczi

2010-May-25 20:21 UTC

head link

[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140

As I am looking at this further, I convince myself this should really be an
assert.
(I am running release builds, so  assert-s do not fire).

I think in a debug build, I should be seeing the !list_empty()  assert in:
 
list_remove(list_t *list, void *object)
 {
    	list_node_t *lold = list_d2l(list, object);
     	ASSERT(!list_empty(list));
     	ASSERT(lold->list_next != NULL);
     	list_remove_node(lold);
 }
 

I am suspecting, maybe this is a race.

Assuming there is not other interfering thread, this crash could never happen..
tatic void
     zfs_acl_release_nodes(zfs_acl_t *aclp)
     {
     	zfs_acl_node_t *aclnode;
     
     	while (aclnode = list_head(&aclp->z_acl)) {
     		list_remove(&aclp->z_acl, aclnode);
     		zfs_acl_node_free(aclnode);
     	}
     	aclp->z_acl_count = 0;
     	aclp->z_acl_bytes = 0;
     }

List_head does a list_empty() check, and  returns null on empty.
So if we got past that, list_remove() should never find an empty list, perhaps
there
is interference from another thread.
-- 
This message posted from opensolaris.org

Steve Gonczi

2010-May-26 17:40 UTC

head link

[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140

More info:

The crashes go away just by swapping the cpu to a faster/more horsepower cpu.
On one box where the crash consistently happened (2 core slow cpu)
I no longer see the crash after swapping to a quad core.
-- 
This message posted from opensolaris.org

zfs discuss - May 2010 - multiple crashes upon boot after upgrading build 134 to 138, 139 or 140

[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140

[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140

[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140