Steve Gonczi
2010-May-25 19:03 UTC
[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140
Greetings, I see repeatable crashes on some systems after upgrading.. the signature is always the same: operating system: 5.11 snv_139 (i86pc) panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff00175f88c0 addr=0 occurred in module "genunix" due to a NULL pointer dereference list_remove+0x1b(ffffff03e19339f0, ffffff03e0814640) zfs_acl_release_nodes+0x34(ffffff03e19339c0) zfs_acl_free+0x16(ffffff03e19339c0) zfs_znode_free+0x5e(ffffff03e17fa600) zfs_zinactive+0x9b(ffffff03e17fa600) zfs_inactive+0x11c(ffffff03e17f8500, ffffff03ee867528, 0) fop_inactive+0xaf(ffffff03e17f8500, ffffff03ee867528, 0) vn_rele_dnlc+0x6c(ffffff03e17f8500) dnlc_purge+0x175() nfs_idmap_args+0x5e(ffffff00175f8c38) nfssys+0x1e1(12, 8047dd8) The stack always looks like the above, the vnode involved is sometimes a file, sometimes a directory. e.g.: I have seen the /boot/acpi directory and the /kernel/drv/amd64/acpi_driver fie in the vnode''s path field. looking at the data, I notice that the z_acl.list_head indicates a single member in the list ( presuming that is the case, because list_prev and list_next point to the same address): (ffffff03e19339c0)::print zfs_acl_t { z_acl_count = 0x6 z_acl_bytes = 0x30 z_version = 0x1 z_next_ace = 0xffffff03e171d210 z_hints = 0 z_curr_node = 0xffffff03e0814640 z_acl = { list_size = 0x40 list_offset = 0 list_head = { list_next = 0xffffff03e0814640 list_prev = 0xffffff03e0814640 } } This member''s next pointer is bad ( sometimes zero, sometimes a low number, eg. 0x10) The null pointer crash happens trying to follow the list_prev pointer: 0xffffff03e0814640::print zfs_acl_node_t { z_next = { list_next = 0 list_prev = 0 } z_acldata = 0xffffff03e10b6230 z_allocdata = 0xffffff03e171d200 z_allocsize = 0x30 z_size = 0x30 z_ace_count = 0x6 z_ace_idx = 0x2 } This is a repeating pattern, seems to me always a single zfs_acl_node in the list, with null / garbaged out list_next and list_prev pointers. e.g.: in another instance of this crash, the zfs_acl_node looks like this: ::stack list_remove+0x1b(ffffff03e10d24f0, ffffff03e0fc9a00) zfs_acl_release_nodes+0x34(ffffff03e10d24c0) zfs_acl_free+0x16(ffffff03e10d24c0) zfs_znode_free+0x5e(ffffff03e10cc200) zfs_zinactive+0x9b(ffffff03e10cc200) zfs_inactive+0x11c(ffffff03e1281840, ffffff03ea5c7010, 0) fop_inactive+0xaf(ffffff03e1281840, ffffff03ea5c7010, 0) vn_rele_dnlc+0x6c(ffffff03e1281840) dnlc_purge+0x175() nfs_idmap_args+0x5e(ffffff001811ac38) nfssys+0x1e1(12, 8047dd8) _sys_sysenter_post_swapgs+0x149()> ::status... panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff001811a8c0 addr=10 occurred in module "genunix" due to a NULL pointer dereference> ffffff03e0fc9a00::print zfs_acl_node_t{ z_next = { list_next = 0xffffff03e10e1cd9 list_prev = 0x10 } z_acldata = 0 z_allocdata = 0xffffff03e10cb5d0 z_allocsize = 0x30 z_size = 0x30 z_ace_count = 0x6 z_ace_idx = 0x2 } Looks to me the crash here is the same, and list_next / list_prev are garbage. Anybody seen this? Am I skipping too many versions when I am image-updating? I am hoping someone who knows this code would chime in. Steve -- This message posted from opensolaris.org
Steve Gonczi
2010-May-25 20:21 UTC
[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140
As I am looking at this further, I convince myself this should really be an assert. (I am running release builds, so assert-s do not fire). I think in a debug build, I should be seeing the !list_empty() assert in: list_remove(list_t *list, void *object) { list_node_t *lold = list_d2l(list, object); ASSERT(!list_empty(list)); ASSERT(lold->list_next != NULL); list_remove_node(lold); } I am suspecting, maybe this is a race. Assuming there is not other interfering thread, this crash could never happen.. tatic void zfs_acl_release_nodes(zfs_acl_t *aclp) { zfs_acl_node_t *aclnode; while (aclnode = list_head(&aclp->z_acl)) { list_remove(&aclp->z_acl, aclnode); zfs_acl_node_free(aclnode); } aclp->z_acl_count = 0; aclp->z_acl_bytes = 0; } List_head does a list_empty() check, and returns null on empty. So if we got past that, list_remove() should never find an empty list, perhaps there is interference from another thread. -- This message posted from opensolaris.org
Steve Gonczi
2010-May-26 17:40 UTC
[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140
More info: The crashes go away just by swapping the cpu to a faster/more horsepower cpu. On one box where the crash consistently happened (2 core slow cpu) I no longer see the crash after swapping to a quad core. -- This message posted from opensolaris.org