Steve Gonczi
2010-May-25 19:03 UTC
[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140
Greetings,
I see repeatable crashes on some systems after upgrading.. the signature is
always the same:
operating system: 5.11 snv_139 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff00175f88c0 addr=0
occurred in module "genunix" due to a NULL pointer dereference
list_remove+0x1b(ffffff03e19339f0, ffffff03e0814640)
zfs_acl_release_nodes+0x34(ffffff03e19339c0)
zfs_acl_free+0x16(ffffff03e19339c0)
zfs_znode_free+0x5e(ffffff03e17fa600)
zfs_zinactive+0x9b(ffffff03e17fa600)
zfs_inactive+0x11c(ffffff03e17f8500, ffffff03ee867528, 0)
fop_inactive+0xaf(ffffff03e17f8500, ffffff03ee867528, 0)
vn_rele_dnlc+0x6c(ffffff03e17f8500)
dnlc_purge+0x175()
nfs_idmap_args+0x5e(ffffff00175f8c38)
nfssys+0x1e1(12, 8047dd8)
The stack always looks like the above, the vnode involved is sometimes a file,
sometimes a directory.
e.g.: I have seen the /boot/acpi directory and the
/kernel/drv/amd64/acpi_driver
fie in the vnode''s path field.
looking at the data, I notice that the z_acl.list_head indicates a single
member in the list ( presuming that is the case,
because list_prev and list_next point to the same address):
(ffffff03e19339c0)::print zfs_acl_t
{
z_acl_count = 0x6
z_acl_bytes = 0x30
z_version = 0x1
z_next_ace = 0xffffff03e171d210
z_hints = 0
z_curr_node = 0xffffff03e0814640
z_acl = {
list_size = 0x40
list_offset = 0
list_head = {
list_next = 0xffffff03e0814640
list_prev = 0xffffff03e0814640
}
}
This member''s next pointer is bad ( sometimes zero, sometimes a low
number, eg. 0x10)
The null pointer crash happens trying to follow the list_prev pointer:
0xffffff03e0814640::print zfs_acl_node_t
{
z_next = {
list_next = 0
list_prev = 0
}
z_acldata = 0xffffff03e10b6230
z_allocdata = 0xffffff03e171d200
z_allocsize = 0x30
z_size = 0x30
z_ace_count = 0x6
z_ace_idx = 0x2
}
This is a repeating pattern, seems to me always a single zfs_acl_node in the
list,
with null / garbaged out list_next and list_prev pointers.
e.g.: in another instance of this crash, the zfs_acl_node looks like this:
::stack
list_remove+0x1b(ffffff03e10d24f0, ffffff03e0fc9a00)
zfs_acl_release_nodes+0x34(ffffff03e10d24c0)
zfs_acl_free+0x16(ffffff03e10d24c0)
zfs_znode_free+0x5e(ffffff03e10cc200)
zfs_zinactive+0x9b(ffffff03e10cc200)
zfs_inactive+0x11c(ffffff03e1281840, ffffff03ea5c7010, 0)
fop_inactive+0xaf(ffffff03e1281840, ffffff03ea5c7010, 0)
vn_rele_dnlc+0x6c(ffffff03e1281840)
dnlc_purge+0x175()
nfs_idmap_args+0x5e(ffffff001811ac38)
nfssys+0x1e1(12, 8047dd8)
_sys_sysenter_post_swapgs+0x149()> ::status
...
panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff001811a8c0 addr=10
occurred in module "genunix" due to a NULL pointer dereference
> ffffff03e0fc9a00::print zfs_acl_node_t
{
z_next = {
list_next = 0xffffff03e10e1cd9
list_prev = 0x10
}
z_acldata = 0
z_allocdata = 0xffffff03e10cb5d0
z_allocsize = 0x30
z_size = 0x30
z_ace_count = 0x6
z_ace_idx = 0x2
}
Looks to me the crash here is the same, and list_next / list_prev are garbage.
Anybody seen this?
Am I skipping too many versions when I am image-updating?
I am hoping someone who knows this code would chime in.
Steve
--
This message posted from opensolaris.org
Steve Gonczi
2010-May-25 20:21 UTC
[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140
As I am looking at this further, I convince myself this should really be an
assert.
(I am running release builds, so assert-s do not fire).
I think in a debug build, I should be seeing the !list_empty() assert in:
list_remove(list_t *list, void *object)
{
list_node_t *lold = list_d2l(list, object);
ASSERT(!list_empty(list));
ASSERT(lold->list_next != NULL);
list_remove_node(lold);
}
I am suspecting, maybe this is a race.
Assuming there is not other interfering thread, this crash could never happen..
tatic void
zfs_acl_release_nodes(zfs_acl_t *aclp)
{
zfs_acl_node_t *aclnode;
while (aclnode = list_head(&aclp->z_acl)) {
list_remove(&aclp->z_acl, aclnode);
zfs_acl_node_free(aclnode);
}
aclp->z_acl_count = 0;
aclp->z_acl_bytes = 0;
}
List_head does a list_empty() check, and returns null on empty.
So if we got past that, list_remove() should never find an empty list, perhaps
there
is interference from another thread.
--
This message posted from opensolaris.org
Steve Gonczi
2010-May-26 17:40 UTC
[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140
More info: The crashes go away just by swapping the cpu to a faster/more horsepower cpu. On one box where the crash consistently happened (2 core slow cpu) I no longer see the crash after swapping to a quad core. -- This message posted from opensolaris.org