Hello all, I am not sure my original mail got through to the list (I haven''t received it back), so I attach it below. Anyhow, now I have a saved kernel crash dump of the system panicking when it tries to - I believe - deferred-release the corrupted deduped blocks which are no longer referenced by the userdata/blockpointer tree. As I previously wrote in my thread on unfixeable corruptions vs. dedup, there may be some single- or multiple-block references from DDT pointing into "nowhere" - the block pointed to contains non-matching data, and now the BP tree does not point to those blocks either. And there are other blocks with same recorded checksums (repaired from another source). I am not really sure what and how breaks the system into panic. I would like to analyze the kernel dump to see if my assumptions is true, and perhaps catch the data block reference from the dump - but I don''t know how and ask for a crash-course ;) Ultimately I think that if the current system fails to safely process whatever data it does find on-disk, then it may be possible to fix that data (or the system) so that the error is not there and fatal failures don''t occur either. In the current state of things the pool is practically unimportable - within a minute after import the system crashes. I did not try rolling back yet... and without deferred release of blocks (i.e. no guarantee that I can safely roll back to a consistent BP tree where all nodes exist), I am actually reluctant to try rollbacks until required to... Thanks, //Jim 2012-02-04 4:28, Jim Klimov wrote:> I got the machine with my 6-disk raidz2 pool booted again, > into oi_151a, but it reboots soon after importing the pool. > Kernel hits a NULL pointer dereference in DDT-related > routines and crashes. > > According to fmdump, error and stacktrace is more or less > the same each time. It seems that "repairing" corrupted > deduped data by overwriting blocks or whole files with > good copies did not go too well, even though all of my > deduped datasets now use "dedup=verify" mode. > > The uptimes are too short for "savecore" to complete :( > I''ll try to catch the system at a good moment to prevent > "pool"''s importing and hope to get the kernel dump. > > What should I look for with ZDB, MDB or in dump files? > > Any suggestions how to analyze and ultimately fix this > problem on-disk (without destroying and remaking the pool)? > > > > Here''s the latest fmdump: > > # fmdump -Vp -u 4f6725c1-509f-eba4-8774-e627e1925461 > TIME UUID SUNW-MSG-ID > Feb 04 2012 04:11:20.930300000 4f6725c1-509f-eba4-8774-e627e1925461 > SUNOS-8000-KL > > TIME CLASS ENA > Feb 04 04:11:11.3891 ireport.os.sunos.panic.dump_pending_on_device > 0x0000000000000000 > > nvlist version: 0 > version = 0x0 > class = list.suspect > uuid = 4f6725c1-509f-eba4-8774-e627e1925461 > code = SUNOS-8000-KL > diag-time = 1328314280 870353 > de = fmd:///module/software-diagnosis > fault-list-sz = 0x1 > fault-list = (array of embedded nvlists) > (start fault-list[0]) > nvlist version: 0 > version = 0x0 > class = defect.sunos.kernel.panic > certainty = 0x64 > asru > sw:///:path=/var/crash/bofh-sol/.4f6725c1-509f-eba4-8774-e627e1925461 > resource > sw:///:path=/var/crash/bofh-sol/.4f6725c1-509f-eba4-8774-e627e1925461 > savecore-succcess = 0 > os-instance-uuid = 4f6725c1-509f-eba4-8774-e627e1925461 > panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff0010a5e920 addr=30 > occurred in module "zfs" due to a NULL pointer dereference > panicstack = unix:die+dd () | unix:trap+1799 () | unix:cmntrap+e6 () | > zfs:ddt_phys_decref+c () | zfs:zio_ddt_free+5c () | zfs:zio_execute+8d > () | genunix:taskq_thread+285 () | unix:thread_start+8 () | > crashtime = 1328314064 > panic-time = February 4, 2012 04:07:44 AM MSK MSK > (end fault-list[0]) > > fault-status = 0x1 > severity = Major > __ttl = 0x1 > __tod = 0x4f2c77a8 0x37734060 > > And here''s some dmesg leading up to that fmdump reference: > > Feb 4 04:11:10 bofh-sol zfs: [ID 249136 kern.info] imported version 28 > pool pool using 28 > Feb 4 04:11:11 bofh-sol savecore: [ID 570001 auth.error] reboot after > panic: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010a5e920 addr=30 > occurred in module "zfs" due to a NULL pointer dereference > Feb 4 04:11:11 bofh-sol savecore: [ID 564761 auth.error] Panic crashdump > pending on dump device but dumpadm -n in effect; run savecore(1M) > manually to extract. Image UUID 4f6725c1-509f-eba4-8774-e627e1925461. > Feb 4 04:11:13 bofh-sol unix: [ID 954099 kern.info] NOTICE: IRQ17 is > being shared by drivers with different interrupt levels. > Feb 4 04:11:13 bofh-sol This may result in reduced system performance. > Feb 4 04:11:20 bofh-sol fmd: [ID 377184 daemon.error] SUNW-MSG-ID: > SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major > Feb 4 04:11:20 bofh-sol EVENT-TIME: Sat Feb 4 04:11:20 MSK 2012 > Feb 4 04:11:20 bofh-sol PLATFORM: System-Product-Name, CSN: > System-Serial-Number, HOSTNAME: bofh-sol > Feb 4 04:11:20 bofh-sol SOURCE: software-diagnosis, REV: 0.1 > Feb 4 04:11:20 bofh-sol EVENT-ID: 4f6725c1-509f-eba4-8774-e627e1925461 > Feb 4 04:11:20 bofh-sol DESC: The system has rebooted after a kernel > panic. Refer to http://sun.com/msg/SUNOS-8000-KL for more information. > Feb 4 04:11:20 bofh-sol AUTO-RESPONSE: The failed system image was > dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a > copy of the dump will be written to the savecore directory . > Feb 4 04:11:20 bofh-sol IMPACT: There may be some performance impact > while the panic is copied to the savecore directory. Disk space usage by > panics can be substantial. > Feb 4 04:11:20 bofh-sol REC-ACTION: If savecore is not enabled then > please take steps to preserve the crash image. > Feb 4 04:11:20 bofh-sol Use ''fmdump -Vp -u > 4f6725c1-509f-eba4-8774-e627e1925461'' to view more panic detail. Please > refer to the knowledge article for additional information. > > > Thanks in advance, > //Jim Klimov >
2012-02-04 18:27, Jim Klimov wrote:> panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff0010a5e920 > addr=30 occurred in module "zfs" due to a NULL pointer dereference > panicstack = unix:die+dd () | unix:trap+1799 () | unix:cmntrap+e6 > () | zfs:ddt_phys_decref+c () | zfs:zio_ddt_free+5c () | > zfs:zio_execute+8d () | genunix:taskq_thread+285 () | > unix:thread_start+8 () | crashtime = 1328314064 > panic-time = February 4, 2012 04:07:44 AM MSK MSK > (end fault-list[0])I''ve finally looked into the code, and the problem seems weird indeed. With "properly" wrong on-disk data the code can take such a path that it tries to dereference a NULL pointer. This is a bug, methinks: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/zio.c#zio_ddt_free 2100 static int 2101 zio_ddt_free(zio_t *zio) 2102 { 2103 spa_t *spa = zio->io_spa; 2104 blkptr_t *bp = zio->io_bp; 2105 ddt_t *ddt = ddt_select(spa, bp); 2106 ddt_entry_t *dde; 2107 ddt_phys_t *ddp; 2108 2109 ASSERT(BP_GET_DEDUP(bp)); 2110 ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL); 2111 2112 ddt_enter(ddt); 2113 freedde = dde = ddt_lookup(ddt, bp, B_TRUE); 2114 ddp = ddt_phys_select(dde, bp); 2115 ddt_phys_decref(ddp); 2116 ddt_exit(ddt); 2117 2118 return (ZIO_PIPELINE_CONTINUE); 2119 } The way I see it, this function: 2109) asserts that the dedup bit is set on the blockpointer, 2113) selects a DDT entry matching the BP with ddt_lookup() Strangely for me, it requests creation of a new DDT entry if a matching one doesn''t exist (B_TRUE) - even though this is a DDT-free routine. I doubt this autocreated entry gets a DVA assigned, but I did not trace the code that far to be certain. 2114) then it tries to find the "ddp" pointer to the DDT entry with ddt_phys_select(), which can validly return NULL if no DDT entries, with same DVA[0] and same phys_birth as the bp, exist. I speculated above that an autocreated DDT entry might have no DVA yet, maybe. I also wonder about cases where phys_birth TXGs would differ for some reason, even if an otherwise matching DDT entry exists (BP rewrite in future, some lag in committing-to-disk today)? http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/ddt.c#ddt_phys_select 2115) ddt_phys_decref() just tries to use ddp as a pointer to structure without checking if it is NULL itself: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/ddt.c#ddt_phys_decref 309 void 310 ddt_phys_decref(ddt_phys_t *ddp) 311 { 312 ASSERT((int64_t)ddp->ddp_refcnt > 0); 313 ddp->ddp_refcnt--; 314 } 315 According to my stacktrace in the kernel panic message, this should be where the system panics and fails. Seemingly this should work if on-disk data is consistent. Since I''ve had some "strangeness" regarding corrupt data blocks pointed to by DDT, with some hickups and reboots during or soon-after repair attempts, I can''t vouch that my on-disk data is consistent, or that the whole block tree is readable and accessible to the OS now. But the code allows for failure is something goes wrong. I''d add a lot more NULL checks for values (especially pointers) returned by functions called from zio_ddt_free() and/or sanity checks for values fed into such functions. Actually, I''d rewrite this part of zio_ddt_free() like this: - freedde = dde = ddt_lookup(ddt, bp, B_TRUE); + freedde = dde = ddt_lookup(ddt, bp, B_FALSE); + if (dde) { ddp = ddt_phys_select(dde, bp); + if (ddp) { ddt_phys_decref(ddp); + } + } ddt_exit(ddt); If ddt_lookup() finds no matching DDE, fine - we''re freed. If ddt_phys_select() finds no DDP, not so fine, but that''s not a good reason to kernel-panic. Perhaps log a message and maybe leave a leaked block? Or relax the phys_birth check? Am I mistaken anywhere? :) Thanks, //Jim
2012-02-09 22:43, Jim Klimov ?????:> 2012-02-04 18:27, Jim Klimov wrote: >> panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff0010a5e920 >> addr=30 occurred in module "zfs" due to a NULL pointer dereference >> panicstack = unix:die+dd () | unix:trap+1799 () | unix:cmntrap+e6 >> () | zfs:ddt_phys_decref+c () | zfs:zio_ddt_free+5c () | >> zfs:zio_execute+8d () | genunix:taskq_thread+285 () | >> unix:thread_start+8 () | crashtime = 1328314064 >> panic-time = February 4, 2012 04:07:44 AM MSK MSK >> (end fault-list[0]) > > I''ve finally looked into the code, and the problem seems > weird indeed. With "properly" wrong on-disk data the code > can take such a path that it tries to dereference a NULL > pointer. (...)> Actually, I''d rewrite this part of zio_ddt_free() like > this (...) > If ddt_lookup() finds no matching DDE, fine - we''re freed. > If ddt_phys_select() finds no DDP, not so fine, but that''s > not a good reason to kernel-panic. Perhaps log a message > and maybe leave a leaked block? Or relax the phys_birth check? > Am I mistaken anywhere? :)To follow up, I couraged up and implemented this simple fix on my (now tweaked) oi_151a installation (diff below). While importing the pool and then while rsyncing the remaining problematic files into it, I''ve got no more kernel panics. There were several (expected) hits of the diagnostic message (embedded into ddt_phys_select() below) reporting that it has indeed failed to find a DDP matching the requested BP, and returns a NULL. Further checks in ddt_phys_decref() now skip dereferencing if the pointer is NULL, and the thing "just works" okay ;) Possibly, the ASSERTs added below into other functions should be replaced by if() blocks to fire regardless of debug/nondebug kernels. Anyhow, I still believe that functions should do sanity-checks on input values, even if at cost of some performance - so as to not dereference NULL pointers, at least... ;) I am still open to suggestions from ZFS coding gurus as to why my solution or overall approach might be a bad idea, or to accept the solution into upstream code ;) I see that the zfs-code mailing list is essentially dead for months, and the recent active writers were people from zfs-dscuss, so I just post here... ==================Under /code/illumos-gate/usr/src/uts/common/fs/zfs : --- ddt.c.orig 2012-01-25 20:20:11.000000000 +0400 +++ ddt.c 2012-02-24 02:14:29.273050980 +0400 @@ -297,20 +297,25 @@ void ddt_phys_clear(ddt_phys_t *ddp) { + ASSERT(ddp != NULL); bzero(ddp, sizeof (*ddp)); } void ddt_phys_addref(ddt_phys_t *ddp) { + ASSERT(ddp != NULL); ddp->ddp_refcnt++; } void ddt_phys_decref(ddt_phys_t *ddp) { - ASSERT((int64_t)ddp->ddp_refcnt > 0); - ddp->ddp_refcnt--; +// ASSERT(ddp != NULL); + if (ddp) { + ASSERT((int64_t)ddp->ddp_refcnt > 0); + ddp->ddp_refcnt--; + } } void @@ -333,6 +338,17 @@ BP_PHYSICAL_BIRTH(bp) == ddp->ddp_phys_birth) return (ddp); } + (void) printf("ddt_phys_select() found nothing for \n" + "\tDVA[BP] =<%llu:%llx:%llx> and phys_birth[BP] %llu and\n" + "\tDVA[DDP]=<%llu:%llx:%llx> and phys_birth[DDP] %llu\n", + (u_longlong_t)DVA_GET_VDEV(BP_IDENTITY(bp)), + (u_longlong_t)DVA_GET_OFFSET(BP_IDENTITY(bp)), + (u_longlong_t)DVA_GET_ASIZE(BP_IDENTITY(bp)), + BP_PHYSICAL_BIRTH(bp), + (u_longlong_t)DVA_GET_VDEV(&ddp->ddp_dva[0]), + (u_longlong_t)DVA_GET_OFFSET(&ddp->ddp_dva[0]), + (u_longlong_t)DVA_GET_ASIZE(&ddp->ddp_dva[0]), + ddp->ddp_phys_birth); return (NULL); } --- zio.c.orig 2012-01-25 20:20:12.000000000 +0400 +++ zio.c 2012-02-10 20:58:02.745614986 +0400 @@ -2111,8 +2111,12 @@ ddt_enter(ddt); freedde = dde = ddt_lookup(ddt, bp, B_TRUE); - ddp = ddt_phys_select(dde, bp); - ddt_phys_decref(ddp); + if (dde) { + ddp = ddt_phys_select(dde, bp); + if (ddp) { + ddt_phys_decref(ddp); + } + } ddt_exit(ddt); return (ZIO_PIPELINE_CONTINUE); ================== HTH, //Jim Klimov
Apparently Analagous Threads
- Shared clipboard not working
- Naming violation: objectClass: myObjectClass not a valid child class for <>
- Naming violation: objectClass: myObjectClass not a valid child class for <>
- Naming violation: objectClass: myObjectClass not a valid child class for <>
- sysvol / netlogon / logoinscript