thr3ads.net - zfs discuss - [zfs-discuss] zpool fails with panic in zio_ddt

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Feb-04 14:27 UTC

[zfs-discuss] zpool fails with panic in zio_ddt_free()

Hello all,

I am not sure my original mail got through to the list
(I haven''t received it back), so I attach it below.

Anyhow, now I have a saved kernel crash dump of the system
panicking when it tries to - I believe - deferred-release
the corrupted deduped blocks which are no longer referenced
by the userdata/blockpointer tree.

As I previously wrote in my thread on unfixeable corruptions
vs. dedup, there may be some single- or multiple-block
references from DDT pointing into "nowhere" - the block
pointed to contains non-matching data, and now the BP
tree does not point to those blocks either. And there
are other blocks with same recorded checksums (repaired
from another source).

I am not really sure what and how breaks the system into
panic. I would like to analyze the kernel dump to see if
my assumptions is true, and perhaps catch the data block
reference from the dump - but I don''t know how and ask
for a crash-course ;)

Ultimately I think that if the current system fails to
safely process whatever data it does find on-disk, then
it may be possible to fix that data (or the system) so
that the error is not there and fatal failures don''t
occur either.

In the current state of things the pool is practically
unimportable - within a minute after import the system
crashes.

I did not try rolling back yet... and without deferred
release of blocks (i.e. no guarantee that I can safely
roll back to a consistent BP tree where all nodes exist),
I am actually reluctant to try rollbacks until required to...

Thanks,
//Jim

2012-02-04 4:28, Jim Klimov wrote:> I got the machine with my 6-disk raidz2 pool booted again,
> into oi_151a, but it reboots soon after importing the pool.
> Kernel hits a NULL pointer dereference in DDT-related
> routines and crashes.
>
> According to fmdump, error and stacktrace is more or less
> the same each time. It seems that "repairing" corrupted
> deduped data by overwriting blocks or whole files with
> good copies did not go too well, even though all of my
> deduped datasets now use "dedup=verify" mode.
>
> The uptimes are too short for "savecore" to complete :(
> I''ll try to catch the system at a good moment to prevent
> "pool"''s importing and hope to get the kernel dump.
>
> What should I look for with ZDB, MDB or in dump files?
>
> Any suggestions how to analyze and ultimately fix this
> problem on-disk (without destroying and remaking the pool)?
>
>
>
> Here''s the latest fmdump:
>
> # fmdump -Vp -u 4f6725c1-509f-eba4-8774-e627e1925461
> TIME UUID SUNW-MSG-ID
> Feb 04 2012 04:11:20.930300000 4f6725c1-509f-eba4-8774-e627e1925461
> SUNOS-8000-KL
>
> TIME CLASS ENA
> Feb 04 04:11:11.3891 ireport.os.sunos.panic.dump_pending_on_device
> 0x0000000000000000
>
> nvlist version: 0
> version = 0x0
> class = list.suspect
> uuid = 4f6725c1-509f-eba4-8774-e627e1925461
> code = SUNOS-8000-KL
> diag-time = 1328314280 870353
> de = fmd:///module/software-diagnosis
> fault-list-sz = 0x1
> fault-list = (array of embedded nvlists)
> (start fault-list[0])
> nvlist version: 0
> version = 0x0
> class = defect.sunos.kernel.panic
> certainty = 0x64
> asru >
sw:///:path=/var/crash/bofh-sol/.4f6725c1-509f-eba4-8774-e627e1925461
> resource >
sw:///:path=/var/crash/bofh-sol/.4f6725c1-509f-eba4-8774-e627e1925461
> savecore-succcess = 0
> os-instance-uuid = 4f6725c1-509f-eba4-8774-e627e1925461
> panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff0010a5e920 addr=30
> occurred in module "zfs" due to a NULL pointer dereference
> panicstack = unix:die+dd () | unix:trap+1799 () | unix:cmntrap+e6 () |
> zfs:ddt_phys_decref+c () | zfs:zio_ddt_free+5c () | zfs:zio_execute+8d
> () | genunix:taskq_thread+285 () | unix:thread_start+8 () |
> crashtime = 1328314064
> panic-time = February 4, 2012 04:07:44 AM MSK MSK
> (end fault-list[0])
>
> fault-status = 0x1
> severity = Major
> __ttl = 0x1
> __tod = 0x4f2c77a8 0x37734060
>
> And here''s some dmesg leading up to that fmdump reference:
>
> Feb 4 04:11:10 bofh-sol zfs: [ID 249136 kern.info] imported version 28
> pool pool using 28
> Feb 4 04:11:11 bofh-sol savecore: [ID 570001 auth.error] reboot after
> panic: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010a5e920 addr=30
> occurred in module "zfs" due to a NULL pointer dereference
> Feb 4 04:11:11 bofh-sol savecore: [ID 564761 auth.error] Panic crashdump
> pending on dump device but dumpadm -n in effect; run savecore(1M)
> manually to extract. Image UUID 4f6725c1-509f-eba4-8774-e627e1925461.
> Feb 4 04:11:13 bofh-sol unix: [ID 954099 kern.info] NOTICE: IRQ17 is
> being shared by drivers with different interrupt levels.
> Feb 4 04:11:13 bofh-sol This may result in reduced system performance.
> Feb 4 04:11:20 bofh-sol fmd: [ID 377184 daemon.error] SUNW-MSG-ID:
> SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
> Feb 4 04:11:20 bofh-sol EVENT-TIME: Sat Feb 4 04:11:20 MSK 2012
> Feb 4 04:11:20 bofh-sol PLATFORM: System-Product-Name, CSN:
> System-Serial-Number, HOSTNAME: bofh-sol
> Feb 4 04:11:20 bofh-sol SOURCE: software-diagnosis, REV: 0.1
> Feb 4 04:11:20 bofh-sol EVENT-ID: 4f6725c1-509f-eba4-8774-e627e1925461
> Feb 4 04:11:20 bofh-sol DESC: The system has rebooted after a kernel
> panic. Refer to sun.com/msg/SUNOS-8000-KL for more information.
> Feb 4 04:11:20 bofh-sol AUTO-RESPONSE: The failed system image was
> dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a
> copy of the dump will be written to the savecore directory .
> Feb 4 04:11:20 bofh-sol IMPACT: There may be some performance impact
> while the panic is copied to the savecore directory. Disk space usage by
> panics can be substantial.
> Feb 4 04:11:20 bofh-sol REC-ACTION: If savecore is not enabled then
> please take steps to preserve the crash image.
> Feb 4 04:11:20 bofh-sol Use ''fmdump -Vp -u
> 4f6725c1-509f-eba4-8774-e627e1925461'' to view more panic detail.
Please
> refer to the knowledge article for additional information.
>
>
> Thanks in advance,
> //Jim Klimov
>

Jim Klimov

2012-Feb-09 18:43 UTC

head link

[zfs-discuss] zpool fails with panic in zio_ddt_free()

2012-02-04 18:27, Jim Klimov wrote:> panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff0010a5e920
> addr=30 occurred in module "zfs" due to a NULL pointer
dereference
> panicstack = unix:die+dd () | unix:trap+1799 () | unix:cmntrap+e6
> () | zfs:ddt_phys_decref+c () | zfs:zio_ddt_free+5c () |
> zfs:zio_execute+8d () | genunix:taskq_thread+285 () |
> unix:thread_start+8 () | crashtime = 1328314064
> panic-time = February 4, 2012 04:07:44 AM MSK MSK
> (end fault-list[0])
I''ve finally looked into the code, and the problem seems
weird indeed. With "properly" wrong on-disk data the code
can take such a path that it tries to dereference a NULL
pointer. This is a bug, methinks:

src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/zio.c#zio_ddt_free

    2100 static int
    2101 zio_ddt_free(zio_t *zio)
    2102 {
    2103 	spa_t *spa = zio->io_spa;
    2104 	blkptr_t *bp = zio->io_bp;
    2105 	ddt_t *ddt = ddt_select(spa, bp);
    2106 	ddt_entry_t *dde;
    2107 	ddt_phys_t *ddp;
    2108
    2109 	ASSERT(BP_GET_DEDUP(bp));
    2110 	ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
    2111
    2112 	ddt_enter(ddt);
    2113 	freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
    2114 	ddp = ddt_phys_select(dde, bp);
    2115 	ddt_phys_decref(ddp);
    2116 	ddt_exit(ddt);
    2117
    2118 	return (ZIO_PIPELINE_CONTINUE);
    2119 }

The way I see it, this function:
2109) asserts that the dedup bit is set on the blockpointer,

2113) selects a DDT entry matching the BP with ddt_lookup()
       Strangely for me, it requests creation of a new DDT
       entry if a matching one doesn''t exist (B_TRUE) -
       even though this is a DDT-free routine.
       I doubt this autocreated entry gets a DVA assigned,
       but I did not trace the code that far to be certain.

2114) then it tries to find the "ddp" pointer to the DDT
       entry with ddt_phys_select(), which can validly return
       NULL if no DDT entries, with same DVA[0] and same
       phys_birth as the bp, exist. I speculated above that
       an autocreated DDT entry might have no DVA yet, maybe.
       I also wonder about cases where phys_birth TXGs would
       differ for some reason, even if an otherwise matching
       DDT entry exists (BP rewrite in future, some lag in
       committing-to-disk today)?
src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/ddt.c#ddt_phys_select

2115) ddt_phys_decref() just tries to use ddp as a pointer
       to structure without checking if it is NULL itself:
src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/ddt.c#ddt_phys_decref

     309 void
     310 ddt_phys_decref(ddt_phys_t *ddp)
     311 {
     312 	ASSERT((int64_t)ddp->ddp_refcnt > 0);
     313 	ddp->ddp_refcnt--;
     314 }
     315

According to my stacktrace in the kernel panic message,
this should be where the system panics and fails.

Seemingly this should work if on-disk data is consistent.

Since I''ve had some "strangeness" regarding corrupt data
blocks pointed to by DDT, with some hickups and reboots
during or soon-after repair attempts, I can''t vouch that
my on-disk data is consistent, or that the whole block
tree is readable and accessible to the OS now.

But the code allows for failure is something goes wrong.
I''d add a lot more NULL checks for values (especially
pointers) returned by functions called from zio_ddt_free()
and/or sanity checks for values fed into such functions.

Actually, I''d rewrite this part of zio_ddt_free() like
this:

-  	freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
+      	freedde = dde = ddt_lookup(ddt, bp, B_FALSE);
+	if (dde) {
        	    ddp = ddt_phys_select(dde, bp);
+	    if (ddp) {
         	ddt_phys_decref(ddp);
+	    }
+	}
  	ddt_exit(ddt);

If ddt_lookup() finds no matching DDE, fine - we''re freed.
If ddt_phys_select() finds no DDP, not so fine, but that''s
not a good reason to kernel-panic. Perhaps log a message
and maybe leave a leaked block? Or relax the phys_birth check?
Am I mistaken anywhere? :)

Thanks,
//Jim

Jim Klimov

2012-Feb-23 22:29 UTC

head link

[zfs-discuss] zpool fails with panic in zio_ddt_free()

2012-02-09 22:43, Jim Klimov ?????:> 2012-02-04 18:27, Jim Klimov wrote:
>> panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff0010a5e920
>> addr=30 occurred in module "zfs" due to a NULL pointer
dereference
>> panicstack = unix:die+dd () | unix:trap+1799 () | unix:cmntrap+e6
>> () | zfs:ddt_phys_decref+c () | zfs:zio_ddt_free+5c () |
>> zfs:zio_execute+8d () | genunix:taskq_thread+285 () |
>> unix:thread_start+8 () | crashtime = 1328314064
>> panic-time = February 4, 2012 04:07:44 AM MSK MSK
>> (end fault-list[0])
>
> I''ve finally looked into the code, and the problem seems
> weird indeed. With "properly" wrong on-disk data the code
> can take such a path that it tries to dereference a NULL
> pointer. (...)
> Actually, I''d rewrite this part of zio_ddt_free() like
> this (...)
> If ddt_lookup() finds no matching DDE, fine - we''re freed.
> If ddt_phys_select() finds no DDP, not so fine, but that''s
> not a good reason to kernel-panic. Perhaps log a message
> and maybe leave a leaked block? Or relax the phys_birth check?
> Am I mistaken anywhere? :)
To follow up, I couraged up and implemented this simple fix
on my (now tweaked) oi_151a installation (diff below).

While importing the pool and then while rsyncing the remaining
problematic files into it, I''ve got no more kernel panics.
There were several (expected) hits of the diagnostic message
(embedded into ddt_phys_select() below) reporting that it
has indeed failed to find a DDP matching the requested BP,
and returns a NULL. Further checks in ddt_phys_decref() now
skip dereferencing if the pointer is NULL, and the thing
"just works" okay ;)

Possibly, the ASSERTs added below into other functions
should be replaced by if() blocks to fire regardless of
debug/nondebug kernels. Anyhow, I still believe that
functions should do sanity-checks on input values, even
if at cost of some performance - so as to not dereference
NULL pointers, at least... ;)

I am still open to suggestions from ZFS coding gurus as to
why my solution or overall approach might be a bad idea,
or to accept the solution into upstream code ;)

I see that the zfs-code mailing list is essentially dead
for months, and the recent active writers were people from
zfs-dscuss, so I just post here...

==================Under /code/illumos-gate/usr/src/uts/common/fs/zfs :

--- ddt.c.orig  2012-01-25 20:20:11.000000000 +0400
+++ ddt.c       2012-02-24 02:14:29.273050980 +0400
@@ -297,20 +297,25 @@
  void
  ddt_phys_clear(ddt_phys_t *ddp)
  {
+       ASSERT(ddp != NULL);
         bzero(ddp, sizeof (*ddp));
  }

  void
  ddt_phys_addref(ddt_phys_t *ddp)
  {
+       ASSERT(ddp != NULL);
         ddp->ddp_refcnt++;
  }

  void
  ddt_phys_decref(ddt_phys_t *ddp)
  {
-       ASSERT((int64_t)ddp->ddp_refcnt > 0);
-       ddp->ddp_refcnt--;
+//     ASSERT(ddp != NULL);
+       if (ddp) {
+               ASSERT((int64_t)ddp->ddp_refcnt > 0);
+               ddp->ddp_refcnt--;
+       }
  }

  void
@@ -333,6 +338,17 @@
                     BP_PHYSICAL_BIRTH(bp) == ddp->ddp_phys_birth)
                         return (ddp);
         }
+       (void) printf("ddt_phys_select() found nothing for \n"
+           "\tDVA[BP] =<%llu:%llx:%llx> and phys_birth[BP]  %llu
and\n"
+           "\tDVA[DDP]=<%llu:%llx:%llx> and phys_birth[DDP]
%llu\n",
+           (u_longlong_t)DVA_GET_VDEV(BP_IDENTITY(bp)),
+           (u_longlong_t)DVA_GET_OFFSET(BP_IDENTITY(bp)),
+           (u_longlong_t)DVA_GET_ASIZE(BP_IDENTITY(bp)),
+           BP_PHYSICAL_BIRTH(bp),
+           (u_longlong_t)DVA_GET_VDEV(&ddp->ddp_dva[0]),
+           (u_longlong_t)DVA_GET_OFFSET(&ddp->ddp_dva[0]),
+           (u_longlong_t)DVA_GET_ASIZE(&ddp->ddp_dva[0]),
+           ddp->ddp_phys_birth);
         return (NULL);
  }

--- zio.c.orig  2012-01-25 20:20:12.000000000 +0400
+++ zio.c       2012-02-10 20:58:02.745614986 +0400
@@ -2111,8 +2111,12 @@

         ddt_enter(ddt);
         freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
-       ddp = ddt_phys_select(dde, bp);
-       ddt_phys_decref(ddp);
+       if (dde) {
+               ddp = ddt_phys_select(dde, bp);
+               if (ddp) {
+                       ddt_phys_decref(ddp);
+               }
+       }
         ddt_exit(ddt);

         return (ZIO_PIPELINE_CONTINUE);

==================
HTH,
//Jim Klimov

Reasonably Related Threads

Search for more seemingly similar threads

zfs discuss - Feb 2012 - zpool fails with panic in zio_ddt_free()

[zfs-discuss] zpool fails with panic in zio_ddt_free()

[zfs-discuss] zpool fails with panic in zio_ddt_free()

[zfs-discuss] zpool fails with panic in zio_ddt_free()

Reasonably Related Threads