thr3ads.net - zfs discuss - [zfs-discuss] Doublefree/doubledelete [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2011-Dec-23 21:43 UTC

[zfs-discuss] Doublefree/doubledelete

Hello all,

My computer has recently crashed with the following messages
last displayed; they also pop up on boot attempts and it freezes:

Dec 20 00:33:12 bofh-sol genunix: [ID 415322 kern.warning]
WARNING: zfs: allocating allocated segment(offset=9662417920 size=512)

Dec 20 00:33:14 bofh-sol genunix: [ID 361072 kern.warning]
WARNING: zfs: freeing free segment (offset=9608101376 size=1536)

I believe it is not good ;)
But the message has even no info on which pool the error was,
and 9Gb offset could be on either the rpool and on the data pool.

What can be done for debug and repair? ;)

Thanks,
//Jim Klimov

Jim Klimov

2012-Jan-10 14:14 UTC

head link

[zfs-discuss] Doublefree/doubledelete

Hello all,

While it is deemed uncool to reply to one''s own posts,
there''s often no other choice ;)

Here is some more detail on that failure: this problem
was located to be on the rpool, and any attempts to
import it (including rollback or readonly modes) lead
to immediate freeze of the system with those warnings
on the console.

My current guess is that ZFS wrongly tries to use a
"very" old TXG number (beyond the actually last 128)
which references some overwritten metadata, leading
to seeming inconsistencies such as boube allocations
and double frees. I am not sure how to "roll-forward"
TXGs in the pool labels properly, in order to point
to newer COW-secured block hierarchy.

Details follow:

Some ZDB research has shown that according to labels,
the latest TXG is 500179 (zdb -l). However, the pool
history has newer TXGs mentioned (zdb -h):

2011-12-19.20:00:00 zpool clear rpool
2011-12-19.20:00:00 [internal pool scrub txg:500179]
     func=1 mintxg=0 maxtxg=500179
2011-12-19.20:00:10 zpool scrub rpool
2011-12-19.20:19:44 [internal pool scrub done txg:500355]
     complete=1

When I tried ZFS forensics script from these sources
linked below (original source link is down at this
time) I had yet newer TXG numbers, ranging from 500422
to 500549, and not including either of those discovered
above.

Info page:
* [1] 
http://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script

Script code:
* [2] http://markmail.org/download.xqy?id=gde5k3zynpfhftgd&number=1

I tried to roll back to TXG 500535 which was about
a minute before the most recent one and the presumed
crash. Here''s a screenshot AFTER the rollback:

# ./zfs_revert.py -bs=512 -tb=75409110 /dev/dsk/c4t1d0s0
512
Total of 75409110 blocks
Reading from the beginning to 131072 blocks
Reading from 75278038 blocks to the end
131072+0 records in
131072+0 records out
67108864 bytes (67 MB) copied, 44.791 s, 1.5 MB/s
131072+0 records in
131072+0 records out
67108864 bytes (67 MB) copied, 26.8802 s, 2.5 MB/s
----
TXG     TIME    TIMESTAMP       BLOCK ADDRESSES
500422  19 Dec 2011 20:25:17  1324326317  [396, 908, 75408268, 75408780]
500423  19 Dec 2011 20:25:22  1324326322  [398, 910, 75408270, 75408782]
...
500530  19 Dec 2011 20:32:27  1324326747  [356, 868, 75408228, 75408740]
500531  19 Dec 2011 20:32:31  1324326751  [358, 870, 75408230, 75408742]
500532  19 Dec 2011 20:32:37  1324326757  [360, 872, 75408232, 75408744]
500533  19 Dec 2011 20:32:40  1324326760  [362, 874, 75408234, 75408746]
500534  19 Dec 2011 20:32:44  1324326764  [364, 876, 75408236, 75408748]
500535  19 Dec 2011 20:32:48  1324326768  [366, 878, 75408238, 75408750]
What is the last TXG you wish to keep?

Apparently, the script did roll back txg''s on disk
(I did not look deep into it, probably it zeroed and
invalidated newer uberblocks); however the pool label
still references an out-of-range TXG number 500179.

I''ve had some strange problems with ZDB''s "-t"
option:
when I referenced the pool with either "-e rpool" or
"-e GUIDNUMBER", it complains about not finding "rpool"
(which is not imported and can''t be), while without
this flag it apparently uses the wrong old TXGnum:

root at openindiana:~# zdb -b -t 500355 -e 12076177533503245216
zdb: can''t open ''rpool'': No such device or address
root at openindiana:~# zdb -b -F -t 500355 -e 12076177533503245216
zdb: can''t open ''rpool'': No such device or address

root at openindiana:~# zdb -b -F -e 12076177533503245216
Traversing all blocks to verify nothing leaked ...
error: zfs: freeing free segment (offset=3146341888 size=1024)
Abort (core dumped)

root at openindiana:~# zdb -b -F -e rpool
Traversing all blocks to verify nothing leaked ...
error: zfs: freeing free segment (offset=3146341888 size=1024)
Abort (core dumped)

So... "Kowalsky, options?" (C) Madagascar

Thanks,
//Jim Klimov

2011-12-24 1:43, Jim Klimov wrote:> Hello all,
>
> My computer has recently crashed with the following messages
> last displayed; they also pop up on boot attempts and it freezes:
>
> Dec 20 00:33:12 bofh-sol genunix: [ID 415322 kern.warning]
> WARNING: zfs: allocating allocated segment(offset=9662417920 size=512)
>
> Dec 20 00:33:14 bofh-sol genunix: [ID 361072 kern.warning]
> WARNING: zfs: freeing free segment (offset=9608101376 size=1536)
>
> I believe it is not good ;)
> But the message has even no info on which pool the error was,
> and 9Gb offset could be on either the rpool and on the data pool.
>
> What can be done for debug and repair? ;)
>
> Thanks,
> //Jim Klimov

zfs discuss - Dec 2011 - Doublefree/doubledelete

[zfs-discuss] Doublefree/doubledelete

[zfs-discuss] Doublefree/doubledelete