I have a ZFS pool that has been corrupted. The pool contains a single device which was actually a file on UFS. The machine was accidentally halted and now the pool is corrupt. There are (of course) no backups and I''ve been asked to recover the pool. The system panics when trying to do anything with the pool. root@:/$ zpool status panic[cpu1]/thread=fffffe8000758c80: assertion failed: dmu_read(os, smo->smo_object, offset, size, entry_map) == 0 (0x5 == 0x0), file: ../../common/fs/zfs/space_map.c, line: 319 <system reboots> I''ve booted single user, moved /etc/zfs/zpool.cache out of the way, and now have access to the pool from the command line. However zdb fails with a similar assertion. root at kestrel:/opt$ zdb -U -bcv zones Traversing all blocks to verify checksums and verify nothing leaked ... Assertion failed: dmu_read(os, smo->smo_object, offset, size, entry_map) == 0 (0x5 == 0x0), file ../../../uts/common/fs/zfs/space_map.c, line 319 Abort (core dumped) I''ve read Victor''s suggestion to invalidate the active uberblock, forcing ZFS to use an older uberblock and thereby recovering the pool. However I don''t know how to figure the offset to the uberblock. I have the following information from zdb. root at kestrel:/opt$ zdb -U -uuuv zones Uberblock magic = 0000000000bab10c version = 4 txg = 1504158 guid_sum = 10365405068077835008 timestamp = 1229142108 UTC = Sat Dec 13 15:21:48 2008 rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:52e3edc00:200> DVA[1]=<0:6f9c1d600:200> DVA[2]=<0:16e280400:200> fletcher4 lzjb LE contiguous birth=1504158 fill=172 cksum=b0a5275f3:474e0ed6469:e993ed9bee4d:205661fa1d4016 I''ve also checked the labels. root at kestrel:/opt$ zdb -U -lv zpool.zones -------------------------------------------- LABEL 0 -------------------------------------------- version=4 name=''zones'' state=0 txg=4 pool_guid=17407806223688303760 top_guid=11404342918099082864 guid=11404342918099082864 vdev_tree type=''file'' id=0 guid=11404342918099082864 path=''/opt/zpool.zones'' metaslab_array=14 metaslab_shift=28 ashift=9 asize=42944954368 -------------------------------------------- LABEL 1 -------------------------------------------- version=4 name=''zones'' state=0 txg=4 pool_guid=17407806223688303760 top_guid=11404342918099082864 guid=11404342918099082864 vdev_tree type=''file'' id=0 guid=11404342918099082864 path=''/opt/zpool.zones'' metaslab_array=14 metaslab_shift=28 ashift=9 asize=42944954368 -------------------------------------------- LABEL 2 -------------------------------------------- version=4 name=''zones'' state=0 txg=4 pool_guid=17407806223688303760 top_guid=11404342918099082864 guid=11404342918099082864 vdev_tree type=''file'' id=0 guid=11404342918099082864 path=''/opt/zpool.zones'' metaslab_array=14 metaslab_shift=28 ashift=9 asize=42944954368 -------------------------------------------- LABEL 3 -------------------------------------------- version=4 name=''zones'' state=0 txg=4 pool_guid=17407806223688303760 top_guid=11404342918099082864 guid=11404342918099082864 vdev_tree type=''file'' id=0 guid=11404342918099082864 path=''/opt/zpool.zones'' metaslab_array=14 metaslab_shift=28 ashift=9 asize=42944954368 I''m hoping somebody here can give me direction on how to figure the active uberblock offset, and the dd parameters I''d need to intentionally corrupt the uberblock and force an earlier uberblock into service. The pool is currently on Solaris 05/08 however I''ll transfer the pool to OpenSolaris if necessary. -- This message posted from opensolaris.org
I have moved the zpool image file to an OpenSolaris machine running 101b. root at opensolaris:~# uname -a SunOS opensolaris 5.11 snv_101b i86pc i386 i86pc Solaris Here I am able to attempt an import of the pool and at least the OS does not panic. root at opensolaris:~# zpool import -d /mnt pool: zones id: 17407806223688303760 state: ONLINE status: The pool is formatted using an older on-disk version. action: The pool can be imported using its name or numeric identifier, though some features will not be available without an explicit ''zpool upgrade''. config: zones ONLINE /mnt/zpool.zones ONLINE But it hangs forever when I actually attempt the import. root at opensolaris:~# zpool import -d /mnt -f zones <never returns> The thread associated with the import is stuck on txg_wait_synced. root at opensolaris:~# echo "0t757::pid2proc|::walk thread|::findstack -v" | mdb -k stack pointer for thread d6dcc800: d51bdc44 d51bdc74 swtch+0x195() d51bdc84 cv_wait+0x53(d62ef1e6, d62ef1a8, d51bdcc4, fa15f9e1) d51bdcc4 txg_wait_synced+0x90(d62ef040, 0, 0, 2) d51bdd34 spa_load+0xd0b(d6c1f080, da5dccd8, 2, 1) d51bdd84 spa_import_common+0xbd() d51bddb4 spa_import+0x18(d6c8f000, da5dccd8, 0, fa187dac) d51bdde4 zfs_ioc_pool_import+0xcd(d6c8f000, 0, 0) d51bde14 zfsdev_ioctl+0xe0() d51bde44 cdev_ioctl+0x31(2d80000, 5a02, 8042450, 100003, da532b28, d51bdf00) d51bde74 spec_ioctl+0x6b(d6dbfc80, 5a02, 8042450, 100003, da532b28, d51bdf00) d51bdec4 fop_ioctl+0x49(d6dbfc80, 5a02, 8042450, 100003, da532b28, d51bdf00) d51bdf84 ioctl+0x171() d51bdfac sys_call+0x10c() There is a corresponding thread stuck on zio_wait. d5e50de0 fec1dad8 0 0 60 d5cb76c8 PC: _resume_from_idle+0xb1 THREAD: txg_sync_thread() stack pointer for thread d5e50de0: d5e50a58 swtch+0x195() cv_wait+0x53() zio_wait+0x55() dbuf_read+0x1fd() dbuf_will_dirty+0x30() dmu_write+0xd7() space_map_sync+0x304() metaslab_sync+0x284() vdev_sync+0xc6() spa_sync+0x35c() txg_sync_thread+0x295() thread_start+8() I see from another discussion on zfs-discuss that Victor Latushkin helped Erik Gulliksson recover from a similar situation by using a specially patched zfs module. Would it be possible for me to get that same module? -- This message posted from opensolaris.org
I don''t know if this is relevant or merely a coincidence but the zdb command fails an assertion in the same txg_wait_synced function. root at opensolaris:~# zdb -p /mnt -e zones Assertion failed: tx->tx_threads == 2, file ../../../uts/common/fs/zfs/txg.c, line 423, function txg_wait_synced Abort (core dumped) -- This message posted from opensolaris.org
On Mon, 15 Dec 2008 06:12:19 PST, Nathan Hand wrote:> I have moved the zpool image file to an > OpenSolaris machine running 101b. > > root at opensolaris:~# uname -a > SunOS opensolaris 5.11 snv_101b i86pc i386 i86pc Solaris > > Here I am able to attempt an import of the pool and at > least the OS does not panic.[snip]>But it hangs forever when I actually attempt the import.The failmode is a property of the pool. PROPERTY EDIT VALUES failmode YES wait | continue | panic In OpenSolaris 101b it defaults to "wait", and that''s what seems to be happening now. Perhaps you can change it to continue (and keep your fingers crossed)? -- ( Kees Nuyt ) c[_]
Thanks for the reply. I tried the following: $ zpool import -o failmode=continue -d /mnt -f zones But the situation did not improve. It still hangs on the import. -- This message posted from opensolaris.org
I''ve had some success. I started with the ZFS on-disk format PDF. http://opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf The uberblocks all have magic value 0x00bab10c. Used od -x to find that value in the vdev. root at opensolaris:~# od -A x -x /mnt/zpool.zones | grep "b10c 00ba" 020000 b10c 00ba 0000 0000 0004 0000 0000 0000 020400 b10c 00ba 0000 0000 0004 0000 0000 0000 020800 b10c 00ba 0000 0000 0004 0000 0000 0000 020c00 b10c 00ba 0000 0000 0004 0000 0000 0000 021000 b10c 00ba 0000 0000 0004 0000 0000 0000 021400 b10c 00ba 0000 0000 0004 0000 0000 0000 021800 b10c 00ba 0000 0000 0004 0000 0000 0000 021c00 b10c 00ba 0000 0000 0004 0000 0000 0000 022000 b10c 00ba 0000 0000 0004 0000 0000 0000 022400 b10c 00ba 0000 0000 0004 0000 0000 0000 ... So the uberblock array begins 128kB into the vdev and there''s an uberblock every 1kb. To identify the active uberblock I used zdb. root at kestrel:/opt$ zdb -U -uuuv zones Uberblock magic = 0000000000bab10c version = 4 txg = 1504158 (= 0x16F39E) guid_sum = 10365405068077835008 = (0x8FD950FDBBD02300) timestamp = 1229142108 UTC = Sat Dec 13 15:21:48 2008 = (0x4943385C) rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:52e3edc00:200> DVA[1]=<0:6f9c1d600:200> DVA[2]=<0:16e280400:200> fletcher4 lzjb LE contiguous birth=1504158 fill=172 cksum=b0a5275f3:474e0ed6469:e993ed9bee4d:205661fa1d4016 I spy those hex values at the uberblock starting 027800. 027800 b10c 00ba 0000 0000 0004 0000 0000 0000 027810 f39e 0016 0000 0000 2300 bbd0 50fd 8fd9 027820 385c 4943 0000 0000 0001 0000 0000 0000 027830 1f6e 0297 0000 0000 0001 0000 0000 0000 027840 e0eb 037c 0000 0000 0001 0000 0000 0000 027850 1402 00b7 0000 0000 0001 0000 0703 800b 027860 0000 0000 0000 0000 0000 0000 0000 0000 027870 0000 0000 0000 0000 f39e 0016 0000 0000 027880 00ac 0000 0000 0000 75f3 0a52 000b 0000 027890 6469 e0ed 0474 0000 ee4d ed9b e993 0000 0278a0 4016 fa1d 5661 0020 0000 0000 0000 0000 0278b0 0000 0000 0000 0000 0000 0000 0000 0000 Breaking it down * the first 8 bytes are the magic uberblock number (b10c 00ba 0000 0000) * the second 8 bytes are the version number (0004 0000 0000 0000) * the third 8 bytes are the transaction group a.k.a txg (f39e 0016 0000 0000) * the fourth 8 bytes are the guid sum (2300 bbd0 50fd 8fd9) * the fifth 8 bytes are the timestamp (385c 4943 0000 0000) The remainder of the bytes are the "blkptr" structure and I''ll ignore them. Those values match the active uberblock exactly, so I know this is the on-disk location of the first active uberblock. Scanning further I find an exact duplicate 256kB later in the device. 067800 b10c 00ba 0000 0000 0004 0000 0000 0000 067810 f39e 0016 0000 0000 2300 bbd0 50fd 8fd9 067820 385c 4943 0000 0000 0001 0000 0000 0000 067830 1f6e 0297 0000 0000 0001 0000 0000 0000 067840 e0eb 037c 0000 0000 0001 0000 0000 0000 067850 1402 00b7 0000 0000 0001 0000 0703 800b 067860 0000 0000 0000 0000 0000 0000 0000 0000 067870 0000 0000 0000 0000 f39e 0016 0000 0000 067880 00ac 0000 0000 0000 75f3 0a52 000b 0000 067890 6469 e0ed 0474 0000 ee4d ed9b e993 0000 0678a0 4016 fa1d 5661 0020 0000 0000 0000 0000 0678b0 0000 0000 0000 0000 0000 0000 0000 0000 I know ZPOOL keeps four copies of the label; two at the front and two at the back, each 256kB in size. root at opensolaris:~# ls -l /mnt/zpool.zones -rw-r--r-- 1 root root 42949672960 Dec 15 04:49 /mnt/zpool.zones That''s 0xA00000000 = 42949672960 = 41943040kB. If I subtract 512kB I should see the third and fourth labels. root at opensolaris:~# dd if=/mnt/zpool.zones bs=1k skip=41942528 | od -A x -x | grep "385c 4943 0000 0000" 027820 385c 4943 0000 0000 0001 0000 0000 0000 512+0 records in 512+0 records out 524288 bytes (524 kB) copied, 0.0577013 s, 9.1 MB/s root at opensolaris:~# Oddly enough I see the third uberblock at 0x27800 but the fourth uberblock at 0x67800 is missing. Perhaps corrupted? No matter. I now work out the exact offsets to the three valid uberblocks and confirm I''m looking at the right uberblocks. root at opensolaris:~# dd if=/mnt/zpool.zones bs=1k skip=158 | od -A x -x | head -3 000000 b10c 00ba 0000 0000 0004 0000 0000 0000 000010 f39e 0016 0000 0000 2300 bbd0 50fd 8fd9 000020 385c 4943 0000 0000 0001 0000 0000 0000 root at opensolaris:~# dd if=/mnt/zpool.zones bs=1k skip=414 | od -A x -x | head -3 000000 b10c 00ba 0000 0000 0004 0000 0000 0000 000010 f39e 0016 0000 0000 2300 bbd0 50fd 8fd9 000020 385c 4943 0000 0000 0001 0000 0000 0000 root at opensolaris:~# dd if=/mnt/zpool.zones bs=1k skip=41942686 | od -A x -x | head -3 000000 b10c 00ba 0000 0000 0004 0000 0000 0000 000010 f39e 0016 0000 0000 2300 bbd0 50fd 8fd9 000020 385c 4943 0000 0000 0001 0000 0000 0000 They all have the same timestamp. I''m looking at the correct uberblocks. Now I intentionally harm them. root at opensolaris:/mnt# dd if=/dev/zero of=/mnt/zpool.zones bs=1k seek=158 count=1 conv=notrunc 1+0 records in 1+0 records out 1024 bytes (1.0 kB) copied, 0.000315229 s, 3.2 MB/s root at opensolaris:/mnt# dd if=/dev/zero of=/mnt/zpool.zones bs=1k seek=414 count=1 conv=notrunc 1+0 records in 1+0 records out 1024 bytes (1.0 kB) copied, 3.5e-08 s, 29.3 GB/s root at opensolaris:/mnt# dd if=/dev/zero of=/mnt/zpool.zones bs=1k seek=41942686 count=1 conv=notrunc 1+0 records in 1+0 records out 1024 bytes (1.0 kB) copied, 0.00192728 s, 531 kB/s And... fingers crossed... root at opensolaris:/mnt# zpool import -d /mnt -f zones root at opensolaris:/mnt# Huzzah, the import worked. root at opensolaris:/mnt# zpool status pool: zones state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using ''zpool upgrade''. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM zones ONLINE 0 0 0 /mnt/zpool.zones ONLINE 0 0 0 errors: No known data errors And my filesystems are back. root at opensolaris:/mnt# zfs list NAME USED AVAIL REFER MOUNTPOINT zones 23.7G 15.5G 27K /zones zones/appserver 1.69G 15.5G 5.55G /zones/appserver zones/base 847M 15.5G 4.20G /zones/base zones/centos 1.35G 15.5G 1.34G /zones/centos zones/cgiserver 2.43G 15.5G 6.24G /zones/cgiserver zones/ds1 5.47G 15.5G 3.91G /zones/ds1 zones/ds2 616M 15.5G 3.88G /zones/ds2 zones/webserver 11.3G 15.5G 15.1G /zones/webserver Initial inspection of the filesystems are promising. I can read from files, there are no panics, everything seems to be intact. I hope this helps other people recover corrupted zpools, until such time as there are tools to automate this process. -- This message posted from opensolaris.org
On Mon, 15 Dec 2008 14:23:37 PST, Nathan Hand wrote: [snip]> Initial inspection of the filesystems are promising. > I can read from files, there are no panics, > everything seems to be intact.Good work, congratulations, and thanks for the clear description of the process. I hope I never need it. Now one wonders why zfs doesn''t have a rescue like that built-in... -- ( Kees Nuyt ) c[_]
I know Eric mentioned the possibility of zpool import doing more of this kind of thing, and he said that it''s current inability to do this will be fixed, but I don''t know if it''s an official project, RFE or bug. Can anybody shed some light on this? See Jeff''s post on Oct 10, and Eric''s follow up later that day in this thread: http://opensolaris.org/jive/thread.jspa?messageID=289537񆬁 -- This message posted from opensolaris.org
On Tue, Dec 16, 2008 at 11:39 AM, Ross <myxiplx at googlemail.com> wrote:> I know Eric mentioned the possibility of zpool import doing more of this > kind of thing, and he said that it''s current inability to do this will be > fixed, but I don''t know if it''s an official project, RFE or bug. Can > anybody shed some light on this? > > See Jeff''s post on Oct 10, and Eric''s follow up later that day in this > thread: > http://opensolaris.org/jive/thread.jspa?messageID=289537񆬁 > -- > >When current uber-block A is detected to point to a corrupted on-disk data, how would "zpool import" (or any other tool for that matter) quickly and safely know that, once it found an older uber-block "B" that it points to a set of blocks which does not include any blocks that has since been freed and re-allocated and, thus, corrupted? Eg, without scanning the entire on-disk structure? -- Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke My blog: http://initialprogramload.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081216/74341660/attachment.html>
Casper.Dik at Sun.COM
2008-Dec-16 11:43 UTC
[zfs-discuss] Need Help Invalidating Uberblock
>When current uber-block A is detected to point to a corrupted on-disk data, >how would "zpool import" (or any other tool for that matter) quickly and >safely know that, once it found an older uber-block "B" that it points to a >set of blocks which does not include any blocks that has since been freed >and re-allocated and, thus, corrupted? Eg, without scanning the entire >on-disk structure?Without a scrub, you mean? Not possible, except the first few uberblocks (blocks aren''t used until a few uberblocks later) Casper
On Tue, Dec 16, 2008 at 1:43 PM, <Casper.Dik at sun.com> wrote:> > > >When current uber-block A is detected to point to a corrupted on-disk > data, > >how would "zpool import" (or any other tool for that matter) quickly and > >safely know that, once it found an older uber-block "B" that it points to > a > >set of blocks which does not include any blocks that has since been freed > >and re-allocated and, thus, corrupted? Eg, without scanning the entire > >on-disk structure? > > Without a scrub, you mean? > > Not possible, except the first few uberblocks (blocks aren''t used until a > few uberblocks later) > > Casper >Does that mean that each of the last "few-minus-1" uberblocks point to a consistent version of the file system? Does "few" have a definition? -- Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke My blog: http://initialprogramload.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081216/2751136d/attachment.html>
It sounds to me like there are several potentially valid filesystem uberblocks available, am I understanding this right? 1. There are four copies of the current uberblock. Any one of these should be enough to load your pool with no data loss. 2. There are also a few (would love to know how many) previous uberblocks which will point to a consistent filesystem, but with some data loss. 3. Failing that, the system could be rolled back to any snapshot uberblock. Any data saved since that snapshot will be lost. Is there any chance at all of automated tools that can take advantage of all of these for pool recovery? On Tue, Dec 16, 2008 at 11:55 AM, Johan Hartzenberg <jhartzen at gmail.com> wrote:> > On Tue, Dec 16, 2008 at 1:43 PM, <Casper.Dik at sun.com> wrote: >> >> >> >When current uber-block A is detected to point to a corrupted on-disk >> > data, >> >how would "zpool import" (or any other tool for that matter) quickly and >> >safely know that, once it found an older uber-block "B" that it points to >> > a >> >set of blocks which does not include any blocks that has since been freed >> >and re-allocated and, thus, corrupted? Eg, without scanning the entire >> >on-disk structure? >> >> Without a scrub, you mean? >> >> Not possible, except the first few uberblocks (blocks aren''t used until a >> few uberblocks later) >> >> Casper > > Does that mean that each of the last "few-minus-1" uberblocks point to a > consistent version of the file system? Does "few" have a definition? > > > > -- > Any sufficiently advanced technology is indistinguishable from magic. > Arthur C. Clarke > > My blog: http://initialprogramload.blogspot.com >
Well done, Nathan, thank you taking on the additional effort to write it all up.
On Tue, Dec 16, 2008 at 12:07:52PM +0000, Ross Smith wrote:> It sounds to me like there are several potentially valid filesystem > uberblocks available, am I understanding this right? > > 1. There are four copies of the current uberblock. Any one of these > should be enough to load your pool with no data loss. > > 2. There are also a few (would love to know how many) previous > uberblocks which will point to a consistent filesystem, but with some > data loss.My memory is that someone on this list said "3" in response to a question I had about it. I looked through the archives and couldn''t come up with the post. It was over a year ago.> 3. Failing that, the system could be rolled back to any snapshot > uberblock. Any data saved since that snapshot will be lost.What is a "snapshot uberblock"? The uberblock points to the entire tree: live data, snapshots, clones, etc. If you don''t have a valid uberblock, you don''t have any snapshots.> Is there any chance at all of automated tools that can take advantage > of all of these for pool recovery?I''m sure there is. In addition, I think there needs to be more that can be done non-destructively. Any successful import is read-write, potentially destroying other information. It would be nice to get "df" or "zfs list" information so you could make a decision about using an older uberblock. Even better would be a read-only (at the pool level) mount so the data could be directly examined. -- Darren
Does anyone know the correct syntax to use the zdb command on a /dev/dsk/c6t0d0s2 I''m trying to determine the active uberblock on an attached USB drive.> To identify the active uberblock I used zdb. > > root at kestrel:/opt$ zdb -U -uuuv zones > Uberblock > magic = 0000000000bab10c > version = 4 > txg = 1504158 (= 0x16F39E) > guid_sum = 10365405068077835008 > (0x8FD950FDBBD02300) > timestamp = 1229142108 UTC = Sat Dec 13 > 15:21:48 2008 = (0x4943385C) > rootbp = [L0 DMU objset] 400L/200P > DVA[0]=<0:52e3edc00:200> DVA[1]=<0:6f9c1d600:200> > DVA[2]=<0:16e280400:200> fletcher4 lzjb LE > contiguous birth=1504158 fill=172 > cksum=b0a5275f3:474e0ed6469:e993ed9bee4d:205661fa1d40 > 6 > > I spy those hex values at the uberblock starting > 027800.Does anyone know the correct syntax to use the zdb command on a /dev/dsk/c6t0d0s2 I''m trying to determine the active uberblock on an attached USB drive. -- This message posted from opensolaris.org