Martin Vool
2009-Nov-16 13:30 UTC
[zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
I have written an python script that enables to get back already deleted files and pools/partitions. This is highly experimental, but I managed to get back a moths work when all the partitions were deleted by accident(and of course backups are for the weak ;-) I hope someone can pass this information to the ZFS forensics project or where this should be.. First the basics and the HOW-TO is after that. And i am not an solaris or ZFS expert, i am sure here are many things to improve, i hope you can help me out with some problems this still has. [b]Basics:[/b] Basically this script finds all the uberblocks, reads their metadata and orders them by time, then enables you to destroy all the uberblocks that were created after the event that you want to scroll back. Then destroy the cache and make the machine boot up again. This will only work if the discs are not very full and there was not very much activity after the bad event. I managed to get back files from an ZFS partition after it was deleted(several) and created new ones. I got so far by the help of these materials, the ones with * are the key parts: *http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html* http://blogs.sun.com/blogfinger/entry/zfs_and_the_uberblock *http://www.opensolaris.org/jive/thread.jspa?threadID=85794&%u205Etstart=0* http://opensolaris.org/os/project/forensics/ZFS-Forensics/ http://docs.huihoo.com/opensolaris/solaris-zfs-administration-guide/html/ch04s06.html http://www.lildude.co.uk/zfs-cheatsheet/ [b]How-to[/b] This is the scenario i had... First check the pool status: $zpool status zones>From there you will get the disc name e.g:c2t60060E800457AB00000057AB00000146d0Now we look up the history of the pool so we can find the timeline and some uberblocks(their TXG-s) where to scroll back: zpool history -il zones Save this output for later use. You will defently want to backup the disk before you continue from this point: e.g. ssh root at host "dd if=/dev/dsk/c..." | dd of=Desktop/zones.dd Now take the script that i have attached zfs_revert.py It has two options: -bs is block size, by default 512 (never tested) -tb is number of blocks:[this is mandatory, maybe someone could automate this] To find the block size in solaris you can use prtvtoc /dev/dsk/c2t60060E800457AB00000057AB00000146d0 | grep sectors>From there look at the "sectors" row.If you have a file/loop device just sizeinbytes/blocksize=total blocks Now run the script for example: ./zfs_revert.py -bs=512 -tb=41944319 /dev/dsk/c2t60060E800457AB00000057AB00000146d0 This will use dd, od and grep(GNU) to find the required information. This script should work on linux and on solaris. It should give you a representation of the found uberblocks(i tested it with a 20GB pool, did not take very long since the uberblocks are only at the beginning and ending of the disk) Something like this, but probably much more: TXG, time-stamp, unixtime, addresses(there are 4 copy''s of uberblocks) 411579 05 Oct 2009 14:39:51 1254742791 [630, 1142, 41926774, 41927286] 411580 05 Oct 2009 14:40:21 1254742821 [632, 1144, 41926776, 41927288] 411586 05 Oct 2009 14:43:21 1254743001 [644, 1156, 41926788, 41927300] 411590 05 Oct 2009 14:45:21 1254743121 [652, 1164, 41926796, 41927308] Now comes the FUN part, take a wild guess witch block might be the one, it took me about 10 tryes to get it right, and i have no idea what are the "good" blocks or how to check this up. You will see later what i mean by that. Enter the last TXG you want to KEEP. Now the script writes zeroes to all of the uberblocs after the TXG you inserted. Now clear the ZFS cache and reboot(better solution someone???) rm -rf /etc/zfs/zpool.cache && reboot After the box comes up you have to hurry, you don''t have much time, if any at all since ZFS will realize in about a minute or two that something is fishy. First try to import the pool if it is not imported yet. zpool import -f zones Now see if it can import it or fail miserably. There is a good chance that you will hit Corrupt data and unable to import, but as i said earlier it took me about 10 tries to get it right. I did not have to restore the whole thing every time, i just took baby steps and every time deleted some more blocks until i found something stable(not quite, it will still crash after few minutes, but this is enough time to get back conf files or some code) Problems and unknown factors: 1) After the machine boots up you have limited time before ZFS realizes that it has been corrupted(checksums? I tried to turn them off but as soon as I turn checksumming off it crashes and when i could turn it of then the data might be corrupted) 2) If you copy files and one of them is corrupted the whole thing halts/crashes and you have to start with the zfs_revert.py script and reboot again. 3) It might be that reverting to a TXG where the pool was exported then there is a better chance of importing it after reboot, but i am not sure. 4) Is there a way to force the pool to stay in tact even if ZFS does not want to? 5) Why does the zpool command totally hang if i try to do anything with the reverted disk, it just hangs after it realizes there is something wrong(that is why i have to reboot after every try and this takes a lot of time) 6) I did try this thing on loop devices before. Used 1gb size and there the ZFS did not crash after i deleted some uberblocks, might be because i only deleted a file and partition then restored them(did not make new files or pools). Worked perfectly, so the scenario might had been a worst case. If something in this post is unclear or you have any suggestions/ideas on how to make this thing better then speak out. You can e-mail me if you don''t want to post in a forum mardicas at gmail.com I did not add a licence to this script, only my name and contact, i don''t know what Licences are popular in the solaris world, but GPL2 is a good idea? Sorry if my English is not perfect, this is not my native language. -- This message posted from opensolaris.org -------------- next part -------------- A non-text attachment was scrubbed... Name: zfs_revert.py Type: application/octet-stream Size: 4124 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091116/26b7569e/attachment.obj>
Martin Vool
2009-Nov-16 13:34 UTC
[zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
I forgot to add the script.... -- This message posted from opensolaris.org -------------- next part -------------- A non-text attachment was scrubbed... Name: zfs_revert.py Type: application/octet-stream Size: 4273 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091116/a89ae16f/attachment.obj>
Martin Vool
2009-Nov-16 13:38 UTC
[zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
The links work fine if you take the * off from the end...sorry bout that.... -- This message posted from opensolaris.org
zpool for zone of customer-facing production appserver hung due to iscsi transport errors. How can I {forcibly} reset this pool? zfs commands are hanging and iscsiadm remove refuses. root at Raadiku~[8]8:48#iscsiadm remove static-config iqn.1986-03.com.sun:02:aef78e-955a-4072-c7f6-afe087723466 iscsiadm: logical unit in use iscsiadm: Unable to complete operation root at Raadiku~[6]8:45#dmesg [...] Nov 16 00:03:30 Raadiku scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0): Nov 16 00:03:30 Raadiku /scsi_vhci/ssd at g0100003048c514da00002a0049ae9806 (ssd3): Command Timeout on path /iscsi (iscsi0) Nov 16 00:03:30 Raadiku scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd at g0100003048c514da00002a0049ae9806 (ssd3): Nov 16 00:03:30 Raadiku SCSI transport failed: reason ''timeout'': retrying command Nov 16 08:40:10 Raadiku su: [ID 810491 auth.crit] ''su root'' failed for jritorto on /dev/pts/1 Nov 16 08:47:05 Raadiku iscsi: [ID 213721 kern.notice] NOTICE: iscsi session(9) - session logout failed (1)
I encountered the same problem...like i sed in the first post...zpool command freezes. Anyone knows how to make it respond again? -- This message posted from opensolaris.org
Martin Vool
2009-Nov-16 20:15 UTC
[zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
I have no idea why this forum just makes files dissapear??? I will put a link tomorrow...a file was attached before... -- This message posted from opensolaris.org
On Mon, Nov 16, 2009 at 2:10 PM, Martin Vool <mardicas at gmail.com> wrote:> I encountered the same problem...like i sed in the first post...zpool > command freezes. Anyone knows how to make it respond again? > -- > >Is your failmode set to wait? --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091116/c786ef4b/attachment.html>
I already got my files back acctuay and the disc contains already new pools, so i have no idea how it was set. I have to make a virtualbox installation and test it. Can you please tell me how-to set the failmode? -- This message posted from opensolaris.org
On Mon, Nov 16, 2009 at 4:00 PM, Martin Vool <mardicas at gmail.com> wrote:> I already got my files back acctuay and the disc contains already new > pools, so i have no idea how it was set. > > I have to make a virtualbox installation and test it. > Can you please tell me how-to set the failmode? > > >http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091116/f51a8a6e/attachment.html>
On Nov 16, 2009, at 2:00 PM, Martin Vool wrote:> I already got my files back acctuay and the disc contains already > new pools, so i have no idea how it was set. > > I have to make a virtualbox installation and test it.Don''t forget to change VirtualBox''s default cache flush setting. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#OpenSolaris.2FZFS.2FVirtual_Box_Recommendations -- richard
On Mon, Nov 16, 2009 at 4:49 PM, Tim Cook <tim at cook.ms> wrote:> Is your failmode set to wait?Yes. This box has like ten prod zones and ten corresponding zpools that initiate to iscsi targets on the filers. We can''t panic the whole box just because one {zone/zpool/iscsi target} fails. Are there undocumented commands to reset a specific zpool or something? thx jake
Hi all, Not sure if you missed my last response or what, but yes, the pool is set to wait because it''s one of many pools on this prod server and we can''t just panic everything because one pool goes away. I just need a way to reset one pool that''s stuck. If the architecture of zfs can''t handle this scenario, I understand and can rework the layout. Just let me know one way or the other, please. thx jake
On Nov 18, 2009, at 5:44 AM, Jacob Ritorto wrote:> Hi all, > Not sure if you missed my last response or what, but yes, the pool > is set to wait because it''s one of many pools on this prod server > and we can''t just panic everything because one pool goes away. > > I just need a way to reset one pool that''s stuck. > > If the architecture of zfs can''t handle this scenario, I understand > and can rework the layout.ZFS relies on the underlying drivers to report errors. For iSCSI initiator, the default timeout is 180 seconds. This was fixed, but is now tunable (b121). -- richard> Just let me know one way or the other, please. > > thx > jake > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, Nov 18, 2009 at 10:30 AM, Richard Elling <richard.elling at gmail.com>wrote:> On Nov 18, 2009, at 5:44 AM, Jacob Ritorto wrote: > > Hi all, >> Not sure if you missed my last response or what, but yes, the pool >> is set to wait because it''s one of many pools on this prod server and we >> can''t just panic everything because one pool goes away. >> >> I just need a way to reset one pool that''s stuck. >> >> If the architecture of zfs can''t handle this scenario, I understand >> and can rework the layout. >> > > ZFS relies on the underlying drivers to report errors. For iSCSI > initiator, > the default timeout is 180 seconds. This was fixed, but is now tunable > (b121). > -- richard > > Just let me know one way or the other, please. >> >> thx >> jake >> >> >Also, I never said anything about setting it to panic. I''m not sure why you can''t set it to continue while alerting you that a vdev has failed? -- --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091118/13b32664/attachment.html>
Tim Cook wrote: > Also, I never said anything about setting it to panic. I''m not sure why > you can''t set it to continue while alerting you that a vdev has failed? Ah, right, thanks for the reminder Tim! Now I''d asked about this some months ago, but didn''t get an answer so forgive me for asking again: What''s the difference between wait and continue in my scenario? Will this allow the one faulted pool to fully fail and accept that it''s broken, thereby allowing me to frob the iscsi initiator, re-import the pool and restart the zone? That''d be exactly what I need. thx jake
On Wed, Nov 18, 2009 at 12:49 PM, Jacob Ritorto <Jacob.Ritorto at gmail.com>wrote:> Tim Cook wrote: > > > Also, I never said anything about setting it to panic. I''m not sure why > > you can''t set it to continue while alerting you that a vdev has failed? > > > Ah, right, thanks for the reminder Tim! > > Now I''d asked about this some months ago, but didn''t get an answer so > forgive me for asking again: What''s the difference between wait and continue > in my scenario? Will this allow the one faulted pool to fully fail and > accept that it''s broken, thereby allowing me to frob the iscsi initiator, > re-import the pool and restart the zone? That''d be exactly what I need. > > thx > jake >I''m not sure I''ve seen what your setup is. If it''s raid-z on top of several iSCSI LUNs, losing one won''t cause much of anything. The pool will be marked as degraded, but it will keep chugging along. When the lun comes back, it should resilver, and I''d suggest doing a scrub as well. If there''s no redundancy, I''d imagine it depends on what data is on the LUN. If it''s something with the core OS, I''d expect a panic. If it''s just a directory with shares, I would expect the shares to go offline, but the system to continue functioning. You''d have to test to verify that though, I can''t say for certain. -- --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091118/14b0b091/attachment.html>
fred pam
2010-Apr-14 12:13 UTC
[zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
I have a similar problem that differs in a subtle way. I moved a zpool (single disk) from one system to another. Due to my inexperience I did not import the zpool but (doh!) ''zpool create''-ed it (I may also have used a -f somewhere in there...) Interestingly the script still gives me the old uberblocks but in this case the first couple (lowest TXG''s) are actually younger (later timestamp) than the higher TXG ones. Obviously removing the highest TXG''s will actually remove the uberblocks I want to keep. Is there a way to copy an uberblock over another one? Or could I perhaps remove the low-TXG uberblocks instead of the high-TXG ones (and would that mean the old pool becomes available again). Or are more things missing than just the uberblocks and should I move to a file-based approach (on ZFS?) Regards, Fred -- This message posted from opensolaris.org
Richard Elling
2010-Apr-14 17:21 UTC
[zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
On Apr 14, 2010, at 5:13 AM, fred pam wrote:> I have a similar problem that differs in a subtle way. I moved a zpool (single disk) from one system to another. Due to my inexperience I did not import the zpool but (doh!) ''zpool create''-ed it (I may also have used a -f somewhere in there...)You have destroyed the previous pool. There is a reason the "-f" flag is required, though it is human nature to ignore such reasons.> Interestingly the script still gives me the old uberblocks but in this case the first couple (lowest TXG''s) are actually younger (later timestamp) than the higher TXG ones. Obviously removing the highest TXG''s will actually remove the uberblocks I want to keep.This is because creation of the new pool did not zero-out the uberblocks.> Is there a way to copy an uberblock over another one? Or could I perhaps remove the low-TXG uberblocks instead of the high-TXG ones (and would that mean the old pool becomes available again). Or are more things missing than just the uberblocks and should I move to a file-based approach (on ZFS?)I do not believe you can recover the data on the previous pool without considerable time and effort. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
fred pam
2010-Apr-15 19:39 UTC
[zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
Hi Richard, Hm, I guess I misunderstand the function of uberblocks. I thought uberblocks contained pointers (to...?) which the system then uses to retrieve the files. If I''m incorrect in thinking that I could use an older uberblock to retrieve the data, what am I missing? I''ve tried to find some basic zpool<->uberblock relation info without much success (eh... well, Wiki wasn''t helpful and I try to avoid reading RFC/IEEE documents since I still value my sanity ;-) Grtz, Fred -- This message posted from opensolaris.org
Richard Elling
2010-Apr-15 22:25 UTC
[zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
On Apr 15, 2010, at 12:39 PM, fred pam wrote:> Hi Richard, > > Hm, I guess I misunderstand the function of uberblocks. I thought uberblocks contained pointers (to...?) which the system then uses to retrieve the files.uberblocks are the trunk of the tree.> If I''m incorrect in thinking that I could use an older uberblock to retrieve the data, what am I missing?uberblocks point to the meta-object set (MOS) which describes the configuration of the pool and ultimately the datasets and files. What you''ve done is plant another tree over the previous tree and it is unlikely that the previous tree remains intact.> I''ve tried to find some basic zpool<->uberblock relation info without much success (eh... well, Wiki wasn''t helpful and I try to avoid reading RFC/IEEE documents since I still value my sanity ;-)This is a design detail. In practical terms, you clobbered the previous ZFS pool by creating a new one on top of it. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
fred pam
2010-Apr-16 12:02 UTC
[zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
Hi Richard, thanks for your time, I really appreciate it, but I''m still unclear on how this works. So uberblocks point to the MOS. Why do you then require multiple uberblocks? Or are there actually multiple MOS''es? Or is there one MOS and multiple delta''s to it (and its predecessors) and do the uberblocks then point to the latest delta? In the latter case I can understand why Nullifying the latest uberblocks reverts to a previous situation, otherwise I don''t see the difference between "Nullifying the first uberblocks" and "Nullifying the last uberblocks". Thanks, Fred -- This message posted from opensolaris.org
max at bruningsystems.com
2010-Apr-16 16:40 UTC
[zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
Hi Fred, Have you read the ZFS On Disk Format Specification paper at: http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/ondiskformat0822.pdf? Ifred pam wrote:> Hi Richard, thanks for your time, I really appreciate it, but I''m still unclear on how this works. > > So uberblocks point to the MOS. Why do you then require multiple uberblocks? Or are there actually multiple MOS''es? > Or is there one MOS and multiple delta''s to it (and its predecessors) and do the uberblocks then point to the latest delta? > In the latter case I can understand why Nullifying the latest uberblocks reverts to a previous situation, otherwise I don''t see the difference between "Nullifying the first uberblocks" and "Nullifying the last uberblocks". >One reason for multiple uberblocks is that uberblocks, like everything else, are copy-on-write. The reason you have 4 copies (2 labels at front and 2 labels at the end of every disk) is redundancy. No, there are not multiple MOS''es in one pool (though there may be multiple copies of the MOS via "ditto" blocks). The current (or "active") uberblock is the one with the highest transaction id with valid checksum. Transaction ids are basically monotonically increasing, so nullifying the last uberblock can revert you to a previous state. max> Thanks, Fred >
fred pam
2010-Apr-19 07:55 UTC
[zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
Hi Max, Thanks, that''s what I was looking for. So, after reading it I come to the conclusion that it''s actually the fact I''ve lost my MOS that makes it ''impossible'' to retrieve the data. My understanding of it all (growing yet still meager ;-): Uberblocks do not point to different MOS-es but refer to a transaction history within the MOS; within an uberblock it is in fact not the blockpointer (as it only points to the MOS) but the TGX that''s determining what the system ''sees'' as data. The fact that I may or may not have older uberblocks is then made irrelevant, right? This seems, from a forensics perspective, quite a quick and powerful way of destroying data (especially when also encrypted). Mind you, I do not necessarily think this is a bad thing, I just like to be sure I understand the consequences... Grtz, Fred -- This message posted from opensolaris.org