On Apr 30, 2019, at 11:17 PM, Michelle Sullivan <michelle at sorbs.net> wrote:>> I have had it happen >> several times over my IT career. If that happens to you the odds are >> that it's absolutely unrecoverable and whatever gets corrupted is >> *gone.* > > Every drive corruption I have suffered in my career I have been able to > recover, all or partial data except where the hardware itself was totally > hosed (Ie clean room options only available)... even with brtfs.. yuk.. > puck.. yuk.. oh what a mess that was... still get nightmares on that > one... but I still managed to get most of the data off... in fact I put > it onto this machine I currently have problems with.. so after the > nightmare of brtfs looks like zfs eventually nailed me.It sounds from reading this thread that FreeBSD's built-in tools for ZFS recovery were insufficient for the corruption your pool suffered. Have you looked at the digital forensics realm to see whether those tools might help you? This article claims to extend The Sleuth Kit to support pooled storage such as ZFS, and they even describe recovering the bulk of an image file from a pool that has a disk missing (Evaluation Section, "Scenario C: reconstructing an incomplete pool"): "Extending The Sleuth Kit and its underlying model for pooled storage file system forensic analysis" https://www.sciencedirect.com/science/article/pii/S1742287617301901>> If said software has no tools to "walk" said >> data or if it's impractical to have it do so you're at severe risk of >> being hosed. > > Umm what? I?m talking about a userland (libzfs) tool (Ie doesn?t need > the pool imported) such as zfs send (which requires the pool to be > imported - hence me not calling it a userland tool) to allow a sending of > data that can be found to other places where it can be either blindly > recovered (corruption might be present) or can be used to locate > files/paths etc that are known to be good (checksums match etc).. walk > the structures, feed the data elsewhere where it can be > examined/recovered... don?t alter it.... it?s a last resort tool when you > don?t have working backups..See above.>> BTW if you've never had a UFS volume unlink all the blocks within a file >> on an fsck and then recover them back into the free list after a crash >> you're a rare bird indeed. If you think a corrupt ZFS volume is fun try >> to get your data back from said file after that happens. > > Been there done that though with ext2 rather than UFS.. still got all my > data back... even though it was a nightmare..Is that an implication that had all your data been on UFS (or ext2:) this time around you would have got it all back? (I've got that impression through this thread from things you've written.) That sort of makes it sound like UFS is bulletproof to me. There are levels of corruption. Maybe what you suffered would have taken down UFS, too? I guess there's no way to know unless there's some way you can recreate exactly the circumstances that took down your original system (but this time your data on UFS). ;-) Cheers, Paul.
Paul Mather wrote:> On Apr 30, 2019, at 11:17 PM, Michelle Sullivan <michelle at sorbs.net> > wrote: > >> Been there done that though with ext2 rather than UFS.. still got >> all my data back... even though it was a nightmare.. > > > Is that an implication that had all your data been on UFS (or ext2:) > this time around you would have got it all back? (I've got that > impression through this thread from things you've written.) That sort > of makes it sound like UFS is bulletproof to me.Its definitely not (and far from it) bullet proof - however when the data on disk is not corrupt I have managed to recover it - even if it has been a nightmare - no structure - all files in lost+found etc... or even resorting to r-studio in the even of lost raid information etc..> > There are levels of corruption. Maybe what you suffered would have > taken down UFS, too?Pretty sure not - and even if it would have - with the files intact I have always been able to recover them... r-studio being the last resort.> I guess there's no way to know unless there's some way you can > recreate exactly the circumstances that took down your original system > (but this time your data on UFS). ;-)True. This case - from what my limited knowledge has managed to fathom is a spacemap has become corrupt due to partial write during the hard power failure. This was the second hard outage during the resilver process following a drive platter failure (on a ZRAID2 - so single platter failure should be completely recoverable all cases - except hba failure or other corruption which does not appear to be the case).. the spacemap fails checksum (no surprises there being that it was part written) however it cannot be repaired (for what ever reason))... how I get that this is an interesting case... one cannot just assume anything about the corrupt spacemap... it could be complete and just the checksum is wrong, it could be completely corrupt and ignorable.. but what I understand of ZFS (and please watchers chime in if I'm wrong) the spacemap is just the freespace map.. if corrupt or missing one cannot just 'fix it' because there is a very good chance that the fix would corrupt something that is actually allocated and therefore the best solution would be (to "fix it") would be consider it 100% full and therefore 'dead space' .. but zfs doesn't do that - probably a good thing - the result being that a drive that is supposed to be good (and zdb reports some +36m objects there) becomes completely unreadable ... my thought (desire/want) on a 'walk' tool would be a last resort tool that could walk the datasets and send them elsewhere (like zfs send) so that I could create a new pool elsewhere and send the data it knows about to another pool and then blow away the original - if there are corruptions or data missing, thats my problem it's a last resort.. but in the case the critical structures become corrupt it means a local recovery option is enabled.. it means that if the data is all there and the corruption is just a spacemap one can transfer the entire drive/data to a new pool whilst the original host is rebuilt... this would *significantly* help most people with large pools that have to blow them away and re-create the pools because of errors/corruptions etc... and with the addition of 'rsync' (the checksumming of files) it would be trivial to just 'fix' the data corrupted or missing from a mirror host rather than transferring the entire pool from (possibly) offsite.... Regards, -- Michelle Sullivan http://www.mhix.org/