thr3ads.net - zfs discuss - [zfs-discuss] more ZFS recovery [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Tom Bird

2008-Aug-06 13:20 UTC

[zfs-discuss] more ZFS recovery

Hi,

Have a problem with a ZFS on a single device, this device is 48 1T SATA
drives presented as a 42T LUN via hardware RAID 6 on a SAS bus which had
a ZFS on it as a single device.

There was a problem with the SAS bus which caused various errors
including the inevitable kernel panic, the thing came back up with 3 out
of 4 zfs mounted.

I''ve tried reading the partition table with format, works fine, also
can
dd the first 100G from the device quite happily so the communication
issue appears resolved however the device just won''t mount.  Googling
around I see that ZFS does have features designed to reduce the impact
of corruption at a particular point, multiple meta data copies and so
on, however commands to help me tidy up a zfs will only run once the
thing has been mounted.

Would be grateful for any ideas, relevant output here:

root at cs3:~# zpool import
  pool: content
    id: 14205780542041739352
 state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
        The pool may be active on on another system, but can be imported
using
        the ''-f'' flag.
   see: http://www.sun.com/msg/ZFS-8000-72
config:

        content     FAULTED   corrupted data
          c2t9d0    ONLINE

root at cs3:~# zpool import content
cannot import ''content'': pool may be in use from other system
use ''-f'' to import anyway

root at cs3:~# zpool import -f content
cannot import ''content'': I/O error

root at cs3:~# uname -a
SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200


Thanks
-- 
Tom

// www.portfast.co.uk -- internet services and consultancy
// hosting from 1.65 per domain

Richard Elling

2008-Aug-06 17:25 UTC

head link

[zfs-discuss] more ZFS recovery

Tom Bird wrote:> Hi,
>
> Have a problem with a ZFS on a single device, this device is 48 1T SATA
> drives presented as a 42T LUN via hardware RAID 6 on a SAS bus which had
> a ZFS on it as a single device.
>
> There was a problem with the SAS bus which caused various errors
> including the inevitable kernel panic, the thing came back up with 3 out
> of 4 zfs mounted.
>
> I''ve tried reading the partition table with format, works fine,
also can
> dd the first 100G from the device quite happily so the communication
> issue appears resolved however the device just won''t mount. 
Googling
> around I see that ZFS does have features designed to reduce the impact
> of corruption at a particular point, multiple meta data copies and so
> on, however commands to help me tidy up a zfs will only run once the
> thing has been mounted.
>   
You should also check the end of the LUN.  ZFS stores its configuration data
at the beginning and end of the LUN.  An I/O error is a fairly generic 
error, but
it can also be an indicator of a catastrophic condition.  You should 
also check
the system log in /var/adm/messages as well as any faults reported by 
fmdump.

In general, ZFS can only repair conditions for which it owns data 
redundancy.
In this case, ZFS does not own the redundancy function, so you are 
susceptible
to faults of this sort.
 -- richard
> Would be grateful for any ideas, relevant output here:
>
> root at cs3:~# zpool import
>   pool: content
>     id: 14205780542041739352
>  state: FAULTED
> status: The pool metadata is corrupted.
> action: The pool cannot be imported due to damaged devices or data.
>         The pool may be active on on another system, but can be imported
> using
>         the ''-f'' flag.
>    see: http://www.sun.com/msg/ZFS-8000-72
> config:
>
>         content     FAULTED   corrupted data
>           c2t9d0    ONLINE
>
> root at cs3:~# zpool import content
> cannot import ''content'': pool may be in use from other
system
> use ''-f'' to import anyway
>
> root at cs3:~# zpool import -f content
> cannot import ''content'': I/O error
>
> root at cs3:~# uname -a
> SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200
>
>
> Thanks
>

Miles Nordin

2008-Aug-06 17:57 UTC

head link

[zfs-discuss] more ZFS recovery

>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>> "tb" == Tom Bird <tom at marmot.org.uk>
writes:
    tb> There was a problem with the SAS bus which caused various
    tb> errors including the inevitable kernel panic, the thing came
    tb> back up with 3 out of 4 zfs mounted.

    re> In general, ZFS can only repair conditions for which it owns
    re> data redundancy.

If that''s really the excuse for this situation, then ZFS is not
``always consistent on the disk'''' for single-VDEV pools.

There was no loss of data here, just an interruption in the connection
to the target, like power loss or any other unplanned shutdown.
Corruption in this scenario is is a significant regression w.r.t. UFS:

  http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048375.html

How about the scenario where you lose power suddenly, but only half of
a mirrored VDEV is available when power is restored?  Is ZFS
vulnerable to this type of unfixable corruption in that scenario,
too?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080806/f342c055/attachment.bin>

Will Murnane

2008-Aug-06 18:23 UTC

head link

[zfs-discuss] more ZFS recovery

On Wed, Aug 6, 2008 at 13:57, Miles Nordin <carton at ivy.net>
wrote:>>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>>> "tb" == Tom Bird <tom at marmot.org.uk>
writes:
>
>    tb> There was a problem with the SAS bus which caused various
>    tb> errors including the inevitable kernel panic, the thing came
>    tb> back up with 3 out of 4 zfs mounted.
>
>    re> In general, ZFS can only repair conditions for which it owns
>    re> data redundancy.
>
> If that''s really the excuse for this situation, then ZFS is not
> ``always consistent on the disk'''' for single-VDEV pools.Well, yes.  If data is sent, but corruption somewhere (the SAS bus,
apparently, here) causes bad data to be written, ZFS can generally
detect but not fix that.  It might be nice to have a "verifywrites"
mode or something similar to make sure that good data has ended up on
disk (at least at the time it checks), but failing that there''s not
much ZFS (or any filesystem) can do.  Using a pool with some level of
redundancy (mirroring, raidz) at least gives zfs a chance to read the
missing pieces from the redundancy that it''s kept.
> How about the scenario where you lose power suddenly, but only half of
> a mirrored VDEV is available when power is restored?  Is ZFS
> vulnerable to this type of unfixable corruption in that scenario,
> too?Every filesystem is vulnerable to corruption, all the time.  I''m
willing to dispute any claims otherwise.  Some are just more likely
than others to hit their error conditions.  I''ve personally run into
UFS'' problems more often than ZFS... but that doesn''t mean I
think I''m
safe.

Will

Richard Elling

2008-Aug-06 18:28 UTC

head link

[zfs-discuss] more ZFS recovery

Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>>> "tb" == Tom Bird <tom at marmot.org.uk>
writes:
>>>>>>             
>
>     tb> There was a problem with the SAS bus which caused various
>     tb> errors including the inevitable kernel panic, the thing came
>     tb> back up with 3 out of 4 zfs mounted.
>
>     re> In general, ZFS can only repair conditions for which it owns
>     re> data redundancy.
>
> If that''s really the excuse for this situation, then ZFS is not
> ``always consistent on the disk'''' for single-VDEV pools.
>   
I disagree with your assessment.  The on-disk format (any on-disk format)
necessarily assumes no faults on the media.  The difference between ZFS
on-disk format and most other file systems is that the metadata will be
consistent to some point in time because it is COW.  With UFS, for instance,
the metadata is overwritten, which is why it cannot be considered always
consistent (and why fsck exists).
> There was no loss of data here, just an interruption in the connection
> to the target, like power loss or any other unplanned shutdown.
> Corruption in this scenario is is a significant regression w.r.t. UFS:
>   
I see no evidence that the data is or is not correct.  What we know is that
ZFS is attempting to read something and the device driver is returning EIO.
Unfortunately, EIO is a catch-all error code, so more digging to find the
root cause is needed.

However, I will bet a steak dinner that if this device was mirrored to 
another,
the pool will import just fine, with the affected device in a faulted or 
degraded
state.
>   http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048375.html
>   
I have no idea what Eric is referring to, and it does not match my 
experience.
Unfortunately, he didn''t reference any CRs either :-(.  "Your baby
is
ugly" posts
aren''t very useful.

That said, we are constantly improving the resiliency of ZFS (more good
stuff coming in b96), so it might be worth trying to recover with a later
version.  For example, boot SXCE b94 and try to import the pool.
> How about the scenario where you lose power suddenly, but only half of
> a mirrored VDEV is available when power is restored?  Is ZFS
> vulnerable to this type of unfixable corruption in that scenario,
> too?
>   
No, this works just fine as long as one side works.  But that is a very 
different case.
 -- richard

Tom Bird

2008-Aug-06 19:14 UTC

head link

[zfs-discuss] more ZFS recovery

Richard Elling wrote:
> I see no evidence that the data is or is not correct.  What we know is that
> ZFS is attempting to read something and the device driver is returning EIO.
> Unfortunately, EIO is a catch-all error code, so more digging to find the
> root cause is needed.
I''m currently checking the whole LUN, although as a 42TB unit this will
take a few hours so we''ll see how that is tomorrow.
> However, I will bet a steak dinner that if this device was mirrored to 
> another,
> the pool will import just fine, with the affected device in a faulted or 
> degraded
> state.
On any other file system though, I could probably kick off a fsck and
get back most of the data.  I see the argument a lot that ZFS
"doesn''t
need" a fsck utility, however I would be inclined to disagree, if not a
full on fsck then something that can patch it up to the point where I
can mount it and then get some data off or run a scrub.

-- 
Tom

// www.portfast.co.uk -- internet services and consultancy
// hosting from 1.65 per domain

Miles Nordin

2008-Aug-06 19:44 UTC

head link

[zfs-discuss] more ZFS recovery

>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
c> If that''s really the excuse for this situation, then ZFS is
c> not ``always consistent on the disk'''' for
single-VDEV pools.

re> I disagree with your assessment. The on-disk format (any
re> on-disk format) necessarily assumes no faults on the media.

The media never failed, only the connection to the media. We''ve every
good reason to believe that every CDB that the storage controller
acknowledged as complete, was completed and is still there---and that
is the only statement which must be true of unfaulty media. We''ve no
strong reason to doubt it.

re> I see no evidence that the data is or is not correct.

the ``evidence'''' is that it was on a SAN, and the storage
itself never
failed, only the connection between ZFS and the storage. Remember:

this device is 48 1T SATA drives presented as a 42T LUN via hardware
RAID 6 on a SAS bus which had a ZFS on it as a single device.

This sort of SAN-outage happens all the time, so it''s not straining my
belief to suggest that probably nothing else happened other than
disruption of the connection between ZFS and the storage. It''s not
like a controller randomly ``acted up'''' or something, so that
I would
suspect a bad disk.

c>
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048375.html

re> I have no idea what Eric is referring to, and it does not
re> match my experience.

unfortunately it''s very easy to match the experience of ``nothing
happened'''' and hard to match the experience ``exactly the same
thing
happened to me.'''' Have you been provoking ZFS in exactly the
way Eric
described, a single-vdev pool on FC where the FC SAN often has outages
or where the storage is rebooted while ZFS is still running? If not,
obviously it doesn''t match your experience because you have none with
this situation. OTOH if you''ve been doing that a lot, your not
running into this problem means something. Otherwise, it''s another
case of the home-user defense: ``I can''t tell you how close to zero the
number
of problems I''ve had with it is. It''s so close to zero, it is
zero,
so there''s virtually 0% chance what you''re saying happened to
you
really did happen to you. and also to this other guy.''''

When I say ``doesn''t mathc my experience'''' I meant I
_do_ see Mac OS X
pinwheels and for me it''s ``usually'''' traceable back
to VM pressure or
dead NFS server, not some random application-level userinterface
modal-wait as others claimed: I''m selecting for the same situation you
are, and gettin g a different result.

that said, yeah, a CR would be nice. For such a serious problem, I''d
like to think someone''s collected an image of the corrupt filesystem
and is trying to figure out wtf happened.

I care about how safe is my data, not how pretty is your baby. I want
its relative safety accurately represented based on the experience
available to us.

c> How about the scenario where you lose power suddenly, but only
c> half of a mirrored VDEV is available when power is restored?
c> Is ZFS vulnerable to this type of unfixable corruption in that
c> scenario, too?

re> No, this works just fine as long as one side works. But that
re> is a very different case. -- richard

Why do you regard this case as very different from a single vdev? I
don''t have confidence that it''s clearly different w.r.t.
whatever
hypothetical bug Eric and Tom have run into.

wm> If data is sent, but corruption somewhere (the SAS bus,
wm> apparently, here) causes bad data to be written, ZFS can
wm> generally detect but not fix that.

Why would there be bad data written? The SAS bus has checksums. The
problem AIUI was that the bus went away, not that it started
scribbling random data all over the place. Am I wrong? Remember what
Tom''s SAS bus is connected to.

wm> "verifywrites"

The verification is the storage array returning success to the command
it was issued. ZFS is supposed to, for example, delay returning from
fsync() until this has happened. The same mechanism is used to write
batches of things in a well-defined order to supposedly achieve the
``always-consistent''''. It depends on the
drive/array''s ability to
accurately report when data is committed to stable storage, not on
rereading what was written, and this is the correct dependency because
ZFS leaves write caches on, so the drive could satisfy a read from the
small on-disk cache RAM even though that data would be lost if you
pulled the disk''s power cord.

The system contains all the tools needed to keep the consistency
promises even if you go around yanking SAS cables.

And this is a data-loss issue, not just an availability issue like we
were discussing before w.r.t. pulling drives.

wm> Every filesystem is vulnerable to corruption, all the time.

Every filesystem in recent history makes rigorous guarantees about
what will survive if you pull the connection to the disk array, or the
host''s power, at any time you wish. The guarantees always include the
integrity of data written before an fsync() command was called so long
as power/connectivity is lost after fsync() returns. It also includes
enough metadata consistency that you won''t lose a whole
friggin'' pool
like this scenaryo with some ``corrupt data, End of Line''''
error.

UFS+logging
vxfs
FFS+softdep
ext3
xfs
reiserfs
HFS+

Disks that go bad, storage subsystems with a RAID5 write hole, PATA
busses that given noisy cables autodegrade to a non-CRC mode and then
corrupt data, disks that silently return bad data, controllers that go
nuts and scribble random data as the 5V rail starts dropping after the
cord is pulled, can, yes, all interfere with these guarantees. but
NONE OF THOSE THINGS HAPPENED IN THIS CASE.

We absolutely do not live in fear that we will lose whole filesystems
if the cord is pulled at the wrong time. That has not been true
since, like, the early 90''s. ancient history. :''
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080806/9cc5d03f/attachment.bin>

Al Hopper

2008-Aug-06 22:58 UTC

head link

[zfs-discuss] more ZFS recovery

On Wed, Aug 6, 2008 at 8:20 AM, Tom Bird <tom at marmot.org.uk>
wrote:> Hi,
>
> Have a problem with a ZFS on a single device, this device is 48 1T SATA
> drives presented as a 42T LUN via hardware RAID 6 on a SAS bus which had
> a ZFS on it as a single device.
>
> There was a problem with the SAS bus which caused various errors
> including the inevitable kernel panic, the thing came back up with 3 out
> of 4 zfs mounted.
Hi Tom,

After reading this and the followups to date....  this could be due to
anything ... and we (on the list) don''t know the history of the system
or the RAID device.  You could have a bad SAS controller, bad system
memory, a bad cable or a RAID controller with a firmware bug....

The first step would be to form a ZFS pool with 2 mirrors, beat up on
it and gain some confidence in the overall system components.   Write
lots of data to it, run zpool scrub etc. and verify that it''s 100%
rock solid before you then zpool destroy it and then test with a
larger pool.  In every case where someone has initially posted an
opening story list yours, the problem has almost always turned out to
be outside of ZFS.   As others have explained, if ZFS does not have a
config with data redundancy - there is not much that can be learned -
except that it "just broke".

Keep testing and report back.  Also, any additional data on the
hardware and software config would be useful and let us know if this
is a "new" system or if the hardware has already been in service and
its reliability track record.
> I''ve tried reading the partition table with format, works fine,
also can
> dd the first 100G from the device quite happily so the communication
> issue appears resolved however the device just won''t mount. 
Googling
> around I see that ZFS does have features designed to reduce the impact
> of corruption at a particular point, multiple meta data copies and so
> on, however commands to help me tidy up a zfs will only run once the
> thing has been mounted.
>
> Would be grateful for any ideas, relevant output here:
>
> root at cs3:~# zpool import
>  pool: content
>    id: 14205780542041739352
>  state: FAULTED
> status: The pool metadata is corrupted.
> action: The pool cannot be imported due to damaged devices or data.
>        The pool may be active on on another system, but can be imported
> using
>        the ''-f'' flag.
>   see: http://www.sun.com/msg/ZFS-8000-72
> config:
>
>        content     FAULTED   corrupted data
>          c2t9d0    ONLINE
>
> root at cs3:~# zpool import content
> cannot import ''content'': pool may be in use from other
system
> use ''-f'' to import anyway
>
> root at cs3:~# zpool import -f content
> cannot import ''content'': I/O error
>
> root at cs3:~# uname -a
> SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200
>
>
> Thanks
> --
> Tom
>
Regards,

-- 
Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com
 Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/

Richard Elling

2008-Aug-06 23:23 UTC

head link

[zfs-discuss] more ZFS recovery

Tom Bird wrote:> Richard Elling wrote:
>
>   
>> I see no evidence that the data is or is not correct.  What we know is
that
>> ZFS is attempting to read something and the device driver is returning
EIO.
>> Unfortunately, EIO is a catch-all error code, so more digging to find
the
>> root cause is needed.
>>     
>
> I''m currently checking the whole LUN, although as a 42TB unit this
will
> take a few hours so we''ll see how that is tomorrow.
>
>   
>> However, I will bet a steak dinner that if this device was mirrored to 
>> another,
>> the pool will import just fine, with the affected device in a faulted
or
>> degraded
>> state.
>>     
>
> On any other file system though, I could probably kick off a fsck and
> get back most of the data.  I see the argument a lot that ZFS
"doesn''t
> need" a fsck utility, however I would be inclined to disagree, if not
a
> full on fsck then something that can patch it up to the point where I
> can mount it and then get some data off or run a scrub.
>
>   
Probably not.  fsck only repairs metadata, it does not restore or correct
data.  If the data is gone or damaged, then there isn''t much ZFS could
do, since ZFS was not in control of the data redundancy (by default,
ZFS metadata is redundant).

BTW, another good sanity test is to try to read the ZFS labels:
    zdb -l /dev/rdsk/...

Cindy, I note that we don''t explicitly address the case where the pool
cannot be imported in the Troubleshooting and Data Recovery chapter
of the ZFS administration guide. Can we put this on the todo list?
 -- richard

Richard Elling

2008-Aug-07 02:45 UTC

head link

[zfs-discuss] more ZFS recovery

Tom Bird wrote:> Richard Elling wrote:
>
>   
>> I see no evidence that the data is or is not correct.  What we know is
that
>> ZFS is attempting to read something and the device driver is returning
EIO.
>> Unfortunately, EIO is a catch-all error code, so more digging to find
the
>> root cause is needed.
>>     
>
> I''m currently checking the whole LUN, although as a 42TB unit this
will
> take a few hours so we''ll see how that is tomorrow.
>
>   
>> However, I will bet a steak dinner that if this device was mirrored to 
>> another,
>> the pool will import just fine, with the affected device in a faulted
or
>> degraded
>> state.
>>     
>
> On any other file system though, I could probably kick off a fsck and
> get back most of the data.  I see the argument a lot that ZFS
"doesn''t
> need" a fsck utility, however I would be inclined to disagree, if not
a
> full on fsck then something that can patch it up to the point where I
> can mount it and then get some data off or run a scrub.
>
>   
 From the ZFS Administration Guide, Chapter 11, Data Repair section:
    Given that the fsck utility is designed to repair known pathologies
    specific to individual file systems, writing such a utility for a file
    system with no known pathologies is impossible. Future
    experience might prove that certain data corruption problems are
    common enough and simple enough such that a repair utility can
    be developed, but these problems can always be avoided by
    using redundant pools.

    If your pool is not redundant, the chance that data corruption can
    render some or all of your data inaccessible is always present.

If you go through the archives you should find similar conversations.
 -- richard

Anton B. Rang

2008-Aug-07 04:58 UTC

head link

[zfs-discuss] more ZFS recovery

> From the ZFS Administration Guide, Chapter 11, Data Repair section:
> Given that the fsck utility is designed to repair known pathologies
> specific to individual file systems, writing such a utility for a file
> system with no known pathologies is impossible.
That''s a fallacy (and is incorrect even for the UFS fsck; refer to the
McKusick/Kowalski paper and the distinction they make between
''expected'' corruptions and other inconsistencies).

First, there are two types of utilities which might be useful in the situation
where a ZFS pool has become corrupted. The first is a file system checking
utility (call it zfsck); the second is a data recovery utility. The difference
between those is that the first tries to bring the pool (or file system) back to
a usable state, while the second simply tries to recover the files to a new
location.

What does a file system check do?  It verifies that a file system is internally
consistent, and makes it consistent if it is not.  If ZFS were always consistent
on disk, then only a verification would be needed.  Since we have evidence that
it is not always consistent in the face of hardware failures, at least, repair
may also be needed.  This doesn''t need to be that hard.  For instance,
the space maps can be reconstructed by walking the various block trees; the
uberblock effectively has several backups (though it might be better in some
cases if an older backup were retained); and the ZFS checksums make it easy to
identify block types and detect bad pointers. Files can be marked as damaged if
they contain pointers to bad data; directories can be repaired if their hash
structures are damaged (as long as the names and pointers can be salvaged); etc.
Much more complex file systems than ZFS have file system checking utilities,
because journaling, COW, etc. don''t help you in the face of software
bugs or certain classes of hardware failures.

A recovery tool is even simpler, because all it needs to do is find a tree root
and then walk the file system, discovering directories and files, verifying that
each of them is readable by using the checksums to check intermediate and leaf
blocks, and extracting the data.  The tricky bit with ZFS is simply identifying
a relatively new root, so that the newest copy of the data can be identified.

Almost every file system starts out without an fsck utility, and implements one
once it becomes obvious that "sorry, you have to reinitialize the file
system" -- or worse, "sorry, we lost all of your data" -- is
unacceptable to a certain proportion of customers.
 
 
This message posted from opensolaris.org

Anton B. Rang

2008-Aug-07 05:05 UTC

head link

[zfs-discuss] more ZFS recovery

> As others have explained, if ZFS does not have a
> config with data redundancy - there is not much that
> can be learned - except that it "just broke".
Plenty can be learned by just looking at the pool.
Unfortunately ZFS currently doesn''t have tools which
make that easy; as I understand it, zdb doesn''t work
(in a useful way) on a pool which won''t import, so
dumping out the raw data structures and looking at
them by hand is the only way to determine what
ZFS doesn''t like and deduce what went wrong (and
how to fix it).
 
 
This message posted from opensolaris.org

Nicolas Williams

2008-Aug-07 05:10 UTC

head link

[zfs-discuss] more ZFS recovery

On Wed, Aug 06, 2008 at 02:23:44PM -0400, Will Murnane
wrote:> On Wed, Aug 6, 2008 at 13:57, Miles Nordin <carton at ivy.net> wrote:
> > If that''s really the excuse for this situation, then ZFS is
not
> > ``always consistent on the disk'''' for single-VDEV
pools.
> Well, yes.  If data is sent, but corruption somewhere (the SAS bus,
> apparently, here) causes bad data to be written, ZFS can generally
> detect but not fix that.  It might be nice to have a
"verifywrites"
> mode or something similar to make sure that good data has ended up on
> disk (at least at the time it checks), but failing that there''s
not
> much ZFS (or any filesystem) can do.  Using a pool with some level of
> redundancy (mirroring, raidz) at least gives zfs a chance to read the
> missing pieces from the redundancy that it''s kept.
There''s also ditto blocks.  So even on a one vdev pool you ZFS can
recover from random corruption unless you''re really unlucky.

Of course, this is a feature.  Without ZFS the OP would have had silent,
undetected (by the OS that is) data corruption.

Basically you don''t want to have one-vdev pools.  If you''ll
use HW RAID
then you should also do mirroring at the ZFS layer.

Nico
--

Nicolas Williams

2008-Aug-07 05:12 UTC

head link

[zfs-discuss] more ZFS recovery

On Wed, Aug 06, 2008 at 03:44:08PM -0400, Miles Nordin
wrote:> >>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
> 
>      c>  If that''s really the excuse for this situation, then
ZFS is
>      c> not ``always consistent on the disk'''' for
single-VDEV pools.
> 
>     re> I disagree with your assessment.  The on-disk format (any
>     re> on-disk format) necessarily assumes no faults on the media.
> 
> The media never failed, only the connection to the media.  We''ve
every
> good reason to believe that every CDB that the storage controller
> acknowledged as complete, was completed and is still there---and that
> is the only statement which must be true of unfaulty media.  We''ve
no
> strong reason to doubt it.
zdb should be able to pinpoint the problem, no?

Miles Nordin

2008-Aug-07 05:29 UTC

head link

[zfs-discuss] more ZFS recovery

>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
re> If your pool is not redundant, the chance that data
re> corruption can render some or all of your data inaccessible is
re> always present.

1. data corruption != unclean shutdown

2. other filesystems do not need a mirror to recover from unclean
shutdown. They only need it for when disks fail, or for when disks
misremember their contents (silent corruption, as in NetApp paper).

I would call data corruption and silent corruption the same thing:
what the CKSUM column was _supposed_ to count, though not in fact
the only thing it counts.

3. saying ZFS needs a mirror to recover from unclean shutdown does not
agree with the claim ``always consistent on the disk''''

4. I''m not sure exactly your position. Before you were saying what
Erik warned about doesn''t happen, because there''s no CR,
and Tom
must be confused too. Now you''re saying of course it happens,
ZFS''s claims of ``always consistent on disk''''
count for nothing
unless you have pool redundancy.

And that is exactly what I said to start with:

re> In general, ZFS can only repair conditions for which it owns
re> data redundancy.

c> If that''s really the excuse for this situation, then ZFS is
c> not ``always consistent on the disk'''' for
single-VDEV pools.

that is the take-home message?

If so, it still leaves me with the concern, what if the breaking of
one component in a mirrored vdev takes my system down uncleanly? This
seems like a really plausible failure mode (as Tom said, ``the
inevitable kernel panic'''').

In that case, I no longer have any redundancy when the system boots
back up. If ZFS calls the inconsistent states through which it
apparently sometimes transitions pools ``data corruption'''' and
depends
on redundancy to recover from them, then isn''t it extremely dangerous
to remove power or SAN connectivity from any DEGRADED pool? The pool
should be rebuilt onto a hot spare IMMEDIATELY so that it''s ONLINE as
soon as possible, because if ZFS loses power with a DEGRADED pool all
bets are off.

If this DEGRADED-pool unclean shutdown is, as you say, a completely
different scenario from single-vdev pools that isn''t dangerous and has
no trouble with ZFS corruption, then no one should ever run a
single-vdev pool. We should instead run mirrored vdevs that are
always DEGRADED, since this configuration looks identical to
everything outside ZFS but supposedly magically avoids the issue. If
only we had some way to attach to vdevs fake mirror components that
immediately get marked FAULTED then we can avoid the corruption risk.
But, that''s clearly absurd!

so, let''s say ZFS''s requirement is, as we seem to be
describing it:
might lose the whole pool if your kernel panics or you pull the power
cord in a situation without redundancy. Then I think this is an
extremely serious issue, even for redundant pools. It is very
plausible that a machine will panic or lose power during a resilver.

And if, on the other hand, ZFS doesn''t transition disks through
inconsistent states and then excuse itself calling what it did ``data
corruption'''' when it bites you after an unclean shutdown, then
what
happened to Erik and Tom?

It seems to me it is ZFS''s fault and can''t be punted off to
the
administrator''s ``asking for it.''''
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080807/46dd427e/attachment.bin>

Miles Nordin

2008-Aug-07 05:35 UTC

head link

[zfs-discuss] more ZFS recovery

>>>>> "nw" == Nicolas Williams <Nicolas.Williams at
sun.com> writes:
    nw>  Without ZFS the OP would have had silent, undetected (by the
    nw> OS that is) data corruption.

It sounds to me more like the system would have paniced as soon as he
pulled the cord, and when it rebooted, it would have rolled the UFS
log and mounted, without even an fsck, with no corruption at all,
silent or otherwise.

Note that the storage controller never even lost power, and does not
appear to be faulty.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080807/26e97474/attachment.bin>

Victor Latushkin

2008-Aug-07 06:30 UTC

head link

[zfs-discuss] more ZFS recovery

Hi Tom and all,

Tom Bird wrote:> Hi,
> 
> Have a problem with a ZFS on a single device, this device is 48 1T SATA
> drives presented as a 42T LUN via hardware RAID 6 on a SAS bus which had
> a ZFS on it as a single device.
> 
> There was a problem with the SAS bus which caused various errors
> including the inevitable kernel panic, the thing came back up with 3 out
> of 4 zfs mounted.
It would be nice to see a panic stack.
> I''ve tried reading the partition table with format, works fine,
also can
> dd the first 100G from the device quite happily so the communication
> issue appears resolved however the device just won''t mount. 
Googling
> around I see that ZFS does have features designed to reduce the impact
> of corruption at a particular point, multiple meta data copies and so
> on, however commands to help me tidy up a zfs will only run once the
> thing has been mounted.
> 
> Would be grateful for any ideas, relevant output here:
> 
> root at cs3:~# zpool import
>   pool: content
>     id: 14205780542041739352
>  state: FAULTED
> status: The pool metadata is corrupted.
> action: The pool cannot be imported due to damaged devices or data.
>         The pool may be active on on another system, but can be imported
> using
>         the ''-f'' flag.
>    see: http://www.sun.com/msg/ZFS-8000-72
> config:
> 
>         content     FAULTED   corrupted data
>           c2t9d0    ONLINE
> 
> root at cs3:~# zpool import content
> cannot import ''content'': pool may be in use from other
system
> use ''-f'' to import anyway
> 
> root at cs3:~# zpool import -f content
> cannot import ''content'': I/O error
As long as it does not panic and just returns I/O error which is rather 
generic, you may try to dig a little bit deeper with DTrace to have a 
chance to see where this I/O error is generated first, e.g. something 
like this with the attached dtrace script:

dtrace -s /path/to/script -c "zpool import -f content"

It is also interesting what impact SAS bus problem had on the storage 
controller. Btw, what is storage controller in question here?
> root at cs3:~# uname -a
> SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200
Btw, have you considered opening support call for this issue?

hth,
victor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zpool.d
Type: text/x-dsrc
Size: 294 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080807/d63fe6c2/attachment.bin>

Victor Latushkin

2008-Aug-07 07:12 UTC

head link

[zfs-discuss] more ZFS recovery

> Would be grateful for any ideas, relevant output here:
> 
> root at cs3:~# zpool import
>   pool: content
>     id: 14205780542041739352
>  state: FAULTED
> status: The pool metadata is corrupted.
> action: The pool cannot be imported due to damaged devices or data.
>         The pool may be active on on another system, but can be imported
> using
>         the ''-f'' flag.
>    see: http://www.sun.com/msg/ZFS-8000-72
> config:
> 
>         content     FAULTED   corrupted data
>           c2t9d0    ONLINE
> 
> root at cs3:~# zpool import content
> cannot import ''content'': pool may be in use from other
system
> use ''-f'' to import anyway
> 
> root at cs3:~# zpool import -f content
> cannot import ''content'': I/O error
If you have DVD with recent SXCE bits handy you may try to use zdb to 
get more details:

zdb -e -dddd content

victor

Ross

2008-Aug-07 07:38 UTC

head link

[zfs-discuss] more ZFS recovery

Hi folks,

Miles, I don''t know if you have more information about this problem
than I''m seeing, but from what Tom wrote I don''t see how you
can assume this is such a simple problem as an unclean shutdown?

Tom wrote "There was a problem with the SAS bus which caused various errors
including the inevitable kernel panic".  It''s the various errors
part that catches my eye, to me this means the SAS bus could have caused bad
data to be written to disk for some time before the kernel panic, and that is
far more serious to a filesystem than a simple power cut.

Can fsck always recover a disk?  Or if the corruption is severe enough, are
there times when even that fails?  I don''t see that we have enough
information here to really compare ZFS with UFS although I do agree that some
kind of ZFS repair tool sounds like it would be useful.  The problem for me is
that I don''t know enough about the low level stuff to really have an
informed opinion on that.

To me, it sounds like Sun have designed ZFS to always know if there is
corruption on the disk, and to write data in a way that corruption of the whole
filesystem *should* never happen.  But I also feel that there are times that
hardware can fail in strange ways, and there''s always a chance that a
pool could become corrupted due to hardware error in a way that prevents it
being mounted.

While I can see where Sun are coming from in that they''ve designed ZFS
to engineer around these problems, and avoid the need to repair filesystems by
using mirroring, multiple copies, etc.  I do think a fsck like utility that can
try to mount a failed system would be good for ZFS, it''s certainly
worth somebody who knows ZFS sitting down and thinking "how can we recover
a pool if we know that X is corrupted", where X refers to any of the core
pieces ZFS needs on its disk(s).

Ross
 
 
This message posted from opensolaris.org

Victor Latushkin

2008-Aug-07 08:04 UTC

head link

[zfs-discuss] more ZFS recovery

Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>>> "tb" == Tom Bird <tom at marmot.org.uk>
writes:
> 
>     tb> There was a problem with the SAS bus which caused various
>     tb> errors including the inevitable kernel panic, the thing came
>     tb> back up with 3 out of 4 zfs mounted.
> 
>     re> In general, ZFS can only repair conditions for which it owns
>     re> data redundancy.
> 
> If that''s really the excuse for this situation, then ZFS is not
> ``always consistent on the disk'''' for single-VDEV pools.
This is wrong implication. ZFS does not write new (meta)data over 
currently allocated blocks, this is how on-disk consistency is achieved.

Recovery from corruption is another thing. Data that is read back may be 
not the one which was written in the first place, and ZFS has a facility 
to detect this - checksums. If there''s more than one copy - there is 
good chance that another copy may be good. If there''s only one copy - 
there''s no much to do besides returning I/O error.

There''s another failure scenario also - (meta)data may be corrupted in 
memory before it is checksummed and written to disk. In this case no 
matter how many copies are stored on disk, all of them are incorrect 
though they may still checksum properly.
> There was no loss of data here, just an interruption in the connection
> to the target, like power loss or any other unplanned shutdown.
Unfortunately this is an assumption only. Saying that there was no loss 
of data you assume that storage controller is bug-free or was not 
affected by SAS bus issues in any way. This may not be the case. But it 
is impossible to tell with the provided data.

victor

Volker A. Brandt

2008-Aug-07 09:01 UTC

head link

[zfs-discuss] more ZFS recovery

Anton B. Rang writes:> dumping out the raw data structures and looking at
> them by hand is the only way to determine what
> ZFS doesn''t like and deduce what went wrong (and
> how to fix it).
http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf

:-)
-- 
------------------------------------------------------------------------
Volker A. Brandt                  Consulting and Support for Sun Solaris
Brandt & Brandt Computer GmbH                   WWW: http://www.bb-c.de/
Am Wiesenpfad 6, 53340 Meckenheim                     Email: vab at bb-c.de
Handelsregister: Amtsgericht Bonn, HRB 10513              Schuhgr??e: 45
Gesch?ftsf?hrer: Rainer J. H. Brandt und Volker A. Brandt

Cindy.Swearingen at Sun.COM

2008-Aug-07 14:56 UTC

head link

[zfs-discuss] more ZFS recovery

Hi Richard,

Yes, sure. We can add that scenario.

What''s been on my todo list is a ZFS troubleshooting wiki.
I''ve been collecting issues. Let''s talk soon.

Cindy

Richard Elling wrote:> Tom Bird wrote:
> 
>> Richard Elling wrote:
>>
>>  
>>
>>> I see no evidence that the data is or is not correct.  What we know
>>> is that
>>> ZFS is attempting to read something and the device driver is 
>>> returning EIO.
>>> Unfortunately, EIO is a catch-all error code, so more digging to
find
>>> the
>>> root cause is needed.
>>>     
>>
>>
>> I''m currently checking the whole LUN, although as a 42TB unit
this will
>> take a few hours so we''ll see how that is tomorrow.
>>
>>  
>>
>>> However, I will bet a steak dinner that if this device was mirrored
>>> to another,
>>> the pool will import just fine, with the affected device in a
faulted
>>> or degraded
>>> state.
>>>     
>>
>>
>> On any other file system though, I could probably kick off a fsck and
>> get back most of the data.  I see the argument a lot that ZFS
"doesn''t
>> need" a fsck utility, however I would be inclined to disagree, if
not a
>> full on fsck then something that can patch it up to the point where I
>> can mount it and then get some data off or run a scrub.
>>
>>   
> 
> 
> Probably not.  fsck only repairs metadata, it does not restore or correct
> data.  If the data is gone or damaged, then there isn''t much ZFS
could
> do, since ZFS was not in control of the data redundancy (by default,
> ZFS metadata is redundant).
> 
> BTW, another good sanity test is to try to read the ZFS labels:
>     zdb -l /dev/rdsk/...
> 
> Cindy, I note that we don''t explicitly address the case where the
pool
> cannot be imported in the Troubleshooting and Data Recovery chapter
> of the ZFS administration guide. Can we put this on the todo list?
>  -- richard
>

Miles Nordin

2008-Aug-07 17:28 UTC

head link

[zfs-discuss] more ZFS recovery

>>>>> "r" == Ross <myxiplx at hotmail.com>
writes:
r> Tom wrote "There was a problem with the SAS bus which caused
r> various errors including the inevitable kernel panic".
It''s
r> the various errors part that catches my eye,

yeah, possibly, but there are checksums on the SAS bus, and its
confirmation of what CDB''s have completed should always be accurate.
If the problem was ``another machine booted up, and I told the other
machine to ''zpool import -f'' '''' then maybe
you have some point. but
just tripping over a cable shouldn''t qualify as weird, nor should
Erik''s problem of the FC array losing power or connectivity. These
are both within the ``unclean shutdown'''' category handled by
UFS+log,
FFS+softdep, ext3, reiser, xfs, vxfs, jfs, HFS+, ...

r> Can fsck always recover a disk? Or if the corruption is
r> severe enough, are there times when even that fails?

This question is obviously silly. write zeroes over the disk, and now
the corruption is severe enough. However fsck can always recover a
disk from a kernel panic, or a power failure of the host or of the
disks, because these things don''t randomly scribble over the disk.

(now, yeah, I know I posted earlier a story from Ted Ts''o about SGI
hardware and about random disk scribbling as the 5V rail started
drooping. yes, I posted that one. but it doesn''t happen _that
much_. and it doesn''t even apply to Tom and Erik''s case of
a loose
SAS cable or tripping over an FC cord.)

If the kernel panic was caused by a bug in the filesystem, then you''ll
say aHA! aaHAh! but then, then it might do the scribbling!

Well, yes. so in that case we agree there''s a bug in the filesystem.
:)

You''ll say ``but WHAT if the kernel panic was a bug in the DISK
DRIVER, eh? eh, then maybe ZFS is not at fault!'''' sure,
fine, read
on.

r> I don''t see that we have enough information here to really
r> compare ZFS with UFS

what we certainly have, between Tom and Erik and my own experience
with resilvering-related errors accumulating in the CKSUM column when
iSCSI targets go away, is enough information that ``you should have
had redundant pools'''' doesn''t settle the issue.
Reports of zpool
corruption on single vdev''s mounted over SAN''s would benefit
from
further investigation, or at least a healthily-suspicious scientific
attitude that encourages someone to investigate this if it happens in
more favorable conditions, such as inside Sun, or to someone with a
support contract and enough time to work on a case (maybe Tom?), or
someone who knows ZFS well like Pavel. Also, there is enough concern
for people designing paranoid systems to approach them with the view,
``ZFS is not always-consistent-on-disk unless it has working
redundancy''''---choosing to build a ZFS system the same way as
a UFS
system without ZFS-level redundancy, based on our experience so far,
is not just foregoing some of ZFS''s whizz-bang new feeechurs.
It''s
significantly less safe than the UFS system. For as long as the
argument remains unsettled, conservative people need to understand
that. Conservative people should also understand point (c) below.

It sounds to me like Tom''s and Erik''s problems are more likely
ZFS''s
fault than not. The dialog has gone like this:

1. This isn''t within the class of errors ZFS should handle. get
redundancy.

2. It sounds to me exactly like the class of error ZFS is supposed to
handle.

3. You cannot prove 100% that this is necessarily the class of error
ZFS is supposed to handle. Somethinig else might have happened.

BTW, did I tell you how good ZFS (sometimes) is at dealing with
``might have happened'''' if you give it redundancy?
It''s new, and
exciting, and unprecedented! Is that a rabbit over there? Look, a
redheaded girl juggling frisbies!

What next, you''ll drag out screaming Dick Cheney on a chain?

Recapping my view:

a. it looks like a ZFS problem (okay, okay, PROBABLY a zfs problem)

b. it''s a big problem

c. there''s no good reason to believe people with redundant pools are
immune from it, because they will run into it when they need their
redundancy to cover a broken disk.

It also deserves more testing by me: I''m going to back up my smaller
''aboveground'' pool and try to provoke it.

r> although I do agree that some kind of ZFS repair tool
r> sounds like it would be useful.

I don''t want to dictate architecture when I don''t know the
internals
well. What''s immediately important to me is that ZFS handle unclean
shutdown rigorously, as most other filesystems claim to and eventually
mostly accomplish. This could be adding an fsck tool, but more likely
it will be simply fixing a bug.

Old computers had to bring up their swap space before fsck''ing big
filesystems because the fsck process needed so much memory. The
filesystem implementation was a small text of fragile code that would
panic if it read the wrong bits from the disk, but it was fast and
didn''t take much memory. It made sense to split the filesystem into
two pieces, the fsck piece and the main piece, to conserve the
machine''s core (and make the programming simpler).

We have plenty of memory for text segments now, so it might make more
sense to build fsck into the filesystem. The filesystem should be
able to mount any state you would expect a hypothetical fsck tool to
handle, and mount it almost immediately, and correct any
``errors'''' it
finds while running. If you want to proactively correct errors, it
should do this while mounted.

That was the original ZFS pitch, and I think it''s not crazy.
It''s
basically what we''re supposed to have now with the ``always consistent
on disk'''' claim and ''zpool scrub'' O(n)?
online fsck-equivalent.

FFS+softdep sort of works this way, too. It''s designed to safely
mount ``unclean'''' filesystems, so in that sense, it''s
``always
consistent.'''' It does not roll a log, because there
isn''t one---it
just mounts the filesystem as it was when the cord was pulled, and it
can do this with no risk of kernel panicing or odd behavior to
userland because of the careful order in which it writes data before
the panic. However, after an unclean shutdown, the filesystem is
still considered dirty even though it mounts and works. FreeBSD then
starts the old fsck tool in the background. The fsck is still O(n^2).
so...FFS+softdep sort of follows the new fsck-less model where the
filesystem is one unified piece that does all its work after mounting,
but follows it clumsily because it''s reusing the old FFS code and
on-disk format.

To my non-developer perspective, there seem to be the equivalent of
mini-FFS+softdep-style fsck''s inside ZFS already. Sometimes when a
mirror component goes away, ZFS does (what looks in ''zpool
status''
like) a mini-resilver on the remaining component. There''s no
redundancy in the vdev, so there''s nothing to actually resilver.
Maybe this has to do with the quorum rules or the (seemingly broken)
dirty region logging, both of which I still don''t understand. And
there is also my old problem of ''zpool offline'' reporting ``no
valid
replicas'''', until I''ve done a scrub, after which
''zpool offline'' works
again, so a scrub is not really a purely proactive thing: burried
inside ZFS there is some notion of dirtyness preventing my ''zpool
offline'', and a successful scrub clears the dirty bit (as do,
possibly, other things, like rebooting :( ). so, the architecture
might be fine as-is since scrub is already a little more than what it
claims to be, and is doing some sort of metadata or RAID-level
fsck-ing. I wouldn''t expect that the fix for these corrupt
single-vdev pools come in some specific form based on prejudices from
earlier filesystems.

Now there is another tool Anton mentioned, a recovery tool or forensic
tool: one that leaves the filesystem unmounted, treats the disks as
read-only, and tries to copy data out of it onto a new filesystem. If
there were going to be a separate tool---say, something to handle disks
that have been scribbled on, or fixes for problems that are really
tricky or logically inappropriate to deal with on the mounted
filesystem---I think a forensic/recovery tool makes more sense than an
fsck. If this odd stuff isn''t supposed to happen, and it has happened
anyway, you want a tool you can run more than once. You want the
chance to improve the tool and run it again, or to try an older
version of the tool if the current one keeps crashing.

I''m just really far from convinced that Tom needs this tool.

r> To me, it sounds like Sun have designed ZFS to always know if
r> there is corruption on the disk, and to write data in a way
r> that corruption of the whole filesystem *should* never happen.

sounds like depends on to what you''re listening. If you''re
listening
to Sun''s claims, then yes, of course that''s exactly what they
claim.
If you''re listening to experience on this list, it sounds different.
The closest we''ve come is, we agree I haven''t completely
invalidated
the original claims, which is pretty far from making me believe them
again.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080807/1e19ac3c/attachment.bin>

Richard Elling

2008-Aug-07 18:34 UTC

head link

[zfs-discuss] more ZFS recovery

Anton B. Rang wrote:>> From the ZFS Administration Guide, Chapter 11, Data Repair section:
>> Given that the fsck utility is designed to repair known pathologies
>> specific to individual file systems, writing such a utility for a file
>> system with no known pathologies is impossible.
>>     
>
> That''s a fallacy (and is incorrect even for the UFS fsck; refer to
the McKusick/Kowalski paper and the distinction they make between
''expected'' corruptions and other inconsistencies).
>
> First, there are two types of utilities which might be useful in the
situation where a ZFS pool has become corrupted. The first is a file system
checking utility (call it zfsck); the second is a data recovery utility. The
difference between those is that the first tries to bring the pool (or file
system) back to a usable state, while the second simply tries to recover the
files to a new location.
>   
Hi Anton,
How would you describe the difference between the file system
checking utility and zpool scrub?  Is zpool scrub lacking in its
verification of the data?

How would you describe the difference between the data recovery
utility and ZFS''s normal data recovery process?
> What does a file system check do?  It verifies that a file system is
internally consistent, and makes it consistent if it is not.  If ZFS were always
consistent on disk, then only a verification would be needed.  Since we have
evidence that it is not always consistent in the face of hardware failures, at
least, repair may also be needed.  This doesn''t need to be that hard. 
For instance, the space maps can be reconstructed by walking the various block
trees; the uberblock effectively has several backups (though it might be better
in some cases if an older backup were retained); and the ZFS checksums make it
easy to identify block types and detect bad pointers. Files can be marked as
damaged if they contain pointers to bad data; directories can be repaired if
their hash structures are damaged (as long as the names and pointers can be
salvaged); etc.  Much more complex file systems than ZFS have file system
checking utilities, because journaling, COW, etc. don''t help you in !
 the>   face of software bugs or certain classes of hardware failures.
>
> A recovery tool is even simpler, because all it needs to do is find a tree
root and then walk the file system, discovering directories and files, verifying
that each of them is readable by using the checksums to check intermediate and
leaf blocks, and extracting the data.  The tricky bit with ZFS is simply
identifying a relatively new root, so that the newest copy of the data can be
identified.
>
> Almost every file system starts out without an fsck utility, and implements
one once it becomes obvious that "sorry, you have to reinitialize the file
system" -- or worse, "sorry, we lost all of your data" -- is
unacceptable to a certain proportion of customers.
>
>   Nobody thinks that an answer of "sorry, we lost all of your data" is
acceptable.  However, there are failures which will result in loss of
data no matter how clever the file system is.  But people will still
believe their hardware is infallible and refuse to configure ZFS to
be able to repair their data.  You can only push a rope so far...
 -- richard

Bob Friesenhahn

2008-Aug-07 18:36 UTC

head link

[zfs-discuss] more ZFS recovery

On Thu, 7 Aug 2008, Miles Nordin wrote:

I must apologize that I was not able to read your complete email due 
to local buffer overflow ...
> someone who knows ZFS well like Pavel.  Also, there is enough concern
> for people designing paranoid systems to approach them with the view,
> ``ZFS is not always-consistent-on-disk unless it has working
> redundancy''''---choosing to build a ZFS system the same
way as a UFS
> system without ZFS-level redundancy, based on our experience so far,
> is not just foregoing some of ZFS''s whizz-bang new feeechurs. 
It''s
> significantly less safe than the UFS system.  For as long as the
> argument remains unsettled, conservative people need to understand
> that.  Conservative people should also understand point (c) below.
I don''t think that non-redundant ZFS can be classified as 
"significantly less safe than the UFS system".  It seems that the 
world has little experience with 48TB single-LUN UFS filesystems, if 
indeed that is even possible.  I would hate to wait for fsck of 48TB 
since some of the disks might wear out and need to be replaced before 
it completes.

According to your logic, AIDS was safer before people were routinely 
tested (http://en.wikipedia.org/wiki/AIDS#HIV_test) to see if they 
were HIV positive.  With ZFS you may learn that you have contracted 
AIDs within minutes of the event while with UFS you might not know 
until your immune system is beyond salvaging and the family is crying 
at your bed.  Apparently you are in the "prefer not to know" group.

The largest UFS filesystems I have here are under 120GB and even at 
that size they make me uneasy since I know that the data can silently 
fail (bad), or be read incorrectly (worse) and that if fsck is needed, 
it might take hours.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bill Sommerfeld

2008-Aug-07 18:57 UTC

head link

[zfs-discuss] more ZFS recovery

On Thu, 2008-08-07 at 11:34 -0700, Richard Elling wrote:> How would you describe the difference between the data recovery
> utility and ZFS''s normal data recovery process?
I''m not Anton but I think I see what he''s getting at.

Assume you have disks which once contained a pool but all of the
uberblocks have been clobbered.  So you don''t know where the root of
the
block tree is, but all the actual data is there, intact, on the disks.  

Given the checksums you could rebuild one or more plausible structure of
the pool from the bottom up.

I''d think that you could construct an offline zpool data recovery tool
where you''d start with N disk images and a large amount of extra
working
space, compute checksums of all possible data blocks on the images, scan
the disk images looking for things that might be valid block pointers,
and attempt to stitch together subtrees of the filesystem and recover as
much as you can even if many upper nodes in the block tree have had
holes shot in them by a miscreant device.

					- Bill

Richard Elling

2008-Aug-07 19:13 UTC

head link

[zfs-discuss] more ZFS recovery

[I think Miles and I seem to be talking about two different topics]

Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>>>             
>
>     re>     If your pool is not redundant, the chance that data
>     re> corruption can render some or all of your data inaccessible is
>     re> always present.
>
> 1. data corruption != unclean shutdown
>   
Agree.  One is a state, the other is an event.
> 2. other filesystems do not need a mirror to recover from unclean
>    shutdown.  They only need it for when disks fail, or for when disks
>    misremember their contents (silent corruption, as in NetApp paper).
>   
Agree.  ZFS fits this category.
>    I would call data corruption and silent corruption the same thing:
>    what the CKSUM column was _supposed_ to count, though not in fact
>    the only thing it counts.
>   
Agree.  Data corruption takes two forms: detectable and undetectable
(aka silent).
> 3. saying ZFS needs a mirror to recover from unclean shutdown does not
>    agree with the claim ``always consistent on the disk''''
>   
Disagree. We test ZFS with unclean shutdowns all of the time and
it works fine.  However, if there is data corruption, then it may be
possible that ZFS cannot recover unless there is a surviving copy
of the good data.  This is what mirrors and raidz do.
> 4. I''m not sure exactly your position.  Before you were saying
what
>    Erik warned about doesn''t happen, because there''s no
CR, and Tom
>    must be confused too.  Now you''re saying of course it happens,
>    ZFS''s claims of ``always consistent on disk''''
count for nothing
>    unless you have pool redundancy.
>   
No, I''m saying that data corruption without a surviving good copy
of the data may lead to an unrecoverable data condition.
>
> And that is exactly what I said to start with:
>
>     re> In general, ZFS can only repair conditions for which it owns
>     re> data redundancy.
>
>      c> If that''s really the excuse for this situation, then
ZFS is
>      c> not ``always consistent on the disk'''' for
single-VDEV pools.
>
> that is the take-home message?
>   
ZFS is always consistent on disk.  If there is data corruption, then
all bets are off, no matter what file system you choose.
> If so, it still leaves me with the concern, what if the breaking of
> one component in a mirrored vdev takes my system down uncleanly?  This
> seems like a really plausible failure mode (as Tom said, ``the
> inevitable kernel panic'''').
>   
Tom has not provided any data as to why the kernel panic''ed.
Panic messages, as a minimum, would be enlightening.
> In that case, I no longer have any redundancy when the system boots
> back up.  If ZFS calls the inconsistent states through which it
> apparently sometimes transitions pools ``data
corruption'''' and depends
> on redundancy to recover from them, then isn''t it extremely
dangerous
> to remove power or SAN connectivity from any DEGRADED pool?  The pool
> should be rebuilt onto a hot spare IMMEDIATELY so that it''s ONLINE
as
> soon as possible, because if ZFS loses power with a DEGRADED pool all
> bets are off.
>   
In Tom''s case, ZFS was not configured such that it could rebuild a
failed vdev on a hot spare.
> If this DEGRADED-pool unclean shutdown is, as you say, a completely
> different scenario from single-vdev pools that isn''t dangerous and
has
> no trouble with ZFS corruption, then no one should ever run a
> single-vdev pool.  We should instead run mirrored vdevs that are
> always DEGRADED, since this configuration looks identical to
> everything outside ZFS but supposedly magically avoids the issue.  If
> only we had some way to attach to vdevs fake mirror components that
> immediately get marked FAULTED then we can avoid the corruption risk.
> But, that''s clearly absurd!
>   
Fast, reliable, inexpensive: pick two.
> so, let''s say ZFS''s requirement is, as we seem to be
describing it:
> might lose the whole pool if your kernel panics or you pull the power
> cord in a situation without redundancy.  Then I think this is an
> extremely serious issue, even for redundant pools.  
Agree.  But in Tom''s case, there is no proof that the fault
condition is cleared.  The fact that zpool import fails with an
I/O error is a strong indicator that the fault is still present.
We do not yet know if there is a data corruption issue or not.
> It is very
> plausible that a machine will panic or lose power during a resilver.
>   
I think this is an unfounded statement.  There are many cases where
resilvers complete successfully.  In our data reliability models, we have
a parameter for the probability of [un]successful resilver, but all of our
research in determining a value for this centers around actual data loss
or corruption in the devices.  Do you have research that points to
another cause?
> And if, on the other hand, ZFS doesn''t transition disks through
> inconsistent states and then excuse itself calling what it did ``data
> corruption'''' when it bites you after an unclean shutdown,
then what
> happened to Erik and Tom?  
>   
I have no idea what happened to Erik.  His post makes claims of
loss followed by claims of unfixed, known problems, but no real
pointer to bugids.  Hence my comment about his post being of
the "your baby is ugly" variety.  At least point out the mole in the
middle of the forehead, aka CR???
> It seems to me it is ZFS''s fault and can''t be punted off
to the
> administrator''s ``asking for it.''''
>   
I think the jury is still out.  Tom needs to complete his tests and
provide the messages and FMA notifications so that a root cause
can be determined.

Meanwhile, we''ll work on putting together some docs on how to
proceed when your pool can''t be imported, because it would be
good to have.  And, as Anton notes, we can''t scrub the pool if
we can''t import the pool.
 -- richard

A Darren Dunham

2008-Aug-07 19:20 UTC

head link

[zfs-discuss] more ZFS recovery

On Thu, Aug 07, 2008 at 11:34:12AM -0700, Richard Elling
wrote:> Anton B. Rang wrote:
> > First, there are two types of utilities which might be useful in the
situation where a ZFS pool has become corrupted. The first is a file system
checking utility (call it zfsck); the second is a data recovery utility. The
difference between those is that the first tries to bring the pool (or file
system) back to a usable state, while the second simply tries to recover the
files to a new location.
> >   
> Hi Anton,
> How would you describe the difference between the file system
> checking utility and zpool scrub?  Is zpool scrub lacking in its
> verification of the data?
One thing I can think of is that scrub is only available on an imported
pool, and you can only import a writable pool.  It might be nice to
verify that a read-only image is valid.  Or to import/mount a pool on
damaged media for recovery.  
> How would you describe the difference between the data recovery
> utility and ZFS''s normal data recovery process?
If the most recent uberblock appears valid, but doesn''t have useful
data, I don''t think there''s any way currently to see what the
tree of an
older uberblock looks like.  It would be nice to see if that data
appears valid and try to create a view that would be
readable/recoverable.

-- 
Darren

Victor Latushkin

2008-Aug-07 21:00 UTC

head link

[zfs-discuss] more ZFS recovery

Miles Nordin ?????:>>>>>> "r" == Ross  <myxiplx at hotmail.com>
writes:
> 
>      r> Tom wrote "There was a problem with the SAS bus which
caused
>      r> various errors including the inevitable kernel panic". 
It''s
>      r> the various errors part that catches my eye,
> 
> yeah, possibly, but there are checksums on the SAS bus, and its
> confirmation of what CDB''s have completed should always be
accurate.
But there''s more than that - there''s storage controller behind
the SAS
bus with it''s cache and loads of disks behind, and even though there
are
checksums on SAS bus, and storage controller should have not lost or 
damage any of its cache, there''s still a possibility for a disk to drop
write on the floor silently or misdirect it, or storage controller 
itself to be configured in a such way that it does not guarantee data 
protection all the time...
> If the problem was ``another machine booted up, and I told the other
> machine to ''zpool import -f'' '''' then
maybe you have some point.  but
> just tripping over a cable shouldn''t qualify as weird, nor should
> Erik''s problem of the FC array losing power or connectivity. 
These
> are both within the ``unclean shutdown'''' category handled
by UFS+log,
> FFS+softdep, ext3, reiser, xfs, vxfs, jfs, HFS+, ...
Does forceful removal of power count as unclean shutdown? If yes, I do 
it several times a day to my notebook with ZFS root. I''m typing this 
from it booted just fine from ZFS after another unclean shutdown.

>      r> Can fsck always recover a disk?  Or if the corruption is
>      r> severe enough, are there times when even that fails?  
> 
> This question is obviously silly.  write zeroes over the disk, and now
> the corruption is severe enough.  However fsck can always recover a
> disk from a kernel panic, or a power failure of the host or of the
> disks, because these things don''t randomly scribble over the disk.
I have an image of UFS filesystem which passes fsck just fine but then 
panics system as soon as writes are started.

> Reports of zpool
> corruption on single vdev''s mounted over SAN''s would
benefit from
> further investigation, or at least a healthily-suspicious scientific
> attitude that encourages someone to investigate this if it happens in
> more favorable conditions, such as inside Sun, or to someone with a
> support contract and enough time to work on a case (maybe Tom?),
The problem is such reports often do not have enough details and 
investigation in such can take lots of time and yield nothing...
> or
> someone who knows ZFS well like Pavel.  Also, there is enough concern
> for people designing paranoid systems to approach them with the view,
> ``ZFS is not always-consistent-on-disk unless it has working
> redundancy''''
Again, always-consistent-on-disk is not related to redundancy. On-disk 
consistency is achieved by not writing new blocks over currently 
allocated ones regardless of redundancy of underlying vdevs. If 
underlying vdevs are redundant, you have better chance of surviving 
corruption of data stored on disk.
> Now there is another tool Anton mentioned, a recovery tool or forensic
> tool:  one that leaves the filesystem unmounted, treats the disks as
> read-only, and tries to copy data out of it onto a new filesystem.  If
> there were going to be a separate tool---say, something to handle disks
> that have been scribbled on, or fixes for problems that are really
> tricky or logically inappropriate to deal with on the mounted
> filesystem---I think a forensic/recovery tool makes more sense than an
> fsck.  If this odd stuff isn''t supposed to happen, and it has
happened
> anyway, you want a tool you can run more than once.  You want the
> chance to improve the tool and run it again, or to try an older
> version of the tool if the current one keeps crashing.
Reads in ZFS can be broadly classified into two types:

- ones that are not critical from the ZFS perspective meaning reads of 
user data and associated metadata where it can safely return I/O error 
in case of checksum failure,

- ones that are critical from ZFS perspectives - this is reads of ZFS 
metadata required to perform writes; depending on context it may be 
impossible to return I/O error and it has either to panic or act 
according to failmode property setting.

So for some cases of non-redundant (and even redundant, where all 
redundant copies are corrupted, e.g. simultaneous import of a pool from 
two hosts) pool corruption it may be enough to import pool in pure 
read-only mode not trying to write anything into the pool (hence not 
having to read any metadata required to do so) to be able to save all 
the data which can be read. There''s an RFE for this feature but i do
not
have the number handy.


victor

Anton B. Rang

2008-Aug-08 05:25 UTC

head link

[zfs-discuss] more ZFS recovery

> How would you describe the difference between the file system
> checking utility and zpool scrub?  Is zpool scrub lacking in its
> verification of the data?
To answer the second question first, yes, zpool scrub is lacking, at least to
the best of my knowledge (I haven''t looked at the ZFS source in a few
months). It does not verify that any internal data structures are correct;
rather, it simply verifies that data and metadata blocks match their checksums.

This makes it useless in situations such as those described in bugs
6458218/6634517, where a pool cannot be imported because its metadata is
inconsistent. It also would not repair a damaged directory, for instance. If a
directory in ZFS is damaged, its files become permanently inaccessible; if the
same happens in UFS, fsck will create new links to the files in the lost+found
directory.

It''s as if the UFS fsck could only work on mounted file systems, and
could tell you that there was a problem, but not fix it.
> How would you describe the difference between the
> data recovery utility and ZFS''s normal data recovery process?
What do you consider the ?normal data recovery process? ?

Take, for instance, the pool which Borys just mentioned on this list, which
causes a kernel panic at import. I''m not sure how ZFS can recover from
that. A data recovery utility would (for instance) scan the pool, locate a
healthy-looking uberblock (or, failing that, look for one or more top-level file
system blocks), and traverse the tree down from that point, pulling files from
the disk as it goes. When a damaged metadata block is found, a scan can be
performed for blocks which are candidate blocks that belong under it; or
potential block numbers can be extracted from the damaged block itself.
 
 
This message posted from opensolaris.org

Tom Bird

2008-Aug-11 14:52 UTC

head link

[zfs-discuss] more ZFS recovery

Victor Latushkin wrote:> Hi Tom and all,
> 
> Tom Bird wrote:
>> Hi,
>>
>> Have a problem with a ZFS on a single device, this device is 48 1T SATA
>> drives presented as a 42T LUN via hardware RAID 6 on a SAS bus which
had
>> a ZFS on it as a single device.
>>
>> There was a problem with the SAS bus which caused various errors
>> including the inevitable kernel panic, the thing came back up with 3
out
>> of 4 zfs mounted.
> 
> It would be nice to see a panic stack.
I''m afraid I don''t have that but now have an open connection
to the
terminal server logging everything in case it should happen again.
>> root at cs3:~# zpool import -f content
>> cannot import ''content'': I/O error
> 
> As long as it does not panic and just returns I/O error which is rather
> generic, you may try to dig a little bit deeper with DTrace to have a
> chance to see where this I/O error is generated first, e.g. something
> like this with the attached dtrace script:
> 
> dtrace -s /path/to/script -c "zpool import -f content"
dtrace output was 6MB, a bit rude to post to the list so I''ve uploaded
it here: http://picard.portfast.net/~tom/import.txt
> It is also interesting what impact SAS bus problem had on the storage
> controller. Btw, what is storage controller in question here?
The controller is an LSI Logic PCI express with 2 external SAS ports
which runs to an eonstor 2u 12 disk RAID chassis with 3 JBOD packs daisy
chained from that.  It seems I can''t run the JBODs directly to the SAS
controller when using SATA drives (may be a different story with proper
SAS) and the RAID box has no JBOD mode so the redundancy has to stay in
the box and can''t be transferred to ZFS.  The entire faulted array
reads
cleanly at /dev/rdsk level into /dev/null.

There are 4 such arrays connected to the server via two SAS cards with a
ZFS on each one, the supplied internal SAS card and an ixgb NIC are the
only other cards installed.  System boots from the standard internal disks.
>> root at cs3:~# uname -a
>> SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200
> 
> Btw, have you considered opening support call for this issue?
Would have thought that unless they have a secret zfsck utility there''s
probably not much they can do.  It''s not a Sun disk array or Sun
branded
SAS card.

thanks
-- 
Tom

// www.portfast.co.uk -- internet services and consultancy
// hosting from 1.65 per domain

Chris Siebenmann

2008-Aug-11 20:27 UTC

head link

[zfs-discuss] more ZFS recovery

I''m not Anton Rang, but:
| How would you describe the difference between the data recovery
| utility and ZFS''s normal data recovery process?

 The data recovery utility should not panic my entire system if it runs
into some situation that it utterly cannot handle. Solaris 10 U5 kernel
ZFS code does not have this property; it is possible to wind up with ZFS
pools that will panic your system when you try to touch them.

(The same thing is true of a theoretical file system checking utility.)

 The data recovery utility can ask me questions about what I want it
to do in an ambiguous situation, or give me only partial results.

 The data recovery can be run read-only, so that I am sure that any
problems in it are not making my situation worse.

| Nobody thinks that an answer of "sorry, we lost all of your data" is
| acceptable.  However, there are failures which will result in loss of
| data no matter how clever the file system is.

 The problem is that there are currently ways to make ZFS lose all your
data when there are no hardware faults or failures, merely people or
software mis-handling pools. This is especially frustrating when the
only thing that is likely to be corrupted is ZFS metadata and the vast
majority (or all) of the data in the pool is intact, readable, and so
on.

	- cks

Claus Guttesen

2008-Aug-11 22:23 UTC

head link

[zfs-discuss] more ZFS recovery

> | How would you describe the difference between the data recovery
> | utility and ZFS''s normal data recovery process?
>
>  The data recovery utility should not panic my entire system if it runs
> into some situation that it utterly cannot handle. Solaris 10 U5 kernel
> ZFS code does not have this property; it is possible to wind up with ZFS
> pools that will panic your system when you try to touch them.
I do agree. The last three weeks I have been testing an
arc-1680-sas-controller with an external cabinet with 16 sas-disk at 1
TB. The server is a E5405 quad-core with 8 GB RAM. Setting the card to
jbod-mode gave me a somewhat unstable setup where the disks would stop
responding. After I had put all the disk on the controller in
passthrough-mode the setup did stabilize and I was able to copy 3.8 of
4 TB of small files when some of the disks bailed out. I brought the
disks back online and restarted the server. A zpool status that show
online disks also told me:

errors: Permanent errors have been detected in the following files:
        ef1/image/z018_16:<0x0>

The zpool consisted of five disks in three seperate raidz in one zpool
including one spare.

The only time I''ve experienced that the server could not get the zpool
back online was when the disks for some reason failed. I find it
completely valid that the server panics rather than write inconsisten
data.

Everytime when our internal file-server suffered a unplanned restart
(power failure) it always recovered (solaris 10/08 and zfs ver. 4).
But this Sunday Aug. the 10''th the same file-server was brought down
by a faulty UPS. When power was restored the zpool had become
inconsistent. This time the storage was also affected by the
power-outage.

Is it a valid point that zfs is able to recover more gracefully when
the server itself goes down rather than when some of the disks/LUN''s
bails out? The reason I ask is because that is the only time I''ve
personally seen zfs unable to recover.
>  The data recovery utility can ask me questions about what I want it
> to do in an ambiguous situation, or give me only partial results.
Our nfs-server was also on this faulty UPS. This is running solaris 9
on sparc with vxfs and is managing 109 TB of storage on a HDS. When I
switched on the server I saw the it replayed the journal and marked
the partition as clean and came online. I know that there is no
guarantee that the data are consistent but at least vxfs have had many
years to mature.

I had initially planned to migrate some of the older partitions to zfs
and thereby test it. But I''ve changed that and will try the setup with
the arc-1680-controller and sas-disks as an internal file-server
instead for a while and rather add additional storage to our solaris
9-server and vxfs.

Zfs have changed the way I look at filesystems and I''m very glad that
Sun gives it so much exposure. But atm. I''d give vxfs the edge. :-)

-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentlest gamester is the soonest winner.

Shakespeare

Richard Elling

2008-Aug-12 00:20 UTC

head link

[zfs-discuss] more ZFS recovery

Claus Guttesen wrote:>> | How would you describe the difference between the data recovery
>> | utility and ZFS''s normal data recovery process?
>>
>>  The data recovery utility should not panic my entire system if it runs
>> into some situation that it utterly cannot handle. Solaris 10 U5 kernel
>> ZFS code does not have this property; it is possible to wind up with
ZFS
>> pools that will panic your system when you try to touch them.
>>     
>
> I do agree. The last three weeks I have been testing an
> arc-1680-sas-controller with an external cabinet with 16 sas-disk at 1
> TB. The server is a E5405 quad-core with 8 GB RAM. Setting the card to
> jbod-mode gave me a somewhat unstable setup where the disks would stop
> responding. After I had put all the disk on the controller in
> passthrough-mode the setup did stabilize and I was able to copy 3.8 of
> 4 TB of small files when some of the disks bailed out. I brought the
> disks back online and restarted the server. A zpool status that show
> online disks also told me:
>
> errors: Permanent errors have been detected in the following files:
>         ef1/image/z018_16:<0x0>
>
> The zpool consisted of five disks in three seperate raidz in one zpool
> including one spare.
>
> The only time I''ve experienced that the server could not get the
zpool
> back online was when the disks for some reason failed. I find it
> completely valid that the server panics rather than write inconsisten
> data.
>
> Everytime when our internal file-server suffered a unplanned restart
> (power failure) it always recovered (solaris 10/08 and zfs ver. 4).
> But this Sunday Aug. the 10''th the same file-server was brought
down
> by a faulty UPS. When power was restored the zpool had become
> inconsistent. This time the storage was also affected by the
> power-outage.
>
> Is it a valid point that zfs is able to recover more gracefully when
> the server itself goes down rather than when some of the
disks/LUN''s
> bails out? The reason I ask is because that is the only time I''ve
> personally seen zfs unable to recover.
>   
Later versions of ZFS, not yet in Solaris 10, are much more tolerant
of disappearing storage.  Solaris 10 update 6 should contain these
features later this year.  OpenSolaris 2008.05 and SXCE b72 or later
already have these features.

There is a failure mode that we worry about: ZFS depends on the disk
actually writing (flushing) data to nonvolatile storage when ZFS issues
the flush request.  If that does not actually occur, then you may see the
problems you describe.  While ZFS distrusts storage better than most
file systems, it must still trust a flush request.
>>  The data recovery utility can ask me questions about what I want it
>> to do in an ambiguous situation, or give me only partial results.
>>     
>
> Our nfs-server was also on this faulty UPS. This is running solaris 9
> on sparc with vxfs and is managing 109 TB of storage on a HDS. When I
> switched on the server I saw the it replayed the journal and marked
> the partition as clean and came online. I know that there is no
> guarantee that the data are consistent but at least vxfs have had many
> years to mature.
>
> I had initially planned to migrate some of the older partitions to zfs
> and thereby test it. But I''ve changed that and will try the setup
with
> the arc-1680-controller and sas-disks as an internal file-server
> instead for a while and rather add additional storage to our solaris
> 9-server and vxfs.
>
> Zfs have changed the way I look at filesystems and I''m very glad
that
> Sun gives it so much exposure. But atm. I''d give vxfs the edge.
:-)
>
>   
I''ve had excellent experiences with Sun-branded HDS storage:
rock solid.

For flaky hardware that seems to lose data during a power outage,
I''d prefer a file system that can detect that my data is corrupted.
 -- richard

Richard Elling

2008-Aug-12 00:25 UTC

head link

[zfs-discuss] Forensic analysis [was: more ZFS recovery]

Chris Siebenmann wrote:>  I''m not Anton Rang, but:
> | How would you describe the difference between the data recovery
> | utility and ZFS''s normal data recovery process?
>
>  The data recovery utility should not panic my entire system if it runs
> into some situation that it utterly cannot handle. Solaris 10 U5 kernel
> ZFS code does not have this property; it is possible to wind up with ZFS
> pools that will panic your system when you try to touch them.
>
> (The same thing is true of a theoretical file system checking utility.)
>
>  The data recovery utility can ask me questions about what I want it
> to do in an ambiguous situation, or give me only partial results.
>
>  The data recovery can be run read-only, so that I am sure that any
> problems in it are not making my situation worse.
>
> | Nobody thinks that an answer of "sorry, we lost all of your
data" is
> | acceptable.  However, there are failures which will result in loss of
> | data no matter how clever the file system is.
>
>  The problem is that there are currently ways to make ZFS lose all your
> data when there are no hardware faults or failures, merely people or
> software mis-handling pools. This is especially frustrating when the
> only thing that is likely to be corrupted is ZFS metadata and the vast
> majority (or all) of the data in the pool is intact, readable, and so
> on.
>   
As others have noted, the COW nature of ZFS means that there is a
good chance that on a mostly-empty pool, previous data is still intact
long after you might think it is gone. A utility to recover such data is
(IMHO) more likely to be in the category of forensic analysis than
a mount (import) process. There is more than enough information
publically available for someone to build such a tool (hint, hint :-)
 -- richard

Cromar Scott

2008-Aug-12 12:46 UTC

head link

[zfs-discuss] more ZFS recovery

From: Richard Elling <Richard.Elling at Sun.COM>
Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>>> "tb" == Tom Bird <tom at marmot.org.uk>
writes:
>>>>>>             
>
...>
>     re> In general, ZFS can only repair conditions for which it owns
>     re> data redundancy.tb> If that''s really the excuse for this situation, then ZFS is not
tb> ``always consistent on the disk'''' for single-VDEV
pools.

re> I disagree with your assessment.  The on-disk 
re> format (any on-disk format) necessarily assumes 
re> no faults on the media.  The difference between ZFS
re> on-disk format and most other file systems is that 
re> the metadata will be consistent to some point in time 
re> because it is COW.  
...

tb> There was no loss of data here, just an interruption in the
connection
tb> to the target, like power loss or any other unplanned shutdown.
tb> Corruption in this scenario is is a significant regression w.r.t.
UFS:

re> I see no evidence that the data is or is not correct.  
...

re> However, I will bet a steak dinner that if this device 
re> was mirrored to another, the pool will import just fine, 
re> with the affected device in a faulted or degraded state.

tb>
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048375.html 

re> I have no idea what Eric is referring to, and it does 
re> not match my experience.

We had a similar problem in our environment.  We lost a CPU on the
server, resulting in metadata corruption and an unrecoverable pool.  We
were told that we were seeing a known bug that will be fixed in S10u6.

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/046951.html

From: Tom Bird <tom at marmot.org.uk>

tb> On any other file system though, I could probably kick 
tb> off a fsck and get back most of the data.  I see the 
tb> argument a lot that ZFS "doesn''t need" a fsck utility,
tb> however I would be inclined to disagree, if not a
tb> full on fsck then something that can patch it up to the 
tb> point where I can mount it and then get some data off or 
tb> run a scrub.

If not that, then we need some sort of recovery tool.  We ought to be
able to have some sort of recovery mode that allows us to read off the
known good data or roll back to a snapshot or something.  

When you have a really big file system, telling us (as Sun support told
us) that our only option was to re-build the zpool and restore from
tape, it becomes really difficult to justify using the product in
certain production environments.  

(For example, consider an environment where the available storage is on
a hardware RAID-5 system, and where mirroring large amounts of already
RAID-ed space adds up to more cost than a VxFS license.  Not every type
of data requires more protection than you get with standard
hardware-based RAID-5.)

--Scott

This message may contain information that is confidential or privileged. 
If you are not the intended recipient, please advise the sender immediately
and delete this message.

Cromar Scott

2008-Aug-12 13:05 UTC

head link

[zfs-discuss] more ZFS recovery

Chris Siebenmann <cks at cs.toronto.edu>

 I''m not Anton Rang, but:
| How would you describe the difference between the data recovery
| utility and ZFS''s normal data recovery process?

cks> The data recovery utility should not panic 
cks> my entire system if it runs into some situation 
cks> that it utterly cannot handle. Solaris 10 U5 
cks> kernel ZFS code does not have this property; 
cks> it is possible to wind up with ZFS pools that 
cks> will panic your system when you try to touch them.
...

I''ll go you one worse.  Imagine a Sun Cluster with several resource
groups and several zpools.  You blow a proc on one of the servers.  As a
result, the metadata on one of the pools becomes corrupted.

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/046951.html

Now, each of the servers in your cluster attempts to import the
zpool--and panics.

As a result of a singe part failure on a single server, your entire
cluster (and all the services on it) are sitting in a smoking heap on
your machine room floor.

| Nobody thinks that an answer of "sorry, we lost all of your data" is
| acceptable.  However, there are failures which will result in loss of
| data no matter how clever the file system is.

cks> The problem is that there are currently ways to 
cks> make ZFS lose all your data when there are no 
cks> hardware faults or failures, merely people or
cks> software mis-handling pools. This is especially 
cks> frustrating when the only thing that is likely 
cks> to be corrupted is ZFS metadata and the vast
cks> majority (or all) of the data in the pool is intact, 
cks> readable, and so on.

I''m just glad that our pool corruption experience happened during
testing, and not after the system had gone into production.  Not exactly
a resume-enhancing experience.

--Scott
 
 
 
 
This message may contain information that is confidential or privileged. 
If you are not the intended recipient, please advise the sender immediately
and delete this message.

Wade.Stuart at fallon.com

2008-Aug-12 15:52 UTC

head link

[zfs-discuss] Forensic analysis [was: more ZFS recovery]

>
> As others have noted, the COW nature of ZFS means that there is a
> good chance that on a mostly-empty pool, previous data is still intact
> long after you might think it is gone. A utility to recover such data is
> (IMHO) more likely to be in the category of forensic analysis than
> a mount (import) process. There is more than enough information
> publically available for someone to build such a tool (hint, hint :-)
>  -- richard
      Veritas,  the makers if vxfs, whom I consider ZFS to be trying to
compete against has higher level (normal) support engineers that have
access to tools that let them scan the disk for inodes and other filesystem
fragments and recover.  When you log a support call on a faulty filesystem
(in one such case I was involved in zeroed out 100mb of the first portion
of the volume killing off both top OLT''s -- bad bad) they can actually
help
you at a very low level dig data out of the filesystem or even recover from
pretty nasty issues.  They can scan for inodes (marked by a magic number),
have utilities to pull out files from those inodes (including indirect
blocks/extents).  Given the tools and help from their support I was able to
pull back 500 gb of files (99%) from a filesystem that emc killed during a
botched powerpath upgrade.  Can Sun''s support engineers,  or is their
answer pull from tape?  (hint, hint ;-)

-Wade

Darren J Moffat

2008-Aug-12 15:59 UTC

head link

[zfs-discuss] Forensic analysis [was: more ZFS recovery]

Wade.Stuart at fallon.com wrote:>> As others have noted, the COW nature of ZFS means that there is a
>> good chance that on a mostly-empty pool, previous data is still intact
>> long after you might think it is gone. A utility to recover such data
is
>> (IMHO) more likely to be in the category of forensic analysis than
>> a mount (import) process. There is more than enough information
>> publically available for someone to build such a tool (hint, hint :-)
>>  -- richard
> 
>       Veritas,  the makers if vxfs, whom I consider ZFS to be trying to
> compete against has higher level (normal) support engineers that have
> access to tools that let them scan the disk for inodes and other filesystem
> fragments and recover.  When you log a support call on a faulty filesystem
> (in one such case I was involved in zeroed out 100mb of the first portion
> of the volume killing off both top OLT''s -- bad bad) they can
actually help
> you at a very low level dig data out of the filesystem or even recover from
> pretty nasty issues.  They can scan for inodes (marked by a magic number),
> have utilities to pull out files from those inodes (including indirect
> blocks/extents).  Given the tools and help from their support I was able to
> pull back 500 gb of files (99%) from a filesystem that emc killed during a
> botched powerpath upgrade.  Can Sun''s support engineers,  or is
their
> answer pull from tape?  (hint, hint ;-)
Sounds like a good topic for here:

http://opensolaris.org/os/project/forensics/


-- 
Darren J Moffat

Chris Siebenmann

2008-Aug-12 16:14 UTC

head link

[zfs-discuss] Forensic analysis [was: more ZFS recovery]

| As others have noted, the COW nature of ZFS means that there is a good
| chance that on a mostly-empty pool, previous data is still intact long
| after you might think it is gone.

 In the cases I am thinking of I am sure that the data was there.
Kernel panics just didn''t let me get at it. Fortunately it was only
testing data, but I am now concerned about it happening in production.

| A utility to recover such data is (IMHO) more likely to be in the
| category of forensic analysis than a mount (import) process. There is
| more than enough information publically available for someone to build
| such a tool (hint, hint :-)

 To put it crudely, if I wanted to write my own software for this sort
of thing I would run Linux.

	- cks

max at bruningsystems.com

2008-Aug-12 17:03 UTC

head link

[zfs-discuss] Forensic analysis [was: more ZFS recovery]

Darren J Moffat wrote:> Wade.Stuart at fallon.com wrote:
>   
>>> As others have noted, the COW nature of ZFS means that there is a
>>> good chance that on a mostly-empty pool, previous data is still
intact
>>> long after you might think it is gone. A utility to recover such
data is
>>> (IMHO) more likely to be in the category of forensic analysis than
>>> a mount (import) process. There is more than enough information
>>> publically available for someone to build such a tool (hint, hint
:-)
>>>  -- richard
>>>       
>>       Veritas,  the makers if vxfs, whom I consider ZFS to be trying to
>> compete against has higher level (normal) support engineers that have
>> access to tools that let them scan the disk for inodes and other
filesystem
>> fragments and recover.  When you log a support call on a faulty
filesystem
>> (in one such case I was involved in zeroed out 100mb of the first
portion
>> of the volume killing off both top OLT''s -- bad bad) they can
actually help
>> you at a very low level dig data out of the filesystem or even recover
from
>> pretty nasty issues.  They can scan for inodes (marked by a magic
number),
>> have utilities to pull out files from those inodes (including indirect
>> blocks/extents).  Given the tools and help from their support I was
able to
>> pull back 500 gb of files (99%) from a filesystem that emc killed
during a
>> botched powerpath upgrade.  Can Sun''s support engineers,  or
is their
>> answer pull from tape?  (hint, hint ;-)
>>     
>
> Sounds like a good topic for here:
>
> http://opensolaris.org/os/project/forensics/
>   I took a look at this project, specifically 
http://opensolaris.org/os/project/forensics/ZFS-Forensics/.
Is there any reason that the paper and slides I presented at the 
OpenSolaris Developers Conference
on zfs on-disk format not mentioned?  The paper is at: 
http://www.osdevcon.org/2008/files/osdevcon2008-proceedings.pdf
starting on page 36, and the slides are at: 
http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf

thanks,
max

eric kustarz

2008-Aug-12 18:37 UTC

head link

[zfs-discuss] more ZFS recovery

On Aug 7, 2008, at 10:25 PM, Anton B. Rang wrote:
>> How would you describe the difference between the file system
>> checking utility and zpool scrub?  Is zpool scrub lacking in its
>> verification of the data?
>
> To answer the second question first, yes, zpool scrub is lacking, at  
> least to the best of my knowledge (I haven''t looked at the ZFS  
> source in a few months). It does not verify that any internal data  
> structures are correct; rather, it simply verifies that data and  
> metadata blocks match their checksums.
Hey Anton,

What do you mean by "internal data structures"?  Are you referring to
things like space maps, props, history obj, etc. (basically anything  
other than user data and the indirect blocks that point to user data)?

eric

Richard Elling

2008-Aug-12 21:02 UTC

head link

[zfs-discuss] more ZFS recovery

Cromar Scott wrote:> Chris Siebenmann <cks at cs.toronto.edu>
>
>  I''m not Anton Rang, but:
> | How would you describe the difference between the data recovery
> | utility and ZFS''s normal data recovery process?
>
> cks> The data recovery utility should not panic 
> cks> my entire system if it runs into some situation 
> cks> that it utterly cannot handle. Solaris 10 U5 
> cks> kernel ZFS code does not have this property; 
> cks> it is possible to wind up with ZFS pools that 
> cks> will panic your system when you try to touch them.
> ...
>
> I''ll go you one worse.  Imagine a Sun Cluster with several
resource
> groups and several zpools.  You blow a proc on one of the servers.  As a
> result, the metadata on one of the pools becomes corrupted.
>   
This failure mode affects all shared-storage clusters.  I don''t see how
ZFS should or should not be any different than raw, UFS, et.al.
> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/046951.html
>
> Now, each of the servers in your cluster attempts to import the
> zpool--and panics.
>
> As a result of a singe part failure on a single server, your entire
> cluster (and all the services on it) are sitting in a smoking heap on
> your machine room floor.
>   
Yes, but your data is corrupted.  If you were my bank, then I would
greatly appreciate you getting the data corrected prior to bringing my
account online.  If you study highly available clusters and services
then you will see many cases where human interaction is preferred to
automation for just such cases.  You will also find that a combination
of shared storage and non-shared storage cluster technology is used
for truly important data.  For example, we would use Solaris Cluster
for the local shared-storage framework and Solaris Cluster Geographic
Edition for a remote site (no shared hardware components with the
local cluster).
> | Nobody thinks that an answer of "sorry, we lost all of your
data" is
> | acceptable.  However, there are failures which will result in loss of
> | data no matter how clever the file system is.
>
> cks> The problem is that there are currently ways to 
> cks> make ZFS lose all your data when there are no 
> cks> hardware faults or failures, merely people or
> cks> software mis-handling pools. This is especially 
> cks> frustrating when the only thing that is likely 
> cks> to be corrupted is ZFS metadata and the vast
> cks> majority (or all) of the data in the pool is intact, 
> cks> readable, and so on.
>
> I''m just glad that our pool corruption experience happened during
> testing, and not after the system had gone into production.  Not exactly
> a resume-enhancing experience.
>   
I''m glad you found this in testing.  BTW, what was the root cause?
 -- richard

Cromar Scott

2008-Aug-12 21:32 UTC

head link

[zfs-discuss] more ZFS recovery

Richard Elling <Richard.Elling at Sun.COM>
Cromar Scott wrote:> Chris Siebenmann <cks at cs.toronto.edu>
>
>  I''m not Anton Rang, but:
> | How would you describe the difference between the data recovery
> | utility and ZFS''s normal data recovery process?
>
> cks> The data recovery utility should not panic 
> cks> my entire system if it runs into some situation 
> cks> that it utterly cannot handle. Solaris 10 U5 
> cks> kernel ZFS code does not have this property; 
> cks> it is possible to wind up with ZFS pools that 
> cks> will panic your system when you try to touch them.
> ...
>
> I''ll go you one worse.  Imagine a Sun Cluster with several
resource
> groups and several zpools.  You blow a proc on one of the servers.  As
a> result, the metadata on one of the pools becomes corrupted.
>   
re> This failure mode affects all shared-storage 
re> clusters.  I don''t see how ZFS should or should 
re> not be any different than raw, UFS, et.al.

Absolutely true.  The file system definitely had a problem.
>
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/046951.html>
> Now, each of the servers in your cluster attempts to import the
> zpool--and panics.
>
> As a result of a singe part failure on a single server, your entire
> cluster (and all the services on it) are sitting in a smoking heap on
> your machine room floor.
>   
re> Yes, but your data is corrupted.  

My data was only corrupted on ONE of the zpools.  In a cluster with
several zpools and several resource groups, we ended up with ALL of the
pools and ALL of the resource groups offline as one node after another
panicked.

re> If you were my bank, then I would greatly 
re> appreciate you getting the data corrected 
re> prior to bringing my account online.  

Fair enough, but do we have to take Fred''s and Joe''s accounts
offline
too?  

re> If you study highly available clusters and services
re> then you will see many cases where human interaction 
re> is preferred to automation for just such cases. 

I see your point about requiring intervention to deal with a potentially
corrupt file system.

I would have preferred a behavior more like we get with VxVM and VxFS,
where the corrupted file system fails to mount without human
intervention, but the nodes don''t panic on the failed vxdg import. 
That
particular service group and that particular file system are offline,
but everything else keeps running because none of the other nodes
panics.

We handled the issue of not corrupting the file system further by
panicking the original node, but I don''t understand why we need to
panic
each other successive node in the cluster.  Why can''t we just refuse to
import automatically?
> I''m just glad that our pool corruption experience happened during
> testing, and not after the system had gone into production.  Not
exactly> a resume-enhancing experience.
re> I''m glad you found this in testing.  

I''m a believer.  Some people wanted us to just throw the box into
production, but I insisted on keeping our test schedule.  I''m glad I
did.

re> BTW, what was the root cause?

It appears that the metadata on that pool became corrupted when the
processor failed.  The exact mechanism is a bit of a mystery, since we
didn''t get a valid crash dump.

The other pools were fine, once we imported them after a boot -x.

We ended up converting to VxVM and VxFS on that server because we could
not guarantee that the same thing wouldn''t just happen again after we
went into production.  

If we had a tool that had allowed us to roll back to a previous snapshot
or something, it might have made a difference.  

We were told that the probability of metadata corruption would have been
reduced but not eliminated by having a mirrored LUN.  We were also told
that the issue will be fixed in U6.

--Scott

This message may contain information that is confidential or privileged. 
If you are not the intended recipient, please advise the sender immediately
and delete this message.

Miles Nordin

2008-Aug-13 01:54 UTC

head link

[zfs-discuss] more ZFS recovery

>>>>> "cs" == Cromar Scott <SCromar at
caxton.com> writes:
    cs> It appears that the metadata on that pool became corrupted
    cs> when the processor failed.  The exact mechanism is a bit of a
    cs> mystery,

[...]

    cs> We were told that the probability of metadata corruption would
    cs> have been reduced but not eliminated by having a mirrored LUN.
    cs> We were also told that the issue will be fixed in U6.

how can one fix an issue which is a bit of a mystery?  Or do you mean
the lazy-panic issue is fixed, but the corruption issue is not?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080812/5f0ddbb8/attachment.bin>

Cromar Scott

2008-Aug-13 13:25 UTC

head link

[zfs-discuss] more ZFS recovery

Miles Nordin <carton at Ivy.NET>
>>>>> "cs" == Cromar Scott <SCromar at
caxton.com> writes:
    cs> It appears that the metadata on that pool became corrupted
    cs> when the processor failed.  The exact mechanism is a bit of a
    cs> mystery,

[...]

    cs> We were told that the probability of metadata corruption would
    cs> have been reduced but not eliminated by having a mirrored LUN.
    cs> We were also told that the issue will be fixed in U6.

mn> how can one fix an issue which is a bit of a 
mn> mystery?  Or do you mean the lazy-panic issue 
mn> is fixed, but the corruption issue is not?

We opened a call with Sun support.  We were told that the corruption
issue was due to a race condition within ZFS.  We were also told that
the issue was known and was scheduled for a fix in S10U6.

Sun support recommended that we use a mirrored pool to reduce the
possibility of the bug re-emerging, but they told us that even a
mirrored pool might be subject to the same bug.

Moving to OpenSolaris was not an option due to the nature of the
application and our requirement for support.  It is possible that we
might have been able to move to SVM and UFS rather than VxVM and VxFS,
but we were bumping up against our deadline, and we knew from previous
deployments that VxVM and VxFS would work.  (And we had the management
infrastructure in place to deal with Veritas.)

We run ZFS in several other environments with different availability
requirements, but we were hoping to start using it across the board as a
VxVM/VxFS replacement.

--Scott
 
 
 
 

This message may contain information that is confidential or privileged.  
If you are not the intended recipient, please advise the sender immediately
and delete this message.

Miles Nordin

2008-Aug-13 14:03 UTC

head link

[zfs-discuss] more ZFS recovery

>>>>> "cs" == Cromar Scott <SCromar at
caxton.com> writes:
    cs> We opened a call with Sun support.  We were told that the
    cs> corruption issue was due to a race condition within ZFS.  We
    cs> were also told that the issue was known and was scheduled for
    cs> a fix in S10U6.

nice.  Is there a bug number?  or is this one of the secret bugs?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080813/46e4a516/attachment.bin>

Cromar Scott

2008-Aug-13 14:37 UTC

head link

[zfs-discuss] more ZFS recovery

Miles Nordin <carton at Ivy.NET>>>>>> "cs" == Cromar Scott <SCromar at
caxton.com> writes:
    cs> We opened a call with Sun support.  We were told that the
    cs> corruption issue was due to a race condition within ZFS.  We
    cs> were also told that the issue was known and was scheduled for
    cs> a fix in S10U6.

mn> nice.  Is there a bug number?  or is this one of the secret bugs?

We were told bug number 6565042, but the description doesn''t quite
match
up with what the Sun engineer told us on the phone.  Maybe it''s one of
those things where the fix for 6565042 also fixes our problem.

--Scott
 
 
 
 
This message may contain information that is confidential or privileged. 
If you are not the intended recipient, please advise the sender immediately
and delete this message.

max at bruningsystems.com

2008-Aug-17 12:02 UTC

head link

[zfs-discuss] more ZFS recovery

A Darren Dunham wrote:>
> If the most recent uberblock appears valid, but doesn''t have
useful
> data, I don''t think there''s any way currently to see what
the tree of an
> older uberblock looks like.  It would be nice to see if that data
> appears valid and try to create a view that would be
> readable/recoverable.
>
>   I have a method to examine uberblocks on disk.  Using this, along with 
my modified
mdb and zdb, I have been able to recover a previously removed file.  
I''ll post
details in a blog if there is interest.

max

Robert Milkowski

2008-Aug-18 08:39 UTC

head link

[zfs-discuss] more ZFS recovery

Hello max,

Sunday, August 17, 2008, 1:02:05 PM, you wrote:

mbc> A Darren Dunham wrote:>>
>> If the most recent uberblock appears valid, but doesn''t have
useful
>> data, I don''t think there''s any way currently to see
what the tree of an
>> older uberblock looks like.  It would be nice to see if that data
>> appears valid and try to create a view that would be
>> readable/recoverable.
>>
>>   mbc> I have a method to examine uberblocks on disk.  Using this, along with
mbc> my modified
mbc> mdb and zdb, I have been able to recover a previously removed file.  
mbc> I''ll post
mbc> details in a blog if there is interest.

Of course, pleas do so.



-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

max at bruningsystems.com

2008-Aug-19 08:57 UTC

head link

[zfs-discuss] more ZFS recovery

Hi Robert, et.al.,
I have blogged about a method I used to recover a removed file from a 
zfs file system
at http://mbruning.blogspot.com.
Be forewarned, it is very long...
All comments are welcome.

max

Robert Milkowski wrote:> Hello max,
>
> Sunday, August 17, 2008, 1:02:05 PM, you wrote:
>
> mbc> A Darren Dunham wrote:
>   
>>> If the most recent uberblock appears valid, but doesn''t
have useful
>>> data, I don''t think there''s any way currently to
see what the tree of an
>>> older uberblock looks like.  It would be nice to see if that data
>>> appears valid and try to create a view that would be
>>> readable/recoverable.
>>>
>>>   
>>>       
> mbc> I have a method to examine uberblocks on disk.  Using this, along
with
> mbc> my modified
> mbc> mdb and zdb, I have been able to recover a previously removed file.
> mbc> I''ll post
> mbc> details in a blog if there is interest.
>
> Of course, pleas do so.
>
>
>
>

Tom Bird

2008-Aug-26 16:26 UTC

head link

[zfs-discuss] more ZFS recovery

Victor Latushkin wrote:> Hi Tom and all,
>> root at cs3:~# uname -a
>> SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200
> 
> Btw, have you considered opening support call for this issue?
As a follow up to the whole story, with the fantastic help of Victor,
the failed pool is now imported and functional thanks to the redundancy
in the meta data.

This does however highlight the need and practical application of a
fsck-like tool.  Fine to say that if ZFS can''t guarantee my data then I
should restore from backups so I know what I''ve got, but in the case of
this 42T device that would take days.

Something to think about,

Tom

Borys Saulyak

2008-Aug-27 13:20 UTC

head link

[zfs-discuss] more ZFS recovery

> As a follow up to the whole story, with the fantastic
> help of Victor,
> the failed pool is now imported and functional thanks
> to the redundancy
> in the meta data.It would be really useful if you could publish the steps to recover the pools.
 
 
This message posted from opensolaris.org

Victor Latushkin

2008-Oct-09 19:20 UTC

head link

[zfs-discuss] more ZFS recovery

Borys Saulyak wrote:>> As a follow up to the whole story, with the fantastic help of
>> Victor, the failed pool is now imported and functional thanks to
>> the redundancy in the meta data.
> It would be really useful if you could publish the steps to recover
> the pools.
Here it is:

Executive summary:

Thanks to COW nature of ZFS it was possible to successfully recover pool
state which was only 5 seconds older than last unopenable one.

Details:

The whole story started with the pool which was not importable. Any
attempt to import it with ''zpool import content'' reported I/O
error.

This situation was preceded by some HW-related issue where host could
not communicate with array for some time over SAS bus. It affected array
so badly that it was power-cycled to get back to life.

I/O error reported by ''zpool import'' is fairly generic, so the
first
step is to find out more about _when_ it happens, what stage of pool
import process detects it. This is where DTrace is very useful. Simple
DTrace script tracing entries and exits in/out of ZFS module functions
and reporting retuned value provided us with the following output:

...
   4          -> spa_last_synced_txg
   4          <- spa_last_synced_txg            = 156012
   4          -> dsl_pool_open
   4            -> dsl_pool_open_impl
...
   4            <- dsl_pool_open_impl           = 6597523216512
   4            -> dmu_objset_open_impl
   4              -> arc_read
...
   4                -> zio_read
...
   4                <- zio_read                 = 3299027537664
   4                -> zio_wait
...
   4                <- zio_wait                 = 5
   4              <- arc_read                   = 5
   4            <- dmu_objset_open_impl         = 5
...
   4            -> dsl_pool_close
   4            <- dsl_pool_close               = 6597523216512
   4          <- dsl_pool_open                  = 5
   4          -> vdev_set_state
   4          <- vdev_set_state                 = 6597441769472
   4          -> zfs_ereport_post
...
  28  <- zfs_ereport_post                       = 0
  28  <- spa_load

Source code reveals that this means that we fail to open Meta ObjSet
(MOS) of the pool which is pointed by block pointer stored in uberblock.
But MOS has three copies (ditto-blocks) even on unreplicated pools!
There must be something bad happened.

This is the first moment where it is worth to stop and try to understand
what it means. First, we get pointer to MOS from active uberblock, which
means that it passed checksum verification, so it was written to disk
successfully. It has pointers to three copies of MOS and all three are
somehow corrupted. How it could happen? Answer to this question is left
as an exercise to the reader.

Next we tried to extract offsets and sizes from zios initiated to read
MOS from disk (this could be done easier by looking into active
uberblock, but anyway). This provided us with the following results:

CPU FUNCTION
   2  <- zio_read                              cmd: 0, retries: 0,
numerrors: 0, error: 0, checksum: 0, compress: 0, ndvas: 0, txg: 156012,
size: 512, offset1: 23622594110, offset2: 40803003282, offset3: 1610705882
   2  <- zio_read                              cmd: 0, retries: 0,
numerrors: 0, error: 0, checksum: 0, compress: 0, ndvas: 0, txg: 156012,
size: 512, offset1: 23622594110, offset2: 40803003282, offset3: 1610705882

With these offsets we read out three copies of MOS to files to compare
them individually (you need to add 4M to the offset to account for ZFS
front label if you are doing it with, say, ''dd''). It turned
out that all
three are completely different. How it could happen? Most likely it
happened because corresponding writes never reached disk surface.

Being unable to read MOS is bad - since it is starting point into the
pool state you cannot discover any other data in the pool?

So is all hope lost? No, not really.

Since ZFS is COW, it is natural to try previous uberblock copy and see
if that points to consistent view of the pool. So we saved all front and
back ZFS labels (which turned out to be the same) and looked for
uberblock array to see what is available there:

Previous:
ub[107].txg = 2616b
o1 = 23622593482
o2 = 40803002710
o3 = 1610705850
timestamp = Sun Jul 27 06:04:03 2008

Active:
ub[108].txg = 2616c
o1 = 23622594110
o2 = 40803003282
o3 = 1610705882
timestamp = Sun Jul 27 06:04:08 2008

This is interesting pieces of data we extracted from active and previous
uberblocks. There are couple things to note:

1. 2616c = 156012 (as reported by spa_last_synced_txg() above)
2. all three offsets from the active uberblock are the same as
discovered earlier with dtrace script.

We tried to read three MOS copies pointed by previous uberblock and
found that all three were the same! So it was likely that previous
uberblock pointed to (more) consistent pool state, at least it''s MOS
might be ok.

So the next step was to deactivate active uberblock and activate
previous one.

How this can be done? Well, it is enough to change checksum of the
currently active uberblock to make it inactive and make sure that
checksum for the previous uberblock is correct. This way previous
uberblock would be selected as active since it would have highest txg id
(and timestamp) of all uberblocks with correct checksum.

This can be achieved with simple tools like ''dd'', you just
need to be
sure to apply "corruption" to all four labels. Though a tool to do
this
would be useful and there''s RFE to address this:

CR 6667683 "need a way to select an uberblock from a previous txg"

Fortunately pool configuration in this case is simple - it contains only
one vdev, so we were able to leverage little quick and dirty utility
which dumps uberblock array and allows to activate any uberblock in the
array (by changing checksums of all others to be incorrect).

The next step was to try import pool. One option was to try to import it
on a live system (running with three other pools and providing service)
and see the outcome. But we came up with a better idea.

zdb was very helpful here. We took zdb (along with libzpool and libzfs)
from Nevada (one in Solaris 10 does not yet work with exported pools)
and tried to run it on the pool with activated uberblock from txg
0x2616b in userspace to verify pool consistency.

So we started it with ''zdb -bbccsv content'' and after couple
of days it
ended with the following output:

Traversing all blocks to verify checksums and verify nothing leaked ...

         No leaks (block sum matches space maps exactly)

         bp count:        66320566
         bp logical:    8629761329152     avg: 130121
         bp physical:   8624565999616     avg: 130043    compression:   1.00
         bp allocated:  8628035570688     avg: 130095    compression:   1.00
         SPA allocated: 8628035570688    used: 18.80%

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
      3  48.0K   5.50K   16.5K   5.50K    8.73     0.00      L1 deferred
free
     35   152K   85.5K    174K   4.97K    1.78     0.00      L0 deferred
free
     38   200K   91.0K    191K   5.01K    2.20     0.00  deferred free
      1    512     512      1K      1K    1.00     0.00  object directory
      3  1.50K   1.50K   3.00K      1K    1.00     0.00  object array
      1    16K      1K      2K      2K   16.00     0.00  packed nvlist
      -      -       -       -       -       -        -  packed nvlist size
      1    16K     16K     32K     32K    1.00     0.00  bplist
      -      -       -       -       -       -        -  bplist header
      -      -       -       -       -       -        -  SPA space map
header
     39   624K    149K    447K   11.5K    4.19     0.00      L1 SPA
space map
  2.22K  8.89M   5.21M   10.4M   4.69K    1.71     0.00      L0 SPA
space map
  2.26K   9.5M   5.35M   10.9M   4.80K    1.77     0.00  SPA space map
      -      -       -       -       -       -        -  ZIL intent log
      1    16K      1K   3.00K   3.00K   16.00     0.00      L6 DMU dnode
      1    16K      1K   3.00K   3.00K   16.00     0.00      L5 DMU dnode
      1    16K      1K   3.00K   3.00K   16.00     0.00      L4 DMU dnode
      1    16K      1K   3.00K   3.00K   16.00     0.00      L3 DMU dnode
      1    16K   1.50K   4.50K   4.50K   10.67     0.00      L2 DMU dnode
     12   192K    110K    329K   27.4K    1.75     0.00      L1 DMU dnode
  1.47K  23.5M   4.85M    9.7M   6.62K    4.84     0.00      L0 DMU dnode
  1.49K  23.8M   4.97M   10.1M   6.77K    4.78     0.00  DMU dnode
      2     2K      1K   3.00K   1.50K    2.00     0.00  DMU objset
      -      -       -       -       -       -        -  DSL directory
      2     1K      1K      2K      1K    1.00     0.00  DSL directory
child map
      1    512     512      1K      1K    1.00     0.00  DSL dataset
snap map
      2     1K      1K      2K      1K    1.00     0.00  DSL props
      -      -       -       -       -       -        -  DSL dataset
      -      -       -       -       -       -        -  ZFS znode
      -      -       -       -       -       -        -  ZFS V0 ACL
  1.07K  17.1M   1.07M   2.14M   2.01K   15.94     0.00      L3 ZFS
plain file
  8.13K   130M   31.8M   63.6M   7.83K    4.09     0.00      L2 ZFS
plain file
   505K  7.89G   3.19G   6.38G   12.9K    2.48     0.08      L1 ZFS
plain file
  62.7M  7.84T   7.84T   7.84T    128K    1.00    99.92      L0 ZFS
plain file
  63.2M  7.85T   7.84T   7.85T    127K    1.00   100.00  ZFS plain file
    991  1.77M    596K   1.16M   1.20K    3.04     0.00  ZFS directory
      1    512     512      1K      1K    1.00     0.00  ZFS master node
      1    512     512      1K      1K    1.00     0.00  ZFS delete queue
      -      -       -       -       -       -        -  zvol object
      -      -       -       -       -       -        -  zvol prop
      -      -       -       -       -       -        -  other uint8[]
      -      -       -       -       -       -        -  other uint64[]
      -      -       -       -       -       -        -  other ZAP
      -      -       -       -       -       -        -  persistent
error log
      1   128K   4.50K   9.00K   9.00K   28.44     0.00  SPA history
      -      -       -       -       -       -        -  SPA history offsets
      -      -       -       -       -       -        -  Pool properties
      -      -       -       -       -       -        -  DSL permissions
      -      -       -       -       -       -        -  ZFS ACL
      -      -       -       -       -       -        -  ZFS SYSACL
      -      -       -       -       -       -        -  FUID table
      -      -       -       -       -       -        -  FUID table size
      -      -       -       -       -       -        -  DSL dataset
next clones
      -      -       -       -       -       -        -  scrub work queue
      1    16K      1K   3.00K   3.00K   16.00     0.00      L6 Total
      1    16K      1K   3.00K   3.00K   16.00     0.00      L5 Total
      1    16K      1K   3.00K   3.00K   16.00     0.00      L4 Total
  1.07K  17.1M   1.07M   2.15M   2.01K   15.94     0.00      L3 Total
  8.13K   130M   31.8M   63.6M   7.83K    4.09     0.00      L2 Total
   505K  7.89G   3.19G   6.38G   12.9K    2.48     0.08      L1 Total
  62.7M  7.84T   7.84T   7.84T    128K    1.00    99.92      L0 Total
  63.2M  7.85T   7.84T   7.85T    127K    1.00   100.00  Total

                       capacity   operations   bandwidth  ---- errors ----
description         used avail  read write  read write  read write cksum
content            7.85T 33.9T   420     0 51.8M     0     0     0     0
  /dev/dsk/c2t9d0s0 7.85T 33.9T   420     0 51.8M     0     0     0     0
bash-3.00#

This confirmed that previous pool state is completely consistent, so it
should be safe to import pool in this state.

Import worked just fine and additional scrub did not find any errors.

Hope this helps,
Victor

Ross

2008-Oct-10 06:00 UTC

head link

[zfs-discuss] more ZFS recovery

Victor, thanks for posting that.  It really is interesting to see exactly what
happened, and to read about how zfs pools can be recovered.

Your work on these forums has done much to re-assure me that ZFS is stable
enough for us to be using on a live server, and I look forward to seeing
automated tools appear to do some of the recoveries you''re currently
having to work so hard on.
--
This message posted from opensolaris.org

zfs discuss - Aug 2008 - more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] Forensic analysis [was: more ZFS recovery]

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] Forensic analysis [was: more ZFS recovery]

[zfs-discuss] Forensic analysis [was: more ZFS recovery]

[zfs-discuss] Forensic analysis [was: more ZFS recovery]

[zfs-discuss] Forensic analysis [was: more ZFS recovery]

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery

[zfs-discuss] more ZFS recovery