Hi, Have a problem with a ZFS on a single device, this device is 48 1T SATA drives presented as a 42T LUN via hardware RAID 6 on a SAS bus which had a ZFS on it as a single device. There was a problem with the SAS bus which caused various errors including the inevitable kernel panic, the thing came back up with 3 out of 4 zfs mounted. I''ve tried reading the partition table with format, works fine, also can dd the first 100G from the device quite happily so the communication issue appears resolved however the device just won''t mount. Googling around I see that ZFS does have features designed to reduce the impact of corruption at a particular point, multiple meta data copies and so on, however commands to help me tidy up a zfs will only run once the thing has been mounted. Would be grateful for any ideas, relevant output here: root at cs3:~# zpool import pool: content id: 14205780542041739352 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. The pool may be active on on another system, but can be imported using the ''-f'' flag. see: http://www.sun.com/msg/ZFS-8000-72 config: content FAULTED corrupted data c2t9d0 ONLINE root at cs3:~# zpool import content cannot import ''content'': pool may be in use from other system use ''-f'' to import anyway root at cs3:~# zpool import -f content cannot import ''content'': I/O error root at cs3:~# uname -a SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200 Thanks -- Tom // www.portfast.co.uk -- internet services and consultancy // hosting from 1.65 per domain
Tom Bird wrote:> Hi, > > Have a problem with a ZFS on a single device, this device is 48 1T SATA > drives presented as a 42T LUN via hardware RAID 6 on a SAS bus which had > a ZFS on it as a single device. > > There was a problem with the SAS bus which caused various errors > including the inevitable kernel panic, the thing came back up with 3 out > of 4 zfs mounted. > > I''ve tried reading the partition table with format, works fine, also can > dd the first 100G from the device quite happily so the communication > issue appears resolved however the device just won''t mount. Googling > around I see that ZFS does have features designed to reduce the impact > of corruption at a particular point, multiple meta data copies and so > on, however commands to help me tidy up a zfs will only run once the > thing has been mounted. >You should also check the end of the LUN. ZFS stores its configuration data at the beginning and end of the LUN. An I/O error is a fairly generic error, but it can also be an indicator of a catastrophic condition. You should also check the system log in /var/adm/messages as well as any faults reported by fmdump. In general, ZFS can only repair conditions for which it owns data redundancy. In this case, ZFS does not own the redundancy function, so you are susceptible to faults of this sort. -- richard> Would be grateful for any ideas, relevant output here: > > root at cs3:~# zpool import > pool: content > id: 14205780542041739352 > state: FAULTED > status: The pool metadata is corrupted. > action: The pool cannot be imported due to damaged devices or data. > The pool may be active on on another system, but can be imported > using > the ''-f'' flag. > see: http://www.sun.com/msg/ZFS-8000-72 > config: > > content FAULTED corrupted data > c2t9d0 ONLINE > > root at cs3:~# zpool import content > cannot import ''content'': pool may be in use from other system > use ''-f'' to import anyway > > root at cs3:~# zpool import -f content > cannot import ''content'': I/O error > > root at cs3:~# uname -a > SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200 > > > Thanks >
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>> "tb" == Tom Bird <tom at marmot.org.uk> writes:tb> There was a problem with the SAS bus which caused various tb> errors including the inevitable kernel panic, the thing came tb> back up with 3 out of 4 zfs mounted. re> In general, ZFS can only repair conditions for which it owns re> data redundancy. If that''s really the excuse for this situation, then ZFS is not ``always consistent on the disk'''' for single-VDEV pools. There was no loss of data here, just an interruption in the connection to the target, like power loss or any other unplanned shutdown. Corruption in this scenario is is a significant regression w.r.t. UFS: http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048375.html How about the scenario where you lose power suddenly, but only half of a mirrored VDEV is available when power is restored? Is ZFS vulnerable to this type of unfixable corruption in that scenario, too? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080806/f342c055/attachment.bin>
On Wed, Aug 6, 2008 at 13:57, Miles Nordin <carton at ivy.net> wrote:>>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>>> "tb" == Tom Bird <tom at marmot.org.uk> writes: > > tb> There was a problem with the SAS bus which caused various > tb> errors including the inevitable kernel panic, the thing came > tb> back up with 3 out of 4 zfs mounted. > > re> In general, ZFS can only repair conditions for which it owns > re> data redundancy. > > If that''s really the excuse for this situation, then ZFS is not > ``always consistent on the disk'''' for single-VDEV pools.Well, yes. If data is sent, but corruption somewhere (the SAS bus, apparently, here) causes bad data to be written, ZFS can generally detect but not fix that. It might be nice to have a "verifywrites" mode or something similar to make sure that good data has ended up on disk (at least at the time it checks), but failing that there''s not much ZFS (or any filesystem) can do. Using a pool with some level of redundancy (mirroring, raidz) at least gives zfs a chance to read the missing pieces from the redundancy that it''s kept.> How about the scenario where you lose power suddenly, but only half of > a mirrored VDEV is available when power is restored? Is ZFS > vulnerable to this type of unfixable corruption in that scenario, > too?Every filesystem is vulnerable to corruption, all the time. I''m willing to dispute any claims otherwise. Some are just more likely than others to hit their error conditions. I''ve personally run into UFS'' problems more often than ZFS... but that doesn''t mean I think I''m safe. Will
Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>>> "tb" == Tom Bird <tom at marmot.org.uk> writes: >>>>>> > > tb> There was a problem with the SAS bus which caused various > tb> errors including the inevitable kernel panic, the thing came > tb> back up with 3 out of 4 zfs mounted. > > re> In general, ZFS can only repair conditions for which it owns > re> data redundancy. > > If that''s really the excuse for this situation, then ZFS is not > ``always consistent on the disk'''' for single-VDEV pools. >I disagree with your assessment. The on-disk format (any on-disk format) necessarily assumes no faults on the media. The difference between ZFS on-disk format and most other file systems is that the metadata will be consistent to some point in time because it is COW. With UFS, for instance, the metadata is overwritten, which is why it cannot be considered always consistent (and why fsck exists).> There was no loss of data here, just an interruption in the connection > to the target, like power loss or any other unplanned shutdown. > Corruption in this scenario is is a significant regression w.r.t. UFS: >I see no evidence that the data is or is not correct. What we know is that ZFS is attempting to read something and the device driver is returning EIO. Unfortunately, EIO is a catch-all error code, so more digging to find the root cause is needed. However, I will bet a steak dinner that if this device was mirrored to another, the pool will import just fine, with the affected device in a faulted or degraded state.> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048375.html >I have no idea what Eric is referring to, and it does not match my experience. Unfortunately, he didn''t reference any CRs either :-(. "Your baby is ugly" posts aren''t very useful. That said, we are constantly improving the resiliency of ZFS (more good stuff coming in b96), so it might be worth trying to recover with a later version. For example, boot SXCE b94 and try to import the pool.> How about the scenario where you lose power suddenly, but only half of > a mirrored VDEV is available when power is restored? Is ZFS > vulnerable to this type of unfixable corruption in that scenario, > too? >No, this works just fine as long as one side works. But that is a very different case. -- richard
Richard Elling wrote:> I see no evidence that the data is or is not correct. What we know is that > ZFS is attempting to read something and the device driver is returning EIO. > Unfortunately, EIO is a catch-all error code, so more digging to find the > root cause is needed.I''m currently checking the whole LUN, although as a 42TB unit this will take a few hours so we''ll see how that is tomorrow.> However, I will bet a steak dinner that if this device was mirrored to > another, > the pool will import just fine, with the affected device in a faulted or > degraded > state.On any other file system though, I could probably kick off a fsck and get back most of the data. I see the argument a lot that ZFS "doesn''t need" a fsck utility, however I would be inclined to disagree, if not a full on fsck then something that can patch it up to the point where I can mount it and then get some data off or run a scrub. -- Tom // www.portfast.co.uk -- internet services and consultancy // hosting from 1.65 per domain
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes:c> If that''s really the excuse for this situation, then ZFS is c> not ``always consistent on the disk'''' for single-VDEV pools. re> I disagree with your assessment. The on-disk format (any re> on-disk format) necessarily assumes no faults on the media. The media never failed, only the connection to the media. We''ve every good reason to believe that every CDB that the storage controller acknowledged as complete, was completed and is still there---and that is the only statement which must be true of unfaulty media. We''ve no strong reason to doubt it. re> I see no evidence that the data is or is not correct. the ``evidence'''' is that it was on a SAN, and the storage itself never failed, only the connection between ZFS and the storage. Remember: this device is 48 1T SATA drives presented as a 42T LUN via hardware RAID 6 on a SAS bus which had a ZFS on it as a single device. This sort of SAN-outage happens all the time, so it''s not straining my belief to suggest that probably nothing else happened other than disruption of the connection between ZFS and the storage. It''s not like a controller randomly ``acted up'''' or something, so that I would suspect a bad disk. c> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048375.html re> I have no idea what Eric is referring to, and it does not re> match my experience. unfortunately it''s very easy to match the experience of ``nothing happened'''' and hard to match the experience ``exactly the same thing happened to me.'''' Have you been provoking ZFS in exactly the way Eric described, a single-vdev pool on FC where the FC SAN often has outages or where the storage is rebooted while ZFS is still running? If not, obviously it doesn''t match your experience because you have none with this situation. OTOH if you''ve been doing that a lot, your not running into this problem means something. Otherwise, it''s another case of the home-user defense: ``I can''t tell you how close to zero the number of problems I''ve had with it is. It''s so close to zero, it is zero, so there''s virtually 0% chance what you''re saying happened to you really did happen to you. and also to this other guy.'''' When I say ``doesn''t mathc my experience'''' I meant I _do_ see Mac OS X pinwheels and for me it''s ``usually'''' traceable back to VM pressure or dead NFS server, not some random application-level userinterface modal-wait as others claimed: I''m selecting for the same situation you are, and gettin g a different result. that said, yeah, a CR would be nice. For such a serious problem, I''d like to think someone''s collected an image of the corrupt filesystem and is trying to figure out wtf happened. I care about how safe is my data, not how pretty is your baby. I want its relative safety accurately represented based on the experience available to us. c> How about the scenario where you lose power suddenly, but only c> half of a mirrored VDEV is available when power is restored? c> Is ZFS vulnerable to this type of unfixable corruption in that c> scenario, too? re> No, this works just fine as long as one side works. But that re> is a very different case. -- richard Why do you regard this case as very different from a single vdev? I don''t have confidence that it''s clearly different w.r.t. whatever hypothetical bug Eric and Tom have run into. wm> If data is sent, but corruption somewhere (the SAS bus, wm> apparently, here) causes bad data to be written, ZFS can wm> generally detect but not fix that. Why would there be bad data written? The SAS bus has checksums. The problem AIUI was that the bus went away, not that it started scribbling random data all over the place. Am I wrong? Remember what Tom''s SAS bus is connected to. wm> "verifywrites" The verification is the storage array returning success to the command it was issued. ZFS is supposed to, for example, delay returning from fsync() until this has happened. The same mechanism is used to write batches of things in a well-defined order to supposedly achieve the ``always-consistent''''. It depends on the drive/array''s ability to accurately report when data is committed to stable storage, not on rereading what was written, and this is the correct dependency because ZFS leaves write caches on, so the drive could satisfy a read from the small on-disk cache RAM even though that data would be lost if you pulled the disk''s power cord. The system contains all the tools needed to keep the consistency promises even if you go around yanking SAS cables. And this is a data-loss issue, not just an availability issue like we were discussing before w.r.t. pulling drives. wm> Every filesystem is vulnerable to corruption, all the time. Every filesystem in recent history makes rigorous guarantees about what will survive if you pull the connection to the disk array, or the host''s power, at any time you wish. The guarantees always include the integrity of data written before an fsync() command was called so long as power/connectivity is lost after fsync() returns. It also includes enough metadata consistency that you won''t lose a whole friggin'' pool like this scenaryo with some ``corrupt data, End of Line'''' error. UFS+logging vxfs FFS+softdep ext3 xfs reiserfs HFS+ Disks that go bad, storage subsystems with a RAID5 write hole, PATA busses that given noisy cables autodegrade to a non-CRC mode and then corrupt data, disks that silently return bad data, controllers that go nuts and scribble random data as the 5V rail starts dropping after the cord is pulled, can, yes, all interfere with these guarantees. but NONE OF THOSE THINGS HAPPENED IN THIS CASE. We absolutely do not live in fear that we will lose whole filesystems if the cord is pulled at the wrong time. That has not been true since, like, the early 90''s. ancient history. :'' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080806/9cc5d03f/attachment.bin>
On Wed, Aug 6, 2008 at 8:20 AM, Tom Bird <tom at marmot.org.uk> wrote:> Hi, > > Have a problem with a ZFS on a single device, this device is 48 1T SATA > drives presented as a 42T LUN via hardware RAID 6 on a SAS bus which had > a ZFS on it as a single device. > > There was a problem with the SAS bus which caused various errors > including the inevitable kernel panic, the thing came back up with 3 out > of 4 zfs mounted.Hi Tom, After reading this and the followups to date.... this could be due to anything ... and we (on the list) don''t know the history of the system or the RAID device. You could have a bad SAS controller, bad system memory, a bad cable or a RAID controller with a firmware bug.... The first step would be to form a ZFS pool with 2 mirrors, beat up on it and gain some confidence in the overall system components. Write lots of data to it, run zpool scrub etc. and verify that it''s 100% rock solid before you then zpool destroy it and then test with a larger pool. In every case where someone has initially posted an opening story list yours, the problem has almost always turned out to be outside of ZFS. As others have explained, if ZFS does not have a config with data redundancy - there is not much that can be learned - except that it "just broke". Keep testing and report back. Also, any additional data on the hardware and software config would be useful and let us know if this is a "new" system or if the hardware has already been in service and its reliability track record.> I''ve tried reading the partition table with format, works fine, also can > dd the first 100G from the device quite happily so the communication > issue appears resolved however the device just won''t mount. Googling > around I see that ZFS does have features designed to reduce the impact > of corruption at a particular point, multiple meta data copies and so > on, however commands to help me tidy up a zfs will only run once the > thing has been mounted. > > Would be grateful for any ideas, relevant output here: > > root at cs3:~# zpool import > pool: content > id: 14205780542041739352 > state: FAULTED > status: The pool metadata is corrupted. > action: The pool cannot be imported due to damaged devices or data. > The pool may be active on on another system, but can be imported > using > the ''-f'' flag. > see: http://www.sun.com/msg/ZFS-8000-72 > config: > > content FAULTED corrupted data > c2t9d0 ONLINE > > root at cs3:~# zpool import content > cannot import ''content'': pool may be in use from other system > use ''-f'' to import anyway > > root at cs3:~# zpool import -f content > cannot import ''content'': I/O error > > root at cs3:~# uname -a > SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200 > > > Thanks > -- > Tom >Regards, -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Tom Bird wrote:> Richard Elling wrote: > > >> I see no evidence that the data is or is not correct. What we know is that >> ZFS is attempting to read something and the device driver is returning EIO. >> Unfortunately, EIO is a catch-all error code, so more digging to find the >> root cause is needed. >> > > I''m currently checking the whole LUN, although as a 42TB unit this will > take a few hours so we''ll see how that is tomorrow. > > >> However, I will bet a steak dinner that if this device was mirrored to >> another, >> the pool will import just fine, with the affected device in a faulted or >> degraded >> state. >> > > On any other file system though, I could probably kick off a fsck and > get back most of the data. I see the argument a lot that ZFS "doesn''t > need" a fsck utility, however I would be inclined to disagree, if not a > full on fsck then something that can patch it up to the point where I > can mount it and then get some data off or run a scrub. > >Probably not. fsck only repairs metadata, it does not restore or correct data. If the data is gone or damaged, then there isn''t much ZFS could do, since ZFS was not in control of the data redundancy (by default, ZFS metadata is redundant). BTW, another good sanity test is to try to read the ZFS labels: zdb -l /dev/rdsk/... Cindy, I note that we don''t explicitly address the case where the pool cannot be imported in the Troubleshooting and Data Recovery chapter of the ZFS administration guide. Can we put this on the todo list? -- richard
Tom Bird wrote:> Richard Elling wrote: > > >> I see no evidence that the data is or is not correct. What we know is that >> ZFS is attempting to read something and the device driver is returning EIO. >> Unfortunately, EIO is a catch-all error code, so more digging to find the >> root cause is needed. >> > > I''m currently checking the whole LUN, although as a 42TB unit this will > take a few hours so we''ll see how that is tomorrow. > > >> However, I will bet a steak dinner that if this device was mirrored to >> another, >> the pool will import just fine, with the affected device in a faulted or >> degraded >> state. >> > > On any other file system though, I could probably kick off a fsck and > get back most of the data. I see the argument a lot that ZFS "doesn''t > need" a fsck utility, however I would be inclined to disagree, if not a > full on fsck then something that can patch it up to the point where I > can mount it and then get some data off or run a scrub. > >From the ZFS Administration Guide, Chapter 11, Data Repair section: Given that the fsck utility is designed to repair known pathologies specific to individual file systems, writing such a utility for a file system with no known pathologies is impossible. Future experience might prove that certain data corruption problems are common enough and simple enough such that a repair utility can be developed, but these problems can always be avoided by using redundant pools. If your pool is not redundant, the chance that data corruption can render some or all of your data inaccessible is always present. If you go through the archives you should find similar conversations. -- richard
> From the ZFS Administration Guide, Chapter 11, Data Repair section: > Given that the fsck utility is designed to repair known pathologies > specific to individual file systems, writing such a utility for a file > system with no known pathologies is impossible.That''s a fallacy (and is incorrect even for the UFS fsck; refer to the McKusick/Kowalski paper and the distinction they make between ''expected'' corruptions and other inconsistencies). First, there are two types of utilities which might be useful in the situation where a ZFS pool has become corrupted. The first is a file system checking utility (call it zfsck); the second is a data recovery utility. The difference between those is that the first tries to bring the pool (or file system) back to a usable state, while the second simply tries to recover the files to a new location. What does a file system check do? It verifies that a file system is internally consistent, and makes it consistent if it is not. If ZFS were always consistent on disk, then only a verification would be needed. Since we have evidence that it is not always consistent in the face of hardware failures, at least, repair may also be needed. This doesn''t need to be that hard. For instance, the space maps can be reconstructed by walking the various block trees; the uberblock effectively has several backups (though it might be better in some cases if an older backup were retained); and the ZFS checksums make it easy to identify block types and detect bad pointers. Files can be marked as damaged if they contain pointers to bad data; directories can be repaired if their hash structures are damaged (as long as the names and pointers can be salvaged); etc. Much more complex file systems than ZFS have file system checking utilities, because journaling, COW, etc. don''t help you in the face of software bugs or certain classes of hardware failures. A recovery tool is even simpler, because all it needs to do is find a tree root and then walk the file system, discovering directories and files, verifying that each of them is readable by using the checksums to check intermediate and leaf blocks, and extracting the data. The tricky bit with ZFS is simply identifying a relatively new root, so that the newest copy of the data can be identified. Almost every file system starts out without an fsck utility, and implements one once it becomes obvious that "sorry, you have to reinitialize the file system" -- or worse, "sorry, we lost all of your data" -- is unacceptable to a certain proportion of customers. This message posted from opensolaris.org
> As others have explained, if ZFS does not have a > config with data redundancy - there is not much that > can be learned - except that it "just broke".Plenty can be learned by just looking at the pool. Unfortunately ZFS currently doesn''t have tools which make that easy; as I understand it, zdb doesn''t work (in a useful way) on a pool which won''t import, so dumping out the raw data structures and looking at them by hand is the only way to determine what ZFS doesn''t like and deduce what went wrong (and how to fix it). This message posted from opensolaris.org
On Wed, Aug 06, 2008 at 02:23:44PM -0400, Will Murnane wrote:> On Wed, Aug 6, 2008 at 13:57, Miles Nordin <carton at ivy.net> wrote: > > If that''s really the excuse for this situation, then ZFS is not > > ``always consistent on the disk'''' for single-VDEV pools. > Well, yes. If data is sent, but corruption somewhere (the SAS bus, > apparently, here) causes bad data to be written, ZFS can generally > detect but not fix that. It might be nice to have a "verifywrites" > mode or something similar to make sure that good data has ended up on > disk (at least at the time it checks), but failing that there''s not > much ZFS (or any filesystem) can do. Using a pool with some level of > redundancy (mirroring, raidz) at least gives zfs a chance to read the > missing pieces from the redundancy that it''s kept.There''s also ditto blocks. So even on a one vdev pool you ZFS can recover from random corruption unless you''re really unlucky. Of course, this is a feature. Without ZFS the OP would have had silent, undetected (by the OS that is) data corruption. Basically you don''t want to have one-vdev pools. If you''ll use HW RAID then you should also do mirroring at the ZFS layer. Nico --
On Wed, Aug 06, 2008 at 03:44:08PM -0400, Miles Nordin wrote:> >>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: > > c> If that''s really the excuse for this situation, then ZFS is > c> not ``always consistent on the disk'''' for single-VDEV pools. > > re> I disagree with your assessment. The on-disk format (any > re> on-disk format) necessarily assumes no faults on the media. > > The media never failed, only the connection to the media. We''ve every > good reason to believe that every CDB that the storage controller > acknowledged as complete, was completed and is still there---and that > is the only statement which must be true of unfaulty media. We''ve no > strong reason to doubt it.zdb should be able to pinpoint the problem, no?
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes:re> If your pool is not redundant, the chance that data re> corruption can render some or all of your data inaccessible is re> always present. 1. data corruption != unclean shutdown 2. other filesystems do not need a mirror to recover from unclean shutdown. They only need it for when disks fail, or for when disks misremember their contents (silent corruption, as in NetApp paper). I would call data corruption and silent corruption the same thing: what the CKSUM column was _supposed_ to count, though not in fact the only thing it counts. 3. saying ZFS needs a mirror to recover from unclean shutdown does not agree with the claim ``always consistent on the disk'''' 4. I''m not sure exactly your position. Before you were saying what Erik warned about doesn''t happen, because there''s no CR, and Tom must be confused too. Now you''re saying of course it happens, ZFS''s claims of ``always consistent on disk'''' count for nothing unless you have pool redundancy. And that is exactly what I said to start with: re> In general, ZFS can only repair conditions for which it owns re> data redundancy. c> If that''s really the excuse for this situation, then ZFS is c> not ``always consistent on the disk'''' for single-VDEV pools. that is the take-home message? If so, it still leaves me with the concern, what if the breaking of one component in a mirrored vdev takes my system down uncleanly? This seems like a really plausible failure mode (as Tom said, ``the inevitable kernel panic''''). In that case, I no longer have any redundancy when the system boots back up. If ZFS calls the inconsistent states through which it apparently sometimes transitions pools ``data corruption'''' and depends on redundancy to recover from them, then isn''t it extremely dangerous to remove power or SAN connectivity from any DEGRADED pool? The pool should be rebuilt onto a hot spare IMMEDIATELY so that it''s ONLINE as soon as possible, because if ZFS loses power with a DEGRADED pool all bets are off. If this DEGRADED-pool unclean shutdown is, as you say, a completely different scenario from single-vdev pools that isn''t dangerous and has no trouble with ZFS corruption, then no one should ever run a single-vdev pool. We should instead run mirrored vdevs that are always DEGRADED, since this configuration looks identical to everything outside ZFS but supposedly magically avoids the issue. If only we had some way to attach to vdevs fake mirror components that immediately get marked FAULTED then we can avoid the corruption risk. But, that''s clearly absurd! so, let''s say ZFS''s requirement is, as we seem to be describing it: might lose the whole pool if your kernel panics or you pull the power cord in a situation without redundancy. Then I think this is an extremely serious issue, even for redundant pools. It is very plausible that a machine will panic or lose power during a resilver. And if, on the other hand, ZFS doesn''t transition disks through inconsistent states and then excuse itself calling what it did ``data corruption'''' when it bites you after an unclean shutdown, then what happened to Erik and Tom? It seems to me it is ZFS''s fault and can''t be punted off to the administrator''s ``asking for it.'''' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080807/46dd427e/attachment.bin>
>>>>> "nw" == Nicolas Williams <Nicolas.Williams at sun.com> writes:nw> Without ZFS the OP would have had silent, undetected (by the nw> OS that is) data corruption. It sounds to me more like the system would have paniced as soon as he pulled the cord, and when it rebooted, it would have rolled the UFS log and mounted, without even an fsck, with no corruption at all, silent or otherwise. Note that the storage controller never even lost power, and does not appear to be faulty. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080807/26e97474/attachment.bin>
Hi Tom and all, Tom Bird wrote:> Hi, > > Have a problem with a ZFS on a single device, this device is 48 1T SATA > drives presented as a 42T LUN via hardware RAID 6 on a SAS bus which had > a ZFS on it as a single device. > > There was a problem with the SAS bus which caused various errors > including the inevitable kernel panic, the thing came back up with 3 out > of 4 zfs mounted.It would be nice to see a panic stack.> I''ve tried reading the partition table with format, works fine, also can > dd the first 100G from the device quite happily so the communication > issue appears resolved however the device just won''t mount. Googling > around I see that ZFS does have features designed to reduce the impact > of corruption at a particular point, multiple meta data copies and so > on, however commands to help me tidy up a zfs will only run once the > thing has been mounted. > > Would be grateful for any ideas, relevant output here: > > root at cs3:~# zpool import > pool: content > id: 14205780542041739352 > state: FAULTED > status: The pool metadata is corrupted. > action: The pool cannot be imported due to damaged devices or data. > The pool may be active on on another system, but can be imported > using > the ''-f'' flag. > see: http://www.sun.com/msg/ZFS-8000-72 > config: > > content FAULTED corrupted data > c2t9d0 ONLINE > > root at cs3:~# zpool import content > cannot import ''content'': pool may be in use from other system > use ''-f'' to import anyway > > root at cs3:~# zpool import -f content > cannot import ''content'': I/O errorAs long as it does not panic and just returns I/O error which is rather generic, you may try to dig a little bit deeper with DTrace to have a chance to see where this I/O error is generated first, e.g. something like this with the attached dtrace script: dtrace -s /path/to/script -c "zpool import -f content" It is also interesting what impact SAS bus problem had on the storage controller. Btw, what is storage controller in question here?> root at cs3:~# uname -a > SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200Btw, have you considered opening support call for this issue? hth, victor -------------- next part -------------- A non-text attachment was scrubbed... Name: zpool.d Type: text/x-dsrc Size: 294 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080807/d63fe6c2/attachment.bin>
> Would be grateful for any ideas, relevant output here: > > root at cs3:~# zpool import > pool: content > id: 14205780542041739352 > state: FAULTED > status: The pool metadata is corrupted. > action: The pool cannot be imported due to damaged devices or data. > The pool may be active on on another system, but can be imported > using > the ''-f'' flag. > see: http://www.sun.com/msg/ZFS-8000-72 > config: > > content FAULTED corrupted data > c2t9d0 ONLINE > > root at cs3:~# zpool import content > cannot import ''content'': pool may be in use from other system > use ''-f'' to import anyway > > root at cs3:~# zpool import -f content > cannot import ''content'': I/O errorIf you have DVD with recent SXCE bits handy you may try to use zdb to get more details: zdb -e -dddd content victor
Hi folks, Miles, I don''t know if you have more information about this problem than I''m seeing, but from what Tom wrote I don''t see how you can assume this is such a simple problem as an unclean shutdown? Tom wrote "There was a problem with the SAS bus which caused various errors including the inevitable kernel panic". It''s the various errors part that catches my eye, to me this means the SAS bus could have caused bad data to be written to disk for some time before the kernel panic, and that is far more serious to a filesystem than a simple power cut. Can fsck always recover a disk? Or if the corruption is severe enough, are there times when even that fails? I don''t see that we have enough information here to really compare ZFS with UFS although I do agree that some kind of ZFS repair tool sounds like it would be useful. The problem for me is that I don''t know enough about the low level stuff to really have an informed opinion on that. To me, it sounds like Sun have designed ZFS to always know if there is corruption on the disk, and to write data in a way that corruption of the whole filesystem *should* never happen. But I also feel that there are times that hardware can fail in strange ways, and there''s always a chance that a pool could become corrupted due to hardware error in a way that prevents it being mounted. While I can see where Sun are coming from in that they''ve designed ZFS to engineer around these problems, and avoid the need to repair filesystems by using mirroring, multiple copies, etc. I do think a fsck like utility that can try to mount a failed system would be good for ZFS, it''s certainly worth somebody who knows ZFS sitting down and thinking "how can we recover a pool if we know that X is corrupted", where X refers to any of the core pieces ZFS needs on its disk(s). Ross This message posted from opensolaris.org
Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>>> "tb" == Tom Bird <tom at marmot.org.uk> writes: > > tb> There was a problem with the SAS bus which caused various > tb> errors including the inevitable kernel panic, the thing came > tb> back up with 3 out of 4 zfs mounted. > > re> In general, ZFS can only repair conditions for which it owns > re> data redundancy. > > If that''s really the excuse for this situation, then ZFS is not > ``always consistent on the disk'''' for single-VDEV pools.This is wrong implication. ZFS does not write new (meta)data over currently allocated blocks, this is how on-disk consistency is achieved. Recovery from corruption is another thing. Data that is read back may be not the one which was written in the first place, and ZFS has a facility to detect this - checksums. If there''s more than one copy - there is good chance that another copy may be good. If there''s only one copy - there''s no much to do besides returning I/O error. There''s another failure scenario also - (meta)data may be corrupted in memory before it is checksummed and written to disk. In this case no matter how many copies are stored on disk, all of them are incorrect though they may still checksum properly.> There was no loss of data here, just an interruption in the connection > to the target, like power loss or any other unplanned shutdown.Unfortunately this is an assumption only. Saying that there was no loss of data you assume that storage controller is bug-free or was not affected by SAS bus issues in any way. This may not be the case. But it is impossible to tell with the provided data. victor
Anton B. Rang writes:> dumping out the raw data structures and looking at > them by hand is the only way to determine what > ZFS doesn''t like and deduce what went wrong (and > how to fix it).http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf :-) -- ------------------------------------------------------------------------ Volker A. Brandt Consulting and Support for Sun Solaris Brandt & Brandt Computer GmbH WWW: http://www.bb-c.de/ Am Wiesenpfad 6, 53340 Meckenheim Email: vab at bb-c.de Handelsregister: Amtsgericht Bonn, HRB 10513 Schuhgr??e: 45 Gesch?ftsf?hrer: Rainer J. H. Brandt und Volker A. Brandt
Hi Richard, Yes, sure. We can add that scenario. What''s been on my todo list is a ZFS troubleshooting wiki. I''ve been collecting issues. Let''s talk soon. Cindy Richard Elling wrote:> Tom Bird wrote: > >> Richard Elling wrote: >> >> >> >>> I see no evidence that the data is or is not correct. What we know >>> is that >>> ZFS is attempting to read something and the device driver is >>> returning EIO. >>> Unfortunately, EIO is a catch-all error code, so more digging to find >>> the >>> root cause is needed. >>> >> >> >> I''m currently checking the whole LUN, although as a 42TB unit this will >> take a few hours so we''ll see how that is tomorrow. >> >> >> >>> However, I will bet a steak dinner that if this device was mirrored >>> to another, >>> the pool will import just fine, with the affected device in a faulted >>> or degraded >>> state. >>> >> >> >> On any other file system though, I could probably kick off a fsck and >> get back most of the data. I see the argument a lot that ZFS "doesn''t >> need" a fsck utility, however I would be inclined to disagree, if not a >> full on fsck then something that can patch it up to the point where I >> can mount it and then get some data off or run a scrub. >> >> > > > Probably not. fsck only repairs metadata, it does not restore or correct > data. If the data is gone or damaged, then there isn''t much ZFS could > do, since ZFS was not in control of the data redundancy (by default, > ZFS metadata is redundant). > > BTW, another good sanity test is to try to read the ZFS labels: > zdb -l /dev/rdsk/... > > Cindy, I note that we don''t explicitly address the case where the pool > cannot be imported in the Troubleshooting and Data Recovery chapter > of the ZFS administration guide. Can we put this on the todo list? > -- richard >
>>>>> "r" == Ross <myxiplx at hotmail.com> writes:r> Tom wrote "There was a problem with the SAS bus which caused r> various errors including the inevitable kernel panic". It''s r> the various errors part that catches my eye, yeah, possibly, but there are checksums on the SAS bus, and its confirmation of what CDB''s have completed should always be accurate. If the problem was ``another machine booted up, and I told the other machine to ''zpool import -f'' '''' then maybe you have some point. but just tripping over a cable shouldn''t qualify as weird, nor should Erik''s problem of the FC array losing power or connectivity. These are both within the ``unclean shutdown'''' category handled by UFS+log, FFS+softdep, ext3, reiser, xfs, vxfs, jfs, HFS+, ... r> Can fsck always recover a disk? Or if the corruption is r> severe enough, are there times when even that fails? This question is obviously silly. write zeroes over the disk, and now the corruption is severe enough. However fsck can always recover a disk from a kernel panic, or a power failure of the host or of the disks, because these things don''t randomly scribble over the disk. (now, yeah, I know I posted earlier a story from Ted Ts''o about SGI hardware and about random disk scribbling as the 5V rail started drooping. yes, I posted that one. but it doesn''t happen _that much_. and it doesn''t even apply to Tom and Erik''s case of a loose SAS cable or tripping over an FC cord.) If the kernel panic was caused by a bug in the filesystem, then you''ll say aHA! aaHAh! but then, then it might do the scribbling! Well, yes. so in that case we agree there''s a bug in the filesystem. :) You''ll say ``but WHAT if the kernel panic was a bug in the DISK DRIVER, eh? eh, then maybe ZFS is not at fault!'''' sure, fine, read on. r> I don''t see that we have enough information here to really r> compare ZFS with UFS what we certainly have, between Tom and Erik and my own experience with resilvering-related errors accumulating in the CKSUM column when iSCSI targets go away, is enough information that ``you should have had redundant pools'''' doesn''t settle the issue. Reports of zpool corruption on single vdev''s mounted over SAN''s would benefit from further investigation, or at least a healthily-suspicious scientific attitude that encourages someone to investigate this if it happens in more favorable conditions, such as inside Sun, or to someone with a support contract and enough time to work on a case (maybe Tom?), or someone who knows ZFS well like Pavel. Also, there is enough concern for people designing paranoid systems to approach them with the view, ``ZFS is not always-consistent-on-disk unless it has working redundancy''''---choosing to build a ZFS system the same way as a UFS system without ZFS-level redundancy, based on our experience so far, is not just foregoing some of ZFS''s whizz-bang new feeechurs. It''s significantly less safe than the UFS system. For as long as the argument remains unsettled, conservative people need to understand that. Conservative people should also understand point (c) below. It sounds to me like Tom''s and Erik''s problems are more likely ZFS''s fault than not. The dialog has gone like this: 1. This isn''t within the class of errors ZFS should handle. get redundancy. 2. It sounds to me exactly like the class of error ZFS is supposed to handle. 3. You cannot prove 100% that this is necessarily the class of error ZFS is supposed to handle. Somethinig else might have happened. BTW, did I tell you how good ZFS (sometimes) is at dealing with ``might have happened'''' if you give it redundancy? It''s new, and exciting, and unprecedented! Is that a rabbit over there? Look, a redheaded girl juggling frisbies! What next, you''ll drag out screaming Dick Cheney on a chain? Recapping my view: a. it looks like a ZFS problem (okay, okay, PROBABLY a zfs problem) b. it''s a big problem c. there''s no good reason to believe people with redundant pools are immune from it, because they will run into it when they need their redundancy to cover a broken disk. It also deserves more testing by me: I''m going to back up my smaller ''aboveground'' pool and try to provoke it. r> although I do agree that some kind of ZFS repair tool r> sounds like it would be useful. I don''t want to dictate architecture when I don''t know the internals well. What''s immediately important to me is that ZFS handle unclean shutdown rigorously, as most other filesystems claim to and eventually mostly accomplish. This could be adding an fsck tool, but more likely it will be simply fixing a bug. Old computers had to bring up their swap space before fsck''ing big filesystems because the fsck process needed so much memory. The filesystem implementation was a small text of fragile code that would panic if it read the wrong bits from the disk, but it was fast and didn''t take much memory. It made sense to split the filesystem into two pieces, the fsck piece and the main piece, to conserve the machine''s core (and make the programming simpler). We have plenty of memory for text segments now, so it might make more sense to build fsck into the filesystem. The filesystem should be able to mount any state you would expect a hypothetical fsck tool to handle, and mount it almost immediately, and correct any ``errors'''' it finds while running. If you want to proactively correct errors, it should do this while mounted. That was the original ZFS pitch, and I think it''s not crazy. It''s basically what we''re supposed to have now with the ``always consistent on disk'''' claim and ''zpool scrub'' O(n)? online fsck-equivalent. FFS+softdep sort of works this way, too. It''s designed to safely mount ``unclean'''' filesystems, so in that sense, it''s ``always consistent.'''' It does not roll a log, because there isn''t one---it just mounts the filesystem as it was when the cord was pulled, and it can do this with no risk of kernel panicing or odd behavior to userland because of the careful order in which it writes data before the panic. However, after an unclean shutdown, the filesystem is still considered dirty even though it mounts and works. FreeBSD then starts the old fsck tool in the background. The fsck is still O(n^2). so...FFS+softdep sort of follows the new fsck-less model where the filesystem is one unified piece that does all its work after mounting, but follows it clumsily because it''s reusing the old FFS code and on-disk format. To my non-developer perspective, there seem to be the equivalent of mini-FFS+softdep-style fsck''s inside ZFS already. Sometimes when a mirror component goes away, ZFS does (what looks in ''zpool status'' like) a mini-resilver on the remaining component. There''s no redundancy in the vdev, so there''s nothing to actually resilver. Maybe this has to do with the quorum rules or the (seemingly broken) dirty region logging, both of which I still don''t understand. And there is also my old problem of ''zpool offline'' reporting ``no valid replicas'''', until I''ve done a scrub, after which ''zpool offline'' works again, so a scrub is not really a purely proactive thing: burried inside ZFS there is some notion of dirtyness preventing my ''zpool offline'', and a successful scrub clears the dirty bit (as do, possibly, other things, like rebooting :( ). so, the architecture might be fine as-is since scrub is already a little more than what it claims to be, and is doing some sort of metadata or RAID-level fsck-ing. I wouldn''t expect that the fix for these corrupt single-vdev pools come in some specific form based on prejudices from earlier filesystems. Now there is another tool Anton mentioned, a recovery tool or forensic tool: one that leaves the filesystem unmounted, treats the disks as read-only, and tries to copy data out of it onto a new filesystem. If there were going to be a separate tool---say, something to handle disks that have been scribbled on, or fixes for problems that are really tricky or logically inappropriate to deal with on the mounted filesystem---I think a forensic/recovery tool makes more sense than an fsck. If this odd stuff isn''t supposed to happen, and it has happened anyway, you want a tool you can run more than once. You want the chance to improve the tool and run it again, or to try an older version of the tool if the current one keeps crashing. I''m just really far from convinced that Tom needs this tool. r> To me, it sounds like Sun have designed ZFS to always know if r> there is corruption on the disk, and to write data in a way r> that corruption of the whole filesystem *should* never happen. sounds like depends on to what you''re listening. If you''re listening to Sun''s claims, then yes, of course that''s exactly what they claim. If you''re listening to experience on this list, it sounds different. The closest we''ve come is, we agree I haven''t completely invalidated the original claims, which is pretty far from making me believe them again. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080807/1e19ac3c/attachment.bin>
Anton B. Rang wrote:>> From the ZFS Administration Guide, Chapter 11, Data Repair section: >> Given that the fsck utility is designed to repair known pathologies >> specific to individual file systems, writing such a utility for a file >> system with no known pathologies is impossible. >> > > That''s a fallacy (and is incorrect even for the UFS fsck; refer to the McKusick/Kowalski paper and the distinction they make between ''expected'' corruptions and other inconsistencies). > > First, there are two types of utilities which might be useful in the situation where a ZFS pool has become corrupted. The first is a file system checking utility (call it zfsck); the second is a data recovery utility. The difference between those is that the first tries to bring the pool (or file system) back to a usable state, while the second simply tries to recover the files to a new location. >Hi Anton, How would you describe the difference between the file system checking utility and zpool scrub? Is zpool scrub lacking in its verification of the data? How would you describe the difference between the data recovery utility and ZFS''s normal data recovery process?> What does a file system check do? It verifies that a file system is internally consistent, and makes it consistent if it is not. If ZFS were always consistent on disk, then only a verification would be needed. Since we have evidence that it is not always consistent in the face of hardware failures, at least, repair may also be needed. This doesn''t need to be that hard. For instance, the space maps can be reconstructed by walking the various block trees; the uberblock effectively has several backups (though it might be better in some cases if an older backup were retained); and the ZFS checksums make it easy to identify block types and detect bad pointers. Files can be marked as damaged if they contain pointers to bad data; directories can be repaired if their hash structures are damaged (as long as the names and pointers can be salvaged); etc. Much more complex file systems than ZFS have file system checking utilities, because journaling, COW, etc. don''t help you in !the> face of software bugs or certain classes of hardware failures. > > A recovery tool is even simpler, because all it needs to do is find a tree root and then walk the file system, discovering directories and files, verifying that each of them is readable by using the checksums to check intermediate and leaf blocks, and extracting the data. The tricky bit with ZFS is simply identifying a relatively new root, so that the newest copy of the data can be identified. > > Almost every file system starts out without an fsck utility, and implements one once it becomes obvious that "sorry, you have to reinitialize the file system" -- or worse, "sorry, we lost all of your data" -- is unacceptable to a certain proportion of customers. > >Nobody thinks that an answer of "sorry, we lost all of your data" is acceptable. However, there are failures which will result in loss of data no matter how clever the file system is. But people will still believe their hardware is infallible and refuse to configure ZFS to be able to repair their data. You can only push a rope so far... -- richard
On Thu, 7 Aug 2008, Miles Nordin wrote: I must apologize that I was not able to read your complete email due to local buffer overflow ...> someone who knows ZFS well like Pavel. Also, there is enough concern > for people designing paranoid systems to approach them with the view, > ``ZFS is not always-consistent-on-disk unless it has working > redundancy''''---choosing to build a ZFS system the same way as a UFS > system without ZFS-level redundancy, based on our experience so far, > is not just foregoing some of ZFS''s whizz-bang new feeechurs. It''s > significantly less safe than the UFS system. For as long as the > argument remains unsettled, conservative people need to understand > that. Conservative people should also understand point (c) below.I don''t think that non-redundant ZFS can be classified as "significantly less safe than the UFS system". It seems that the world has little experience with 48TB single-LUN UFS filesystems, if indeed that is even possible. I would hate to wait for fsck of 48TB since some of the disks might wear out and need to be replaced before it completes. According to your logic, AIDS was safer before people were routinely tested (http://en.wikipedia.org/wiki/AIDS#HIV_test) to see if they were HIV positive. With ZFS you may learn that you have contracted AIDs within minutes of the event while with UFS you might not know until your immune system is beyond salvaging and the family is crying at your bed. Apparently you are in the "prefer not to know" group. The largest UFS filesystems I have here are under 120GB and even at that size they make me uneasy since I know that the data can silently fail (bad), or be read incorrectly (worse) and that if fsck is needed, it might take hours. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, 2008-08-07 at 11:34 -0700, Richard Elling wrote:> How would you describe the difference between the data recovery > utility and ZFS''s normal data recovery process?I''m not Anton but I think I see what he''s getting at. Assume you have disks which once contained a pool but all of the uberblocks have been clobbered. So you don''t know where the root of the block tree is, but all the actual data is there, intact, on the disks. Given the checksums you could rebuild one or more plausible structure of the pool from the bottom up. I''d think that you could construct an offline zpool data recovery tool where you''d start with N disk images and a large amount of extra working space, compute checksums of all possible data blocks on the images, scan the disk images looking for things that might be valid block pointers, and attempt to stitch together subtrees of the filesystem and recover as much as you can even if many upper nodes in the block tree have had holes shot in them by a miscreant device. - Bill
[I think Miles and I seem to be talking about two different topics] Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>>> > > re> If your pool is not redundant, the chance that data > re> corruption can render some or all of your data inaccessible is > re> always present. > > 1. data corruption != unclean shutdown >Agree. One is a state, the other is an event.> 2. other filesystems do not need a mirror to recover from unclean > shutdown. They only need it for when disks fail, or for when disks > misremember their contents (silent corruption, as in NetApp paper). >Agree. ZFS fits this category.> I would call data corruption and silent corruption the same thing: > what the CKSUM column was _supposed_ to count, though not in fact > the only thing it counts. >Agree. Data corruption takes two forms: detectable and undetectable (aka silent).> 3. saying ZFS needs a mirror to recover from unclean shutdown does not > agree with the claim ``always consistent on the disk'''' >Disagree. We test ZFS with unclean shutdowns all of the time and it works fine. However, if there is data corruption, then it may be possible that ZFS cannot recover unless there is a surviving copy of the good data. This is what mirrors and raidz do.> 4. I''m not sure exactly your position. Before you were saying what > Erik warned about doesn''t happen, because there''s no CR, and Tom > must be confused too. Now you''re saying of course it happens, > ZFS''s claims of ``always consistent on disk'''' count for nothing > unless you have pool redundancy. >No, I''m saying that data corruption without a surviving good copy of the data may lead to an unrecoverable data condition.> > And that is exactly what I said to start with: > > re> In general, ZFS can only repair conditions for which it owns > re> data redundancy. > > c> If that''s really the excuse for this situation, then ZFS is > c> not ``always consistent on the disk'''' for single-VDEV pools. > > that is the take-home message? >ZFS is always consistent on disk. If there is data corruption, then all bets are off, no matter what file system you choose.> If so, it still leaves me with the concern, what if the breaking of > one component in a mirrored vdev takes my system down uncleanly? This > seems like a really plausible failure mode (as Tom said, ``the > inevitable kernel panic''''). >Tom has not provided any data as to why the kernel panic''ed. Panic messages, as a minimum, would be enlightening.> In that case, I no longer have any redundancy when the system boots > back up. If ZFS calls the inconsistent states through which it > apparently sometimes transitions pools ``data corruption'''' and depends > on redundancy to recover from them, then isn''t it extremely dangerous > to remove power or SAN connectivity from any DEGRADED pool? The pool > should be rebuilt onto a hot spare IMMEDIATELY so that it''s ONLINE as > soon as possible, because if ZFS loses power with a DEGRADED pool all > bets are off. >In Tom''s case, ZFS was not configured such that it could rebuild a failed vdev on a hot spare.> If this DEGRADED-pool unclean shutdown is, as you say, a completely > different scenario from single-vdev pools that isn''t dangerous and has > no trouble with ZFS corruption, then no one should ever run a > single-vdev pool. We should instead run mirrored vdevs that are > always DEGRADED, since this configuration looks identical to > everything outside ZFS but supposedly magically avoids the issue. If > only we had some way to attach to vdevs fake mirror components that > immediately get marked FAULTED then we can avoid the corruption risk. > But, that''s clearly absurd! >Fast, reliable, inexpensive: pick two.> so, let''s say ZFS''s requirement is, as we seem to be describing it: > might lose the whole pool if your kernel panics or you pull the power > cord in a situation without redundancy. Then I think this is an > extremely serious issue, even for redundant pools.Agree. But in Tom''s case, there is no proof that the fault condition is cleared. The fact that zpool import fails with an I/O error is a strong indicator that the fault is still present. We do not yet know if there is a data corruption issue or not.> It is very > plausible that a machine will panic or lose power during a resilver. >I think this is an unfounded statement. There are many cases where resilvers complete successfully. In our data reliability models, we have a parameter for the probability of [un]successful resilver, but all of our research in determining a value for this centers around actual data loss or corruption in the devices. Do you have research that points to another cause?> And if, on the other hand, ZFS doesn''t transition disks through > inconsistent states and then excuse itself calling what it did ``data > corruption'''' when it bites you after an unclean shutdown, then what > happened to Erik and Tom? >I have no idea what happened to Erik. His post makes claims of loss followed by claims of unfixed, known problems, but no real pointer to bugids. Hence my comment about his post being of the "your baby is ugly" variety. At least point out the mole in the middle of the forehead, aka CR???> It seems to me it is ZFS''s fault and can''t be punted off to the > administrator''s ``asking for it.'''' >I think the jury is still out. Tom needs to complete his tests and provide the messages and FMA notifications so that a root cause can be determined. Meanwhile, we''ll work on putting together some docs on how to proceed when your pool can''t be imported, because it would be good to have. And, as Anton notes, we can''t scrub the pool if we can''t import the pool. -- richard
On Thu, Aug 07, 2008 at 11:34:12AM -0700, Richard Elling wrote:> Anton B. Rang wrote: > > First, there are two types of utilities which might be useful in the situation where a ZFS pool has become corrupted. The first is a file system checking utility (call it zfsck); the second is a data recovery utility. The difference between those is that the first tries to bring the pool (or file system) back to a usable state, while the second simply tries to recover the files to a new location. > > > Hi Anton, > How would you describe the difference between the file system > checking utility and zpool scrub? Is zpool scrub lacking in its > verification of the data?One thing I can think of is that scrub is only available on an imported pool, and you can only import a writable pool. It might be nice to verify that a read-only image is valid. Or to import/mount a pool on damaged media for recovery.> How would you describe the difference between the data recovery > utility and ZFS''s normal data recovery process?If the most recent uberblock appears valid, but doesn''t have useful data, I don''t think there''s any way currently to see what the tree of an older uberblock looks like. It would be nice to see if that data appears valid and try to create a view that would be readable/recoverable. -- Darren
Miles Nordin ?????:>>>>>> "r" == Ross <myxiplx at hotmail.com> writes: > > r> Tom wrote "There was a problem with the SAS bus which caused > r> various errors including the inevitable kernel panic". It''s > r> the various errors part that catches my eye, > > yeah, possibly, but there are checksums on the SAS bus, and its > confirmation of what CDB''s have completed should always be accurate.But there''s more than that - there''s storage controller behind the SAS bus with it''s cache and loads of disks behind, and even though there are checksums on SAS bus, and storage controller should have not lost or damage any of its cache, there''s still a possibility for a disk to drop write on the floor silently or misdirect it, or storage controller itself to be configured in a such way that it does not guarantee data protection all the time...> If the problem was ``another machine booted up, and I told the other > machine to ''zpool import -f'' '''' then maybe you have some point. but > just tripping over a cable shouldn''t qualify as weird, nor should > Erik''s problem of the FC array losing power or connectivity. These > are both within the ``unclean shutdown'''' category handled by UFS+log, > FFS+softdep, ext3, reiser, xfs, vxfs, jfs, HFS+, ...Does forceful removal of power count as unclean shutdown? If yes, I do it several times a day to my notebook with ZFS root. I''m typing this from it booted just fine from ZFS after another unclean shutdown.> r> Can fsck always recover a disk? Or if the corruption is > r> severe enough, are there times when even that fails? > > This question is obviously silly. write zeroes over the disk, and now > the corruption is severe enough. However fsck can always recover a > disk from a kernel panic, or a power failure of the host or of the > disks, because these things don''t randomly scribble over the disk.I have an image of UFS filesystem which passes fsck just fine but then panics system as soon as writes are started.> Reports of zpool > corruption on single vdev''s mounted over SAN''s would benefit from > further investigation, or at least a healthily-suspicious scientific > attitude that encourages someone to investigate this if it happens in > more favorable conditions, such as inside Sun, or to someone with a > support contract and enough time to work on a case (maybe Tom?),The problem is such reports often do not have enough details and investigation in such can take lots of time and yield nothing...> or > someone who knows ZFS well like Pavel. Also, there is enough concern > for people designing paranoid systems to approach them with the view, > ``ZFS is not always-consistent-on-disk unless it has working > redundancy''''Again, always-consistent-on-disk is not related to redundancy. On-disk consistency is achieved by not writing new blocks over currently allocated ones regardless of redundancy of underlying vdevs. If underlying vdevs are redundant, you have better chance of surviving corruption of data stored on disk.> Now there is another tool Anton mentioned, a recovery tool or forensic > tool: one that leaves the filesystem unmounted, treats the disks as > read-only, and tries to copy data out of it onto a new filesystem. If > there were going to be a separate tool---say, something to handle disks > that have been scribbled on, or fixes for problems that are really > tricky or logically inappropriate to deal with on the mounted > filesystem---I think a forensic/recovery tool makes more sense than an > fsck. If this odd stuff isn''t supposed to happen, and it has happened > anyway, you want a tool you can run more than once. You want the > chance to improve the tool and run it again, or to try an older > version of the tool if the current one keeps crashing.Reads in ZFS can be broadly classified into two types: - ones that are not critical from the ZFS perspective meaning reads of user data and associated metadata where it can safely return I/O error in case of checksum failure, - ones that are critical from ZFS perspectives - this is reads of ZFS metadata required to perform writes; depending on context it may be impossible to return I/O error and it has either to panic or act according to failmode property setting. So for some cases of non-redundant (and even redundant, where all redundant copies are corrupted, e.g. simultaneous import of a pool from two hosts) pool corruption it may be enough to import pool in pure read-only mode not trying to write anything into the pool (hence not having to read any metadata required to do so) to be able to save all the data which can be read. There''s an RFE for this feature but i do not have the number handy. victor
> How would you describe the difference between the file system > checking utility and zpool scrub? Is zpool scrub lacking in its > verification of the data?To answer the second question first, yes, zpool scrub is lacking, at least to the best of my knowledge (I haven''t looked at the ZFS source in a few months). It does not verify that any internal data structures are correct; rather, it simply verifies that data and metadata blocks match their checksums. This makes it useless in situations such as those described in bugs 6458218/6634517, where a pool cannot be imported because its metadata is inconsistent. It also would not repair a damaged directory, for instance. If a directory in ZFS is damaged, its files become permanently inaccessible; if the same happens in UFS, fsck will create new links to the files in the lost+found directory. It''s as if the UFS fsck could only work on mounted file systems, and could tell you that there was a problem, but not fix it.> How would you describe the difference between the > data recovery utility and ZFS''s normal data recovery process?What do you consider the ?normal data recovery process? ? Take, for instance, the pool which Borys just mentioned on this list, which causes a kernel panic at import. I''m not sure how ZFS can recover from that. A data recovery utility would (for instance) scan the pool, locate a healthy-looking uberblock (or, failing that, look for one or more top-level file system blocks), and traverse the tree down from that point, pulling files from the disk as it goes. When a damaged metadata block is found, a scan can be performed for blocks which are candidate blocks that belong under it; or potential block numbers can be extracted from the damaged block itself. This message posted from opensolaris.org
Victor Latushkin wrote:> Hi Tom and all, > > Tom Bird wrote: >> Hi, >> >> Have a problem with a ZFS on a single device, this device is 48 1T SATA >> drives presented as a 42T LUN via hardware RAID 6 on a SAS bus which had >> a ZFS on it as a single device. >> >> There was a problem with the SAS bus which caused various errors >> including the inevitable kernel panic, the thing came back up with 3 out >> of 4 zfs mounted. > > It would be nice to see a panic stack.I''m afraid I don''t have that but now have an open connection to the terminal server logging everything in case it should happen again.>> root at cs3:~# zpool import -f content >> cannot import ''content'': I/O error > > As long as it does not panic and just returns I/O error which is rather > generic, you may try to dig a little bit deeper with DTrace to have a > chance to see where this I/O error is generated first, e.g. something > like this with the attached dtrace script: > > dtrace -s /path/to/script -c "zpool import -f content"dtrace output was 6MB, a bit rude to post to the list so I''ve uploaded it here: http://picard.portfast.net/~tom/import.txt> It is also interesting what impact SAS bus problem had on the storage > controller. Btw, what is storage controller in question here?The controller is an LSI Logic PCI express with 2 external SAS ports which runs to an eonstor 2u 12 disk RAID chassis with 3 JBOD packs daisy chained from that. It seems I can''t run the JBODs directly to the SAS controller when using SATA drives (may be a different story with proper SAS) and the RAID box has no JBOD mode so the redundancy has to stay in the box and can''t be transferred to ZFS. The entire faulted array reads cleanly at /dev/rdsk level into /dev/null. There are 4 such arrays connected to the server via two SAS cards with a ZFS on each one, the supplied internal SAS card and an ixgb NIC are the only other cards installed. System boots from the standard internal disks.>> root at cs3:~# uname -a >> SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200 > > Btw, have you considered opening support call for this issue?Would have thought that unless they have a secret zfsck utility there''s probably not much they can do. It''s not a Sun disk array or Sun branded SAS card. thanks -- Tom // www.portfast.co.uk -- internet services and consultancy // hosting from 1.65 per domain
I''m not Anton Rang, but: | How would you describe the difference between the data recovery | utility and ZFS''s normal data recovery process? The data recovery utility should not panic my entire system if it runs into some situation that it utterly cannot handle. Solaris 10 U5 kernel ZFS code does not have this property; it is possible to wind up with ZFS pools that will panic your system when you try to touch them. (The same thing is true of a theoretical file system checking utility.) The data recovery utility can ask me questions about what I want it to do in an ambiguous situation, or give me only partial results. The data recovery can be run read-only, so that I am sure that any problems in it are not making my situation worse. | Nobody thinks that an answer of "sorry, we lost all of your data" is | acceptable. However, there are failures which will result in loss of | data no matter how clever the file system is. The problem is that there are currently ways to make ZFS lose all your data when there are no hardware faults or failures, merely people or software mis-handling pools. This is especially frustrating when the only thing that is likely to be corrupted is ZFS metadata and the vast majority (or all) of the data in the pool is intact, readable, and so on. - cks
> | How would you describe the difference between the data recovery > | utility and ZFS''s normal data recovery process? > > The data recovery utility should not panic my entire system if it runs > into some situation that it utterly cannot handle. Solaris 10 U5 kernel > ZFS code does not have this property; it is possible to wind up with ZFS > pools that will panic your system when you try to touch them.I do agree. The last three weeks I have been testing an arc-1680-sas-controller with an external cabinet with 16 sas-disk at 1 TB. The server is a E5405 quad-core with 8 GB RAM. Setting the card to jbod-mode gave me a somewhat unstable setup where the disks would stop responding. After I had put all the disk on the controller in passthrough-mode the setup did stabilize and I was able to copy 3.8 of 4 TB of small files when some of the disks bailed out. I brought the disks back online and restarted the server. A zpool status that show online disks also told me: errors: Permanent errors have been detected in the following files: ef1/image/z018_16:<0x0> The zpool consisted of five disks in three seperate raidz in one zpool including one spare. The only time I''ve experienced that the server could not get the zpool back online was when the disks for some reason failed. I find it completely valid that the server panics rather than write inconsisten data. Everytime when our internal file-server suffered a unplanned restart (power failure) it always recovered (solaris 10/08 and zfs ver. 4). But this Sunday Aug. the 10''th the same file-server was brought down by a faulty UPS. When power was restored the zpool had become inconsistent. This time the storage was also affected by the power-outage. Is it a valid point that zfs is able to recover more gracefully when the server itself goes down rather than when some of the disks/LUN''s bails out? The reason I ask is because that is the only time I''ve personally seen zfs unable to recover.> The data recovery utility can ask me questions about what I want it > to do in an ambiguous situation, or give me only partial results.Our nfs-server was also on this faulty UPS. This is running solaris 9 on sparc with vxfs and is managing 109 TB of storage on a HDS. When I switched on the server I saw the it replayed the journal and marked the partition as clean and came online. I know that there is no guarantee that the data are consistent but at least vxfs have had many years to mature. I had initially planned to migrate some of the older partitions to zfs and thereby test it. But I''ve changed that and will try the setup with the arc-1680-controller and sas-disks as an internal file-server instead for a while and rather add additional storage to our solaris 9-server and vxfs. Zfs have changed the way I look at filesystems and I''m very glad that Sun gives it so much exposure. But atm. I''d give vxfs the edge. :-) -- regards Claus When lenity and cruelty play for a kingdom, the gentlest gamester is the soonest winner. Shakespeare
Claus Guttesen wrote:>> | How would you describe the difference between the data recovery >> | utility and ZFS''s normal data recovery process? >> >> The data recovery utility should not panic my entire system if it runs >> into some situation that it utterly cannot handle. Solaris 10 U5 kernel >> ZFS code does not have this property; it is possible to wind up with ZFS >> pools that will panic your system when you try to touch them. >> > > I do agree. The last three weeks I have been testing an > arc-1680-sas-controller with an external cabinet with 16 sas-disk at 1 > TB. The server is a E5405 quad-core with 8 GB RAM. Setting the card to > jbod-mode gave me a somewhat unstable setup where the disks would stop > responding. After I had put all the disk on the controller in > passthrough-mode the setup did stabilize and I was able to copy 3.8 of > 4 TB of small files when some of the disks bailed out. I brought the > disks back online and restarted the server. A zpool status that show > online disks also told me: > > errors: Permanent errors have been detected in the following files: > ef1/image/z018_16:<0x0> > > The zpool consisted of five disks in three seperate raidz in one zpool > including one spare. > > The only time I''ve experienced that the server could not get the zpool > back online was when the disks for some reason failed. I find it > completely valid that the server panics rather than write inconsisten > data. > > Everytime when our internal file-server suffered a unplanned restart > (power failure) it always recovered (solaris 10/08 and zfs ver. 4). > But this Sunday Aug. the 10''th the same file-server was brought down > by a faulty UPS. When power was restored the zpool had become > inconsistent. This time the storage was also affected by the > power-outage. > > Is it a valid point that zfs is able to recover more gracefully when > the server itself goes down rather than when some of the disks/LUN''s > bails out? The reason I ask is because that is the only time I''ve > personally seen zfs unable to recover. >Later versions of ZFS, not yet in Solaris 10, are much more tolerant of disappearing storage. Solaris 10 update 6 should contain these features later this year. OpenSolaris 2008.05 and SXCE b72 or later already have these features. There is a failure mode that we worry about: ZFS depends on the disk actually writing (flushing) data to nonvolatile storage when ZFS issues the flush request. If that does not actually occur, then you may see the problems you describe. While ZFS distrusts storage better than most file systems, it must still trust a flush request.>> The data recovery utility can ask me questions about what I want it >> to do in an ambiguous situation, or give me only partial results. >> > > Our nfs-server was also on this faulty UPS. This is running solaris 9 > on sparc with vxfs and is managing 109 TB of storage on a HDS. When I > switched on the server I saw the it replayed the journal and marked > the partition as clean and came online. I know that there is no > guarantee that the data are consistent but at least vxfs have had many > years to mature. > > I had initially planned to migrate some of the older partitions to zfs > and thereby test it. But I''ve changed that and will try the setup with > the arc-1680-controller and sas-disks as an internal file-server > instead for a while and rather add additional storage to our solaris > 9-server and vxfs. > > Zfs have changed the way I look at filesystems and I''m very glad that > Sun gives it so much exposure. But atm. I''d give vxfs the edge. :-) > >I''ve had excellent experiences with Sun-branded HDS storage: rock solid. For flaky hardware that seems to lose data during a power outage, I''d prefer a file system that can detect that my data is corrupted. -- richard
Richard Elling
2008-Aug-12 00:25 UTC
[zfs-discuss] Forensic analysis [was: more ZFS recovery]
Chris Siebenmann wrote:> I''m not Anton Rang, but: > | How would you describe the difference between the data recovery > | utility and ZFS''s normal data recovery process? > > The data recovery utility should not panic my entire system if it runs > into some situation that it utterly cannot handle. Solaris 10 U5 kernel > ZFS code does not have this property; it is possible to wind up with ZFS > pools that will panic your system when you try to touch them. > > (The same thing is true of a theoretical file system checking utility.) > > The data recovery utility can ask me questions about what I want it > to do in an ambiguous situation, or give me only partial results. > > The data recovery can be run read-only, so that I am sure that any > problems in it are not making my situation worse. > > | Nobody thinks that an answer of "sorry, we lost all of your data" is > | acceptable. However, there are failures which will result in loss of > | data no matter how clever the file system is. > > The problem is that there are currently ways to make ZFS lose all your > data when there are no hardware faults or failures, merely people or > software mis-handling pools. This is especially frustrating when the > only thing that is likely to be corrupted is ZFS metadata and the vast > majority (or all) of the data in the pool is intact, readable, and so > on. >As others have noted, the COW nature of ZFS means that there is a good chance that on a mostly-empty pool, previous data is still intact long after you might think it is gone. A utility to recover such data is (IMHO) more likely to be in the category of forensic analysis than a mount (import) process. There is more than enough information publically available for someone to build such a tool (hint, hint :-) -- richard
From: Richard Elling <Richard.Elling at Sun.COM> Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>>> "tb" == Tom Bird <tom at marmot.org.uk> writes: >>>>>> >...> > re> In general, ZFS can only repair conditions for which it owns > re> data redundancy.tb> If that''s really the excuse for this situation, then ZFS is not tb> ``always consistent on the disk'''' for single-VDEV pools. re> I disagree with your assessment. The on-disk re> format (any on-disk format) necessarily assumes re> no faults on the media. The difference between ZFS re> on-disk format and most other file systems is that re> the metadata will be consistent to some point in time re> because it is COW. ... tb> There was no loss of data here, just an interruption in the connection tb> to the target, like power loss or any other unplanned shutdown. tb> Corruption in this scenario is is a significant regression w.r.t. UFS: re> I see no evidence that the data is or is not correct. ... re> However, I will bet a steak dinner that if this device re> was mirrored to another, the pool will import just fine, re> with the affected device in a faulted or degraded state. tb> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048375.html re> I have no idea what Eric is referring to, and it does re> not match my experience. We had a similar problem in our environment. We lost a CPU on the server, resulting in metadata corruption and an unrecoverable pool. We were told that we were seeing a known bug that will be fixed in S10u6. http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/046951.html From: Tom Bird <tom at marmot.org.uk> tb> On any other file system though, I could probably kick tb> off a fsck and get back most of the data. I see the tb> argument a lot that ZFS "doesn''t need" a fsck utility, tb> however I would be inclined to disagree, if not a tb> full on fsck then something that can patch it up to the tb> point where I can mount it and then get some data off or tb> run a scrub. If not that, then we need some sort of recovery tool. We ought to be able to have some sort of recovery mode that allows us to read off the known good data or roll back to a snapshot or something. When you have a really big file system, telling us (as Sun support told us) that our only option was to re-build the zpool and restore from tape, it becomes really difficult to justify using the product in certain production environments. (For example, consider an environment where the available storage is on a hardware RAID-5 system, and where mirroring large amounts of already RAID-ed space adds up to more cost than a VxFS license. Not every type of data requires more protection than you get with standard hardware-based RAID-5.) --Scott This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message.
Chris Siebenmann <cks at cs.toronto.edu> I''m not Anton Rang, but: | How would you describe the difference between the data recovery | utility and ZFS''s normal data recovery process? cks> The data recovery utility should not panic cks> my entire system if it runs into some situation cks> that it utterly cannot handle. Solaris 10 U5 cks> kernel ZFS code does not have this property; cks> it is possible to wind up with ZFS pools that cks> will panic your system when you try to touch them. ... I''ll go you one worse. Imagine a Sun Cluster with several resource groups and several zpools. You blow a proc on one of the servers. As a result, the metadata on one of the pools becomes corrupted. http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/046951.html Now, each of the servers in your cluster attempts to import the zpool--and panics. As a result of a singe part failure on a single server, your entire cluster (and all the services on it) are sitting in a smoking heap on your machine room floor. | Nobody thinks that an answer of "sorry, we lost all of your data" is | acceptable. However, there are failures which will result in loss of | data no matter how clever the file system is. cks> The problem is that there are currently ways to cks> make ZFS lose all your data when there are no cks> hardware faults or failures, merely people or cks> software mis-handling pools. This is especially cks> frustrating when the only thing that is likely cks> to be corrupted is ZFS metadata and the vast cks> majority (or all) of the data in the pool is intact, cks> readable, and so on. I''m just glad that our pool corruption experience happened during testing, and not after the system had gone into production. Not exactly a resume-enhancing experience. --Scott This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message.
Wade.Stuart at fallon.com
2008-Aug-12 15:52 UTC
[zfs-discuss] Forensic analysis [was: more ZFS recovery]
> > As others have noted, the COW nature of ZFS means that there is a > good chance that on a mostly-empty pool, previous data is still intact > long after you might think it is gone. A utility to recover such data is > (IMHO) more likely to be in the category of forensic analysis than > a mount (import) process. There is more than enough information > publically available for someone to build such a tool (hint, hint :-) > -- richardVeritas, the makers if vxfs, whom I consider ZFS to be trying to compete against has higher level (normal) support engineers that have access to tools that let them scan the disk for inodes and other filesystem fragments and recover. When you log a support call on a faulty filesystem (in one such case I was involved in zeroed out 100mb of the first portion of the volume killing off both top OLT''s -- bad bad) they can actually help you at a very low level dig data out of the filesystem or even recover from pretty nasty issues. They can scan for inodes (marked by a magic number), have utilities to pull out files from those inodes (including indirect blocks/extents). Given the tools and help from their support I was able to pull back 500 gb of files (99%) from a filesystem that emc killed during a botched powerpath upgrade. Can Sun''s support engineers, or is their answer pull from tape? (hint, hint ;-) -Wade
Darren J Moffat
2008-Aug-12 15:59 UTC
[zfs-discuss] Forensic analysis [was: more ZFS recovery]
Wade.Stuart at fallon.com wrote:>> As others have noted, the COW nature of ZFS means that there is a >> good chance that on a mostly-empty pool, previous data is still intact >> long after you might think it is gone. A utility to recover such data is >> (IMHO) more likely to be in the category of forensic analysis than >> a mount (import) process. There is more than enough information >> publically available for someone to build such a tool (hint, hint :-) >> -- richard > > Veritas, the makers if vxfs, whom I consider ZFS to be trying to > compete against has higher level (normal) support engineers that have > access to tools that let them scan the disk for inodes and other filesystem > fragments and recover. When you log a support call on a faulty filesystem > (in one such case I was involved in zeroed out 100mb of the first portion > of the volume killing off both top OLT''s -- bad bad) they can actually help > you at a very low level dig data out of the filesystem or even recover from > pretty nasty issues. They can scan for inodes (marked by a magic number), > have utilities to pull out files from those inodes (including indirect > blocks/extents). Given the tools and help from their support I was able to > pull back 500 gb of files (99%) from a filesystem that emc killed during a > botched powerpath upgrade. Can Sun''s support engineers, or is their > answer pull from tape? (hint, hint ;-)Sounds like a good topic for here: http://opensolaris.org/os/project/forensics/ -- Darren J Moffat
Chris Siebenmann
2008-Aug-12 16:14 UTC
[zfs-discuss] Forensic analysis [was: more ZFS recovery]
| As others have noted, the COW nature of ZFS means that there is a good | chance that on a mostly-empty pool, previous data is still intact long | after you might think it is gone. In the cases I am thinking of I am sure that the data was there. Kernel panics just didn''t let me get at it. Fortunately it was only testing data, but I am now concerned about it happening in production. | A utility to recover such data is (IMHO) more likely to be in the | category of forensic analysis than a mount (import) process. There is | more than enough information publically available for someone to build | such a tool (hint, hint :-) To put it crudely, if I wanted to write my own software for this sort of thing I would run Linux. - cks
max at bruningsystems.com
2008-Aug-12 17:03 UTC
[zfs-discuss] Forensic analysis [was: more ZFS recovery]
Darren J Moffat wrote:> Wade.Stuart at fallon.com wrote: > >>> As others have noted, the COW nature of ZFS means that there is a >>> good chance that on a mostly-empty pool, previous data is still intact >>> long after you might think it is gone. A utility to recover such data is >>> (IMHO) more likely to be in the category of forensic analysis than >>> a mount (import) process. There is more than enough information >>> publically available for someone to build such a tool (hint, hint :-) >>> -- richard >>> >> Veritas, the makers if vxfs, whom I consider ZFS to be trying to >> compete against has higher level (normal) support engineers that have >> access to tools that let them scan the disk for inodes and other filesystem >> fragments and recover. When you log a support call on a faulty filesystem >> (in one such case I was involved in zeroed out 100mb of the first portion >> of the volume killing off both top OLT''s -- bad bad) they can actually help >> you at a very low level dig data out of the filesystem or even recover from >> pretty nasty issues. They can scan for inodes (marked by a magic number), >> have utilities to pull out files from those inodes (including indirect >> blocks/extents). Given the tools and help from their support I was able to >> pull back 500 gb of files (99%) from a filesystem that emc killed during a >> botched powerpath upgrade. Can Sun''s support engineers, or is their >> answer pull from tape? (hint, hint ;-) >> > > Sounds like a good topic for here: > > http://opensolaris.org/os/project/forensics/ >I took a look at this project, specifically http://opensolaris.org/os/project/forensics/ZFS-Forensics/. Is there any reason that the paper and slides I presented at the OpenSolaris Developers Conference on zfs on-disk format not mentioned? The paper is at: http://www.osdevcon.org/2008/files/osdevcon2008-proceedings.pdf starting on page 36, and the slides are at: http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf thanks, max
On Aug 7, 2008, at 10:25 PM, Anton B. Rang wrote:>> How would you describe the difference between the file system >> checking utility and zpool scrub? Is zpool scrub lacking in its >> verification of the data? > > To answer the second question first, yes, zpool scrub is lacking, at > least to the best of my knowledge (I haven''t looked at the ZFS > source in a few months). It does not verify that any internal data > structures are correct; rather, it simply verifies that data and > metadata blocks match their checksums.Hey Anton, What do you mean by "internal data structures"? Are you referring to things like space maps, props, history obj, etc. (basically anything other than user data and the indirect blocks that point to user data)? eric
Cromar Scott wrote:> Chris Siebenmann <cks at cs.toronto.edu> > > I''m not Anton Rang, but: > | How would you describe the difference between the data recovery > | utility and ZFS''s normal data recovery process? > > cks> The data recovery utility should not panic > cks> my entire system if it runs into some situation > cks> that it utterly cannot handle. Solaris 10 U5 > cks> kernel ZFS code does not have this property; > cks> it is possible to wind up with ZFS pools that > cks> will panic your system when you try to touch them. > ... > > I''ll go you one worse. Imagine a Sun Cluster with several resource > groups and several zpools. You blow a proc on one of the servers. As a > result, the metadata on one of the pools becomes corrupted. >This failure mode affects all shared-storage clusters. I don''t see how ZFS should or should not be any different than raw, UFS, et.al.> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/046951.html > > Now, each of the servers in your cluster attempts to import the > zpool--and panics. > > As a result of a singe part failure on a single server, your entire > cluster (and all the services on it) are sitting in a smoking heap on > your machine room floor. >Yes, but your data is corrupted. If you were my bank, then I would greatly appreciate you getting the data corrected prior to bringing my account online. If you study highly available clusters and services then you will see many cases where human interaction is preferred to automation for just such cases. You will also find that a combination of shared storage and non-shared storage cluster technology is used for truly important data. For example, we would use Solaris Cluster for the local shared-storage framework and Solaris Cluster Geographic Edition for a remote site (no shared hardware components with the local cluster).> | Nobody thinks that an answer of "sorry, we lost all of your data" is > | acceptable. However, there are failures which will result in loss of > | data no matter how clever the file system is. > > cks> The problem is that there are currently ways to > cks> make ZFS lose all your data when there are no > cks> hardware faults or failures, merely people or > cks> software mis-handling pools. This is especially > cks> frustrating when the only thing that is likely > cks> to be corrupted is ZFS metadata and the vast > cks> majority (or all) of the data in the pool is intact, > cks> readable, and so on. > > I''m just glad that our pool corruption experience happened during > testing, and not after the system had gone into production. Not exactly > a resume-enhancing experience. >I''m glad you found this in testing. BTW, what was the root cause? -- richard
Richard Elling <Richard.Elling at Sun.COM> Cromar Scott wrote:> Chris Siebenmann <cks at cs.toronto.edu> > > I''m not Anton Rang, but: > | How would you describe the difference between the data recovery > | utility and ZFS''s normal data recovery process? > > cks> The data recovery utility should not panic > cks> my entire system if it runs into some situation > cks> that it utterly cannot handle. Solaris 10 U5 > cks> kernel ZFS code does not have this property; > cks> it is possible to wind up with ZFS pools that > cks> will panic your system when you try to touch them. > ... > > I''ll go you one worse. Imagine a Sun Cluster with several resource > groups and several zpools. You blow a proc on one of the servers. Asa> result, the metadata on one of the pools becomes corrupted. >re> This failure mode affects all shared-storage re> clusters. I don''t see how ZFS should or should re> not be any different than raw, UFS, et.al. Absolutely true. The file system definitely had a problem.>http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/046951.html> > Now, each of the servers in your cluster attempts to import the > zpool--and panics. > > As a result of a singe part failure on a single server, your entire > cluster (and all the services on it) are sitting in a smoking heap on > your machine room floor. >re> Yes, but your data is corrupted. My data was only corrupted on ONE of the zpools. In a cluster with several zpools and several resource groups, we ended up with ALL of the pools and ALL of the resource groups offline as one node after another panicked. re> If you were my bank, then I would greatly re> appreciate you getting the data corrected re> prior to bringing my account online. Fair enough, but do we have to take Fred''s and Joe''s accounts offline too? re> If you study highly available clusters and services re> then you will see many cases where human interaction re> is preferred to automation for just such cases. I see your point about requiring intervention to deal with a potentially corrupt file system. I would have preferred a behavior more like we get with VxVM and VxFS, where the corrupted file system fails to mount without human intervention, but the nodes don''t panic on the failed vxdg import. That particular service group and that particular file system are offline, but everything else keeps running because none of the other nodes panics. We handled the issue of not corrupting the file system further by panicking the original node, but I don''t understand why we need to panic each other successive node in the cluster. Why can''t we just refuse to import automatically?> I''m just glad that our pool corruption experience happened during > testing, and not after the system had gone into production. Notexactly> a resume-enhancing experience.re> I''m glad you found this in testing. I''m a believer. Some people wanted us to just throw the box into production, but I insisted on keeping our test schedule. I''m glad I did. re> BTW, what was the root cause? It appears that the metadata on that pool became corrupted when the processor failed. The exact mechanism is a bit of a mystery, since we didn''t get a valid crash dump. The other pools were fine, once we imported them after a boot -x. We ended up converting to VxVM and VxFS on that server because we could not guarantee that the same thing wouldn''t just happen again after we went into production. If we had a tool that had allowed us to roll back to a previous snapshot or something, it might have made a difference. We were told that the probability of metadata corruption would have been reduced but not eliminated by having a mirrored LUN. We were also told that the issue will be fixed in U6. --Scott This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message.
>>>>> "cs" == Cromar Scott <SCromar at caxton.com> writes:cs> It appears that the metadata on that pool became corrupted cs> when the processor failed. The exact mechanism is a bit of a cs> mystery, [...] cs> We were told that the probability of metadata corruption would cs> have been reduced but not eliminated by having a mirrored LUN. cs> We were also told that the issue will be fixed in U6. how can one fix an issue which is a bit of a mystery? Or do you mean the lazy-panic issue is fixed, but the corruption issue is not? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080812/5f0ddbb8/attachment.bin>
Miles Nordin <carton at Ivy.NET>>>>>> "cs" == Cromar Scott <SCromar at caxton.com> writes:cs> It appears that the metadata on that pool became corrupted cs> when the processor failed. The exact mechanism is a bit of a cs> mystery, [...] cs> We were told that the probability of metadata corruption would cs> have been reduced but not eliminated by having a mirrored LUN. cs> We were also told that the issue will be fixed in U6. mn> how can one fix an issue which is a bit of a mn> mystery? Or do you mean the lazy-panic issue mn> is fixed, but the corruption issue is not? We opened a call with Sun support. We were told that the corruption issue was due to a race condition within ZFS. We were also told that the issue was known and was scheduled for a fix in S10U6. Sun support recommended that we use a mirrored pool to reduce the possibility of the bug re-emerging, but they told us that even a mirrored pool might be subject to the same bug. Moving to OpenSolaris was not an option due to the nature of the application and our requirement for support. It is possible that we might have been able to move to SVM and UFS rather than VxVM and VxFS, but we were bumping up against our deadline, and we knew from previous deployments that VxVM and VxFS would work. (And we had the management infrastructure in place to deal with Veritas.) We run ZFS in several other environments with different availability requirements, but we were hoping to start using it across the board as a VxVM/VxFS replacement. --Scott This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message.
>>>>> "cs" == Cromar Scott <SCromar at caxton.com> writes:cs> We opened a call with Sun support. We were told that the cs> corruption issue was due to a race condition within ZFS. We cs> were also told that the issue was known and was scheduled for cs> a fix in S10U6. nice. Is there a bug number? or is this one of the secret bugs? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080813/46e4a516/attachment.bin>
Miles Nordin <carton at Ivy.NET>>>>>> "cs" == Cromar Scott <SCromar at caxton.com> writes:cs> We opened a call with Sun support. We were told that the cs> corruption issue was due to a race condition within ZFS. We cs> were also told that the issue was known and was scheduled for cs> a fix in S10U6. mn> nice. Is there a bug number? or is this one of the secret bugs? We were told bug number 6565042, but the description doesn''t quite match up with what the Sun engineer told us on the phone. Maybe it''s one of those things where the fix for 6565042 also fixes our problem. --Scott This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message.
A Darren Dunham wrote:> > If the most recent uberblock appears valid, but doesn''t have useful > data, I don''t think there''s any way currently to see what the tree of an > older uberblock looks like. It would be nice to see if that data > appears valid and try to create a view that would be > readable/recoverable. > >I have a method to examine uberblocks on disk. Using this, along with my modified mdb and zdb, I have been able to recover a previously removed file. I''ll post details in a blog if there is interest. max
Hello max, Sunday, August 17, 2008, 1:02:05 PM, you wrote: mbc> A Darren Dunham wrote:>> >> If the most recent uberblock appears valid, but doesn''t have useful >> data, I don''t think there''s any way currently to see what the tree of an >> older uberblock looks like. It would be nice to see if that data >> appears valid and try to create a view that would be >> readable/recoverable. >> >>mbc> I have a method to examine uberblocks on disk. Using this, along with mbc> my modified mbc> mdb and zdb, I have been able to recover a previously removed file. mbc> I''ll post mbc> details in a blog if there is interest. Of course, pleas do so. -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
Hi Robert, et.al., I have blogged about a method I used to recover a removed file from a zfs file system at http://mbruning.blogspot.com. Be forewarned, it is very long... All comments are welcome. max Robert Milkowski wrote:> Hello max, > > Sunday, August 17, 2008, 1:02:05 PM, you wrote: > > mbc> A Darren Dunham wrote: > >>> If the most recent uberblock appears valid, but doesn''t have useful >>> data, I don''t think there''s any way currently to see what the tree of an >>> older uberblock looks like. It would be nice to see if that data >>> appears valid and try to create a view that would be >>> readable/recoverable. >>> >>> >>> > mbc> I have a method to examine uberblocks on disk. Using this, along with > mbc> my modified > mbc> mdb and zdb, I have been able to recover a previously removed file. > mbc> I''ll post > mbc> details in a blog if there is interest. > > Of course, pleas do so. > > > >
Victor Latushkin wrote:> Hi Tom and all,>> root at cs3:~# uname -a >> SunOS cs3.kw 5.10 Generic_127127-11 sun4v sparc SUNW,Sun-Fire-T200 > > Btw, have you considered opening support call for this issue?As a follow up to the whole story, with the fantastic help of Victor, the failed pool is now imported and functional thanks to the redundancy in the meta data. This does however highlight the need and practical application of a fsck-like tool. Fine to say that if ZFS can''t guarantee my data then I should restore from backups so I know what I''ve got, but in the case of this 42T device that would take days. Something to think about, Tom
> As a follow up to the whole story, with the fantastic > help of Victor, > the failed pool is now imported and functional thanks > to the redundancy > in the meta data.It would be really useful if you could publish the steps to recover the pools. This message posted from opensolaris.org
Borys Saulyak wrote:>> As a follow up to the whole story, with the fantastic help of >> Victor, the failed pool is now imported and functional thanks to >> the redundancy in the meta data.> It would be really useful if you could publish the steps to recover > the pools.Here it is: Executive summary: Thanks to COW nature of ZFS it was possible to successfully recover pool state which was only 5 seconds older than last unopenable one. Details: The whole story started with the pool which was not importable. Any attempt to import it with ''zpool import content'' reported I/O error. This situation was preceded by some HW-related issue where host could not communicate with array for some time over SAS bus. It affected array so badly that it was power-cycled to get back to life. I/O error reported by ''zpool import'' is fairly generic, so the first step is to find out more about _when_ it happens, what stage of pool import process detects it. This is where DTrace is very useful. Simple DTrace script tracing entries and exits in/out of ZFS module functions and reporting retuned value provided us with the following output: ... 4 -> spa_last_synced_txg 4 <- spa_last_synced_txg = 156012 4 -> dsl_pool_open 4 -> dsl_pool_open_impl ... 4 <- dsl_pool_open_impl = 6597523216512 4 -> dmu_objset_open_impl 4 -> arc_read ... 4 -> zio_read ... 4 <- zio_read = 3299027537664 4 -> zio_wait ... 4 <- zio_wait = 5 4 <- arc_read = 5 4 <- dmu_objset_open_impl = 5 ... 4 -> dsl_pool_close 4 <- dsl_pool_close = 6597523216512 4 <- dsl_pool_open = 5 4 -> vdev_set_state 4 <- vdev_set_state = 6597441769472 4 -> zfs_ereport_post ... 28 <- zfs_ereport_post = 0 28 <- spa_load Source code reveals that this means that we fail to open Meta ObjSet (MOS) of the pool which is pointed by block pointer stored in uberblock. But MOS has three copies (ditto-blocks) even on unreplicated pools! There must be something bad happened. This is the first moment where it is worth to stop and try to understand what it means. First, we get pointer to MOS from active uberblock, which means that it passed checksum verification, so it was written to disk successfully. It has pointers to three copies of MOS and all three are somehow corrupted. How it could happen? Answer to this question is left as an exercise to the reader. Next we tried to extract offsets and sizes from zios initiated to read MOS from disk (this could be done easier by looking into active uberblock, but anyway). This provided us with the following results: CPU FUNCTION 2 <- zio_read cmd: 0, retries: 0, numerrors: 0, error: 0, checksum: 0, compress: 0, ndvas: 0, txg: 156012, size: 512, offset1: 23622594110, offset2: 40803003282, offset3: 1610705882 2 <- zio_read cmd: 0, retries: 0, numerrors: 0, error: 0, checksum: 0, compress: 0, ndvas: 0, txg: 156012, size: 512, offset1: 23622594110, offset2: 40803003282, offset3: 1610705882 With these offsets we read out three copies of MOS to files to compare them individually (you need to add 4M to the offset to account for ZFS front label if you are doing it with, say, ''dd''). It turned out that all three are completely different. How it could happen? Most likely it happened because corresponding writes never reached disk surface. Being unable to read MOS is bad - since it is starting point into the pool state you cannot discover any other data in the pool? So is all hope lost? No, not really. Since ZFS is COW, it is natural to try previous uberblock copy and see if that points to consistent view of the pool. So we saved all front and back ZFS labels (which turned out to be the same) and looked for uberblock array to see what is available there: Previous: ub[107].txg = 2616b o1 = 23622593482 o2 = 40803002710 o3 = 1610705850 timestamp = Sun Jul 27 06:04:03 2008 Active: ub[108].txg = 2616c o1 = 23622594110 o2 = 40803003282 o3 = 1610705882 timestamp = Sun Jul 27 06:04:08 2008 This is interesting pieces of data we extracted from active and previous uberblocks. There are couple things to note: 1. 2616c = 156012 (as reported by spa_last_synced_txg() above) 2. all three offsets from the active uberblock are the same as discovered earlier with dtrace script. We tried to read three MOS copies pointed by previous uberblock and found that all three were the same! So it was likely that previous uberblock pointed to (more) consistent pool state, at least it''s MOS might be ok. So the next step was to deactivate active uberblock and activate previous one. How this can be done? Well, it is enough to change checksum of the currently active uberblock to make it inactive and make sure that checksum for the previous uberblock is correct. This way previous uberblock would be selected as active since it would have highest txg id (and timestamp) of all uberblocks with correct checksum. This can be achieved with simple tools like ''dd'', you just need to be sure to apply "corruption" to all four labels. Though a tool to do this would be useful and there''s RFE to address this: CR 6667683 "need a way to select an uberblock from a previous txg" Fortunately pool configuration in this case is simple - it contains only one vdev, so we were able to leverage little quick and dirty utility which dumps uberblock array and allows to activate any uberblock in the array (by changing checksums of all others to be incorrect). The next step was to try import pool. One option was to try to import it on a live system (running with three other pools and providing service) and see the outcome. But we came up with a better idea. zdb was very helpful here. We took zdb (along with libzpool and libzfs) from Nevada (one in Solaris 10 does not yet work with exported pools) and tried to run it on the pool with activated uberblock from txg 0x2616b in userspace to verify pool consistency. So we started it with ''zdb -bbccsv content'' and after couple of days it ended with the following output: Traversing all blocks to verify checksums and verify nothing leaked ... No leaks (block sum matches space maps exactly) bp count: 66320566 bp logical: 8629761329152 avg: 130121 bp physical: 8624565999616 avg: 130043 compression: 1.00 bp allocated: 8628035570688 avg: 130095 compression: 1.00 SPA allocated: 8628035570688 used: 18.80% Blocks LSIZE PSIZE ASIZE avg comp %Total Type 3 48.0K 5.50K 16.5K 5.50K 8.73 0.00 L1 deferred free 35 152K 85.5K 174K 4.97K 1.78 0.00 L0 deferred free 38 200K 91.0K 191K 5.01K 2.20 0.00 deferred free 1 512 512 1K 1K 1.00 0.00 object directory 3 1.50K 1.50K 3.00K 1K 1.00 0.00 object array 1 16K 1K 2K 2K 16.00 0.00 packed nvlist - - - - - - - packed nvlist size 1 16K 16K 32K 32K 1.00 0.00 bplist - - - - - - - bplist header - - - - - - - SPA space map header 39 624K 149K 447K 11.5K 4.19 0.00 L1 SPA space map 2.22K 8.89M 5.21M 10.4M 4.69K 1.71 0.00 L0 SPA space map 2.26K 9.5M 5.35M 10.9M 4.80K 1.77 0.00 SPA space map - - - - - - - ZIL intent log 1 16K 1K 3.00K 3.00K 16.00 0.00 L6 DMU dnode 1 16K 1K 3.00K 3.00K 16.00 0.00 L5 DMU dnode 1 16K 1K 3.00K 3.00K 16.00 0.00 L4 DMU dnode 1 16K 1K 3.00K 3.00K 16.00 0.00 L3 DMU dnode 1 16K 1.50K 4.50K 4.50K 10.67 0.00 L2 DMU dnode 12 192K 110K 329K 27.4K 1.75 0.00 L1 DMU dnode 1.47K 23.5M 4.85M 9.7M 6.62K 4.84 0.00 L0 DMU dnode 1.49K 23.8M 4.97M 10.1M 6.77K 4.78 0.00 DMU dnode 2 2K 1K 3.00K 1.50K 2.00 0.00 DMU objset - - - - - - - DSL directory 2 1K 1K 2K 1K 1.00 0.00 DSL directory child map 1 512 512 1K 1K 1.00 0.00 DSL dataset snap map 2 1K 1K 2K 1K 1.00 0.00 DSL props - - - - - - - DSL dataset - - - - - - - ZFS znode - - - - - - - ZFS V0 ACL 1.07K 17.1M 1.07M 2.14M 2.01K 15.94 0.00 L3 ZFS plain file 8.13K 130M 31.8M 63.6M 7.83K 4.09 0.00 L2 ZFS plain file 505K 7.89G 3.19G 6.38G 12.9K 2.48 0.08 L1 ZFS plain file 62.7M 7.84T 7.84T 7.84T 128K 1.00 99.92 L0 ZFS plain file 63.2M 7.85T 7.84T 7.85T 127K 1.00 100.00 ZFS plain file 991 1.77M 596K 1.16M 1.20K 3.04 0.00 ZFS directory 1 512 512 1K 1K 1.00 0.00 ZFS master node 1 512 512 1K 1K 1.00 0.00 ZFS delete queue - - - - - - - zvol object - - - - - - - zvol prop - - - - - - - other uint8[] - - - - - - - other uint64[] - - - - - - - other ZAP - - - - - - - persistent error log 1 128K 4.50K 9.00K 9.00K 28.44 0.00 SPA history - - - - - - - SPA history offsets - - - - - - - Pool properties - - - - - - - DSL permissions - - - - - - - ZFS ACL - - - - - - - ZFS SYSACL - - - - - - - FUID table - - - - - - - FUID table size - - - - - - - DSL dataset next clones - - - - - - - scrub work queue 1 16K 1K 3.00K 3.00K 16.00 0.00 L6 Total 1 16K 1K 3.00K 3.00K 16.00 0.00 L5 Total 1 16K 1K 3.00K 3.00K 16.00 0.00 L4 Total 1.07K 17.1M 1.07M 2.15M 2.01K 15.94 0.00 L3 Total 8.13K 130M 31.8M 63.6M 7.83K 4.09 0.00 L2 Total 505K 7.89G 3.19G 6.38G 12.9K 2.48 0.08 L1 Total 62.7M 7.84T 7.84T 7.84T 128K 1.00 99.92 L0 Total 63.2M 7.85T 7.84T 7.85T 127K 1.00 100.00 Total capacity operations bandwidth ---- errors ---- description used avail read write read write read write cksum content 7.85T 33.9T 420 0 51.8M 0 0 0 0 /dev/dsk/c2t9d0s0 7.85T 33.9T 420 0 51.8M 0 0 0 0 bash-3.00# This confirmed that previous pool state is completely consistent, so it should be safe to import pool in this state. Import worked just fine and additional scrub did not find any errors. Hope this helps, Victor
Victor, thanks for posting that. It really is interesting to see exactly what happened, and to read about how zfs pools can be recovered. Your work on these forums has done much to re-assure me that ZFS is stable enough for us to be using on a live server, and I look forward to seeing automated tools appear to do some of the recoveries you''re currently having to work so hard on. -- This message posted from opensolaris.org