thr3ads.net - zfs discuss - [zfs-discuss] Finding corrupted files [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Stephan Budach

2010-Oct-06 08:52 UTC

[zfs-discuss] Finding corrupted files

Hi,

I recently discovered some - or at least one corrupted file on one ofmy ZFS
datasets, which caused an I/O error when trying to send a ZFDS snapshot to
another host:


zpool status -v obelixData
  pool: obelixData
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        obelixData               ONLINE       4     0     0
          c4t210000D023038FA8d0  ONLINE       0     0     0
          c4t210000D02305FF42d0  ONLINE       4     0     0

errors: Permanent errors have been detected in the following files:

        <0x949>:<0x12b9b9>
        obelixData/JvMpreprint at
2010-10-02_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET
10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
        obelixData/JvMpreprint at
BackupSnapshot_2010-10-05-08:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET
10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
        obelixData/JvMpreprint at
2010-09-24_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor
6_210/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
        /obelixData/JvMpreprint/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET
10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps

Now, scrub would reveal corrupted blocks on the devices, but is there a way to
identify damaged files as well?

Thanks,
budy
-- 
This message posted from opensolaris.org

Tomas Ögren

2010-Oct-06 13:36 UTC

head link

[zfs-discuss] Finding corrupted files

On 06 October, 2010 - Stephan Budach sent me these 2,1K bytes:
> Hi,
> 
> I recently discovered some - or at least one corrupted file on one ofmy ZFS
datasets, which caused an I/O error when trying to send a ZFDS snapshot to
another host:
> 
> 
> zpool status -v obelixData
>   pool: obelixData
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: none requested
> config:
> 
>         NAME                     STATE     READ WRITE CKSUM
>         obelixData               ONLINE       4     0     0
>           c4t210000D023038FA8d0  ONLINE       0     0     0
>           c4t210000D02305FF42d0  ONLINE       4     0     0
> 
> errors: Permanent errors have been detected in the following files:
> 
>         <0x949>:<0x12b9b9>
>         obelixData/JvMpreprint at
2010-10-02_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET
10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
>         obelixData/JvMpreprint at
BackupSnapshot_2010-10-05-08:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET
10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
>         obelixData/JvMpreprint at
2010-09-24_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor
6_210/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
>         /obelixData/JvMpreprint/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI
vor ET 10.6.2010/13404_41_07008 Estate
HandelsMarketing/Dealer_Launch_Invitations
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
> 
> Now, scrub would reveal corrupted blocks on the devices, but is there a way
to identify damaged files as well?
Is this a trick question or something? The filenames are right over
your question..?

/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Stephan Budach

2010-Oct-06 13:48 UTC

head link

[zfs-discuss] Finding corrupted files

No - not a trick question., but maybe I didn''t make myself clear.
Is there a way to discover such bad files other than trying to actually read
from them one by one, say using cp or by sending a snapshot elsewhere?

I am well aware that the file shown in  zpool status -v is damaged and I have
already restored it, but I wanted to know, if there''re more of them.

Regards,
budy
-- 
This message posted from opensolaris.org

Scott Meilicke

2010-Oct-06 15:24 UTC

head link

[zfs-discuss] Finding corrupted files

Scrub?

On Oct 6, 2010, at 6:48 AM, Stephan Budach wrote:
> No - not a trick question., but maybe I didn''t make myself clear.
> Is there a way to discover such bad files other than trying to actually
read from them one by one, say using cp or by sending a snapshot elsewhere?
> 
> I am well aware that the file shown in  zpool status -v is damaged and I
have already restored it, but I wanted to know, if there''re more of
them.
> 
> Regards,
> budy
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Scott Meilicke

Jim Dunham

2010-Oct-06 15:31 UTC

head link

[zfs-discuss] Finding corrupted files

Budy,
> No - not a trick question., but maybe I didn''t make myself clear.
> Is there a way to discover such bad files other than trying to actually
read from them one by one, say using cp or by sending a snapshot elsewhere?
As noted by your original email, ZFS reports on any corruption using the
"zpool status" command.

ZFS detects corruption as part of its normal filesystem operations, which maybe
triggered by: cp, send-recv, etc., or by a forced reading of the entire
filesystem by scrub.
> I am well aware that the file shown in  zpool status -v is damaged and I
have already restored it, but I wanted to know, if there''re more of
them.
Assuming that the ZFS filesystem in question is not degrading further (as in a
disk going bad), upon completion of a successful scrub, zpool reports the
complete status of the filesystem being reported on.

- Jim
> 
> Regards,
> budy
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101006/f25e3b71/attachment.html>

Stephan Budach

2010-Oct-06 19:09 UTC

head link

[zfs-discuss] Finding corrupted files

Well I think, that answers my question then: after a successful scrub, zpool
status -v should then list all damaged files on an entire zpool.

I only asked, because I read a thread in this forum that one guy had a problem
with different files, aven after a successful scrub.

Thanks,
budy
-- 
This message posted from opensolaris.org

Cindy Swearingen

2010-Oct-06 19:23 UTC

head link

[zfs-discuss] Finding corrupted files

Budy,

Your previous zpool status output shows a non-redundant pool with data 
corruption.

You should use the fmdump -eV command to find out the underlying cause
of this corruption.

You can review the hardware-level monitoring tools, here:

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide

Thanks,

Cindy

On 10/06/10 13:09, Stephan Budach wrote:> Well I think, that answers my question then: after a successful scrub,
zpool status -v should then list all damaged files on an entire zpool.
> 
> I only asked, because I read a thread in this forum that one guy had a
problem with different files, aven after a successful scrub.
> 
> Thanks,
> budy

Ian Collins

2010-Oct-06 19:52 UTC

head link

[zfs-discuss] Finding corrupted files

On 10/ 6/10 09:52 PM, Stephan Budach wrote:> Hi,
>
> I recently discovered some - or at least one corrupted file on one ofmy ZFS
datasets, which caused an I/O error when trying to send a ZFDS snapshot to
another host:
>
>
> zpool status -v obelixData
>    pool: obelixData
>   state: ONLINE
> status: One or more devices has experienced an error resulting in data
>          corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>          entire pool from backup.
>     see: http://www.sun.com/msg/ZFS-8000-8A
>   scrub: none requested
> config:
>
>          NAME                     STATE     READ WRITE CKSUM
>          obelixData               ONLINE       4     0     0
>            c4t210000D023038FA8d0  ONLINE       0     0     0
>            c4t210000D02305FF42d0  ONLINE       4     0     0
>
>    Are you aware that this is a very dangerous configuration?

Your pool lacks redundancy and you will loose it if one of the devices 
fails.

-- 
Ian.

Stephan Budach

2010-Oct-06 20:26 UTC

head link

[zfs-discuss] Finding corrupted files

Hi Cindy,

thanks for bringing that to my attention. I checked fmdump and found a lot of
these entries:


Okt 06 2010 17:52:12.862812483 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.tran
        ena = 0x514dc67d57e00001
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /pci at 0,0/pci8086,340e at 7/pci1077,138 at
0,1/fp at 0,0/disk at w210000d02305ff42,0
        (end detector)

        driver-assessment = retry
        op-code = 0x88
        cdb = 0x88 0x0 0x0 0x0 0x0 0x2 0xac 0xd4 0x3d 0x80 0x0 0x0 0x0 0x80 0x0
0x0
        pkt-reason = 0x3
        pkt-state = 0x0
        pkt-stats = 0x20
        __ttl = 0x1
        __tod = 0x4cac9b2c 0x336d7943

Okt 06 2010 17:52:12.862813713 ereport.io.scsi.cmd.disk.recovered
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.recovered
        ena = 0x514dc67d57e00001
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /pci at 0,0/pci8086,340e at 7/pci1077,138 at
0,1/fp at 0,0/disk at w210000d02305ff42,0
                devid = id1,sd at n600d02310005ff4200000000712ab96c
        (end detector)

        driver-assessment = recovered
        op-code = 0x88
        cdb = 0x88 0x0 0x0 0x0 0x0 0x2 0xac 0xd4 0x3d 0x80 0x0 0x0 0x0 0x80 0x0
0x0
        pkt-reason = 0x0
        pkt-state = 0x1f
        pkt-stats = 0x0
        __ttl = 0x1
        __tod = 0x4cac9b2c 0x336d7e11

Googling about these errors brought me directly to this document:

http://dsc.sun.com/solaris/articles/scsi_disk_fma2.html

which talks about these scsi errors. Since we''re talking FC here, it
seems to point to some FC issue I have not been aware of. Furthermore,
it''s always the same FC device that show these errors, so I will try to
check the device and it''s connections to the fabric first.

Thanks,
budy
-- 
This message posted from opensolaris.org

Stephan Budach

2010-Oct-06 20:29 UTC

head link

[zfs-discuss] Finding corrupted files

Ian,

yes, although these vdevs are FC raids themselves, so the risk is? uhm?
calculated.

Unfortuanetly, one of the devices seems to have some issues, as stated im my
previous post.
I will, nevertheless, add redundancy to my pool asap.

Thanks,
budy
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Oct-07 02:04 UTC

head link

[zfs-discuss] Finding corrupted files

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Stephan Budach
> 
> Ian,
> 
> yes, although these vdevs are FC raids themselves, so the risk is? uhm?
> calculated.
Whenever possible, you should always JBOD the storage and let ZFS manage the
raid, for several reasons.  (See below).  Also, as counter-intuitive as this
sounds (see below) you should disable hardware write-back cache (even with BBU)
because it hurts performance in any of these situations:  (a) Disable WB if you
have access to SSD or other nonvolatile dedicated log device.  (b) Disable WB if
you know all of your writes to be async mode and not sync mode.  (c) Disable WB
if you''ve opted to disable ZIL.

* Hardware raid blindly assumes the redundant data written to disk is written
correctly.  So later, if you experience a checksum error (such as you have) then
it''s impossible for ZFS to correct it.  The hardware raid
doesn''t know a checksum error has occurred, and there is no way for the
OS to read the "other side of the mirror" to attempt correcting the
checksum via redundant data.

* ZFS has knowledge of both the filesystem, and the block level devices, while
hardware raid has only knowledge of block level devices.  Which means ZFS is
able to optimize performance in ways that hardware cannot possibly do.  For
example, whenever there are many small writes taking place concurrently, ZFS is
able to remap the physical disk blocks of those writes, to aggregate them into a
single sequential write.  Depending on your metric, this yields 1-2 orders of
magnitude higher IOPS.

* Because ZFS automatically buffers writes in ram in order to aggregate as
previously mentioned, the hardware WB cache is not beneficial.  There is one
exception.  If you are doing sync writes to spindle disks, and you
don''t have a dedicated log device, then the WB cache will benefit you,
approx half as much as you would benefit by adding dedicated log device.  The
sync write sort-of by-passes the ram buffer, and that''s the reason why
the WB is able to do some good in the case of sync writes.

Ironically, if you have WB enabled, and you have a SSD log device, then the WB
hurts you.  You get the best performance with SSD log, and no WB.  Because the
WB "lies" to the OS, saying some tiny chunk of data has been
written... then the OS will happily write another tiny chunk, and another, and
another.  The WB is only buffering a lot of tiny random writes, and in
aggregate, it will only go as fast as the random writes.  It undermines
ZFS''s ability to aggregate small writes into sequential writes.

Edward Ned Harvey

2010-Oct-07 02:13 UTC

head link

[zfs-discuss] Finding corrupted files

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Stephan Budach
> 
> Now, scrub would reveal corrupted blocks on the devices, but is there a
> way to identify damaged files as well?
I saw a lot of people offering the same knee-jerk reaction that I had:
"Scrub."  And that is the only correct answer, to make a best effort
at
salvaging data.  But I think there is a valid question here which was
neglected.

*Does* scrub produce a list of all the names of all the corrupted files?
And if so, how does it do that?

If scrub is operating at a block-level (and I think it is), then how can
checksum failures be mapped to file names?  For example, this is a
long-requested feature of "zfs send" which is fundamentally difficult
or
impossible to implement.

Zfs send operates at a block level.  And there is a desire to produce a list
of all the incrementally changed files in a zfs incremental send.  But no
capability of doing that.

It seems, if scrub is able to list the names of files that correspond to
corrupted blocks, then zfs send should be able to list the names of files
that correspond to changed blocks, right?

I am reaching the opposite conclusion of what''s already been said.  I
think
you should scrub, but don''t expect file names as a result.  I think if
you
want file names, then tar > /dev/null will be your best friend.

I didn''t answer anything at first, cuz I was hoping somebody would have
that
answer.  I only know that I don''t know, and the above is my best guess.

Stephan Budach

2010-Oct-07 05:22 UTC

head link

[zfs-discuss] Finding corrupted files

Hi Edward,

these are interesting points. I have considered a couple of them, when I started
playing around with ZFS.

I am not sure whether I disagree with all of your points, but I conducted a
couple of tests, where I configured my raids as jbods and mapped each drive out
as a seperate LUN and I couldn''t notice a difference in performance in
any way.

I''d love to discuss this in a seperate thread, but first I will have to
check the archives an Google. ;)

Thanks,
budy
-- 
This message posted from opensolaris.org

Eric D. Mudama

2010-Oct-07 05:34 UTC

head link

[zfs-discuss] Finding corrupted files

On Wed, Oct  6 at 22:04, Edward Ned Harvey wrote:> * Because ZFS automatically buffers writes in ram in order to
> aggregate as previously mentioned, the hardware WB cache is not
> beneficial.  There is one exception.  If you are doing sync writes
> to spindle disks, and you don''t have a dedicated log device, then
> the WB cache will benefit you, approx half as much as you would
> benefit by adding dedicated log device.  The sync write sort-of
> by-passes the ram buffer, and that''s the reason why the WB is able
> to do some good in the case of sync writes.
All of your comments made sense except for this one.

Every N seconds when the system decides to burst writes to media from
RAM, those writes are only sequential in the case where the underlying
storage devices are significantly empty.

Once you''re in a situation where your allocations are scattered across
the disk due to longer-term fragmentation, I don''t see any way that a
write cache would hurt performances on the devices, since it''d allow
the drive to reorder writes to the media within that burst of data.

Even though ZFS is issuing writes of ~256 sectors if it can, that is
only a fraction of a revolution on a modern drive, so random writes of
128KB still have significant opportunity for reordering optimization.

Granted, with NCQ or TCQ you can get back much of the cache-disabled
performance loss, however, in any system that implements an internal
queue depth greater than the protocol-allowed queue depth, there is
opportunity for improvement, to an asymptotic limit driven by servo
settle speed.

Obviously this performance improvement comes with the standard WB
risks, and YMMV, IANAL, etc.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Stephan Budach

2010-Oct-07 06:37 UTC

head link

[zfs-discuss] Finding corrupted files

Hi Edward,

well that was exactly my point, when I raised this question. If zfs send is able
to identify corrupted files while it transfers a snapshot, why
shouldn''t scrub be able to do the same?

ZFS send quit with an I/O error and zpool status -v showed my the file that
indeed had problems. Since I thought that zfs send also operates on the block
level, I thought whether or not scrub would basically do the same thing.

On the other hand scrub really doesn''t care about what to read from the
device - it simply reads all blocks, which is not the case when running zfs
send.

Maybe, if zfs send could just go on and not halt on an I/O error and instead
just print out the errors?

Cheers,
budy
-- 
This message posted from opensolaris.org

Ian Collins

2010-Oct-07 08:01 UTC

head link

[zfs-discuss] Finding corrupted files

On 10/ 7/10 06:22 PM, Stephan Budach wrote:> Hi Edward,
>
> these are interesting points. I have considered a couple of them, when I
started playing around with ZFS.
>
> I am not sure whether I disagree with all of your points, but I conducted a
couple of tests, where I configured my raids as jbods and mapped each drive out
as a seperate LUN and I couldn''t notice a difference in performance in
any way.
>
>    The time you will notice is when a cable falls out or becomes loose and 
you get corrupted data and loose the pool due to lack of redundancy.  
Even though your LUNs are RAID, there are still numerous single points 
of failure between them and the target system.

-- 
Ian.

Stephan Budach

2010-Oct-07 11:43 UTC

head link

[zfs-discuss] Finding corrupted files

Ian,

I know - and I will address this, by upgrading the vdevs to mirrors, but
there''re a lot of other SPOFs around. So I started out by reducing the
most common failures and I have found that to be the disc drives, not the
chassis.

The beauty is: one can work their way up until the point of securuty is reached
or until there is no more money to spend.

Cheers,
budy
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Oct-07 12:26 UTC

head link

[zfs-discuss] Finding corrupted files

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Stephan Budach
> 
> I
> conducted a couple of tests, where I configured my raids as jbods and
> mapped each drive out as a seperate LUN and I couldn''t notice a
> difference in performance in any way.
Not sure if my original points were communicated clearly.  Giving
JBOD''s to
ZFS is not for the sake of performance.  The reason for JBOD is reliability.
Because hardware raid cannot detect or correct checksum errors.  ZFS can.
So it''s better to skip the hardware raid and use JBOD, to enable ZFS
access
to each separate side of the redundant data.

Edward Ned Harvey

2010-Oct-07 12:42 UTC

head link

[zfs-discuss] Finding corrupted files

> From: edmudama at mail.bounceswoosh.org
> [mailto:edmudama at mail.bounceswoosh.org] On Behalf Of Eric D. Mudama
> 
> On Wed, Oct  6 at 22:04, Edward Ned Harvey wrote:
> > * Because ZFS automatically buffers writes in ram in order to
> > aggregate as previously mentioned, the hardware WB cache is not
> > beneficial.  There is one exception.  If you are doing sync writes
> > to spindle disks, and you don''t have a dedicated log device,
then
> > the WB cache will benefit you, approx half as much as you would
> > benefit by adding dedicated log device.  The sync write sort-of
> > by-passes the ram buffer, and that''s the reason why the WB is
able
> > to do some good in the case of sync writes.
> 
> All of your comments made sense except for this one.
> 
> (etc)
Your point about long-term fragmentation and significant drive emptiness are
well received.  I never let a pool get over 90% full, for several reasons
including this one.  My target is 70%, which seems to be sufficiently empty.

Also, as you indicated, blocks of 128K are not sufficiently large for
reordering to benefit.  There''s another thread here, where I
calculated, you
need blocks approx 40MB in size, in order to reduce random seek time below
1% of total operation time.  So all that I said will only be relevant or
accurate if within 30sec (or 5 sec in the future) there exists at least 40M
of aggregatable sequential writes.

It''s really easy to measure and quantify what I was saying.  Just
create a
pool, and benchmark it in each configuration.  Results that I measured were:

	(stripe of 2 mirrors) 
	721  IOPS without WB or slog.  
	2114 IOPS with WB
	2722 IOPS with WB and slog
	2927 IOPS with slog, and no WB

There''s a whole spreadsheet full of results that I can''t
publish, but the
trend of WB versus slog was clear and consistent.

I will admit the above were performed on relatively new, relatively empty
pools.  It would be interesting to see if any of that changes, if the test
is run on a system that has been in production for a long time, with real
user data in it.

Cindy Swearingen

2010-Oct-07 16:01 UTC

head link

[zfs-discuss] Finding corrupted files

I would not discount the performance issue...

Depending on your workload, you might find that performance increases
with ZFS on your hardware RAID in JBOD mode.

Cindy

On 10/07/10 06:26, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Stephan Budach
>>
>> I
>> conducted a couple of tests, where I configured my raids as jbods and
>> mapped each drive out as a seperate LUN and I couldn''t notice
a
>> difference in performance in any way.
> 
> Not sure if my original points were communicated clearly.  Giving
JBOD''s to
> ZFS is not for the sake of performance.  The reason for JBOD is
reliability.
> Because hardware raid cannot detect or correct checksum errors.  ZFS can.
> So it''s better to skip the hardware raid and use JBOD, to enable
ZFS access
> to each separate side of the redundant data.
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Toby Thain

2010-Oct-07 17:09 UTC

head link

[zfs-discuss] Finding corrupted files

On 7-Oct-10, at 1:22 AM, Stephan Budach wrote:
> Hi Edward,
>
> these are interesting points. I have considered a couple of them,  
> when I started playing around with ZFS.
>
> I am not sure whether I disagree with all of your points, but I  
> conducted a couple of tests, where I configured my raids as jbods  
> and mapped each drive out as a seperate LUN and I couldn''t notice
a
> difference in performance in any way.
>

The integrity issue is, however, clear cut. ZFS must manage the  
redundancy.

ZFS just alerted you that your ''FC RAID'' doesn''t
actually provide data
integrity, & you just lost the ''calculated'' bet. :)

--Toby

> I''d love to discuss this in a seperate thread, but first I will
have
> to check the archives an Google. ;)
>
> Thanks,
> budy
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Edward Ned Harvey

2010-Oct-08 00:00 UTC

head link

[zfs-discuss] Finding corrupted files

> From: Cindy Swearingen [mailto:cindy.swearingen at oracle.com]
> 
> I would not discount the performance issue...
> 
> Depending on your workload, you might find that performance increases
> with ZFS on your hardware RAID in JBOD mode.
Depends on the raid card you''re comparing to.  I''ve certainly
seen some raid
cards that were too dumb to read from 2 disks in a mirror simultaneously for
the sake of read performance enhancement.  And many other similar
situations.

But I would not say that''s generally true anymore.  In the last several
years, all the hardware raid cards that I''ve bothered to test were able
to
utilize all the hardware available.  Just like ZFS.

There are performance differences...  like ... the hardware raid might be
able to read 15% faster in raid5, while ZFS is able to write 15% faster in
raidz, and so forth.  Differences that roughly balance each other out.

For example, here''s one data point I can share (2 mirrors striped,
results
normalized):
	8 initial writers, 8 rewriters,	8 readers	
	ZFS	1.43	2.99	5.05
	HW	2.00	2.54	2.96	
	
	8 re-readers,	8 reverse readers,	8 stride readers	
	ZFS	4.19	3.59	3.93	
	HW	3.02	2.80	2.90
	
	8 random readers,	8 random mix,	8 random writers	
	ZFS	2.57	2.40	1.69
	HW	1.99	1.70	1.73
	
	average
	ZFS	3.09
	HW	2.40

There were some categories where ZFS was faster.  Some where HW was faster.
On average, ZFS was faster, but they were all in the same ballpark, and the
results were highly dependent on specific details and tunables.  AKA, not a
place you should explore, unless you have a highly specialized use case that
you wish to optimize.

Stephan Budach

2010-Oct-08 09:47 UTC

head link

[zfs-discuss] Finding corrupted files

So, I decided to give tar a whirl, after zfs send encountered the next corrupted
file, resulting in an I/O error, even though scrub ran successfully w/o any
erors.

I then issued a 
/usr/gnu/bin/tar -cf /dev/null /obelixData/?/.zfs/snapshot/<actual
snapshot>/DTP

which finished without any issue and I have now issued a zfs send of this
snapshot to my remote host.

Let''s see, what happens in approx. 9 hrs.

budy
-- 
This message posted from opensolaris.org

Stephan Budach

2010-Oct-08 19:41 UTC

head link

[zfs-discuss] Finding corrupted files

So - after 10 hrs and 21 mins. the incremental zfs send/recv finished without a
problem. ;)

Seems that using tar for checking all files is an appropriate action.

Cheers,
budy
-- 
This message posted from opensolaris.org

David Dyer-Bennet

2010-Oct-11 19:28 UTC

head link

[zfs-discuss] Finding corrupted files

On Fri, October 8, 2010 04:47, Stephan Budach wrote:> So, I decided to give tar a whirl, after zfs send encountered the next
> corrupted file, resulting in an I/O error, even though scrub ran
> successfully w/o any erors.
I must say that this concept of scrub running w/o error when corrupted
files, detectable to zfs send, apparently exist, is very disturbing. 
Background scrubbing, and the block checksums to make it more meaningful
than just reading the disk blocks, was the key thing that drew me into
ZFS, and this seems to suggest that it doesn''t work.

Does your sequence of tests happen to provide evidence that the problem
isn''t new errors appearing, sometimes after a scrub and before the
send?
For example, have you done 1) scrub finds no error, 2) send finds error,
3) scrub finds no error?  (with nothing in between that could have cleared
or fixed the error).

-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

Richard Elling

2010-Oct-12 00:43 UTC

head link

[zfs-discuss] Finding corrupted files

On Oct 6, 2010, at 1:26 PM, Stephan Budach wrote:
> Hi Cindy,
> 
> thanks for bringing that to my attention. I checked fmdump and found a lot
of these entries
> 
> Okt 06 2010 17:52:12.862812483 ereport.io.scsi.cmd.disk.tran
...
> Okt 06 2010 17:52:12.862813713 ereport.io.scsi.cmd.disk.recovered
...
> Googling about these errors brought me directly to this document:
> 
> http://dsc.sun.com/solaris/articles/scsi_disk_fma2.html
> 
> which talks about these scsi errors. Since we''re talking FC here,
it seems to point to some FC issue I have not been aware of. Furthermore,
it''s always the same FC device that show these errors, so I will try to
check the device and it''s connections to the fabric first.
SCSI transport errors occur between the HBA and the target.  These are
reported up the stack to Solaris.  As you can see, a retry was successful.
However, these will have negative impacts on performance, so it is best
to solve the problem.
 -- richard

Edward Ned Harvey

2010-Oct-12 02:33 UTC

head link

[zfs-discuss] Finding corrupted files

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of David Dyer-Bennet
> 
> I must say that this concept of scrub running w/o error when corrupted
> files, detectable to zfs send, apparently exist, is very disturbing.
As previously mentioned, the OP is using a hardware raid system.  It is
impossible for ZFS to read "both sides of the mirror," which means
it''s pure
chance.  The hardware raid may fetch data from a bad disk one time, and
fetch good data from another disk the next time.  Or vice-versa.

You should always configure JBOD and allow ZFS to manage the raid. 
Don''t do
it in hardware, as the OP of this thread is soundly demonstrating the
reasons why.

Stephan Budach

2010-Oct-12 04:59 UTC

head link

[zfs-discuss] Finding corrupted files

I think one has to accept that zfs send appearently is able to detect such
errors while scrub is not. scrub is operates only on the block level and makes
sure that each block can be read and is in line with its''s checksum.

However, zfs send seems to have detected some errors in the file system
structure itself, resulting in a couple of files being unable to read.
What had caused these errors, I have no idea, but deleting the affected files
and replacing them did the job.

I think that my understanding of zfs send/recv only operating on the block
level, bypassing the higher level fs stuff, has been too simple.

Now to answer your question: I did 1), 2) and 3), but between 2) and 3) I
verified using tar that all files were accessible.
Also, I didn''t had any problem since.

Cheers,
budy
-- 
This message posted from opensolaris.org

Stephan Budach

2010-Oct-12 06:39 UTC

head link

[zfs-discuss] Finding corrupted files

You are implying that the issues resulted from the H/W raid(s) and I
don''t think that this is appropriate.

I configured a striped pool using two raids - this is exactly the same as using
two single hard drives without mirroring them. I simply cannot see what zfs
would be able to do in case of a block corruption in that matter.

You are not stating that a single hard drive is more reliable than a HW raid
box, are you? Actually my pool has no mirror capabilities at all, unless I am
seriously mistaken.

What scrub has found out is that none of the blocks had any issue, but the
filesystem was not "clean" either, so if scrub does it''s job
right and doesn''t report any errors, the error must have occurred
somewhere else up the stack, way before the checksum had been calculated.

No?
-- 
This message posted from opensolaris.org

Tuomas Leikola

2010-Oct-12 09:06 UTC

head link

[zfs-discuss] Finding corrupted files

On Tue, Oct 12, 2010 at 9:39 AM, Stephan Budach <stephan.budach at jvm.de>
wrote:> You are implying that the issues resulted from the H/W raid(s) and I
don''t think that this is appropriate.
>
Not exactly. Because the raid is managed in hardware, and not by zfs,
is the reason why zfs cannot fix these errors when it encounters them.
> I configured a striped pool using two raids - this is exactly the same as
using two single hard drives without mirroring them. I simply cannot see what
zfs would be able to do in case of a block corruption in that matter.
It cannot, exactly.
> You are not stating that a single hard drive is more reliable than a HW
raid box, are you? Actually my pool has no mirror capabilities at all, unless I
am seriously mistaken.
no, but zfs-managed raid is more reliable than hardware raid.
> What scrub has found out is that none of the blocks had any issue, but the
filesystem was not "clean" either, so if scrub does it''s job
right and doesn''t report any errors, the error must have occurred
somewhere else up the stack, way before the checksum had been calculated.
If the case is, as speculated, that one mirror has bad data and one
has good, scrub or any IO has 50% chances of seeing the corruption.
scrub does verify checksums.

Tuomas

Stephan Budach

2010-Oct-12 10:35 UTC

head link

[zfs-discuss] Finding corrupted files

> If the case is, as speculated, that one mirror has bad data and onehas good, scrub or any IO has 50% chances of seeing the corruption.
scrub does verify checksums.

Yes, if the vdev would be a mirrored one, which it wasn''t. There
weren''t any mirrors setup. Plus, if the checksums would have been bad,
scrub would have to deteteced that. It would not have been to resolve it, but
that wasn''t the case.

zpool status backupPool_01
  pool: backupPool_01
 state: ONLINE
 scrub: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        backupPool_01            ONLINE       0     0     0
          c3t2100001378AC0253d0  ONLINE       0     0     0
          c3t2100001378AC026Ed0  ONLINE       0     0     0

errors: No known data errors

If one of the two devices would go bad, boom - that it''d be for the
entire pool, but as long as the two devices work, it''s okay.
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Oct-12 12:16 UTC

head link

[zfs-discuss] Finding corrupted files

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Stephan Budach
> 
> You are implying that the issues resulted from the H/W raid(s) and I
> don''t think that this is appropriate.
Please quote originals when you reply.  If you don''t - then
it''s easy to
follow the thread on the web forum, but not in email.  So if you don''t
quote, you''ll be losing a lot of the people following the thread.  

I think it''s entirely appropriate to imply that your problem this time
stems
from hardware.  I''ll say it outright.  You have a hardware problem. 
Because
if there is a repeatable checksum failure (bad disk) then if anything can
find it, scrub can.  And scrub is the best way to find it.

If you have a nonrepeatable checksum failure (such as you have) then there
is only one possibility.  You are experiencing a hardware problem.

One possibility is that there''s a failing disk in your hardware raid
set,
and your hardware raid controller is unable to detect it, because hardware
raid doesn''t do checksumming.  Sometimes ZFS reads the device, and gets
an
error.  Sometimes the hardware raid controller reads the other side of the
mirror, and there is no error.

This is not the only possibility.  There could be some other piece of
hardware yielding your intermittent checksum errors.  But there''s one
absolute conclusion:  Your intermittent checksum errors are caused by
hardware.

If scrub didn''t find an error, then there was no error at the time of
scrub.

If scrub didn''t find an error, and then something else *did* find an
error,
it means one of two things.  (a) Maybe the error only occurred after the
scrub.  or (b) the hardware raid controller or some other piece of hardware
didn''t produce corrupted data during the scrub, but will produce
corrupted
data at some other time.

Edward Ned Harvey

2010-Oct-12 12:21 UTC

head link

[zfs-discuss] Finding corrupted files

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Stephan Budach
>
>           c3t2100001378AC0253d0  ONLINE       0     0     0
How many disks are there inside of c3t2100001378AC0253d0?

How are they configured?  Hardware raid 5?  A mirror of two hardware raid
5''s?  The point is:  This device, as seen by ZFS, is not a pure storage
device.  It is a high level device representing some LUN or something, which
is configured & controlled by hardware raid.

If there''s zero redundancy in that device, then scrub would probably
find
the checksum errors consistently and repeatably.

If there''s some redundancy in that device, then all bets are off. 
Sometimes
scrub might read the "good half" of the data, and other times, the bad
half.


But then again, the error might not be in the physical disks themselves.
The error might be somewhere in the raid controller(s) or the interconnect.
Or even some weird unsupported driver or something.

Stephan Budach

2010-Oct-12 12:45 UTC

head link

[zfs-discuss] Finding corrupted files

Am 12.10.10 14:21, schrieb Edward Ned Harvey:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Stephan Budach
>>
>>            c3t2100001378AC0253d0  ONLINE       0     0     0
> How many disks are there inside of c3t2100001378AC0253d0?
>
> How are they configured?  Hardware raid 5?  A mirror of two hardware raid
> 5''s?  The point is:  This device, as seen by ZFS, is not a pure
storage
> device.  It is a high level device representing some LUN or something,
which
> is configured&  controlled by hardware raid.
>
> If there''s zero redundancy in that device, then scrub would
probably find
> the checksum errors consistently and repeatably.
>
> If there''s some redundancy in that device, then all bets are off. 
Sometimes
> scrub might read the "good half" of the data, and other times,
the bad half.
>
>
> But then again, the error might not be in the physical disks themselves.
> The error might be somewhere in the raid controller(s) or the interconnect.
> Or even some weird unsupported driver or something.
>Both raid boxes run raid6 with 16 drives each. This is the reason I was 
running a non-mirrored pool in the first place.
I fully understand that zfs'' power comes to play, when you''re
running
with multiple independent drives, but that was what I got at hand.

I now also got what you meant by "good half" but I don''t dare
to say
whether or not this is also the case in a raid6 setup.

Regards

-- 
Stephan Budach
Jung von Matt/it-services GmbH
Glash?ttenstra?e 79
20357 Hamburg

Tel: +49 40-4321-1353
Fax: +49 40-4321-1114
E-Mail: stephan.budach at jvm.de
Internet: http://www.jvm.com

Gesch?ftsf?hrer: Ulrich Pallas, Frank Wilhelm
AG HH HRB 98380

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101012/34f76a94/attachment.html>

Ross Walker

2010-Oct-12 13:50 UTC

head link

[zfs-discuss] Finding corrupted files

On Oct 12, 2010, at 8:21 AM, "Edward Ned Harvey" <shill at
nedharvey.com> wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Stephan Budach
>> 
>>          c3t2100001378AC0253d0  ONLINE       0     0     0
> 
> How many disks are there inside of c3t2100001378AC0253d0?
> 
> How are they configured?  Hardware raid 5?  A mirror of two hardware raid
> 5''s?  The point is:  This device, as seen by ZFS, is not a pure
storage
> device.  It is a high level device representing some LUN or something,
which
> is configured & controlled by hardware raid.
> 
> If there''s zero redundancy in that device, then scrub would
probably find
> the checksum errors consistently and repeatably.
> 
> If there''s some redundancy in that device, then all bets are off. 
Sometimes
> scrub might read the "good half" of the data, and other times,
the bad half.
> 
> 
> But then again, the error might not be in the physical disks themselves.
> The error might be somewhere in the raid controller(s) or the interconnect.
> Or even some weird unsupported driver or something.
If it were a parity based raid set then the error would most likely be
reproducible, if not detected by the raid controller.

The biggest problem is from hardware mirrors where the hardware can''t
detect an error on one side vs the other.

For mirrors it''s always best to use ZFS'' built-in mirrors,
otherwise if I were to use HW RAID I would use RAID5/6/50/60 since errors
encountered can be reproduced, two parity raids mirrored in ZFS would probably
provide the best of both worlds, for a steep cost though.

-Ross

Edward Ned Harvey

2010-Oct-13 03:44 UTC

head link

[zfs-discuss] Finding corrupted files

> From: Stephan Budach [mailto:stephan.budach at jvm.de]
> 
> I now also got what you meant by "good half" but I don''t
dare to say
> whether or not this is also the case in a raid6 setup.
The same concept applies to raid5 or raid6.  When you read the device, you
never know if you''re actually reading the "data" or the
"parity" and in
fact, they''re mixed together in order to fully utilize all the hardware
available.  (Assuming you have some decently smart hardware.)

But all of that is mostly irrelevant.  One fact remains:

You have checksum errors.  There is only one cause for checksum errors:
Hardware failure.

It may be the physical disks themselves, or the raid card, or ram, or cpu,
or any of the interconnect in between.  I suppose it could be a driver
problem, but that''s less likely.

Orvar Korvar

2010-Oct-13 19:59 UTC

head link

[zfs-discuss] Finding corrupted files

Budy, if you are using raid-5 or raid-6 underneath ZFS, then you should know
that raid-5/6 might corrupt data. See here for lots of technical articles why
raid-5 is bad:
http://www.baarf.com/
raid-6 is not better. I can show you links about raid-6 being not safe.

I is a good thing you run ZFS, because ZFS can detect those errors, whereas
raid-5/6 can not. There are lots of research from computer scientists that show
this. You want to see some research papers on data corruption and hardware raid?

On the other hand, ZFS is safe. There are research papers showing that ZFS
detects and corrects all errors. You want to see them?

The bottom line is: ZFS should manage the discs directly. Do not let hardware
raid (which can not detect all errors) run the discs. ZFS can detect and repair
those errors. That is the reason to use ZFS, for data safety. Not for
performance (that is secondary).

You do have problems with your discs, only ZFS detects those errors. Your
hardware raid did not detect those errors. ZFS can not repair the errors, unless
ZFS runs the discs.
-- 
This message posted from opensolaris.org

Richard Elling

2010-Oct-14 01:47 UTC

head link

[zfs-discuss] Finding corrupted files

On Oct 13, 2010, at 12:59 PM, Orvar Korvar wrote:> On the other hand, ZFS is safe. There are research papers showing that ZFS
detects and corrects all errors. You want to see them?
I would.  URLs please?
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
ZFS Tutorial at USENIX LISA''10 Conference
November 8, 2010  San Jose, CA
ZFS and performance consulting
http://www.RichardElling.com

Stephan Budach

2010-Oct-14 07:27 UTC

head link

[zfs-discuss] Finding corrupted files

I''d like to see those docs as well.
As all HW raids are driven by software, of course - and software can be buggy.

I don''t want to heat up the discussion about ZFS managed discs vs. HW
raids, but if RAID5/6 would be that bad, no one would use it anymore.

So? just post the link and I will take a close look at the docs.

Thanks,
budy
-- 
This message posted from opensolaris.org

Toby Thain

2010-Oct-14 14:26 UTC

head link

[zfs-discuss] Finding corrupted files

On 14-Oct-10, at 3:27 AM, Stephan Budach wrote:
> I''d like to see those docs as well.
> As all HW raids are driven by software, of course - and software can  
> be buggy.
>
It''s not that the software ''can be buggy'' -
that''s not the point here.
The point being made is that conventional RAID just doesn''t offer data
*integrity* - it''s not a design factor. The necessary mechanisms  
simply aren''t there.

Contrariwise, with ZFS, end to end integrity is *designed in*. The  
''papers'' which demonstrate this difference are the design
documents;
anyone could start with Mr Bonwick''s blog - with which I am sure most  
list readers are already familiar.

http://blogs.sun.com/bonwick/en_US/category/ZFS
e.g. http://blogs.sun.com/bonwick/en_US/entry/zfs_end_to_end_data
> I don''t want to heat up the discussion about ZFS managed discs vs.
> HW raids, but if RAID5/6 would be that bad, no one would use it  
> anymore.
It is. And there''s no reason not to point it out. The world has  
changed a lot since RAID was ''state of the art''. It is
important to
understand its limitations (most RAID users apparently don''t).

The saddest part is that your experience clearly shows these  
limitations. As expected, the hardware RAID didn''t protect your data,  
since it''s designed neither to detect nor repair such errors.

If you had been running any other filesystem on your RAID you would  
never even have found out about it until you accessed a damaged part  
of it. Furthermore, backups would probably have been silently corrupt,  
too.

As many other replies have said: The correct solution is to let ZFS,  
and not conventional RAID, manage your redundancy. That''s the bottom  
line of any discussion of "ZFS managed discs vs. HW raids". If still  
unclear, read Bonwick''s blog posts, or the detailed reply to you from  
Edward Harvey (10/6).

--Toby
>
> So? just post the link and I will take a close look at the docs.
>
> Thanks,
> budy
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Edward Ned Harvey

2010-Oct-14 15:48 UTC

head link

[zfs-discuss] Finding corrupted files

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Toby Thain
> 
> > I don''t want to heat up the discussion about ZFS managed
discs vs.
> > HW raids, but if RAID5/6 would be that bad, no one would use it
> > anymore.
> 
> It is. And there''s no reason not to point it out. The world has
Well, neither one of the above statements is really fair.

The truth is:  radi5/6 are generally not that bad.  Data integrity failures
are not terribly common (maybe one bit per year out of 20 large disks or
something like that.)

And in order to reach the conclusion "nobody would use it," the people
using
it would have to first *notice* the failure.  Which they don''t. 
That''s kind
of the point.

Since I started using ZFS in production, about a year ago, on three servers
totaling approx 1.5TB used, I have had precisely one checksum error, which
ZFS corrected.  I have every reason to believe, if that were on a raid5/6,
the error would have gone undetected and nobody would have noticed.

Toby Thain

2010-Oct-14 18:06 UTC

head link

[zfs-discuss] Finding corrupted files

On 14-Oct-10, at 11:48 AM, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Toby Thain
>>
>>> I don''t want to heat up the discussion about ZFS managed
discs vs.
>>> HW raids, but if RAID5/6 would be that bad, no one would use it
>>> anymore.
>>
>> It is. And there''s no reason not to point it out. The world
has
>
> Well, neither one of the above statements is really fair.
>
> The truth is:  radi5/6 are generally not that bad.  Data integrity  
> failures
> are not terribly common (maybe one bit per year out of 20 large  
> disks or
> something like that.)
Such statistics assume that no part of the stack (drive, cable,  
network, controller, memory, etc) has any fault and is operating  
normally. This is, indeed, the base presumption of RAID (which also  
assumes a perfect error reporting chain).
>
> And in order to reach the conclusion "nobody would use it," the  
> people using
> it would have to first *notice* the failure.  Which they don''t.   
> That''s kind
> of the point.
Indeed it is. And then we could talk about self healing (also missing  
from RAID).

--Toby
>
> Since I started using ZFS in production, about a year ago, on three  
> servers
> totaling approx 1.5TB used, I have had precisely one checksum error,  
> which
> ZFS corrected.  I have every reason to believe, if that were on a  
> raid5/6,
> the error would have gone undetected and nobody would have noticed.
>

Stephan Budach

2010-Oct-15 13:18 UTC

head link

[zfs-discuss] Finding corrupted files

Am 14.10.10 17:48, schrieb Edward Ned Harvey:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Toby Thain
>>
>>> I don''t want to heat up the discussion about ZFS managed
discs vs.
>>> HW raids, but if RAID5/6 would be that bad, no one would use it
>>> anymore.
>> It is. And there''s no reason not to point it out. The world
has
> Well, neither one of the above statements is really fair.
>
> The truth is:  radi5/6 are generally not that bad.  Data integrity failures
> are not terribly common (maybe one bit per year out of 20 large disks or
> something like that.)
>
> And in order to reach the conclusion "nobody would use it," the
people using
> it would have to first *notice* the failure.  Which they don''t. 
That''s kind
> of the point.
>
> Since I started using ZFS in production, about a year ago, on three servers
> totaling approx 1.5TB used, I have had precisely one checksum error, which
> ZFS corrected.  I have every reason to believe, if that were on a raid5/6,
> the error would have gone undetected and nobody would have noticed.
>Point taken!

So, what would you suggest, if I wanted to create really big pools? Say 
in the 100 TB range? That would be quite a number of single drives then, 
especially when you want to go with zpool raid-1.

Cheers,
budy

-- 
Stephan Budach
Jung von Matt/it-services GmbH
Glash?ttenstra?e 79
20357 Hamburg

Tel: +49 40-4321-1353
Fax: +49 40-4321-1114
E-Mail: stephan.budach at jvm.de
Internet: http://www.jvm.com

Gesch?ftsf?hrer: Ulrich Pallas, Frank Wilhelm
AG HH HRB 98380

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101015/e640c31d/attachment.html>

Ross Walker

2010-Oct-15 22:26 UTC

head link

[zfs-discuss] Finding corrupted files

On Oct 15, 2010, at 9:18 AM, Stephan Budach <stephan.budach at jvm.de>
wrote:
> Am 14.10.10 17:48, schrieb Edward Ned Harvey:
>> 
>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of Toby Thain
>>> 
>>>> I don''t want to heat up the discussion about ZFS
managed discs vs.
>>>> HW raids, but if RAID5/6 would be that bad, no one would use it
>>>> anymore.
>>> It is. And there''s no reason not to point it out. The
world has
>> Well, neither one of the above statements is really fair.
>> 
>> The truth is:  radi5/6 are generally not that bad.  Data integrity
failures
>> are not terribly common (maybe one bit per year out of 20 large disks
or
>> something like that.)
>> 
>> And in order to reach the conclusion "nobody would use it,"
the people using
>> it would have to first *notice* the failure.  Which they
don''t.  That''s kind
>> of the point.
>> 
>> Since I started using ZFS in production, about a year ago, on three
servers
>> totaling approx 1.5TB used, I have had precisely one checksum error,
which
>> ZFS corrected.  I have every reason to believe, if that were on a
raid5/6,
>> the error would have gone undetected and nobody would have noticed.
>> 
> Point taken!
> 
> So, what would you suggest, if I wanted to create really big pools? Say in
the 100 TB range? That would be quite a number of single drives then, especially
when you want to go with zpool raid-1.
A pool consisting of 4 disk raidz vdevs (25% overhead) or 6 disk raidz2 vdevs
(33% overhead) should deliver the storage and performance for a pool that size,
versus a pool of mirrors (50% overhead).

You need a lot if spindles to reach 100TB.

-Ross

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101015/8966bf3d/attachment.html>

Edward Ned Harvey

2010-Oct-16 01:51 UTC

head link

[zfs-discuss] Finding corrupted files

> From: Stephan Budach [mailto:stephan.budach at jvm.de]
> 
> Point taken!
> 
> So, what would you suggest, if I wanted to create really big pools? Say
> in the 100 TB range? That would be quite a number of single drives
> then, especially when you want to go with zpool raid-1.
You have a lot of disks.  You either tell the hardware to manage a lot of
disks, and then tell ZFS to manage a single device, and you take unnecessary
risk and performance degradation for no apparent reason ...

Or you tell ZFS to manage a lot of disks.  Either way, you have a lot of
disks that need to be managed by something.  Why would you want to make that
hardware instead of ZFS?

For 100TB ... I suppose you have 2TB disks.  I suppose you have 12 buses.  I
would make a raidz1 using 1 disk from bus0, bus1, ... bus5.  I would make
another raidz1 vdev using a disk from bus6, bus7, ... bus11.  And so forth.
Then, even if you lose a whole bus, you still haven''t lost your pool. 
Each
raidz1 vdev would be 6 disks with a capacity of 5, so you would have a total
of 10 vdev''s and that means 5 disks on each bus.

Or do whatever you want.  The point is yes, give all the individual disks to
ZFS.

Richard Elling

2010-Oct-16 15:38 UTC

head link

[zfs-discuss] Finding corrupted files

On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote:
> So, what would you suggest, if I wanted to create really big pools? Say in
the 100 TB range? That would be quite a number of single drives then, especially
when you want to go with zpool raid-1.
For 100 TB, the methods change dramatically.  You can''t just reload 100
TB from CD
or tape. When you get to this scale you need to be thinking about raidz2+ *and*
mirroring.

I will be exploring these issues of scale at the "Techniques for Managing
Huge
Amounts of Data" tutorial at the USENIX LISA ''10 Conference.
http://www.usenix.org/events/lisa10/training/
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA ''10 Conference November 8-16
ZFS and performance consulting
http://www.RichardElling.com












-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101016/c0668477/attachment.html>

Pasi Kärkkäinen

2010-Oct-16 23:13 UTC

head link

[zfs-discuss] Finding corrupted files

On Sat, Oct 16, 2010 at 08:38:28AM -0700, Richard Elling
wrote:>    On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote:
> 
>      So, what would you suggest, if I wanted to create really big pools?
Say
>      in the 100 TB range? That would be quite a number of single drives
then,
>      especially when you want to go with zpool raid-1.
> 
>    For 100 TB, the methods change dramatically.  You can''t just
reload 100 TB
>    from CD
>    or tape. When you get to this scale you need to be thinking about
raidz2+
>    *and*
>    mirroring.
>    I will be exploring these issues of scale at the "Techniques for
Managing
>    Huge
>    Amounts of Data" tutorial at the USENIX LISA ''10
Conference.
>    [1]http://www.usenix.org/events/lisa10/training/
Hopefully your presentation will be available online after the event!

-- Pasi
>     -- richard
> 
>    --
>    OpenStorage Summit, October 25-27, Palo Alto, CA
>    [2]http://nexenta-summit2010.eventbrite.com
>    USENIX LISA ''10 Conference November 8-16
>    ZFS and performance consulting
>    [3]http://www.RichardElling.com
> 
> References
> 
>    Visible links
>    1. http://www.usenix.org/events/lisa10/training/
>    2. http://nexenta-summit2010.eventbrite.com/
>    3. http://www.richardelling.com/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2010-Oct-17 03:58 UTC

head link

[zfs-discuss] Finding corrupted files

On Oct 16, 2010, at 4:13 PM, Pasi K?rkk?inen wrote:> On Sat, Oct 16, 2010 at 08:38:28AM -0700, Richard Elling wrote:
>>   On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote:
>> 
>>     So, what would you suggest, if I wanted to create really big pools?
Say
>>     in the 100 TB range? That would be quite a number of single drives
then,
>>     especially when you want to go with zpool raid-1.
>> 
>>   For 100 TB, the methods change dramatically.  You can''t just
reload 100 TB
>>   from CD
>>   or tape. When you get to this scale you need to be thinking about
raidz2+
>>   *and*
>>   mirroring.
>>   I will be exploring these issues of scale at the "Techniques for
Managing
>>   Huge
>>   Amounts of Data" tutorial at the USENIX LISA ''10
Conference.
>>   [1]http://www.usenix.org/events/lisa10/training/
> 
> Hopefully your presentation will be available online after the event!
Sure, though I would encourage everyone to attend :-)
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA ''10 Conference November 8-16, 2010
ZFS and performance consulting
http://www.RichardElling.com

Orvar Korvar

2010-Oct-17 10:05 UTC

head link

[zfs-discuss] Finding corrupted files

budy,
here are some links. Remember, the reason you get corrupted files, is because
ZFS detects it. Probably, you got corruption earlier as well, but your hardware
did not notice it. This is called Silent Corruption. But ZFS is designed to
detect and correct Silent Corruption. Which no normal hardware is designed for.

The thing is, ZFS does end-to-end checksum. The data in RAM, is it identlcal on
disc? From RAM down to controller to disk. There can be errors in the passing
between the realms. Normally, there are checksums within each realm (checksums
on the disc), but no checksums from the beginning of the chaing, to the end: end
to the end checksums:
http://jforonda.blogspot.com/2007/01/faulty-fc-port-meets-zfs.html

Here are some links. CERN did a data integrity survey on 3000 hw raid and saw
silent corruptions.
http://storagemojo.com/2007/09/19/cerns-data-corruption-research/


 In another CERN paper, they say "such data corruption is found in all
solutions, no matter price (even very expensive Enterprise solutions)"!!!
From that paper (can not find the link now)
"Conclusions
-silent corruptions are a fact of life
-first step towards a solution is detection
-elimination seems impossible
-existing datasets are at the mercy of Murphy
-correction will cost time AND money
-effort has to start now (if not started already)
-multiple cost-schemes exist
--trade time and storage space (? la Google)
--trade time and CPU power (correction codes"

CERN writes: "checksumming - not necessarily enough" you need to use
"end-to-end checksumming (ZFS has a point)"


See the specifications on a new SAS Enterprise disk, typically it says:
"one irrecoverable error in 10^15 bits". With todays large and fast
raids, you quickly reach 10^ 15 bits in a short time.


Greenplums database solution faces one such bit every 15 min:
http://queue.acm.org/detail.cfm?id=1317400


Ordinary filesystems such as XFS, ReiserFS, JFS, etc does not protect your data,
nor detect all errors (here is a PhD thesis link)
http://www.zdnet.com/blog/storage/how-microsoft-puts-your-data-at-risk/169


ZFS data integrity tested by researchers:
http://www.zdnet.com/blog/storage/zfs-data-integrity-tested/811?tag=rbxccnbzd1
(if they had ran zfs raid, ZFS would have corrected all artificially injected
errors. Now, ZFS only detected all errors - which is very difficult to do. First
step is detection, then repair the errors)


Companies tries to hide silent corruption:
http://www.enterprisestorageforum.com/sans/features/article.php/3704666


http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt
"When a drive returns garbage, since RAID5 does not EVER check parity on
read (RAID3 & RAID4 do BTW and both perform better for databases than RAID5
to boot) if you write a garbage sector back garbage parity will be calculated
and your RAID5 integrity is lost! Similarly if a drive fails and one of the
remaining drives is flaky the replacement will be rebuilt with garbage also
propagating the problem to two blocks instead of just one."


http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
"The paper explains that the best RAID-6 can do is use probabilistic
methods to distinguish between single and dual-disk corruption, eg. "there
are 95% chances it is single-disk corruption so I am going to fix it assuming
that, but there are 5% chances I am going to actually corrupt more data, I just
can''t tell". I wouldn''t want to rely on a RAID controller
that takes gambles :-)"


Researchers write regarding hw-raid:
http://www.cs.wisc.edu/adsl/Publications/parity-fast08.html
"We use the model checker to evaluate a number of different approaches
found in real RAID systems, focusing on parity-based protection and single
errors. We find holes in all of the schemes examined, where systems potentially
exposes data to loss or returns corrupt data to the user. In data loss
scenarios, the error is detected, but the data cannot be recovered, while in the
rest, the error is not detected and therefore corrupt data is returned to the
user. For example, we examine a combination of two techniques ? block-level
checksums (where checksums of the data block are stored within the same disk
block as data and verified on every read) and write-verify (where data is read
back immediately after it is written to disk and verified for correctness), and
show that the scheme could still fail to detect certain error conditions, thus
returning corrupt data to the user.

We discover one particularly interesting and general problem that we call parity
pollution. In this situation, corrupt data in one block of a stripe spreads to
other blocks through various parity calculations. We find a number of cases
where parity pollution occurs, and show how pollution can lead to data loss.
Specifically, we find that data scrubbing (which is used to reduce the chances
of double disk failures) tends to be one of themain causes of parity
pollution."


http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf
"Detecting and recovering from data corruption requires protection
techniques beyond those provided by the disk drive. In fact, basic protection
schemes such as RAID [13] may also be unable to detect these problems.
...
as we discuss later, checksums do not protect against all forms of
corruption"



http://www.cs.wisc.edu/adsl/Publications/corrupt-mysql-icde10.pdf
"More reliable SCSI drives encounter fewer problems, but even within this
expensive and carefully-engineered drive class, corruption still takes
place."
....
Recent work has shown that even with sophisticated RAID protection strategies,
the ?right? combination of a single fault and certain repair activities (e.g., a
parity scrub) can still lead to data loss [19]. Thus, while these schemes reduce
the chances of corruption, the possibility still exists; any higher-level client
of storage that is serious about managing data reliably must consider the
possibility that a disk will return data in a corrupted form."
-- 
This message posted from opensolaris.org

Kees Nuyt

2010-Oct-17 12:35 UTC

head link

[zfs-discuss] Finding corrupted files

On Sun, 17 Oct 2010 03:05:34 PDT, Orvar Korvar
<knatte_fnatte_tjatte at yahoo.com> wrote:
> here are some links. 
Wow, that''s a great overview, thanks!
-- 
  (  Kees Nuyt
  )
c[_]

Edward Ned Harvey

2010-Oct-17 13:17 UTC

head link

[zfs-discuss] Finding corrupted files

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> If scrub is operating at a block-level (and I think it is), then how
> can
> checksum failures be mapped to file names?  For example, this is a
> long-requested feature of "zfs send" which is fundamentally
difficult
> or
> impossible to implement.
How about that.  I recently learned that "zfs diff" does exist
already, in
b147 of openindiana.  That means it''s already in the oracle
opened-source
zfs code, but apparently too new to be included in any of the present
releases.

So it seems, zfs does have some ability to figure out which file owns a
particular block on disk.

Richard Elling

2010-Oct-18 03:26 UTC

head link

[zfs-discuss] Finding corrupted files

On Oct 17, 2010, at 6:17 AM, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>> 
>> If scrub is operating at a block-level (and I think it is), then how
>> can
>> checksum failures be mapped to file names?  For example, this is a
>> long-requested feature of "zfs send" which is fundamentally
difficult
>> or
>> impossible to implement.
> 
> How about that.  I recently learned that "zfs diff" does exist
already, in
> b147 of openindiana.  That means it''s already in the oracle
opened-source
> zfs code, but apparently too new to be included in any of the present
> releases.
> 
> So it seems, zfs does have some ability to figure out which file owns a
> particular block on disk.
uhm... of course this exists.  The problem is that the efficient mapping
goes the other way: files to blocks.  Snapshots further complicate this 
because a block may belong to a filename in one snapshot but the file 
got renamed in another snapshot.  Deduplication also complicates this
because a block may be referenced in multiple files. Maintaining this
mapping live is probably not worth the effort.
 -- richard

Edward Ned Harvey

2010-Oct-18 13:55 UTC

head link

[zfs-discuss] Finding corrupted files

> From: Richard Elling [mailto:richard.elling at gmail.com]
> 
> On Oct 17, 2010, at 6:17 AM, Edward Ned Harvey wrote:
> 
> >> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> >> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> >>
> >> If scrub is operating at a block-level (and I think it is), then
how
> >> can
> >> checksum failures be mapped to file names?  For example, this is a
> >> long-requested feature of "zfs send" which is
fundamentally
> difficult
> >> or
> >> impossible to implement.
> >
> > How about that.  I recently learned that "zfs diff" does
exist
> already, in
> > b147 of openindiana.  That means it''s already in the oracle
opened-
> source
> > zfs code, but apparently too new to be included in any of the present
> > releases.
> >
> > So it seems, zfs does have some ability to figure out which file owns
> a
> > particular block on disk.
> 
> uhm... of course this exists.  The problem is that the efficient
> mapping
> goes the other way: files to blocks.  Snapshots further complicate this
> because a block may belong to a filename in one snapshot but the file
> got renamed in another snapshot.  Deduplication also complicates this
> because a block may be referenced in multiple files. Maintaining this
> mapping live is probably not worth the effort.
Thank you, but, the original question was whether a scrub would identify
just corrupt blocks, or if it would be able to map corrupt blocks to a list
of corrupt files.  

Until I wrote this comment about "zfs diff" no answer existed in this
thread.  (Unless I overlooked it somehow.)

So thank you for the information about dedup and difficulty maintaining live
information.  Although it was irrelevant to the discussion at hand.

Tuomas Leikola

2010-Oct-19 20:36 UTC

head link

[zfs-discuss] Finding corrupted files

On Mon, Oct 18, 2010 at 4:55 PM, Edward Ned Harvey <shill at
nedharvey.com> wrote:> Thank you, but, the original question was whether a scrub would identify
> just corrupt blocks, or if it would be able to map corrupt blocks to a list
> of corrupt files.
>
Just in case this wasn''t already clear.

After scrub sees read or checksum errors, zpool status -v will list
filenames that are affected. At least in my experience.
-- 
- Tuomas

Stephan Budach

2010-Oct-20 07:21 UTC

head link

[zfs-discuss] Finding corrupted files

Am 19.10.2010 um 22:36 schrieb Tuomas Leikola <tuomas.leikola at
gmail.com>:
> On Mon, Oct 18, 2010 at 4:55 PM, Edward Ned Harvey <shill at
nedharvey.com> wrote:
>> Thank you, but, the original question was whether a scrub would
identify
>> just corrupt blocks, or if it would be able to map corrupt blocks to a
list
>> of corrupt files.
>> 
> 
> Just in case this wasn''t already clear.
> 
> After scrub sees read or checksum errors, zpool status -v will list
> filenames that are affected. At least in my experience.
> -- 
> - Tuomas
That didn''t do it for me. I used scrub and afterwards zpool staus -v
didn''t show any additional corrupted files, although there were the
same three files corrupted in a number of snapshots, which of course zfs send
detected when trying to actually send them.

budy

Edward Ned Harvey

2010-Oct-20 11:20 UTC

head link

[zfs-discuss] Finding corrupted files

> From: Stephan Budach [mailto:stephan.budach at jvm.de]
> 
> > Just in case this wasn''t already clear.
> >
> > After scrub sees read or checksum errors, zpool status -v will list
> > filenames that are affected. At least in my experience.
> > --
> > - Tuomas
> 
> That didn''t do it for me. I used scrub and afterwards zpool staus
-v
> didn''t show any additional corrupted files, although there were
the
> same three files corrupted in a number of snapshots, which of course
> zfs send detected when trying to actually send them.
Budy, we''ve been over this.

The behavior you experienced is explained by having corrupt data inside a
hardware raid, and during the scrub you luckily read the good copy of
redundant data.  During zfs send, you unluckily read the bad copy of
redundant data.  This is a known problem as long as you use hardware raid.
It''s one of the big selling points, reasons for ZFS to exist.  You
should
always give ZFS JBOD devices to work on, so ZFS is able to scrub both of the
redundant sides of the data, and when a checksum error occurs, ZFS is able
to detect *and* correct it.  Don''t use hardware raid.

Darren J Moffat

2010-Oct-20 11:33 UTC

head link

[zfs-discuss] Finding corrupted files

On 20/10/2010 12:20, Edward Ned Harvey wrote:> It''s one of the big selling points, reasons for ZFS to exist.  You
should
> always give ZFS JBOD devices to work on, so ZFS is able to scrub both of
the
> redundant sides of the data, and when a checksum error occurs, ZFS is able
> to detect *and* correct it.  Don''t use hardware raid.
That isn''t the recommended best practice, you are stating it far too 
strongly.

The recommended best practice is to always create ZFS pools with 
redundancy in the control of ZFS.  That doesn''t require that the back 
end storage be JBOD or full disks nor does it require you not to use 
hardware raid. Some of all of which are impossible if you are using SAN 
or other remote block storage devices in many cases - and certainly the 
case if the SAN is provided by a Sun ZFS Storage appliance.

-- 
Darren J Moffat

Stephan Budach

2010-Oct-20 12:57 UTC

head link

[zfs-discuss] Finding corrupted files

>> From: Stephan Budach [mailto:stephan.budach at jvm.de]
>> 
>>> Just in case this wasn''t already clear.
>>> 
>>> After scrub sees read or checksum errors, zpool status -v will list
>>> filenames that are affected. At least in my experience.
>>> --
>>> - Tuomas
>> 
>> That didn''t do it for me. I used scrub and afterwards zpool
staus -v
>> didn''t show any additional corrupted files, although there
were the
>> same three files corrupted in a number of snapshots, which of course
>> zfs send detected when trying to actually send them.
> 
> Budy, we''ve been over this.
> 
> The behavior you experienced is explained by having corrupt data inside a
> hardware raid, and during the scrub you luckily read the good copy of
> redundant data.  During zfs send, you unluckily read the bad copy of
> redundant data.  This is a known problem as long as you use hardware raid.
> It''s one of the big selling points, reasons for ZFS to exist.  You
should
> always give ZFS JBOD devices to work on, so ZFS is able to scrub both of
the
> redundant sides of the data, and when a checksum error occurs, ZFS is able
> to detect *and* correct it.  Don''t use hardware raid.
> 
Edward - I am working on that! 

Although, I have to say that I do have exactly 3 files that are corrupt in each
snapshot until I finally deleted them and restored them from their original
source.

zfs send will abort when trying to send them, while scrub doesn''t
notice this.
If zfs send would have sent any of these snapshots successfully, or if any of my
read attempts for these files would work one time and fail the other time,
I''d agree.
I can''t see how this behaviour could be explained, or better: what are
the chances that only scrub gets the "clean" blocks from the h/w
raids, while zfs send or cp always get the corrupted blocks!

Edward Ned Harvey

2010-Oct-20 13:11 UTC

head link

[zfs-discuss] Finding corrupted files

> From: Stephan Budach [mailto:stephan.budach at jvm.de]
> 
> Although, I have to say that I do have exactly 3 files that are corrupt
> in each snapshot until I finally deleted them and restored them from
> their original source.
> 
> zfs send will abort when trying to send them, while scrub doesn''t
> notice this.
That cannot be consistently repeatable.  If anything will notice corrupt
data, scrub will too.  The only way you will find corrupt data with
something else and not with scrub is ... If the corrupt data didn''t
exist
during the scrub.

I''m glad you''re working to change the raid setup to jbod,
because, although
that''s not the only possible explanation, it is the most obvious
explanation.

Edward Ned Harvey

2010-Oct-20 13:26 UTC

head link

[zfs-discuss] Finding corrupted files

> -----Original Message-----
> From: Darren J Moffat [mailto:darrenm at opensolaris.org]
> > It''s one of the big selling points, reasons for ZFS to exist.
You
> should
> > always give ZFS JBOD devices to work on, so ZFS is able to scrub both
> of the
> > redundant sides of the data, and when a checksum error occurs, ZFS is
> able
> > to detect *and* correct it.  Don''t use hardware raid.
> 
> That isn''t the recommended best practice, you are stating it far
too
> strongly.
>
> The recommended best practice is to always create ZFS pools with
> redundancy in the control of ZFS.  That doesn''t require that the
back
> end storage be JBOD or full disks nor does it require you not to use
> hardware raid. Some of all of which are impossible if you are using SAN
> or other remote block storage devices in many cases - and certainly the
> case if the SAN is provided by a Sun ZFS Storage appliance.
You''re right though, I''m stating that too strongly.  Never say
never.  And
never say always.  The truth is exactly as you said.  Even if you have
redundancy in hardware, make sure you also have redundancy in ZFS.

If you allow hardware to manage redundancy ... Then just as Budy has
experienced, when corruption is found, it''s not consistently
repeatable, and
it could appear anywhere in the storage unit randomly.  ZFS is unable to
isolate the individual failing disk.  After enough checksum failures, the
whole storage unit will be marked failed and taken offline.  So much for
your redundancy.  

It is a problem if your only redundancy is hardware.  It is not a problem if
you also have redundancy managed by ZFS.  So a more correct conclusion would
be "Whenever possible" don''t use hardware raid, and
"whenever possible" use
JBOD managed by ZFS.  But whatever you do, make sure ZFS has some redundancy
it can manage.

Stephan Budach

2010-Oct-20 13:32 UTC

head link

[zfs-discuss] Finding corrupted files

--
Von meinem iPhone iOS4
 gesendet.

Stephan Budach
Jung von Matt/it-services GmbH
Glash?ttenstra?e 79
20257 Hamburg

Tel: +49 40-4321-1353
Fax: +49 40-4321-1114
E-Mail: stephan.budach at jvm.de
Internet: www.jvm.de

Gesch?ftsf?hrer: Ulrich Pallas, Frank Willhelm
AG HH HRB 98380

Am 20.10.2010 um 15:11 schrieb "Edward Ned Harvey" <shill at
nedharvey.com>:
>> From: Stephan Budach [mailto:stephan.budach at jvm.de]
>> 
>> Although, I have to say that I do have exactly 3 files that are corrupt
>> in each snapshot until I finally deleted them and restored them from
>> their original source.
>> 
>> zfs send will abort when trying to send them, while scrub
doesn''t
>> notice this.
> 
> That cannot be consistently repeatable.  If anything will notice corrupt
> data, scrub will too.  The only way you will find corrupt data with
> something else and not with scrub is ... If the corrupt data
didn''t exist
> during the scrub.
> 
> I''m glad you''re working to change the raid setup to jbod,
because, although
> that''s not the only possible explanation, it is the most obvious
> explanation.
>

Stephan Budach

2010-Oct-20 16:06 UTC

head link

[zfs-discuss] Finding corrupted files

Am 20.10.10 15:11, schrieb Edward Ned Harvey:>> From: Stephan Budach [mailto:stephan.budach at jvm.de]
>>
>> Although, I have to say that I do have exactly 3 files that are corrupt
>> in each snapshot until I finally deleted them and restored them from
>> their original source.
>>
>> zfs send will abort when trying to send them, while scrub
doesn''t
>> notice this.
> That cannot be consistently repeatable.  If anything will notice corrupt
> data, scrub will too.  The only way you will find corrupt data with
> something else and not with scrub is ... If the corrupt data
didn''t exist
> during the scrub.
I will do some more scrubbing - it only takes a couple of hours and then 
scrub should at least show some of the errors.
When I use zpool clear on that pool, why does zpool status still shows 
the errors that have been encountered? I''d figure that it would be a
lot
easier to track, if scrub finds "new" erorrs, if zpool status -v 
wouldn''t show the "old" ones.

-- 
Stephan Budach
Jung von Matt/it-services GmbH
Glash?ttenstra?e 79
20357 Hamburg

Tel: +49 40-4321-1353
Fax: +49 40-4321-1114
E-Mail: stephan.budach at jvm.de
Internet: http://www.jvm.com

Gesch?ftsf?hrer: Ulrich Pallas, Frank Wilhelm
AG HH HRB 98380

zfs discuss - Oct 2010 - Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files

[zfs-discuss] Finding corrupted files