Hi, I recently discovered some - or at least one corrupted file on one ofmy ZFS datasets, which caused an I/O error when trying to send a ZFDS snapshot to another host: zpool status -v obelixData pool: obelixData state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM obelixData ONLINE 4 0 0 c4t210000D023038FA8d0 ONLINE 0 0 0 c4t210000D02305FF42d0 ONLINE 4 0 0 errors: Permanent errors have been detected in the following files: <0x949>:<0x12b9b9> obelixData/JvMpreprint at 2010-10-02_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET 10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps obelixData/JvMpreprint at BackupSnapshot_2010-10-05-08:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET 10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps obelixData/JvMpreprint at 2010-09-24_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor 6_210/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps /obelixData/JvMpreprint/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET 10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps Now, scrub would reveal corrupted blocks on the devices, but is there a way to identify damaged files as well? Thanks, budy -- This message posted from opensolaris.org
On 06 October, 2010 - Stephan Budach sent me these 2,1K bytes:> Hi, > > I recently discovered some - or at least one corrupted file on one ofmy ZFS datasets, which caused an I/O error when trying to send a ZFDS snapshot to another host: > > > zpool status -v obelixData > pool: obelixData > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > obelixData ONLINE 4 0 0 > c4t210000D023038FA8d0 ONLINE 0 0 0 > c4t210000D02305FF42d0 ONLINE 4 0 0 > > errors: Permanent errors have been detected in the following files: > > <0x949>:<0x12b9b9> > obelixData/JvMpreprint at 2010-10-02_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET 10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps > obelixData/JvMpreprint at BackupSnapshot_2010-10-05-08:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET 10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps > obelixData/JvMpreprint at 2010-09-24_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor 6_210/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps > /obelixData/JvMpreprint/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET 10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps > > Now, scrub would reveal corrupted blocks on the devices, but is there a way to identify damaged files as well?Is this a trick question or something? The filenames are right over your question..? /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
No - not a trick question., but maybe I didn''t make myself clear. Is there a way to discover such bad files other than trying to actually read from them one by one, say using cp or by sending a snapshot elsewhere? I am well aware that the file shown in zpool status -v is damaged and I have already restored it, but I wanted to know, if there''re more of them. Regards, budy -- This message posted from opensolaris.org
Scrub? On Oct 6, 2010, at 6:48 AM, Stephan Budach wrote:> No - not a trick question., but maybe I didn''t make myself clear. > Is there a way to discover such bad files other than trying to actually read from them one by one, say using cp or by sending a snapshot elsewhere? > > I am well aware that the file shown in zpool status -v is damaged and I have already restored it, but I wanted to know, if there''re more of them. > > Regards, > budy > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussScott Meilicke
Budy,> No - not a trick question., but maybe I didn''t make myself clear. > Is there a way to discover such bad files other than trying to actually read from them one by one, say using cp or by sending a snapshot elsewhere?As noted by your original email, ZFS reports on any corruption using the "zpool status" command. ZFS detects corruption as part of its normal filesystem operations, which maybe triggered by: cp, send-recv, etc., or by a forced reading of the entire filesystem by scrub.> I am well aware that the file shown in zpool status -v is damaged and I have already restored it, but I wanted to know, if there''re more of them.Assuming that the ZFS filesystem in question is not degrading further (as in a disk going bad), upon completion of a successful scrub, zpool reports the complete status of the filesystem being reported on. - Jim> > Regards, > budy > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101006/f25e3b71/attachment.html>
Well I think, that answers my question then: after a successful scrub, zpool status -v should then list all damaged files on an entire zpool. I only asked, because I read a thread in this forum that one guy had a problem with different files, aven after a successful scrub. Thanks, budy -- This message posted from opensolaris.org
Budy, Your previous zpool status output shows a non-redundant pool with data corruption. You should use the fmdump -eV command to find out the underlying cause of this corruption. You can review the hardware-level monitoring tools, here: http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide Thanks, Cindy On 10/06/10 13:09, Stephan Budach wrote:> Well I think, that answers my question then: after a successful scrub, zpool status -v should then list all damaged files on an entire zpool. > > I only asked, because I read a thread in this forum that one guy had a problem with different files, aven after a successful scrub. > > Thanks, > budy
On 10/ 6/10 09:52 PM, Stephan Budach wrote:> Hi, > > I recently discovered some - or at least one corrupted file on one ofmy ZFS datasets, which caused an I/O error when trying to send a ZFDS snapshot to another host: > > > zpool status -v obelixData > pool: obelixData > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > obelixData ONLINE 4 0 0 > c4t210000D023038FA8d0 ONLINE 0 0 0 > c4t210000D02305FF42d0 ONLINE 4 0 0 > >Are you aware that this is a very dangerous configuration? Your pool lacks redundancy and you will loose it if one of the devices fails. -- Ian.
Hi Cindy, thanks for bringing that to my attention. I checked fmdump and found a lot of these entries: Okt 06 2010 17:52:12.862812483 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x514dc67d57e00001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci8086,340e at 7/pci1077,138 at 0,1/fp at 0,0/disk at w210000d02305ff42,0 (end detector) driver-assessment = retry op-code = 0x88 cdb = 0x88 0x0 0x0 0x0 0x0 0x2 0xac 0xd4 0x3d 0x80 0x0 0x0 0x0 0x80 0x0 0x0 pkt-reason = 0x3 pkt-state = 0x0 pkt-stats = 0x20 __ttl = 0x1 __tod = 0x4cac9b2c 0x336d7943 Okt 06 2010 17:52:12.862813713 ereport.io.scsi.cmd.disk.recovered nvlist version: 0 class = ereport.io.scsi.cmd.disk.recovered ena = 0x514dc67d57e00001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci8086,340e at 7/pci1077,138 at 0,1/fp at 0,0/disk at w210000d02305ff42,0 devid = id1,sd at n600d02310005ff4200000000712ab96c (end detector) driver-assessment = recovered op-code = 0x88 cdb = 0x88 0x0 0x0 0x0 0x0 0x2 0xac 0xd4 0x3d 0x80 0x0 0x0 0x0 0x80 0x0 0x0 pkt-reason = 0x0 pkt-state = 0x1f pkt-stats = 0x0 __ttl = 0x1 __tod = 0x4cac9b2c 0x336d7e11 Googling about these errors brought me directly to this document: http://dsc.sun.com/solaris/articles/scsi_disk_fma2.html which talks about these scsi errors. Since we''re talking FC here, it seems to point to some FC issue I have not been aware of. Furthermore, it''s always the same FC device that show these errors, so I will try to check the device and it''s connections to the fabric first. Thanks, budy -- This message posted from opensolaris.org
Ian, yes, although these vdevs are FC raids themselves, so the risk is? uhm? calculated. Unfortuanetly, one of the devices seems to have some issues, as stated im my previous post. I will, nevertheless, add redundancy to my pool asap. Thanks, budy -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Stephan Budach > > Ian, > > yes, although these vdevs are FC raids themselves, so the risk is? uhm? > calculated.Whenever possible, you should always JBOD the storage and let ZFS manage the raid, for several reasons. (See below). Also, as counter-intuitive as this sounds (see below) you should disable hardware write-back cache (even with BBU) because it hurts performance in any of these situations: (a) Disable WB if you have access to SSD or other nonvolatile dedicated log device. (b) Disable WB if you know all of your writes to be async mode and not sync mode. (c) Disable WB if you''ve opted to disable ZIL. * Hardware raid blindly assumes the redundant data written to disk is written correctly. So later, if you experience a checksum error (such as you have) then it''s impossible for ZFS to correct it. The hardware raid doesn''t know a checksum error has occurred, and there is no way for the OS to read the "other side of the mirror" to attempt correcting the checksum via redundant data. * ZFS has knowledge of both the filesystem, and the block level devices, while hardware raid has only knowledge of block level devices. Which means ZFS is able to optimize performance in ways that hardware cannot possibly do. For example, whenever there are many small writes taking place concurrently, ZFS is able to remap the physical disk blocks of those writes, to aggregate them into a single sequential write. Depending on your metric, this yields 1-2 orders of magnitude higher IOPS. * Because ZFS automatically buffers writes in ram in order to aggregate as previously mentioned, the hardware WB cache is not beneficial. There is one exception. If you are doing sync writes to spindle disks, and you don''t have a dedicated log device, then the WB cache will benefit you, approx half as much as you would benefit by adding dedicated log device. The sync write sort-of by-passes the ram buffer, and that''s the reason why the WB is able to do some good in the case of sync writes. Ironically, if you have WB enabled, and you have a SSD log device, then the WB hurts you. You get the best performance with SSD log, and no WB. Because the WB "lies" to the OS, saying some tiny chunk of data has been written... then the OS will happily write another tiny chunk, and another, and another. The WB is only buffering a lot of tiny random writes, and in aggregate, it will only go as fast as the random writes. It undermines ZFS''s ability to aggregate small writes into sequential writes.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Stephan Budach > > Now, scrub would reveal corrupted blocks on the devices, but is there a > way to identify damaged files as well?I saw a lot of people offering the same knee-jerk reaction that I had: "Scrub." And that is the only correct answer, to make a best effort at salvaging data. But I think there is a valid question here which was neglected. *Does* scrub produce a list of all the names of all the corrupted files? And if so, how does it do that? If scrub is operating at a block-level (and I think it is), then how can checksum failures be mapped to file names? For example, this is a long-requested feature of "zfs send" which is fundamentally difficult or impossible to implement. Zfs send operates at a block level. And there is a desire to produce a list of all the incrementally changed files in a zfs incremental send. But no capability of doing that. It seems, if scrub is able to list the names of files that correspond to corrupted blocks, then zfs send should be able to list the names of files that correspond to changed blocks, right? I am reaching the opposite conclusion of what''s already been said. I think you should scrub, but don''t expect file names as a result. I think if you want file names, then tar > /dev/null will be your best friend. I didn''t answer anything at first, cuz I was hoping somebody would have that answer. I only know that I don''t know, and the above is my best guess.
Hi Edward, these are interesting points. I have considered a couple of them, when I started playing around with ZFS. I am not sure whether I disagree with all of your points, but I conducted a couple of tests, where I configured my raids as jbods and mapped each drive out as a seperate LUN and I couldn''t notice a difference in performance in any way. I''d love to discuss this in a seperate thread, but first I will have to check the archives an Google. ;) Thanks, budy -- This message posted from opensolaris.org
On Wed, Oct 6 at 22:04, Edward Ned Harvey wrote:> * Because ZFS automatically buffers writes in ram in order to > aggregate as previously mentioned, the hardware WB cache is not > beneficial. There is one exception. If you are doing sync writes > to spindle disks, and you don''t have a dedicated log device, then > the WB cache will benefit you, approx half as much as you would > benefit by adding dedicated log device. The sync write sort-of > by-passes the ram buffer, and that''s the reason why the WB is able > to do some good in the case of sync writes.All of your comments made sense except for this one. Every N seconds when the system decides to burst writes to media from RAM, those writes are only sequential in the case where the underlying storage devices are significantly empty. Once you''re in a situation where your allocations are scattered across the disk due to longer-term fragmentation, I don''t see any way that a write cache would hurt performances on the devices, since it''d allow the drive to reorder writes to the media within that burst of data. Even though ZFS is issuing writes of ~256 sectors if it can, that is only a fraction of a revolution on a modern drive, so random writes of 128KB still have significant opportunity for reordering optimization. Granted, with NCQ or TCQ you can get back much of the cache-disabled performance loss, however, in any system that implements an internal queue depth greater than the protocol-allowed queue depth, there is opportunity for improvement, to an asymptotic limit driven by servo settle speed. Obviously this performance improvement comes with the standard WB risks, and YMMV, IANAL, etc. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Hi Edward, well that was exactly my point, when I raised this question. If zfs send is able to identify corrupted files while it transfers a snapshot, why shouldn''t scrub be able to do the same? ZFS send quit with an I/O error and zpool status -v showed my the file that indeed had problems. Since I thought that zfs send also operates on the block level, I thought whether or not scrub would basically do the same thing. On the other hand scrub really doesn''t care about what to read from the device - it simply reads all blocks, which is not the case when running zfs send. Maybe, if zfs send could just go on and not halt on an I/O error and instead just print out the errors? Cheers, budy -- This message posted from opensolaris.org
On 10/ 7/10 06:22 PM, Stephan Budach wrote:> Hi Edward, > > these are interesting points. I have considered a couple of them, when I started playing around with ZFS. > > I am not sure whether I disagree with all of your points, but I conducted a couple of tests, where I configured my raids as jbods and mapped each drive out as a seperate LUN and I couldn''t notice a difference in performance in any way. > >The time you will notice is when a cable falls out or becomes loose and you get corrupted data and loose the pool due to lack of redundancy. Even though your LUNs are RAID, there are still numerous single points of failure between them and the target system. -- Ian.
Ian, I know - and I will address this, by upgrading the vdevs to mirrors, but there''re a lot of other SPOFs around. So I started out by reducing the most common failures and I have found that to be the disc drives, not the chassis. The beauty is: one can work their way up until the point of securuty is reached or until there is no more money to spend. Cheers, budy -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Stephan Budach > > I > conducted a couple of tests, where I configured my raids as jbods and > mapped each drive out as a seperate LUN and I couldn''t notice a > difference in performance in any way.Not sure if my original points were communicated clearly. Giving JBOD''s to ZFS is not for the sake of performance. The reason for JBOD is reliability. Because hardware raid cannot detect or correct checksum errors. ZFS can. So it''s better to skip the hardware raid and use JBOD, to enable ZFS access to each separate side of the redundant data.
> From: edmudama at mail.bounceswoosh.org > [mailto:edmudama at mail.bounceswoosh.org] On Behalf Of Eric D. Mudama > > On Wed, Oct 6 at 22:04, Edward Ned Harvey wrote: > > * Because ZFS automatically buffers writes in ram in order to > > aggregate as previously mentioned, the hardware WB cache is not > > beneficial. There is one exception. If you are doing sync writes > > to spindle disks, and you don''t have a dedicated log device, then > > the WB cache will benefit you, approx half as much as you would > > benefit by adding dedicated log device. The sync write sort-of > > by-passes the ram buffer, and that''s the reason why the WB is able > > to do some good in the case of sync writes. > > All of your comments made sense except for this one. > > (etc)Your point about long-term fragmentation and significant drive emptiness are well received. I never let a pool get over 90% full, for several reasons including this one. My target is 70%, which seems to be sufficiently empty. Also, as you indicated, blocks of 128K are not sufficiently large for reordering to benefit. There''s another thread here, where I calculated, you need blocks approx 40MB in size, in order to reduce random seek time below 1% of total operation time. So all that I said will only be relevant or accurate if within 30sec (or 5 sec in the future) there exists at least 40M of aggregatable sequential writes. It''s really easy to measure and quantify what I was saying. Just create a pool, and benchmark it in each configuration. Results that I measured were: (stripe of 2 mirrors) 721 IOPS without WB or slog. 2114 IOPS with WB 2722 IOPS with WB and slog 2927 IOPS with slog, and no WB There''s a whole spreadsheet full of results that I can''t publish, but the trend of WB versus slog was clear and consistent. I will admit the above were performed on relatively new, relatively empty pools. It would be interesting to see if any of that changes, if the test is run on a system that has been in production for a long time, with real user data in it.
I would not discount the performance issue... Depending on your workload, you might find that performance increases with ZFS on your hardware RAID in JBOD mode. Cindy On 10/07/10 06:26, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Stephan Budach >> >> I >> conducted a couple of tests, where I configured my raids as jbods and >> mapped each drive out as a seperate LUN and I couldn''t notice a >> difference in performance in any way. > > Not sure if my original points were communicated clearly. Giving JBOD''s to > ZFS is not for the sake of performance. The reason for JBOD is reliability. > Because hardware raid cannot detect or correct checksum errors. ZFS can. > So it''s better to skip the hardware raid and use JBOD, to enable ZFS access > to each separate side of the redundant data. > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 7-Oct-10, at 1:22 AM, Stephan Budach wrote:> Hi Edward, > > these are interesting points. I have considered a couple of them, > when I started playing around with ZFS. > > I am not sure whether I disagree with all of your points, but I > conducted a couple of tests, where I configured my raids as jbods > and mapped each drive out as a seperate LUN and I couldn''t notice a > difference in performance in any way. >The integrity issue is, however, clear cut. ZFS must manage the redundancy. ZFS just alerted you that your ''FC RAID'' doesn''t actually provide data integrity, & you just lost the ''calculated'' bet. :) --Toby> I''d love to discuss this in a seperate thread, but first I will have > to check the archives an Google. ;) > > Thanks, > budy > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> From: Cindy Swearingen [mailto:cindy.swearingen at oracle.com] > > I would not discount the performance issue... > > Depending on your workload, you might find that performance increases > with ZFS on your hardware RAID in JBOD mode.Depends on the raid card you''re comparing to. I''ve certainly seen some raid cards that were too dumb to read from 2 disks in a mirror simultaneously for the sake of read performance enhancement. And many other similar situations. But I would not say that''s generally true anymore. In the last several years, all the hardware raid cards that I''ve bothered to test were able to utilize all the hardware available. Just like ZFS. There are performance differences... like ... the hardware raid might be able to read 15% faster in raid5, while ZFS is able to write 15% faster in raidz, and so forth. Differences that roughly balance each other out. For example, here''s one data point I can share (2 mirrors striped, results normalized): 8 initial writers, 8 rewriters, 8 readers ZFS 1.43 2.99 5.05 HW 2.00 2.54 2.96 8 re-readers, 8 reverse readers, 8 stride readers ZFS 4.19 3.59 3.93 HW 3.02 2.80 2.90 8 random readers, 8 random mix, 8 random writers ZFS 2.57 2.40 1.69 HW 1.99 1.70 1.73 average ZFS 3.09 HW 2.40 There were some categories where ZFS was faster. Some where HW was faster. On average, ZFS was faster, but they were all in the same ballpark, and the results were highly dependent on specific details and tunables. AKA, not a place you should explore, unless you have a highly specialized use case that you wish to optimize.
So, I decided to give tar a whirl, after zfs send encountered the next corrupted file, resulting in an I/O error, even though scrub ran successfully w/o any erors. I then issued a /usr/gnu/bin/tar -cf /dev/null /obelixData/?/.zfs/snapshot/<actual snapshot>/DTP which finished without any issue and I have now issued a zfs send of this snapshot to my remote host. Let''s see, what happens in approx. 9 hrs. budy -- This message posted from opensolaris.org
So - after 10 hrs and 21 mins. the incremental zfs send/recv finished without a problem. ;) Seems that using tar for checking all files is an appropriate action. Cheers, budy -- This message posted from opensolaris.org
On Fri, October 8, 2010 04:47, Stephan Budach wrote:> So, I decided to give tar a whirl, after zfs send encountered the next > corrupted file, resulting in an I/O error, even though scrub ran > successfully w/o any erors.I must say that this concept of scrub running w/o error when corrupted files, detectable to zfs send, apparently exist, is very disturbing. Background scrubbing, and the block checksums to make it more meaningful than just reading the disk blocks, was the key thing that drew me into ZFS, and this seems to suggest that it doesn''t work. Does your sequence of tests happen to provide evidence that the problem isn''t new errors appearing, sometimes after a scrub and before the send? For example, have you done 1) scrub finds no error, 2) send finds error, 3) scrub finds no error? (with nothing in between that could have cleared or fixed the error). -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Oct 6, 2010, at 1:26 PM, Stephan Budach wrote:> Hi Cindy, > > thanks for bringing that to my attention. I checked fmdump and found a lot of these entries > > Okt 06 2010 17:52:12.862812483 ereport.io.scsi.cmd.disk.tran...> Okt 06 2010 17:52:12.862813713 ereport.io.scsi.cmd.disk.recovered...> Googling about these errors brought me directly to this document: > > http://dsc.sun.com/solaris/articles/scsi_disk_fma2.html > > which talks about these scsi errors. Since we''re talking FC here, it seems to point to some FC issue I have not been aware of. Furthermore, it''s always the same FC device that show these errors, so I will try to check the device and it''s connections to the fabric first.SCSI transport errors occur between the HBA and the target. These are reported up the stack to Solaris. As you can see, a retry was successful. However, these will have negative impacts on performance, so it is best to solve the problem. -- richard
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of David Dyer-Bennet > > I must say that this concept of scrub running w/o error when corrupted > files, detectable to zfs send, apparently exist, is very disturbing.As previously mentioned, the OP is using a hardware raid system. It is impossible for ZFS to read "both sides of the mirror," which means it''s pure chance. The hardware raid may fetch data from a bad disk one time, and fetch good data from another disk the next time. Or vice-versa. You should always configure JBOD and allow ZFS to manage the raid. Don''t do it in hardware, as the OP of this thread is soundly demonstrating the reasons why.
I think one has to accept that zfs send appearently is able to detect such errors while scrub is not. scrub is operates only on the block level and makes sure that each block can be read and is in line with its''s checksum. However, zfs send seems to have detected some errors in the file system structure itself, resulting in a couple of files being unable to read. What had caused these errors, I have no idea, but deleting the affected files and replacing them did the job. I think that my understanding of zfs send/recv only operating on the block level, bypassing the higher level fs stuff, has been too simple. Now to answer your question: I did 1), 2) and 3), but between 2) and 3) I verified using tar that all files were accessible. Also, I didn''t had any problem since. Cheers, budy -- This message posted from opensolaris.org
You are implying that the issues resulted from the H/W raid(s) and I don''t think that this is appropriate. I configured a striped pool using two raids - this is exactly the same as using two single hard drives without mirroring them. I simply cannot see what zfs would be able to do in case of a block corruption in that matter. You are not stating that a single hard drive is more reliable than a HW raid box, are you? Actually my pool has no mirror capabilities at all, unless I am seriously mistaken. What scrub has found out is that none of the blocks had any issue, but the filesystem was not "clean" either, so if scrub does it''s job right and doesn''t report any errors, the error must have occurred somewhere else up the stack, way before the checksum had been calculated. No? -- This message posted from opensolaris.org
On Tue, Oct 12, 2010 at 9:39 AM, Stephan Budach <stephan.budach at jvm.de> wrote:> You are implying that the issues resulted from the H/W raid(s) and I don''t think that this is appropriate. >Not exactly. Because the raid is managed in hardware, and not by zfs, is the reason why zfs cannot fix these errors when it encounters them.> I configured a striped pool using two raids - this is exactly the same as using two single hard drives without mirroring them. I simply cannot see what zfs would be able to do in case of a block corruption in that matter.It cannot, exactly.> You are not stating that a single hard drive is more reliable than a HW raid box, are you? Actually my pool has no mirror capabilities at all, unless I am seriously mistaken.no, but zfs-managed raid is more reliable than hardware raid.> What scrub has found out is that none of the blocks had any issue, but the filesystem was not "clean" either, so if scrub does it''s job right and doesn''t report any errors, the error must have occurred somewhere else up the stack, way before the checksum had been calculated.If the case is, as speculated, that one mirror has bad data and one has good, scrub or any IO has 50% chances of seeing the corruption. scrub does verify checksums. Tuomas
> If the case is, as speculated, that one mirror has bad data and onehas good, scrub or any IO has 50% chances of seeing the corruption. scrub does verify checksums. Yes, if the vdev would be a mirrored one, which it wasn''t. There weren''t any mirrors setup. Plus, if the checksums would have been bad, scrub would have to deteteced that. It would not have been to resolve it, but that wasn''t the case. zpool status backupPool_01 pool: backupPool_01 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM backupPool_01 ONLINE 0 0 0 c3t2100001378AC0253d0 ONLINE 0 0 0 c3t2100001378AC026Ed0 ONLINE 0 0 0 errors: No known data errors If one of the two devices would go bad, boom - that it''d be for the entire pool, but as long as the two devices work, it''s okay. -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Stephan Budach > > You are implying that the issues resulted from the H/W raid(s) and I > don''t think that this is appropriate.Please quote originals when you reply. If you don''t - then it''s easy to follow the thread on the web forum, but not in email. So if you don''t quote, you''ll be losing a lot of the people following the thread. I think it''s entirely appropriate to imply that your problem this time stems from hardware. I''ll say it outright. You have a hardware problem. Because if there is a repeatable checksum failure (bad disk) then if anything can find it, scrub can. And scrub is the best way to find it. If you have a nonrepeatable checksum failure (such as you have) then there is only one possibility. You are experiencing a hardware problem. One possibility is that there''s a failing disk in your hardware raid set, and your hardware raid controller is unable to detect it, because hardware raid doesn''t do checksumming. Sometimes ZFS reads the device, and gets an error. Sometimes the hardware raid controller reads the other side of the mirror, and there is no error. This is not the only possibility. There could be some other piece of hardware yielding your intermittent checksum errors. But there''s one absolute conclusion: Your intermittent checksum errors are caused by hardware. If scrub didn''t find an error, then there was no error at the time of scrub. If scrub didn''t find an error, and then something else *did* find an error, it means one of two things. (a) Maybe the error only occurred after the scrub. or (b) the hardware raid controller or some other piece of hardware didn''t produce corrupted data during the scrub, but will produce corrupted data at some other time.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Stephan Budach > > c3t2100001378AC0253d0 ONLINE 0 0 0How many disks are there inside of c3t2100001378AC0253d0? How are they configured? Hardware raid 5? A mirror of two hardware raid 5''s? The point is: This device, as seen by ZFS, is not a pure storage device. It is a high level device representing some LUN or something, which is configured & controlled by hardware raid. If there''s zero redundancy in that device, then scrub would probably find the checksum errors consistently and repeatably. If there''s some redundancy in that device, then all bets are off. Sometimes scrub might read the "good half" of the data, and other times, the bad half. But then again, the error might not be in the physical disks themselves. The error might be somewhere in the raid controller(s) or the interconnect. Or even some weird unsupported driver or something.
Am 12.10.10 14:21, schrieb Edward Ned Harvey:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Stephan Budach >> >> c3t2100001378AC0253d0 ONLINE 0 0 0 > How many disks are there inside of c3t2100001378AC0253d0? > > How are they configured? Hardware raid 5? A mirror of two hardware raid > 5''s? The point is: This device, as seen by ZFS, is not a pure storage > device. It is a high level device representing some LUN or something, which > is configured& controlled by hardware raid. > > If there''s zero redundancy in that device, then scrub would probably find > the checksum errors consistently and repeatably. > > If there''s some redundancy in that device, then all bets are off. Sometimes > scrub might read the "good half" of the data, and other times, the bad half. > > > But then again, the error might not be in the physical disks themselves. > The error might be somewhere in the raid controller(s) or the interconnect. > Or even some weird unsupported driver or something. >Both raid boxes run raid6 with 16 drives each. This is the reason I was running a non-mirrored pool in the first place. I fully understand that zfs'' power comes to play, when you''re running with multiple independent drives, but that was what I got at hand. I now also got what you meant by "good half" but I don''t dare to say whether or not this is also the case in a raid6 setup. Regards -- Stephan Budach Jung von Matt/it-services GmbH Glash?ttenstra?e 79 20357 Hamburg Tel: +49 40-4321-1353 Fax: +49 40-4321-1114 E-Mail: stephan.budach at jvm.de Internet: http://www.jvm.com Gesch?ftsf?hrer: Ulrich Pallas, Frank Wilhelm AG HH HRB 98380 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101012/34f76a94/attachment.html>
On Oct 12, 2010, at 8:21 AM, "Edward Ned Harvey" <shill at nedharvey.com> wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Stephan Budach >> >> c3t2100001378AC0253d0 ONLINE 0 0 0 > > How many disks are there inside of c3t2100001378AC0253d0? > > How are they configured? Hardware raid 5? A mirror of two hardware raid > 5''s? The point is: This device, as seen by ZFS, is not a pure storage > device. It is a high level device representing some LUN or something, which > is configured & controlled by hardware raid. > > If there''s zero redundancy in that device, then scrub would probably find > the checksum errors consistently and repeatably. > > If there''s some redundancy in that device, then all bets are off. Sometimes > scrub might read the "good half" of the data, and other times, the bad half. > > > But then again, the error might not be in the physical disks themselves. > The error might be somewhere in the raid controller(s) or the interconnect. > Or even some weird unsupported driver or something.If it were a parity based raid set then the error would most likely be reproducible, if not detected by the raid controller. The biggest problem is from hardware mirrors where the hardware can''t detect an error on one side vs the other. For mirrors it''s always best to use ZFS'' built-in mirrors, otherwise if I were to use HW RAID I would use RAID5/6/50/60 since errors encountered can be reproduced, two parity raids mirrored in ZFS would probably provide the best of both worlds, for a steep cost though. -Ross
> From: Stephan Budach [mailto:stephan.budach at jvm.de] > > I now also got what you meant by "good half" but I don''t dare to say > whether or not this is also the case in a raid6 setup.The same concept applies to raid5 or raid6. When you read the device, you never know if you''re actually reading the "data" or the "parity" and in fact, they''re mixed together in order to fully utilize all the hardware available. (Assuming you have some decently smart hardware.) But all of that is mostly irrelevant. One fact remains: You have checksum errors. There is only one cause for checksum errors: Hardware failure. It may be the physical disks themselves, or the raid card, or ram, or cpu, or any of the interconnect in between. I suppose it could be a driver problem, but that''s less likely.
Budy, if you are using raid-5 or raid-6 underneath ZFS, then you should know that raid-5/6 might corrupt data. See here for lots of technical articles why raid-5 is bad: http://www.baarf.com/ raid-6 is not better. I can show you links about raid-6 being not safe. I is a good thing you run ZFS, because ZFS can detect those errors, whereas raid-5/6 can not. There are lots of research from computer scientists that show this. You want to see some research papers on data corruption and hardware raid? On the other hand, ZFS is safe. There are research papers showing that ZFS detects and corrects all errors. You want to see them? The bottom line is: ZFS should manage the discs directly. Do not let hardware raid (which can not detect all errors) run the discs. ZFS can detect and repair those errors. That is the reason to use ZFS, for data safety. Not for performance (that is secondary). You do have problems with your discs, only ZFS detects those errors. Your hardware raid did not detect those errors. ZFS can not repair the errors, unless ZFS runs the discs. -- This message posted from opensolaris.org
On Oct 13, 2010, at 12:59 PM, Orvar Korvar wrote:> On the other hand, ZFS is safe. There are research papers showing that ZFS detects and corrects all errors. You want to see them?I would. URLs please? -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS Tutorial at USENIX LISA''10 Conference November 8, 2010 San Jose, CA ZFS and performance consulting http://www.RichardElling.com
I''d like to see those docs as well. As all HW raids are driven by software, of course - and software can be buggy. I don''t want to heat up the discussion about ZFS managed discs vs. HW raids, but if RAID5/6 would be that bad, no one would use it anymore. So? just post the link and I will take a close look at the docs. Thanks, budy -- This message posted from opensolaris.org
On 14-Oct-10, at 3:27 AM, Stephan Budach wrote:> I''d like to see those docs as well. > As all HW raids are driven by software, of course - and software can > be buggy. >It''s not that the software ''can be buggy'' - that''s not the point here. The point being made is that conventional RAID just doesn''t offer data *integrity* - it''s not a design factor. The necessary mechanisms simply aren''t there. Contrariwise, with ZFS, end to end integrity is *designed in*. The ''papers'' which demonstrate this difference are the design documents; anyone could start with Mr Bonwick''s blog - with which I am sure most list readers are already familiar. http://blogs.sun.com/bonwick/en_US/category/ZFS e.g. http://blogs.sun.com/bonwick/en_US/entry/zfs_end_to_end_data> I don''t want to heat up the discussion about ZFS managed discs vs. > HW raids, but if RAID5/6 would be that bad, no one would use it > anymore.It is. And there''s no reason not to point it out. The world has changed a lot since RAID was ''state of the art''. It is important to understand its limitations (most RAID users apparently don''t). The saddest part is that your experience clearly shows these limitations. As expected, the hardware RAID didn''t protect your data, since it''s designed neither to detect nor repair such errors. If you had been running any other filesystem on your RAID you would never even have found out about it until you accessed a damaged part of it. Furthermore, backups would probably have been silently corrupt, too. As many other replies have said: The correct solution is to let ZFS, and not conventional RAID, manage your redundancy. That''s the bottom line of any discussion of "ZFS managed discs vs. HW raids". If still unclear, read Bonwick''s blog posts, or the detailed reply to you from Edward Harvey (10/6). --Toby> > So? just post the link and I will take a close look at the docs. > > Thanks, > budy > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Toby Thain > > > I don''t want to heat up the discussion about ZFS managed discs vs. > > HW raids, but if RAID5/6 would be that bad, no one would use it > > anymore. > > It is. And there''s no reason not to point it out. The world hasWell, neither one of the above statements is really fair. The truth is: radi5/6 are generally not that bad. Data integrity failures are not terribly common (maybe one bit per year out of 20 large disks or something like that.) And in order to reach the conclusion "nobody would use it," the people using it would have to first *notice* the failure. Which they don''t. That''s kind of the point. Since I started using ZFS in production, about a year ago, on three servers totaling approx 1.5TB used, I have had precisely one checksum error, which ZFS corrected. I have every reason to believe, if that were on a raid5/6, the error would have gone undetected and nobody would have noticed.
On 14-Oct-10, at 11:48 AM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Toby Thain >> >>> I don''t want to heat up the discussion about ZFS managed discs vs. >>> HW raids, but if RAID5/6 would be that bad, no one would use it >>> anymore. >> >> It is. And there''s no reason not to point it out. The world has > > Well, neither one of the above statements is really fair. > > The truth is: radi5/6 are generally not that bad. Data integrity > failures > are not terribly common (maybe one bit per year out of 20 large > disks or > something like that.)Such statistics assume that no part of the stack (drive, cable, network, controller, memory, etc) has any fault and is operating normally. This is, indeed, the base presumption of RAID (which also assumes a perfect error reporting chain).> > And in order to reach the conclusion "nobody would use it," the > people using > it would have to first *notice* the failure. Which they don''t. > That''s kind > of the point.Indeed it is. And then we could talk about self healing (also missing from RAID). --Toby> > Since I started using ZFS in production, about a year ago, on three > servers > totaling approx 1.5TB used, I have had precisely one checksum error, > which > ZFS corrected. I have every reason to believe, if that were on a > raid5/6, > the error would have gone undetected and nobody would have noticed. >
Am 14.10.10 17:48, schrieb Edward Ned Harvey:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Toby Thain >> >>> I don''t want to heat up the discussion about ZFS managed discs vs. >>> HW raids, but if RAID5/6 would be that bad, no one would use it >>> anymore. >> It is. And there''s no reason not to point it out. The world has > Well, neither one of the above statements is really fair. > > The truth is: radi5/6 are generally not that bad. Data integrity failures > are not terribly common (maybe one bit per year out of 20 large disks or > something like that.) > > And in order to reach the conclusion "nobody would use it," the people using > it would have to first *notice* the failure. Which they don''t. That''s kind > of the point. > > Since I started using ZFS in production, about a year ago, on three servers > totaling approx 1.5TB used, I have had precisely one checksum error, which > ZFS corrected. I have every reason to believe, if that were on a raid5/6, > the error would have gone undetected and nobody would have noticed. >Point taken! So, what would you suggest, if I wanted to create really big pools? Say in the 100 TB range? That would be quite a number of single drives then, especially when you want to go with zpool raid-1. Cheers, budy -- Stephan Budach Jung von Matt/it-services GmbH Glash?ttenstra?e 79 20357 Hamburg Tel: +49 40-4321-1353 Fax: +49 40-4321-1114 E-Mail: stephan.budach at jvm.de Internet: http://www.jvm.com Gesch?ftsf?hrer: Ulrich Pallas, Frank Wilhelm AG HH HRB 98380 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101015/e640c31d/attachment.html>
On Oct 15, 2010, at 9:18 AM, Stephan Budach <stephan.budach at jvm.de> wrote:> Am 14.10.10 17:48, schrieb Edward Ned Harvey: >> >>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Toby Thain >>> >>>> I don''t want to heat up the discussion about ZFS managed discs vs. >>>> HW raids, but if RAID5/6 would be that bad, no one would use it >>>> anymore. >>> It is. And there''s no reason not to point it out. The world has >> Well, neither one of the above statements is really fair. >> >> The truth is: radi5/6 are generally not that bad. Data integrity failures >> are not terribly common (maybe one bit per year out of 20 large disks or >> something like that.) >> >> And in order to reach the conclusion "nobody would use it," the people using >> it would have to first *notice* the failure. Which they don''t. That''s kind >> of the point. >> >> Since I started using ZFS in production, about a year ago, on three servers >> totaling approx 1.5TB used, I have had precisely one checksum error, which >> ZFS corrected. I have every reason to believe, if that were on a raid5/6, >> the error would have gone undetected and nobody would have noticed. >> > Point taken! > > So, what would you suggest, if I wanted to create really big pools? Say in the 100 TB range? That would be quite a number of single drives then, especially when you want to go with zpool raid-1.A pool consisting of 4 disk raidz vdevs (25% overhead) or 6 disk raidz2 vdevs (33% overhead) should deliver the storage and performance for a pool that size, versus a pool of mirrors (50% overhead). You need a lot if spindles to reach 100TB. -Ross -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101015/8966bf3d/attachment.html>
> From: Stephan Budach [mailto:stephan.budach at jvm.de] > > Point taken! > > So, what would you suggest, if I wanted to create really big pools? Say > in the 100 TB range? That would be quite a number of single drives > then, especially when you want to go with zpool raid-1.You have a lot of disks. You either tell the hardware to manage a lot of disks, and then tell ZFS to manage a single device, and you take unnecessary risk and performance degradation for no apparent reason ... Or you tell ZFS to manage a lot of disks. Either way, you have a lot of disks that need to be managed by something. Why would you want to make that hardware instead of ZFS? For 100TB ... I suppose you have 2TB disks. I suppose you have 12 buses. I would make a raidz1 using 1 disk from bus0, bus1, ... bus5. I would make another raidz1 vdev using a disk from bus6, bus7, ... bus11. And so forth. Then, even if you lose a whole bus, you still haven''t lost your pool. Each raidz1 vdev would be 6 disks with a capacity of 5, so you would have a total of 10 vdev''s and that means 5 disks on each bus. Or do whatever you want. The point is yes, give all the individual disks to ZFS.
On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote:> So, what would you suggest, if I wanted to create really big pools? Say in the 100 TB range? That would be quite a number of single drives then, especially when you want to go with zpool raid-1.For 100 TB, the methods change dramatically. You can''t just reload 100 TB from CD or tape. When you get to this scale you need to be thinking about raidz2+ *and* mirroring. I will be exploring these issues of scale at the "Techniques for Managing Huge Amounts of Data" tutorial at the USENIX LISA ''10 Conference. http://www.usenix.org/events/lisa10/training/ -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com USENIX LISA ''10 Conference November 8-16 ZFS and performance consulting http://www.RichardElling.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101016/c0668477/attachment.html>
On Sat, Oct 16, 2010 at 08:38:28AM -0700, Richard Elling wrote:> On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote: > > So, what would you suggest, if I wanted to create really big pools? Say > in the 100 TB range? That would be quite a number of single drives then, > especially when you want to go with zpool raid-1. > > For 100 TB, the methods change dramatically. You can''t just reload 100 TB > from CD > or tape. When you get to this scale you need to be thinking about raidz2+ > *and* > mirroring. > I will be exploring these issues of scale at the "Techniques for Managing > Huge > Amounts of Data" tutorial at the USENIX LISA ''10 Conference. > [1]http://www.usenix.org/events/lisa10/training/Hopefully your presentation will be available online after the event! -- Pasi> -- richard > > -- > OpenStorage Summit, October 25-27, Palo Alto, CA > [2]http://nexenta-summit2010.eventbrite.com > USENIX LISA ''10 Conference November 8-16 > ZFS and performance consulting > [3]http://www.RichardElling.com > > References > > Visible links > 1. http://www.usenix.org/events/lisa10/training/ > 2. http://nexenta-summit2010.eventbrite.com/ > 3. http://www.richardelling.com/> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Oct 16, 2010, at 4:13 PM, Pasi K?rkk?inen wrote:> On Sat, Oct 16, 2010 at 08:38:28AM -0700, Richard Elling wrote: >> On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote: >> >> So, what would you suggest, if I wanted to create really big pools? Say >> in the 100 TB range? That would be quite a number of single drives then, >> especially when you want to go with zpool raid-1. >> >> For 100 TB, the methods change dramatically. You can''t just reload 100 TB >> from CD >> or tape. When you get to this scale you need to be thinking about raidz2+ >> *and* >> mirroring. >> I will be exploring these issues of scale at the "Techniques for Managing >> Huge >> Amounts of Data" tutorial at the USENIX LISA ''10 Conference. >> [1]http://www.usenix.org/events/lisa10/training/ > > Hopefully your presentation will be available online after the event!Sure, though I would encourage everyone to attend :-) -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com USENIX LISA ''10 Conference November 8-16, 2010 ZFS and performance consulting http://www.RichardElling.com
budy, here are some links. Remember, the reason you get corrupted files, is because ZFS detects it. Probably, you got corruption earlier as well, but your hardware did not notice it. This is called Silent Corruption. But ZFS is designed to detect and correct Silent Corruption. Which no normal hardware is designed for. The thing is, ZFS does end-to-end checksum. The data in RAM, is it identlcal on disc? From RAM down to controller to disk. There can be errors in the passing between the realms. Normally, there are checksums within each realm (checksums on the disc), but no checksums from the beginning of the chaing, to the end: end to the end checksums: http://jforonda.blogspot.com/2007/01/faulty-fc-port-meets-zfs.html Here are some links. CERN did a data integrity survey on 3000 hw raid and saw silent corruptions. http://storagemojo.com/2007/09/19/cerns-data-corruption-research/ In another CERN paper, they say "such data corruption is found in all solutions, no matter price (even very expensive Enterprise solutions)"!!! From that paper (can not find the link now) "Conclusions -silent corruptions are a fact of life -first step towards a solution is detection -elimination seems impossible -existing datasets are at the mercy of Murphy -correction will cost time AND money -effort has to start now (if not started already) -multiple cost-schemes exist --trade time and storage space (? la Google) --trade time and CPU power (correction codes" CERN writes: "checksumming - not necessarily enough" you need to use "end-to-end checksumming (ZFS has a point)" See the specifications on a new SAS Enterprise disk, typically it says: "one irrecoverable error in 10^15 bits". With todays large and fast raids, you quickly reach 10^ 15 bits in a short time. Greenplums database solution faces one such bit every 15 min: http://queue.acm.org/detail.cfm?id=1317400 Ordinary filesystems such as XFS, ReiserFS, JFS, etc does not protect your data, nor detect all errors (here is a PhD thesis link) http://www.zdnet.com/blog/storage/how-microsoft-puts-your-data-at-risk/169 ZFS data integrity tested by researchers: http://www.zdnet.com/blog/storage/zfs-data-integrity-tested/811?tag=rbxccnbzd1 (if they had ran zfs raid, ZFS would have corrected all artificially injected errors. Now, ZFS only detected all errors - which is very difficult to do. First step is detection, then repair the errors) Companies tries to hide silent corruption: http://www.enterprisestorageforum.com/sans/features/article.php/3704666 http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt "When a drive returns garbage, since RAID5 does not EVER check parity on read (RAID3 & RAID4 do BTW and both perform better for databases than RAID5 to boot) if you write a garbage sector back garbage parity will be calculated and your RAID5 integrity is lost! Similarly if a drive fails and one of the remaining drives is flaky the replacement will be rebuilt with garbage also propagating the problem to two blocks instead of just one." http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf "The paper explains that the best RAID-6 can do is use probabilistic methods to distinguish between single and dual-disk corruption, eg. "there are 95% chances it is single-disk corruption so I am going to fix it assuming that, but there are 5% chances I am going to actually corrupt more data, I just can''t tell". I wouldn''t want to rely on a RAID controller that takes gambles :-)" Researchers write regarding hw-raid: http://www.cs.wisc.edu/adsl/Publications/parity-fast08.html "We use the model checker to evaluate a number of different approaches found in real RAID systems, focusing on parity-based protection and single errors. We find holes in all of the schemes examined, where systems potentially exposes data to loss or returns corrupt data to the user. In data loss scenarios, the error is detected, but the data cannot be recovered, while in the rest, the error is not detected and therefore corrupt data is returned to the user. For example, we examine a combination of two techniques ? block-level checksums (where checksums of the data block are stored within the same disk block as data and verified on every read) and write-verify (where data is read back immediately after it is written to disk and verified for correctness), and show that the scheme could still fail to detect certain error conditions, thus returning corrupt data to the user. We discover one particularly interesting and general problem that we call parity pollution. In this situation, corrupt data in one block of a stripe spreads to other blocks through various parity calculations. We find a number of cases where parity pollution occurs, and show how pollution can lead to data loss. Specifically, we find that data scrubbing (which is used to reduce the chances of double disk failures) tends to be one of themain causes of parity pollution." http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf "Detecting and recovering from data corruption requires protection techniques beyond those provided by the disk drive. In fact, basic protection schemes such as RAID [13] may also be unable to detect these problems. ... as we discuss later, checksums do not protect against all forms of corruption" http://www.cs.wisc.edu/adsl/Publications/corrupt-mysql-icde10.pdf "More reliable SCSI drives encounter fewer problems, but even within this expensive and carefully-engineered drive class, corruption still takes place." .... Recent work has shown that even with sophisticated RAID protection strategies, the ?right? combination of a single fault and certain repair activities (e.g., a parity scrub) can still lead to data loss [19]. Thus, while these schemes reduce the chances of corruption, the possibility still exists; any higher-level client of storage that is serious about managing data reliably must consider the possibility that a disk will return data in a corrupted form." -- This message posted from opensolaris.org
On Sun, 17 Oct 2010 03:05:34 PDT, Orvar Korvar <knatte_fnatte_tjatte at yahoo.com> wrote:> here are some links.Wow, that''s a great overview, thanks! -- ( Kees Nuyt ) c[_]
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > If scrub is operating at a block-level (and I think it is), then how > can > checksum failures be mapped to file names? For example, this is a > long-requested feature of "zfs send" which is fundamentally difficult > or > impossible to implement.How about that. I recently learned that "zfs diff" does exist already, in b147 of openindiana. That means it''s already in the oracle opened-source zfs code, but apparently too new to be included in any of the present releases. So it seems, zfs does have some ability to figure out which file owns a particular block on disk.
On Oct 17, 2010, at 6:17 AM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey >> >> If scrub is operating at a block-level (and I think it is), then how >> can >> checksum failures be mapped to file names? For example, this is a >> long-requested feature of "zfs send" which is fundamentally difficult >> or >> impossible to implement. > > How about that. I recently learned that "zfs diff" does exist already, in > b147 of openindiana. That means it''s already in the oracle opened-source > zfs code, but apparently too new to be included in any of the present > releases. > > So it seems, zfs does have some ability to figure out which file owns a > particular block on disk.uhm... of course this exists. The problem is that the efficient mapping goes the other way: files to blocks. Snapshots further complicate this because a block may belong to a filename in one snapshot but the file got renamed in another snapshot. Deduplication also complicates this because a block may be referenced in multiple files. Maintaining this mapping live is probably not worth the effort. -- richard
> From: Richard Elling [mailto:richard.elling at gmail.com] > > On Oct 17, 2010, at 6:17 AM, Edward Ned Harvey wrote: > > >> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > >> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > >> > >> If scrub is operating at a block-level (and I think it is), then how > >> can > >> checksum failures be mapped to file names? For example, this is a > >> long-requested feature of "zfs send" which is fundamentally > difficult > >> or > >> impossible to implement. > > > > How about that. I recently learned that "zfs diff" does exist > already, in > > b147 of openindiana. That means it''s already in the oracle opened- > source > > zfs code, but apparently too new to be included in any of the present > > releases. > > > > So it seems, zfs does have some ability to figure out which file owns > a > > particular block on disk. > > uhm... of course this exists. The problem is that the efficient > mapping > goes the other way: files to blocks. Snapshots further complicate this > because a block may belong to a filename in one snapshot but the file > got renamed in another snapshot. Deduplication also complicates this > because a block may be referenced in multiple files. Maintaining this > mapping live is probably not worth the effort.Thank you, but, the original question was whether a scrub would identify just corrupt blocks, or if it would be able to map corrupt blocks to a list of corrupt files. Until I wrote this comment about "zfs diff" no answer existed in this thread. (Unless I overlooked it somehow.) So thank you for the information about dedup and difficulty maintaining live information. Although it was irrelevant to the discussion at hand.
On Mon, Oct 18, 2010 at 4:55 PM, Edward Ned Harvey <shill at nedharvey.com> wrote:> Thank you, but, the original question was whether a scrub would identify > just corrupt blocks, or if it would be able to map corrupt blocks to a list > of corrupt files. >Just in case this wasn''t already clear. After scrub sees read or checksum errors, zpool status -v will list filenames that are affected. At least in my experience. -- - Tuomas
Am 19.10.2010 um 22:36 schrieb Tuomas Leikola <tuomas.leikola at gmail.com>:> On Mon, Oct 18, 2010 at 4:55 PM, Edward Ned Harvey <shill at nedharvey.com> wrote: >> Thank you, but, the original question was whether a scrub would identify >> just corrupt blocks, or if it would be able to map corrupt blocks to a list >> of corrupt files. >> > > Just in case this wasn''t already clear. > > After scrub sees read or checksum errors, zpool status -v will list > filenames that are affected. At least in my experience. > -- > - TuomasThat didn''t do it for me. I used scrub and afterwards zpool staus -v didn''t show any additional corrupted files, although there were the same three files corrupted in a number of snapshots, which of course zfs send detected when trying to actually send them. budy
> From: Stephan Budach [mailto:stephan.budach at jvm.de] > > > Just in case this wasn''t already clear. > > > > After scrub sees read or checksum errors, zpool status -v will list > > filenames that are affected. At least in my experience. > > -- > > - Tuomas > > That didn''t do it for me. I used scrub and afterwards zpool staus -v > didn''t show any additional corrupted files, although there were the > same three files corrupted in a number of snapshots, which of course > zfs send detected when trying to actually send them.Budy, we''ve been over this. The behavior you experienced is explained by having corrupt data inside a hardware raid, and during the scrub you luckily read the good copy of redundant data. During zfs send, you unluckily read the bad copy of redundant data. This is a known problem as long as you use hardware raid. It''s one of the big selling points, reasons for ZFS to exist. You should always give ZFS JBOD devices to work on, so ZFS is able to scrub both of the redundant sides of the data, and when a checksum error occurs, ZFS is able to detect *and* correct it. Don''t use hardware raid.
On 20/10/2010 12:20, Edward Ned Harvey wrote:> It''s one of the big selling points, reasons for ZFS to exist. You should > always give ZFS JBOD devices to work on, so ZFS is able to scrub both of the > redundant sides of the data, and when a checksum error occurs, ZFS is able > to detect *and* correct it. Don''t use hardware raid.That isn''t the recommended best practice, you are stating it far too strongly. The recommended best practice is to always create ZFS pools with redundancy in the control of ZFS. That doesn''t require that the back end storage be JBOD or full disks nor does it require you not to use hardware raid. Some of all of which are impossible if you are using SAN or other remote block storage devices in many cases - and certainly the case if the SAN is provided by a Sun ZFS Storage appliance. -- Darren J Moffat
>> From: Stephan Budach [mailto:stephan.budach at jvm.de] >> >>> Just in case this wasn''t already clear. >>> >>> After scrub sees read or checksum errors, zpool status -v will list >>> filenames that are affected. At least in my experience. >>> -- >>> - Tuomas >> >> That didn''t do it for me. I used scrub and afterwards zpool staus -v >> didn''t show any additional corrupted files, although there were the >> same three files corrupted in a number of snapshots, which of course >> zfs send detected when trying to actually send them. > > Budy, we''ve been over this. > > The behavior you experienced is explained by having corrupt data inside a > hardware raid, and during the scrub you luckily read the good copy of > redundant data. During zfs send, you unluckily read the bad copy of > redundant data. This is a known problem as long as you use hardware raid. > It''s one of the big selling points, reasons for ZFS to exist. You should > always give ZFS JBOD devices to work on, so ZFS is able to scrub both of the > redundant sides of the data, and when a checksum error occurs, ZFS is able > to detect *and* correct it. Don''t use hardware raid. >Edward - I am working on that! Although, I have to say that I do have exactly 3 files that are corrupt in each snapshot until I finally deleted them and restored them from their original source. zfs send will abort when trying to send them, while scrub doesn''t notice this. If zfs send would have sent any of these snapshots successfully, or if any of my read attempts for these files would work one time and fail the other time, I''d agree. I can''t see how this behaviour could be explained, or better: what are the chances that only scrub gets the "clean" blocks from the h/w raids, while zfs send or cp always get the corrupted blocks!
> From: Stephan Budach [mailto:stephan.budach at jvm.de] > > Although, I have to say that I do have exactly 3 files that are corrupt > in each snapshot until I finally deleted them and restored them from > their original source. > > zfs send will abort when trying to send them, while scrub doesn''t > notice this.That cannot be consistently repeatable. If anything will notice corrupt data, scrub will too. The only way you will find corrupt data with something else and not with scrub is ... If the corrupt data didn''t exist during the scrub. I''m glad you''re working to change the raid setup to jbod, because, although that''s not the only possible explanation, it is the most obvious explanation.
> -----Original Message----- > From: Darren J Moffat [mailto:darrenm at opensolaris.org] > > It''s one of the big selling points, reasons for ZFS to exist. You > should > > always give ZFS JBOD devices to work on, so ZFS is able to scrub both > of the > > redundant sides of the data, and when a checksum error occurs, ZFS is > able > > to detect *and* correct it. Don''t use hardware raid. > > That isn''t the recommended best practice, you are stating it far too > strongly. > > The recommended best practice is to always create ZFS pools with > redundancy in the control of ZFS. That doesn''t require that the back > end storage be JBOD or full disks nor does it require you not to use > hardware raid. Some of all of which are impossible if you are using SAN > or other remote block storage devices in many cases - and certainly the > case if the SAN is provided by a Sun ZFS Storage appliance.You''re right though, I''m stating that too strongly. Never say never. And never say always. The truth is exactly as you said. Even if you have redundancy in hardware, make sure you also have redundancy in ZFS. If you allow hardware to manage redundancy ... Then just as Budy has experienced, when corruption is found, it''s not consistently repeatable, and it could appear anywhere in the storage unit randomly. ZFS is unable to isolate the individual failing disk. After enough checksum failures, the whole storage unit will be marked failed and taken offline. So much for your redundancy. It is a problem if your only redundancy is hardware. It is not a problem if you also have redundancy managed by ZFS. So a more correct conclusion would be "Whenever possible" don''t use hardware raid, and "whenever possible" use JBOD managed by ZFS. But whatever you do, make sure ZFS has some redundancy it can manage.
-- Von meinem iPhone iOS4 gesendet. Stephan Budach Jung von Matt/it-services GmbH Glash?ttenstra?e 79 20257 Hamburg Tel: +49 40-4321-1353 Fax: +49 40-4321-1114 E-Mail: stephan.budach at jvm.de Internet: www.jvm.de Gesch?ftsf?hrer: Ulrich Pallas, Frank Willhelm AG HH HRB 98380 Am 20.10.2010 um 15:11 schrieb "Edward Ned Harvey" <shill at nedharvey.com>:>> From: Stephan Budach [mailto:stephan.budach at jvm.de] >> >> Although, I have to say that I do have exactly 3 files that are corrupt >> in each snapshot until I finally deleted them and restored them from >> their original source. >> >> zfs send will abort when trying to send them, while scrub doesn''t >> notice this. > > That cannot be consistently repeatable. If anything will notice corrupt > data, scrub will too. The only way you will find corrupt data with > something else and not with scrub is ... If the corrupt data didn''t exist > during the scrub. > > I''m glad you''re working to change the raid setup to jbod, because, although > that''s not the only possible explanation, it is the most obvious > explanation. >
Am 20.10.10 15:11, schrieb Edward Ned Harvey:>> From: Stephan Budach [mailto:stephan.budach at jvm.de] >> >> Although, I have to say that I do have exactly 3 files that are corrupt >> in each snapshot until I finally deleted them and restored them from >> their original source. >> >> zfs send will abort when trying to send them, while scrub doesn''t >> notice this. > That cannot be consistently repeatable. If anything will notice corrupt > data, scrub will too. The only way you will find corrupt data with > something else and not with scrub is ... If the corrupt data didn''t exist > during the scrub.I will do some more scrubbing - it only takes a couple of hours and then scrub should at least show some of the errors. When I use zpool clear on that pool, why does zpool status still shows the errors that have been encountered? I''d figure that it would be a lot easier to track, if scrub finds "new" erorrs, if zpool status -v wouldn''t show the "old" ones. -- Stephan Budach Jung von Matt/it-services GmbH Glash?ttenstra?e 79 20357 Hamburg Tel: +49 40-4321-1353 Fax: +49 40-4321-1114 E-Mail: stephan.budach at jvm.de Internet: http://www.jvm.com Gesch?ftsf?hrer: Ulrich Pallas, Frank Wilhelm AG HH HRB 98380