Adam Leventhal
2009-Sep-03 09:08 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Hey folk, The are two problems with RAID-Z in builds snv_120 through snv_123 that will both be resolved in build snv_124. The problems are as follows: 1. Data corruption on a RAID-Z system of any sort (raidz1, raidz2, raidz3) can lead to spurious checksum errors being reported on devices that were not used as part of the reconstruction. These errors are harmless and can be cleared safely (zpool clear <pool>). 2. There is a far more serious problem with single-parity RAID-Z that can lead to data corruption. This data corruption is recoverable as long as no additional data corruption or drive failure occurs. That is to say, data is fine provided there is not an additional problem. The problem is present on all raidz1 configurations that use an odd number of children (disks) e.g. 4+1, or 6+1. Note that raidz1 configurations with an even number of children (e.g. 3+1), raidz2, and raidz3 are unaffected. The recommended course of action is to roll back to build snv_119 or earlier. If for some reason this is impossible, please email me PRIVATELY, and we can discuss the best course of action for you. After rolling back initiate a scrub. ZFS will identify and correct these errors, but if enough accumulate it will (incorrectly) identify drives as faulty (which they likely aren''t). You can clear these failures (zpool clear <pool>). Without rolling back, repeated scrubs will eventually remove all traces of the data corruption. You may need to clear checksum failures as they''re identified to ensure that enough drives remain online. For reference here''s the bug: 6869090 filebench on thumper with ZFS (snv_120) raidz causes checksum errors from all drives Apologies for the bug and for any inconvenience this caused. Below is a technical description of the two issues. This is for interest only and does not contain additional discussion of symptoms or prescriptive action. Adam ---8<--- 1. In situations where a block read from a RAID-Z vdev fails to checksum but there were no errors from any of the child vdevs (e.g. hard drives) we must enter combinatorial reconstruction in which we attempt every combination of data and parity until we find the correct data. The logic was modified to scale to triple-parity RAID-Z and in doing so I introduced a bug in which spurious errors reports may in some circumstances be generated for vdevs that were not used as part of the data reconstruction. These do not represent actual corruption or problems with the underlying devices and can be ignored and cleared. 2. This one is far subtler and requires an understanding of how RAID-Z writes work. For that I strongly recommend the following blog post from Jeff Bonwick: http://blogs.sun.com/bonwick/entry/raid_z Basically, RAID-Z writes full stripes every time; note that without careful accounting it would be possible to effectively fragment the vdev such that single sectors were free but useless since single-parity RAID-Z requires two adjacent sectors to store data (one for data, one for parity). To address this, RAID-Z rounds up its allocation to the next (nparity + 1). This ensures that all space is accounted for. RAID-Z will thus skip sectors that are unused based on this rounding. For example, under raidz1 a write of 1024 bytes would result in 512 bytes of parity, 512 bytes of data on two devices and 512 bytes skipped. To improve performance, ZFS aggregates multiple adjacent IOs into a single large IO. Further, hard drives themselves can perform aggregation of adjacent IOs. We noted that these skipped sectors were inhibiting performance so added "optional" IOs that could be used to improve aggregation. This yielded a significant performance boost for all RAID-Z configurations. Another nuance of single-parity RAID-Z is that while it normally lays down stripes as P D D (parity, data, data, ...), it will switch every megabyte to move the parity into the second position (data, parity, data, ...). This was ostensibly to effect the same improvement as between RAID-4 and RAID-5 -- distributed parity. However, to implement RAID-5 actually requires full distribution of parity AND RAID-Z already distributes parity by virtue of the skipped sectors and variable width stripes. In other words, this was not a particularly valid optimization. It was accordingly discarded for double- and tripe-parity RAID-Z. They contain no such swapping. The implementation of this swapping was not taken into account for the optional IOs so rather than writing the optional IO into the skipped sector, the optional IO overwrote the first sector of the subsequent stripe with zeros. The aggregation does not always happen so the corruption is ususally not pervasive. Futher, raidz1 vdevs with odd numbers of children are more likely to encounter the problem. Let''s say we have a raidz1 vdev with three children. Two writes of 1K each would look like this: disks 0 1 2 _____________ | | | | P = parity | P | D | D | LBAs D = data |___|___|___| | X = skipped sector | | | | | | X | P | D | v |___|___|___| | | | | | D | X | | |___|___|___| The logic for the optional IOs effectively (though not literally) in this case would fill in the next LBA on the disk with a 0: _____________ | | | | P = parity | P | D | D | LBAs D = data |___|___|___| | X = skipped sector | | | | | 0 = zero-data from aggregation | 0 | P | D | v |___|___|___| | | | | | D | X | | |___|___|___| We can see the problem when the parity undergoes the swap described above: disks 0 1 2 _____________ | | | | P = parity | D | P | D | LBAs D = data |___|___|___| | X = skipped sector | | | | | 0 = zero-data from aggregation | X | 0 | P | v |___|___|___| | | | | | D | X | | |___|___|___| Note that the 0 incorrectly is also swapped thus inadvertently overwriting a data sector in the subsequent stripe. This only occurs if there is IO aggregation making it much more likely with small, synchronous IOs. It''s also only possible with an odd (N) number of child vdevs since to induce the problem the size of the data written must consume a multiple of N-1 sectors _and_ the total number of sectors used for data and parity must be odd (to create the need for a skipped sector). The number of data sectors is simply size / 512 and the number of parity sectors is ceil(size / 512 / (N-1)). 1) size / 512 = K * (N-1) 2) size / 512 + ceil(size / 512 / (N-1)) is odd therefore K * (N-1) + K = K * N is odd If N is even K * N cannot be odd and therefore the situation cannot arise. If N is odd, it is possible to satisfy (1) and (2). -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Gary Gendel
2009-Sep-03 14:45 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Alan, Thanks for the detailed explanation. The rollback successfully fixed my 5-disk RAID-Z errors. I''ll hold off another upgrade attempt until 124 rolls out. Fortunately, I didn''t do a zfs upgrade right away after installing 121. For those that did, this could be very painful. Gary -- This message posted from opensolaris.org
Roman Naumenko
2009-Sep-03 15:38 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
> Hey folk, > > The are two problems with RAID-Z in builds snv_120 > through snv_123 that > will both be resolved in build snv_124. The problems > are as follows:Thanks for letting us know. Is there is a way to get promt updates on such issues for Opensolaris? (except like reading a discussion list). Paid support maybe is the answer? Is there any? -- Roman -- This message posted from opensolaris.org
Roman Naumenko
2009-Sep-03 16:03 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
And a question here how to control number of dev version to install? -- Roman -- This message posted from opensolaris.org
Jeff Victor
2009-Sep-03 16:10 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Roman Naumenko wrote:>> Hey folk, >> >> The are two problems with RAID-Z in builds snv_120 through snv_123 that >> will both be resolved in build snv_124. The problems are as follows: >> > Thanks for letting us know. > > Is there is a way to get promt updates on such issues for Opensolaris? > (except like reading a discussion list). > > Paid support maybe is the answer? Is there any? >You can learn about support options for OpenSolaris 2009.06 at http://www.sun.com/service/opensolaris/index.jsp?intcmp=2166 . However, AFAIK OpenSolaris 2009.06 does not have the problem being discussed. The snv_ builds are developer builds, and support contracts are not available for them. So if you want the newest supportable features, choose OpenSolaris (the distro). If you want to test out new features "fresh out of the oven" you can test the snv_ builds. --JeffV
Collier Minerich
2009-Sep-03 17:15 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Please unsubscribe me COLLIER -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Jeff Victor Sent: Thursday, September 03, 2009 9:10 AM To: Roman Naumenko Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123 Roman Naumenko wrote:>> Hey folk, >> >> The are two problems with RAID-Z in builds snv_120 through snv_123 that >> will both be resolved in build snv_124. The problems are as follows: >> > Thanks for letting us know. > > Is there is a way to get promt updates on such issues for Opensolaris? > (except like reading a discussion list). > > Paid support maybe is the answer? Is there any? >You can learn about support options for OpenSolaris 2009.06 at http://www.sun.com/service/opensolaris/index.jsp?intcmp=2166 . However, AFAIK OpenSolaris 2009.06 does not have the problem being discussed. The snv_ builds are developer builds, and support contracts are not available for them. So if you want the newest supportable features, choose OpenSolaris (the distro). If you want to test out new features "fresh out of the oven" you can test the snv_ builds. --JeffV _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Collier Minerich
2009-Sep-03 17:15 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Please unsubscribe me COLLIER -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Gary Gendel Sent: Thursday, September 03, 2009 7:45 AM To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123 Alan, Thanks for the detailed explanation. The rollback successfully fixed my 5-disk RAID-Z errors. I''ll hold off another upgrade attempt until 124 rolls out. Fortunately, I didn''t do a zfs upgrade right away after installing 121. For those that did, this could be very painful. Gary -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Collier Minerich
2009-Sep-03 17:15 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Please unsubscribe me COLLIER -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Roman Naumenko Sent: Thursday, September 03, 2009 9:03 AM To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123 And a question here how to control number of dev version to install? -- Roman -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Collier Minerich
2009-Sep-03 17:15 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Please unsubscribe me COLLIER -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Adam Leventhal Sent: Thursday, September 03, 2009 2:08 AM To: zfs-discuss at opensolaris.org discuss Subject: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123 Hey folk, The are two problems with RAID-Z in builds snv_120 through snv_123 that will both be resolved in build snv_124. The problems are as follows: 1. Data corruption on a RAID-Z system of any sort (raidz1, raidz2, raidz3) can lead to spurious checksum errors being reported on devices that were not used as part of the reconstruction. These errors are harmless and can be cleared safely (zpool clear <pool>). 2. There is a far more serious problem with single-parity RAID-Z that can lead to data corruption. This data corruption is recoverable as long as no additional data corruption or drive failure occurs. That is to say, data is fine provided there is not an additional problem. The problem is present on all raidz1 configurations that use an odd number of children (disks) e.g. 4+1, or 6+1. Note that raidz1 configurations with an even number of children (e.g. 3+1), raidz2, and raidz3 are unaffected. The recommended course of action is to roll back to build snv_119 or earlier. If for some reason this is impossible, please email me PRIVATELY, and we can discuss the best course of action for you. After rolling back initiate a scrub. ZFS will identify and correct these errors, but if enough accumulate it will (incorrectly) identify drives as faulty (which they likely aren''t). You can clear these failures (zpool clear <pool>). Without rolling back, repeated scrubs will eventually remove all traces of the data corruption. You may need to clear checksum failures as they''re identified to ensure that enough drives remain online. For reference here''s the bug: 6869090 filebench on thumper with ZFS (snv_120) raidz causes checksum errors from all drives Apologies for the bug and for any inconvenience this caused. Below is a technical description of the two issues. This is for interest only and does not contain additional discussion of symptoms or prescriptive action. Adam ---8<--- 1. In situations where a block read from a RAID-Z vdev fails to checksum but there were no errors from any of the child vdevs (e.g. hard drives) we must enter combinatorial reconstruction in which we attempt every combination of data and parity until we find the correct data. The logic was modified to scale to triple-parity RAID-Z and in doing so I introduced a bug in which spurious errors reports may in some circumstances be generated for vdevs that were not used as part of the data reconstruction. These do not represent actual corruption or problems with the underlying devices and can be ignored and cleared. 2. This one is far subtler and requires an understanding of how RAID-Z writes work. For that I strongly recommend the following blog post from Jeff Bonwick: http://blogs.sun.com/bonwick/entry/raid_z Basically, RAID-Z writes full stripes every time; note that without careful accounting it would be possible to effectively fragment the vdev such that single sectors were free but useless since single-parity RAID-Z requires two adjacent sectors to store data (one for data, one for parity). To address this, RAID-Z rounds up its allocation to the next (nparity + 1). This ensures that all space is accounted for. RAID-Z will thus skip sectors that are unused based on this rounding. For example, under raidz1 a write of 1024 bytes would result in 512 bytes of parity, 512 bytes of data on two devices and 512 bytes skipped. To improve performance, ZFS aggregates multiple adjacent IOs into a single large IO. Further, hard drives themselves can perform aggregation of adjacent IOs. We noted that these skipped sectors were inhibiting performance so added "optional" IOs that could be used to improve aggregation. This yielded a significant performance boost for all RAID-Z configurations. Another nuance of single-parity RAID-Z is that while it normally lays down stripes as P D D (parity, data, data, ...), it will switch every megabyte to move the parity into the second position (data, parity, data, ...). This was ostensibly to effect the same improvement as between RAID-4 and RAID-5 -- distributed parity. However, to implement RAID-5 actually requires full distribution of parity AND RAID-Z already distributes parity by virtue of the skipped sectors and variable width stripes. In other words, this was not a particularly valid optimization. It was accordingly discarded for double- and tripe-parity RAID-Z. They contain no such swapping. The implementation of this swapping was not taken into account for the optional IOs so rather than writing the optional IO into the skipped sector, the optional IO overwrote the first sector of the subsequent stripe with zeros. The aggregation does not always happen so the corruption is ususally not pervasive. Futher, raidz1 vdevs with odd numbers of children are more likely to encounter the problem. Let''s say we have a raidz1 vdev with three children. Two writes of 1K each would look like this: disks 0 1 2 _____________ | | | | P = parity | P | D | D | LBAs D = data |___|___|___| | X = skipped sector | | | | | | X | P | D | v |___|___|___| | | | | | D | X | | |___|___|___| The logic for the optional IOs effectively (though not literally) in this case would fill in the next LBA on the disk with a 0: _____________ | | | | P = parity | P | D | D | LBAs D = data |___|___|___| | X = skipped sector | | | | | 0 = zero-data from aggregation | 0 | P | D | v |___|___|___| | | | | | D | X | | |___|___|___| We can see the problem when the parity undergoes the swap described above: disks 0 1 2 _____________ | | | | P = parity | D | P | D | LBAs D = data |___|___|___| | X = skipped sector | | | | | 0 = zero-data from aggregation | X | 0 | P | v |___|___|___| | | | | | D | X | | |___|___|___| Note that the 0 incorrectly is also swapped thus inadvertently overwriting a data sector in the subsequent stripe. This only occurs if there is IO aggregation making it much more likely with small, synchronous IOs. It''s also only possible with an odd (N) number of child vdevs since to induce the problem the size of the data written must consume a multiple of N-1 sectors _and_ the total number of sectors used for data and parity must be odd (to create the need for a skipped sector). The number of data sectors is simply size / 512 and the number of parity sectors is ceil(size / 512 / (N-1)). 1) size / 512 = K * (N-1) 2) size / 512 + ceil(size / 512 / (N-1)) is odd therefore K * (N-1) + K = K * N is odd If N is even K * N cannot be odd and therefore the situation cannot arise. If N is odd, it is possible to satisfy (1) and (2). -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Collier Minerich
2009-Sep-03 17:15 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Please unsubscribe me COLLIER -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Roman Naumenko Sent: Thursday, September 03, 2009 8:39 AM To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123> Hey folk, > > The are two problems with RAID-Z in builds snv_120 > through snv_123 that > will both be resolved in build snv_124. The problems > are as follows:Thanks for letting us know. Is there is a way to get promt updates on such issues for Opensolaris? (except like reading a discussion list). Paid support maybe is the answer? Is there any? -- Roman -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Yeah, I wouldn''t mind knowing that too. With the old snv builds I just downloaded the appropriate image, with OpenSolaris and the development repository, is there any way to pick a particular build? -- This message posted from opensolaris.org
Sorry if this is a faq, but I just got a time sensitive dictim from the higherups to disable and remove all remnants of rolling snapshots on our DR filer. Is there a way for me to nuke all snapshots with a single command, or to I have to manually destroy all 600+ snapshots with zfs destroy? osol 2008.11 thx jake
Le 3 sept. 09 ? 19:57, Jacob Ritorto a ?crit :> Sorry if this is a faq, but I just got a time sensitive dictim from > the higherups to disable and remove all remnants of rolling > snapshots on our DR filer. Is there a way for me to nuke all > snapshots with a single command, or to I have to manually destroy > all 600+ snapshots with zfs destroy?zfs list -r -t snapshot -o name -H pool | xargs -tl zfs destroy should destroy all the snapshots in a pool Ga?tan -- Ga?tan Lehmann Biologie du D?veloppement et de la Reproduction INRA de Jouy-en-Josas (France) tel: +33 1 34 65 29 66 fax: 01 34 65 29 09 http://voxel.jouy.inra.fr http://www.itk.org http://www.mandriva.org http://www.bepo.fr -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 203 bytes Desc: Ceci est une signature ?lectronique PGP URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090903/de0b0c31/attachment.bin>
Ross Walker
2009-Sep-03 18:29 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
On Sep 3, 2009, at 1:25 PM, Ross <myxiplx at googlemail.com> wrote:> Yeah, I wouldn''t mind knowing that too. With the old snv builds I > just downloaded the appropriate image, with OpenSolaris and the > development repository, is there any way to pick a particular build?I just do a ''pkg list entire'' and then install the build I want with a ''pkg install entire-<build>'' -Ross
Ga?tan Lehmann wrote:> zfs list -r -t snapshot -o name -H pool | xargs -tl zfs destroy > > should destroy all the snapshots in a poolThanks Ga?tan. I added ''grep auto'' to filter on just the rolling snaps and found that xargs wouldn''t let me put both flags on the same dash, so: zfs list -r -t snapshot -o name -H poolName | grep auto | xargs -t -l zfs destroy worked for me.
Roman Naumenko
2009-Sep-03 19:26 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
> On Sep 3, 2009, at 1:25 PM, Ross > <myxiplx at googlemail.com> wrote: > > > Yeah, I wouldn''t mind knowing that too. With the > old snv builds I > > just downloaded the appropriate image, with > OpenSolaris and the > > development repository, is there any way to pick a > particular build? > > I just do a ''pkg list entire'' and then install the > build I want with a ''pkg install entire-<build>''Ross, can you provide details? Doesn''t it show the latest? uname -a SunOS zsan00 5.11 snv_118 i86pc i386 i86pc Solaris root at zsan00:~# pkg list entire NAME (PUBLISHER) VERSION STATE UFIX entire 0.5.11-0.118 installed u--- root at zsan00:~# pkg refresh root at zsan00:~# pkg list entire NAME (PUBLISHER) VERSION STATE UFIX entire 0.5.11-0.118 installed u--- -- Roman Naumenko -- This message posted from opensolaris.org
Roman Naumenko
2009-Sep-03 19:28 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Hey, web-admins, you see what happens when mailling list is screwed up from the beginning? -- Roman -- This message posted from opensolaris.org
Simon Breden
2009-Sep-03 19:55 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Hi Adam, Thanks for the info on this. Some people, including myself, reported seeing checksum errors within mirrors too. Is it considered that these checksum errors within mirrors could also be related to this bug, or is there another bug related to checksum errors within mirrors that I should take a look at? Search for ''mirror'' here: http://opensolaris.org/jive/thread.jspa?threadID=111316&tstart=0 Cheers, Simon And good luck with the fix for build 124. Are talking days or weeks for the fix to be available, do you think? :) -- This message posted from opensolaris.org
Adam Leventhal
2009-Sep-03 20:09 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Hey Simon,> Thanks for the info on this. Some people, including myself, reported seeing > checksum errors within mirrors too. Is it considered that these checksum > errors within mirrors could also be related to this bug, or is there another > bug related to checksum errors within mirrors that I should take a look at?Absolutely not. That is an unrelated issue. This problem is isolated to RAID-Z.> And good luck with the fix for build 124. Are talking days or weeks for the > fix to be available, do you think? :) --Days or hours. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Simon Breden
2009-Sep-03 20:46 UTC
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
OK, thanks Adam. I''ll look elsewhere for the mirror checksum error issue. In fact there''s already a response here, which I shall check up on: http://opensolaris.org/jive/thread.jspa?messageID=413169#413169 Thanks again, and I look forward to grabbing 124 soon. Cheers, Simon -- This message posted from opensolaris.org