Gary Gendel
2009-Aug-25 12:29 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
I have a 5-500GB disk Raid-Z pool that has been producing checksum errors right after upgrading SXCE to build 121. They seem to be randomly occurring on all 5 disks, so it doesn''t look like a disk failure situation. Repeatingly running a scrub on the pools randomly repairs between 20 and a few hundred checksum errors. Since I hadn''t physically touched the machine, it seems a very strong coincidence that it started right after I upgraded to 121. This machine is a SunFire v20z with a Marvell SATA 8-port controller (the same one as in the original thumper). I''ve seen this kind of problem way back around build 40-50 ish, but haven''t seen it after that until now. Anyone else experiencing this problem or knows how to isolate the problem definitively? Thanks, Gary -- This message posted from opensolaris.org
Henrik Johansson
2009-Aug-25 14:27 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Hello, On 25 aug 2009, at 14.29, Gary Gendel <gary at genashor.com> wrote:> I have a 5-500GB disk Raid-Z pool that has been producing checksum > errors right after upgrading SXCE to build 121. They seem to be > randomly occurring on all 5 disks, so it doesn''t look like a disk > failure situation. > > Repeatingly running a scrub on the pools randomly repairs between 20 > and a few hundred checksum errors. > > Since I hadn''t physically touched the machine, it seems a very > strong coincidence that it started right after I upgraded to 121.I had my first checksum errors in almost a year yesterday after upgrading to snv_121 on my filer. I blamed an esata device that was not part of the pool. I will do some testing tonight and see if I still get errors. The machine that got the errors has a Asus M3N78-VM MB (GF8200). Henrik http://sparcv9.blogspot.com
Neal Pollack
2009-Aug-25 15:55 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On 08/25/09 05:29 AM, Gary Gendel wrote:> I have a 5-500GB disk Raid-Z pool that has been producing checksum errors right after upgrading SXCE to build 121. They seem to be randomly occurring on all 5 disks, so it doesn''t look like a disk failure situation. > > Repeatingly running a scrub on the pools randomly repairs between 20 and a few hundred checksum errors. > > Since I hadn''t physically touched the machine, it seems a very strong coincidence that it started right after I upgraded to 121. > > This machine is a SunFire v20z with a Marvell SATA 8-port controller (the same one as in the original thumper). I''ve seen this kind of problem way back around build 40-50 ish, but haven''t seen it after that until now. > > Anyone else experiencing this problem or knows how to isolate the problem definitively? > > Thanks, > GaryMy group also upgraded a small server with 6 disks to build 121 and almost immediately all 6 disks were showing between dozens and hundreds of checksum errors. Neal
Gary Gendel
2009-Aug-27 13:29 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
It looks like It''s definitely related to the snv_121 upgrade. I decided to roll back to snv_110 and the checksum errors have disappeared. I''d like to issue a bug report, but I don''t have any information that might help track this down, just lots of checksum errors. Looks like I''m stuck at snv_110 until someone figures out what is broken. If it helps, here is my properly list for this pool. gary at phoenix[~]101>zfs get all archive NAME PROPERTY VALUE SOURCE archive type filesystem - archive creation Mon Jun 18 20:40 2007 - archive used 787G - archive available 1.01T - archive referenced 125G - archive compressratio 1.13x - archive mounted yes - archive quota none default archive reservation none default archive recordsize 128K default archive mountpoint /archive default archive sharenfs off default archive checksum on default archive compression on local archive atime off local archive devices on default archive exec on default archive setuid on default archive readonly off default archive zoned off default archive snapdir hidden default archive aclmode groupmask default archive aclinherit restricted default archive canmount on default archive shareiscsi off default archive xattr on default archive copies 1 default archive version 3 - archive utf8only off - archive normalization none - archive casesensitivity sensitive - archive vscan off default archive nbmand off default archive sharesmb off local archive refquota none default archive refreservation none default archive primarycache all default archive secondarycache all default And each of the sub-pools look like this: gary at phoenix[~]101>zfs get all archive/gary archive/gary type filesystem - archive/gary creation Mon Jun 18 20:56 2007 - archive/gary used 141G - archive/gary available 1.01T - archive/gary referenced 141G - archive/gary compressratio 1.22x - archive/gary mounted yes - archive/gary quota none default archive/gary reservation none default archive/gary recordsize 128K default archive/gary mountpoint /archive/gary default archive/gary sharenfs off default archive/gary checksum on default archive/gary compression on inherited from archive archive/gary atime off inherited from archive archive/gary devices on default archive/gary exec on default archive/gary setuid on default archive/gary readonly off default archive/gary zoned off default archive/gary snapdir hidden default archive/gary aclmode groupmask default archive/gary aclinherit passthrough local archive/gary canmount on default archive/gary shareiscsi off default archive/gary xattr on default archive/gary copies 1 default archive/gary version 3 - archive/gary utf8only off - archive/gary normalization none - archive/gary casesensitivity sensitive - archive/gary vscan off default archive/gary nbmand off default archive/gary sharesmb name=garybackup local archive/gary refquota none default archive/gary refreservation none default archive/gary primarycache all default archive/gary secondarycache all default -- This message posted from opensolaris.org
Albert Chin
2009-Aug-27 13:36 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On Thu, Aug 27, 2009 at 06:29:52AM -0700, Gary Gendel wrote:> It looks like It''s definitely related to the snv_121 upgrade. I > decided to roll back to snv_110 and the checksum errors have > disappeared. I''d like to issue a bug report, but I don''t have any > information that might help track this down, just lots of checksum > errors.So, on snv_121, can you read the files with checksum errors? Is it simply the reporting mechanism that is wrong or are the files really damaged? -- albert chin (china at thewrittenword.com)
Casper.Dik at Sun.COM
2009-Aug-27 13:39 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
>It looks like It''s definitely related to the snv_121 upgrade. I decided to roll >back to snv_110 and the checksum errors have disappeared. I''d like to issue a >bug report, but I don''t have any information that might help track this down, >just lots of checksum errors.>Looks like I''m stuck at snv_110 until someone figures out what is broken. >If it helps, here is my properly list for this pool.There are many components in the stack: zfs, the device drivers and such. What hardware are you using? Perhaps it''s an issue of your SATA driver or something else. I''ve seen no checksum errors on snv_121 Casper
Adam Leventhal
2009-Aug-27 17:23 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Hey Gary, There appears to be a bug in the RAID-Z code that can generate spurious checksum errors. I''m looking into it now and hope to have it fixed in build 123 or 124. Apologies for the inconvenience. Adam On Aug 25, 2009, at 5:29 AM, Gary Gendel wrote:> I have a 5-500GB disk Raid-Z pool that has been producing checksum > errors right after upgrading SXCE to build 121. They seem to be > randomly occurring on all 5 disks, so it doesn''t look like a disk > failure situation. > > Repeatingly running a scrub on the pools randomly repairs between 20 > and a few hundred checksum errors. > > Since I hadn''t physically touched the machine, it seems a very > strong coincidence that it started right after I upgraded to 121. > > This machine is a SunFire v20z with a Marvell SATA 8-port controller > (the same one as in the original thumper). I''ve seen this kind of > problem way back around build 40-50 ish, but haven''t seen it after > that until now. > > Anyone else experiencing this problem or knows how to isolate the > problem definitively? > > Thanks, > Gary > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Gary Gendel
2009-Aug-29 02:25 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Alan, Super find. Thanks, I thought I was just going crazy until I rolled back to 110 and the errors disappeared. When you do work out a fix, please ping me to let me know when I can try an upgrade again. Gary -- This message posted from opensolaris.org
James Lever
2009-Aug-29 03:20 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On 28/08/2009, at 3:23 AM, Adam Leventhal wrote:> There appears to be a bug in the RAID-Z code that can generate > spurious checksum errors. I''m looking into it now and hope to have > it fixed in build 123 or 124. Apologies for the inconvenience.Are the errors being generated likely to cause any significant problem running 121 with a RAID-Z volume or should users of RAID-Z* wait until this issue is resolved? cheers, James
Adam Leventhal
2009-Sep-01 23:54 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Hi James, After investigating this problem a bit I''d suggest avoiding deploying RAID-Z until this issue is resolved. I anticipate having it fixed in build 124. Apologies for the inconvenience. Adam On Aug 28, 2009, at 8:20 PM, James Lever wrote:> > On 28/08/2009, at 3:23 AM, Adam Leventhal wrote: > >> There appears to be a bug in the RAID-Z code that can generate >> spurious checksum errors. I''m looking into it now and hope to have >> it fixed in build 123 or 124. Apologies for the inconvenience. > > Are the errors being generated likely to cause any significant > problem running 121 with a RAID-Z volume or should users of RAID-Z* > wait until this issue is resolved? > > cheers, > James >-- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
James Lever
2009-Sep-02 00:02 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On 02/09/2009, at 9:54 AM, Adam Leventhal wrote:> After investigating this problem a bit I''d suggest avoiding > deploying RAID-Z > until this issue is resolved. I anticipate having it fixed in build > 124.Thanks for the status update on this Adam. cheers, James
Nigel Smith
2009-Sep-02 09:22 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Adam The ''OpenSolaris Development Release Packaging Repository'' has recently been updated to release 121. http://mail.opensolaris.org/pipermail/opensolaris-announce/2009-August/001253.html http://pkg.opensolaris.org/dev/en/index.shtml Just to be totally clear, as you recommending that anyone using raidz, raidz2, raidz3, should not upgrade to that release? For the people who have already upgraded, presumably the recommendation is that they should revert to a pre 121 BE. Thanks Nigel Smith -- This message posted from opensolaris.org
Daniel Carosone
2009-Sep-02 09:38 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Furthermore, this clarity needs to be posted somewhere much, much more visible than buried in some discussion thread. -- This message posted from opensolaris.org
Henrik Johansson
2009-Sep-02 09:40 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Hi Adam, On Sep 2, 2009, at 1:54 AM, Adam Leventhal wrote:> Hi James, > > After investigating this problem a bit I''d suggest avoiding > deploying RAID-Z > until this issue is resolved. I anticipate having it fixed in build > 124.For those of us which have already upgraded and written data to our raidz pools, are there any risks of inconsistency, wrong checksums in the pool? Is there a bug id? Regards Henrik http://sparcv9.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090902/7ea48eb8/attachment.html>
Frank Middleton
2009-Sep-02 13:27 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On 09/02/09 05:40 AM, Henrik Johansson wrote:> For those of us which have already upgraded and written data to our > raidz pools, are there any risks of inconsistency, wrong checksums in > the pool? Is there a bug id?This may not be a new problem insofar as it may also affect mirrors. As part of the ancient "mirrored drives should not have checksum errors thread", I used Richard Elling''s amazing zcksummon script http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon to help diagnose this (thanks, Richard, for all your help). The bottom line is that hardware glitches (as found on cheap PCs without ECC on buses and memory) can put ZFS into a mode where it detects bogus checksum errors. If you set copies=2, it seems to always be able to repair them, but they are never actually repaired. Every time you scrub, it finds a checksum error on the affected file(s) and it pretends to repair it (or may fail if you have copies=1 set). Note: I have not tried this on raidz, only mirrors, where it is highly reproducible. It would be really interesting to see if raidz gets results similar to the mirror case when running zcksummon. Note I have NEVER had this problem on SPARC, only on certain bargain-basement PCs (used as X-Terminals) which as it turns out have mobos notorious for not detecting bus parity errors. If this is the same problem, you can certainly mitigate it by setting copies=2 and actually copying the files (e.g., by promoting a snapshot, which I believe will do this - can someone confirm?). My guess is that snv121 has done something to make the problem more likely to occur, but the problem itself is quite old (predates snv100). Could you share with us some details of your hardware, especially how much memory and if it has ECC orbus parity? Cheers -- Frank On 09/02/09 05:40 AM, Henrik Johansson wrote:> Hi Adam, > > > On Sep 2, 2009, at 1:54 AM, Adam Leventhal wrote: > >> Hi James, >> >> After investigating this problem a bit I''d suggest avoiding deploying >> RAID-Z >> until this issue is resolved. I anticipate having it fixed in build 124. >> Regards > > Henrik > http://sparcv9.blogspot.com <http://sparcv9.blogspot.com/> > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Gaëtan Lehmann
2009-Sep-02 14:01 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Le 2 sept. 09 ? 15:27, Frank Middleton a ?crit :> On 09/02/09 05:40 AM, Henrik Johansson wrote: > >> For those of us which have already upgraded and written data to our >> raidz pools, are there any risks of inconsistency, wrong checksums in >> the pool? Is there a bug id? > > This may not be a new problem insofar as it may also affect mirrors. > As part of the ancient "mirrored drives should not have checksum > errors thread", I used Richard Elling''s amazing zcksummon script > http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon > to help diagnose this (thanks, Richard, for all your help). > > The bottom line is that hardware glitches (as found on cheap PCs > without ECC on buses and memory) can put ZFS into a mode where it > detects bogus checksum errors. If you set copies=2, it seems to > always be able to repair them, but they are never actually repaired. > Every time you scrub, it finds a checksum error on the affected > file(s) > and it pretends to repair it (or may fail if you have copies=1 set). > > Note: I have not tried this on raidz, only mirrors, where it is > highly reproducible. It would be really interesting to see if > raidz gets results similar to the mirror case when running zcksummon. > Note I have NEVER had this problem on SPARC, only on certain > bargain-basement PCs (used as X-Terminals) which as it turns out > have mobos notorious for not detecting bus parity errors. > > If this is the same problem, you can certainly mitigate it by > setting copies=2 and actually copying the files (e.g., by > promoting a snapshot, which I believe will do this - can someone > confirm?). My guess is that snv121 has done something to make > the problem more likely to occur, but the problem itself is > quite old (predates snv100). Could you share with us some details > of your hardware, especially how much memory and if it has ECC > orbus parity?I see the same problem on a workstation with ECC RAM and disks in mirror. The host is a Dell T5500 with 2 cpus and 24 GB of RAM. Ga?tan -- Ga?tan Lehmann Biologie du D?veloppement et de la Reproduction INRA de Jouy-en-Josas (France) tel: +33 1 34 65 29 66 fax: 01 34 65 29 09 http://voxel.jouy.inra.fr http://www.itk.org http://www.mandriva.org http://www.bepo.fr -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 203 bytes Desc: Ceci est une signature ?lectronique PGP URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090902/28343f53/attachment.bin>
Eric Sproul
2009-Sep-02 14:27 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Adam Leventhal wrote:> Hi James, > > After investigating this problem a bit I''d suggest avoiding deploying > RAID-Z > until this issue is resolved. I anticipate having it fixed in build 124.Adam, Is it known approximately when this bug was introduced? I have a system running snv_111 with a large raidz2 pool and I keep running into checksum errors though the drives are brand new. They are 2TB drives, but the pool is only about 14% used (~250G/drive across 13 drives). For a drive to develop hundreds of checksum errors at less than 20% capacity seems far above the expected error rate. Of course, I understand there could be plenty of other reasons for this, but if I can eliminate this issue as a possibility that will help focus my troubleshooting. Thanks, Eric
Frank Middleton
2009-Sep-02 14:28 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On 09/02/09 10:01 AM, Ga?tan Lehmann wrote:> I see the same problem on a workstation with ECC RAM and disks in mirror. > The host is a Dell T5500 with 2 cpus and 24 GB of RAM.Would you know if it has ECC on the buses? I have no idea if or what Solaris does on X86 to check or correct bus errors, but I vaguely remember seeing a thread about it. Asking, because it really does seem to require a hardware problem to make this happen. Did you try zcksummon? Cheers -- Frank
Simon Breden
2009-Sep-02 14:34 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
I too see checksum errors ocurring for the first time using OpenSolaris 2009.06 on the /dev package repository at version snv_121. I see the problem occur within a mirrored boot pool (rpool) using SSDs. Hardware is AMD BE-2350 (ECC) processor with 4GB ECC memory on MCP55 chipset, although SATA is using mpt driver on a SuperMicro AOC-USAS-L8i controller card. More here: http://breden.org.uk/2009/09/02/home-fileserver-handling-pool-errors/ So I''m going to check my other boot environments to see if a rollback makes sense (< snv_121). Cheers, Simon -- This message posted from opensolaris.org
Markus Kovero
2009-Sep-02 14:48 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Please see iostat -xen if there is transport or hw errors generated by say, device timeouts or bad cables etc. Consumer disks usually just timeout time to time while on load when RE-versions usually report error. Yours Markus Kovero -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Simon Breden Sent: 2. syyskuuta 2009 17:34 To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool I too see checksum errors ocurring for the first time using OpenSolaris 2009.06 on the /dev package repository at version snv_121. I see the problem occur within a mirrored boot pool (rpool) using SSDs. Hardware is AMD BE-2350 (ECC) processor with 4GB ECC memory on MCP55 chipset, although SATA is using mpt driver on a SuperMicro AOC-USAS-L8i controller card. More here: http://breden.org.uk/2009/09/02/home-fileserver-handling-pool-errors/ So I''m going to check my other boot environments to see if a rollback makes sense (< snv_121). Cheers, Simon -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Frank Middleton
2009-Sep-02 14:58 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On 09/02/09 10:34 AM, Simon Breden wrote:> I too see checksum errors ocurring for the first time using OpenSolaris 2009.06 on the /dev package repository at version snv_121. > > I see the problem occur within a mirrored boot pool (rpool) using SSDs. > > Hardware is AMD BE-2350 (ECC) processor with 4GB ECC memory on MCP55 chipset, although SATA is using mpt driver on a SuperMicro AOC-USAS-L8i controller card. > > More here: > http://breden.org.uk/2009/09/02/home-fileserver-handling-pool-errors/Boy, that looks familiar. Did you try zcksummon to see if the checksums are really being fixed? If it is the same problem I encountered, then they are not, even though the scrub says no errors (and the problem goes back before snv100). Your hardware seems pretty beefy, though. Note that iostat -Ene never reported any hard errors in my case even though the mobo was known to have problems, so hard errors do not explain the problem. Cheers -- Frank
Simon Breden
2009-Sep-02 16:03 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Thanks Markus, I''ll give that a try. -- This message posted from opensolaris.org
Simon Breden
2009-Sep-02 16:05 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Cheers Frank, I''ll give it a try... also, doesn''t sound good if the problem goes back pre snv_100... -- This message posted from opensolaris.org
Brent Jones
2009-Sep-02 16:15 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On Wed, Sep 2, 2009 at 6:27 AM, Frank Middleton<f.middleton at apogeect.com> wrote:> On 09/02/09 05:40 AM, Henrik Johansson wrote: > >> For those of us which have already upgraded and written data to our >> raidz pools, are there any risks of inconsistency, wrong checksums in >> the pool? Is there a bug id? > > This may not be a new problem insofar as it may also affect mirrors. > As part of the ancient "mirrored drives should not have checksum > errors thread", I used Richard Elling''s amazing zcksummon script > http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon > to help diagnose this (thanks, Richard, for all your help). > > The bottom line is that hardware glitches (as found on cheap PCs > without ECC on buses and memory) can put ZFS into a mode where it > detects bogus checksum errors. If you set copies=2, it seems to > always be able to repair them, but they are never actually repaired. > Every time you scrub, it finds a checksum error on the affected file(s) > and it pretends to repair it (or may fail if you have copies=1 set). > > Note: I have not tried this on raidz, only mirrors, where it is > highly reproducible. It would be really interesting to see if > raidz gets results similar to the mirror case when running zcksummon. > Note I have NEVER had this problem on SPARC, only on certain > bargain-basement PCs (used as X-Terminals) which as it turns out > have mobos notorious for not detecting bus parity errors. > > If this is the same problem, you can certainly mitigate it by > setting copies=2 and actually copying the files (e.g., by > promoting a snapshot, which I believe will do this - can someone > confirm?). My guess is that snv121 has done something to make > the problem more likely to occur, but the problem itself is > quite old (predates snv100). Could you share with us some details > of your hardware, especially how much memory and if it has ECC > orbus parity? > > Cheers -- Frank > > On 09/02/09 05:40 AM, Henrik Johansson wrote: >> >> Hi Adam, >> >> >> On Sep 2, 2009, at 1:54 AM, Adam Leventhal wrote: >> >>> Hi James, >>> >>> After investigating this problem a bit I''d suggest avoiding deploying >>> RAID-Z >>> until this issue is resolved. I anticipate having it fixed in build 124. >> > >> Regards >> >> Henrik >> http://sparcv9.blogspot.com <http://sparcv9.blogspot.com/> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >I see this issue on each of my X4540''s, 64GB of ECC memory, 1TB drives. Rolling back to snv_118 does not reveal any checksum errors, only snc_121 So, the commodity hardware here doesn''t hold up, unless Sun isn''t validating their equipment (not likely, as these servers have had no hardware issues prior to this build) -- Brent Jones brent at servuhome.net
Richard Elling
2009-Sep-02 16:19 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On Sep 2, 2009, at 2:38 AM, Daniel Carosone wrote:> Furthermore, this clarity needs to be posted somewhere much, much > more visible than buried in some discussion thread.I''ve added a note in the ZFS Troubleshooting Guide wiki. However, I could not find a public CR. If someone inside Sun can provide a CR number, I''ll add that to the reference. http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Resolving_Software_Problems -- richard
Richard Elling
2009-Sep-02 16:31 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On Sep 2, 2009, at 6:27 AM, Frank Middleton wrote:> On 09/02/09 05:40 AM, Henrik Johansson wrote: > >> For those of us which have already upgraded and written data to our >> raidz pools, are there any risks of inconsistency, wrong checksums in >> the pool? Is there a bug id? > > This may not be a new problem insofar as it may also affect mirrors. > As part of the ancient "mirrored drives should not have checksum > errors thread", I used Richard Elling''s amazing zcksummon script > http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon > to help diagnose this (thanks, Richard, for all your help).I believe this is a different problem. Adam, was this introduced in b120? There is more work that can be leveraged from zcksummon, perhaps I''ll get a few spare moments to test and update the procedure in the next few days. -- richard
Simon Breden
2009-Sep-02 16:35 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Hi Richard, I just took at that link and it only mentions problems with RAID-Z vdevs, but some people here, including myself, have checksum errors with mirrors too, so maybe the link could be updated with this info? Cheers, Simon -- This message posted from opensolaris.org
Henrik Johansson
2009-Sep-02 17:13 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Hello all, I have backed down to snv_117, when scrubbing this pool i got my first checksum errors ever on any build except snv_121. I wonder if this is a coincidence or if bad checksums have been generated by snv_121? So i have been running for 10 months without any checksum errors, i installed snv_121 and got plenty of them, now i also get them after backing to snv_117. I will check my hardware after the scrub is completed. Someone asked what hardware we where using, I am have a Asus M3N78-VM (nforce 8200) with ECC protected memory (And I think HT uses CRC?) the pool is a 3 disk raidz. Henrik http://sparcv9.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090902/b791c82f/attachment.html>
Simon Breden
2009-Sep-02 17:26 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
And in addition to which solaris version people are using, is it relevant which ZFS level their pool is using? -- This message posted from opensolaris.org
Frank Middleton
2009-Sep-02 17:58 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On 09/02/09 12:31 PM, Richard Elling wrote:> I believe this is a different problem. Adam, was this introduced in b120?Doubtless you are correct as usual. However, if this is a new problem, how did it get through Sun''s legendary testing process unless it is (as you have always maintained) triggered by a hardware problem? If so, I believe that any new CR would be regarded as a duplicate of any CR that described the problem you and I researched, even if they have different root causes. Of course this seems to be new as of snv121, so one can only speculate that it might be a latent problem or a new one. Do you think that there are separate mirror vs. raidz issues?> There is more work that can be leveraged from zcksummon, perhaps > I''ll get a few spare moments to test and update the procedure in the next > few days.If you think it would be relevant, you know I can reproduce this at will. I wonder if any Sun hardware users have experienced this problem. So far IIRC the only reports are Asus and Dell. Does anyone else recollect the thread about how Solaris does (or does not) do bus error checking on x86? Cheers -- Frank
Edward Pilatowicz
2009-Sep-02 18:00 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
hey richard, so i just got a bunch of zfs checksum errors after replacing some mirrored disks on my desktop (u27). i originally blamed the new disks, until i saw this thread, at which point i started digging in bugster. i found the following related bugs (i''m not sure which one adam was refering to): 6847180 Status of newly replaced drive became faulted due to checksum errors after scrub in raidz1 pool http://bugs.opensolaris.org/view_bug.do?bug_id=6847180 6869090 on thumper with ZFS (snv_120) raidz causes checksum errors from all drives http://bugs.opensolaris.org/view_bug.do?bug_id=6869090 i think the issue i''m seeing may be 6847180. reading through the bug, i get the impression that it can affect disks in mirrors as well as raidz configurations. to complicate the situation, i just upgraded from snv_121 to snv_122. the initial checksum after resilvering errors i saw were on snv_121. i''m not seeing any new errors with snv_122. (of course i haven''t tried a new resilvering operation since upgrading to snv_122, i''ll probably do that tommorow.) i''ve zpool clear''ed the problem, did a scrub, and things look ok. i''m currently testing the pool by doing more scrubs + builds on it to see if i get any more errors. ed On Wed, Sep 02, 2009 at 09:19:03AM -0700, Richard Elling wrote:> > On Sep 2, 2009, at 2:38 AM, Daniel Carosone wrote: > >> Furthermore, this clarity needs to be posted somewhere much, much more >> visible than buried in some discussion thread. > > I''ve added a note in the ZFS Troubleshooting Guide wiki. However, I > could not > find a public CR. If someone inside Sun can provide a CR number, I''ll > add that > to the reference. > http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Resolving_Software_Problems > > -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Bob Friesenhahn
2009-Sep-02 18:10 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On Wed, 2 Sep 2009, Frank Middleton wrote:> On 09/02/09 12:31 PM, Richard Elling wrote: > >> I believe this is a different problem. Adam, was this introduced in b120? > > Doubtless you are correct as usual. However, if this is a new problem, > how did it get through Sun''s legendary testing process unless it isThese snv releases do not go through "Sun''s legendary testing process". They only go through simple sanity checks which take a week or two rather than a year. Instead, OpenSolaris users become part of a new testing process. I imagine that Solaris 11 will go through "Sun''s legendary testing process" but with the considerable benefit that much of it will already have gone through the OpenSolaris trial-by-fire process. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jeff Victor
2009-Sep-02 18:17 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Bob Friesenhahn wrote:> On Wed, 2 Sep 2009, Frank Middleton wrote: >> On 09/02/09 12:31 PM, Richard Elling wrote: >>> I believe this is a different problem. Adam, was this introduced in >>> b120? >> Doubtless you are correct as usual. However, if this is a new problem, >> how did it get through Sun''s legendary testing process unless it is > These snv releases do not go through "Sun''s legendary testing > process". They only go through simple sanity checks which take a week > or two rather than a year. Instead, OpenSolaris users become part of > a new testing process. > > I imagine that Solaris 11 will go through "Sun''s legendary testing > process" but with the considerable benefit that much of it will > already have gone through the OpenSolaris trial-by-fire process.Just to expand on that: there are now three levels of testing (and therefore stability) in [Open]Solaris: * Nevada builds - I don''t know the details, but it''s what BobF referred to with "simple sanity checks" and, I think, what he meant by "OpenSolaris users become part of a new testing process." * OpenSolaris distro (e.g. 2009.06) - this goes through significant testing, but not as much as Solaris 10 updates. OpenSolaris users (that is, users of the OpenSolaris distro) benefit from this testing. * Solaris 10 goes through "Sun''s legendary testing process" :-) --JeffV
Frank Middleton
2009-Sep-02 20:02 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On 09/02/09 02:17 PM, Jeff Victor wrote:> Just to expand on that: there are now three levels of testing (and > therefore stability) in [Open]Solaris: > * Nevada builds - I don''t know the details, but it''s what BobF referred > to with "simple sanity checks" and, I think, what he meant by > "OpenSolaris users become part of a new testing process." > > * OpenSolaris distro (e.g. 2009.06) - this goes through significant > testing, but not as much as Solaris 10 updates. OpenSolaris users (that > is, users of the OpenSolaris distro) benefit from this testing. > > * Solaris 10 goes through "Sun''s legendary testing process" :-)OK, I stand corrected. So the new snv121 checksum bug somehow made it through the "simple sanity checks". Based on this thread, I wonder if it is still doing so (my intuition is that the problem still doesn''t show up on Sun hardware). No doubt there''s someone out there itching to prove me wrong :-) Note that the "old" checksum bug evidently hasn''t shown up much at all, although with the right (grotty) hardware it is quite reproducible even though iostat -Ene shows no hard errors at all... In the context of bug id 6848079, the only time new files get added to the list of the invisible checksum errors is after reboot of an otherwise read only file system. The new files show up with a checksum failure that a scrub clears, but zcksummon shows that scrub still finds them with checksum failures and supposedly repairs them (until next time). What''s the betting (lottery aside) that fixing 6848079 will also fix the problem I found? Note also that 6848079 was reported against snv115. Baffling. My question here is - if this bug isn''t triggered by some kind of (soft) hardware glitch, how come it isn''t affecting more systems? After all, you have to reboot to actually run snv121, and there must be quite a few folks who use ZFS who must have done so by now.
Bob Friesenhahn
2009-Sep-02 21:13 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On Wed, 2 Sep 2009, Frank Middleton wrote:> > OK, I stand corrected. So the new snv121 checksum bug somehow made it > through the "simple sanity checks". Based on this thread, I wonder if > it is still doing so (my intuition is that the problem still doesn''t > show up on Sun hardware). No doubt there''s someone out there itching > to prove me wrong :-)I have seen few people more prone to unsubstantiated conjecture than you. The raidz checksum code was recently reworked to add raidz3. It seems likely that a subtle bug was added at that time. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Adam Leventhal
2009-Sep-02 21:22 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Hey Bob,> I have seen few people more prone to unsubstantiated conjecture than you. > The raidz checksum code was recently reworked to add raidz3. It seems > likely that a subtle bug was added at that time.That appears to be the case. I''m investigating the problem and hope to have and update to the last either later today or tomorrow. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Tim Cook
2009-Sep-02 22:54 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On Wed, Sep 2, 2009 at 3:02 PM, Frank Middleton <f.middleton at apogeect.com>wrote:> On 09/02/09 02:17 PM, Jeff Victor wrote: > > Just to expand on that: there are now three levels of testing (and >> therefore stability) in [Open]Solaris: >> * Nevada builds - I don''t know the details, but it''s what BobF referred >> to with "simple sanity checks" and, I think, what he meant by >> "OpenSolaris users become part of a new testing process." >> >> * OpenSolaris distro (e.g. 2009.06) - this goes through significant >> testing, but not as much as Solaris 10 updates. OpenSolaris users (that >> is, users of the OpenSolaris distro) benefit from this testing. >> >> * Solaris 10 goes through "Sun''s legendary testing process" :-) >> > > OK, I stand corrected. So the new snv121 checksum bug somehow made it > through the "simple sanity checks". Based on this thread, I wonder if > it is still doing so (my intuition is that the problem still doesn''t > show up on Sun hardware). No doubt there''s someone out there itching > to prove me wrong :-) > > Note that the "old" checksum bug evidently hasn''t shown up much at > all, although with the right (grotty) hardware it is quite reproducible > even though iostat -Ene shows no hard errors at all... > > In the context of bug id 6848079, the only time new files get added > to the list of the invisible checksum errors is after reboot of > an otherwise read only file system. The new files show up with > a checksum failure that a scrub clears, but zcksummon shows that > scrub still finds them with checksum failures and supposedly > repairs them (until next time). What''s the betting (lottery > aside) that fixing 6848079 will also fix the problem I found? > Note also that 6848079 was reported against snv115. Baffling. > > My question here is - if this bug isn''t triggered by some kind > of (soft) hardware glitch, how come it isn''t affecting more systems? > After all, you have to reboot to actually run snv121, and there > must be quite a few folks who use ZFS who must have done so by now. > > > >Define "more systems". How many people do you think are on 121? And of those, how many are on the zfs mailing list? And of those, how many have done a scrub recently to see the checksum errors? Do you have some proof to validate your beliefs? REGARDLESS, had you read all the posts to this thread, you''d know you''ve already been proven wrong: On Wed, Sep 2, 2009 at 11:15 AM, Brent Jones <brent at servuhome.net> wrote: I see this issue on each of my X4540''s, 64GB of ECC memory, 1TB drives. Rolling back to snv_118 does not reveal any checksum errors, only snc_121 So, the commodity hardware here doesn''t hold up, unless Sun isn''t validating their equipment (not likely, as these servers have had no hardware issues prior to this build) -- Brent Jones brent at servuhome.net --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090902/e3992724/attachment.html>
Chris Csanady
2009-Sep-03 01:52 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
2009/9/2 Eric Sproul <esproul at omniti.com>:> > Adam, > Is it known approximately when this bug was introduced? ?I have a system running > snv_111 with a large raidz2 pool and I keep running into checksum errors though > the drives are brand new. ?They are 2TB drives, but the pool is only about 14% > used (~250G/drive across 13 drives). ?For a drive to develop hundreds of > checksum errors at less than 20% capacity seems far above the expected error rate.This may be 6826470, which was present for some time, and fixed it b114. If you have replaced a device on b111, you will see a lot of checksum errors, even after the resilver completes. In fact, when I scrubbed my pool it encountered so many that it transitioned the vdev to a faulted state. (I had to run zpool clear periodically in a loop to allow it to finish.) See the details at: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6826470 Chris
Simon Breden
2009-Sep-03 19:59 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
So what''s the consensus on checksum errors appearing within mirror vdevs? Is it caused the same bug announced by Adam, or is something else causing it? If so, what''s the bug id? Cheers, Simon -- This message posted from opensolaris.org
Gaëtan Lehmann
2009-Sep-03 20:28 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Le 3 sept. 09 ? 21:59, Simon Breden a ?crit :> So what''s the consensus on checksum errors appearing within mirror > vdevs? > Is it caused the same bug announced by Adam, or is something else > causing it? > If so, what''s the bug id?Sorry, I forgot to report the end of my experiments. They shown that the checksum error I''ve seen are unlikely related to snv 121. The checksum errors I''ve seen is more likely caused by the iommu bug on intel platforms which has lead to several crash some time ago, and a lack of scrub in the last days. After the errors reported during the scrub on snv 121, I run a scrub on snv 118 and find the same amount of error, all on rpool/dump. I dropped that zvol, rerun the scrub again still on snv 118 without any error. After a reboot on snv 121 and a new scrub, no checksum error are reported. Regards, Ga?tan -- Ga?tan Lehmann Biologie du D?veloppement et de la Reproduction INRA de Jouy-en-Josas (France) tel: +33 1 34 65 29 66 fax: 01 34 65 29 09 http://voxel.jouy.inra.fr http://www.itk.org http://www.mandriva.org http://www.bepo.fr -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 203 bytes Desc: Ceci est une signature ?lectronique PGP URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090903/84c62054/attachment.bin>
Simon Breden
2009-Sep-03 21:18 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Thanks Ga?tan. What''s the bug id for this iommu bug on Intel platforms? In my case, I have an AMD processor with ECC RAM, so probably not related to the Intel iommu bug. I''m seeing the checksum errors in a mirrored rpool using SSDs so maybe it could be something like cosmic rays causing occasional random bits to flip? After clearing the errors and scrubbing the pool a couple of times until the errors were fixed, I have not seen any new checksum errors, and I''m using 121 at the moment, though I should probably drop back to 117 to avoid the RAID-Z bug, although I have a RAID-Z2 vedev and not a RAID-Z1 vdev so I should not encounter the more serious problem mentioned.>After the errors reported during the scrub on snv 121, I run a scrub on snv 118 and find the same > amount of error, all on rpool/dump. I dropped that zvol, rerun the scrub again still on snv 118 >without any error. After a reboot on snv 121 and a new scrub, no checksum error are reported.You did #zfs destroy rpool/dump ? -- This message posted from opensolaris.org
Frank Middleton
2009-Sep-04 03:30 UTC
[zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
It was someone from Sun that recently asked me to repost here about the checksum problem on mirrored drives. I was reluctant to do so because you and Bob might start flames again, and you did! You both sound very defensive, but of course I would never make an unsubstantiated speculation that you might have vulnerable hardware :-). But in case you do, please don''t shoot the messenger... Instead of being negative, how about some conjectures of your own about this?. here''s a summary of what is happening: An old machine with mirrored drives and a suspect mobo (maybe not checking PCI parity) gets checksum errors on reboot and scrub. With copies=1 it fails to repair them. With copies=2 it apparently fixes them, but zcksummon shows quite clearly that on a scrub, zfs finds and repairs them again on every scrub, even though scrub shows no errors. Typically these files are system libraries and unless you actually replace them, they are never truly repaired. Although I really don''t think this is caused by cosmic rays, are you also saying that PCs without ECC on memory and/or buses will *never* experience a glitch? You obviously don''t play the lottery :-) [ZFS errors due to memory hits seem far more likely than winning a 6 ball lottery for typical retail consumer loads] On 09/02/09 06:54 PM, Tim Cook wrote:> Define "more systems". How many people do you think are on 121? And ofAbsolutely no idea. Enough, though.> those, how many are on the zfs mailing list? And of those, how manyProbably - all of them (yes, this is an unsubstantiated speculation).> have done a scrub recently to see the checksum errors? Do you have some > proof to validate your beliefs?If you had read the thread carefully, you would note that a scrub actually clears the errors (but zcksummon shows that they really aren''t cleared). And doesn''t the guide tell us to run scrubs frequently? I am sure we all dutifully do so :-). I''d be quite happy to send you the proof.> REGARDLESS, had you read all the posts to this thread, you''d know you''ve > already been proven wrong:Wrong about what? Reading posts before they are posted? I have read every post most carefully. Having experienced checksum failures on mirrored drives for 4 months now (and there''s a CR against snv115 for a similar problem), what exactly do you think I am trying to prove, or what beliefs? After 4 months of hearing the hardware being blamed for the checksum problem (which is easy to reproduce against snv111b), all I''m doing is agreeing that it is likely triggered by some kind of soft hardware glitch, we just don''t know what the glitch might be. The SPoFs on this machine are the disk controller, the PCI bus, and memory, (and cpu, of course). Take your pick. FWIW it always picks on SUNWcsl (libdlpi.so.1) - 3 or 4 times now, and more recently, /usr/share/doc/SUNWmusicbrainz/COPYING.bz2. I am skeptical that the disk controller is picking on certain files, so that leaves memory and the bus. Take your pick. New files get added to the list quite infrequently. But it could also be a pure software bug - some kind of race condition, perhaps.> On Wed, Sep 2, 2009 at 11:15 AM, Brent Jones <brent at servuhome.net > <mailto:brent at servuhome.net>> wrote: > I see this issue on each of my X4540''s, 64GB of ECC memory, 1TB drives. > Rolling back to snv_118 does not reveal any checksum errors, only > snc_121 > > So, the commodity hardware here doesn''t hold up, unless Sun isn''t > validating their equipment (not likely, as these servers have had no > hardware issues prior to this build)Exactly. My whole point. Glad to hear that Sun hardware is as reliable as ever! I hope Richard''s new and improved zcksummon will shed more light on this... Cheers -- Frank