I''ve been using a 5-disk raidZ for years on SXCE machine which I converted to OSOL. The only time I ever had zfs problems in SXCE was with snv_120, which was fixed. So, now I''m at OSOL snv_111b and I''m finding that scrub repairs errors on random disks. If I repeat the scrub, it will fix errors on other disks. Occasionally it runs cleanly. That it doesn''t happen in a consistent manner makes me believe it''s not hardware related. fmdump only reports, three types of errors: ereport.fs.zfs.checksum ereport.io.scsi.cmd.disk.tran ereport.io.scsi.cmd.disk.recovered The middle one seems to be the issue I''d like to track down the source. Any docs on how to do this? Thanks, Gary -- This message posted from opensolaris.org
On Sun, Jan 10, 2010 at 16:40, Gary Gendel <gary at genashor.com> wrote:> I''ve been using a 5-disk raidZ for years on SXCE machine which I converted to OSOL. ?The only time I ever had zfs problems in SXCE was with snv_120, which was fixed. > > So, now I''m at OSOL snv_111b and I''m finding that scrub repairs errors on random disks. ?If I repeat the scrub, it will fix errors on other disks. ?Occasionally it runs cleanly. ?That it doesn''t happen in a consistent manner makes me believe it''s not hardware related. >That is a good indication for hardware related errors. Software will do the same thing every time but hardware errors are often random. But you are running an older version now, I would recommend an upgrade.
Mattias Pantzare wrote:> On Sun, Jan 10, 2010 at 16:40, Gary Gendel <gary at genashor.com> wrote: > >> I''ve been using a 5-disk raidZ for years on SXCE machine which I converted to OSOL. The only time I ever had zfs problems in SXCE was with snv_120, which was fixed. >> >> So, now I''m at OSOL snv_111b and I''m finding that scrub repairs errors on random disks. If I repeat the scrub, it will fix errors on other disks. Occasionally it runs cleanly. That it doesn''t happen in a consistent manner makes me believe it''s not hardware related. >> >> > > That is a good indication for hardware related errors. Software will > do the same thing every time but hardware errors are often random. > > But you are running an older version now, I would recommend an upgrade. >I would have thought that too if it didn''t start right after the switch from SXCE to OSOL. As for an upgrade, use the dev repository on my laptop and I find that OSOL updates aren''t nearly as stable as SXCE was. I tried for a bit, but always had to go back to 111b because something crucial broke. I was hoping to wait until the official release in March in order to let things stabilize. This is my main web/mail/file/etc. server and I don''t really want to muck too much. That said, I may take a gambol on upgrading as we''re getting closer to the 2010.x release. Gary
I''ve just made a couple of consecutive scrubs, each time it found a couple of checksum errors but on different drives. No indication of any other errors. That a disk scrubs cleanly on a quiescent pool in one run but fails in the next is puzzling. It reminds me of the snv_120 odd number of disks raidz bug I reported. Looks like I''ve got to bite the bullet and upgrade to the dev tree and hope for the best. Gary -- This message posted from opensolaris.org
Hi Gary, You might consider running OSOL on a later build, like build 130. Have you reviewed the fmdump -eV output to determine on which devices the ereports below have been generated? This might give you more clues as to what the issues are. I would also be curious if you have any driver-level errors reported in /var/adm/messages or the iostat -En command. I think of cable problems or controller issues with repeated random problems across disks. Thanks, Cindy On 01/10/10 08:40, Gary Gendel wrote:> I''ve been using a 5-disk raidZ for years on SXCE machine which I converted to OSOL. The only time I ever had zfs problems in SXCE was with snv_120, which was fixed. > > So, now I''m at OSOL snv_111b and I''m finding that scrub repairs errors on random disks. If I repeat the scrub, it will fix errors on other disks. Occasionally it runs cleanly. That it doesn''t happen in a consistent manner makes me believe it''s not hardware related. > > fmdump only reports, three types of errors: > > ereport.fs.zfs.checksum > ereport.io.scsi.cmd.disk.tran > ereport.io.scsi.cmd.disk.recovered > > The middle one seems to be the issue I''d like to track down the source. Any docs on how to do this? > > Thanks, > Gary
Thanks for all the suggestions. Now for a strange tail... I tried upgrading to dev 130 and, as expected, things did not go well. All sorts of permission errors flew by during the upgrade stage and it would not start X-windows. I''ve heard that things installed from the contrib and extras repositories might cause issues but I didn''t want to spend the time with my server offline while i tried to figure this out. So, I booted back to 111b and scrubs still showed errors. Late in the evening, the pool faulted preventing any backups from the other servers to this pool. Greeted this morning with the "recover files from backup" status message sent shivers up my spine. This IS my backup. I exported the pool and then imported it, which it did successfully. Now the scrubs run cleanly (at least for a few repeated scrubs spanning several hours). So, was it hardware? What the heck could have fixed it by just exporting and importing the pool? -- This message posted from opensolaris.org