thr3ads.net - zfs discuss - [zfs-discuss] Repeating scrub does random fixes [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Gary Gendel

2010-Jan-10 15:40 UTC

[zfs-discuss] Repeating scrub does random fixes

I''ve been using a 5-disk raidZ for years on SXCE machine which I
converted to OSOL.  The only time I ever had zfs problems in SXCE was with
snv_120, which was fixed.

So, now I''m at OSOL snv_111b and I''m finding that scrub
repairs errors on random disks.  If I repeat the scrub, it will fix errors on
other disks.  Occasionally it runs cleanly.  That it doesn''t happen in
a consistent manner makes me believe it''s not hardware related.

fmdump only reports, three types of errors:

ereport.fs.zfs.checksum
ereport.io.scsi.cmd.disk.tran
ereport.io.scsi.cmd.disk.recovered

The middle one seems to be the issue I''d like to track down the source.
Any docs on how to do this?

Thanks,
Gary
-- 
This message posted from opensolaris.org

Mattias Pantzare

2010-Jan-10 17:37 UTC

head link

[zfs-discuss] Repeating scrub does random fixes

On Sun, Jan 10, 2010 at 16:40, Gary Gendel <gary at genashor.com>
wrote:> I''ve been using a 5-disk raidZ for years on SXCE machine which I
converted to OSOL. ?The only time I ever had zfs problems in SXCE was with
snv_120, which was fixed.
>
> So, now I''m at OSOL snv_111b and I''m finding that scrub
repairs errors on random disks. ?If I repeat the scrub, it will fix errors on
other disks. ?Occasionally it runs cleanly. ?That it doesn''t happen in
a consistent manner makes me believe it''s not hardware related.
>
That is a good indication for hardware related errors. Software will
do the same thing every time but hardware errors are often random.

But you are running an older version now, I would recommend an upgrade.

Gary Gendel

2010-Jan-10 20:08 UTC

head link

[zfs-discuss] Repeating scrub does random fixes

Mattias Pantzare wrote:> On Sun, Jan 10, 2010 at 16:40, Gary Gendel <gary at genashor.com>
wrote:
>   
>> I''ve been using a 5-disk raidZ for years on SXCE machine which
I converted to OSOL.  The only time I ever had zfs problems in SXCE was with
snv_120, which was fixed.
>>
>> So, now I''m at OSOL snv_111b and I''m finding that
scrub repairs errors on random disks.  If I repeat the scrub, it will fix errors
on other disks.  Occasionally it runs cleanly.  That it doesn''t happen
in a consistent manner makes me believe it''s not hardware related.
>>
>>     
>
> That is a good indication for hardware related errors. Software will
> do the same thing every time but hardware errors are often random.
>
> But you are running an older version now, I would recommend an upgrade.
>   
I would have thought that too if it didn''t start right after the switch
from SXCE to OSOL.  As for an upgrade, use the dev repository on my 
laptop and I find that OSOL updates aren''t nearly as stable as SXCE 
was.  I tried for a bit, but always had to go back to 111b because 
something crucial broke.  I was hoping to wait until the official 
release in March in order to let things stabilize.  This is my main 
web/mail/file/etc. server and I don''t really want to muck too much.

That said, I may take a gambol on upgrading as we''re getting closer to 
the 2010.x release.


Gary

Gary Gendel

2010-Jan-11 12:11 UTC

head link

[zfs-discuss] Repeating scrub does random fixes

I''ve just made a couple of consecutive scrubs, each time it found a
couple of checksum errors but on different drives.  No indication of any other
errors.  That a disk scrubs cleanly on a quiescent pool in one run but fails in
the next is puzzling.  It reminds me of the snv_120 odd number of disks raidz
bug I reported.

Looks like I''ve got to bite the bullet and upgrade to the dev tree and
hope for the best.

Gary
-- 
This message posted from opensolaris.org

Cindy Swearingen

2010-Jan-11 17:27 UTC

head link

[zfs-discuss] Repeating scrub does random fixes

Hi Gary,

You might consider running OSOL on a later build, like build 130.

Have you reviewed the fmdump -eV output to determine on which devices
the ereports below have been generated? This might give you more clues
as to what the issues are. I would also be curious if you have any
driver-level errors reported in /var/adm/messages or the iostat -En
command.

I think of cable problems or controller issues with repeated random
problems across disks.

Thanks,

Cindy

On 01/10/10 08:40, Gary Gendel wrote:> I''ve been using a 5-disk raidZ for years on SXCE machine which I
converted to OSOL.  The only time I ever had zfs problems in SXCE was with
snv_120, which was fixed.
> 
> So, now I''m at OSOL snv_111b and I''m finding that scrub
repairs errors on random disks.  If I repeat the scrub, it will fix errors on
other disks.  Occasionally it runs cleanly.  That it doesn''t happen in
a consistent manner makes me believe it''s not hardware related.
> 
> fmdump only reports, three types of errors:
> 
> ereport.fs.zfs.checksum
> ereport.io.scsi.cmd.disk.tran
> ereport.io.scsi.cmd.disk.recovered
> 
> The middle one seems to be the issue I''d like to track down the
source.  Any docs on how to do this?
> 
> Thanks,
> Gary

Gary Gendel

2010-Jan-12 14:39 UTC

head link

[zfs-discuss] Repeating scrub does random fixes

Thanks for all the suggestions.  Now for a strange tail...

I tried upgrading to dev 130 and, as expected, things did not go well.  All
sorts of permission errors flew by during the upgrade stage and it would not
start X-windows.  I''ve heard that things installed from the  contrib
and extras repositories might cause issues but I didn''t want to spend
the time with my server offline while i tried to figure this out.

So, I booted back to 111b and scrubs still showed errors.  Late in the evening,
the pool faulted preventing any backups from the other servers to this pool. 
Greeted this morning with the "recover files from backup" status
message sent shivers up my spine.  This IS my backup.

I exported the pool and then imported it, which it did successfully.  Now the
scrubs run cleanly (at least for a few repeated scrubs spanning several hours). 
So, was it hardware?  What the heck could have fixed it by just exporting and
importing the pool?
-- 
This message posted from opensolaris.org

Reasonably Related Threads

Search for more possibly parallel threads

zfs discuss - Jan 2010 - Repeating scrub does random fixes

[zfs-discuss] Repeating scrub does random fixes

[zfs-discuss] Repeating scrub does random fixes

[zfs-discuss] Repeating scrub does random fixes

[zfs-discuss] Repeating scrub does random fixes

[zfs-discuss] Repeating scrub does random fixes

[zfs-discuss] Repeating scrub does random fixes

Reasonably Related Threads