[this seems to be the question of the day, today...]
On Apr 14, 2010, at 2:57 AM, bonso wrote:
> Hi all,
> I recently experienced a disk failure on my home server and observed
checksum errors while resilvering the pool and on the first scrub after the
resilver had completed. Now everything seems fine but I''m posting this
to get help with calming my nerves and detect any possible future faults.
>
> Lets start with some specs.
> OSOL 2009.06
> Intel SASUC8i (w LSI 1.30IT FW)
> Gigabyte MA770-UD3 mobo w 8GB ECC RAM
> Hitachi P7K500 harddrives
>
> When checking the condition of my pool some days ago (yes I should make it
mail me if something like this happens again) one disk in my pool was labeled as
"Removed" with a small number of read errors, nineish I think, all
other disks where fine. I removed tested (DFT crashed so the disk seemed very
broken) replaced the drive and started a resilver.
>
> Checking the status of the resilver everything looked good from the start
but when it was finished the status report looked like this:
> pool: sasuc8i
> state: ONLINE
> status: One or more devices has experienced an unrecoverable error. An
> attempt was made to correct the error. Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> using ''zpool clear'' or replace the device with
''zpool replace''.
> see: http://www.sun.com/msg/ZFS-8000-9P
> scrub: resilver completed after 4h9m with 0 errors on Mon Apr 12 18:12:26
2010
> config:
>
> NAME STATE READ WRITE CKSUM
> sasuc8i ONLINE 0 0 0
> raidz2 ONLINE 0 0 0
> c12t4d0 ONLINE 0 0 5 108K resilvered
> c12t8d0 ONLINE 0 0 0 254G resilvered
> c12t6d0 ONLINE 0 0 0
> c12t7d0 ONLINE 0 0 0
> c12t0d0 ONLINE 0 0 1 21.5K resilvered
> c12t1d0 ONLINE 0 0 2 43K resilvered
> c12t2d0 ONLINE 0 0 4 86K resilvered
> c12t3d0 ONLINE 0 0 1 21.5K resilvered
>
> errors: No known data errors
>
> All I really cared about at this point was the "Applications are
unaffected" and "No known data errors" and I thought that the
checksum errors might be down to the failing drive (c12t5d0 failed, the
controlled labeled the new drive as c12t8d0) going out during a write. Then
again ZFS is atomic, better clear the errors and run a scrub, it came out like
this:
> pool: sasuc8i
> state: ONLINE
> status: One or more devices has experienced an unrecoverable error. An
> attempt was made to correct the error. Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> using ''zpool clear'' or replace the device with
''zpool replace''.
> see: http://www.sun.com/msg/ZFS-8000-9P
> scrub: scrub completed after 1h16m with 0 errors on Tue Apr 13 01:29:32
2010
> config:
>
> NAME STATE READ WRITE CKSUM
> sasuc8i ONLINE 0 0 0
> raidz2 ONLINE 0 0 0
> c12t4d0 ONLINE 0 0 5
> c12t8d0 ONLINE 0 0 0
> c12t6d0 ONLINE 0 0 0
> c12t7d0 ONLINE 0 0 4 86K repaired
> c12t0d0 ONLINE 0 0 1
> c12t1d0 ONLINE 0 0 6 86K repaired
> c12t2d0 ONLINE 0 0 4
> c12t3d0 ONLINE 0 0 6 108K repaired
>
> errors: No known data errors
>
> Now I''m getting nervous. Checksum errors, some repaired others
not. Am I going to end up with multiple drive failures or what the * is going on
here?
When I see many disks suddenly reporting errors, I suspect a common
element: HBA, cables, backplane, mobo, CPU, power supply, etc.
If you search the zfs-discuss archives you can find instances where
HBA firmware, driver issues, or firmware+driver interactions caused
such reports. Cabling and power supplies are less commonly reported.
> Ran one more scrub and everything came up roses.
> Checked smart status on the drives with checksum errors and they are fine,
allthough I expect only read/write errors would show up there.
>
> I''m not sure of how to get this into a propper question but what
I''m after is "is this normal to be expected after a resilver and
can I start breathing again?". Checksum errors are as far as I can gather
dodgy data on disk and read/write somewhere in the physical link (more or less).
Breathing is good. Then check your firmware releases.
-- richard
ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com