After performing the following steps in exact order, I am now seeing CKSUM
errors in my zpool. I''ve never seen any Checksum errors before in the
zpool.
1. Performing running setup (RAIDZ 7D+1P) - 8x 1TB. Solaris 10 Update 3
x86.
2. Disk 6 (c6t2d0) was dying, $(zpool status) read errors, and device
errors in /var/adm/messages.
3. Additional to replacing this disk, I thought I would give myself a
challenge and upgrade to Solaris 10 and change my CPU/Motherboard.
> 3.1 CPU went to AthlonXP 3500+ from an Athlon FX-51
> 3.2 Motherboard went to Asus A8N-SLI Premium from Asus SK8N.
> 3.3 Memory stayed the same at 2GB ECC DRR (all other components
identical).
> 3.4 And finally, I replaced the failed Disk 6.
4. Solaris 10 U5 x86 Install was fine without a problem. zpool imported
fine (obviously DEGRADED).
5. zpool replace worked without a problem and it resilvered with 0 read,
write or cksum errors.
6. After zpool replace, zfs recommended I run zfs upgrade to upgrade from
zfs3 to zfs4, which I have done.
This is where the problem starts to appear.
The Upgrade was fine, however immediately after the upgrade I ran a scrub
and I noticed a very high number of cksum errors on the newly replaced disk
6 (now c4t2d0, previously before reinstall c6t2d0).
Here is the progress of the scrub and you can see how the cksum is quickly,
and constantly increasing:
[/root][root]# date
Fri Oct 10 00:19:16 EST 2008
[root][root]# zpool status -v
pool: rzdata
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
An
attempt was made to correct the error. Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using ''zpool clear'' or replace the device with
''zpool
replace''.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub in progress, 7.34% done, 6h10m to go
config:
NAME STATE READ WRITE CKSUM
rzdata ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 390
c4t3d0 ONLINE 0 0 0
errors: No known data errors
[/root][root]# date
Fri Oct 10 00:23:12 EST 2008
[root][root]# zpool status -v
pool: rzdata
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
An
attempt was made to correct the error. Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using ''zpool clear'' or replace the device with
''zpool
replace''.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub in progress, 8.01% done, 6h6m to go
config:
NAME STATE READ WRITE CKSUM
rzdata ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 1
c3t3d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 2
c4t2d0 ONLINE 0 0 768
c4t3d0 ONLINE 0 0 0
[/root][root]# date
Fri Oct 10 00:29:44 EST 2008
[/root][root]# zpool status -v
pool: rzdata
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
An
attempt was made to correct the error. Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using ''zpool clear'' or replace the device with
''zpool
replace''.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub in progress, 9.88% done, 5h57m to go
config:
NAME STATE READ WRITE CKSUM
rzdata ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 2
c3t3d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 2
c4t2d0 ONLINE 0 0 931
c4t3d0 ONLINE 0 0 1
It eventually finished with 6.4K CKSUM errors against c4t2d0 and an average
of sub 5 errors on the remaining disks. I was not (and still not)
convinced it''s a physical hardware problem and my initial thoughts was
that
there is/was(?) a bug with zfs and zpool upgrade a mounted and running
zpool. So to be pedantic, I rebooted the server, and initiated another
scrub.
This is the outcome of this scrub:
[/root][root]# zpool status -v
pool: rzdata
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using ''zpool clear'' or replace the device with
''zpool replace''.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed with 0 errors on Mon Oct 13 09:42:41 2008
config:
NAME STATE READ WRITE CKSUM
rzdata ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 1
c3t2d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 1
c4t0d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 22
c4t3d0 ONLINE 0 0 2
The next avenue I plan on investigating is running a complete memtest86 test
again the hardware to ensure the memory isn''t occasionally returning
garage
(even though it''s ECC).
So this is where I stand. I''d like to ask zfs-discuss if
they''ve seen any
ZIL/Replay style bugs associated with u3/u5 x86? Again, I''m confident
in my
hardware, and /var/adm/messages is showing no warnings/errors.
Thank You
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081014/d2ac60fc/attachment.html>
> So this is where I stand. I''d like to ask zfs-discuss if they''ve seen any ZIL/Replay style bugs associated with u3/u5 x86? Again, I''m confident in my hardware, and /var/adm/messages is showing no warnings/errors.Are you absolutely sure the hardware is OK? Is there another disk you can test in its place? If I read your post correctly, your first disk was having errors logged against it, and now the second disk -- plugged into the same port -- is also logging errors. This seems to me more like the port is bad. Is there a third disk you can try in that same port? I have a hard time seeing that this could be a zfs bug - I''ve been doing lots of testing on u5 and the only time I see checksum errors is when I deliberately induce them. -- This message posted from opensolaris.org
The original disk failure was very explicit. High Read Errors and errors inside /var/adm/messages. When I replaced the disk however, these have all gone and the resilver was okay. I am not seeing any read/write or /var/adm/messages errors -- but for some reason I am seeing errors inside the CKSUM column which I''ve never seen before. I hope you''re right and it''s a simple memory corruption problem. I will be running memtest86 overnight and hopefully it fails so we can rule our zfs. On Wed, Oct 15, 2008 at 11:48 AM, Mark J Musante <mark.musante at sun.com>wrote:> > So this is where I stand. I''d like to ask zfs-discuss if they''ve seen > any ZIL/Replay style bugs associated with u3/u5 x86? Again, I''m confident > in my hardware, and /var/adm/messages is showing no warnings/errors. > > Are you absolutely sure the hardware is OK? Is there another disk you can > test in its place? If I read your post correctly, your first disk was > having errors logged against it, and now the second disk -- plugged into the > same port -- is also logging errors. > > This seems to me more like the port is bad. Is there a third disk you can > try in that same port? > > I have a hard time seeing that this could be a zfs bug - I''ve been doing > lots of testing on u5 and the only time I see checksum errors is when I > deliberately induce them. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081015/fe5d14e5/attachment.html>
Another update:
Last night, already reading many blogs about si3124 chipset problems with
Solaris 10 I applied the Patch Id: 138053-02 which updates si3124 from 1.2
to 1.4 and fixes numerous performance and interrupt related bugs.
And it appears to have helped. Below is the zpool scrub after the new
driver, but I''m still not confident on the exact problem.
# zpool status -v
pool: rzdata
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub completed with 1 errors on Wed Oct 29 05:32:16 2008
config:
NAME STATE READ WRITE CKSUM
rzdata ONLINE 0 0 2
raidz1 ONLINE 0 0 2
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 3
c4t3d0 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
/rzdata/downloads/linux/ubuntu-8.04.1-desktop-i386.iso
It still didn''t clear the errored file I have, which I''m
curious about
considering it''s a RAIDZ.
On Mon, Oct 27, 2008 at 2:57 PM, Matthew Angelo <bangers at gmail.com>
wrote:
> Another update.
>
> Weekly cron kicked in again this week, but this time is failed with a lot
> of CKSUM errors and now also complained about corrupted files. The single
> file it complained about is a new one I recently copied into it.
>
> I''m stumped with this. How do I verify the x86 hardware under the
OS?
>
> I''ve run Memtest86 and it ran overnight without a problem.
Tonight I will
> be moving back to my old Motherboard/CPU/Memory. Hopefully this is a
simple
> hardware problems.
>
> But the question I''d like to pose to everyone is, how can we
validate our
> x86 hardware?
>
>
> On Tue, Oct 21, 2008 at 8:23 AM, David Turnbull <dsturnbull at
gmail.com>wrote:
>
>> I don''t think it''s normal, no.. it seems to occur
when the resilver is
>> interrupted and gets marked as "done" prematurely?
>>
>>
>> On 20/10/2008, at 12:28 PM, Matthew Angelo wrote:
>>
>> Hi David,
>>>
>>> Thanks for the additional input. This is the reason why I thought
I''d
>>> start a thread about it.
>>>
>>> To continue my original topic, I have additional information to
add.
>>> After last weeks initial replace/resilver/scrub -- my weekly cron
scrub
>>> (runs Sunday morning) kicked off and all CKSUM errors have now
cleared:
>>>
>>>
>>> pool: rzdata
>>> state: ONLINE
>>> scrub: scrub completed with 0 errors on Mon Oct 20 09:41:31 2008
>>> config:
>>>
>>> NAME STATE READ WRITE CKSUM
>>> rzdata ONLINE 0 0 0
>>> raidz1 ONLINE 0 0 0
>>> c3t0d0 ONLINE 0 0 0
>>> c3t1d0 ONLINE 0 0 0
>>> c3t2d0 ONLINE 0 0 0
>>> c3t3d0 ONLINE 0 0 0
>>> c4t0d0 ONLINE 0 0 0
>>> c4t1d0 ONLINE 0 0 0
>>> c4t2d0 ONLINE 0 0 0
>>> c4t3d0 ONLINE 0 0 0
>>>
>>> errors: No known data errors
>>>
>>>
>>> Which requires me to ask -- is it standard for high Checksum
(CKSUM)
>>> errors on a zpool when you replace a failed disk after it has
resilvered?
>>>
>>> Is there anything I can feedback into the zfs community on this
matter?
>>>
>>> Matt
>>>
>>> On Sun, Oct 19, 2008 at 9:26 AM, David Turnbull <dsturnbull at
gmail.com>
>>> wrote:
>>> Hi Matthew.
>>>
>>> I had a similar problem occur last week. One disk in the raidz had
the
>>> first 4GB zeroed out (manually) before we then offlined it and
replaced with
>>> a new disk.
>>> High checksum errors were occuring on the partially-zeroed disk, as
you''d
>>> expect, but when the new disk was inserted, checksum errors occured
on all
>>> disks.
>>>
>>> Not sure how relevant this is to your particular situation, but
>>> unexpected checksum errors on known-good hardware has definitely
happened to
>>> me as well.
>>>
>>> -- Dave
>>>
>>>
>>> On 15/10/2008, at 10:50 PM, Matthew Angelo wrote:
>>>
>>> The original disk failure was very explicit. High Read Errors and
errors
>>> inside /var/adm/messages.
>>>
>>> When I replaced the disk however, these have all gone and the
resilver
>>> was okay. I am not seeing any read/write or /var/adm/messages
errors -- but
>>> for some reason I am seeing errors inside the CKSUM column which
I''ve never
>>> seen before.
>>>
>>> I hope you''re right and it''s a simple memory
corruption problem. I will
>>> be running memtest86 overnight and hopefully it fails so we can
rule our
>>> zfs.
>>>
>>>
>>> On Wed, Oct 15, 2008 at 11:48 AM, Mark J Musante <mark.musante
at sun.com>
>>> wrote:
>>> > So this is where I stand. I''d like to ask
zfs-discuss if they''ve seen
>>> any ZIL/Replay style bugs associated with u3/u5 x86? Again,
I''m confident
>>> in my hardware, and /var/adm/messages is showing no
warnings/errors.
>>>
>>> Are you absolutely sure the hardware is OK? Is there another disk
you
>>> can test in its place? If I read your post correctly, your first
disk was
>>> having errors logged against it, and now the second disk -- plugged
into the
>>> same port -- is also logging errors.
>>>
>>> This seems to me more like the port is bad. Is there a third disk
you
>>> can try in that same port?
>>>
>>> I have a hard time seeing that this could be a zfs bug -
I''ve been doing
>>> lots of testing on u5 and the only time I see checksum errors is
when I
>>> deliberately induce them.
>>> --
>>> This message posted from opensolaris.org
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081029/cbbfcf07/attachment.html>