thr3ads.net - freebsd stable - Checksum errors across ZFS array [Jul 2012]

If this information is useful, please help other people find it:
Share via:

James Snow

2012-Jul-19 15:36 UTC

Checksum errors across ZFS array

I have a ZFS server on which I've seen periodic checksum errors on
almost every drive. While scrubbing the pool last night, it began to
report unrecoverable data errors on a single file.

I compared an md5 of the supposedly corrupted file to an md5 of the
original copy, stored on different media. They were the same, suggesting
no corruption.

A large file was being written to the pool while the scrub was in
progress, and the entire array became unresponsive. The OS was still up,
but 'zpool status' showed the scrub progress stuck at the same spot,
with the throughput rate falling. 'shutdown -r now' stalled. Eventually
I hard power cycled the system.

Now, attempting to read the file that ZFS reports errors on yields
"Input/output error." The scrub completed, with the following result:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     7
          mirror-0   ONLINE       0     0     0
            aacd0p1  ONLINE       0     0     0
            aacd4p1  ONLINE       0     0     1
          mirror-1   ONLINE       0     0     0
            aacd1p1  ONLINE       0     0     0
            aacd5p1  ONLINE       0     0     0
          mirror-2   ONLINE       0     0    14
            aacd2p1  ONLINE       0     0    14
            aacd6p1  ONLINE       0     0    14
          mirror-3   ONLINE       0     0     0
            aacd3p1  ONLINE       0     0     0
            aacd7p1  ONLINE       0     0     0

The system configuration is as follows:

Controller:  Adaptec 2805 
Motherboard: Supermicro X8STE
Drive Cage:  2x Supermicro CSE-M35T-1
Memory:      2x Kingston 12GB ECC (KVR1066D3E7SK3/12G)
PSU:         Nexus RX-7000
OS:          9.0-RELEASE-p3
ZFS:         ZFS filesystem version 5, ZFS storage pool version 28


The Adaptec card has 2 ports, each of which uses a 4-port fan-out cable.
The cables are routed as shown:

      /--- aacd0 (ST1000DM003-9YN1 CC4D)
     / /-- aacd1 (ST1000DM003-9YN1 CC4D)
p1-----
     \ \-- aacd2 (WDC WD1001FALS-0 05.0)
      \--- aacd3 (WDC WD1001FALS-0 05.0)

      /--- aacd4 (ST1000DM003-9YN1 CC4D)
     / /-- aacd5 (ST1000DM003-9YN1 CC4D)
p2-----
     \ \-- aacd6 (WDC WD1002FAEX-0 05.0)
      \--- aacd7 (WDC WD1002FAEX-0 05.0)

You can see that each ZFS mirror device is comprised of one drive from
each drive carrier, on separate ports, on separate cables.

Since I have seen periodic checksum errors on almost every drive but the
only common component is the Adapter controller and the motherboard, I
suspect the controller. (Or the motherboard, but I'm starting with the
controller since it's much simpler to swap out.)

Could it be something else? What else I should be looking at? Any input
greatly appreciated.


-Snow

Dr Joe Karthauser

2012-Jul-19 17:13 UTC

head link

Checksum errors across ZFS array

Hi James,

It's almost definitely a memory problem. I'd change it ASAP if I were
you.

I lost about 70mb from my zfs pool for this very reason just a few weeks ago.
Luckily I had enough snapshots from before the rot set in to recover most of
what I lost.

Joe

-- 
Dr Joe Karthauser

On 19 Jul 2012, at 16:29, James Snow <snow@teardrop.org> wrote:
> I have a ZFS server on which I've seen periodic checksum errors on
> almost every drive. While scrubbing the pool last night, it began to
> report unrecoverable data errors on a single file.
> 
> I compared an md5 of the supposedly corrupted file to an md5 of the
> original copy, stored on different media. They were the same, suggesting
> no corruption.
> 
> A large file was being written to the pool while the scrub was in
> progress, and the entire array became unresponsive. The OS was still up,
> but 'zpool status' showed the scrub progress stuck at the same
spot,
> with the throughput rate falling. 'shutdown -r now' stalled.
Eventually
> I hard power cycled the system.
> 
> Now, attempting to read the file that ZFS reports errors on yields
> "Input/output error." The scrub completed, with the following
result:
> 
>        NAME         STATE     READ WRITE CKSUM
>        tank         ONLINE       0     0     7
>          mirror-0   ONLINE       0     0     0
>            aacd0p1  ONLINE       0     0     0
>            aacd4p1  ONLINE       0     0     1
>          mirror-1   ONLINE       0     0     0
>            aacd1p1  ONLINE       0     0     0
>            aacd5p1  ONLINE       0     0     0
>          mirror-2   ONLINE       0     0    14
>            aacd2p1  ONLINE       0     0    14
>            aacd6p1  ONLINE       0     0    14
>          mirror-3   ONLINE       0     0     0
>            aacd3p1  ONLINE       0     0     0
>            aacd7p1  ONLINE       0     0     0
> 
> The system configuration is as follows:
> 
> Controller:  Adaptec 2805 
> Motherboard: Supermicro X8STE
> Drive Cage:  2x Supermicro CSE-M35T-1
> Memory:      2x Kingston 12GB ECC (KVR1066D3E7SK3/12G)
> PSU:         Nexus RX-7000
> OS:          9.0-RELEASE-p3
> ZFS:         ZFS filesystem version 5, ZFS storage pool version 28
> 
> 
> The Adaptec card has 2 ports, each of which uses a 4-port fan-out cable.
> The cables are routed as shown:
> 
>      /--- aacd0 (ST1000DM003-9YN1 CC4D)
>     / /-- aacd1 (ST1000DM003-9YN1 CC4D)
> p1-----
>     \ \-- aacd2 (WDC WD1001FALS-0 05.0)
>      \--- aacd3 (WDC WD1001FALS-0 05.0)
> 
>      /--- aacd4 (ST1000DM003-9YN1 CC4D)
>     / /-- aacd5 (ST1000DM003-9YN1 CC4D)
> p2-----
>     \ \-- aacd6 (WDC WD1002FAEX-0 05.0)
>      \--- aacd7 (WDC WD1002FAEX-0 05.0)
> 
> You can see that each ZFS mirror device is comprised of one drive from
> each drive carrier, on separate ports, on separate cables.
> 
> Since I have seen periodic checksum errors on almost every drive but the
> only common component is the Adapter controller and the motherboard, I
> suspect the controller. (Or the motherboard, but I'm starting with the
> controller since it's much simpler to swap out.)
> 
> Could it be something else? What else I should be looking at? Any input
> greatly appreciated.
> 
> 
> -Snow
> 
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
"freebsd-stable-unsubscribe@freebsd.org"
>

Steven Hartland

2012-Jul-19 17:26 UTC

head link

Checksum errors across ZFS array

----- Original Message ----- 
From: "James Snow" <snow@teardrop.org>

>I have a ZFS server on which I've seen periodic checksum errors on
> almost every drive. While scrubbing the pool last night, it began to
> report unrecoverable data errors on a single file.
> 
> I compared an md5 of the supposedly corrupted file to an md5 of the
> original copy, stored on different media. They were the same, suggesting
> no corruption....

Had this before, has always turned out to be failing hardware. Its
been a mixture of faults for us:-
1. Memory, even though ECC and not reporting failures in use or
via memtest.
2. CPU / Northbridge on old AMD's, not 100% sure which. This started
as ZFS checksum issues and then weeks / months later resulting in
random untraceable panic and watchdog timeouts in bge nic.
Disabling the cores on the second CPU fixed this for us on two separate
machines e.g.
/boot/loader.conf
hint.lapic.2.disabled="1"
hint.lapic.3.disabled="1"

So while ZFS can report errors on files, that aren't errors on the
disks themselves and hence the data, as you confirmed, is fine don't
ignore it.

    Regards
    Steve

===============================================This e.mail is private and
confidential between Multiplay (UK) Ltd. and the person or entity to whom it is
addressed. In the event of misdirection, the recipient is prohibited from using,
copying, printing or otherwise disseminating it or any information contained in
it.

In the event of misdirection, illegible or incomplete transmission please
telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.

freebsd stable - Jul 2012 - Checksum errors across ZFS array

Checksum errors across ZFS array

Checksum errors across ZFS array

Checksum errors across ZFS array