thr3ads.net - freebsd stable - a strange and terrible saga of the cursed iSCSI ZFS SAN [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Eugene M. Zheganin

2017-Aug-05 17:08 UTC

a strange and terrible saga of the cursed iSCSI ZFS SAN

Hi,


I got a problem that I cannot solve just by myself. I have a iSCSI zfs 
SAN system that crashes, corrupting it's data. I'll be short, and try to
describe it's genesis shortly:

1) autumn 2016, SAN is set up, supermicro server, external JBOD, sandisk 
ssds, several redundant pools, FreeBSD 11.x (probably release, don't 
really remember - see below).

2) this is working just fine until early spring 2017

3) system starts to crash (various panics):

panic: general protection fault
panic: page fault
panic: Solaris(panic): zfs: allocating allocated 
segment(offset=6599069589504 size=81920)
panic: page fault
panic: page fault
panic: Solaris(panic): zfs: allocating allocated 
segment(offset=8245779054592 size=8192)
panic: page fault
panic: page fault
panic: page fault
panic: Solaris(panic): zfs: allocating allocated 
segment(offset=1792100934656 size=46080)

4) we memtested it immidiately, no problems found.

5) we switch sandisks to toshibas, we switch also the server to an 
identical one, JBOD to an identical one, leaving same cables.

6) crashes don't stop.

7) we found that field engineers physically damaged (sic!) the SATA 
cables (main one and spare ones), and that 90% of the disks show ICRC 
SMART errors.

8) we replaced the cable (brand new HP one).

9) ATA SMART errors stopped increasing.

10) crashes continue.

11) we decided that probably when ZFS was moved over damaged cables 
between JBODs it was somehow damaged too, so now it's panicking because 
of that. so we wiped the data completely, reinitialized the SAN system 
and put it back into the production. we even dd'ed each disk with zeroes 
(!) - just in case. Important note: the data was imported using zfs send 
from another, stable system that is runing in production in another DC.

12) today we got another panic.

btw the pools look now like this:


# zpool status -v
   pool: data
  state: ONLINE
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://illumos.org/msg/ZFS-8000-8A
   scan: none requested
config:

         NAME        STATE     READ WRITE CKSUM
         data        ONLINE       0     0    62
           raidz1-0  ONLINE       0     0     0
             da2     ONLINE       0     0     0
             da3     ONLINE       0     0     0
             da4     ONLINE       0     0     0
             da5     ONLINE       0     0     0
             da6     ONLINE       0     0     0
           raidz1-1  ONLINE       0     0     0
             da7     ONLINE       0     0     0
             da8     ONLINE       0     0     0
             da9     ONLINE       0     0     0
             da10    ONLINE       0     0     0
             da11    ONLINE       0     0     0
           raidz1-2  ONLINE       0     0    62
             da12    ONLINE       0     0     0
             da13    ONLINE       0     0     0
             da14    ONLINE       0     0     0
             da15    ONLINE       0     0     0
             da16    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

         data/userdata/worker208:<0x1>

   pool: userdata
  state: ONLINE
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://illumos.org/msg/ZFS-8000-8A
   scan: none requested
config:

         NAME               STATE     READ WRITE CKSUM
         userdata           ONLINE       0     0  216K
           mirror-0         ONLINE       0     0  432K
             gpt/userdata0  ONLINE       0     0  432K
             gpt/userdata1  ONLINE       0     0  432K

errors: Permanent errors have been detected in the following files:

         userdata/worker36:<0x1>
         userdata/worker30:<0x1>
         userdata/worker31:<0x1>
         userdata/worker35:<0x1>

12) somewhere between p.5 and p.10 the pool became deduplicated (not 
directly connected to the problem, just for production reasons).


So, concluding: we had bad hardware, we replaced EACH piece (server, 
adapter, JBOD, cable, disks), and crashes just don't stop. We have 5 
another iSCSI SAN systems, almost fully identical that don't crash. 
Crashes on this particular system began when it was running same set of 
versions that stable systems.


So, besides calling an exorcist, I would like to hear what other options 
do I have, I really would.

And I want to also ask - what happens when the system's memory isn't 
enough for deduplication - does it crash, or does the problem of 
mounting the pool appear, like some articles mention ?

This message could been encumbered by junky data like the exact FreeBSD 
releases we ran (asuuming it's normal for some 11.x revisions to crash 
and damage the data, and some - not, which I believe it's a nonsense), 
by the pool configurations and disk lists (assuming the same - that you 
can provoque data loss by some redundant pool configurations - not 
considering raidz with more than 5 disks - which I believe is not true), 
and so on. but I decided not to include this until requested. And as I 
also said, we have 5 another SAN systems running similar/identical 
configurations without major problems.


Thanks.

Eugene.

Eugene M. Zheganin

2017-Aug-05 17:16 UTC

head link

a strange and terrible saga of the cursed iSCSI ZFS SAN

Hi,

On 05.08.2017 22:08, Eugene M. Zheganin wrote:>
>   pool: userdata
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://illumos.org/msg/ZFS-8000-8A
>   scan: none requested
> config:
>
>         NAME               STATE     READ WRITE CKSUM
>         userdata           ONLINE       0     0  216K
>           mirror-0         ONLINE       0     0  432K
>             gpt/userdata0  ONLINE       0     0  432K
>             gpt/userdata1  ONLINE       0     0  432KThat would be funny, if not that sad, but while writing this message, 
the pool started to look like below (I just asked zpool status twice in 
a row, comparing to what it was):

[root at san1:~]# zpool status userdata
   pool: userdata
  state: ONLINE
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://illumos.org/msg/ZFS-8000-8A
   scan: none requested
config:

         NAME               STATE     READ WRITE CKSUM
         userdata           ONLINE       0     0  728K
           mirror-0         ONLINE       0     0 1,42M
             gpt/userdata0  ONLINE       0     0 1,42M
             gpt/userdata1  ONLINE       0     0 1,42M

errors: 4 data errors, use '-v' for a list
[root at san1:~]# zpool status userdata
   pool: userdata
  state: ONLINE
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://illumos.org/msg/ZFS-8000-8A
   scan: none requested
config:

         NAME               STATE     READ WRITE CKSUM
         userdata           ONLINE       0     0  730K
           mirror-0         ONLINE       0     0 1,43M
             gpt/userdata0  ONLINE       0     0 1,43M
             gpt/userdata1  ONLINE       0     0 1,43M

errors: 4 data errors, use '-v' for a list

So, you see, the error rate is like speed of light. And I'm not sure if 
the data access rate is that enormous, looks like they are increasing on 
their own.
So may be someone have an idea on what this really means.

Thanks.
Eugene.

Eugene M. Zheganin

2017-Aug-08 07:23 UTC

head link

a strange and terrible saga of the cursed iSCSI ZFS SAN

On 05.08.2017 22:08, Eugene M. Zheganin wrote:> Hi,
>
>
> I got a problem that I cannot solve just by myself. I have a iSCSI zfs 
> SAN system that crashes, corrupting it's data. I'll be short, and
try
> to describe it's genesis shortly:
>
> 1) autumn 2016, SAN is set up, supermicro server, external JBOD, 
> sandisk ssds, several redundant pools, FreeBSD 11.x (probably release, 
> don't really remember - see below).
>
> 2) this is working just fine until early spring 2017
>
> 3) system starts to crash (various panics):
>
> panic: general protection fault
> panic: page fault
> panic: Solaris(panic): zfs: allocating allocated 
> segment(offset=6599069589504 size=81920)
> panic: page fault
> panic: page fault
> panic: Solaris(panic): zfs: allocating allocated 
> segment(offset=8245779054592 size=8192)
> panic: page fault
> panic: page fault
> panic: page fault
> panic: Solaris(panic): zfs: allocating allocated 
> segment(offset=1792100934656 size=46080)
>
> 4) we memtested it immidiately, no problems found.
>
> 5) we switch sandisks to toshibas, we switch also the server to an 
> identical one, JBOD to an identical one, leaving same cables.
>
> 6) crashes don't stop.
>
> 7) we found that field engineers physically damaged (sic!) the SATA 
> cables (main one and spare ones), and that 90% of the disks show ICRC 
> SMART errors.
>
> 8) we replaced the cable (brand new HP one).
>
> 9) ATA SMART errors stopped increasing.
>
> 10) crashes continue.
>
> 11) we decided that probably when ZFS was moved over damaged cables 
> between JBODs it was somehow damaged too, so now it's panicking 
> because of that. so we wiped the data completely, reinitialized the 
> SAN system and put it back into the production. we even dd'ed each 
> disk with zeroes (!) - just in case. Important note: the data was 
> imported using zfs send from another, stable system that is runing in 
> production in another DC.
>
> 12) today we got another panic.
>
> btw the pools look now like this:
>
>
> # zpool status -v
>   pool: data
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://illumos.org/msg/ZFS-8000-8A
>   scan: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         data        ONLINE       0     0    62
>           raidz1-0  ONLINE       0     0     0
>             da2     ONLINE       0     0     0
>             da3     ONLINE       0     0     0
>             da4     ONLINE       0     0     0
>             da5     ONLINE       0     0     0
>             da6     ONLINE       0     0     0
>           raidz1-1  ONLINE       0     0     0
>             da7     ONLINE       0     0     0
>             da8     ONLINE       0     0     0
>             da9     ONLINE       0     0     0
>             da10    ONLINE       0     0     0
>             da11    ONLINE       0     0     0
>           raidz1-2  ONLINE       0     0    62
>             da12    ONLINE       0     0     0
>             da13    ONLINE       0     0     0
>             da14    ONLINE       0     0     0
>             da15    ONLINE       0     0     0
>             da16    ONLINE       0     0     0
>
> errors: Permanent errors have been detected in the following files:
>
>         data/userdata/worker208:<0x1>
>
>   pool: userdata
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://illumos.org/msg/ZFS-8000-8A
>   scan: none requested
> config:
>
>         NAME               STATE     READ WRITE CKSUM
>         userdata           ONLINE       0     0  216K
>           mirror-0         ONLINE       0     0  432K
>             gpt/userdata0  ONLINE       0     0  432K
>             gpt/userdata1  ONLINE       0     0  432K
>
> errors: Permanent errors have been detected in the following files:
>
>         userdata/worker36:<0x1>
>         userdata/worker30:<0x1>
>         userdata/worker31:<0x1>
>         userdata/worker35:<0x1>
>
> 12) somewhere between p.5 and p.10 the pool became deduplicated (not 
> directly connected to the problem, just for production reasons).
>
>
> So, concluding: we had bad hardware, we replaced EACH piece (server, 
> adapter, JBOD, cable, disks), and crashes just don't stop. We have 5 
> another iSCSI SAN systems, almost fully identical that don't crash. 
> Crashes on this particular system began when it was running same set 
> of versions that stable systems.
>
>So far my priority version is that something was broken in the iSCSI+zfs 
stack somewhere between r310734 (most recent version on my SAN systems 
that works) and r320056 (probably earlier, but r320056 is the first 
revision with documented crash).

So I downgraded back to r310734 (from a 11.1-RELEASE, which is affected, 
if I'm right).

Some things speak pro this version:

- the system was stable pre-spring 2017, before the upgrade happened

- zfs corruption happens _only_ on the pools that the iSCSI is serving 
from, no corruption happens on the zfs pools that have nothing to do 
with providing zvils as iSCSI targets (and this seems to be the most 
convincing point).

- the faulty hardware was changed. though it was changed to a identical 
hardware, BUT I have the very same set of identical hardware working in 
almost identical environment under r310734 in another DC.


so far I'm not sure, because only 20 hours passed since the downgrade. 
However, if the system will be stable for more than a week (was never 
stable that long on recent revisions), it will prove I'm right and I'll 
file the PR.


Thanks.

Eugene.

freebsd stable - Aug 2017 - a strange and terrible saga of the cursed iSCSI ZFS SAN

a strange and terrible saga of the cursed iSCSI ZFS SAN

a strange and terrible saga of the cursed iSCSI ZFS SAN

a strange and terrible saga of the cursed iSCSI ZFS SAN