Eugene M. Zheganin
2017-Jun-29 11:37 UTC
redundant zfs pool, system traps and tonns of corrupted files
Hi. Say I'm having a server that traps more and more often (different panics: zfs panics, GPFs, fatal traps while in kernel mode etc), and then I realize it has tonns of permanent errors on all of it's pools that scrub is unable to heal. Does this situation mean it's a bad memory case ? Unfortunately I switched the hardware to an identical server prior to encountering zpools have errors, so I'm not use when did they appear. Right now I'm about to run a memtest on an old hardware. So, whadda you say - does it point at the memory as the root problem ? Thanks. Eugene.
Eugene M. Zheganin
2017-Jun-29 12:04 UTC
redundant zfs pool, system traps and tonns of corrupted files
Hi, On 29.06.2017 16:37, Eugene M. Zheganin wrote:> Hi. > > > Say I'm having a server that traps more and more often (different > panics: zfs panics, GPFs, fatal traps while in kernel mode etc), and > then I realize it has tonns of permanent errors on all of it's pools > that scrub is unable to heal. Does this situation mean it's a bad > memory case ? Unfortunately I switched the hardware to an identical > server prior to encountering zpools have errors, so I'm not use when > did they appear. Right now I'm about to run a memtest on an old hardware. > > > So, whadda you say - does it point at the memory as the root problem ? >I'm also not quite getting the situation when I have errors on a vdev level, but 0 errors on a lower device layer (could someone please explain this): pool: esx state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: resilvered 3,74G in 0h5m with 0 errors on Tue Dec 27 05:14:32 2016 config: NAME STATE READ WRITE CKSUM esx ONLINE 0 0 99,0K raidz1-0 ONLINE 0 0 113K da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 2 da3 ONLINE 0 0 0 da5 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 84,7K da12 ONLINE 0 0 0 da13 ONLINE 0 0 1 da14 ONLINE 0 0 0 da15 ONLINE 0 0 0 da16 ONLINE 0 0 0 errors: 25 data errors, use '-v' for a list pool: gamestop state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: scrub in progress since Thu Jun 29 12:30:21 2017 1,67T scanned out of 4,58T at 1002M/s, 0h50m to go 0 repaired, 36,44% done config: NAME STATE READ WRITE CKSUM gamestop ONLINE 0 0 1 raidz1-0 ONLINE 0 0 2 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 da8 ONLINE 0 0 0 da9 ONLINE 0 0 0 da11 ONLINE 0 0 0 errors: 10 data errors, use '-v' for a list P.S. This is a FreeBSD 11.1-BETA2 r320056M (M stands for CTL_MAX_PORTS = 1024), with ECC memory. Thanks. Eugene.