thr3ads.net - freebsd stable - ZFS... [May 2019]

If this information is useful, please help other people find it:
Share via:

Alan Somers

2019-Apr-30 14:12 UTC

ZFS...

On Tue, Apr 30, 2019 at 8:05 AM Michelle Sullivan <michelle at sorbs.net>
wrote:>
>
>
> Michelle Sullivan
> http://www.mhix.org/
> Sent from my iPad
>
> > On 01 May 2019, at 00:01, Alan Somers <asomers at freebsd.org>
wrote:
> >
> >> On Tue, Apr 30, 2019 at 7:30 AM Michelle Sullivan <michelle at
sorbs.net> wrote:
> >>
> >> Karl Denninger wrote:
> >>> On 4/30/2019 05:14, Michelle Sullivan wrote:
> >>>>>> On 30 Apr 2019, at 19:50, Xin LI <delphij at
gmail.com> wrote:
> >>>>>> On Tue, Apr 30, 2019 at 5:08 PM Michelle Sullivan
<michelle at sorbs.net> wrote:
> >>>>>> but in my recent experience 2 issues colliding at
the same time results in disaster
> >>>>> Do we know exactly what kind of corruption happen to
your pool?  If you see it twice in a row, it might suggest a software bug that
should be investigated.
> >>>>>
> >>>>> All I know is it?s a checksum error on a meta slab
(122) and from what I can gather it?s the spacemap that is corrupt... but I am
no expert.  I don?t believe it?s a software fault as such, because this was
cause by a hard outage (damaged UPSes) whilst resilvering a single (but
completely failed) drive.  ...and after the first outage a second occurred (same
as the first but more damaging to the power hardware)... the host itself was not
damaged nor were the drives or controller.
> >>> .....
> >>>>> Note that ZFS stores multiple copies of its essential
metadata, and in my experience with my old, consumer grade crappy hardware
(non-ECC RAM, with several faulty, single hard drive pool: bad enough to crash
almost monthly and damages my data from time to time),
> >>>> This was a top end consumer grade mb with non ecc ram that
had been running for 8+ years without fault (except for hard drive platter
failures.). Uptime would have been years if it wasn?t for patching.
> >>> Yuck.
> >>>
> >>> I'm sorry, but that may well be what nailed you.
> >>>
> >>> ECC is not just about the random cosmic ray.  It also saves
your bacon
> >>> when there are power glitches.
> >>
> >> No. Sorry no.  If the data is only half to disk, ECC isn't
going to save
> >> you at all... it's all about power on the drives to complete
the write.
> >
> > ECC RAM isn't about saving the last few seconds' worth of data
from
> > before a power crash.  It's about not corrupting the data that
gets
> > written long before a crash.  If you have non-ECC RAM, then a cosmic
> > ray/alpha ray/row hammer attack/bad luck can corrupt data after
it's
> > been checksummed but before it gets DMAed to disk.  Then disk will
> > contain corrupt data and you won't know it until you try to read
it
> > back.
>
> I know this... unless I misread Karl?s message he implied the ECC would
have saved the corruption in the crash... which is patently false... I think
you?ll agree..
I don't think that's what Karl meant.  I think he meant that the
non-ECC RAM could've caused latent corruption that was only detected
when the crash forced a reboot and resilver.
>
> Michelle
>
>
> >
> > -Alan
> >
> >>>
> >>> Unfortunately however there is also cache memory on most
modern hard
> >>> drives, most of the time (unless you explicitly shut it off)
it's on for
> >>> write caching, and it'll nail you too.  Oh, and it's
never, in my
> >>> experience, ECC.
> >
> > Fortunately, ZFS never sends non-checksummed data to the hard drive.
> > So an error in the hard drive's cache ram will usually get
detected by
> > the ZFS checksum.
> >
> >>
> >> No comment on that - you're right in the first part, I
can't comment if
> >> there are drives with ECC.
> >>
> >>>
> >>> In addition, however, and this is something I learned a LONG
time ago
> >>> (think Z-80 processors!) is that as in so many very important
things
> >>> "two is one and one is none."
> >>>
> >>> In other words without a backup you WILL lose data eventually,
and it
> >>> WILL be important.
> >>>
> >>> Raidz2 is very nice, but as the name implies it you have two
> >>> redundancies.  If you take three errors, or if, God forbid,
you *write*
> >>> a block that has a bad checksum in it because it got scrambled
while in
> >>> RAM, you're dead if that happens in the wrong place.
> >>
> >> Or in my case you write part data therefore invalidating the
checksum...
> >>>
> >>>> Yeah.. unlike UFS that has to get really really hosed to
restore from backup with nothing recoverable it seems ZFS can get hosed where
issues occur in just the wrong bit... but mostly it is recoverable (and my
experience has been some nasty shit that always ended up being recoverable.)
> >>>>
> >>>> Michelle
> >>> Oh that is definitely NOT true.... again, from hard
experience,
> >>> including (but not limited to) on FreeBSD.
> >>>
> >>> My experience is that ZFS is materially more-resilient but
there is no
> >>> such thing as "can never be corrupted by any set of
events."
> >>
> >> The latter part is true - and my blog and my current situation is
not
> >> limited to or aimed at FreeBSD specifically,  FreeBSD is my
experience.
> >> The former part... it has been very resilient, but I think (based
on
> >> this certain set of events) it is easily corruptible and I have
just
> >> been lucky.  You just have to hit a certain write to activate the
issue,
> >> and whilst that write and issue might be very very difficult
(read: hit
> >> and miss) to hit in normal every day scenarios it can and will
> >> eventually happen.
> >>
> >>>   Backup
> >>> strategies for moderately large (e.g. many Terabytes) to very
large
> >>> (e.g. Petabytes and beyond) get quite complex but they're
also very
> >>> necessary.
> >>>
> >> and there in lies the problem.  If you don't have a many
10's of
> >> thousands of dollars backup solutions, you're either:
> >>
> >> 1/ down for a looooong time.
> >> 2/ losing all data and starting again...
> >>
> >> ..and that's the problem... ufs you can recover most (in most
> >> situations) and providing the *data* is there uncorrupted by the
fault
> >> you can get it all off with various tools even if it is a complete
> >> mess....  here I am with the data that is apparently ok, but the
> >> metadata is corrupt (and note: as I had stopped writing to the
drive
> >> when it started resilvering the data - all of it - should be
intact...
> >> even if a mess.)
> >>
> >> Michelle
> >>
> >> --
> >> Michelle Sullivan
> >> http://www.mhix.org/
> >>
> >> _______________________________________________
> >> freebsd-stable at freebsd.org mailing list
> >> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> >> To unsubscribe, send any mail to "freebsd-stable-unsubscribe
at freebsd.org"

Michelle Sullivan

2019-Apr-30 14:15 UTC

head link

ZFS...

This issue is definitely related to sudden unexpected loss of power during
resilver.. not ECC/non-ECC issues.

Michelle Sullivan
http://www.mhix.org/
Sent from my iPad
> On 01 May 2019, at 00:12, Alan Somers <asomers at freebsd.org> wrote:
> 
>> On Tue, Apr 30, 2019 at 8:05 AM Michelle Sullivan <michelle at
sorbs.net> wrote:
>> 
>> 
>> 
>> Michelle Sullivan
>> http://www.mhix.org/
>> Sent from my iPad
>> 
>>>> On 01 May 2019, at 00:01, Alan Somers <asomers at
freebsd.org> wrote:
>>>> 
>>>> On Tue, Apr 30, 2019 at 7:30 AM Michelle Sullivan <michelle
at sorbs.net> wrote:
>>>> 
>>>> Karl Denninger wrote:
>>>>> On 4/30/2019 05:14, Michelle Sullivan wrote:
>>>>>>>> On 30 Apr 2019, at 19:50, Xin LI <delphij at
gmail.com> wrote:
>>>>>>>> On Tue, Apr 30, 2019 at 5:08 PM Michelle
Sullivan <michelle at sorbs.net> wrote:
>>>>>>>> but in my recent experience 2 issues colliding
at the same time results in disaster
>>>>>>> Do we know exactly what kind of corruption happen
to your pool?  If you see it twice in a row, it might suggest a software bug
that should be investigated.
>>>>>>> 
>>>>>>> All I know is it?s a checksum error on a meta slab
(122) and from what I can gather it?s the spacemap that is corrupt... but I am
no expert.  I don?t believe it?s a software fault as such, because this was
cause by a hard outage (damaged UPSes) whilst resilvering a single (but
completely failed) drive.  ...and after the first outage a second occurred (same
as the first but more damaging to the power hardware)... the host itself was not
damaged nor were the drives or controller.
>>>>> .....
>>>>>>> Note that ZFS stores multiple copies of its
essential metadata, and in my experience with my old, consumer grade crappy
hardware (non-ECC RAM, with several faulty, single hard drive pool: bad enough
to crash almost monthly and damages my data from time to time),
>>>>>> This was a top end consumer grade mb with non ecc ram
that had been running for 8+ years without fault (except for hard drive platter
failures.). Uptime would have been years if it wasn?t for patching.
>>>>> Yuck.
>>>>> 
>>>>> I'm sorry, but that may well be what nailed you.
>>>>> 
>>>>> ECC is not just about the random cosmic ray.  It also saves
your bacon
>>>>> when there are power glitches.
>>>> 
>>>> No. Sorry no.  If the data is only half to disk, ECC isn't
going to save
>>>> you at all... it's all about power on the drives to
complete the write.
>>> 
>>> ECC RAM isn't about saving the last few seconds' worth of
data from
>>> before a power crash.  It's about not corrupting the data that
gets
>>> written long before a crash.  If you have non-ECC RAM, then a
cosmic
>>> ray/alpha ray/row hammer attack/bad luck can corrupt data after
it's
>>> been checksummed but before it gets DMAed to disk.  Then disk will
>>> contain corrupt data and you won't know it until you try to
read it
>>> back.
>> 
>> I know this... unless I misread Karl?s message he implied the ECC would
have saved the corruption in the crash... which is patently false... I think
you?ll agree..
> 
> I don't think that's what Karl meant.  I think he meant that the
> non-ECC RAM could've caused latent corruption that was only detected
> when the crash forced a reboot and resilver.
> 
>> 
>> Michelle
>> 
>> 
>>> 
>>> -Alan
>>> 
>>>>> 
>>>>> Unfortunately however there is also cache memory on most
modern hard
>>>>> drives, most of the time (unless you explicitly shut it
off) it's on for
>>>>> write caching, and it'll nail you too.  Oh, and
it's never, in my
>>>>> experience, ECC.
>>> 
>>> Fortunately, ZFS never sends non-checksummed data to the hard
drive.
>>> So an error in the hard drive's cache ram will usually get
detected by
>>> the ZFS checksum.
>>> 
>>>> 
>>>> No comment on that - you're right in the first part, I
can't comment if
>>>> there are drives with ECC.
>>>> 
>>>>> 
>>>>> In addition, however, and this is something I learned a
LONG time ago
>>>>> (think Z-80 processors!) is that as in so many very
important things
>>>>> "two is one and one is none."
>>>>> 
>>>>> In other words without a backup you WILL lose data
eventually, and it
>>>>> WILL be important.
>>>>> 
>>>>> Raidz2 is very nice, but as the name implies it you have
two
>>>>> redundancies.  If you take three errors, or if, God forbid,
you *write*
>>>>> a block that has a bad checksum in it because it got
scrambled while in
>>>>> RAM, you're dead if that happens in the wrong place.
>>>> 
>>>> Or in my case you write part data therefore invalidating the
checksum...
>>>>> 
>>>>>> Yeah.. unlike UFS that has to get really really hosed
to restore from backup with nothing recoverable it seems ZFS can get hosed where
issues occur in just the wrong bit... but mostly it is recoverable (and my
experience has been some nasty shit that always ended up being recoverable.)
>>>>>> 
>>>>>> Michelle
>>>>> Oh that is definitely NOT true.... again, from hard
experience,
>>>>> including (but not limited to) on FreeBSD.
>>>>> 
>>>>> My experience is that ZFS is materially more-resilient but
there is no
>>>>> such thing as "can never be corrupted by any set of
events."
>>>> 
>>>> The latter part is true - and my blog and my current situation
is not
>>>> limited to or aimed at FreeBSD specifically,  FreeBSD is my
experience.
>>>> The former part... it has been very resilient, but I think
(based on
>>>> this certain set of events) it is easily corruptible and I have
just
>>>> been lucky.  You just have to hit a certain write to activate
the issue,
>>>> and whilst that write and issue might be very very difficult
(read: hit
>>>> and miss) to hit in normal every day scenarios it can and will
>>>> eventually happen.
>>>> 
>>>>>  Backup
>>>>> strategies for moderately large (e.g. many Terabytes) to
very large
>>>>> (e.g. Petabytes and beyond) get quite complex but
they're also very
>>>>> necessary.
>>>>> 
>>>> and there in lies the problem.  If you don't have a many
10's of
>>>> thousands of dollars backup solutions, you're either:
>>>> 
>>>> 1/ down for a looooong time.
>>>> 2/ losing all data and starting again...
>>>> 
>>>> ..and that's the problem... ufs you can recover most (in
most
>>>> situations) and providing the *data* is there uncorrupted by
the fault
>>>> you can get it all off with various tools even if it is a
complete
>>>> mess....  here I am with the data that is apparently ok, but
the
>>>> metadata is corrupt (and note: as I had stopped writing to the
drive
>>>> when it started resilvering the data - all of it - should be
intact...
>>>> even if a mess.)
>>>> 
>>>> Michelle
>>>> 
>>>> --
>>>> Michelle Sullivan
>>>> http://www.mhix.org/
>>>> 
>>>> _______________________________________________
>>>> freebsd-stable at freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
>>>> To unsubscribe, send any mail to
"freebsd-stable-unsubscribe at freebsd.org"

Karl Denninger

2019-Apr-30 15:15 UTC

head link

ZFS...

On 4/30/2019 09:12, Alan Somers wrote:> On Tue, Apr 30, 2019 at 8:05 AM Michelle Sullivan <michelle at
sorbs.net> wrote:
> .
>> I know this... unless I misread Karl?s message he implied the ECC would
have saved the corruption in the crash... which is patently false... I think
you?ll agree..
> I don't think that's what Karl meant. I think he meant that the
> non-ECC RAM could've caused latent corruption that was only detected
> when the crash forced a reboot and resilver.
Exactly.

Non-ECC memory means you can potentially write data to *all* copies of a
block (and its parity in the case of a Raidz) where the checksum is
invalid and there is no way for the code to know it happened or defend
against it.? Unfortunately since the checksum is very small compared to
the data size the odds are that IF that happens it's the *data* and not
the checksum that's bad and there are *no* good copies.

Contrary to popular belief the? "power good" signal on your PSU and MB
do not provide 100% protection against transient power problems causing
this to occur with non-ECC memory either.

IMHO non-ECC memory systems are ok for personal desktop and laptop
machines where loss of stored data requiring a restore is acceptable
(assuming you have a reasonable backup paradigm for same) but not for
servers and *especially* not for ZFS storage.? I don't like the price of
ECC memory and I really don't like Intel's practices when it comes to
only enabling ECC RAM on their "server" class line of CPUs either but
it
is what it is.? Pay up for the machines where it matters.

One of the ironies is that there's better data *integrity* with ZFS than
other filesystems in this circumstance; you're much more-likely to
*know* you're hosed even if the situation is unrecoverable and requires
a restore.? With UFS and other filesystems you can quite-easily wind up
with silent corruption that can go undetected; the filesystem "works"
just fine but the data is garbage.? From my point of view that's *much*
worse.

In addition IMHO consumer drives are not exactly safe for online ZFS
storage.? Ironically they're *safer* for archival use because when not
actively in use they're dismounted and thus not subject to "you're
silently hosed" sort of failures.? What sort of "you're
hosed"
failures?? Oh, for example, claiming to have flushed their cache buffers
before returning "complete" on that request when they really did not!?
In combination with write re-ordering that can *really* screw you and
there's nothing that any filesystem can defensively do about it either.?
This sort of "cheat" is much-more likely to be present in consumer
drives than ones sold for either enterprise or NAS purposes and it's
quite difficult to accurately test for this sort of thing on an
individual basis too.

--
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190430/7ec1d6ac/attachment.bin>

Michelle Sullivan

2019-May-01 00:14 UTC

head link

ZFS...

Michelle Sullivan
http://www.mhix.org/
Sent from my iPad
> On 01 May 2019, at 01:15, Karl Denninger <karl at denninger.net>
wrote:
> 
> 
> IMHO non-ECC memory systems are ok for personal desktop and laptop
> machines where loss of stored data requiring a restore is acceptable
> (assuming you have a reasonable backup paradigm for same) but not for
> servers and *especially* not for ZFS storage.  I don't like the price
of
> ECC memory and I really don't like Intel's practices when it comes
to
> only enabling ECC RAM on their "server" class line of CPUs either
but it
> is what it is.  Pay up for the machines where it matters.
And the irony is the FreeBSD policy to default to zfs on new installs using the
complete drive.. even when there is only one disk available and regardless of
the cpu or ram class...  with one usb stick I have around here it attempted to
use zfs on one of my laptops.

Damned if you do, damned if you don?t comes to mind.

Michelle

freebsd stable - May 2019 - ZFS...

ZFS...

ZFS...

ZFS...

ZFS...