thr3ads.net - freebsd stable - ZFS... [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Michelle Sullivan

2019-Apr-30 08:09 UTC

ZFS...

Michelle Sullivan
http://www.mhix.org/
Sent from my iPad
> On 30 Apr 2019, at 17:10, Andrea Venturoli <ml at netfence.it> wrote:
> 
>> On 4/30/19 2:41 AM, Michelle Sullivan wrote:
>> 
>> The system was originally built on 9.0, and got upgraded through out
the years... zfsd was not available back then.  So get your point, but maybe you
didn?t realize this blog was a history of 8+ years?
> 
> That's one of the first things I thought about while reading the
original post: what can be inferred from it is that ZFS might not have been that
good in the past.
> It *could* still suffer from the same problems or it *could* have improved
and be more resilient.
> Answering that would be interesting...
> 
Without a doubt it has come a long way, but in my opinion, until there is a tool
to walk the data (to transfer it out) or something that can either repair or
invalidate metadata (such as a spacemap corruption) there is still a fatal flaw
that makes it questionable to use... and that is for one reason alone
(regardless of my current problems.)

Consider..

If one triggers such a fault on a production server, how can one justify
transferring from backup multiple terabytes (or even petabytes now) of data to
repair an unmountable/faulted array.... because all backup solutions I know
currently would take days if not weeks to restore the sort of store ZFS is
touted with supporting.

Now, yes most production environments have multiple backing stores so will have
a server or ten to switch to whilst the store is being recovered, but it still
wouldn?t be a pleasant experience... not to mention the possibility that if one
store is corrupted there is a chance that the other store(s) would also be
affected in the same way if in the same DC... (Eg a DC fire - which I have seen)
.. and if you have multi DC stores to protect from that.. size of the pipes
between DCs comes clearly into play.

Thoughts?

Michelle

Kurt Jaeger

2019-Apr-30 08:26 UTC

head link

ZFS...

Hi!
> If one triggers such a fault on a production server, how can one
> justify transferring from backup multiple terabytes (or even petabytes
> now) of data to repair an unmountable/faulted array.... because all
> backup solutions I know currently would take days if not weeks to
> restore the sort of store ZFS is touted with supporting.
Isn't that the problem with all large storage systems ?

Even mainframe storage (with very different concepts behind it) can
become messed up so much that there's no hope left.

-- 
pi at opsec.eu            +49 171 3101372                    One year to go !

rainer at ultra-secure.de

2019-Apr-30 08:44 UTC

head link

ZFS...

Am 2019-04-30 10:09, schrieb Michelle Sullivan:
> Now, yes most production environments have multiple backing stores so
> will have a server or ten to switch to whilst the store is being
> recovered, but it still wouldn?t be a pleasant experience... not to
> mention the possibility that if one store is corrupted there is a
> chance that the other store(s) would also be affected in the same way
> if in the same DC... (Eg a DC fire - which I have seen) .. and if you
> have multi DC stores to protect from that.. size of the pipes between
> DCs comes clearly into play.

I have one customer with about 13T of ZFS - and because it would take a 
while to restore (actual backups), it zfs-sends delta-snapshots every 
hour to a standby-system.

It was handy when we had to rebuild the system with different HBAs.

Karl Denninger

2019-Apr-30 13:33 UTC

head link

ZFS...

On 4/30/2019 03:09, Michelle Sullivan wrote:> Consider..
>
> If one triggers such a fault on a production server, how can one justify
transferring from backup multiple terabytes (or even petabytes now) of data to
repair an unmountable/faulted array.... because all backup solutions I know
currently would take days if not weeks to restore the sort of store ZFS is
touted with supporting.
Had it happen on a production server a few years back with ZFS.? The
*hardware* went insane (disk adapter) and scribbled on *all* of the vdevs.

The machine crashed and would not come back up -- at all.? I insist on
(and had) emergency boot media physically in the box (a USB key) in any
production machine and it was quite-quickly obvious that all of the
vdevs were corrupted beyond repair.? There was no rational option other
than to restore.

It was definitely not a pleasant experience, but this is why when you
get into systems and data store sizes where it's a five-alarm pain in
the neck you must figure out some sort of strategy that covers you 99%
of the time without a large amount of downtime involved, and in the 1%
case accept said downtime.? In this particular circumstance the customer
didn't want to spend on a doubled-and-transaction-level protected
on-site (in the same DC) redundancy setup originally so restore, as
opposed to fail-over/promote and then restore and build a new
"redundant" box where the old "primary" resided was the
most-viable
option.? Time to recover essential functions was ~8 hours (and over 24
hours for everything to be restored.)

Incidentally that's not the first time I've had a disk adapter failure
on a production machine in my career as a systems dude; it was, in fact,
the *third* such failure.? Then again I've been doing this stuff since
the 1980s and learned long ago that if it can break it eventually will,
and that Murphy is a real b******.

The answer to your question Michelle is that when restore times get into
"seriously disruptive" amounts of time (e.g. hours, days or worse
depending on the application involved and how critical it is) you spend
the time and money to have redundancy in multiple places and via paths
that do not destroy the redundant copies when things go wrong, and you
spend the engineering time to figure out what those potential faults are
and how to design such that a fault which can destroy the data set does
not propagate to the redundant copies before it is detected.

--
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190430/a39b71e3/attachment.bin>

freebsd stable - Apr 2019 - ZFS...

ZFS...

ZFS...

ZFS...

ZFS...