thr3ads.net - freebsd stable - ZFS... [May 2019]

If this information is useful, please help other people find it:
Share via:

Karl Denninger

2019-May-08 16:28 UTC

ZFS...

On 5/8/2019 10:14, Michelle Sullivan wrote:> Paul Mather wrote:
>> On May 8, 2019, at 9:59 AM, Michelle Sullivan <michelle at
sorbs.net>
>> wrote:
>>
>>>> Did you have regular pool scrubs enabled?? It would have picked
up
>>>> silent data corruption like this.? It does for me.
>>> Yes, every month (once a month because, (1) the data doesn't
change
>>> much (new data is added, old it not touched), and (2) because to
>>> complete it took 2 weeks.)
>>
>>
>> Do you also run sysutils/smartmontools to monitor S.M.A.R.T.
>> attributes?? Although imperfect, it can sometimes signal trouble
>> brewing with a drive (e.g., increasing Reallocated_Sector_Ct and
>> Current_Pending_Sector counts) that can lead to proactive remediation
>> before catastrophe strikes.
> not Automatically
>>
>> Unless you have been gathering periodic drive metrics, you have no
>> way of knowing whether these hundreds of bad sectors have happened
>> suddenly or slowly over a period of time.
> no, it something i have thought about but been unable to spend the
> time on.
>There are two issues here that would concern me greatly and IMHO you
should address.

I have a system here with about the same amount of net storage on it as
you did.? It runs scrubs regularly; none of them take more than 8 hours
on *any* of the pools.? The SSD-based pool is of course *much* faster
but even the many-way RaidZ2 on spinning rust is an ~8 hour deal; it
kicks off automatically at 2:00 AM when the time comes but is complete
before noon.? I run them on 14 day intervals.

If you have pool(s) that are taking *two weeks* to run a scrub IMHO
either something is badly wrong or you need to rethink organization of
the pool structure -- that is, IMHO you likely either have a severe
performance problem with one or more members or an architectural problem
you *really* need to determine and fix.? If a scrub takes two weeks
*then a resilver could conceivably take that long as well* and that's
*extremely* bad as the window for getting screwed is at its worst when a
resilver is being run.

Second, smartmontools/smartd isn't the be-all, end-all but it *does*
sometimes catch incipient problems with specific units before they turn
into all-on death and IMHO in any installation of any material size
where one cares about the data (as opposed to "if it fails just restore
it from backup") it should be running.? It's very easy to set up and
there are no real downsides to using it.? I have one disk that I rotate
in and out that was bought as a "refurb" and has 70 permanent
relocated
sectors on it.? It has never grown another one since I acquired it, but
every time it goes in the machine within minutes I get an alert on
that.? If I was to ever get *71*, or a *different* drive grew a new one
said drive would get replaced *instantly*.? Over the years it has
flagged two disks before they "hard failed" and both were immediately
taken out of service, replaced and then destroyed and thrown away.?
Maybe that's me being paranoid but IMHO it's the correct approach to
such notifications.

BTW that tool will *also* tell you if something else software-wise is
going on that you *might* think is drive-related.? For example recently
here on the list I ran into a really oddball thing happening with SAS
expanders that showed up with 12-STABLE and was *not* present in the
same box with 11.1.? Smartmontools confirmed that while the driver was
reporting errors from the disks *the disks themselves were not in fact
taking errors.*? Had I not had that information I might well have
traveled down a road that led to a catastrophic pool failure by
attempting to replace disks that weren't actually bad.? The SAS expander
wound up being taken out of service and replaced with an HBA that has
more ports -- the issues disappeared.

Finally, while you *think* you only have a metadata problem I'm with the
other people here in expressing disbelief that the damage is limited to
that.? There is enough redundancy in the metadata on ZFS that if *all*
copies are destroyed or inconsistent to the degree that they're unusable
it's extremely likely that if you do get some sort of "disaster
recovery" tool working you're going to find out that what you thought
was a metadata problem is really a "you're hosed; the data is also
gone"
sort of problem.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190508/eb3a3262/attachment.bin>

Freddie Cash

2019-May-08 16:53 UTC

head link

ZFS...

On Wed, May 8, 2019 at 9:31 AM Karl Denninger <karl at denninger.net>
wrote:
> I have a system here with about the same amount of net storage on it as
> you did.  It runs scrubs regularly; none of them take more than 8 hours
> on *any* of the pools.  The SSD-based pool is of course *much* faster
> but even the many-way RaidZ2 on spinning rust is an ~8 hour deal; it
> kicks off automatically at 2:00 AM when the time comes but is complete
> before noon.  I run them on 14 day intervals.
>
Damn, I wish our scrubs took 8 hours.  :)

Storage pool 1:  90 drives in 6-disk raidz2 vdevs (mix of 2 TB and 4 TB
SATA).  45 hours to scrub.

Storage pool 2:  90 drives in 6-disk raidz2 vdevs (mix of 2 TB and 4 TB
SATA).  33 hours to scrub.

Storage pool 3:  24 drives in 6-disk raidz2 vdevs (mix of 2 TB and 4 TB
SATA).  134 hours to scrub.

Storage pool 4:  24 drives in 6-disk raidz2 vdevs (mix of 1 TB, 2 TB, 4 TB
SATA).  Dedupe enabled.  256 hours to scrub.

Storage pool 5:  90 drives in 6-disk raidz2 vdevs (mix of 2 TB and 4 TB
SATA).  Dedupe enabled.  Takes about 6 weeks to resilver a drive, and it's
constantly resilvering drives these days as it's the oldest pool, and all
the drives are dying.

:D

Pools 1, 3, and 4 are in DC1.  Pools 2 and 5 are in DC2 across town.

Pool 1 sends snapshots to pool 2.  Pools 3 and 4 send snapshots to pool 5.

These pools are highly fragmented.  :)

> If you have pool(s) that are taking *two weeks* to run a scrub IMHO
> either something is badly wrong or you need to rethink organization of
> the pool structure -- that is, IMHO you likely either have a severe
> performance problem with one or more members or an architectural problem
> you *really* need to determine and fix.  If a scrub takes two weeks
> *then a resilver could conceivably take that long as well* and that's
> *extremely* bad as the window for getting screwed is at its worst when a
> resilver is being run.
>
Thankfully, ours are strictly storage for backups of other systems, so as
long as the nightly backups complete successfully before 6 am, we're not
worried about performance.  :)  And we do have plans to replace pools 2 and
5 to remove dedupe from the equation.  There's not a lot we can do about
the fragmentation issue, as these servers all run rsync backups from
200-odd other servers, and remove the oldest snapshot every night.

So, while a 2-week scrub may be horrible, it all depends on the use-case.
If these were direct storage systems for in-production servers, then I'd be
worried.  But as redundant backup systems (3 copies of everything, in 3
separate locations around the city), I'm not too worried.  Yet.  :D

-- 
Freddie Cash
fjwcash at gmail.com

freebsd stable - May 2019 - ZFS...

ZFS...

ZFS...