thr3ads.net - zfs discuss - [zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835? [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Nils Goroll

2008-Oct-29 08:17 UTC

[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?

Hi Graham,

(this message was posed on opensolaris-bugs initially, I am CC''ing and 
reply-to''ing zfs-discuss as it seems to be a more appropriate place to
discuss
this.)
> I''m surprised to see that the status of bug 6592835
hasn''t moved beyond "yes that''s a problem".
My understanding is that the resilver speed is tied to fact that the currenct 
resilver implementation follows the ZFS on disk structures, which needs 
random-like I/O operations while a traditional RAID rebuild issues sequential 
I/O only. Simply put, the former is very slow while the latter is very fast with
respect to the amounts of data having to be touched.

IIRC, this issue has been discussed on zfs-discuss several times already and my 
understanding it that it would be very difficult to implement some kind of 
"sequential resilver" (walking disk blocks sequentially rather than
ZFS on disk
structures), so I doubt if an improvement can be expected anytime soon.

That said, the core developers on zfs-discuss will know more than me and might 
be willing to give more background on this.

Nils

Graham McArdle

2008-Oct-29 11:39 UTC

head link

[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?

Hi Nils,

thanks for the detailed info. I''ve tried searching the zfs-discuss
archive for both the bug id and ''resilver'', but in both cases
the only result I can find from the whole history is this thread:
http://www.opensolaris.org/jive/thread.jspa?messageID=276358&#276358
Maybe the discussions you recall aren''t fully indexed for searching on
these keywords or they were in another forum, but thanks for giving me the gist
of it. It is potentially quite an Achilles heel for ZFS though. I''ve
argued locally to migrate our main online data archive (currently 3.5TB) to ZFS,
but if the recovery time for disk failures keeps getting slower as the archive
grows and accumulates snapshots etc., some questions might be asked about this
policy. I''ve suggested we can do continuous data replication to a
secondary server by sending incremental snapshot streams, but if the CDR had to
be suspended for a significant time (days) to allow a resilver, this would be a
real problem, at least until the 6343667 bugfix from snv_94 finds its way into a
Solaris 10 patch (will this happen?).
Does the severity of the problem depend on access / write patterns used, i.e.
would a simple archiving system where data is only ever added in a sequential
fashion be less susceptible to slow rebuilds than a system where data is
written, snapshotted, moved, modified, deleted, etc.
Does the time taken to scrub the pool give some indication of the likely
resilvering time, or does that process walk a different kind of tree?

Graham
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2008-Oct-29 12:59 UTC

head link

[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?

On Wed, 29 Oct 2008, Nils Goroll wrote:> My understanding is that the resilver speed is tied to fact that the
currenct
> resilver implementation follows the ZFS on disk structures, which needs
> random-like I/O operations while a traditional RAID rebuild issues
sequential
> I/O only. Simply put, the former is very slow while the latter is very fast
with
> respect to the amounts of data having to be touched.
If this is indeed an issue, then the pool itself is severely 
fragmented and will be slow in normal use.  This could happen if the 
pool is allowed to become overly full.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2008-Oct-29 13:06 UTC

head link

[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?

On Wed, 29 Oct 2008, Graham McArdle wrote:
> Maybe the discussions you recall aren''t fully indexed for
searching
> on these keywords or they were in another forum, but thanks for 
> giving me the gist of it. It is potentially quite an Achilles heel 
> for ZFS though. I''ve argued locally to migrate our main online
data
> archive (currently 3.5TB) to ZFS, but if the recovery time for disk 
> failures keeps getting slower as the archive grows and accumulates 
> snapshots etc., some questions might be asked about this policy.
The simple solution is to use a reasonable pool design.  Limit the 
maximum LUN size to a size which can be resilved in reasonable time. 
Don''t do something silly like building a mirror across two 10TB LUNs.

Huge disks will take longer to resilver, and it is more likely for a 
secondary failure to occur during resilvering.  Manage your potential 
losses by managing the size of a loss, and therefore the time to 
recover.

Another rules is to control how full a pool is allowed to become.  If 
you fill it to 99% you can be assured that the pool will become 
fragmented and resilvers will take much longer.  The pool will be 
slower in normal use as well.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Graham McArdle

2008-Oct-29 16:48 UTC

head link

[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?

We have a 24-disk server, so the current design is 2-disk root mirror and 2x
11-disk RAIDZ2 vdevs. I suppose another solution could have been to have 3x
7-disk vdevs plus a hot spare, but the capacity starts to get compromised. Using
1TB disks in our current config will give us growth capacity to 16TiB. Obviously
3.5TiB is a small starting point, but we are facing an exponential growth curve.
It seems like the recommendation is to keep expanding out in disk quantity
rather than upgrading disk size to meet growth requirements. Perhaps 1TB SATA
disk is already too big a lump for ZFS resilver.
Looks like there is also some hope that if "bp rewrite" becomes a
reality it will also help by allowing online defragmentation:
http://www.opensolaris.org/jive/thread.jspa?messageID=186582

NEWS
Actually I''ve just noticed that Matt Ahrens has updated the status of
this bug to "need more information" and added a comment that the
bugfix for 6343667 involved a major rewrite, which might have removed the
problem. Has anyone out there got a large zpool running under snv_94 or higher,
and if so have you had to rebuild a disk yet? If this has fixed both bugs then
I''m definitely hoping for an early backport to Solaris 10.
-- 
This message posted from opensolaris.org

zfs discuss - Oct 2008 - [osol-bugs] what''s the story wtih bug #6592835?

[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?

[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?

[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?

[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?

[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?