Nils Goroll
2008-Oct-29  08:17 UTC
[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?
Hi Graham, (this message was posed on opensolaris-bugs initially, I am CC''ing and reply-to''ing zfs-discuss as it seems to be a more appropriate place to discuss this.)> I''m surprised to see that the status of bug 6592835 hasn''t moved beyond "yes that''s a problem".My understanding is that the resilver speed is tied to fact that the currenct resilver implementation follows the ZFS on disk structures, which needs random-like I/O operations while a traditional RAID rebuild issues sequential I/O only. Simply put, the former is very slow while the latter is very fast with respect to the amounts of data having to be touched. IIRC, this issue has been discussed on zfs-discuss several times already and my understanding it that it would be very difficult to implement some kind of "sequential resilver" (walking disk blocks sequentially rather than ZFS on disk structures), so I doubt if an improvement can be expected anytime soon. That said, the core developers on zfs-discuss will know more than me and might be willing to give more background on this. Nils
Graham McArdle
2008-Oct-29  11:39 UTC
[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?
Hi Nils, thanks for the detailed info. I''ve tried searching the zfs-discuss archive for both the bug id and ''resilver'', but in both cases the only result I can find from the whole history is this thread: http://www.opensolaris.org/jive/thread.jspa?messageID=276358񃞆 Maybe the discussions you recall aren''t fully indexed for searching on these keywords or they were in another forum, but thanks for giving me the gist of it. It is potentially quite an Achilles heel for ZFS though. I''ve argued locally to migrate our main online data archive (currently 3.5TB) to ZFS, but if the recovery time for disk failures keeps getting slower as the archive grows and accumulates snapshots etc., some questions might be asked about this policy. I''ve suggested we can do continuous data replication to a secondary server by sending incremental snapshot streams, but if the CDR had to be suspended for a significant time (days) to allow a resilver, this would be a real problem, at least until the 6343667 bugfix from snv_94 finds its way into a Solaris 10 patch (will this happen?). Does the severity of the problem depend on access / write patterns used, i.e. would a simple archiving system where data is only ever added in a sequential fashion be less susceptible to slow rebuilds than a system where data is written, snapshotted, moved, modified, deleted, etc. Does the time taken to scrub the pool give some indication of the likely resilvering time, or does that process walk a different kind of tree? Graham -- This message posted from opensolaris.org
Bob Friesenhahn
2008-Oct-29  12:59 UTC
[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?
On Wed, 29 Oct 2008, Nils Goroll wrote:> My understanding is that the resilver speed is tied to fact that the currenct > resilver implementation follows the ZFS on disk structures, which needs > random-like I/O operations while a traditional RAID rebuild issues sequential > I/O only. Simply put, the former is very slow while the latter is very fast with > respect to the amounts of data having to be touched.If this is indeed an issue, then the pool itself is severely fragmented and will be slow in normal use. This could happen if the pool is allowed to become overly full. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2008-Oct-29  13:06 UTC
[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?
On Wed, 29 Oct 2008, Graham McArdle wrote:> Maybe the discussions you recall aren''t fully indexed for searching > on these keywords or they were in another forum, but thanks for > giving me the gist of it. It is potentially quite an Achilles heel > for ZFS though. I''ve argued locally to migrate our main online data > archive (currently 3.5TB) to ZFS, but if the recovery time for disk > failures keeps getting slower as the archive grows and accumulates > snapshots etc., some questions might be asked about this policy.The simple solution is to use a reasonable pool design. Limit the maximum LUN size to a size which can be resilved in reasonable time. Don''t do something silly like building a mirror across two 10TB LUNs. Huge disks will take longer to resilver, and it is more likely for a secondary failure to occur during resilvering. Manage your potential losses by managing the size of a loss, and therefore the time to recover. Another rules is to control how full a pool is allowed to become. If you fill it to 99% you can be assured that the pool will become fragmented and resilvers will take much longer. The pool will be slower in normal use as well. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Graham McArdle
2008-Oct-29  16:48 UTC
[zfs-discuss] [osol-bugs] what''s the story wtih bug #6592835?
We have a 24-disk server, so the current design is 2-disk root mirror and 2x 11-disk RAIDZ2 vdevs. I suppose another solution could have been to have 3x 7-disk vdevs plus a hot spare, but the capacity starts to get compromised. Using 1TB disks in our current config will give us growth capacity to 16TiB. Obviously 3.5TiB is a small starting point, but we are facing an exponential growth curve. It seems like the recommendation is to keep expanding out in disk quantity rather than upgrading disk size to meet growth requirements. Perhaps 1TB SATA disk is already too big a lump for ZFS resilver. Looks like there is also some hope that if "bp rewrite" becomes a reality it will also help by allowing online defragmentation: http://www.opensolaris.org/jive/thread.jspa?messageID=186582 NEWS Actually I''ve just noticed that Matt Ahrens has updated the status of this bug to "need more information" and added a comment that the bugfix for 6343667 involved a major rewrite, which might have removed the problem. Has anyone out there got a large zpool running under snv_94 or higher, and if so have you had to rebuild a disk yet? If this has fixed both bugs then I''m definitely hoping for an early backport to Solaris 10. -- This message posted from opensolaris.org