Even though ZFS is "the last word" in filesystems, is there something more that an application can do when writing large files sequentially in order to assure that the data is stored as contiguously as possible? Does this notion even make sense given that ZFS load-shares large blocks across a set of disks? It seems that with some filesystems, doing a ftruncate() to length may help, but with ZFS and its copy-on-write semantics, this may actually make the problem worse and slows down the writes. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
As a followup, I see that there is an optional posix_fallocate() function defined in the POSIX standard (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html) With some Linux-related discussion at http://lwn.net/Articles/226710/. Recent Linux (2.6.23) has implemented this function and some of its filesystems support it (XFS and ext4). I found an rsync list posting here http://www.mail-archive.com/rsync at lists.samba.org/msg20875.html which shows that for the XFS filesystem, there is substantial advantage (in terms of fragmentation) to using it. Assuming that this functionality is not already in ZFS, ZFS would implement it by pre-allocating all of the requested filesystem blocks, but marking them in such a way that their content is unassigned and therefore the expensive copy-on-write semantics are avoided for the first update. The allocation of the blocks should optimize future read or append access. The system call permanently assigns these blocks to the file even if they are not yet used. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Jun 13, 2008 at 6:17 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> As a followup, I see that there is an optional posix_fallocate() > function defined in the POSIX standard > (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html) > With some Linux-related discussion at http://lwn.net/Articles/226710/. > > Recent Linux (2.6.23) has implemented this function and some of its > filesystems support it (XFS and ext4). I found an rsync list posting > here http://www.mail-archive.com/rsync at lists.samba.org/msg20875.html > which shows that for the XFS filesystem, there is substantial > advantage (in terms of fragmentation) to using it. > > Assuming that this functionality is not already in ZFS, ZFS would > implement it by pre-allocating all of the requested filesystem blocks, > but marking them in such a way that their content is unassigned and > therefore the expensive copy-on-write semantics are avoided for the > first update. The allocation of the blocks should optimize future > read or append access. The system call permanently assigns these > blocks to the file even if they are not yet used. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/I''ll defer to the Team ZFS Gods for corrections - but, in general, the overlying philosophy for ZFS is to work automatically and transparently and not require the user (or user application) to "tell" the underlying filesystem *anything*. IOW - treat it as a "black box" storage sub-system. Currently, ZFS determines if the access pattern is random or sequential and there is no mechanism to provide it with "hints". Yes - this is a lofty and worthy goal and it would appear, upon first blush, that to provide a "hints" facility would make sense - but Team ZFS are the ultimate look-to-the-future designers and believe, that if the current implement is deficient, the next implementation will simply be better. This philosophy means that the user will never have to change a single line of code or a learned (keyboard) behavioral pattern. Given that machines keep getting faster and that more and more CPU cycles can be dedicated to "letting the machine do the work" - this philosophy is the correct approach IMHO. But is counter-intuitative for those of use used to "explaining" our intent to a dumb, resource starved computing platform. Regards, -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
On Fri, 13 Jun 2008, Al Hopper wrote:> storage sub-system. Currently, ZFS determines if the access pattern > is random or sequential and there is no mechanism to provide it with > "hints".Right. But this untunable generality may prevent it from being used for real-time uncompressed 2K resolution video server use where it does not take many disk seeks to cause stutters.> Yes - this is a lofty and worthy goal and it would appear, upon first > blush, that to provide a "hints" facility would make sense - but Team > ZFS are the ultimate look-to-the-future designers and believe, that if > the current implement is deficient, the next implementation will > simply be better. This philosophy means that the user will never have > to change a single line of code or a learned (keyboard) behavioral > pattern.This POSIX interface simply requests to reserve space in advance from the filesystem. Even if fragmentation is not considered, there is a worthy purpose to this since an error message can be returned more quickly to the user rather than taking minutes or hours. It assures that the task will not fail mid-way due to lack of disk space. Of course when compression is enabled, it is not possible to accurately reserve space.> Given that machines keep getting faster and that more and more CPU > cycles can be dedicated to "letting the machine do the work" - this > philosophy is the correct approach IMHO. But is counter-intuitativeMaybe once all rotating disk drives are replaced with solid state devices, then CPU cycles would make a difference but currently we are facing disk seeks which are not getting much faster. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/