Hello all, I understand that relatively high fragmentation is inherent to ZFS due to its COW and possible intermixing of metadata and data blocks (of which metadata path blocks are likely to expire and get freed relatively quickly). I believe it was sometimes implied on this list that such fragmentation for "static" data can be currently combatted only by zfs send-ing existing pools data to other pools at some reserved hardware, and then clearing the original pools and sending the data back. This is time-consuming, disruptive and requires lots of extra storage idling for this task (or at best - for backup purposes). I wonder how resilvering works, namely - does it write blocks "as they were" or in an optimized (defragmented) fashion, in two usecases: 1) Resilvering from a healthy array (vdev) onto a spare drive in order to replace one of the healthy drives in the vdev; 2) Resilvering a degraded array from existing drives onto a new drive in order to repair the array and make it redundant again. Also, are these two modes different at all? I.e. if I were to ask ZFS to replace a working drive with a spare in the case (1), can I do it at all, and would its data simply be copied over, or reconstructed from other drives, or some mix of these two operations? Finally, what would the gurus say - does fragmentation pose a heavy problem on nearly-filled-up pools made of spinning HDDs (I believe so, at least judging from those performance degradation problems writing to 80+%-filled pools), and can fragmentation be effectively combatted on ZFS at all (with or without BP rewrite)? For example, can(does?) metadata live "separately" from data in some "dedicated" disk areas, while data blocks are written as contiguously as they can? Many Windows defrag programs group files into several "zones" on the disk based on their last-modify times, so that old WORM files remain defragmented for a long time. There are thus some empty areas reserved for new writes as well as for moving newly discovered WORM files to the WORM zones (free space permitting)... I wonder if this is viable with ZFS (COW and snapshots involved) when BP-rewrites are implemented? Perhaps such zoned defragmentation can be done based on block creation date (TXG number) and the knowledge that some blocks in certain order comprise at least one single file (maybe more due to clones and dedup) ;) What do you think? Thanks, //Jim Klimov
Edward Ned Harvey
2012-Jan-07 15:34 UTC
[zfs-discuss] zfs defragmentation via resilvering?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > I understand that relatively high fragmentation is inherent > to ZFS due to its COW and possible intermixing of metadata > and data blocks (of which metadata path blocks are likely > to expire and get freed relatively quickly). > > I believe it was sometimes implied on this list that such > fragmentation for "static" data can be currently combatted > only by zfs send-ing existing pools data to other pools at > some reserved hardware, and then clearing the original pools > and sending the data back. This is time-consuming, disruptive > and requires lots of extra storage idling for this task (or > at best - for backup purposes).Can be combated by sending & receiving. But that''s not the only way. You can defrag, (or apply/remove dedup and/or compression, or any of the other stuff that''s dependent on BP rewrite) by doing any technique which sequentially reads the existing data, and writes it back to disk again. For example, if you "cp -p file1 file2 && mv file2 file1" then you have effectively defragged file1 (or added/removed dedup or compression). But of course it''s requisite that file1 is sufficiently "not being used" right now.> I wonder how resilvering works, namely - does it write > blocks "as they were" or in an optimized (defragmented) > fashion, in two usecases:resilver goes according to temporal order. While this might sometimes yield a slightly better organization (If a whole bunch of small writes were previously spread out over a large period of time on a largely idle system, they will now be write-aggregated to sequential blocks) usually resilvering recreates fragmentation similar to the pre-existing fragmentation. In fact, even if you zfs send | zfs receive while preserving snapshots, you''re still recreating the data in something loosely temporal order. Because it will do all the blocks of the oldest snapshot, and then all the blocks of the second oldest snapshot, etc. So by preserving the old snapshots, you might sometimes be recreating significant amount of fragmentation anyway.> 1) Resilvering from a healthy array (vdev) onto a spare drive > in order to replace one of the healthy drives in the vdev; > 2) Resilvering a degraded array from existing drives onto a > new drive in order to repair the array and make it redundant > again.Same behavior either way. Unless... If your old disks are small and very full, and your new disks are bigger, then sometimes in the past you may have suffered fragmentation due to lack of available sequential unused blocks. So resilvering onto new *larger* disks might make a difference.> Finally, what would the gurus say - does fragmentation > pose a heavy problem on nearly-filled-up pools made of > spinning HDDsYes. But that''s not unique to ZFS or COW. No matter what your system, if your disk is nearly full, you will suffer from fragmentation.> and can fragmentation be effectively combatted > on ZFS at all (with or without BP rewrite)?With BP rewrite, yes you can effectively combat fragmentation. Unfortunately it doesn''t exist. :-/ Without BP rewrite... Define "effectively." ;-) I have successfully defragged, compressed, enabled/disabled dedup on pools before, by using zfs send | zfs receive... Or by asking users, "Ok, we''re all in agreement, this weekend, nobody will be using the "a" directory. Right?" So then I sudo rm -rf a, and restore from the latest snapshot. Or something along those lines. Next weekend, we''ll do the "b" directory...
"Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."
2012-Jan-07 16:10 UTC
[zfs-discuss] zfs defragmentation via resilvering?
it seems that s11 shadow migration can help:-) On 1/7/2012 9:50 AM, Jim Klimov wrote:> Hello all, > > I understand that relatively high fragmentation is inherent > to ZFS due to its COW and possible intermixing of metadata > and data blocks (of which metadata path blocks are likely > to expire and get freed relatively quickly). > > I believe it was sometimes implied on this list that such > fragmentation for "static" data can be currently combatted > only by zfs send-ing existing pools data to other pools at > some reserved hardware, and then clearing the original pools > and sending the data back. This is time-consuming, disruptive > and requires lots of extra storage idling for this task (or > at best - for backup purposes). > > I wonder how resilvering works, namely - does it write > blocks "as they were" or in an optimized (defragmented) > fashion, in two usecases: > 1) Resilvering from a healthy array (vdev) onto a spare drive > in order to replace one of the healthy drives in the vdev; > 2) Resilvering a degraded array from existing drives onto a > new drive in order to repair the array and make it redundant > again. > > Also, are these two modes different at all? > I.e. if I were to ask ZFS to replace a working drive with > a spare in the case (1), can I do it at all, and would its > data simply be copied over, or reconstructed from other > drives, or some mix of these two operations? > > Finally, what would the gurus say - does fragmentation > pose a heavy problem on nearly-filled-up pools made of > spinning HDDs (I believe so, at least judging from those > performance degradation problems writing to 80+%-filled > pools), and can fragmentation be effectively combatted > on ZFS at all (with or without BP rewrite)? > > For example, can(does?) metadata live "separately" > from data in some "dedicated" disk areas, while data > blocks are written as contiguously as they can? > > Many Windows defrag programs group files into several > "zones" on the disk based on their last-modify times, so > that old WORM files remain defragmented for a long time. > There are thus some empty areas reserved for new writes > as well as for moving newly discovered WORM files to > the WORM zones (free space permitting)... > > I wonder if this is viable with ZFS (COW and snapshots > involved) when BP-rewrites are implemented? Perhaps such > zoned defragmentation can be done based on block creation > date (TXG number) and the knowledge that some blocks in > certain order comprise at least one single file (maybe > more due to clones and dedup) ;) > > What do you think? Thanks, > //Jim Klimov > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Hung-Sheng Tsao Ph D. Founder& Principal HopBit GridComputing LLC cell: 9734950840 http://laotsao.blogspot.com/ http://laotsao.wordpress.com/ http://blogs.oracle.com/hstsao/ -------------- next part -------------- A non-text attachment was scrubbed... Name: laotsao.vcf Type: text/x-vcard Size: 153 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120107/cd15fb67/attachment.vcf>
On Sat, 7 Jan 2012, Jim Klimov wrote:> I understand that relatively high fragmentation is inherent > to ZFS due to its COW and possible intermixing of metadata > and data blocks (of which metadata path blocks are likely > to expire and get freed relatively quickly).To put things in proper perspective, with 128K filesystem blocks, the worst case file fragmentation as a percentage is 0.39% (100*1/((128*1024)/512)). On a Microsoft Windows system, the defragger might suggest that defragmentation is not warranted for this percentage level.> Finally, what would the gurus say - does fragmentation > pose a heavy problem on nearly-filled-up pools made of > spinning HDDs (I believe so, at least judging from those > performance degradation problems writing to 80+%-filled > pools), and can fragmentation be effectively combatted > on ZFS at all (with or without BP rewrite)?There are different types of fragmentation. The fragmentation which causes a slowdown when writing to an almost full pool is fragmentation of the free-list/area (causing zfs to take longer to find free space to write to) as opposed to fragmentation of the files themselves. The files themselves will still not be fragmented any more severely than the zfs blocksize. However, there are seeks and there are *seeks* and some seeks take longer than others so some forms of fragmentation are worse than others. When the free space is fragmented into smaller blocks, there is necessarily more file fragmentation then the file is written. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Edward Ned Harvey
2012-Jan-09 13:44 UTC
[zfs-discuss] zfs defragmentation via resilvering?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Bob Friesenhahn > > To put things in proper perspective, with 128K filesystem blocks, the > worst case file fragmentation as a percentage is 0.39% > (100*1/((128*1024)/512)). On a Microsoft Windows system, the > defragger might suggest that defragmentation is not warranted for this > percentage level.I don''t think that''s correct... Suppose you write a 1G file to disk. It is a database store. Now you start running your db server. It starts performing transactions all over the place. It overwrites the middle 4k of the file, and it overwrites 512b somewhere else, and so on. Since this is COW, each one of these little writes in the middle of the file will actually get mapped to unused sectors of disk. Depending on how quickly they''re happening, they may be aggregated as writes... But that''s not going to help the sequential read speed of the file, later when you stop your db server and try to sequentially copy your file for backup purposes. In the pathological worst case, you would write a file that takes up half of the disk. Then you would snapshot it, and overwrite it in random order, using the smallest possible block size. Now your disk is 100% full, and if you read that file, you will be performing worst case random IO spanning 50% of the total disk space. Granted, this is not a very realistic case, but it is the worst case, and it''s really really really bad for read performance.
On Jan 9, 2012, at 5:44 AM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Bob Friesenhahn >> >> To put things in proper perspective, with 128K filesystem blocks, the >> worst case file fragmentation as a percentage is 0.39% >> (100*1/((128*1024)/512)). On a Microsoft Windows system, the >> defragger might suggest that defragmentation is not warranted for this >> percentage level. > > I don''t think that''s correct... > Suppose you write a 1G file to disk. It is a database store. Now you start > running your db server. It starts performing transactions all over the > place. It overwrites the middle 4k of the file, and it overwrites 512b > somewhere else, and so on.It depends on the database, but many (eg Oracle database) are COW and write fixed block sizes so your example does not apply.> Since this is COW, each one of these little > writes in the middle of the file will actually get mapped to unused sectors > of disk. Depending on how quickly they''re happening, they may be aggregated > as writes... But that''s not going to help the sequential read speed of the > file, later when you stop your db server and try to sequentially copy your > file for backup purposes.Those who expect sequential to get performance out of HDDs usually end up being sad :-( Interestingly, if you run Oracle database on top of ZFS on top of SSDs, then you have COW over COW over COW. Now all we need is a bull! :-) -- richard -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/
On Mon, 9 Jan 2012, Edward Ned Harvey wrote:> > I don''t think that''s correct...But it is! :-)> Suppose you write a 1G file to disk. It is a database store. Now you start > running your db server. It starts performing transactions all over the > place. It overwrites the middle 4k of the file, and it overwrites 512b > somewhere else, and so on. Since this is COW, each one of these little > writes in the middle of the file will actually get mapped to unused sectors > of disk. Depending on how quickly they''re happening, they may be aggregatedOops. I see an error in the above. Other than tail blocks, or due to compression, zfs will not write a COW data block smaller than the zfs filesystem blocksize. If the blocksize was 128K, then updating just one byte in that 128K block results in writing a whole new 128K block. This is pretty significant write-amplification but the resulting fragmentation is still limited by the 128K block size. Remember that any fragmentation calculation needs to be based on the disk''s minimum read (i.e. sector) size. However, it is worth remembering that it is common to set the block size to a much smaller value than default (e.g. 8K) if the filesystem is going to support a database. In that case it is possible for there to be fragmentation for every 8K of data. The worst case fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25% ((100*1/((8*1024)/512))). That would be a high enough percentage that Microsoft Windows defrag would recommend defragging the disk. Metadata chunks can not be any smaller than the disk''s sector size (e.g. 512 bytes or 4K bytes). Metadata can be seen as contributing to fragmentation, which is why it is so valuable to cache it. If the metadata is not conveniently close to the data, then it may result in a big ugly disk seek (same impact as data fragmentation) to read it. In summary, with zfs''s default 128K block size, data fragmentation is not a significant issue, If the zfs filesystem block size is reduced to a much smaller value (e.g. 8K) then it can become a significant issue. As Richard Elling points out, a database layered on top of zfs may already be fragmented by design. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
2012-01-09 19:14, Bob Friesenhahn wrote:> In summary, with zfs''s default 128K block size, data fragmentation is > not a significant issue, If the zfs filesystem block size is reduced to > a much smaller value (e.g. 8K) then it can become a significant issue. > As Richard Elling points out, a database layered on top of zfs may > already be fragmented by design.I THINK there is some fallacy in your discussion: I''ve seen 128K referred to as the maximum filesystem block size, i.e. for large "streaming" writes. For smaller writes ZFS adapts with smaller blocks. I am not sure how it would rewrite a few bytes inside a larger block - split it into many smaller ones or COW all 128K. Intermixing variable-sized indivisible blocks can in turn lead to more fragmentation than would otherwise be expected/possible ;) Fixed block sizes are used (only?) for volume datasets. > If the metadata is not conveniently close to the data, then it may > result in a big ugly disk seek (same impact as data fragmentation) > to read it. Also I''m not sure about ths argument. If VDEV prefetch does not slurp in data blocks, then by the time metadata is discovered in read-from-disk blocks and data block locations are determined, the disk may have rotated away from the head, so at least one rotational delay is incurred even if metadata is immediately followed by its referred data... no? //Jim
Edward Ned Harvey
2012-Jan-13 04:15 UTC
[zfs-discuss] zfs defragmentation via resilvering?
> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] > > > Suppose you write a 1G file to disk. It is a database store. Now youstart> > running your db server. It starts performing transactions all over the > > place. It overwrites the middle 4k of the file, and it overwrites 512b > > somewhere else, and so on. Since this is COW, each one of these little > > writes in the middle of the file will actually get mapped to unusedsectors> > of disk. Depending on how quickly they''re happening, they may be > aggregated > > Oops. I see an error in the above. Other than tail blocks, or due to > compression, zfs will not write a COW data block smaller than the zfs > filesystem blocksize. If the blocksize was 128K, then updating just > one byte in that 128K block results in writing a whole new 128K block.Before anything else, let''s define what "fragmentation" means in this context, or more importantly, why anyone would care. Fragmentation, in this context, is a measurement of how many blocks exist sequentially aligned on disk, such that a sequential read will not suffer a seek/latency penalty. So the reason somebody would care is a function of performance - disk work payload versus disk work wasted overhead time. But wait! There are different types of reads. If you read using a scrub or a zfs send, then it will read the blocks in temporal order, so anything which was previously write coalesced (even from many different files) will again be read-coalesced (which is nice). But if you read a file using something like tar or cp or cat, then it reads the file in sequential file order, which would be different from temporal order unless the file was originally written sequentially and never overwritten by COW. Suppose you have a 1G file open, and a snapshot of this file is on disk from a previous point in time. for ( i=0 ; i<1trillion ; i++ ) { seek(random integer in range[0 to 1G]); write(4k); } Something like this would quickly try to write a bunch of separate and scattered 4k blocks at different offsets within the file. Every 32 of these 4k writes would be write-coalesced into a single 128k on-disk block. Sometime later, you read the whole file sequentially such as cp or tar or cat. The first 4k come from this 128k block... The next 4k come from another 128k block... The next 4k come from yet another 128k block... Essentially, the file has become very fragmented and scattered about on the physical disk. Every 4k read results in a random disk seek.> The worst case > fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25% > ((100*1/((8*1024)/512))).You seem to be assuming that reading 512b disk sector and its neighboring 512b sector count as contiguous blocks. And since there are guaranteed to be exactly 256 sectors in every 128k filesystem block, then there is no fragmentation for 256 contiguous sectors, guaranteed. Unfortunately, the 512b sector size is just an arbitrary number (and variable, actually 4k on modern disks), and the resultant percentage of fragmentation is equally arbitrary. To produce a number that actually matters - What you need to do is calculate the percentage of time the disk is able to deliver payload, versus the percentage of time the disk is performing time-wasting "overhead" operations - seek and latency. Suppose your disk speed is 1Gbit/sec while actively engaging the head, and suppose the average random access (seek & latency) is 10ms. Suppose you wish for 99% efficiency. The 10ms must be 1% of the time, and the head must be engaged for 99% of the time, which is 990ms, which is very near 1Gbit, or approximately 123MB sequential data for every random disk access. You need 123MB sequential data payload for every random disk access. That''s 944 times larger than the largest 128k block size currently in zfs, and obviously larger still compared to what you mentioned - 4k or 8k recordsizes or 512b disk sectors... Suppose you have 128k blocks written to disk, and all scattered about in random order. Your disk must seek & rotate for 10ms, and then it will be engaged for 1.3ms reading the 128k, and then it will seek & rotate again for 10ms... I would call that a 13% payload and 87% wasted time. Fragmentation at this level hurts you really bad. Suppose there is a TXG flush every 5 seconds. You write a program, which will write a single byte to disk once every 5.1 seconds. Then you leave that program running for a very very long time. You now have millions of 128k blocks written on disk scattered about in random order. You start a scrub. It will read 128k, and then random seek, and then read 128k, etc. I would call that 100% fragmentation, because there are no contiguously aligned sequential blocks on disk anywhere. But again, any measure of "percent fragmentation" is purely arbitrary, unless you know (a) which type of read behavior is being meaured (temporal or file order) and you know (b) the sequential engaged disk speed, and you know (c) the average random access time.
On Thu, 12 Jan 2012, Edward Ned Harvey wrote:> Suppose you have a 1G file open, and a snapshot of this file is on disk from > a previous point in time. > for ( i=0 ; i<1trillion ; i++ ) { > seek(random integer in range[0 to 1G]); > write(4k); > } > > Something like this would quickly try to write a bunch of separate and > scattered 4k blocks at different offsets within the file. Every 32 of these > 4k writes would be write-coalesced into a single 128k on-disk block. > > Sometime later, you read the whole file sequentially such as cp or tar or > cat. The first 4k come from this 128k block... The next 4k come from > another 128k block... The next 4k come from yet another 128k block... > Essentially, the file has become very fragmented and scattered about on the > physical disk. Every 4k read results in a random disk seek.Are you talking about some other filesystem or are you talking about zfs? Because zfs does not work like that ... However, I did ignore the additional fragmentation due to using raidz type formats. These break the 128K block into smaller chunks and so there can be more fragmentation.>> The worst case >> fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25% >> ((100*1/((8*1024)/512))). > > You seem to be assuming that reading 512b disk sector and its neighboring > 512b sector count as contiguous blocks. And since there are guaranteed to > be exactly 256 sectors in every 128k filesystem block, then there is no > fragmentation for 256 contiguous sectors, guaranteed. Unfortunately, the > 512b sector size is just an arbitrary number (and variable, actually 4k on > modern disks), and the resultant percentage of fragmentation is equally > arbitrary.Yes, I am saying that zfs writes its data in contiguous chunks (filesystem blocksize in the case of mirrors).> To produce a number that actually matters - What you need to do is calculate > the percentage of time the disk is able to deliver payload, versus the > percentage of time the disk is performing time-wasting "overhead" operations > - seek and latency.Yes, latency is the critical factor.> That''s 944 times larger than the largest 128k block size currently in zfs, > and obviously larger still compared to what you mentioned - 4k or 8k > recordsizes or 512b disk sectors...Yes, fragmentation is still important even with 128K chunks.> I would call that 100% fragmentation, because there are no contiguously > aligned sequential blocks on disk anywhere. But again, any measure of > "percent fragmentation" is purely arbitrary, unless you know (a) which typeI agree that the notion of percent fragmentation is arbitrary. I used one that I invented, and which is based on underlying disk sectors rather than filesystem blocks. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Edward Ned Harvey
2012-Jan-16 02:12 UTC
[zfs-discuss] zfs defragmentation via resilvering?
> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] > > On Thu, 12 Jan 2012, Edward Ned Harvey wrote: > > Suppose you have a 1G file open, and a snapshot of this file is on diskfrom> > a previous point in time. > > for ( i=0 ; i<1trillion ; i++ ) { > > seek(random integer in range[0 to 1G]); > > write(4k); > > } > > > > Something like this would quickly try to write a bunch of separate and > > scattered 4k blocks at different offsets within the file. Every 32 ofthese> > 4k writes would be write-coalesced into a single 128k on-disk block. > > > > Sometime later, you read the whole file sequentially such as cp or taror> > cat. The first 4k come from this 128k block... The next 4k come from > > another 128k block... The next 4k come from yet another 128k block... > > Essentially, the file has become very fragmented and scattered about on > the > > physical disk. Every 4k read results in a random disk seek. > > Are you talking about some other filesystem or are you talking about > zfs? Because zfs does not work like that ...In what way? I''ve only described behavior of COW and write coalescing. Which part are you saying is un-ZFS-like? Before answering, here, let''s do some test work: Create a new pool, with a single disk, no compression or dedup or anything, called "junk" run this script. All it does is generate some data in a file sequentially, and then randomly overwrite random pieces of the file in random order, creating snapshots all along the way... Many times, until the file has been completely overwritten many times over. This should be a fragmentation nightmare. http://dl.dropbox.com/u/543241/fragmenter.py Then reboot, to ensure cache is clear. And see how long it takes to sequentially read the original sequential file, as compared to the highly fragmented one: cat /junk/.zfs/snapshot/sequential-before/out.txt | pv > /dev/null ; cat /junk/.zfs/snapshot/random1399/out.txt | pv > /dev/null While I''m waiting for this to run, I''ll make some predictions: The file is 2GB (16 Gbit) and the disk reads around 1Gbit/sec, so reading the initial sequential file should take ~16 sec After fragmentation, it should be essentially random 4k fragments (32768 bits). I figure each time the head is able to find useful data, it takes 32us to read the 4kb, followed by 10ms random access time... disk is doing useful work 0.3% of the time and wasting 99.7% of the time doing random seeks. Should be about 300x longer to read the fragmented file. ... (Ding!) ... Test is done. Thank you for patiently waiting during this time warp. ;-) Actual result: 15s and 45s. So it was 3x longer, not 300x. Either way it proves the point - but I want to see results that are at least 100x worse due to fragmentation, to REALLY drive home the point, that fragmentation matters. I hypothesize, that the mere 3x performance degradation is because I have only a single 2G file in a 2T pool, and no other activity, and no other files. So all my supposedly randomly distributed data might reside very close to each other on platter... The combination of short stroke & read prefetcher could be doing wonders in this case. So now I''ll repeat that test, but this time... I allow the sequential data to be written sequentially again just like before, but after it starts the random rewriting, I''ll run a couple of separate threads writing and removing other junk to the pool, so the write coalescing will include other files, spread more across a larger percentage of the total disk, getting closer to the worst case random distribution on disk... (destroy & recreate the pool in between test runs...) Actual result: 15s and 104s. So it''s only 6.9x performance degradation. That''s the worst I can do without hurting myself. It proves the point, but not to the magnitude that I expected.
On Sun, 15 Jan 2012, Edward Ned Harvey wrote:> > While I''m waiting for this to run, I''ll make some predictions: > The file is 2GB (16 Gbit) and the disk reads around 1Gbit/sec, so reading > the initial sequential file should take ~16 sec > After fragmentation, it should be essentially random 4k fragments (32768 > bits). I figure each time the head is able to find useful data, it takesThe 4k fragments is the part I don''t agree with. Zfs does not do that. If you were to run raidzN over a wide enough array of disks you could end up with 4K fragments (distributed across the disks), but then you would always have 4K fragments. Zfs writes linear strips of data in units of the zfs blocksize, unless it is sliced-n-diced by raidzN for striping across disks. If part of a zfs filesystem block is overwritten, then the underlying block is read, modified in memory, and then the whole block written to a new location. The need to read the existing block is a reason why the zfs ARC is so vitally important to write performance. If the filesystem has compression enabled, then the blocksize is still the same, but the data written may be shorter (due to compression). File tail blocks may also be shorter. There are dtrace tools you can use to observe low level I/O and see the size of the writes. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
2012-01-16 8:39, Bob Friesenhahn wrote:> On Sun, 15 Jan 2012, Edward Ned Harvey wrote: >> >> While I''m waiting for this to run, I''ll make some predictions: >> The file is 2GB (16 Gbit) and the disk reads around 1Gbit/sec, so reading >> the initial sequential file should take ~16 sec >> After fragmentation, it should be essentially random 4k fragments (32768 >> bits). I figure each time the head is able to find useful data, it takes > > The 4k fragments is the part I don''t agree with. Zfs does not do that. > If you were to run raidzN over a wide enough array of disks you could > end up with 4K fragments (distributed across the disks), but then you > would always have 4K fragments.I think that in order to create a truly fragmented ZFS layout, Edward needs to do sync writes (without a ZIL?) so that every block and its metadata go to disk (coalesced as they may be) and no two blocks of the file would be sequenced on disk together. Although creating snapshots should give that effect... He would have to fight hard to defeat ZFS''s anti-fragmentation attempts overall - while this is possible on very full pools ;) Hint: pre-fill Ed''s test pool to 90%, then run the tests :) I think that to go forward about discussing defragmentation tools, we should define a metric of fragmentation - as Bob and Edward have often brought up. This implies accounting for the effects on end-user of some mix of factors like: 1) Size of "free" reads and writes, i.e. cheap prefetch of a HDD''s track as opposed to seeking; reads of an SSD block (those 256KB that are sliced into 4/8KB pages) as opposed to random reads of pages from separate SSD blocks. Seeks to neighboring tracks may be faster, than full-disk seeks, but they are slower than no seeks at all. For an optimal read-performance, we might want to prefetch whole tracks/blocks (not 64Kb from the position of ZFS''s wanted block, but the whole track including this block, reversely knowing the sector numbers of start and end). Effect: we might not need to fully defragment data, but rather make long-enough ranges "correctly" positioned on the media. These may span neighboring tracks/blocks. We do need to know media''s performance characteristics to do this optimally (i.e. which disk tracks have which byte-lengths, and where does each circle start in terms of LBA offsets). Also, disks'' internal reallocation to spare blocks may lead to uncontrollable random seeks, degrading performance over time, but an FS is unlikely to have control or knowledge of that. Metric: start-addresses and lengths of fastest-read locations (i.e. whole tracks or SSD blocks) on leaf storage. May be variable within the storage device. 2) In case of ZFS - reads of contiguously allocated and monotonously increasing block numbers of data from a file''s or zvol''s most current version (live dataset as opposed to block history change in snapshots and the monotonous TXG number increase in on-disk vlocks). This may be in unresolvable conflict with clones and deduplication, so some files or volumes can not be made contiguous without breaking continuity of others. Still, some "overall contiguousness" can be optimised. For users it might also be important to have many files from some directory stored close to each other, especially if these are small files used together somehow (sourcecode, thumbnails, whatever). Effect: fast reads of most-current datasets. Metric: length of continuous (DVA) stretches of current logical block numbers of userdata divided by total data size. Amount of separate fragments somehow included ;) 3) In case of ZFS - fast access to metadata, especially branches of the current blockpointer tree in sequence of increasing TXG numbers. Effect: fast reads of metadata, i.e. scrubbing. Metric: length of continuous (DVA) stretches of current block pointer trees in same-or-increasing TXG numbers divided by total size of the tree (branch). There is likely no absolute fragmentation or defragmentation, but there are some optimisations. For example, ZFS''s attempts to coalesce 10Mb of data during one write into one metaslab may suffice. And we do actually see performance hits when it can''t find stretches long enough (quickly enough) with pools over empirical 80% fill-up. Defragmentation might set the aim of clearing up enough 10Mb-long stretches of free space and relocate smaller fragments of current user-data or {monotonous BPTree} metadata into these clearings. In particular, even if we have old data in snapshots, but it is stored in long 10Mb+ contiguous stretches, we might just leave it there. It is already about as good as it gets. Also, as I proposed elsewhere, the metadata might be stored in separate stretches of physical disk space - thus different aims of defragmenting userdata and metadata (and free space) would not conflict. What do you think? //Jim
On Mon, 16 Jan 2012, Jim Klimov wrote:> > I think that in order to create a truly fragmented ZFS layout, > Edward needs to do sync writes (without a ZIL?) so that every > block and its metadata go to disk (coalesced as they may be) > and no two blocks of the file would be sequenced on disk together. > Although creating snapshots should give that effect...Creating snapshots does not in itself cause fragmentation since COW would cause that level of fragmentation to exist anyway. However, snapshots cause old blocks to be maintained so the disk becomes more full, fresh blocks may be less appropriately situated, and the disk seeks may become more expensive due to needing to seek over more tracks. In my experience, most files on Unix systems are re-written from scatch. For example, when one edits a file in an editor, the editor loads the file into memory, performs the edit, and then writes out the whole file. Given sufficient free disk space, these files are unlikely to be fragmented. The case of slowly written log files or random-access databases are the worse cases for causing fragmentation. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, Jan 16, 2012 at 09:13:03AM -0600, Bob Friesenhahn wrote:> On Mon, 16 Jan 2012, Jim Klimov wrote: > > > >I think that in order to create a truly fragmented ZFS layout, > >Edward needs to do sync writes (without a ZIL?) so that every > >block and its metadata go to disk (coalesced as they may be) > >and no two blocks of the file would be sequenced on disk together. > >Although creating snapshots should give that effect... > > In my experience, most files on Unix systems are re-written from > scatch. For example, when one edits a file in an editor, the editor > loads the file into memory, performs the edit, and then writes out > the whole file. Given sufficient free disk space, these files are > unlikely to be fragmented. > > The case of slowly written log files or random-access databases are > the worse cases for causing fragmentation.The case I''ve seen was with an IMAP server with many users. E-mail folders were represented as ZFS directories, and e-mail messages as files within those directories. New messages arrived randomly in the INBOX folder, so that those files were written all over the place on the storage. Users also deleted many messages from their INBOX folder, but the files were retained in snapshots for two weeks. On IMAP session startup, the server typically had to read all of the messages in the INBOX folder, making this portion slow. The server also had to refresh the folder whenever new messages arrived, making that portion slow as well. Performance degraded when the storage became 50% full. It would increase markedly when the oldest snapshot was deleted. -- -Gary Mills- -refurb- -Winnipeg, Manitoba, Canada-