thr3ads.net - zfs discuss - [zfs-discuss] zfs defragmentation via resilvering? [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Jan-07 14:50 UTC

[zfs-discuss] zfs defragmentation via resilvering?

Hello all,

   I understand that relatively high fragmentation is inherent
to ZFS due to its COW and possible intermixing of metadata
and data blocks (of which metadata path blocks are likely
to expire and get freed relatively quickly).

   I believe it was sometimes implied on this list that such
fragmentation for "static" data can be currently combatted
only by zfs send-ing existing pools data to other pools at
some reserved hardware, and then clearing the original pools
and sending the data back. This is time-consuming, disruptive
and requires lots of extra storage idling for this task (or
at best - for backup purposes).

   I wonder how resilvering works, namely - does it write
blocks "as they were" or in an optimized (defragmented)
fashion, in two usecases:
1) Resilvering from a healthy array (vdev) onto a spare drive
    in order to replace one of the healthy drives in the vdev;
2) Resilvering a degraded array from existing drives onto a
    new drive in order to repair the array and make it redundant
    again.

Also, are these two modes different at all?
I.e. if I were to ask ZFS to replace a working drive with
a spare in the case (1), can I do it at all, and would its
data simply be copied over, or reconstructed from other
drives, or some mix of these two operations?

   Finally, what would the gurus say - does fragmentation
pose a heavy problem on nearly-filled-up pools made of
spinning HDDs (I believe so, at least judging from those
performance degradation problems writing to 80+%-filled
pools), and can fragmentation be effectively combatted
on ZFS at all (with or without BP rewrite)?

   For example, can(does?) metadata live "separately"
from data in some "dedicated" disk areas, while data
blocks are written as contiguously as they can?

   Many Windows defrag programs group files into several
"zones" on the disk based on their last-modify times, so
that old WORM files remain defragmented for a long time.
There are thus some empty areas reserved for new writes
as well as for moving newly discovered WORM files to
the WORM zones (free space permitting)...

   I wonder if this is viable with ZFS (COW and snapshots
involved) when BP-rewrites are implemented? Perhaps such
zoned defragmentation can be done based on block creation
date (TXG number) and the knowledge that some blocks in
certain order comprise at least one single file (maybe
more due to clones and dedup) ;)

What do you think? Thanks,
//Jim Klimov

Edward Ned Harvey

2012-Jan-07 15:34 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
>    I understand that relatively high fragmentation is inherent
> to ZFS due to its COW and possible intermixing of metadata
> and data blocks (of which metadata path blocks are likely
> to expire and get freed relatively quickly).
> 
>    I believe it was sometimes implied on this list that such
> fragmentation for "static" data can be currently combatted
> only by zfs send-ing existing pools data to other pools at
> some reserved hardware, and then clearing the original pools
> and sending the data back. This is time-consuming, disruptive
> and requires lots of extra storage idling for this task (or
> at best - for backup purposes).
Can be combated by sending & receiving.  But that''s not the only
way.  You
can defrag, (or apply/remove dedup and/or compression, or any of the other
stuff that''s dependent on BP rewrite) by doing any technique which
sequentially reads the existing data, and writes it back to disk again.  For
example, if you "cp -p file1 file2 && mv file2 file1" then you
have
effectively defragged file1 (or added/removed dedup or compression).  But of
course it''s requisite that file1 is sufficiently "not being
used" right now.

>    I wonder how resilvering works, namely - does it write
> blocks "as they were" or in an optimized (defragmented)
> fashion, in two usecases:
resilver goes according to temporal order.  While this might sometimes yield
a slightly better organization (If a whole bunch of small writes were
previously spread out over a large period of time on a largely idle system,
they will now be write-aggregated to sequential blocks) usually resilvering
recreates fragmentation similar to the pre-existing fragmentation.  

In fact, even if you zfs send | zfs receive while preserving snapshots,
you''re still recreating the data in something loosely temporal order.
Because it will do all the blocks of the oldest snapshot, and then all the
blocks of the second oldest snapshot, etc.  So by preserving the old
snapshots, you might sometimes be recreating significant amount of
fragmentation anyway.

> 1) Resilvering from a healthy array (vdev) onto a spare drive
>     in order to replace one of the healthy drives in the vdev;
> 2) Resilvering a degraded array from existing drives onto a
>     new drive in order to repair the array and make it redundant
>     again.
Same behavior either way.  Unless...  If your old disks are small and very
full, and your new disks are bigger, then sometimes in the past you may have
suffered fragmentation due to lack of available sequential unused blocks.
So resilvering onto new *larger* disks might make a difference.

>    Finally, what would the gurus say - does fragmentation
> pose a heavy problem on nearly-filled-up pools made of
> spinning HDDs 
Yes.  But that''s not unique to ZFS or COW.  No matter what your system,
if
your disk is nearly full, you will suffer from fragmentation.

> and can fragmentation be effectively combatted
> on ZFS at all (with or without BP rewrite)?
With BP rewrite, yes you can effectively combat fragmentation.
Unfortunately it doesn''t exist.  :-/

Without BP rewrite...  Define "effectively."  ;-)  I have successfully
defragged, compressed, enabled/disabled dedup on pools before, by using zfs
send | zfs receive...  Or by asking users, "Ok, we''re all in
agreement, this
weekend, nobody will be using the "a" directory.  Right?"  So
then I sudo rm
-rf a, and restore from the latest snapshot.  Or something along those
lines.  Next weekend, we''ll do the "b" directory...

"Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."

2012-Jan-07 16:10 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

it seems that s11 shadow migration can help:-)


On 1/7/2012 9:50 AM, Jim Klimov wrote:> Hello all,
>
>   I understand that relatively high fragmentation is inherent
> to ZFS due to its COW and possible intermixing of metadata
> and data blocks (of which metadata path blocks are likely
> to expire and get freed relatively quickly).
>
>   I believe it was sometimes implied on this list that such
> fragmentation for "static" data can be currently combatted
> only by zfs send-ing existing pools data to other pools at
> some reserved hardware, and then clearing the original pools
> and sending the data back. This is time-consuming, disruptive
> and requires lots of extra storage idling for this task (or
> at best - for backup purposes).
>
>   I wonder how resilvering works, namely - does it write
> blocks "as they were" or in an optimized (defragmented)
> fashion, in two usecases:
> 1) Resilvering from a healthy array (vdev) onto a spare drive
>    in order to replace one of the healthy drives in the vdev;
> 2) Resilvering a degraded array from existing drives onto a
>    new drive in order to repair the array and make it redundant
>    again.
>
> Also, are these two modes different at all?
> I.e. if I were to ask ZFS to replace a working drive with
> a spare in the case (1), can I do it at all, and would its
> data simply be copied over, or reconstructed from other
> drives, or some mix of these two operations?
>
>   Finally, what would the gurus say - does fragmentation
> pose a heavy problem on nearly-filled-up pools made of
> spinning HDDs (I believe so, at least judging from those
> performance degradation problems writing to 80+%-filled
> pools), and can fragmentation be effectively combatted
> on ZFS at all (with or without BP rewrite)?
>
>   For example, can(does?) metadata live "separately"
> from data in some "dedicated" disk areas, while data
> blocks are written as contiguously as they can?
>
>   Many Windows defrag programs group files into several
> "zones" on the disk based on their last-modify times, so
> that old WORM files remain defragmented for a long time.
> There are thus some empty areas reserved for new writes
> as well as for moving newly discovered WORM files to
> the WORM zones (free space permitting)...
>
>   I wonder if this is viable with ZFS (COW and snapshots
> involved) when BP-rewrites are implemented? Perhaps such
> zoned defragmentation can be done based on block creation
> date (TXG number) and the knowledge that some blocks in
> certain order comprise at least one single file (maybe
> more due to clones and dedup) ;)
>
> What do you think? Thanks,
> //Jim Klimov
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Hung-Sheng Tsao Ph D.
Founder&  Principal
HopBit GridComputing LLC
cell: 9734950840

http://laotsao.blogspot.com/
http://laotsao.wordpress.com/
http://blogs.oracle.com/hstsao/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 153 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120107/cd15fb67/attachment.vcf>

Bob Friesenhahn

2012-Jan-08 19:40 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

On Sat, 7 Jan 2012, Jim Klimov wrote:
>  I understand that relatively high fragmentation is inherent
> to ZFS due to its COW and possible intermixing of metadata
> and data blocks (of which metadata path blocks are likely
> to expire and get freed relatively quickly).
To put things in proper perspective, with 128K filesystem blocks, the 
worst case file fragmentation as a percentage is 0.39% 
(100*1/((128*1024)/512)).  On a Microsoft Windows system, the 
defragger might suggest that defragmentation is not warranted for this 
percentage level.
>  Finally, what would the gurus say - does fragmentation
> pose a heavy problem on nearly-filled-up pools made of
> spinning HDDs (I believe so, at least judging from those
> performance degradation problems writing to 80+%-filled
> pools), and can fragmentation be effectively combatted
> on ZFS at all (with or without BP rewrite)?
There are different types of fragmentation.  The fragmentation which 
causes a slowdown when writing to an almost full pool is fragmentation 
of the free-list/area (causing zfs to take longer to find free space 
to write to) as opposed to fragmentation of the files themselves. 
The files themselves will still not be fragmented any more severely 
than the zfs blocksize.  However, there are seeks and there are 
*seeks* and some seeks take longer than others so some forms of 
fragmentation are worse than others.  When the free space is 
fragmented into smaller blocks, there is necessarily more file 
fragmentation then the file is written.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Edward Ned Harvey

2012-Jan-09 13:44 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Bob Friesenhahn
> 
> To put things in proper perspective, with 128K filesystem blocks, the
> worst case file fragmentation as a percentage is 0.39%
> (100*1/((128*1024)/512)).  On a Microsoft Windows system, the
> defragger might suggest that defragmentation is not warranted for this
> percentage level.
I don''t think that''s correct...
Suppose you write a 1G file to disk.  It is a database store.  Now you start
running your db server.  It starts performing transactions all over the
place.  It overwrites the middle 4k of the file, and it overwrites 512b
somewhere else, and so on.  Since this is COW, each one of these little
writes in the middle of the file will actually get mapped to unused sectors
of disk.  Depending on how quickly they''re happening, they may be
aggregated
as writes...  But that''s not going to help the sequential read speed of
the
file, later when you stop your db server and try to sequentially copy your
file for backup purposes.

In the pathological worst case, you would write a file that takes up half of
the disk.  Then you would snapshot it, and overwrite it in random order,
using the smallest possible block size.  Now your disk is 100% full, and if
you read that file, you will be performing worst case random IO spanning 50%
of the total disk space.  Granted, this is not a very realistic case, but it
is the worst case, and it''s really really really bad for read
performance.

Richard Elling

2012-Jan-09 14:03 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

On Jan 9, 2012, at 5:44 AM, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Bob Friesenhahn
>> 
>> To put things in proper perspective, with 128K filesystem blocks, the
>> worst case file fragmentation as a percentage is 0.39%
>> (100*1/((128*1024)/512)).  On a Microsoft Windows system, the
>> defragger might suggest that defragmentation is not warranted for this
>> percentage level.
> 
> I don''t think that''s correct...
> Suppose you write a 1G file to disk.  It is a database store.  Now you
start
> running your db server.  It starts performing transactions all over the
> place.  It overwrites the middle 4k of the file, and it overwrites 512b
> somewhere else, and so on.  
It depends on the database, but many (eg Oracle database) are COW and
write fixed block sizes so your example does not apply.
> Since this is COW, each one of these little
> writes in the middle of the file will actually get mapped to unused sectors
> of disk.  Depending on how quickly they''re happening, they may be
aggregated
> as writes...  But that''s not going to help the sequential read
speed of the
> file, later when you stop your db server and try to sequentially copy your
> file for backup purposes.
Those who expect sequential to get performance out of HDDs usually end up
being sad :-( Interestingly, if you run Oracle database on top of ZFS on top of
SSDs, then you have COW over COW over COW. Now all we need is a bull! :-)
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/

Bob Friesenhahn

2012-Jan-09 15:14 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

On Mon, 9 Jan 2012, Edward Ned Harvey wrote:>
> I don''t think that''s correct...
But it is! :-)
> Suppose you write a 1G file to disk.  It is a database store.  Now you
start
> running your db server.  It starts performing transactions all over the
> place.  It overwrites the middle 4k of the file, and it overwrites 512b
> somewhere else, and so on.  Since this is COW, each one of these little
> writes in the middle of the file will actually get mapped to unused sectors
> of disk.  Depending on how quickly they''re happening, they may be
aggregated
Oops.  I see an error in the above.  Other than tail blocks, or due to 
compression, zfs will not write a COW data block smaller than the zfs 
filesystem blocksize.  If the blocksize was 128K, then updating just 
one byte in that 128K block results in writing a whole new 128K block. 
This is pretty significant write-amplification but the resulting 
fragmentation is still limited by the 128K block size. Remember that 
any fragmentation calculation needs to be based on the disk''s minimum 
read (i.e. sector) size.

However, it is worth remembering that it is common to set the block 
size to a much smaller value than default (e.g. 8K) if the filesystem 
is going to support a database.  In that case it is possible for there 
to be fragmentation for every 8K of data.  The worst case 
fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25% 
((100*1/((8*1024)/512))).  That would be a high enough percentage that 
Microsoft Windows defrag would recommend defragging the disk.

Metadata chunks can not be any smaller than the disk''s sector size 
(e.g. 512 bytes or 4K bytes).  Metadata can be seen as contributing to 
fragmentation, which is why it is so valuable to cache it.  If the 
metadata is not conveniently close to the data, then it may result in 
a big ugly disk seek (same impact as data fragmentation) to read it.

In summary, with zfs''s default 128K block size, data fragmentation is 
not a significant issue, If the zfs filesystem block size is reduced 
to a much smaller value (e.g. 8K) then it can become a significant 
issue.  As Richard Elling points out, a database layered on top of zfs 
may already be fragmented by design.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jim Klimov

2012-Jan-09 15:50 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

2012-01-09 19:14, Bob Friesenhahn wrote:
> In summary, with zfs''s default 128K block size, data fragmentation
is
> not a significant issue, If the zfs filesystem block size is reduced to
> a much smaller value (e.g. 8K) then it can become a significant issue.
> As Richard Elling points out, a database layered on top of zfs may
> already be fragmented by design.
I THINK there is some fallacy in your discussion: I''ve seen 128K
referred to as the maximum filesystem block size, i.e. for large
"streaming" writes. For smaller writes ZFS adapts with smaller
blocks. I am not sure how it would rewrite a few bytes inside
a larger block - split it into many smaller ones or COW all 128K.

Intermixing variable-sized indivisible blocks can in turn lead
to more fragmentation than would otherwise be expected/possible ;)

Fixed block sizes are used (only?) for volume datasets.

 > If the metadata is not conveniently close to the data, then it may
 > result in a big ugly disk seek (same impact as data fragmentation)
 > to read it.

Also I''m not sure about ths argument. If VDEV prefetch does not
slurp in data blocks, then by the time metadata is discovered in
read-from-disk blocks and data block locations are determined,
the disk may have rotated away from the head, so at least one
rotational delay is incurred even if metadata is immediately
followed by its referred data... no?

//Jim

Edward Ned Harvey

2012-Jan-13 04:15 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us]
> 
> > Suppose you write a 1G file to disk.  It is a database store.  Now you
start> > running your db server.  It starts performing transactions all over
the
> > place.  It overwrites the middle 4k of the file, and it overwrites
512b
> > somewhere else, and so on.  Since this is COW, each one of these
little
> > writes in the middle of the file will actually get mapped to unused
sectors> > of disk.  Depending on how quickly they''re happening, they
may be
> aggregated
> 
> Oops.  I see an error in the above.  Other than tail blocks, or due to
> compression, zfs will not write a COW data block smaller than the zfs
> filesystem blocksize.  If the blocksize was 128K, then updating just
> one byte in that 128K block results in writing a whole new 128K block.
Before anything else, let''s define what "fragmentation" means
in this
context, or more importantly, why anyone would care.

Fragmentation, in this context, is a measurement of how many blocks exist
sequentially aligned on disk, such that a sequential read will not suffer a
seek/latency penalty.  So the reason somebody would care is a function of
performance - disk work payload versus disk work wasted overhead time.  But
wait!  There are different types of reads.  If you read using a scrub or a
zfs send, then it will read the blocks in temporal order, so anything which
was previously write coalesced (even from many different files) will again
be read-coalesced (which is nice).  But if you read a file using something
like tar or cp or cat, then it reads the file in sequential file order,
which would be different from temporal order unless the file was originally
written sequentially and never overwritten by COW.

Suppose you have a 1G file open, and a snapshot of this file is on disk from
a previous point in time.
for ( i=0 ; i<1trillion ; i++ ) {
	seek(random integer in range[0 to 1G]);
	write(4k);
}

Something like this would quickly try to write a bunch of separate and
scattered 4k blocks at different offsets within the file.  Every 32 of these
4k writes would be write-coalesced into a single 128k on-disk block.  

Sometime later, you read the whole file sequentially such as cp or tar or
cat.  The first 4k come from this 128k block...  The next 4k come from
another 128k block...  The next 4k come from yet another 128k block...
Essentially, the file has become very fragmented and scattered about on the
physical disk.  Every 4k read results in a random disk seek.

> The worst case
> fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25%
> ((100*1/((8*1024)/512))).  
You seem to be assuming that reading 512b disk sector and its neighboring
512b sector count as contiguous blocks.  And since there are guaranteed to
be exactly 256 sectors in every 128k filesystem block, then there is no
fragmentation for 256 contiguous sectors, guaranteed.  Unfortunately, the
512b sector size is just an arbitrary number (and variable, actually 4k on
modern disks), and the resultant percentage of fragmentation is equally
arbitrary.

To produce a number that actually matters - What you need to do is calculate
the percentage of time the disk is able to deliver payload, versus the
percentage of time the disk is performing time-wasting "overhead"
operations
- seek and latency.

Suppose your disk speed is 1Gbit/sec while actively engaging the head, and
suppose the average random access (seek & latency) is 10ms.  Suppose you
wish for 99% efficiency.  The 10ms must be 1% of the time, and the head must
be engaged for 99% of the time, which is 990ms, which is very near 1Gbit, or
approximately 123MB sequential data for every random disk access.  You need
123MB sequential data payload for every random disk access.

That''s 944 times larger than the largest 128k block size currently in
zfs,
and obviously larger still compared to what you mentioned - 4k or 8k
recordsizes or 512b disk sectors...

Suppose you have 128k blocks written to disk, and all scattered about in
random order.  Your disk must seek & rotate for 10ms, and then it will be
engaged for 1.3ms reading the 128k, and then it will seek & rotate again for
10ms...  I would call that a 13% payload and 87% wasted time.  Fragmentation
at this level hurts you really bad.

Suppose there is a TXG flush every 5 seconds.  You write a program, which
will write a single byte to disk once every 5.1 seconds.  Then you leave
that program running for a very very long time.  You now have millions of
128k blocks written on disk scattered about in random order.  You start a
scrub.  It will read 128k, and then random seek, and then read 128k, etc.

I would call that 100% fragmentation, because there are no contiguously
aligned sequential blocks on disk anywhere.  But again, any measure of
"percent fragmentation" is purely arbitrary, unless you know (a) which
type
of read behavior is being meaured (temporal or file order) and you know (b)
the sequential engaged disk speed, and you know (c) the average random
access time.

Bob Friesenhahn

2012-Jan-13 20:30 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

On Thu, 12 Jan 2012, Edward Ned Harvey wrote:> Suppose you have a 1G file open, and a snapshot of this file is on disk
from
> a previous point in time.
> for ( i=0 ; i<1trillion ; i++ ) {
> 	seek(random integer in range[0 to 1G]);
> 	write(4k);
> }
>
> Something like this would quickly try to write a bunch of separate and
> scattered 4k blocks at different offsets within the file.  Every 32 of
these
> 4k writes would be write-coalesced into a single 128k on-disk block.
>
> Sometime later, you read the whole file sequentially such as cp or tar or
> cat.  The first 4k come from this 128k block...  The next 4k come from
> another 128k block...  The next 4k come from yet another 128k block...
> Essentially, the file has become very fragmented and scattered about on the
> physical disk.  Every 4k read results in a random disk seek.
Are you talking about some other filesystem or are you talking about 
zfs?  Because zfs does not work like that ...

However, I did ignore the additional fragmentation due to using raidz 
type formats.  These break the 128K block into smaller chunks and so 
there can be more fragmentation.
>> The worst case
>> fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25%
>> ((100*1/((8*1024)/512))).
>
> You seem to be assuming that reading 512b disk sector and its neighboring
> 512b sector count as contiguous blocks.  And since there are guaranteed to
> be exactly 256 sectors in every 128k filesystem block, then there is no
> fragmentation for 256 contiguous sectors, guaranteed.  Unfortunately, the
> 512b sector size is just an arbitrary number (and variable, actually 4k on
> modern disks), and the resultant percentage of fragmentation is equally
> arbitrary.
Yes, I am saying that zfs writes its data in contiguous chunks 
(filesystem blocksize in the case of mirrors).
> To produce a number that actually matters - What you need to do is
calculate
> the percentage of time the disk is able to deliver payload, versus the
> percentage of time the disk is performing time-wasting "overhead"
operations
> - seek and latency.
Yes, latency is the critical factor.
> That''s 944 times larger than the largest 128k block size currently
in zfs,
> and obviously larger still compared to what you mentioned - 4k or 8k
> recordsizes or 512b disk sectors...
Yes, fragmentation is still important even with 128K chunks.
> I would call that 100% fragmentation, because there are no contiguously
> aligned sequential blocks on disk anywhere.  But again, any measure of
> "percent fragmentation" is purely arbitrary, unless you know (a)
which type
I agree that the notion of percent fragmentation is arbitrary.  I used 
one that I invented, and which is based on underlying disk sectors 
rather than filesystem blocks.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Edward Ned Harvey

2012-Jan-16 02:12 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us]
> 
> On Thu, 12 Jan 2012, Edward Ned Harvey wrote:
> > Suppose you have a 1G file open, and a snapshot of this file is on
disk
from> > a previous point in time.
> > for ( i=0 ; i<1trillion ; i++ ) {
> > 	seek(random integer in range[0 to 1G]);
> > 	write(4k);
> > }
> >
> > Something like this would quickly try to write a bunch of separate and
> > scattered 4k blocks at different offsets within the file.  Every 32 of
these> > 4k writes would be write-coalesced into a single 128k on-disk block.
> >
> > Sometime later, you read the whole file sequentially such as cp or tar
or> > cat.  The first 4k come from this 128k block...  The next 4k come from
> > another 128k block...  The next 4k come from yet another 128k block...
> > Essentially, the file has become very fragmented and scattered about
on
> the
> > physical disk.  Every 4k read results in a random disk seek.
> 
> Are you talking about some other filesystem or are you talking about
> zfs?  Because zfs does not work like that ...
In what way?  I''ve only described behavior of COW and write coalescing.
Which part are you saying is un-ZFS-like?

Before answering, here, let''s do some test work:

Create a new pool, with a single disk, no compression or dedup or anything,
called "junk"

run this script.  All it does is generate some data in a file sequentially,
and then randomly overwrite random pieces of the file in random order,
creating snapshots all along the way...  Many times, until the file has been
completely overwritten many times over.  This should be a fragmentation
nightmare.
http://dl.dropbox.com/u/543241/fragmenter.py

Then reboot, to ensure cache is clear.
And see how long it takes to sequentially read the original sequential file,
as compared to the highly fragmented one:
cat /junk/.zfs/snapshot/sequential-before/out.txt | pv > /dev/null ; cat
/junk/.zfs/snapshot/random1399/out.txt | pv > /dev/null

While I''m waiting for this to run, I''ll make some predictions:
The file is 2GB (16 Gbit) and the disk reads around 1Gbit/sec, so reading
the initial sequential file should take ~16 sec
After fragmentation, it should be essentially random 4k fragments (32768
bits).  I figure each time the head is able to find useful data, it takes
32us to read the 4kb, followed by 10ms random access time...  disk is doing
useful work 0.3% of the time and wasting 99.7% of the time doing random
seeks.  Should be about 300x longer to read the fragmented file.

... (Ding!) ...  Test is done.  Thank you for patiently waiting during this
time warp.  ;-)

Actual result:  15s and 45s.  So it was 3x longer, not 300x.  Either way it
proves the point - but I want to see results that are at least 100x worse
due to fragmentation, to REALLY drive home the point, that fragmentation
matters.

I hypothesize, that the mere 3x performance degradation is because I have
only a single 2G file in a 2T pool, and no other activity, and no other
files.  So all my supposedly randomly distributed data might reside very
close to each other on platter...  The combination of short stroke & read
prefetcher could be doing wonders in this case.  So now I''ll repeat
that
test, but this time...  I allow the sequential data to be written
sequentially again just like before, but after it starts the random
rewriting, I''ll run a couple of separate threads writing and removing
other
junk to the pool, so the write coalescing will include other files, spread
more across a larger percentage of the total disk, getting closer to the
worst case random distribution on disk...

(destroy & recreate the pool in between test runs...)

Actual result:  15s and 104s.  So it''s only 6.9x performance
degradation.
That''s the worst I can do without hurting myself.  It proves the point,
but
not to the magnitude that I expected.

Bob Friesenhahn

2012-Jan-16 04:39 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

On Sun, 15 Jan 2012, Edward Ned Harvey wrote:>
> While I''m waiting for this to run, I''ll make some
predictions:
> The file is 2GB (16 Gbit) and the disk reads around 1Gbit/sec, so reading
> the initial sequential file should take ~16 sec
> After fragmentation, it should be essentially random 4k fragments (32768
> bits).  I figure each time the head is able to find useful data, it takes
The 4k fragments is the part I don''t agree with.  Zfs does not do 
that.  If you were to run raidzN over a wide enough array of disks you 
could end up with 4K fragments (distributed across the disks), but 
then you would always have 4K fragments.

Zfs writes linear strips of data in units of the zfs blocksize, unless 
it is sliced-n-diced by raidzN for striping across disks.  If part of 
a zfs filesystem block is overwritten, then the underlying block is 
read, modified in memory, and then the whole block written to a new 
location.  The need to read the existing block is a reason why the zfs 
ARC is so vitally important to write performance.

If the filesystem has compression enabled, then the blocksize is still 
the same, but the data written may be shorter (due to compression). 
File tail blocks may also be shorter.

There are dtrace tools you can use to observe low level I/O and see 
the size of the writes.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jim Klimov

2012-Jan-16 13:02 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

2012-01-16 8:39, Bob Friesenhahn wrote:> On Sun, 15 Jan 2012, Edward Ned Harvey wrote:
>>
>> While I''m waiting for this to run, I''ll make some
predictions:
>> The file is 2GB (16 Gbit) and the disk reads around 1Gbit/sec, so
reading
>> the initial sequential file should take ~16 sec
>> After fragmentation, it should be essentially random 4k fragments
(32768
>> bits). I figure each time the head is able to find useful data, it
takes
>
> The 4k fragments is the part I don''t agree with. Zfs does not do
that.
> If you were to run raidzN over a wide enough array of disks you could
> end up with 4K fragments (distributed across the disks), but then you
> would always have 4K fragments.

I think that in order to create a truly fragmented ZFS layout,
Edward needs to do sync writes (without a ZIL?) so that every
block and its metadata go to disk (coalesced as they may be)
and no two blocks of the file would be sequenced on disk together.
Although creating snapshots should give that effect...

He would have to fight hard to defeat ZFS''s anti-fragmentation
attempts overall - while this is possible on very full pools ;)
Hint: pre-fill Ed''s test pool to 90%, then run the tests :)

I think that to go forward about discussing defragmentation
tools, we should define a metric of fragmentation - as Bob and
Edward have often brought up. This implies accounting for
the effects on end-user of some mix of factors like:

1) Size of "free" reads and writes, i.e. cheap prefetch of
    a HDD''s track as opposed to seeking; reads of an SSD block
    (those 256KB that are sliced into 4/8KB pages) as opposed
    to random reads of pages from separate SSD blocks.
    Seeks to neighboring tracks may be faster, than full-disk
    seeks, but they are slower than no seeks at all.

    For an optimal read-performance, we might want to prefetch
    whole tracks/blocks (not 64Kb from the position of ZFS''s
    wanted block, but the whole track including this block,
    reversely knowing the sector numbers of start and end).

    Effect: we might not need to fully defragment data, but
    rather make long-enough ranges "correctly" positioned
    on the media. These may span neighboring tracks/blocks.

    We do need to know media''s performance characteristics
    to do this optimally (i.e. which disk tracks have which
    byte-lengths, and where does each circle start in terms
    of LBA offsets).

    Also, disks'' internal reallocation to spare blocks
    may lead to uncontrollable random seeks, degrading
    performance over time, but an FS is unlikely to have
    control or knowledge of that.

    Metric: start-addresses and lengths of fastest-read
    locations (i.e. whole tracks or SSD blocks) on leaf
    storage. May be variable within the storage device.


2) In case of ZFS - reads of contiguously allocated and
    monotonously increasing block numbers of data from a
    file''s or zvol''s most current version (live dataset
    as opposed to block history change in snapshots and
    the monotonous TXG number increase in on-disk vlocks).
    This may be in unresolvable conflict with clones and
    deduplication, so some files or volumes can not be
    made contiguous without breaking continuity of others.
    Still, some "overall contiguousness" can be optimised.

    For users it might also be important to have many files
    from some directory stored close to each other, especially
    if these are small files used together somehow (sourcecode,
    thumbnails, whatever).

    Effect: fast reads of most-current datasets.
    Metric: length of continuous (DVA) stretches of current
    logical block numbers of userdata divided by total data
    size. Amount of separate fragments somehow included ;)

3) In case of ZFS - fast access to metadata, especially
    branches of the current blockpointer tree in sequence
    of increasing TXG numbers.

    Effect: fast reads of metadata, i.e. scrubbing.
    Metric: length of continuous (DVA) stretches of current
    block pointer trees in same-or-increasing TXG numbers
    divided by total size of the tree (branch).

There is likely no absolute fragmentation or defragmentation,
but there are some optimisations. For example, ZFS''s attempts
to coalesce 10Mb of data during one write into one metaslab
may suffice. And we do actually see performance hits when it
can''t find stretches long enough (quickly enough) with pools
over empirical 80% fill-up. Defragmentation might set the aim
of clearing up enough 10Mb-long stretches of free space and
relocate smaller fragments of current user-data or {monotonous
BPTree} metadata into these clearings.

In particular, even if we have old data in snapshots, but
it is stored in long 10Mb+ contiguous stretches, we might
just leave it there. It is already about as good as it gets.

Also, as I proposed elsewhere, the metadata might be stored
in separate stretches of physical disk space - thus different
aims of defragmenting userdata and metadata (and free space)
would not conflict.

What do you think?
//Jim

Bob Friesenhahn

2012-Jan-16 15:13 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

On Mon, 16 Jan 2012, Jim Klimov wrote:>
> I think that in order to create a truly fragmented ZFS layout,
> Edward needs to do sync writes (without a ZIL?) so that every
> block and its metadata go to disk (coalesced as they may be)
> and no two blocks of the file would be sequenced on disk together.
> Although creating snapshots should give that effect...
Creating snapshots does not in itself cause fragmentation since COW 
would cause that level of fragmentation to exist anyway.  However, 
snapshots cause old blocks to be maintained so the disk becomes more 
full, fresh blocks may be less appropriately situated, and the disk 
seeks may become more expensive due to needing to seek over more 
tracks.

In my experience, most files on Unix systems are re-written from 
scatch.  For example, when one edits a file in an editor, the editor 
loads the file into memory, performs the edit, and then writes out the 
whole file.  Given sufficient free disk space, these files are 
unlikely to be fragmented.

The case of slowly written log files or random-access databases are 
the worse cases for causing fragmentation.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Gary Mills

2012-Jan-16 15:49 UTC

head link

[zfs-discuss] zfs defragmentation via resilvering?

On Mon, Jan 16, 2012 at 09:13:03AM -0600, Bob Friesenhahn
wrote:> On Mon, 16 Jan 2012, Jim Klimov wrote:
> >
> >I think that in order to create a truly fragmented ZFS layout,
> >Edward needs to do sync writes (without a ZIL?) so that every
> >block and its metadata go to disk (coalesced as they may be)
> >and no two blocks of the file would be sequenced on disk together.
> >Although creating snapshots should give that effect...
> 
> In my experience, most files on Unix systems are re-written from
> scatch.  For example, when one edits a file in an editor, the editor
> loads the file into memory, performs the edit, and then writes out
> the whole file.  Given sufficient free disk space, these files are
> unlikely to be fragmented.
> 
> The case of slowly written log files or random-access databases are
> the worse cases for causing fragmentation.
The case I''ve seen was with an IMAP server with many users.  E-mail
folders were represented as ZFS directories, and e-mail messages as
files within those directories.  New messages arrived randomly in the
INBOX folder, so that those files were written all over the place on
the storage.  Users also deleted many messages from their INBOX
folder, but the files were retained in snapshots for two weeks.  On
IMAP session startup, the server typically had to read all of the
messages in the INBOX folder, making this portion slow.  The server
also had to refresh the folder whenever new messages arrived, making
that portion slow as well.  Performance degraded when the storage
became 50% full.  It would increase markedly when the oldest snapshot
was deleted.

-- 
-Gary Mills-		-refurb-		-Winnipeg, Manitoba, Canada-

Seemingly Similar Threads

Search for more apparently analagous threads

zfs discuss - Jan 2012 - zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

[zfs-discuss] zfs defragmentation via resilvering?

Seemingly Similar Threads