thr3ads.net - zfs discuss - [zfs-discuss] How does ZFS write data to disks? [May 2007]

If this information is useful, please help other people find it:
Share via:

Mario Goebbels

2007-May-08 18:33 UTC

[zfs-discuss] How does ZFS write data to disks?

While trying some things earlier in figuring out how zpool iostat is supposed to
be interpreted, I noticed that ZFS behaves kind of weird when writing data. Not
to say that it''s bad, just interesting. I wrote 160MB of zeroed data
with dd. I had zpool iostat running with an one second interval.

dd actually finished before the disk activity started, so I suppose, ZFS does
aggressive write caching. However following the iostats, ZFS wrote two seconds
at I suppose full speed (roughly 35-40MB/s) and then continued 26.5 seconds at
3.2MB/s. Adding all the values up, I get to the 160MB.

I found this interesting. Is this intended? What''s the rationale behind
this? Wouldn''t this put huge data writes in jeopardy, if dragged out
like this?

Thanks.
-mg
 
 
This message posted from opensolaris.org

James Dickens

2007-May-10 21:12 UTC

head link

[zfs-discuss] How does ZFS write data to disks?

On 5/8/07, Mario Goebbels <me at tomservo.cc>
wrote:>
> While trying some things earlier in figuring out how zpool iostat is
> supposed to be interpreted, I noticed that ZFS behaves kind of weird when
> writing data. Not to say that it''s bad, just interesting. I wrote
160MB of
> zeroed data with dd. I had zpool iostat running with an one second
interval.
>
> dd actually finished before the disk activity started, so I suppose, ZFS
> does aggressive write caching. However following the iostats, ZFS wrote two
> seconds at I suppose full speed (roughly 35-40MB/s) and then continued
> 26.5 seconds at 3.2MB/s. Adding all the values up, I get to the 160MB.
>
> I found this interesting. Is this intended? What''s the rationale
behind
> this? Wouldn''t this put huge data writes in jeopardy, if dragged
out like
> this?

zfs will interpret zero''d sectors as holes, so wont really write them
to
disk, they just adjust the file size accordingly.

James Dickens
uadmin.blogspot.com



Thanks.> -mg
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070510/42f7506f/attachment.html>

lonny

2007-May-11 16:00 UTC

head link

[zfs-discuss] Re: How does ZFS write data to disks?

I''ve noticed a similar behavior in my writes. ZFS seems to write in
bursts of around 5 seconds. I assume it''s just something to do with
caching? I was watching the drive lights on the T2000s with 3 disk raidz and the
disks all blink a couple seconds then are solid for a few seconds.

Is this behavior ok? seems it would be better to have the disks writing the
whole time instead of in bursts.

On my thumper
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
vault1      10.7T  8.32T    108    561  7.23M  24.8M
vault1      10.7T  8.32T    108    152  2.68M  5.90M
vault1      10.7T  8.32T    143    177  6.49M  11.4M
vault1      10.7T  8.32T    147    429  6.59M  27.0M
[b]vault1      10.7T  8.32T    111  3.89K  2.84M   131M[/b]
vault1      10.7T  8.32T     74    151   460K  6.72M
vault1      10.7T  8.32T    103    180  1.71M  7.21M
vault1      10.7T  8.32T    119    144   832K  5.69M
vault1      10.7T  8.32T    110    185  2.51M  4.75M
[b]vault1      10.7T  8.32T     94  2.17K  1.07M   137M
vault1      10.7T  8.32T     36  2.87K   354K  24.9M[/b]
vault1      10.7T  8.32T     69    140  3.36M  6.00M
vault1      10.7T  8.32T     60    177  4.78M  12.9M
vault1      10.7T  8.32T     90    198  2.82M  5.22M
[b]vault1      10.7T  8.32T     94  1.12K  2.22M  18.1M
vault1      10.7T  8.32T     37  3.79K  2.06M   130M[/b]
vault1      10.7T  8.32T     88    254  2.43M  10.2M
vault1      10.7T  8.32T    137    147  3.64M  7.05M
vault1      10.7T  8.32T    307    415  5.84M  9.38M
[b]vault1      10.7T  8.32T    132  4.13K  2.26M   158M
vault1      10.7T  8.32T     57  1.45K  1.89M  13.2M[/b]
vault1      10.7T  8.32T     78    148   577K  8.47M
vault1      10.7T  8.32T     17    159   749K  6.26M
vault1      10.7T  8.32T     74    248   598K  6.56M
[b]vault1      10.7T  8.32T    178  1.20K  1.62M  23.8M
vault1      10.7T  8.32T     46  5.23K  1.01M   168M[/b]
 
 
This message posted from opensolaris.org

Bob Netherton

2007-May-11 16:09 UTC

head link

[zfs-discuss] Re: How does ZFS write data to disks?

On Fri, 2007-05-11 at 09:00 -0700, lonny wrote:> I''ve noticed a similar behavior in my writes. ZFS seems to write
in bursts of
>  around 5 seconds. I assume it''s just something to do with
caching?
Yep - the ZFS equivalent of fsflush.  Runs more often so the pipes
don''t
get as clogged.   We''ve had lots of rain here recently, so I''m
sort of
sensitive to stories of clogged pipes.
> Is this behavior ok? seems it would be better to have the disks writing
>  the whole time instead of in bursts.
Perhaps - although not in all cases (probably not in most cases). 
Wouldn''t it be cool to actually do some nice sequential writes to
the sweet spot of the disk bandwidth curve, but not depend on it
so much that a single random I/O here and there throws you for
a loop ?

Human analogy - it''s often more wise to work smarter than harder :-)

Directly to your question - are you seeing any anomalies in file
system read or write performance (bandwidth or latency) ?

Bob

lonny

2007-May-11 17:29 UTC

head link

[zfs-discuss] Re: How does ZFS write data to disks?

On May 11, 2007, at 9:09 AM, Bob Netherton wrote:

**On Fri, 2007-05-11 at 09:00 -0700, lonny wrote:
**I''ve noticed a similar behavior in my writes. ZFS seems to write in
bursts of
** around 5 seconds. I assume it''s just something to do with caching?

^Yep - the ZFS equivalent of fsflush.  Runs more often so the pipes
don''t
^get as clogged.   We''ve had lots of rain here recently, so
I''m sort of
^sensitive to stories of clogged pipes.
^
**Is this behavior ok? seems it would be better to have the disks writing
** the whole time instead of in bursts.
^
^Perhaps - although not in all cases (probably not in most cases).
^Wouldn''t it be cool to actually do some nice sequential writes to
^the sweet spot of the disk bandwidth curve, but not depend on it
^so much that a single random I/O here and there throws you for
^a loop ?
^
^Human analogy - it''s often more wise to work smarter than harder :-)
^
^Directly to your question - are you seeing any anomalies in file
^system read or write performance (bandwidth or latency) ?

^Bob


No performance problems so far, the thumper and zfs seem to handle everything we
throw at them. On the T2000 internal disks we were seeing a bottleneck when
using a single disk for our apps but moving to a 3 disk raidz alleviated that.

The only issue is when using iostat commands the bursts make it a little harder
to gauge performance. Is it safe to assume that if those bursts were to reach
the upper performance limit that it would spread the writes out a bit more?

thanks
lonny
 
 
This message posted from opensolaris.org

Neil.Perrin at Sun.COM

2007-May-11 17:53 UTC

head link

[zfs-discuss] Re: How does ZFS write data to disks?

lonny wrote:> On May 11, 2007, at 9:09 AM, Bob Netherton wrote:
> 
> **On Fri, 2007-05-11 at 09:00 -0700, lonny wrote:
> **I''ve noticed a similar behavior in my writes. ZFS seems to write
in bursts of
> ** around 5 seconds. I assume it''s just something to do with
caching?
> 
> ^Yep - the ZFS equivalent of fsflush.  Runs more often so the pipes
don''t
> ^get as clogged.   We''ve had lots of rain here recently, so
I''m sort of
> ^sensitive to stories of clogged pipes.
> ^
> **Is this behavior ok? seems it would be better to have the disks writing
> ** the whole time instead of in bursts.
> ^
> ^Perhaps - although not in all cases (probably not in most cases).
> ^Wouldn''t it be cool to actually do some nice sequential writes to
> ^the sweet spot of the disk bandwidth curve, but not depend on it
> ^so much that a single random I/O here and there throws you for
> ^a loop ?
> ^
> ^Human analogy - it''s often more wise to work smarter than harder
:-)
> ^
> ^Directly to your question - are you seeing any anomalies in file
> ^system read or write performance (bandwidth or latency) ?
> 
> ^Bob
> 
> 
> No performance problems so far, the thumper and zfs seem to handle
everything we throw at them. On the T2000 internal disks we were seeing a
bottleneck when using a single disk for our apps but moving to a 3 disk raidz
alleviated that.
> 
> The only issue is when using iostat commands the bursts make it a little
harder to gauge performance. Is it safe to assume that if those bursts were to
reach the upper performance limit that it would spread the writes out a bit
more?
The burst of activity every 5 seconds is when the transaction group is
committed.
Batching up the writes in this way can lead to a number of efficiencies (as Bob 
hinted).
With heavier activity the writes will not get spread out, but will just takes 
longer.
Another way to look at the gaps of IO inactivity is that they indicate 
underutilisation.

Neil.

Louwtjie Burger

2007-May-12 10:42 UTC

head link

[zfs-discuss] Re: How does ZFS write data to disks?

I think it''s also important to note _how_ one measure performance
(which is black magic at the best of times).

I personally like to see averages since doing #iostat -xnz 10 doesn''t
tell me anything really. Since zfs likes to "bundle and flush" I want
my (very expensive ;) Sun storage to give me all it''s got.

I''m not too concerned if a 5 second flush gives the disk subsystem a
good workout, but when I/O utilization is around 100% with services
times of 30+ ms over a period of a hour... then I might want to wheel
the drawing board into the architects office.

My 2c :)
> > The only issue is when using iostat commands the bursts make it a
little harder to gauge performance. Is it safe to assume that if those bursts
were to reach the upper performance limit that it would spread the writes out a
bit more?

Robert Milkowski

2007-May-15 09:12 UTC

head link

Re[2]: How does ZFS write data to disks?

Hello James,

Thursday, May 10, 2007, 11:12:57 PM, you wrote:

&gt;

zfs will interpret zero''d sectors as holes, so wont really write them
to disk, they just adjust the file size accordingly. 

It does that only with compression turned on.

-- 

Best regards,

 Robert                            mailto:rmilkowski@task.gda.pl

                                       http://milek.blogspot.com


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bill Moloney

2007-May-16 16:02 UTC

head link

[zfs-discuss] Re: How does ZFS write data to disks?

writes to ZFS objects have significant data and meta-data implications, based on
the zfs copy-on write implementation ... as data is written into a file object,
for example, this update must eventually be written to a new location on
physical disk, and all of the meta-data (from the uberblock down to this object)
must be updated and re-written to a new location as well ... while in cache, the
changes to these objects can be consolidated, but once written out to disk, any
further changes would make this recent write obsolete and require it  all to be
written once again to yet another new location on the disk ... batching
transactions for 5 seconds (the trigger discussed in zfs documentation) ... is
essential to limiting the amount of redundant re-writing that takes place to
physical disk ... keeping a disk busy 100% of the time by writing mostly the
same data over and over makes far less sense than collecting a group of changes
in cache and writing them efficiently every trigger period of time ... even with
this  optimization, our experience with small, sequential writes (4KB or less) 
to zvols that have been previously written (to ensure the mapping of real space
on the physical disk) for example, show bandwidth values that are less than 10% 
of comparable larger (128KB or larger) writes ... you can see this behavior
dramatically if you compare the amount of host initiated write data (front-end
data) to the actual amount of IO  performed to the physical disks (both reads
and writes) to handle the host''s  front-end request ... for example,
doing sequential 1MB writes to a  (previously written) zvol (simple catenation
of 5 FC drives in a JBOD) and writing 2GB of data induced more than 4GB of IO to
the drives (with smaller write sizes this ratio gets progressively worse)
 
 
This message posted from opensolaris.org

Bart Smaalders

2007-May-16 16:07 UTC

head link

[zfs-discuss] Re: How does ZFS write data to disks?

Bill Moloney wrote:> for example, doing sequential 1MB writes to a  
> previously written) zvol (simple catenation of 5 
> FC drives in a JBOD) and writing 2GB of data induced 
> more than 4GB of IO to the drives (with smaller write 
> sizes this ratio gets progressively worse)
How did you measure this?  This would imply that rewriting
a zvol would be limited at below 50% of disk bandwidth, not
something I''m seeing at all.

- Bart

-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Robert Milkowski

2007-May-16 22:18 UTC

head link

[zfs-discuss] Re: How does ZFS write data to disks?

Hello Bart,

Wednesday, May 16, 2007, 6:07:36 PM, you wrote:

BS> Bill Moloney wrote:>> for example, doing sequential 1MB writes to a  
>> previously written) zvol (simple catenation of 5 
>> FC drives in a JBOD) and writing 2GB of data induced 
>> more than 4GB of IO to the drives (with smaller write 
>> sizes this ratio gets progressively worse)
BS> How did you measure this?  This would imply that rewriting
BS> a zvol would be limited at below 50% of disk bandwidth, not
BS> something I''m seeing at all.

Perhaps zvol was created with default 128k block size, then smaller
writes were issued. Perhaps lowering volblocksize to 8k or whatever
avarage (or constant?) io size he is using would help?

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Bill Moloney

2007-May-17 18:21 UTC

head link

[zfs-discuss] Re: Re[2]: Re: How does ZFS write data to disks?

this is not a problem we''re trying to solve, but part of a
characterization study of the zfs implementation ... we''re currently
using the default 8KB blocksize for our zvol deployment, and we''re
performing tests using write block sizes as small as 4KB and as large as 1MB as
previously described (including an 8KB write aligned to logical zvol block zero,
for a perfect match to the zvol blocksize) ... in all cases we see at least
twice the IO to the disks than we generate from our test program (and
it''s much worse for smaller write block sizes) ... we''re not
exactly caught in read-modify-write hell (except when we write the 4KB blocks
that are smaller than the zvol blocksize), it''s more like modify-write
hell since the original meta-data that maps the 2GB region we''re
writing is probably just read once and kept in cache for the duration of the
test ... the large amount of back-end IO is almost entirely write operations,
but these write operations include the re-writing of meta-data that has to
change to reflect the re-location of newly written data (remember, no in-place
writes ever occur for data or meta-data) ... using the default zvol block size
of 8KB, zfs requires, in just block-pointer meta-data, about 1.5% of the total
2GB write region (this is a large percentage vs other file systems like ufs, for
example, because zfs uses a 128 byte block pointer vs a ufs 8 byte block
pointer) ... as new data is written over the old data, the leaves of the
meta-data tree are necessarily changed to point to the new locations on disk of
the new data, but any new leaf block-pointer requires that a new block of leaf
pointers be allocated and written, which requires that the next indirect level
up from these leaves point to this new set of leaf pointers, so it must be
rewritten itself, and so on up the tree (and remember, meta-data is subject to
being written in up to 3 copies - default is 2 - anytime any of it is written to
disk) ... the indirect pointer blocks closer to the root of the tree may only
see a single pointer change over the course of a 5 second consolidation (based
on the size of the zvol, the size of the block allocation unit in the zvol and
the amount of data actually written to the zvol in 5 seconds), but a complete
new indirect block must be created and written to disk (all the way back to the
uberblock) on each transaction group write ... this means that some of these
meta-data blocks are written to disk over and over again with only small changes
from their previous composition ... consolidating for more than 5 seconds would
help to mitigate this situation, but longer consolidation periods put more data
at risk of being lost in case of a power failure ... this is not particularly a
problem, just a manifestation of the need to never write in-place, a rather
large block pointer size and the possible writing of multiple copies of
meta-data (of course this block pointer carries check sums, and the addresses of
up to 3 duplicate blocks, providing the excellent data and meta-data protection
zfs is so well known for) ... the original thread that this reply addressed was
the characteristic 5 second delay in writes, which I tried to explain in the
context of copy-on-write consolidation, but it''s clear that even this
delay cannot prevent the modification and re-writing of the same basic meta-data
many times with small modifications
 
 
This message posted from opensolaris.org

zfs discuss - May 2007 - How does ZFS write data to disks?

[zfs-discuss] How does ZFS write data to disks?

[zfs-discuss] How does ZFS write data to disks?

[zfs-discuss] Re: How does ZFS write data to disks?

[zfs-discuss] Re: How does ZFS write data to disks?

[zfs-discuss] Re: How does ZFS write data to disks?

[zfs-discuss] Re: How does ZFS write data to disks?

[zfs-discuss] Re: How does ZFS write data to disks?

Re[2]: How does ZFS write data to disks?

[zfs-discuss] Re: How does ZFS write data to disks?

[zfs-discuss] Re: How does ZFS write data to disks?

[zfs-discuss] Re: How does ZFS write data to disks?

[zfs-discuss] Re: Re[2]: Re: How does ZFS write data to disks?