thr3ads.net - zfs discuss - [zfs-discuss] Lots of seeks? [Aug 2006]

If this information is useful, please help other people find it:
Share via:

Anton B. Rang

2006-Aug-08 17:20 UTC

[zfs-discuss] Lots of seeks?

I moved my main workspaces over to ZFS a while ago and noticed that my disk got
really noisy (yes, one of those subjective measurements). It sounded like the
head was being bounced around a lot at the end of each transaction group.

Today I grabbed the iosnoop dtrace script (from
<http://www.opensolaris.org/os/community/dtrace/scripts/>) and looked a
little at the output. It''s strange, it looks as if the blocks are being
written to disk in nearly random order.

I have a two-vdev pool, just plain disk slices, no mirroring etc. (I''m
not using whole disks because I''ve just got the two disks in my
workstation and my root is still on UFS.) If I use ''dd'' to
create a 1MB file out of 1KB writes and wait for it to be pushed to disk, one of
the two disks sees a block stream like:

  27610929:1
  27610930:3
  27610933:9
  27610942:13
  39425458:13  <-- huh?
  27565952:16  <-- now we''ve gone backwards
  39400576:16
  27463484:4
  39342412:4
  27581454:2
  39382602:2
  27581456:2
  ...

So the head of this disk is happily bouncing back and forth at this point (well,
they''re FC disks with a reasonably deep queue, so it''s not so
bad as it could be, but it''s still not great).

The other disk is behaving a little better, but still moving back and forth
between two block ranges.

Before I find some time to go dig into the intricacies of the I/O scheduler, any
hints as to why this might be happening? My intuition would be that we ought to
be able to write the blocks out in arbitrary order since it''s only the
?berblock write which commits them, so we should be able to use an
always-move-forward ordering (and, of course, let the disk do its own scheduling
within that). Also, why the very small adjacent writes? Those first four writes
in the snoop pushed out 13K of data using 4 separate write operations, which is
wasteful. (There are others too, e.g. towards the end of the excerpt above
we''re doing two 1K writes to adjacent blocks.) Does the scheduler
attempt to perform coalescing as well?

(I should mention that this is S10U2 so there have certainly been fixes since.)
 
 
This message posted from opensolaris.org

Anton B. Rang

2006-Aug-08 21:02 UTC

head link

[zfs-discuss] Re: Lots of seeks?

So while I''m feeling optimistic :-) we really ought to be able to do
this in two I/O operations. If we have, say, 500K of data to write (including
all of the metadata), we should be able to allocate a contiguous 500K block on
disk and write that with a single operation. Then we update the ?berblock.

The only inherent problem preventing this right now is that we don''t
have general scatter/gather at the driver level (ugh). This is a bug that should
be fixed, IMO. Then ZFS just needs to delay choosing physical block locations
until they?re being written as part of a group. (Of course, as NetApp points out
in their WAFL papers, the goal of optimizing writes can conflict with the goal
of optimizing reads, so taken to an extreme, this optimization isn?t always
desirable.)
 
 
This message posted from opensolaris.org

Spencer Shepler

2006-Aug-08 21:07 UTC

head link

[zfs-discuss] Re: Lots of seeks?

On Tue, Anton B. Rang wrote:> So while I''m feeling optimistic :-) we really ought to be able to
do this in two I/O operations. If we have, say, 500K of data to write (including
all of the metadata), we should be able to allocate a contiguous 500K block on
disk and write that with a single operation. Then we update the ??berblock.
> 
> The only inherent problem preventing this right now is that we
don''t have general scatter/gather at the driver level (ugh).
Fixing this bug would help the NFS server significantly given the
general lack of continuity of incoming write data (split at mblk
boundaries).

Spencer

Jesus Cea

2006-Aug-09 12:21 UTC

head link

[zfs-discuss] Lots of seeks?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Anton B. Rang wrote:> I have a two-vdev pool, just plain disk slices
If the vdev''s are from the same disk, your are doomed.

ZFS tries to spread the load among the vdevs, so if the vdevs are from
the same disk, you will have a seek hell.

I would suggest you to join the slices using SVM (Solaris Volume
manager) and create the ZFS pool over that new virtual device.

Using ZFS over SVM is undocumented, but seems to work fine. Make sure
the zfs pool is accesible after a machine reboot, nevertheless.

- --
Jesus Cea Avion                         _/_/      _/_/_/        _/_/_/
jcea at argo.es http://www.argo.es/~jcea/ _/_/    _/_/  _/_/    _/_/  _/_/
jabber / xmpp:jcea at jabber.org         _/_/    _/_/          _/_/_/_/_/
                               _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBRNnTRJlgi5GaxT1NAQJpfwP+KUlcSg3Wl5vALIkGLQVG1C2o22q6zoO9
7IVnW7Td99kj77h4Df+dtg2sFIerbAz3a41L25GuGArD72IlQ1XwLliq0fW/pcn8
cmcrRDi5gAvxFE/Kge/2xfKAfTCGQLOwUr1vi5t3b/u4usoOafRKD1HqQ3jBemaq
7QTqVR0PmAo=zLrH
-----END PGP SIGNATURE-----

Robert Milkowski

2006-Aug-09 12:39 UTC

head link

[zfs-discuss] Lots of seeks?

Hello Jesus,

Wednesday, August 9, 2006, 2:21:24 PM, you wrote:

JC> -----BEGIN PGP SIGNED MESSAGE-----
JC> Hash: SHA1

JC> Anton B. Rang wrote:>> I have a two-vdev pool, just plain disk slices
JC> If the vdev''s are from the same disk, your are doomed.

JC> ZFS tries to spread the load among the vdevs, so if the vdevs are from
JC> the same disk, you will have a seek hell.

JC> I would suggest you to join the slices using SVM (Solaris Volume
JC> manager) and create the ZFS pool over that new virtual device.

JC> Using ZFS over SVM is undocumented, but seems to work fine. Make sure
JC> the zfs pool is accesible after a machine reboot, nevertheless.

Then create zvol and put UFS on top of it :))))))))))))))))

ok, just kidding :)

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Roch

2006-Aug-09 13:18 UTC

head link

[zfs-discuss] Re: Lots of seeks?

So while I''m feeling optimistic  :-) we really ought to be
  able to do this in two I/O operations. If we have, say, 500K
  of data to write (including all  of the metadata), we should
  be able  to allocate  a contiguous  500K  block on disk  and
  write  that with  a  single  operation.  Then we update  the
  Uberblock. 

    The only inherent   problem preventing this right   now is
  that we don''t have  general   scatter/gather at the   driver
  level (ugh).  This is a bug  that should be fixed, IMO. Then
  ZFS just needs  to delay  choosing physical block  locations
  until   they???re being written as   part  of a group.   
  (Of course, as NetApp points out in  their WAFL papers, the goal
  of   optimizing   writes  can conflict  with    the  goal of
  optimizing reads, so taken to  an extreme, this optimization
  isn???t always desirable.)


Hi Anton, Optimistic a little yes.

The data block should have aggregated quite well into near
recordsize I/Os, are you sure they did not ? No O_DSYNC in
here right ?

Once  the data  blocks are  on disk we  have the information
necessary to update the  indirect  blocks iteratively up  to
the  ueberblock. Those  are the  smaller I/Os;  I guess that
because    of ditto blocks  they  go  to physically seperate
locations, by design.

All of these though are normally done asynchronously to
applications, unless the disks are flooded. 

But  I follow  you in that,  It  may be remotely possible to
reduce the number of Iterations  in the process by  assuming
that the I/O will  all succeed, then  if some fails, fix  up
the consequence and when all  done, update the ueberblock. I
would not hold my breath quite yet for that.

-r

Jesus Cea

2006-Aug-09 16:55 UTC

head link

[zfs-discuss] Lots of seeks?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Milkowski wrote:> JC> Using ZFS over SVM is undocumented, but seems to work fine. Make
sure
> JC> the zfs pool is accesible after a machine reboot, nevertheless.
> 
> Then create zvol and put UFS on top of it :))))))))))))))))
> 
> ok, just kidding :)
Not joke about this. I''m still waiting an answer to
http://groups.yahoo.com/group/solarisx86/message/37310

};-)

(Constructive) comments will be very appreciated :-)

- --
Jesus Cea Avion                         _/_/      _/_/_/        _/_/_/
jcea at argo.es http://www.argo.es/~jcea/ _/_/    _/_/  _/_/    _/_/  _/_/
jabber / xmpp:jcea at jabber.org         _/_/    _/_/          _/_/_/_/_/
                               _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBRNoTa5lgi5GaxT1NAQLxXAQAgjBUJ3FqfrDmtyGJcihnFJ+vHfADhaPv
VCjwtDNs0huRTy8mOzgNJkm+SnFyl8omJi1bvOtXNWFgrqAwe7NxkviR/uUe9uaK
3JdIV257xnjM4hUefqQWn39+mbdvnUYGftvwVpRmzzjqn225mmZ/wdn12fGEpHdk
pilXg0DIU7Q=8Tiq
-----END PGP SIGNATURE-----

Richard Elling

2006-Aug-09 17:21 UTC

head link

[zfs-discuss] Lots of seeks?

Jesus Cea wrote:> Anton B. Rang wrote:
>> I have a two-vdev pool, just plain disk slices
> 
> If the vdev''s are from the same disk, your are doomed.
> 
> ZFS tries to spread the load among the vdevs, so if the vdevs are from
> the same disk, you will have a seek hell.
It is not clear to me that this is a problem.  I used to believe that
it was always a problem.  However, with modern drives using CTQ or
NCQ, it is not as bad as you would expect.  Add drive caching to the
mix and it becomes very difficult to make a stand one way or the other.

Also, you should understand that the on-disk format for ZFS places
uberblocks at the beginning and end of the drive.  Also, ditto
blocks are spread about using a diversity algorithm.

The best way to analyse such systems is to use statistical methods.
Predictions based upon simple seek models will no longer work.

Which leads me to... mirroring on one drive is better than not having
data protection.  Ditto blocks for data will be better than mirroring
on one drive.  Even when mirroring on one drive, the performance isn''t
too bad for casual use.
  -- richard

Anton Rang

2006-Aug-11 16:04 UTC

head link

[zfs-discuss] Re: Lots of seeks?

On Aug 9, 2006, at 8:18 AM, Roch wrote:
>
>
>     So while I''m feeling optimistic  :-) we really ought to be
>   able to do this in two I/O operations. If we have, say, 500K
>   of data to write (including all  of the metadata), we should
>   be able  to allocate  a contiguous  500K  block on disk  and
>   write  that with  a  single  operation.  Then we update  the
>   Uberblock.
>
> Hi Anton, Optimistic a little yes.
>
> The data block should have aggregated quite well into near
> recordsize I/Os, are you sure they did not ? No O_DSYNC in
> here right ?
When I repeated this with just 512K written in 1K chunks via dd,
I saw six 16K writes.  Those were the largest.  The others were
around 1K-4K.  No O_DSYNC.

   dd if=/dev/zero of=xyz bs=1k count=512

So some writes are being aggregated, but we''re missing a lot.
> Once  the data  blocks are  on disk we  have the information
> necessary to update the  indirect  blocks iteratively up  to
> the  ueberblock. Those  are the  smaller I/Os;  I guess that
> because    of ditto blocks  they  go  to physically seperate
> locations, by design.
We shouldn''t have to wait for the data blocks to reach disk,
though.  We know where they''re going in advance.  One of the
key advantages of the ?berblock scheme is that we can, in a
sense, speculatively write to disk.  We don''t need the tight
ordering that UFS requires to avoid security exposures and
allow the file system to be repaired.  We can lay out all of
the data and metadata, write them all to disk, choose new
locations if the writes fail, etc. and not worry about any
ordering or state issues, because the on-disk image doesn''t
change until we commit it.

You''re right, the ditto block mechanism will mean that some
writes will be spread around (at least when using a
non-redundant pool like mine), but then we should have at
most three writes followed by the ?berblock update, assuming
three degrees of replication.
> All of these though are normally done asynchronously to
> applications, unless the disks are flooded.
Which is a good thing (I think they''re asynchronous anyway,
unless the cache is full).
> But  I follow  you in that,  It  may be remotely possible to
> reduce the number of Iterations  in the process by  assuming
> that the I/O will  all succeed, then  if some fails, fix  up
> the consequence and when all  done, update the ueberblock. I
> would not hold my breath quite yet for that.
Hmmm.  I guess my point is that we shouldn''t need to iterate
at all.  There are no dependencies between these writes; only
between the complete set of writes and the ?berblock update.

-- Anton

Jonathan Adams

2006-Aug-11 17:38 UTC

head link

[zfs-discuss] Re: Lots of seeks?

On Fri, Aug 11, 2006 at 11:04:06AM -0500, Anton Rang
wrote:> >Once  the data  blocks are  on disk we  have the information
> >necessary to update the  indirect  blocks iteratively up  to
> >the  ueberblock. Those  are the  smaller I/Os;  I guess that
> >because    of ditto blocks  they  go  to physically seperate
> >locations, by design.
> 
> We shouldn''t have to wait for the data blocks to reach disk,
> though.  We know where they''re going in advance.  One of the
> key advantages of the ?berblock scheme is that we can, in a
> sense, speculatively write to disk.  We don''t need the tight
> ordering that UFS requires to avoid security exposures and
> allow the file system to be repaired.  We can lay out all of
> the data and metadata, write them all to disk, choose new
> locations if the writes fail, etc. and not worry about any
> ordering or state issues, because the on-disk image doesn''t
> change until we commit it.
> You''re right, the ditto block mechanism will mean that some
> writes will be spread around (at least when using a
> non-redundant pool like mine), but then we should have at
> most three writes followed by the ?berblock update, assuming
> three degrees of replication.
The problem is that you don''t know the actual *contents* of the parent
block
until *all* of its children have been written to their final locations.
(This is because the block pointer''s value depends on the final
location)
The ditto blocks don''t really effect this, since they can all be
written
out in parallel.

So you end up with the current N phases; data, it''s parents,
it''s parents, ..., uberblock.
> >But  I follow  you in that,  It  may be remotely possible to
> >reduce the number of Iterations  in the process by  assuming
> >that the I/O will  all succeed, then  if some fails, fix  up
> >the consequence and when all  done, update the ueberblock. I
> >would not hold my breath quite yet for that.
> 
> Hmmm.  I guess my point is that we shouldn''t need to iterate
> at all.  There are no dependencies between these writes; only
> between the complete set of writes and the ?berblock update.
Again, there is;  if a block write fails, you have to re-write it and
all of it''s parents.  So the best you could do would be:

	1. assign locations for all blocks, and update the space bitmaps
	   as necessary.
	2. update all of the non-Uberdata blocks with their actual
	   contents (which requires calculating checksums on all of the
	   child blocks)
	3. write everything out in parallel.
	3a. if any write fails, re-do 1+2 for that block, and 2 for all of its
	    parents, then start over at 3 with all of the changed blocks.

	4. once everything is on stable storage, update the uberblock.

That''s a lot more complicated than the current model, but certainly
seems
possible.

Cheers,
- jonathan

(this is only my understanding of how ZFS works;  I could be mistaken)


-- 
Jonathan Adams, Solaris Kernel Development

Anton Rang

2006-Aug-11 17:56 UTC

head link

[zfs-discuss] Re: Lots of seeks?

On Aug 11, 2006, at 12:38 PM, Jonathan Adams wrote:
> The problem is that you don''t know the actual *contents* of the  
> parent block
> until *all* of its children have been written to their final  
> locations.
> (This is because the block pointer''s value depends on the final  
> location)
But I know where the children are going before I actually write  
them.  There
is a dependency of the parent''s contents on the *address* of its  
children, but
not on the actual write.  We can compute everything that we are going  
to write
before we start to write.

(Yes, in the event of a write failure we have to recover; but that''s
very rare, and can easily be handled -- we just start over, since no
visible state has been changed.)
> The ditto blocks don''t really effect this, since they can all be  
> written
> out in parallel.
The reason they affect my desire of turning the update into a two-phase
commit (make all the changes, then update the ?berblock) is because the
ditto blocks are deliberately spread across the disk, so we can''t  
collect
them into a single write (for a non-redundant pool, or at least a one- 
disk
pool -- presumably they wind up on different disks for a two-disk pool,
in which case we can still do a single write per disk).
> Again, there is;  if a block write fails, you have to re-write it and
> all of it''s parents.  So the best you could do would be:
>
> 	1. assign locations for all blocks, and update the space bitmaps
> 	   as necessary.
> 	2. update all of the non-Uberdata blocks with their actual
> 	   contents (which requires calculating checksums on all of the
> 	   child blocks)
> 	3. write everything out in parallel.
> 	3a. if any write fails, re-do 1+2 for that block, and 2 for all of  
> its
> 	    parents, then start over at 3 with all of the changed blocks.
>
> 	4. once everything is on stable storage, update the uberblock.
>
> That''s a lot more complicated than the current model, but
certainly
> seems
> possible.
(3a could actually be simplified to simply "mark the bad blocks as
unallocatable, and go to 1", but it''s more efficient as you
describe.)

The eventual advantage, though, is that we get the performance of a  
single
write (plus, always, the ?berblock update).  In a heavily loaded system,
the current approach (lots of small writes) won''t scale so well.   
(Actually
we''d probably want to limit the size of each write to some small value,
like 16 MB, simply to allow the first write to start earlier under  
fairly
heavy loads.)

As I pointed out earlier, this would require getting scatter/gather  
support
through the storage subsystem, but the potential win should be quite  
large.
Something to think about for the future.  :-)

Incidentally, this is part of how QFS gets its performance for  
streaming I/O.
We use an "allocate forward" policy, allow very large allocation  
blocks, and
separate the metadata from data.  This allows us to write (or read)  
data in
fairly large I/O requests, without unnecessary disk head motion.

Anton

Roch

2006-Aug-14 16:13 UTC

head link

[zfs-discuss] Re: Lots of seeks?

Incidentally,  this is part of how  QFS gets its performance
  for streaming I/O.  We use   an "allocate forward"   policy,
  allow very  large    allocation  blocks,  and separate   the
  metadata from data.  This allows  us to write (or read) data
  in fairly large I/O  requests, without unnecessary disk head
  motion.

I believe ZFS allocate forward both data and metadata but I
don''t think this causes unnecessary disk head motion.

-r

zfs discuss - Aug 2006 - Lots of seeks?

[zfs-discuss] Lots of seeks?

[zfs-discuss] Re: Lots of seeks?

[zfs-discuss] Re: Lots of seeks?

[zfs-discuss] Lots of seeks?

[zfs-discuss] Lots of seeks?

[zfs-discuss] Re: Lots of seeks?

[zfs-discuss] Lots of seeks?

[zfs-discuss] Lots of seeks?

[zfs-discuss] Re: Lots of seeks?

[zfs-discuss] Re: Lots of seeks?

[zfs-discuss] Re: Lots of seeks?

[zfs-discuss] Re: Lots of seeks?