I moved my main workspaces over to ZFS a while ago and noticed that my disk got really noisy (yes, one of those subjective measurements). It sounded like the head was being bounced around a lot at the end of each transaction group. Today I grabbed the iosnoop dtrace script (from <http://www.opensolaris.org/os/community/dtrace/scripts/>) and looked a little at the output. It''s strange, it looks as if the blocks are being written to disk in nearly random order. I have a two-vdev pool, just plain disk slices, no mirroring etc. (I''m not using whole disks because I''ve just got the two disks in my workstation and my root is still on UFS.) If I use ''dd'' to create a 1MB file out of 1KB writes and wait for it to be pushed to disk, one of the two disks sees a block stream like: 27610929:1 27610930:3 27610933:9 27610942:13 39425458:13 <-- huh? 27565952:16 <-- now we''ve gone backwards 39400576:16 27463484:4 39342412:4 27581454:2 39382602:2 27581456:2 ... So the head of this disk is happily bouncing back and forth at this point (well, they''re FC disks with a reasonably deep queue, so it''s not so bad as it could be, but it''s still not great). The other disk is behaving a little better, but still moving back and forth between two block ranges. Before I find some time to go dig into the intricacies of the I/O scheduler, any hints as to why this might be happening? My intuition would be that we ought to be able to write the blocks out in arbitrary order since it''s only the ?berblock write which commits them, so we should be able to use an always-move-forward ordering (and, of course, let the disk do its own scheduling within that). Also, why the very small adjacent writes? Those first four writes in the snoop pushed out 13K of data using 4 separate write operations, which is wasteful. (There are others too, e.g. towards the end of the excerpt above we''re doing two 1K writes to adjacent blocks.) Does the scheduler attempt to perform coalescing as well? (I should mention that this is S10U2 so there have certainly been fixes since.) This message posted from opensolaris.org
So while I''m feeling optimistic :-) we really ought to be able to do this in two I/O operations. If we have, say, 500K of data to write (including all of the metadata), we should be able to allocate a contiguous 500K block on disk and write that with a single operation. Then we update the ?berblock. The only inherent problem preventing this right now is that we don''t have general scatter/gather at the driver level (ugh). This is a bug that should be fixed, IMO. Then ZFS just needs to delay choosing physical block locations until they?re being written as part of a group. (Of course, as NetApp points out in their WAFL papers, the goal of optimizing writes can conflict with the goal of optimizing reads, so taken to an extreme, this optimization isn?t always desirable.) This message posted from opensolaris.org
On Tue, Anton B. Rang wrote:> So while I''m feeling optimistic :-) we really ought to be able to do this in two I/O operations. If we have, say, 500K of data to write (including all of the metadata), we should be able to allocate a contiguous 500K block on disk and write that with a single operation. Then we update the ??berblock. > > The only inherent problem preventing this right now is that we don''t have general scatter/gather at the driver level (ugh).Fixing this bug would help the NFS server significantly given the general lack of continuity of incoming write data (split at mblk boundaries). Spencer
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Anton B. Rang wrote:> I have a two-vdev pool, just plain disk slicesIf the vdev''s are from the same disk, your are doomed. ZFS tries to spread the load among the vdevs, so if the vdevs are from the same disk, you will have a seek hell. I would suggest you to join the slices using SVM (Solaris Volume manager) and create the ZFS pool over that new virtual device. Using ZFS over SVM is undocumented, but seems to work fine. Make sure the zfs pool is accesible after a machine reboot, nevertheless. - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea at argo.es http://www.argo.es/~jcea/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBRNnTRJlgi5GaxT1NAQJpfwP+KUlcSg3Wl5vALIkGLQVG1C2o22q6zoO9 7IVnW7Td99kj77h4Df+dtg2sFIerbAz3a41L25GuGArD72IlQ1XwLliq0fW/pcn8 cmcrRDi5gAvxFE/Kge/2xfKAfTCGQLOwUr1vi5t3b/u4usoOafRKD1HqQ3jBemaq 7QTqVR0PmAo=zLrH -----END PGP SIGNATURE-----
Hello Jesus, Wednesday, August 9, 2006, 2:21:24 PM, you wrote: JC> -----BEGIN PGP SIGNED MESSAGE----- JC> Hash: SHA1 JC> Anton B. Rang wrote:>> I have a two-vdev pool, just plain disk slicesJC> If the vdev''s are from the same disk, your are doomed. JC> ZFS tries to spread the load among the vdevs, so if the vdevs are from JC> the same disk, you will have a seek hell. JC> I would suggest you to join the slices using SVM (Solaris Volume JC> manager) and create the ZFS pool over that new virtual device. JC> Using ZFS over SVM is undocumented, but seems to work fine. Make sure JC> the zfs pool is accesible after a machine reboot, nevertheless. Then create zvol and put UFS on top of it :)))))))))))))))) ok, just kidding :) -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
So while I''m feeling optimistic  :-) we really ought to be
  able to do this in two I/O operations. If we have, say, 500K
  of data to write (including all  of the metadata), we should
  be able  to allocate  a contiguous  500K  block on disk  and
  write  that with  a  single  operation.  Then we update  the
  Uberblock. 
    The only inherent   problem preventing this right   now is
  that we don''t have  general   scatter/gather at the   driver
  level (ugh).  This is a bug  that should be fixed, IMO. Then
  ZFS just needs  to delay  choosing physical block  locations
  until   they???re being written as   part  of a group.   
  (Of course, as NetApp points out in  their WAFL papers, the goal
  of   optimizing   writes  can conflict  with    the  goal of
  optimizing reads, so taken to  an extreme, this optimization
  isn???t always desirable.)
Hi Anton, Optimistic a little yes.
The data block should have aggregated quite well into near
recordsize I/Os, are you sure they did not ? No O_DSYNC in
here right ?
Once  the data  blocks are  on disk we  have the information
necessary to update the  indirect  blocks iteratively up  to
the  ueberblock. Those  are the  smaller I/Os;  I guess that
because    of ditto blocks  they  go  to physically seperate
locations, by design.
All of these though are normally done asynchronously to
applications, unless the disks are flooded. 
But  I follow  you in that,  It  may be remotely possible to
reduce the number of Iterations  in the process by  assuming
that the I/O will  all succeed, then  if some fails, fix  up
the consequence and when all  done, update the ueberblock. I
would not hold my breath quite yet for that.
-r
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Robert Milkowski wrote:> JC> Using ZFS over SVM is undocumented, but seems to work fine. Make sure > JC> the zfs pool is accesible after a machine reboot, nevertheless. > > Then create zvol and put UFS on top of it :)))))))))))))))) > > ok, just kidding :)Not joke about this. I''m still waiting an answer to http://groups.yahoo.com/group/solarisx86/message/37310 };-) (Constructive) comments will be very appreciated :-) - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea at argo.es http://www.argo.es/~jcea/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBRNoTa5lgi5GaxT1NAQLxXAQAgjBUJ3FqfrDmtyGJcihnFJ+vHfADhaPv VCjwtDNs0huRTy8mOzgNJkm+SnFyl8omJi1bvOtXNWFgrqAwe7NxkviR/uUe9uaK 3JdIV257xnjM4hUefqQWn39+mbdvnUYGftvwVpRmzzjqn225mmZ/wdn12fGEpHdk pilXg0DIU7Q=8Tiq -----END PGP SIGNATURE-----
Jesus Cea wrote:> Anton B. Rang wrote: >> I have a two-vdev pool, just plain disk slices > > If the vdev''s are from the same disk, your are doomed. > > ZFS tries to spread the load among the vdevs, so if the vdevs are from > the same disk, you will have a seek hell.It is not clear to me that this is a problem. I used to believe that it was always a problem. However, with modern drives using CTQ or NCQ, it is not as bad as you would expect. Add drive caching to the mix and it becomes very difficult to make a stand one way or the other. Also, you should understand that the on-disk format for ZFS places uberblocks at the beginning and end of the drive. Also, ditto blocks are spread about using a diversity algorithm. The best way to analyse such systems is to use statistical methods. Predictions based upon simple seek models will no longer work. Which leads me to... mirroring on one drive is better than not having data protection. Ditto blocks for data will be better than mirroring on one drive. Even when mirroring on one drive, the performance isn''t too bad for casual use. -- richard
On Aug 9, 2006, at 8:18 AM, Roch wrote:> > > So while I''m feeling optimistic :-) we really ought to be > able to do this in two I/O operations. If we have, say, 500K > of data to write (including all of the metadata), we should > be able to allocate a contiguous 500K block on disk and > write that with a single operation. Then we update the > Uberblock. > > Hi Anton, Optimistic a little yes. > > The data block should have aggregated quite well into near > recordsize I/Os, are you sure they did not ? No O_DSYNC in > here right ?When I repeated this with just 512K written in 1K chunks via dd, I saw six 16K writes. Those were the largest. The others were around 1K-4K. No O_DSYNC. dd if=/dev/zero of=xyz bs=1k count=512 So some writes are being aggregated, but we''re missing a lot.> Once the data blocks are on disk we have the information > necessary to update the indirect blocks iteratively up to > the ueberblock. Those are the smaller I/Os; I guess that > because of ditto blocks they go to physically seperate > locations, by design.We shouldn''t have to wait for the data blocks to reach disk, though. We know where they''re going in advance. One of the key advantages of the ?berblock scheme is that we can, in a sense, speculatively write to disk. We don''t need the tight ordering that UFS requires to avoid security exposures and allow the file system to be repaired. We can lay out all of the data and metadata, write them all to disk, choose new locations if the writes fail, etc. and not worry about any ordering or state issues, because the on-disk image doesn''t change until we commit it. You''re right, the ditto block mechanism will mean that some writes will be spread around (at least when using a non-redundant pool like mine), but then we should have at most three writes followed by the ?berblock update, assuming three degrees of replication.> All of these though are normally done asynchronously to > applications, unless the disks are flooded.Which is a good thing (I think they''re asynchronous anyway, unless the cache is full).> But I follow you in that, It may be remotely possible to > reduce the number of Iterations in the process by assuming > that the I/O will all succeed, then if some fails, fix up > the consequence and when all done, update the ueberblock. I > would not hold my breath quite yet for that.Hmmm. I guess my point is that we shouldn''t need to iterate at all. There are no dependencies between these writes; only between the complete set of writes and the ?berblock update. -- Anton
On Fri, Aug 11, 2006 at 11:04:06AM -0500, Anton Rang wrote:> >Once the data blocks are on disk we have the information > >necessary to update the indirect blocks iteratively up to > >the ueberblock. Those are the smaller I/Os; I guess that > >because of ditto blocks they go to physically seperate > >locations, by design. > > We shouldn''t have to wait for the data blocks to reach disk, > though. We know where they''re going in advance. One of the > key advantages of the ?berblock scheme is that we can, in a > sense, speculatively write to disk. We don''t need the tight > ordering that UFS requires to avoid security exposures and > allow the file system to be repaired. We can lay out all of > the data and metadata, write them all to disk, choose new > locations if the writes fail, etc. and not worry about any > ordering or state issues, because the on-disk image doesn''t > change until we commit it.> You''re right, the ditto block mechanism will mean that some > writes will be spread around (at least when using a > non-redundant pool like mine), but then we should have at > most three writes followed by the ?berblock update, assuming > three degrees of replication.The problem is that you don''t know the actual *contents* of the parent block until *all* of its children have been written to their final locations. (This is because the block pointer''s value depends on the final location) The ditto blocks don''t really effect this, since they can all be written out in parallel. So you end up with the current N phases; data, it''s parents, it''s parents, ..., uberblock.> >But I follow you in that, It may be remotely possible to > >reduce the number of Iterations in the process by assuming > >that the I/O will all succeed, then if some fails, fix up > >the consequence and when all done, update the ueberblock. I > >would not hold my breath quite yet for that. > > Hmmm. I guess my point is that we shouldn''t need to iterate > at all. There are no dependencies between these writes; only > between the complete set of writes and the ?berblock update.Again, there is; if a block write fails, you have to re-write it and all of it''s parents. So the best you could do would be: 1. assign locations for all blocks, and update the space bitmaps as necessary. 2. update all of the non-Uberdata blocks with their actual contents (which requires calculating checksums on all of the child blocks) 3. write everything out in parallel. 3a. if any write fails, re-do 1+2 for that block, and 2 for all of its parents, then start over at 3 with all of the changed blocks. 4. once everything is on stable storage, update the uberblock. That''s a lot more complicated than the current model, but certainly seems possible. Cheers, - jonathan (this is only my understanding of how ZFS works; I could be mistaken) -- Jonathan Adams, Solaris Kernel Development
On Aug 11, 2006, at 12:38 PM, Jonathan Adams wrote:> The problem is that you don''t know the actual *contents* of the > parent block > until *all* of its children have been written to their final > locations. > (This is because the block pointer''s value depends on the final > location)But I know where the children are going before I actually write them. There is a dependency of the parent''s contents on the *address* of its children, but not on the actual write. We can compute everything that we are going to write before we start to write. (Yes, in the event of a write failure we have to recover; but that''s very rare, and can easily be handled -- we just start over, since no visible state has been changed.)> The ditto blocks don''t really effect this, since they can all be > written > out in parallel.The reason they affect my desire of turning the update into a two-phase commit (make all the changes, then update the ?berblock) is because the ditto blocks are deliberately spread across the disk, so we can''t collect them into a single write (for a non-redundant pool, or at least a one- disk pool -- presumably they wind up on different disks for a two-disk pool, in which case we can still do a single write per disk).> Again, there is; if a block write fails, you have to re-write it and > all of it''s parents. So the best you could do would be: > > 1. assign locations for all blocks, and update the space bitmaps > as necessary. > 2. update all of the non-Uberdata blocks with their actual > contents (which requires calculating checksums on all of the > child blocks) > 3. write everything out in parallel. > 3a. if any write fails, re-do 1+2 for that block, and 2 for all of > its > parents, then start over at 3 with all of the changed blocks. > > 4. once everything is on stable storage, update the uberblock. > > That''s a lot more complicated than the current model, but certainly > seems > possible.(3a could actually be simplified to simply "mark the bad blocks as unallocatable, and go to 1", but it''s more efficient as you describe.) The eventual advantage, though, is that we get the performance of a single write (plus, always, the ?berblock update). In a heavily loaded system, the current approach (lots of small writes) won''t scale so well. (Actually we''d probably want to limit the size of each write to some small value, like 16 MB, simply to allow the first write to start earlier under fairly heavy loads.) As I pointed out earlier, this would require getting scatter/gather support through the storage subsystem, but the potential win should be quite large. Something to think about for the future. :-) Incidentally, this is part of how QFS gets its performance for streaming I/O. We use an "allocate forward" policy, allow very large allocation blocks, and separate the metadata from data. This allows us to write (or read) data in fairly large I/O requests, without unnecessary disk head motion. Anton
Incidentally, this is part of how QFS gets its performance for streaming I/O. We use an "allocate forward" policy, allow very large allocation blocks, and separate the metadata from data. This allows us to write (or read) data in fairly large I/O requests, without unnecessary disk head motion. I believe ZFS allocate forward both data and metadata but I don''t think this causes unnecessary disk head motion. -r