Nathan Kroenert
2012-Nov-20  13:29 UTC
[zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?
Hi folks, (Long time no post...) Only starting to get into this one, so apologies if I''m light on detail, but... I have a shiny SSD I''m using to help make some VirtualBox stuff I''m doing go fast. I have a 240GB Intel 520 series jobbie. Nice. I chopped into a few slices - p0 (partition table), p1 128GB, p2 60gb. As part of my work, I have used it both as a RAW device (cxtxdxp1) and wrapped partition 1 with a virtualbox created VMDK linkage, and it works like a champ. :) Very happy with that. I then tried creating a new zpool using partition 2 of the disk (zpool create c2d0p2) and then carved a zvol out of that (30GB), and wrapped *that* in a vmdk. Still works OK and speed is good(ish) - but there are a couple of things in particular that disturb me: - Sync writes are pretty slow - only about 1/10th of what I thought I might get (about 15MB/s). ASync writes are fast - up to 150MB/s or more. - More worringly, it seems that writes are amplified by 2X in that if I write 100MB at the guest level, the underlying bare metal ZFS writes 200M, as observed by iostat. This doesn''t happen on the VM''s that are using RAW slices. Anyone have any thoughts on what might be happening here? I can appreciate that if everything comes through as a sync write, it goes to the ZIL first, then to it''s final resting place - but it seems a little over the top that it really is double. I have also had a play with sync=, primarycache settings and a few other things but it doesn''t seem to change the behavious Again - I''m looking for thoughts here - as I have only really just started looking into this. Should I happen across anything interesting, I''ll followup this post. Cheers, Nathan. :)
Peter Tripp
2012-Nov-20  14:50 UTC
[zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?
Hi Nathan, You''ve misunderstood how the Zil works and why it reduces write latency for synchronous writes. Since you''ve partitioned a single SSD into two silces, one as pool storage and one as Zil for that pool, all sync writes will be 2X amplified. There''s no way around it. ZFS will write to the Zil while simultaneously (or with up to a couple second delay) write to the slice you''re using to persistently store pool data. This doesn''t happen when you expose the raw partition to the VM because those write don''t go through the Zil...hence no write amplification. Since you''ve put the Zil on physically on the same device as the pool storage, the Zil serves no purpose other than to slow things down. The purpose of a Zil is confirm sync writes as fast as possible even if they haven''t hit the actual pool storage (usually slow HDs) yet; it confirms the write once it''s hit the Zil and then ZFS has a moment (up to 30sec IIRC) to bundle multiple IOs before committing it to persistent pool storage. Remove the cache silce from the pool where you''ve carved out this zvol. Test again. Your writes will be faster. They likely won''t as fast as your async writes (150MB/sec) but they will certainly be faster than the 15MB/sec you''re getting now when you''re unintentionally do synchronous writing to the zil slice and async writes to the pool storage slice simultaneously. I''d bet the zvol solution will approach the speed of using a raw partition. -Pete P.S. Be careful using the term write amplification when talking about SSDs...people usually use that to refer to what happens within the SSD. Specifically before a write (especially a small write) can be written, other nearby data must be read and so an entire block can be rewritten. http://en.wikipedia.org/wiki/Write_amplification On Nov 20, 2012, at 8:29 AM, Nathan Kroenert wrote:> Hi folks, (Long time no post...) > > Only starting to get into this one, so apologies if I''m light on detail, but... > > I have a shiny SSD I''m using to help make some VirtualBox stuff I''m doing go fast. > > I have a 240GB Intel 520 series jobbie. Nice. > > I chopped into a few slices - p0 (partition table), p1 128GB, p2 60gb. > > As part of my work, I have used it both as a RAW device (cxtxdxp1) and wrapped partition 1 with a virtualbox created VMDK linkage, and it works like a champ. :) Very happy with that. > > I then tried creating a new zpool using partition 2 of the disk (zpool create c2d0p2) and then carved a zvol out of that (30GB), and wrapped *that* in a vmdk. > > Still works OK and speed is good(ish) - but there are a couple of things in particular that disturb me: > - Sync writes are pretty slow - only about 1/10th of what I thought I might get (about 15MB/s). ASync writes are fast - up to 150MB/s or more. > - More worringly, it seems that writes are amplified by 2X in that if I write 100MB at the guest level, the underlying bare metal ZFS writes 200M, as observed by iostat. This doesn''t happen on the VM''s that are using RAW slices. > > Anyone have any thoughts on what might be happening here? > > I can appreciate that if everything comes through as a sync write, it goes to the ZIL first, then to it''s final resting place - but it seems a little over the top that it really is double. > > I have also had a play with sync=, primarycache settings and a few other things but it doesn''t seem to change the behavious > > Again - I''m looking for thoughts here - as I have only really just started looking into this. Should I happen across anything interesting, I''ll followup this post. > > Cheers, > > Nathan. :) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121120/22027213/attachment-0001.html>
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Nov-20  17:07 UTC
[zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Nathan Kroenert > > I chopped into a few slices - p0 (partition table), p1 128GB, p2 60gb. > > As part of my work, I have used it both as a RAW device (cxtxdxp1) and > wrapped partition 1 with a virtualbox created VMDK linkage, and it works > like a champ. :) Very happy with that. > > I then tried creating a new zpool using partition 2 of the disk (zpool > create c2d0p2) and then carved a zvol out of that (30GB), and wrapped > *that* in a vmdk.Why are you parititoning, then creating zpool, and then creating zvol? I think you should make the whole disk a zpool unto itself, and then carve out the 128G zvol and 60G zvol. For that matter, why are you carving out multiple zvol''s? Does your Guest VM really want multiple virtual disks for some reason? Side note: Assuming you *really* just want a single guest to occupy the whole disk and run as fast as possible... If you want to snapshot your guest, you should make the whole disk one zpool, and then carve out a zvol which is significantly smaller than 50%, say perhaps 40% or 45% might do the trick. The zvol will immediately reserve all the space it needs, and if you don''t have enough space leftover to completely replicate the zvol, you won''t be able to create the snapshot. If your pool ever gets over 90% used, your performance will degrade, so a 40% zvol is what I would recommend. Back to the topic: Given that you''re on the SSD, there is no faster nonvolatile storage you can use for ZIL log device. So you should leave the default ZIL inside the pool... Don''t try adding any separate slice or anything as a log device... But as you said, sync writes will hit the disk twice. I would have to guess it''s a good idea for you to tune ZFS to immediately flush transactions whenever there''s a sync write. I forget how this is done - there''s some tunable that indicates anything sync write over a certain size should be immediately flushed...
Fajar A. Nugraha
2012-Nov-20  21:33 UTC
[zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?
On Wed, Nov 21, 2012 at 12:07 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> Why are you parititoning, then creating zpool,The common case it''s often because they use the disk for something else as well (e.g. OS), not only for zfs> and then creating zvol?Because it enables you to do other stuff easier and faster (e.g. copying files from the host) compared to using plain disk image files (vmdk/vdi/vhd/whatever)> I think you should make the whole disk a zpool unto itself, and then carve out the 128G zvol and 60G zvol. For that matter, why are you carving out multiple zvol''s? Does your Guest VM really want multiple virtual disks for some reason? > > Side note: Assuming you *really* just want a single guest to occupy the whole disk and run as fast as possible... If you want to snapshot your guest, you should make the whole disk one zpool, and then carve out a zvol which is significantly smaller than 50%, say perhaps 40% or 45% might do the trick.... or use sparse zvols, e.g. "zfs create -V 10G -s tank/vol1" Of course, that''s assuming you KNOW that you never max-out storage use on that zvol. If you don''t have control over that, then using smaller zvol size is indeed preferable. -- Fajar
nathan
2012-Nov-21  02:21 UTC
[zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?
Hi folks, some extra thoughts: 1. Don''t question why. :) I''m playing and observing, so I ultimately know and understand the best way to do things! heh. 2. In fairness, asking why is entirely valid. ;) I''m not doing things to best practice just yet - I wanted the best performance for my VM''s, which are all testing/training/playing VM''s. I got *great* performance from the first RAW PARTITION I gave to VirtualBox. I wanted to do the same, but due to the way it wraps paritions, and Solaris complains that there is more than one Solaris2 partition on the disk when I try to install the second instance, I thought I''d give zvols a go. 3. The device I wrap as a VMDK is the RAW device. sigh. Of course, all writes will go through the ZIL, and of course we''ll have to write twice as much. I should have seen that straight away, but was lacking sleep. 4. Note: I don''t have a separate ZIL. The first partition I made was given directly to virtualbox. The second was used to create the zpool. I''m going to have a play with using LVM md devices instead and see how that goes as well. Overall, the pain of the doubling of bandwidth requirements seems like a big downer for *my* configuration, as I have just the one SSD, but I''ll persist and see what I can get out of it. Thanks for the thoughts thus far! Cheers, Nathan. On 21/11/2012 8:33 AM, Fajar A. Nugraha wrote:> On Wed, Nov 21, 2012 at 12:07 AM, Edward Ned Harvey > (opensolarisisdeadlongliveopensolaris) > <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote: >> Why are you parititoning, then creating zpool, > The common case it''s often because they use the disk for something > else as well (e.g. OS), not only for zfs > >> and then creating zvol? > Because it enables you to do other stuff easier and faster (e.g. > copying files from the host) compared to using plain disk image files > (vmdk/vdi/vhd/whatever) > >> I think you should make the whole disk a zpool unto itself, and then carve out the 128G zvol and 60G zvol. For that matter, why are you carving out multiple zvol''s? Does your Guest VM really want multiple virtual disks for some reason? >> >> Side note: Assuming you *really* just want a single guest to occupy the whole disk and run as fast as possible... If you want to snapshot your guest, you should make the whole disk one zpool, and then carve out a zvol which is significantly smaller than 50%, say perhaps 40% or 45% might do the trick. > ... or use sparse zvols, e.g. "zfs create -V 10G -s tank/vol1" > > Of course, that''s assuming you KNOW that you never max-out storage use > on that zvol. If you don''t have control over that, then using smaller > zvol size is indeed preferable. >
Jim Klimov
2012-Nov-21  05:24 UTC
[zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?
On 2012-11-21 03:21, nathan wrote:> Overall, the pain of the doubling of bandwidth requirements seems like a > big downer for *my* configuration, as I have just the one SSD, but I''ll > persist and see what I can get out of it.I might also speculate that for each rewritten block of userdata in the VM image, you have a series of metadata block updates in ZFS. If you keep the zvol blocks relatively small, you might get the effective doubling of writes for the userdata updates. As for ZIL - even if it is used with the in-pool variant, I don''t think your setup needs any extra steps to disable it (as Edward likes to suggest), and most other setups don''t need to disable it either. It also shouldn''t add much to your writes - the in-pool ZIL blocks are then referenced as userdata when the TXG commit happens (I think). I also think that with a VM in a raw partition you don''t get any snapshots - neither ZFS as underlying storage (''cause it''s not), not hypervisor snaps of the VM. So while faster, this is also some trade-off :) //Jim
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Nov-21  11:58 UTC
[zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > As for ZIL - even if it is used with the in-pool variant, I don''t > think your setup needs any extra steps to disable it (as Edward likes > to suggest), and most other setups don''t need to disable it either.No, no - I know I often suggest disabling the zil, because so many people outrule it on principle (the evil tuning guide says "disable the zil (don''t!)") But in this case, I was suggesting precisely the opposite of disabling it. I was suggesting making it more aggressive. But now that you mention it - if he''s looking for maximum performance, perhaps disabling the zil would be best for him. ;-) Nathan, it will do you some good to understand when it''s ok or not ok to disable the zil. (zfs set sync=disabled) If this is a guest VM in your laptop or something like that, then it''s definitely safe. If the guest VM is a database server, with a bunch of external clients (on the LAN or network or whatever) then it''s definitely *not* safe. Basically if anything external of the VM is monitoring or depending on the state of the VM, then it''s not ok. But, if the VM were to crash and go back in time by a few seconds ... If there are no clients that would care about that ... then it''s safe to disable ZIL. And that is the highest performance thing you can possibly do.> It also shouldn''t add much to your writes - the in-pool ZIL blocks > are then referenced as userdata when the TXG commit happens (I think).I would like to get some confirmation of that - because it''s the opposite of what I thought. I thought the ZIL is used like a circular buffer. The same blocks will be overwritten repeatedly. But if there''s a sync write over a certain size, then it skips the ZIL and writes immediately to main zpool storage, so it doesn''t have to get written twice.> I also think that with a VM in a raw partition you don''t get any > snapshots - neither ZFS as underlying storage (''cause it''s not), > not hypervisor snaps of the VM. So while faster, this is also some > trade-off :)Oh - But not faster than zvol. I am currently a fan of wrapping zvol inside vmdk, so I get maximum performance and also snapshots.