Ian Pratt
2006-Aug-29 14:42 UTC
RE: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches
> Right, but we are interested in CoW between two block > devices. That''s why we have created a special CoW format > (qcow wastes a lot of space to do sparse allocation, which is > pointless and slow on a block device). We looking to support > things like using two iSCSI block devices to do CoWWhat data structure / format are you using for storing the CoW data? Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Smith
2006-Aug-29 14:51 UTC
Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches
IP> What data structure / format are you using for storing the CoW IP> data? Currently, I''ve got a really simple format that just forms an identity mapping by using a bitmap. The first block (or blocks) on the destination is reserved for metadata. After that, if bit X is set, then block X is located in block X+1 in the cow space. It''s fast, it doesn''t waste space, and it''s easy to predict a mapping and flush it later to avoid metadata reservations. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Julian Chesterfield
2006-Aug-29 23:47 UTC
Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches
On 29 Aug 2006, at 15:51, Dan Smith wrote:> IP> What data structure / format are you using for storing the CoW > IP> data? > > Currently, I''ve got a really simple format that just forms an identity > mapping by using a bitmap. The first block (or blocks) on the > destination is reserved for metadata. After that, if bit X is set, > then block X is located in block X+1 in the cow space.You indicated recently that you are focused on block- as opposed to file-backed virtual disks. Your new CoW driver, however doesn''t handle allocation, it just assumes a CoW volume that''s as big as the original disk and uses a bitmap to optimize lookups. Given that you seem to be assuming that the block device is providing sparse allocation and dynamic disk resizing for you, isn''t it likely that such devices would already provide low-level support for CoW and disk snapshotting? Qcow provides both sparse support and CoW functionality.> It''s fast, it > doesn''t waste space, and it''s easy to predict a mapping and flush it > later to avoid metadata reservations.What''s the policy on metadata writes - are metadata writes synchronised with the acknowledgement of block IO requests? - Julian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Smith
2006-Aug-30 00:41 UTC
Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches
JC> Your new CoW driver, however doesn''t handle allocation, it just JC> assumes a CoW volume that''s as big as the original disk Correct, it does not do sparse allocation. Sparse allocation does not make any sense on a fixed-size block device. Note that it has nothing to do with how the rest of the dm-userspace/cowd system works. When using the prototype qcow plugin to cowd, sparse allocation is performed as you would expect. JC> and uses a bitmap to optimize lookups. It does not use a bitmap to optimize lookups, it uses a bitmap as the metadata to record which blocks have been mapped. It also determines where the block is located in the CoW volume, as this can be determined from the block size and the bit number. JC> Given that you seem to be assuming that the block device is JC> providing sparse allocation and dynamic disk resizing for you, No, I''m not assuming this. With this current format plugin, we are just assuming a fully-allocated block device (such as a LVM, or a simple iSCSI device). JC> isn''t it likely that such devices would already provide low-level JC> support for CoW and disk snapshotting? Some do, but plenty do not. However, being able to do snapshots, CoW, rollback, etc while easily synchronizing with other pieces of Xen is something that will be simpler with dm-userspace than with a hardware device. It also allows us to provide the same advanced features, no matter what device we are working on top of, and independent of whether or not it supports them. JC> Qcow provides both sparse support and CoW functionality. Sparse allocation and control of block devices (i.e. growing an LVM as we need it for a growing CoW storage) is definitely something that is on our radar. We are just not there yet. I should mention here that I am not fighting for a format here. The plugin that is the default in cowd right now (dscow) implements a very simple format geared at speed and simplicity. It has limitations and we understand that. It is not intended to be the CoW format of the future :) JC> What''s the policy on metadata writes - are metadata writes JC> synchronised with the acknowledgement of block IO requests? Yes. Ian asked for this the last time we posted our code. We have worked hard to implement this ability in dm-userspace/cowd between the time we posted our original version and our recent post. While the normal persistent domain case definitely needs this to be "correct", there are other usage models for virtual machines that do not necessarily need to have a persistent disk store. We are able to disable the metadata syncing (and the metadata writing altogether if desired) and regain a lot of speed. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Warfield
2006-Aug-30 14:37 UTC
Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches
Hi Dan,> Some do, but plenty do not. However, being able to do snapshots, CoW, > rollback, etc while easily synchronizing with other pieces of Xen is > something that will be simpler with dm-userspace than with a hardware > device. It also allows us to provide the same advanced features, no > matter what device we are working on top of, and independent of > whether or not it supports them.Okay -- I think there may have been a disconnect on what the assumptions driving dscow''s design were. Based on your clarifying emails these seem to be that an administrator has a block device that they want to apply cow to, and that they have oodles of space. They''ll just hard-allocate a second block device of the same size as the original on a per-cow basis, and use this (plus the bitmap header) to write updates into. All of the CoW formats that I''ve seen use some form of pagetable-style lookup hierarchy to represent sparseness, frequently a combination of a lookup tree and leaf bitmaps -- your scheme is just the extreme of this... a zero-level tree. It seems like a possibly useful thing to have in some environments to use as a fast-image-copy operation, although it would be cool to have something that ran in the background and lazily copied all the other blocks over and eventually resulted in a fully linear disk image. Perhaps you''d consider adding that and porting the format as a plugin to the blktap tools as well? ;)> JC> What''s the policy on metadata writes - are metadata writes > JC> synchronised with the acknowledgement of block IO requests? > > Yes. Ian asked for this the last time we posted our code. We have > worked hard to implement this ability in dm-userspace/cowd between the > time we posted our original version and our recent post. > > While the normal persistent domain case definitely needs this to be > "correct", there are other usage models for virtual machines that do > not necessarily need to have a persistent disk store. We are able to > disable the metadata syncing (and the metadata writing altogether if > desired) and regain a lot of speed.Yes -- we''ve seen comments from users who are very pleased with the better-than-disk write throughput that they achieve with the loopback driver ;) -- basically the same effect of letting the buffer cache step in and play things a little more "fast and loose". I''ve got a few questions about your driver code -- sorry for taking a while to get back to you on this: 1. You are allocating a per-disk mapping cache in the driver. Do you have any sense of how big this needs to be to be useful for common workloads? Generally speaking, this seems like a strange thing to add -- I understand the desire for an in-kernel cache to avoid context switches, but why make it implicit and LRU. Wouldn''t it simplify the code considerably to allow the userspace stuff to manage the size and contents of the cache so that they can do replacement based on their knowledge of the block layout? 2. There are a heck of a lot of spin locks in that driver. Did you run into a lot of stability problems that led to aggressively conservative locking? Would removing the cache and/or switching some of the locks to refcounts simplify things at all? I haven''t looked at the userspace stuff all that closely. Time permitting I''ll peek at that and get back to you with some comments in the next little while. best, a. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Smith
2006-Aug-30 18:55 UTC
Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches
AW> Okay -- I think there may have been a disconnect on what the AW> assumptions driving dscow''s design were. Based on your clarifying AW> emails these seem to be that an administrator has a block device AW> that they want to apply cow to, and that they have oodles of AW> space. They''ll just hard-allocate a second block device of the AW> same size as the original on a per-cow basis, and use this (plus AW> the bitmap header) to write updates into. Correct. Again, let me reiterate that I am not claiming that dscow is the best format for anything other than a few small situations that we are currently targeting :) We definitely want to eventually develop a sparse allocation method that will allow us to take a big block store and carve it up (like LVM does) for on-demand cow volumes. AW> All of the CoW formats that I''ve seen use some form of AW> pagetable-style lookup hierarchy to represent sparseness, AW> frequently a combination of a lookup tree and leaf bitmaps -- your AW> scheme is just the extreme of this... a zero-level tree. Indeed, many use a pagetable approach. Development of the above idea would definitely require it. FWIW, I believe cowloop uses a format similar to dscow. AW> It seems like a possibly useful thing to have in some environments AW> to use as a fast-image-copy operation, although it would be cool AW> to have something that ran in the background and lazily copied all AW> the other blocks over and eventually resulted in a fully linear AW> disk image. Yes, I have discussed this on-list a few times, in reference to live-copy of LVMs and building a local version of a network-accessible image, such as an nbd device. AW> Perhaps you''d consider adding that and porting the format as a AW> plugin to the blktap tools as well? ;) I do not really see the direct value of that. If the functionality exists with dm-userspace and cowd, then dm-userspace could be used to slowly build the image, while blktap could provide access to that image for a domain (in direct mode, as Julian pointed out). Building the functionality into dm-userspace would allow it to be generally applicable to vanilla linux systems. Why build it into a xen-specific component? AW> Yes -- we''ve seen comments from users who are very pleased with AW> the better-than-disk write throughput that they achieve with the AW> loopback driver ;) -- basically the same effect of letting the AW> buffer cache step in and play things a little more "fast and AW> loose". Heh, right. I was actually talking about increased performance against a block device. However, for this kind of transient domain model, a file will work as well. AW> 1. You are allocating a per-disk mapping cache in the driver. Do AW> you have any sense of how big this needs to be to be useful for AW> common workloads? My latest version (which we will post soon) puts a cap on the number of remaps each device can maintain. Changing from a 4096-map limit to a 16384-map limit makes some difference, but it does not appear to be significant. We will post concrete numbers when we send the latest version. AW> Generally speaking, this seems like a strange thing to add -- I AW> understand the desire for an in-kernel cache to avoid context AW> switches, but why make it implicit and LRU. Well, if you do not keep that kind of data in the kernel, I think performance would suffer significantly. The idea here is to have, at steady-state, a block device that behaves almost exactly like a device-mapper device (read: LVM) does right now. All block redirections happen in-kernel. Remember that the userspace side can invalidate any mapping cached in the kernel at any time. If userspace wanted to do cache management, it could do so. I have also discussed the possibility of feeding some coarse statistics back to userspace so it can make more informed decisions. I would not say that the caching is implicit. If you set the DMU_FLAG_TEMPORARY bit on a response, the kernel will not remember the mapping and thus will fault the next access back to userspace again. AW> Wouldn''t it simplify the code considerably to allow the userspace AW> stuff to manage the size and contents of the cache so that they AW> can do replacement based on their knowledge of the block layout? I am not sure why this would be much better than letting the kernel manage it. The kernel knows two things that userspace does not: low-memory pressure and access statistics. I do not see why it would make sense to have the kernel collect and communicate access statistics for each block to userspace and then rely on it to evict unused mappings. Further, the kernel can run without the userspace component if no unmapped blocks are accessed. This allows a restart or upgrade of the userspace component without disturbing the device. It is entirely possible that I do not understand your point, so feel free to correct me :) AW> 2. There are a heck of a lot of spin locks in that driver. Did AW> you run into a lot of stability problems that led to aggressively AW> conservative locking? I think I have said this before, but: no performance analysis has been done on dm-userspace to identify areas of contention. The use of spinlocks was the best way (for me) to get things working and stable. Most of the spinlocks are used to protect linked lists, which I think is relatively valid. I should also point out that the structure of the entire thing has been a moving target up until recently. It is definitely possible that some of the places that a spinlock was used could be refactored under the current model. AW> Would removing the cache and/or switching some of the locks to AW> refcounts simplify things at all? Moving to something other than spinlocks for a few of the data structures may be possible; we can investigate and post some numbers on the next go-round. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel