thr3ads.net - Xen devel - RE: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches [Aug 2006]

If this information is useful, please help other people find it:
Share via:

Ian Pratt

2006-Aug-29 14:42 UTC

RE: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

> Right, but we are interested in CoW between two block 
> devices.  That''s why we have created a special CoW format 
> (qcow wastes a lot of space to do sparse allocation, which is 
> pointless and slow on a block device).  We looking to support 
> things like using two iSCSI block devices to do CoW 
What data structure / format are you using for storing the CoW data?

Ian 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Smith

2006-Aug-29 14:51 UTC

head link

Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

IP> What data structure / format are you using for storing the CoW
IP> data?

Currently, I''ve got a really simple format that just forms an identity
mapping by using a bitmap.  The first block (or blocks) on the
destination is reserved for metadata.  After that, if bit X is set,
then block X is located in block X+1 in the cow space.  It''s fast, it
doesn''t waste space, and it''s easy to predict a mapping and
flush it
later to avoid metadata reservations.

-- 
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Julian Chesterfield

2006-Aug-29 23:47 UTC

head link

Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

On 29 Aug 2006, at 15:51, Dan Smith wrote:
> IP> What data structure / format are you using for storing the CoW
> IP> data?
>
> Currently, I''ve got a really simple format that just forms an
identity
> mapping by using a bitmap.  The first block (or blocks) on the
> destination is reserved for metadata.  After that, if bit X is set,
> then block X is located in block X+1 in the cow space.
You indicated recently that you are focused on block- as opposed to 
file-backed virtual disks.  Your new CoW driver, however doesn''t handle
allocation, it just assumes a CoW volume that''s as big as the original 
disk and uses a bitmap to optimize lookups.  Given that you seem to be 
assuming that the block device is providing sparse allocation and 
dynamic disk resizing for you, isn''t it likely that such devices would 
already provide low-level support for CoW and disk snapshotting? Qcow 
provides both sparse support and CoW functionality.
> It''s fast, it
> doesn''t waste space, and it''s easy to predict a mapping
and flush it
> later to avoid metadata reservations.
What''s the policy on metadata writes - are metadata writes synchronised
with the acknowledgement of block IO requests?

- Julian


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Smith

2006-Aug-30 00:41 UTC

head link

Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

JC> Your new CoW driver, however doesn''t handle allocation, it just
JC> assumes a CoW volume that''s as big as the original disk

Correct, it does not do sparse allocation.  Sparse allocation does not
make any sense on a fixed-size block device.  Note that it has nothing
to do with how the rest of the dm-userspace/cowd system works.  When
using the prototype qcow plugin to cowd, sparse allocation is
performed as you would expect.

JC> and uses a bitmap to optimize lookups.  

It does not use a bitmap to optimize lookups, it uses a bitmap as the
metadata to record which blocks have been mapped.  It also determines
where the block is located in the CoW volume, as this can be
determined from the block size and the bit number.

JC> Given that you seem to be assuming that the block device is
JC> providing sparse allocation and dynamic disk resizing for you,

No, I''m not assuming this.  With this current format plugin, we are
just assuming a fully-allocated block device (such as a LVM, or a
simple iSCSI device).

JC> isn''t it likely that such devices would already provide
low-level
JC> support for CoW and disk snapshotting? 

Some do, but plenty do not.  However, being able to do snapshots, CoW,
rollback, etc while easily synchronizing with other pieces of Xen is
something that will be simpler with dm-userspace than with a hardware
device.  It also allows us to provide the same advanced features, no
matter what device we are working on top of, and independent of
whether or not it supports them.

JC> Qcow provides both sparse support and CoW functionality.

Sparse allocation and control of block devices (i.e. growing an LVM as
we need it for a growing CoW storage) is definitely something that is
on our radar.  We are just not there yet.

I should mention here that I am not fighting for a format here.  The
plugin that is the default in cowd right now (dscow) implements a very
simple format geared at speed and simplicity.  It has limitations and
we understand that.  It is not intended to be the CoW format of the
future :)

JC> What''s the policy on metadata writes - are metadata writes
JC> synchronised with the acknowledgement of block IO requests?

Yes.  Ian asked for this the last time we posted our code.  We have
worked hard to implement this ability in dm-userspace/cowd between the
time we posted our original version and our recent post.

While the normal persistent domain case definitely needs this to be
"correct", there are other usage models for virtual machines that do
not necessarily need to have a persistent disk store.  We are able to
disable the metadata syncing (and the metadata writing altogether if
desired) and regain a lot of speed.

-- 
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andrew Warfield

2006-Aug-30 14:37 UTC

head link

Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

Hi Dan,
> Some do, but plenty do not.  However, being able to do snapshots, CoW,
> rollback, etc while easily synchronizing with other pieces of Xen is
> something that will be simpler with dm-userspace than with a hardware
> device.  It also allows us to provide the same advanced features, no
> matter what device we are working on top of, and independent of
> whether or not it supports them.
Okay -- I think there may have been a disconnect on what the
assumptions driving dscow''s design were.  Based on your clarifying
emails these seem to be that an administrator has a block device that
they want to apply cow to, and that they have oodles of space.
They''ll just hard-allocate a second block device of the same size as
the original on a per-cow basis, and use this (plus the bitmap header)
to write updates into.

All of the CoW formats that I''ve seen use some form of pagetable-style
lookup hierarchy to represent sparseness, frequently a combination of
a lookup tree and leaf bitmaps -- your scheme is just the extreme of
this... a zero-level tree.  It seems like a possibly useful thing to
have in some environments to use as a fast-image-copy operation,
although it would be cool to have something that ran in the background
and lazily copied all the other blocks over and eventually resulted in
a fully linear disk image.  Perhaps you''d consider adding that and
porting the format as a plugin to the blktap tools as well? ;)
> JC> What''s the policy on metadata writes - are metadata writes
> JC> synchronised with the acknowledgement of block IO requests?
>
> Yes.  Ian asked for this the last time we posted our code.  We have
> worked hard to implement this ability in dm-userspace/cowd between the
> time we posted our original version and our recent post.
>
> While the normal persistent domain case definitely needs this to be
> "correct", there are other usage models for virtual machines that
do
> not necessarily need to have a persistent disk store.  We are able to
> disable the metadata syncing (and the metadata writing altogether if
> desired) and regain a lot of speed.
Yes -- we''ve seen comments from users who are very pleased with the
better-than-disk write throughput that they achieve with the loopback
driver ;) -- basically the same effect of letting the buffer cache
step in and play things a little more "fast and loose".

I''ve got a few questions about your driver code -- sorry for taking a
while to get back to you on this:

1. You are allocating a per-disk mapping cache in the driver.  Do you
have any sense of how big this needs to be to be useful for common
workloads?  Generally speaking, this seems like a strange thing to add
-- I understand the desire for an in-kernel cache to avoid context
switches, but why make it implicit and LRU.  Wouldn''t it simplify the
code considerably to allow the userspace stuff to manage the size and
contents of the cache so that they can do replacement based on their
knowledge of the block layout?

2. There are a heck of a lot of spin locks in that driver.  Did you
run into a lot of stability problems that led to aggressively
conservative locking?  Would removing the cache and/or switching some
of the locks to refcounts simplify things at all?

I haven''t looked at the userspace stuff all that closely.  Time
permitting I''ll peek at that and get back to you with some comments in
the next little while.

best,
a.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Smith

2006-Aug-30 18:55 UTC

head link

Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

AW> Okay -- I think there may have been a disconnect on what the
AW> assumptions driving dscow''s design were.  Based on your
clarifying
AW> emails these seem to be that an administrator has a block device
AW> that they want to apply cow to, and that they have oodles of
AW> space.  They''ll just hard-allocate a second block device of the
AW> same size as the original on a per-cow basis, and use this (plus
AW> the bitmap header) to write updates into.

Correct.  Again, let me reiterate that I am not claiming that dscow is
the best format for anything other than a few small situations that
we are currently targeting :)

We definitely want to eventually develop a sparse allocation method
that will allow us to take a big block store and carve it up (like LVM
does) for on-demand cow volumes.

AW> All of the CoW formats that I''ve seen use some form of
AW> pagetable-style lookup hierarchy to represent sparseness,
AW> frequently a combination of a lookup tree and leaf bitmaps -- your
AW> scheme is just the extreme of this... a zero-level tree.

Indeed, many use a pagetable approach.  Development of the above idea
would definitely require it.  FWIW, I believe cowloop uses a format
similar to dscow.

AW> It seems like a possibly useful thing to have in some environments
AW> to use as a fast-image-copy operation, although it would be cool
AW> to have something that ran in the background and lazily copied all
AW> the other blocks over and eventually resulted in a fully linear
AW> disk image.  

Yes, I have discussed this on-list a few times, in reference to
live-copy of LVMs and building a local version of a network-accessible
image, such as an nbd device.

AW> Perhaps you''d consider adding that and porting the format as a
AW> plugin to the blktap tools as well? ;)

I do not really see the direct value of that.  If the functionality
exists with dm-userspace and cowd, then dm-userspace could be used to
slowly build the image, while blktap could provide access to that
image for a domain (in direct mode, as Julian pointed out).

Building the functionality into dm-userspace would allow it to be
generally applicable to vanilla linux systems.  Why build it into a
xen-specific component?

AW> Yes -- we''ve seen comments from users who are very pleased with
AW> the better-than-disk write throughput that they achieve with the
AW> loopback driver ;) -- basically the same effect of letting the
AW> buffer cache step in and play things a little more "fast and
AW> loose".

Heh, right.  I was actually talking about increased performance
against a block device.  However, for this kind of transient domain
model, a file will work as well.

AW> 1. You are allocating a per-disk mapping cache in the driver.  Do
AW> you have any sense of how big this needs to be to be useful for
AW> common workloads?  

My latest version (which we will post soon) puts a cap on the number
of remaps each device can maintain.  Changing from a 4096-map limit to
a 16384-map limit makes some difference, but it does not appear to be
significant.  We will post concrete numbers when we send the latest
version.

AW> Generally speaking, this seems like a strange thing to add -- I
AW> understand the desire for an in-kernel cache to avoid context
AW> switches, but why make it implicit and LRU. 

Well, if you do not keep that kind of data in the kernel, I think
performance would suffer significantly.  The idea here is to have, at
steady-state, a block device that behaves almost exactly like a
device-mapper device (read: LVM) does right now.  All block
redirections happen in-kernel.  Remember that the userspace side can
invalidate any mapping cached in the kernel at any time.  If userspace
wanted to do cache management, it could do so.  I have also discussed
the possibility of feeding some coarse statistics back to userspace so
it can make more informed decisions.

I would not say that the caching is implicit.  If you set the
DMU_FLAG_TEMPORARY bit on a response, the kernel will not remember the
mapping and thus will fault the next access back to userspace again.

AW> Wouldn''t it simplify the code considerably to allow the
userspace
AW> stuff to manage the size and contents of the cache so that they
AW> can do replacement based on their knowledge of the block layout?

I am not sure why this would be much better than letting the kernel
manage it.  The kernel knows two things that userspace does not:
low-memory pressure and access statistics.  I do not see why it would
make sense to have the kernel collect and communicate access
statistics for each block to userspace and then rely on it to evict
unused mappings.  Further, the kernel can run without the userspace
component if no unmapped blocks are accessed.  This allows a restart
or upgrade of the userspace component without disturbing the device.

It is entirely possible that I do not understand your point, so feel
free to correct me :)

AW> 2. There are a heck of a lot of spin locks in that driver.  Did
AW> you run into a lot of stability problems that led to aggressively
AW> conservative locking?

I think I have said this before, but: no performance analysis has been
done on dm-userspace to identify areas of contention.  The use of
spinlocks was the best way (for me) to get things working and stable.
Most of the spinlocks are used to protect linked lists, which I think
is relatively valid.

I should also point out that the structure of the entire thing has
been a moving target up until recently.  It is definitely possible
that some of the places that a spinlock was used could be refactored
under the current model.

AW> Would removing the cache and/or switching some of the locks to
AW> refcounts simplify things at all?

Moving to something other than spinlocks for a few of the data
structures may be possible; we can investigate and post some numbers
on the next go-round.

-- 
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Aug 2006 - RE: Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

RE: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches

Re: [Xen-devel] Re: [PATCH 0 of 6] dm-userspace xen integrationpatches