Andrew Warfield
2006-Jun-19 16:19 UTC
[Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Attached to this email is a patch containing the (new and improved) blktap Linux driver and associated userspace tools for Xen. In addition to being more flavourful, containing half the fat, and removing stains twice as well as the old driver, this stuff adds a userspace block backend and let you use raw (without loopback), qcow, and vmdk-based image files for your domUs. There''s also a fun little driver that provides a shared-memory block device which, in combination with OCFS2, represents a cheap-and-cheerful fast shared filesystem between multiple domUs. This code has been (somewhat lackadaisically) developed over the past few years at Cambridge and has recently enjoyed massive improvements thanks to the considerable efforts of Julian Chesterfield. The code "works for us" and has been tested on a grand total of about three machines. We would love to have feedback from a broader audience, in terms of both trying out the tools and inspecting the code. We''ll plan to release new patches at about 1-week intervals based on comments. Performance is quite good, and we intend to focus on this a bit more over the next few weeks, releasing updated patches as they are available. Bonnie results this morning are as follows (64-bit results compare against linux blkback+loopback file, Julian can follow up with loopback results for 32-bit later if anyone''s interested): -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 64-bit: xen0 4096 40115 93.4 41067 12.7 22757 1.2 32532 56.7 53724 0.4 121.4 0.0 img-sp 4096 20291 86.0 38091 18.1 19939 8.2 30854 69.0 47779 4.2 95.3 0.4 loop-sp 4096 33421 77.6 33663 13.1 18546 5.1 28606 59.2 46659 6.0 85.2 0.1 32-Bit: xen0 1024 33857 94.0 45804 9.0 23269 0.0 25825 52.0 55628 0 185.0 0.0 img-sp 1448 32743 92.0 40703 8.0 23281 0.0 31139 75.0 56585 0 208.1 0.0 The patch is against cset 0426:840f33e54054 -- but is unlikely to conflict with anything recent. You''ll need libaio and libaio-devel on your build machine for the tools to compile. Blktap readme follows.) Thanks! a. --- Blktap Userspace Tools + Library =============================== Andrew Warfield and Julian Chesterfield 16th June 2006 {firstname.lastname}@cl.cam.ac.uk The blktap userspace toolkit provides a user-level disk I/O interface. The blktap mechanism involves a kernel driver that acts similarly to the existing Xen/Linux blkback driver, and a set of associated user-level libraries. Using these tools, blktap allows virtual block devices presented to VMs to be implemented in userspace and to be backed by raw partitions, files, network, etc. The key benefit of blktap is that it makes it easy and fast to write arbitrary block backends, and that these user-level backends actually perform very well. Specifically: - Metadata disk formats such as Copy-on-Write, encrypted disks, sparse formats and other compression features can be easily implemented. O_DIRECT and libaio allow high-performance implementation of even sparse image formats such as QCoW, while still preserving the safe ordering of metadata and data writes to ensure data integrity. (As opposed to, for instance, both the loopback driver and LVM snaps which both have very dangerous failure cases.) - Accessing file-based images from userspace avoids problems related to flushing dirty pages which are present in the Linux loopback driver. (Specifically, doing a large number of writes to an NFS-backed image don''t result in the OOM killer going berserk.) - Per-disk handler processes enable easier userspace policing of block resources, and process-granularity QoS techniques (disk scheduling and related tools) may be trivially applied to block devices. - It''s very easy to take advantage of userspace facilities such as networking libraries, compression utilities, peer-to-peer file-sharing systems and so on to build more complex block backends. - Crashes are contained -- incremental development/debugging is very fast. - All block data is forwarded in a zero-copy fashion, allowing for low-overhead userspace implementations. How it works (in one paragraph): Working in conjunction with the kernel blktap driver, all disk I/O requests from VMs are passed to the userspace deamon (using a shared memory interface) through a character device. Each active disk is mappd to an individual device node, allowing per-disk processes to implement individual block devices where desired. The userspace drivers are implemented using asynchronous (Linux libaio), O_DIRECT-based calls to preserve the unbuffered, batched and asynchronous request dispatch achieved with the existing blockback code. We provide a simple, asynchronous virtual disk interface that makes it quite easy to add new disk implementations. As of June 2006 the current supported disk formats are: - Raw Images (both on partitions and in image files) - File-backed Qcow disks (sparse qcow overlay on a raw image/patrition). - Standalone sparse Qcow disks (sparse disks, not backed by a parent image). - Fast shareable RAM disk between VMs (requires some form of cluster-based filesystem support e.g. OCFS2 in the guest kernel) - Some VMDK images - your mileage may vary Raw and QCow images have asynchronous backends and so should perform fairly well. VMDK is based directly on the qemu vmdk driver, which is synchronous (a.k.a. slow). The qcow backends support existing qcow disks. There are also a set of tools to generate and convert qcow images. With these tools (and driver support), we maintain the qcow file format but adjust parameters for higher performance with Xen -- using a larger segment size (4096 instead of 512) and more coarsely allocating metadata regions. We are continuing to improve this work and expect qcow performance to improve a great deal over the newxt few weeks. Build and Installation Instructions ================================== You will need libaio >= 0.3.104 on your target system to build the tools (if you are installing RPMs, this means libaio and libaio-devel). Make to configure the blktap backend driver in your dom0 kernel. It will cooperate fine with the existing backend driver, so you can experiment with tap disks without breaking existing VM configs. To build the tools separately, "make && make install" in tools/blktap_user. Using the Tools ============== Prepare the image for booting. For qcow files use the qcow utilities installed earlier. e.g. qcow-create generates a blank standalone image or a file-backed CoW image. img2qcow takes an existing image or partition and creates a sparse, standalone qcow-based file. Start the userspace disk agent either on system boot (e.g. via an init script) or manually => ''blktapctrl'' Customise the VM config file to use the ''tap'' handler, followed by the driver type. e.g. for a raw image such as a file or partition: disk = [''tap:aio:<FILENAME>,sda1,w''] e.g. for a qcow image: disk = [''tap:qcow:<FILENAME>,sda1,w''] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
NAHieu
2006-Jun-19 16:51 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Wonderful!! Now we have dm-userspace and blktap, and these two seems to do the similar things. So what are the pros/cons of blktap compared to dm-userspace? Perhaps blktap will have a better performance? Did you have any benchmark to compare dm-userspace & blktap? Thanks. H On 6/20/06, Andrew Warfield <andrew.warfield@cl.cam.ac.uk> wrote:> Attached to this email is a patch containing the (new and improved) > blktap Linux driver and associated userspace tools for Xen. In > addition to being more flavourful, containing half the fat, and > removing stains twice as well as the old driver, this stuff adds a > userspace block backend and let you use raw (without loopback), qcow, > and vmdk-based image files for your domUs. There''s also a fun little > driver that provides a shared-memory block device which, in > combination with OCFS2, represents a cheap-and-cheerful fast shared > filesystem between multiple domUs. > > This code has been (somewhat lackadaisically) developed over the past > few years at Cambridge and has recently enjoyed massive improvements > thanks to the considerable efforts of Julian Chesterfield. > > The code "works for us" and has been tested on a grand total of about > three machines. We would love to have feedback from a broader > audience, in terms of both trying out the tools and inspecting the code. > We''ll plan to release new patches at about 1-week intervals based on > comments. > > Performance is quite good, and we intend to focus on this a bit more > over the next few weeks, releasing updated patches as they are > available. Bonnie results this morning are as follows (64-bit results > compare against linux blkback+loopback file, Julian can follow up with > loopback results for 32-bit later if anyone''s interested): > > -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 64-bit: > xen0 4096 40115 93.4 41067 12.7 22757 1.2 32532 56.7 53724 0.4 121.4 0.0 > img-sp 4096 20291 86.0 38091 18.1 19939 8.2 30854 69.0 47779 4.2 95.3 0.4 > loop-sp 4096 33421 77.6 33663 13.1 18546 5.1 28606 59.2 46659 6.0 85.2 0.1 > > 32-Bit: > xen0 1024 33857 94.0 45804 9.0 23269 0.0 25825 52.0 55628 0 185.0 0.0 > img-sp 1448 32743 92.0 40703 8.0 23281 0.0 31139 75.0 56585 0 208.1 0.0 > > The patch is against cset 0426:840f33e54054 -- but is unlikely to > conflict with anything recent. You''ll need libaio and libaio-devel on > your build machine for the tools to compile. > > > Blktap readme follows.) > > Thanks! > a. > > --- > > > Blktap Userspace Tools + Library > ===============================> > Andrew Warfield and Julian Chesterfield > 16th June 2006 > > {firstname.lastname}@cl.cam.ac.uk > > The blktap userspace toolkit provides a user-level disk I/O > interface. The blktap mechanism involves a kernel driver that acts > similarly to the existing Xen/Linux blkback driver, and a set of > associated user-level libraries. Using these tools, blktap allows > virtual block devices presented to VMs to be implemented in userspace > and to be backed by raw partitions, files, network, etc. > > The key benefit of blktap is that it makes it easy and fast to write > arbitrary block backends, and that these user-level backends actually > perform very well. Specifically: > > - Metadata disk formats such as Copy-on-Write, encrypted disks, sparse > formats and other compression features can be easily implemented. > O_DIRECT and libaio allow high-performance implementation of even > sparse image formats such as QCoW, while still preserving the safe > ordering of metadata and data writes to ensure data integrity. > (As opposed to, for instance, both the loopback driver and LVM snaps > which both have very dangerous failure cases.) > > - Accessing file-based images from userspace avoids problems related > to flushing dirty pages which are present in the Linux loopback > driver. (Specifically, doing a large number of writes to an > NFS-backed image don''t result in the OOM killer going berserk.) > > - Per-disk handler processes enable easier userspace policing of block > resources, and process-granularity QoS techniques (disk scheduling > and related tools) may be trivially applied to block devices. > > - It''s very easy to take advantage of userspace facilities such as > networking libraries, compression utilities, peer-to-peer > file-sharing systems and so on to build more complex block backends. > > - Crashes are contained -- incremental development/debugging is very > fast. > > - All block data is forwarded in a zero-copy fashion, allowing for > low-overhead userspace implementations. > > How it works (in one paragraph): > > Working in conjunction with the kernel blktap driver, all disk I/O > requests from VMs are passed to the userspace deamon (using a shared > memory interface) through a character device. Each active disk is > mappd to an individual device node, allowing per-disk processes to > implement individual block devices where desired. The userspace > drivers are implemented using asynchronous (Linux libaio), > O_DIRECT-based calls to preserve the unbuffered, batched and > asynchronous request dispatch achieved with the existing blockback > code. We provide a simple, asynchronous virtual disk interface that > makes it quite easy to add new disk implementations. > > > As of June 2006 the current supported disk formats are: > > - Raw Images (both on partitions and in image files) > - File-backed Qcow disks (sparse qcow overlay on a raw image/patrition). > - Standalone sparse Qcow disks (sparse disks, not backed by a parent image). > - Fast shareable RAM disk between VMs (requires some form of cluster-based > filesystem support e.g. OCFS2 in the guest kernel) > - Some VMDK images - your mileage may vary > > Raw and QCow images have asynchronous backends and so should perform > fairly well. VMDK is based directly on the qemu vmdk driver, which is > synchronous (a.k.a. slow). > > The qcow backends support existing qcow disks. There are also a set > of tools to generate and convert qcow images. With these tools (and > driver support), we maintain the qcow file format but adjust > parameters for higher performance with Xen -- using a larger segment > size (4096 instead of 512) and more coarsely allocating metadata > regions. We are continuing to improve this work and expect qcow > performance to improve a great deal over the newxt few weeks. > > Build and Installation Instructions > ==================================> > You will need libaio >= 0.3.104 on your target system to build the > tools (if you are installing RPMs, this means libaio and > libaio-devel). > > Make to configure the blktap backend driver in your dom0 kernel. It > will cooperate fine with the existing backend driver, so you can > experiment with tap disks without breaking existing VM configs. > > To build the tools separately, "make && make install" in > tools/blktap_user. > > > Using the Tools > ==============> > Prepare the image for booting. For qcow files use the qcow utilities > installed earlier. e.g. qcow-create generates a blank standalone image > or a file-backed CoW image. img2qcow takes an existing image or > partition and creates a sparse, standalone qcow-based file. > > Start the userspace disk agent either on system boot (e.g. via an init > script) or manually => ''blktapctrl'' > > Customise the VM config file to use the ''tap'' handler, followed by the > driver type. e.g. for a raw image such as a file or partition: > > disk = [''tap:aio:<FILENAME>,sda1,w''] > > e.g. for a qcow image: > > disk = [''tap:qcow:<FILENAME>,sda1,w''] > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Warfield
2006-Jun-19 17:22 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
> Wonderful!! Now we have dm-userspace and blktap, and these two seems > to do the similar things. So what are the pros/cons of blktap compared > to dm-userspace?I''m sure that Dan can comment on this as well. The main technical difference is that (as I understand it at least) dm-userspace doesn''t bring block data through userspace, just the block request addresses, which may be redirected. The current tap code maps the entire request up, so you can potentially change the data and you can issue block I/O using normal unix file access functions. My intuition is that an approach like dm-userspace can be made more efficient in the long run, but right now it''s going to be slower as you need to do copies of guest data pages as requests go through the device mapper kernel code. This should be fixable though. I''m also not sure how carefully dm-u watches block completion responses to ensure safety of metadata updates relative to data writes. This too should be fixable -- i just don''t know if the user-level tools can currently request completion notifications on requests that they''ve processed. A benefit to the dm-user patch is that it is more of a linux approach than a xen+linux approach. Dm-user will be generally useful in the linux tree, whereas our stuff takes advantage of Xen-specific things to get high performance (i.e. zero-copy data movement). dm-user also has the benefit of being able to map images directly in dom0, which is very useful for tools and is something we haven''t yet added. Similarly though, one downside of dm-user, that is absolutely no fault of the developers, is the dependency on the linux loopback driver which has some bad failure characteristics which can result in both data being acknowledged as written even though it hasn''t been, and the OOM killer going insane. I think some fixes to loop probably need to be applied in the near future given how much people are generally depending on the code with VMs. Blktap is a bit of a bigger hammer -- requests are moved to userspace and the current backends do everything there. This gives you a lot more flexibility in terms of developing virtual block devices. Take a look at tools/blktap_user/block-*.c to see what plugins look like, they''re pretty tidy imo. ;) The current code has the immediate benefit of being fully integrated with the tools and so on, so should be easy to play with and extend. Having access to block contents also makes it possible to do things like compression, encryption, content-adressable storage, and memory-backed block devices. I suspect that the ideal answer lies somewhere in between the two. Julian and I have talked about extending the tap driver to combine it with blkback and allow block address translation without access to request contents. That''s my biased view ;) -- I''m sure Dan can clear things up a bit. Now that we have this code to the list, Julian and I are hoping to take a closer look at dm-user and get a better sense of the relative merits of the two approaches. a. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
NAHieu
2006-Jun-19 18:41 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Andrew, I am compiling the code, but I got the below error: ...... make[3]: Entering directory `/home/hieu/projects/xen/blktap/tools/blktap_user/aiotools'' gcc -O2 -fomit-frame-pointer -DNDEBUG -m32 -march=i686 -Wall -Wstrict-prototypes -Wdeclaration-after-statement -D__XEN_TOOLS__ -fPIC -Wall -Werror -Wno-unused -g3 -fno-strict-aliasing -I ../../../tools/libxc -I.. -I. -I../../xenstore -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_GNU_SOURCE -Wp,-MD,.tapdisk.d -o tapdisk -L../../../tools/libxc \ -L../../../tools/xenstore -lxenstore -lblktap block-aio.o block-sync.o block-vmdk.o block-ram.o block-qcow.o aes.o tapdisk.c -L. -L.. -laio -lz tapdisk.c:19:23: error: db.h: No such file or directory make[3]: *** [tapdisk] Error 1 make[3]: Leaving directory `/home/hieu/projects/xen/blktap/tools/blktap_user/aiotools'' make[2]: *** [all] Error 2 make[2]: Leaving directory `/home/hieu/projects/xen/blktap/tools/blktap_user'' make[1]: *** [install] Error 2 make[1]: Leaving directory `/home/hieu/projects/xen/blktap/tools'' make: *** [install-tools] Error 2 I have no idea what is the file db.h? Did you miss smt? Thanks. H _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2006-Jun-19 18:55 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Hi Andy,> > Performance is quite good, and we intend to focus on this a bit more > over the next few weeks, releasing updated patches as they are > available. Bonnie results this morning are as follows (64-bit results > compare against linux blkback+loopback file, Julian can follow up with > loopback results for 32-bit later if anyone''s interested): > > -------Sequential Output-------- ---Sequential Input-- > --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- > --Seeks--- > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU > /sec %CPU > 64-bit: > xen0 4096 40115 93.4 41067 12.7 22757 1.2 32532 56.7 53724 0.4 > 121.4 0.0 > img-sp 4096 20291 86.0 38091 18.1 19939 8.2 30854 69.0 47779 4.2 > 95.3 0.4 > loop-sp 4096 33421 77.6 33663 13.1 18546 5.1 28606 59.2 46659 6.0 > 85.2 0.1 > > 32-Bit: > xen0 1024 33857 94.0 45804 9.0 23269 0.0 25825 52.0 55628 0 > 185.0 0.0 > img-sp 1448 32743 92.0 40703 8.0 23281 0.0 31139 75.0 56585 0 > 208.1 0.0What is img-sp? Is this blktap + a physical device or is this blktap with something like qcow? The numbers a tad worse than I''d expect them to be if it was a physical device. Theoretically, linux-aio is inserting requests directly into the backend. I expect there to be a certain amount of CPU overhead from context switching but since it''s still zero-copy, I wouldn''t expect less CPU usage and less throughput. Any idea why this is or am I just totally misunderstanding how things should behave :-)> Working in conjunction with the kernel blktap driver, all disk I/O > requests from VMs are passed to the userspace deamon (using a shared > memory interface) through a character device. Each active disk is > mappd to an individual device node, allowing per-disk processes to > implement individual block devices where desired. The userspace > drivers are implemented using asynchronous (Linux libaio), > O_DIRECT-based calls to preserve the unbuffered, batched and > asynchronous request dispatch achieved with the existing blockback > code. We provide a simple, asynchronous virtual disk interface that > makes it quite easy to add new disk implementations. >A very much like the idea of a userspace block device backend. Have you considered what it would take to completely replace blkback with a userspace backend? I''m also curious why you choose a character device to interact with the ring queue instead of just attaching to the ring queue directly in userspace. I think the whole discussion of COW support is orthogonal to a userspace backend FWIW so I''ll save that part of the discussion for another thread :-) Regards, Anthony Liguori> > As of June 2006 the current supported disk formats are: > > - Raw Images (both on partitions and in image files) > - File-backed Qcow disks (sparse qcow overlay on a raw image/patrition). > - Standalone sparse Qcow disks (sparse disks, not backed by a parent > image). > - Fast shareable RAM disk between VMs (requires some form of > cluster-based > filesystem support e.g. OCFS2 in the guest kernel) > - Some VMDK images - your mileage may vary > > Raw and QCow images have asynchronous backends and so should perform > fairly well. VMDK is based directly on the qemu vmdk driver, which is > synchronous (a.k.a. slow). > > The qcow backends support existing qcow disks. There are also a set > of tools to generate and convert qcow images. With these tools (and > driver support), we maintain the qcow file format but adjust > parameters for higher performance with Xen -- using a larger segment > size (4096 instead of 512) and more coarsely allocating metadata > regions. We are continuing to improve this work and expect qcow > performance to improve a great deal over the newxt few weeks. > > Build and Installation Instructions > ==================================> > You will need libaio >= 0.3.104 on your target system to build the > tools (if you are installing RPMs, this means libaio and > libaio-devel). > > Make to configure the blktap backend driver in your dom0 kernel. It > will cooperate fine with the existing backend driver, so you can > experiment with tap disks without breaking existing VM configs. > > To build the tools separately, "make && make install" in > tools/blktap_user. > > > Using the Tools > ==============> > Prepare the image for booting. For qcow files use the qcow utilities > installed earlier. e.g. qcow-create generates a blank standalone image > or a file-backed CoW image. img2qcow takes an existing image or > partition and creates a sparse, standalone qcow-based file. > > Start the userspace disk agent either on system boot (e.g. via an init > script) or manually => ''blktapctrl'' > > Customise the VM config file to use the ''tap'' handler, followed by the > driver type. e.g. for a raw image such as a file or partition: > > disk = [''tap:aio:<FILENAME>,sda1,w''] > > e.g. for a qcow image: > > disk = [''tap:qcow:<FILENAME>,sda1,w''] > ------------------------------------------------------------------------ > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2006-Jun-19 19:15 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Couple general comments on the code: Please don''t introduce more (ab)uses of /proc. Sure it''s just for debugging but there''s no reason to not make that sysfs. I''m not an expert here, but the nopage handlers that I''ve seen return NOPAGE_SIGBUS instead of manually causing a SIGBUS on current. I think it''s better to use C99 initialization than GCC: owner: ..., => .owner = ..., Some of the indenting is a bit off from Linux CodingStyle. Stuff like if( => if ( and some random spaces after an (. There''s some code commented out with C++ comments too. What''s the significance of /**BLKTAP**/ and /**TAPEND**/? I''m a little surprised to see these conversion tools too. Wouldn''t it be easier to just add some parameters to qemu-img? Pretty interesting stuff, thanks for posting. Regards, Anthony Liguori Andrew Warfield wrote:> Attached to this email is a patch containing the (new and improved) > blktap Linux driver and associated userspace tools for Xen. In > addition to being more flavourful, containing half the fat, and > removing stains twice as well as the old driver, this stuff adds a > userspace block backend and let you use raw (without loopback), qcow, > and vmdk-based image files for your domUs. There''s also a fun little > driver that provides a shared-memory block device which, in > combination with OCFS2, represents a cheap-and-cheerful fast shared > filesystem between multiple domUs. > > This code has been (somewhat lackadaisically) developed over the past > few years at Cambridge and has recently enjoyed massive improvements > thanks to the considerable efforts of Julian Chesterfield. > > The code "works for us" and has been tested on a grand total of about > three machines. We would love to have feedback from a broader > audience, in terms of both trying out the tools and inspecting the code. > We''ll plan to release new patches at about 1-week intervals based on > comments. > > Performance is quite good, and we intend to focus on this a bit more > over the next few weeks, releasing updated patches as they are > available. Bonnie results this morning are as follows (64-bit results > compare against linux blkback+loopback file, Julian can follow up with > loopback results for 32-bit later if anyone''s interested): > > -------Sequential Output-------- ---Sequential Input-- > --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- > --Seeks--- > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU > /sec %CPU > 64-bit: > xen0 4096 40115 93.4 41067 12.7 22757 1.2 32532 56.7 53724 0.4 > 121.4 0.0 > img-sp 4096 20291 86.0 38091 18.1 19939 8.2 30854 69.0 47779 4.2 > 95.3 0.4 > loop-sp 4096 33421 77.6 33663 13.1 18546 5.1 28606 59.2 46659 6.0 > 85.2 0.1 > > 32-Bit: > xen0 1024 33857 94.0 45804 9.0 23269 0.0 25825 52.0 55628 0 > 185.0 0.0 > img-sp 1448 32743 92.0 40703 8.0 23281 0.0 31139 75.0 56585 0 > 208.1 0.0 > > The patch is against cset 0426:840f33e54054 -- but is unlikely to > conflict with anything recent. You''ll need libaio and libaio-devel on > your build machine for the tools to compile. > > > Blktap readme follows.) > > Thanks! > a. > > --- > > > Blktap Userspace Tools + Library > ===============================> > Andrew Warfield and Julian Chesterfield > 16th June 2006 > > {firstname.lastname}@cl.cam.ac.uk > > The blktap userspace toolkit provides a user-level disk I/O > interface. The blktap mechanism involves a kernel driver that acts > similarly to the existing Xen/Linux blkback driver, and a set of > associated user-level libraries. Using these tools, blktap allows > virtual block devices presented to VMs to be implemented in userspace > and to be backed by raw partitions, files, network, etc. > > The key benefit of blktap is that it makes it easy and fast to write > arbitrary block backends, and that these user-level backends actually > perform very well. Specifically: > > - Metadata disk formats such as Copy-on-Write, encrypted disks, sparse > formats and other compression features can be easily implemented. > O_DIRECT and libaio allow high-performance implementation of even > sparse image formats such as QCoW, while still preserving the safe > ordering of metadata and data writes to ensure data integrity. > (As opposed to, for instance, both the loopback driver and LVM snaps > which both have very dangerous failure cases.) > > - Accessing file-based images from userspace avoids problems related > to flushing dirty pages which are present in the Linux loopback > driver. (Specifically, doing a large number of writes to an > NFS-backed image don''t result in the OOM killer going berserk.) > > - Per-disk handler processes enable easier userspace policing of block > resources, and process-granularity QoS techniques (disk scheduling > and related tools) may be trivially applied to block devices. > > - It''s very easy to take advantage of userspace facilities such as > networking libraries, compression utilities, peer-to-peer > file-sharing systems and so on to build more complex block backends. > > - Crashes are contained -- incremental development/debugging is very > fast. > > - All block data is forwarded in a zero-copy fashion, allowing for > low-overhead userspace implementations. > > How it works (in one paragraph): > > Working in conjunction with the kernel blktap driver, all disk I/O > requests from VMs are passed to the userspace deamon (using a shared > memory interface) through a character device. Each active disk is > mappd to an individual device node, allowing per-disk processes to > implement individual block devices where desired. The userspace > drivers are implemented using asynchronous (Linux libaio), > O_DIRECT-based calls to preserve the unbuffered, batched and > asynchronous request dispatch achieved with the existing blockback > code. We provide a simple, asynchronous virtual disk interface that > makes it quite easy to add new disk implementations. > > > As of June 2006 the current supported disk formats are: > > - Raw Images (both on partitions and in image files) > - File-backed Qcow disks (sparse qcow overlay on a raw image/patrition). > - Standalone sparse Qcow disks (sparse disks, not backed by a parent > image). > - Fast shareable RAM disk between VMs (requires some form of > cluster-based > filesystem support e.g. OCFS2 in the guest kernel) > - Some VMDK images - your mileage may vary > > Raw and QCow images have asynchronous backends and so should perform > fairly well. VMDK is based directly on the qemu vmdk driver, which is > synchronous (a.k.a. slow). > > The qcow backends support existing qcow disks. There are also a set > of tools to generate and convert qcow images. With these tools (and > driver support), we maintain the qcow file format but adjust > parameters for higher performance with Xen -- using a larger segment > size (4096 instead of 512) and more coarsely allocating metadata > regions. We are continuing to improve this work and expect qcow > performance to improve a great deal over the newxt few weeks. > > Build and Installation Instructions > ==================================> > You will need libaio >= 0.3.104 on your target system to build the > tools (if you are installing RPMs, this means libaio and > libaio-devel). > > Make to configure the blktap backend driver in your dom0 kernel. It > will cooperate fine with the existing backend driver, so you can > experiment with tap disks without breaking existing VM configs. > > To build the tools separately, "make && make install" in > tools/blktap_user. > > > Using the Tools > ==============> > Prepare the image for booting. For qcow files use the qcow utilities > installed earlier. e.g. qcow-create generates a blank standalone image > or a file-backed CoW image. img2qcow takes an existing image or > partition and creates a sparse, standalone qcow-based file. > > Start the userspace disk agent either on system boot (e.g. via an init > script) or manually => ''blktapctrl'' > > Customise the VM config file to use the ''tap'' handler, followed by the > driver type. e.g. for a raw image such as a file or partition: > > disk = [''tap:aio:<FILENAME>,sda1,w''] > > e.g. for a qcow image: > > disk = [''tap:qcow:<FILENAME>,sda1,w''] > ------------------------------------------------------------------------ > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Warfield
2006-Jun-19 19:22 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Hi Anthony,> What is img-sp? Is this blktap + a physical device or is this blktap > with something like qcow?Oops, good question. This is blktap backed off of a sparse image file (generated with something along the lines of "dd if=/dev/zero of=./scratch.img bs=1024 seek=<big number>"). So this test is directly comparable to blkback with a loopback-mounted image, which is what''s shown on the next line.> The numbers a tad worse than I''d expect them to be if it was a physical > device. Theoretically, linux-aio is inserting requests directly into > the backend. I expect there to be a certain amount of CPU overhead from > context switching but since it''s still zero-copy, I wouldn''t expect less > CPU usage and less throughput. > > Any idea why this is or am I just totally misunderstanding how things > should behave :-)Performance on raw devices is certainly better than on images -- I didn''t have a spare partition to work with on my test box this morning (maybe I can use that as an excuse to get an extra disk), but will get some results posted on this asap.> A very much like the idea of a userspace block device backend. Have you > considered what it would take to completely replace blkback with a > userspace backend? I''m also curious why you choose a character device > to interact with the ring queue instead of just attaching to the ring > queue directly in userspace.The current blktap code has functional parity with blkback. Just change ''phys:'' to ''tap:aio:'' in your vm config files and you''re set.> I think the whole discussion of COW support is orthogonal to a userspace > backend FWIW so I''ll save that part of the discussion for another thread :-)Fair enough, I''ll look forward to it. ;) a. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Warfield
2006-Jun-19 19:26 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
> A very much like the idea of a userspace block device backend. Have you > considered what it would take to completely replace blkback with a > userspace backend? I''m also curious why you choose a character device > to interact with the ring queue instead of just attaching to the ring > queue directly in userspace.Oops (again), missed answering your char device question. We just use a char device to pin up a region of virtual address space for each disk as it''s presented in userspace. Anyone familiar with blkback will recognise the technique. In our case, the first page is a request/response ring between tap driver and application, and the remainder is a sparsely populated address space where data pages are mapped as they fly through. We signal down with ioctl()s, and up using poll(). a. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Warfield
2006-Jun-19 19:31 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Excellent comments, thanks.> Please don''t introduce more (ab)uses of /proc. Sure it''s just for > debugging but there''s no reason to not make that sysfs. > > I''m not an expert here, but the nopage handlers that I''ve seen return > NOPAGE_SIGBUS instead of manually causing a SIGBUS on current. > > I think it''s better to use C99 initialization than GCC: > > owner: ..., => .owner = ..., > > Some of the indenting is a bit off from Linux CodingStyle. Stuff like > if( => if ( and some random spaces after an (. > > There''s some code commented out with C++ comments too.All good -- I''ll take a pass through and fix all these this week.> What''s the significance of /**BLKTAP**/ and /**TAPEND**/?For a while we were maintaining the kernel tap driver as a diff against blkback, to pick up fixes quickly -- those markers were just to mark differing regions. I think the current code has diverged enough to make this approach untenable.> I''m a little surprised to see these conversion tools too. Wouldn''t it > be easier to just add some parameters to qemu-img?The image tools use our plugins (rather than qemu''s) to build disks -- most importantly they adjust the layout to build better-performing images. Fair point though, I think julian was going to look and see if we could get away with just using the qemu tools.> Pretty interesting stuff, thanks for posting.Thanks for the feedback! a. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2006-Jun-19 19:51 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Andrew Warfield wrote:>> A very much like the idea of a userspace block device backend. Have you >> considered what it would take to completely replace blkback with a >> userspace backend? I''m also curious why you choose a character device >> to interact with the ring queue instead of just attaching to the ring >> queue directly in userspace. > > Oops (again), missed answering your char device question. We just use > a char device to pin up a region of virtual address space for each > disk as it''s presented in userspace.Is this strictly needed though? My current understanding (which may be totally off) of this device is that it contains: - first page is ring/queue - rest of file is mmap()''able and as requests come in over the blkfront queue, you map them into that address space - poll/ioctl is used for event channel notification Couldn''t you do all of this in pure userspace though with privcmd and evtchn? Regards, Anthony Liguori> Anyone familiar with blkback > will recognise the technique. In our case, the first page is a > request/response ring between tap driver and application, and the > remainder is a sparsely populated address space where data pages are > mapped as they fly through. We signal down with ioctl()s, and up > using poll(). > > a._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Warfield
2006-Jun-19 21:07 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
> Andrew, I am compiling the code, but I got the below error: > ... > tapdisk.c:19:23: error: db.h: No such file or directoryOops -- that''s there from an old version of the code that used berkeley db as a test. Just remove that line and you should be in business -- it should be completely unnecessary. a.> ...... > make[3]: Entering directory > `/home/hieu/projects/xen/blktap/tools/blktap_user/aiotools'' > gcc -O2 -fomit-frame-pointer -DNDEBUG -m32 -march=i686 -Wall > -Wstrict-prototypes -Wdeclaration-after-statement -D__XEN_TOOLS__ > -fPIC -Wall -Werror -Wno-unused -g3 -fno-strict-aliasing -I > ../../../tools/libxc -I.. -I. -I../../xenstore -D_FILE_OFFSET_BITS=64 > -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_GNU_SOURCE > -Wp,-MD,.tapdisk.d -o tapdisk -L../../../tools/libxc > \ > -L../../../tools/xenstore -lxenstore -lblktap block-aio.o > block-sync.o block-vmdk.o block-ram.o block-qcow.o aes.o tapdisk.c > -L. -L.. -laio -lz > make[3]: *** [tapdisk] Error 1 > make[3]: Leaving directory > `/home/hieu/projects/xen/blktap/tools/blktap_user/aiotools'' > make[2]: *** [all] Error 2 > make[2]: Leaving directory `/home/hieu/projects/xen/blktap/tools/blktap_user'' > make[1]: *** [install] Error 2 > make[1]: Leaving directory `/home/hieu/projects/xen/blktap/tools'' > make: *** [install-tools] Error 2 > > > I have no idea what is the file db.h? Did you miss smt? > > > Thanks. > H >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Smith
2006-Jun-19 21:16 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
AW> I''m sure that Dan can comment on this as well. The main technical AW> difference is that (as I understand it at least) dm-userspace AW> doesn''t bring block data through userspace, just the block request AW> addresses, which may be redirected. The current tap code maps the AW> entire request up, so you can potentially change the data and you AW> can issue block I/O using normal unix file access functions. Yup, that''s a correct assessment. AW> My intuition is that an approach like dm-userspace can be made AW> more efficient in the long run, but right now it''s going to be AW> slower as you need to do copies of guest data pages as requests go AW> through the device mapper kernel code. Why do you say that? I would imagine that blkback provides the domU pages as the target pages in the request, is that right? In that case, the data coming off of the disk should go directly into the domU page. Remember that dm-userspace doesn''t do anything other than rewriting of the destination device and sector of a request. So, however it works for blkback now, is how it works with dm-userspace in the mix. AW> This should be fixable though. I''m also not sure how carefully AW> dm-u watches block completion responses to ensure safety of AW> metadata updates relative to data writes. This too should be AW> fixable -- i just don''t know if the user-level tools can currently AW> request completion notifications on requests that they''ve AW> processed. So, right now, we''re a little optimistic about metadata writing. It will be relatively easy to hijack the callback routine for the disk request (a technique which is heavily used in the rest of the block layer) to get a completion trigger. We can then notify userspace for the metadata write and then trigger the original callback routine for completion. AW> A benefit to the dm-user patch is that it is more of a linux AW> approach than a xen+linux approach. Dm-user will be generally AW> useful in the linux tree Right, this is a huge advantage, I think. Being able to mount images as if they were disks will be quite helpful. Another benefit is the ability to easily convert between formats. Converting a vmdk to a qcow is as easy as mounting both and doing a "cp -R" between them. AW> Similarly though, one downside of dm-user, that is absolutely no AW> fault of the developers, is the dependency on the linux loopback AW> driver Just a clarification, this is only if file images are used. If using LVMs or partitions or some other block device, we don''t use the loop driver. AW> which has some bad failure characteristics which can result in AW> both data being acknowledged as written even though it hasn''t AW> been, and the OOM killer going insane. I think some fixes to loop AW> probably need to be applied in the near future given how much AW> people are generally depending on the code with VMs. Can you elaborate about what specifically is wrong with the loop driver? AW> Julian and I have talked about extending the tap driver to combine AW> it with blkback and allow block address translation without access AW> to request contents. Since the kernel already has a block address translation solution (i.e. device-mapper), is there a benefit to adding another xen-specific one? Another question I have is this: doesn''t the dependence on libaio limit you to certain filesystems? For example, the page for libaio doesn''t mention reisferfs as supported. Does that mean that SLES users won''t be able to use ublkback? Thanks for posting your code Andrew! -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Julian Chesterfield
2006-Jun-19 21:42 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
> > On 19/6/06 8:15 pm, "Anthony Liguori" <aliguori@us.ibm.com> wrote: > >> Couple general comments on the code: >> >> Please don''t introduce more (ab)uses of /proc. Sure it''s just for >> debugging but there''s no reason to not make that sysfs. >> >> I''m not an expert here, but the nopage handlers that I''ve seen return >> NOPAGE_SIGBUS instead of manually causing a SIGBUS on current. >> >> I think it''s better to use C99 initialization than GCC: >> >> owner: ..., => .owner = ..., >> >> Some of the indenting is a bit off from Linux CodingStyle. Stuff like >> if( => if ( and some random spaces after an (. >> >> There''s some code commented out with C++ comments too. >> >> What''s the significance of /**BLKTAP**/ and /**TAPEND**/? >> >> I''m a little surprised to see these conversion tools too. Wouldn''t it >> be easier to just add some parameters to qemu-img?Thanks for the comments anthony. When we initially played with qcow images it was easier to knock-up our own frontend to the plugins for converting between the different image types and testing features like image sparseness. We added an optimisation feature in the xen qcow plugin which would allocate full extents for non backing file based images as well as the asynchronous callback architecture to enable request batching for AIO. We could certainly adapt qemu-img to use these and other features. Not sure what the best approach for keeping the toolsets in synch between the 2 projects would be though. Thanks, Julian Chesterfield>> >> Pretty interesting stuff, thanks for posting. >> >> Regards, >> >> Anthony Liguori >> >> Andrew Warfield wrote: >>> Attached to this email is a patch containing the (new and improved) >>> blktap Linux driver and associated userspace tools for Xen. In >>> addition to being more flavourful, containing half the fat, and >>> removing stains twice as well as the old driver, this stuff adds a >>> userspace block backend and let you use raw (without loopback), qcow, >>> and vmdk-based image files for your domUs. There''s also a fun little >>> driver that provides a shared-memory block device which, in >>> combination with OCFS2, represents a cheap-and-cheerful fast shared >>> filesystem between multiple domUs. >>> >>> This code has been (somewhat lackadaisically) developed over the past >>> few years at Cambridge and has recently enjoyed massive improvements >>> thanks to the considerable efforts of Julian Chesterfield. >>> >>> The code "works for us" and has been tested on a grand total of about >>> three machines. We would love to have feedback from a broader >>> audience, in terms of both trying out the tools and inspecting the >>> code. >>> We''ll plan to release new patches at about 1-week intervals based on >>> comments. >>> >>> Performance is quite good, and we intend to focus on this a bit more >>> over the next few weeks, releasing updated patches as they are >>> available. Bonnie results this morning are as follows (64-bit >>> results >>> compare against linux blkback+loopback file, Julian can follow up >>> with >>> loopback results for 32-bit later if anyone''s interested): >>> >>> -------Sequential Output-------- ---Sequential Input-- >>> --Random-- >>> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- >>> --Seeks--- >>> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU >>> /sec %CPU >>> 64-bit: >>> xen0 4096 40115 93.4 41067 12.7 22757 1.2 32532 56.7 53724 0.4 >>> 121.4 0.0 >>> img-sp 4096 20291 86.0 38091 18.1 19939 8.2 30854 69.0 47779 4.2 >>> 95.3 0.4 >>> loop-sp 4096 33421 77.6 33663 13.1 18546 5.1 28606 59.2 46659 6.0 >>> 85.2 0.1 >>> >>> 32-Bit: >>> xen0 1024 33857 94.0 45804 9.0 23269 0.0 25825 52.0 55628 0 >>> 185.0 0.0 >>> img-sp 1448 32743 92.0 40703 8.0 23281 0.0 31139 75.0 56585 0 >>> 208.1 0.0 >>> >>> The patch is against cset 0426:840f33e54054 -- but is unlikely to >>> conflict with anything recent. You''ll need libaio and libaio-devel >>> on >>> your build machine for the tools to compile. >>> >>> >>> Blktap readme follows.) >>> >>> Thanks! >>> a. >>> >>> --- >>> >>> >>> Blktap Userspace Tools + Library >>> ===============================>>> >>> Andrew Warfield and Julian Chesterfield >>> 16th June 2006 >>> >>> {firstname.lastname}@cl.cam.ac.uk >>> >>> The blktap userspace toolkit provides a user-level disk I/O >>> interface. The blktap mechanism involves a kernel driver that acts >>> similarly to the existing Xen/Linux blkback driver, and a set of >>> associated user-level libraries. Using these tools, blktap allows >>> virtual block devices presented to VMs to be implemented in userspace >>> and to be backed by raw partitions, files, network, etc. >>> >>> The key benefit of blktap is that it makes it easy and fast to write >>> arbitrary block backends, and that these user-level backends actually >>> perform very well. Specifically: >>> >>> - Metadata disk formats such as Copy-on-Write, encrypted disks, >>> sparse >>> formats and other compression features can be easily implemented. >>> O_DIRECT and libaio allow high-performance implementation of even >>> sparse image formats such as QCoW, while still preserving the safe >>> ordering of metadata and data writes to ensure data integrity. >>> (As opposed to, for instance, both the loopback driver and LVM snaps >>> which both have very dangerous failure cases.) >>> >>> - Accessing file-based images from userspace avoids problems related >>> to flushing dirty pages which are present in the Linux loopback >>> driver. (Specifically, doing a large number of writes to an >>> NFS-backed image don''t result in the OOM killer going berserk.) >>> >>> - Per-disk handler processes enable easier userspace policing of >>> block >>> resources, and process-granularity QoS techniques (disk scheduling >>> and related tools) may be trivially applied to block devices. >>> >>> - It''s very easy to take advantage of userspace facilities such as >>> networking libraries, compression utilities, peer-to-peer >>> file-sharing systems and so on to build more complex block backends. >>> >>> - Crashes are contained -- incremental development/debugging is very >>> fast. >>> >>> - All block data is forwarded in a zero-copy fashion, allowing for >>> low-overhead userspace implementations. >>> >>> How it works (in one paragraph): >>> >>> Working in conjunction with the kernel blktap driver, all disk I/O >>> requests from VMs are passed to the userspace deamon (using a shared >>> memory interface) through a character device. Each active disk is >>> mappd to an individual device node, allowing per-disk processes to >>> implement individual block devices where desired. The userspace >>> drivers are implemented using asynchronous (Linux libaio), >>> O_DIRECT-based calls to preserve the unbuffered, batched and >>> asynchronous request dispatch achieved with the existing blockback >>> code. We provide a simple, asynchronous virtual disk interface that >>> makes it quite easy to add new disk implementations. >>> >>> >>> As of June 2006 the current supported disk formats are: >>> >>> - Raw Images (both on partitions and in image files) >>> - File-backed Qcow disks (sparse qcow overlay on a raw >>> image/patrition). >>> - Standalone sparse Qcow disks (sparse disks, not backed by a parent >>> image). >>> - Fast shareable RAM disk between VMs (requires some form of >>> cluster-based >>> filesystem support e.g. OCFS2 in the guest kernel) >>> - Some VMDK images - your mileage may vary >>> >>> Raw and QCow images have asynchronous backends and so should perform >>> fairly well. VMDK is based directly on the qemu vmdk driver, which >>> is >>> synchronous (a.k.a. slow). >>> >>> The qcow backends support existing qcow disks. There are also a set >>> of tools to generate and convert qcow images. With these tools (and >>> driver support), we maintain the qcow file format but adjust >>> parameters for higher performance with Xen -- using a larger segment >>> size (4096 instead of 512) and more coarsely allocating metadata >>> regions. We are continuing to improve this work and expect qcow >>> performance to improve a great deal over the newxt few weeks. >>> >>> Build and Installation Instructions >>> ==================================>>> >>> You will need libaio >= 0.3.104 on your target system to build the >>> tools (if you are installing RPMs, this means libaio and >>> libaio-devel). >>> >>> Make to configure the blktap backend driver in your dom0 kernel. It >>> will cooperate fine with the existing backend driver, so you can >>> experiment with tap disks without breaking existing VM configs. >>> >>> To build the tools separately, "make && make install" in >>> tools/blktap_user. >>> >>> >>> Using the Tools >>> ==============>>> >>> Prepare the image for booting. For qcow files use the qcow utilities >>> installed earlier. e.g. qcow-create generates a blank standalone >>> image >>> or a file-backed CoW image. img2qcow takes an existing image or >>> partition and creates a sparse, standalone qcow-based file. >>> >>> Start the userspace disk agent either on system boot (e.g. via an >>> init >>> script) or manually => ''blktapctrl'' >>> >>> Customise the VM config file to use the ''tap'' handler, followed by >>> the >>> driver type. e.g. for a raw image such as a file or partition: >>> >>> disk = [''tap:aio:<FILENAME>,sda1,w''] >>> >>> e.g. for a qcow image: >>> >>> disk = [''tap:qcow:<FILENAME>,sda1,w''] >>> --------------------------------------------------------------------- >>> --- >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2006-Jun-19 21:56 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Julian Chesterfield wrote:>> >> On 19/6/06 8:15 pm, "Anthony Liguori" <aliguori@us.ibm.com> wrote: >> >>> Couple general comments on the code: >>> >>> Please don''t introduce more (ab)uses of /proc. Sure it''s just for >>> debugging but there''s no reason to not make that sysfs. >>> >>> I''m not an expert here, but the nopage handlers that I''ve seen return >>> NOPAGE_SIGBUS instead of manually causing a SIGBUS on current. >>> >>> I think it''s better to use C99 initialization than GCC: >>> >>> owner: ..., => .owner = ..., >>> >>> Some of the indenting is a bit off from Linux CodingStyle. Stuff like >>> if( => if ( and some random spaces after an (. >>> >>> There''s some code commented out with C++ comments too. >>> >>> What''s the significance of /**BLKTAP**/ and /**TAPEND**/? >>> >>> I''m a little surprised to see these conversion tools too. Wouldn''t it >>> be easier to just add some parameters to qemu-img? > > Thanks for the comments anthony. When we initially played with qcow > images it was easier to knock-up our own frontend to the plugins for > converting between the different image types and testing features like > image sparseness. We added an optimisation feature in the xen qcow > plugin which would allocate full extents for non backing file based > images as well as the asynchronous callback architecture to enable > request batching for AIO. > > We could certainly adapt qemu-img to use these and other features. Not > sure what the best approach for keeping the toolsets in synch between > the 2 projects would be though.It may be worth just bringing up the changes on qemu-devel. I know why you''d want to change the cluster size (it''s a pain to work with clusters < block size). I saw another comment about making metadata more coarse. Can you clarify the reasons for that? I can''t imagine there would be that push back in changing the default cluster size in qemu-img from 512 to 4096.. Most OS''s are going to write in that granularity anyway I imagine :-) Regards, Anthony Liguori> > Thanks, > Julian Chesterfield_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Julian Chesterfield
2006-Jun-20 13:44 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
> On 19/6/06 10:56 pm, "Anthony Liguori" <aliguori@us.ibm.com> wrote: > >> Julian Chesterfield wrote: >>>> >>>> On 19/6/06 8:15 pm, "Anthony Liguori" <aliguori@us.ibm.com> wrote: >>>> >>>>> Couple general comments on the code: >>>>> >>>>> Please don''t introduce more (ab)uses of /proc. Sure it''s just for >>>>> debugging but there''s no reason to not make that sysfs. >>>>> >>>>> I''m not an expert here, but the nopage handlers that I''ve seen >>>>> return >>>>> NOPAGE_SIGBUS instead of manually causing a SIGBUS on current. >>>>> >>>>> I think it''s better to use C99 initialization than GCC: >>>>> >>>>> owner: ..., => .owner = ..., >>>>> >>>>> Some of the indenting is a bit off from Linux CodingStyle. Stuff >>>>> like >>>>> if( => if ( and some random spaces after an (. >>>>> >>>>> There''s some code commented out with C++ comments too. >>>>> >>>>> What''s the significance of /**BLKTAP**/ and /**TAPEND**/? >>>>> >>>>> I''m a little surprised to see these conversion tools too. >>>>> Wouldn''t it >>>>> be easier to just add some parameters to qemu-img? >>> >>> Thanks for the comments anthony. When we initially played with qcow >>> images it was easier to knock-up our own frontend to the plugins for >>> converting between the different image types and testing features >>> like >>> image sparseness. We added an optimisation feature in the xen qcow >>> plugin which would allocate full extents for non backing file based >>> images as well as the asynchronous callback architecture to enable >>> request batching for AIO. >>> >>> We could certainly adapt qemu-img to use these and other features. >>> Not >>> sure what the best approach for keeping the toolsets in synch between >>> the 2 projects would be though. >> >> It may be worth just bringing up the changes on qemu-devel. I know >> why >> you''d want to change the cluster size (it''s a pain to work with >> clusters >> < block size). I saw another comment about making metadata more >> coarse. Can you clarify the reasons for that?We''ve been thinking about an enhancement to the qcow driver to use smarter readahead on the request ring in order to speculatively limit the number of metadata writes where request batching is used. This is an advantage of having access to the full frontend request queue which enables the userspace agent to make smart decisions regarding caching and safe but minimal metadata writes. (Not sure which comment you''d read, but hope this may answer it!) - Julian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Julian Chesterfield
2006-Jun-20 13:57 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
> > Another question I have is this: doesn''t the dependence on libaio > limit you to certain filesystems? For example, the page for libaio > doesn''t mention reisferfs as supported. Does that mean that SLES > users won''t be able to use ublkback?Dan, I''ve just tested blktap with a reiserfs base filesystem. There were no errors opening files with O_DIRECT, and the performance is proportionally similar to ext3 (average block reads close to native, average block writes above 80% under bonnie++). I''ll explore further, however it seems that O_DIRECT is supported. Thanks, Julian> > Thanks for posting your code Andrew! > > -- > Dan Smith > IBM Linux Technology Center > Open Hypervisor Team > email: danms@us.ibm.com > > ------ End of Forwarded Message > > <Attachment (application_pgp-signature document)>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rusty Russell
2006-Jun-29 03:35 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
On Mon, 2006-06-19 at 09:19 -0700, Andrew Warfield wrote:> Attached to this email is a patch containing the (new and improved) > blktap Linux driver and associated userspace tools for Xen. In > addition to being more flavourful, containing half the fat, and > removing stains twice as well as the old driver, this stuff adds a > userspace block backend and let you use raw (without loopback), qcow, > and vmdk-based image files for your domUs. There''s also a fun little > driver that provides a shared-memory block device which, in > combination with OCFS2, represents a cheap-and-cheerful fast shared > filesystem between multiple domUs.Hi Andrew, I like the idea of block servers in userspace, but I''m curious. When I wrote the simple share block server I couldn''t see an obvious justification for multiple outstanding requests (with AIO/threads and all that entails), so I went for the trivial single request approach. It seems to me that the backend doesn''t have much information the front end doesn''t have. Just wondered if you''d tried a naive approach first... Thanks! Rusty. -- Help! Save Australia from the worst of the DMCA: http://linux.org.au/law _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Warfield
2006-Jun-29 05:24 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
> I like the idea of block servers in userspace, but I''m curious. When I > wrote the simple share block server I couldn''t see an obvious > justification for multiple outstanding requests (with AIO/threads and > all that entails), so I went for the trivial single request approach. > It seems to me that the backend doesn''t have much information the front > end doesn''t have.Hi Rusty, not sure I see what you are asking. A very early version of the code just did synchronous dispatch (one blocking request at a time) and was, as you might expect, very slow. You clearly want to keep the block request queues as full as possible to amortize seeks... AIO just lets me issue batches of requests at once, and so minimizes context switching through userland -- which was something I was worried about causing overhead on x86_64. I don''t really think it adds that much complexity. The process-per-disk thing is optional in the current code, you could just as easily build a single-threaded user backend. The current model hopefully buys you a bit of resiliency against crashes and maps per-disk request streams in a fairly clean way down onto the block sheduler. Am I missing your point? a. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rusty Russell
2006-Jun-29 06:31 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
On Wed, 2006-06-28 at 22:24 -0700, Andrew Warfield wrote:> > I like the idea of block servers in userspace, but I''m curious. When I > > wrote the simple share block server I couldn''t see an obvious > > justification for multiple outstanding requests (with AIO/threads and > > all that entails), so I went for the trivial single request approach. > > It seems to me that the backend doesn''t have much information the front > > end doesn''t have. > > Hi Rusty, > > not sure I see what you are asking. A very early version of the > code just did synchronous dispatch (one blocking request at a time) > and was, as you might expect, very slow. You clearly want to keep the > block request queues as full as possible to amortize seeks...Last I looked the blkif front end, it uses a noop I/O scheduler, which means that the only one doing scheduling is the backend. I can easily imagine that if the backend is synchronous, this would be slow. However, it''s not clear to me that doing scheduling in the backend will generally be faster than doing it in the front end. I suppose it should be, if the backend domain were serving multiple frontends from the same device.> AIO just > lets me issue batches of requests at once, and so minimizes context > switching through userland -- which was something I was worried about > causing overhead on x86_64. I don''t really think it adds that much > complexity.Sure, I would have used a pool of processes because I''m old-fashioned, but AIO is probably a better choice for multiple requests at once. Cheers! Rusty. -- Help! Save Australia from the worst of the DMCA: http://linux.org.au/law _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2006-Jun-29 11:49 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Rusty Russell wrote:> On Mon, 2006-06-19 at 09:19 -0700, Andrew Warfield wrote: > >> Attached to this email is a patch containing the (new and improved) >> blktap Linux driver and associated userspace tools for Xen. In >> addition to being more flavourful, containing half the fat, and >> removing stains twice as well as the old driver, this stuff adds a >> userspace block backend and let you use raw (without loopback), qcow, >> and vmdk-based image files for your domUs. There''s also a fun little >> driver that provides a shared-memory block device which, in >> combination with OCFS2, represents a cheap-and-cheerful fast shared >> filesystem between multiple domUs. >> > > Hi Andrew, > > I like the idea of block servers in userspace, but I''m curious. When I > wrote the simple share block server I couldn''t see an obvious > justification for multiple outstanding requests (with AIO/threads and > all that entails),Are you thinking of posix-aio? posix-aio is "emulated" with threads and normal read/select calls. The performance isn''t that great. I believe blktap is using linux-aio which doesn''t use threads (it uses the linux specific interface). I''ve seen a number of benchmarks where linux-aio is significantly better than posix-aio. Regards, Anthony Liguori> so I went for the trivial single request approach. > It seems to me that the backend doesn''t have much information the front > end doesn''t have. > > Just wondered if you''d tried a naive approach first... > > Thanks! > Rusty. >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Laurent Vivier
2006-Jun-29 12:26 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Anthony Liguori wrote:> Rusty Russell wrote: >> On Mon, 2006-06-19 at 09:19 -0700, Andrew Warfield wrote: >> >>> Attached to this email is a patch containing the (new and improved) >>> blktap Linux driver and associated userspace tools for Xen. In >>> addition to being more flavourful, containing half the fat, and >>> removing stains twice as well as the old driver, this stuff adds a >>> userspace block backend and let you use raw (without loopback), qcow, >>> and vmdk-based image files for your domUs. There''s also a fun little >>> driver that provides a shared-memory block device which, in >>> combination with OCFS2, represents a cheap-and-cheerful fast shared >>> filesystem between multiple domUs. >>> >> >> Hi Andrew, >> >> I like the idea of block servers in userspace, but I''m curious. >> When I >> wrote the simple share block server I couldn''t see an obvious >> justification for multiple outstanding requests (with AIO/threads and >> all that entails), > > Are you thinking of posix-aio? posix-aio is "emulated" with threads and > normal read/select calls. The performance isn''t that great.Hi, We develop another implementation of posix I/O for linux with better performance, based on linux kernel AIO, have a look at: http://www.bullopensource.org/posix/index.html Laurent -- Laurent Vivier Bull, Architect of an Open World (TM) http://www.bullopensource.org/ext4 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Warfield
2006-Jun-29 14:34 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
> Last I looked the blkif front end, it uses a noop I/O scheduler, which > means that the only one doing scheduling is the backend. I can easily > imagine that if the backend is synchronous, this would be slow. > > However, it''s not clear to me that doing scheduling in the backend will > generally be faster than doing it in the front end. I suppose it should > be, if the backend domain were serving multiple frontends from the same > device.Well, scheduling across multiple VM request streams is certainly one reason for exposing as big a request aperture to the physical (backend) disk scheduler as possible. The fact that the frontend doesn''t necessarily have any idea how its blocks are actually laid out on the disk is another -- in the case of file-backed images for instance.> > AIO just > > lets me issue batches of requests at once, and so minimizes context > > switching through userland -- which was something I was worried about > > causing overhead on x86_64. I don''t really think it adds that much > > complexity. > > Sure, I would have used a pool of processes because I''m old-fashioned, > but AIO is probably a better choice for multiple requests at once.My older code was written without the benefit of working AIO for xen linux. I knocked up a thread pool to improve performance and it worked reasonably well, although I found that you needed a fairly large number of threads to saturate the disk (with blocking i/o, which was a little naive ;) ), and it represented a fairly large chunk of unnecessary moving parts. The linux libaio stuff is pretty good actually. Requests map rather directly down onto the kernel bio interface, so with aio the userland block back code is doing a very similar thing to the in-kernel driver. As Anthony points out, libaio is unthreaded, you just fill out a batch of request structs and shove it down. It''s very fast indeed and quite low-overhead. My only real complaint is that despite a couple of years discussing ways to do it on libaio-devel, the AIO developers haven''t settled a unified way to pool on aio completions and normal file handles, which is a bit of an inconvenience when you want to do both. a. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Stephen C. Tweedie
2006-Jun-30 13:35 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
Hi, On Thu, 2006-06-29 at 07:34 -0700, Andrew Warfield wrote:> The linux libaio stuff is pretty good actually. Requests map rather > directly down onto the kernel bio interface, so with aio the userland > block back code is doing a very similar thing to the in-kernel driver.Yep. I noticed that the blktap patch includes adding EPOLL to kernel aio, though, and that has not (yet) been accepted upstream; is that something that is absolutely necessary for blktap, or could you live without it? Is there any movement towards getting that upstream, since otherwise we''re introducing dependencies on core kernel infrastructure that is not guaranteed to persist upstream? It looks like the sort of thing that would be entirely reasonable upstream: EPOLL for aio seems to make a ton of sense. Cheers, Stephen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Julian Chesterfield
2006-Jun-30 14:17 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
On 30 Jun 2006, at 14:35, Stephen C. Tweedie wrote:> Hi, > > On Thu, 2006-06-29 at 07:34 -0700, Andrew Warfield wrote: > >> The linux libaio stuff is pretty good actually. Requests map rather >> directly down onto the kernel bio interface, so with aio the userland >> block back code is doing a very similar thing to the in-kernel driver. > > Yep. I noticed that the blktap patch includes adding EPOLL to kernel > aio, though, and that has not (yet) been accepted upstream; is that > something that is absolutely necessary for blktap, or could you live > without it?Without the completion event poll an alternative was to block on io_getevents for the batch to complete, or to periodically test for queued responses. This approach was definitely preferable since it fit very nicely into the asynch architecture we were working towards for the userspace drivers. Without the completion poll, the performance would most likely degrade, although we haven''t done any tests to measure by how much.> > Is there any movement towards getting that upstream, since otherwise > we''re introducing dependencies on core kernel infrastructure that is > not > guaranteed to persist upstream? It looks like the sort of thing that > would be entirely reasonable upstream: EPOLL for aio seems to make a > ton > of sense.Agreed. We''d like to see the EPOLL facility adopted in the mainstream AIO architecture. The current patch was submitted on the linux-aio list in repsonse to a query we sent about a month ago, however I don''t believe there has been any movement to officially add it. It''s on our agenda to follow-up with the AIO folks since it definitely should belong in the mainstream kernel rather than as a xen patch. - Julian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeff Moyer
2006-Jun-30 18:41 UTC
Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC)
==> Regarding Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC); Julian Chesterfield <jac90@cl.cam.ac.uk> adds: jac90> Agreed. We''d like to see the EPOLL facility adopted in the jac90> mainstream AIO architecture. The current patch was submitted on the jac90> linux-aio list in repsonse to a query we sent about a month ago, jac90> however I don''t believe there has been any movement to officially jac90> add it. It''s on our agenda to follow-up with the AIO folks since it jac90> definitely should belong in the mainstream kernel rather than as a jac90> xen patch. Are you planning on sending it along soonish? When you post it, I''ll set aside some time to kick the tires and provide feedback. Thanks! Jeff _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel