I have installed XEN and linux 2.6.10 on three different machines. The slowest of them was my computer at home running and Athlon XP 1600+ ( 1.4 GHZ ) and 256 MB RAM. My Problem is reduced file-system performance in domU guests. These guest run faster when I use loopbacked files on Dom0 than the do when I use real partitions and poulate them with a linux system. I found out that dom0 does file-system IO and raw IO ( using dd as a tool to test throughput from the disk ) is about exactly the same as when using a standard linux kernel without XEN. But the raw IO from DomU to an unused disk ( a second disk in the system ) is limited to fourty percent of the speed I get within Dom0. This effect transforms to about the same ratio when doing real file-system IO. I found this sympthom in all of the systems I installed. An early paper about XEN describes that the penalty when using VDBs is close to zero and neglectable. I think this conflicts with the results I got and I believe this reflects that something in my configuration is wrong ( at least I hope so ). I have the drivers for my chipset linked into the kernel and hdparm tells me that DMA is enabled for the used disks ( using hdparm under Dom0 ). What worries me is that the results within Dom0 are completely satisfactory, while those in DomU are not. Do I have to change the kernel config for DomU ? Or is there any special option I have to set in the kernel configuration for Dom0 or even for xen? I have compiled version 2.0.5 - the newest available, to my knowledge. Any hints ?? ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> I found out that dom0 does file-system IO and raw IO ( using > dd as a tool to test > throughput from the disk ) is about exactly the same as when > using a standard > linux kernel without XEN. But the raw IO from DomU to an > unused disk ( a second > disk in the system ) is limited to fourty percent of the > speed I get within Dom0.Just to be clear: you''re doing a dd performance test within dom0 to the exact same partition on the 2nd disk that you''re using when you start the domU and finding that the domU ''dd'' performance is 40% of the dom0 performance? I''ve not heard of anyone else having problems like this. What happens if you use a partition on the 1st disk? What chipset is the IDE controller? What device (e.g. sda1) are you exporting the disk partition into the domU as? Are you sure dom 0 is idle when doing the dd test in the domU? Ian ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt <m+Ian.Pratt <at> cl.cam.ac.uk> writes:> > > I found out that dom0 does file-system IO and raw IO ( using > > dd as a tool to test > > throughput from the disk ) is about exactly the same as when > > using a standard > > linux kernel without XEN. But the raw IO from DomU to an > > unused disk ( a second > > disk in the system ) is limited to fourty percent of the > > speed I get within Dom0. > > Just to be clear: you''re doing a dd performance test within dom0 to the > exact same partition on the 2nd disk that you''re using when you start > the domU and finding that the domU ''dd'' performance is 40% of the dom0 > performance? > > I''ve not heard of anyone else having problems like this. What happens if > you use a partition on the 1st disk? > > What chipset is the IDE controller? What device (e.g. sda1) are you > exporting the disk partition into the domU as? > > Are you sure dom 0 is idle when doing the dd test in the domU? > > Ian > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_ide95&alloc_id396&op=click >Yes, I have tried various partitions both from Dom0 and DomU on both disks and the result has always been a performance ratio of 2.5 between Dom0 and DomU. Yes I used dd for the test. But I came accross this problem doing IO into the filesystem. I was surprised that I did not only get no improvement when switching for a loopbacked file as "device" for DomU to a real device but that I got a performance degradation. With that effect I started to test raw io performance using dd. I am sure that the device was not busy and dom0 was idle when I did the test. There where no busy jobs in dom0 neither CPU- nor IO-bound. I don''t know which chipset the ide-controller is. My mainbord is a MSI KT7 board. I am currently not at home, must lookup what the ide-controller is. The devices I exported to have been hda1 and hdb6 on my computer at home and hdg5 in the office. In the latter case the disk is attached to a Promise202 raid controller. Is there any description what I have to do to configure my system adequately to run efficiently using Xen ? If such where available I might be able to locate the problem myself. I have not yet done a "dd performance" test using loopbacked files as devices yet. I only used them as filesystems. Thanks in advance Peter Bier ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt <m+Ian.Pratt <at> cl.cam.ac.uk> writes:> > > I found out that dom0 does file-system IO and raw IO ( using > > dd as a tool to test > > throughput from the disk ) is about exactly the same as when > > using a standard > > linux kernel without XEN. But the raw IO from DomU to an > > unused disk ( a second > > disk in the system ) is limited to fourty percent of the > > speed I get within Dom0. > > Just to be clear: you''re doing a dd performance test within dom0 to the > exact same partition on the 2nd disk that you''re using when you start > the domU and finding that the domU ''dd'' performance is 40% of the dom0 > performance? > > I''ve not heard of anyone else having problems like this. What happens if > you use a partition on the 1st disk? > > What chipset is the IDE controller? What device (e.g. sda1) are you > exporting the disk partition into the domU as? > > Are you sure dom 0 is idle when doing the dd test in the domU? > > Ian > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_ide95&alloc_id396&op=click >Yes I do the performance testing using dd. It''s only a simple "benchmark" but its results seem to indicate a fundamental issue. I did the tests with the same partitions from DOM0 just as DomU. I used both disks and Dom0 achieved in all experiments 2.5 times the transfer rate of DomU. I do not know the chipset of my IDE controller on my computer at home, while I know that in the office it pas a Promise raid controller ( I am neither at home nor in the office momentarily ) . I am sure, that the system was idle during all test ( meaning that there was only the standard system running with no busy jobs and no user program consuming CPU or IO resources. I am very interested about Xen, but I need to fiy that problem. If there is any "checklist" how to configure XEN efficiently, I might be able to fix the problem myself Thanks Peter Bier _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > I found out that dom0 does file-system IO and raw IO ( using > > dd as a tool to test > > throughput from the disk ) is about exactly the same as when > > using a standard > > linux kernel without XEN. But the raw IO from DomU to an > > unused disk ( a second > > disk in the system ) is limited to fourty percent of the > > speed I get within Dom0.OK, this looks like a perofrmance bug that''s crept into the 2.6 dom0 some where along the way. I''m surprised no-one else has spotted it. Please can you confirm that performance is OK if you use 2.4 as a dom0? (It doesn''t matter what you use as guests). Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Monday 28 March 2005 12:55, Ian Pratt wrote:> > > I found out that dom0 does file-system IO and raw IO ( using > > > dd as a tool to test > > > throughput from the disk ) is about exactly the same as when > > > using a standard > > > linux kernel without XEN. But the raw IO from DomU to an > > > unused disk ( a second > > > disk in the system ) is limited to fourty percent of the > > > speed I get within Dom0.Is the second disk exactly the same as the first one? I''ll try an IO test here on the same disk array with dom0 and domU and see what I get. -Andrew> > OK, this looks like a perofrmance bug that''s crept into the 2.6 dom0 > some where along the way. I''m surprised no-one else has spotted it. > > Please can you confirm that performance is OK if you use 2.4 as a dom0? > (It doesn''t matter what you use as guests). > > > Thanks, > Ian > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > > > I found out that dom0 does file-system IO and raw IO ( using > > > > dd as a tool to test > > > > throughput from the disk ) is about exactly the same as when > > > > using a standard > > > > linux kernel without XEN. But the raw IO from DomU to an > > > > unused disk ( a second > > > > disk in the system ) is limited to fourty percent of the > > > > speed I get within Dom0. > > Is the second disk exactly the same as the first one? I''ll > try an IO test > here on the same disk array with dom0 and domU and see what I get.I''ve reproduced the problem and its a real issue. It only affects reads, and is almost certainly down to how the blkback driver passes requests down to the actual device. Does anyone on the list actually understand the changes made to linux block IO between 2.4 and 2.6? In the 2.6 blkfront there is no run_task_queue() to flush requests to the lower layer, and we use submit_bio() instead of 2.4''s generic_make_request(). It looks like this is happening syncronously rather than queueing multiple requests. What should we be doing to cause things to be batched? Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Monday 28 March 2005 14:14, Ian Pratt wrote:> > > > > I found out that dom0 does file-system IO and raw IO ( using > > > > > dd as a tool to test > > > > > throughput from the disk ) is about exactly the same as when > > > > > using a standard > > > > > linux kernel without XEN. But the raw IO from DomU to an > > > > > unused disk ( a second > > > > > disk in the system ) is limited to fourty percent of the > > > > > speed I get within Dom0. > > > > Is the second disk exactly the same as the first one? I''ll > > try an IO test > > here on the same disk array with dom0 and domU and see what I get. > > I''ve reproduced the problem and its a real issue. > > It only affects reads, and is almost certainly down to how the blkback > driver passes requests down to the actual device. > > Does anyone on the list actually understand the changes made to linux > block IO between 2.4 and 2.6? > > In the 2.6 blkfront there is no run_task_queue() to flush requests to > the lower layer, and we use submit_bio() instead of 2.4''s > generic_make_request(). It looks like this is happening syncronously > rather than queueing multiple requests. What should we be doing to cause > things to be batched?There are multiple IO schedulers in 2.6. Do you know which one is being used? It should say somewhere in the boot log. Some read-ahead code also changed in 2.6.10-11 range. So far I have not been able to reproduce this in xen-unstable with 2.6. I am building xen-2.0.5 for a look. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Monday 28 March 2005 14:14, Ian Pratt wrote:> > > > > I found out that dom0 does file-system IO and raw IO ( using > > > > > dd as a tool to test > > > > > throughput from the disk ) is about exactly the same as when > > > > > using a standard > > > > > linux kernel without XEN. But the raw IO from DomU to an > > > > > unused disk ( a second > > > > > disk in the system ) is limited to fourty percent of the > > > > > speed I get within Dom0. > > > > Is the second disk exactly the same as the first one? I''ll > > try an IO test > > here on the same disk array with dom0 and domU and see what I get. > > I''ve reproduced the problem and its a real issue. > It only affects reads, and is almost certainly down to how the blkback > driver passes requests down to the actual device. > > Does anyone on the list actually understand the changes made to linux > block IO between 2.4 and 2.6? > > In the 2.6 blkfront there is no run_task_queue() to flush requests to > the lower layer, and we use submit_bio() instead of 2.4''s > generic_make_request(). It looks like this is happening syncronously > rather than queueing multiple requests. What should we be doing to cause > things to be batched?To my knowlege you cannot queue multiple bio requests at once. The IO schedulers should batch them up before submitting to the actual devices. I tried xen-2.0.5 and xen-unstable with a sequential read test using 256k request size and 8 reader threads with o_direct on a lvm-raid-0 scsci array (no HW cache) and got: xen-2-dom0-2.6.10: 177 MB/sec xen-2-domU-2.6.10: 185 MB/sec xen-3-dom0-2.6.11: 177 MB/sec xen-3-domU-2.6.11: 185 MB/sec Better results with VBD :) I am wondering if going through 2 layers of IO schedulers streams the IO better. I was using AS scheduler. I am going to try noop scheduler and see what i get. What block size were you using with dd? -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> tried xen-2.0.5 and xen-unstable with a sequential read test > using 256k > request size and 8 reader threads with o_direct on a > lvm-raid-0 scsci array > (no HW cache) and got: > > xen-2-dom0-2.6.10: 177 MB/sec > xen-2-domU-2.6.10: 185 MB/sec > xen-3-dom0-2.6.11: 177 MB/sec > xen-3-domU-2.6.11: 185 MB/secPlease can you try a simple ''dd if=/dev/sdaXX of=/dev/null bs=1024k count=4096'' to read 4GB from the partition both in dom0 and domU. When booting, I get the following output, which I presume is the default? elevator: using anticipatory as default io scheduler Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Theurer <habanero <at> us.ibm.com> writes:> > On Monday 28 March 2005 14:14, Ian Pratt wrote: > > > > > > I found out that dom0 does file-system IO and raw IO ( using > > > > > > dd as a tool to test > > > > > > throughput from the disk ) is about exactly the same as when > > > > > > using a standard > > > > > > linux kernel without XEN. But the raw IO from DomU to an > > > > > > unused disk ( a second > > > > > > disk in the system ) is limited to fourty percent of the > > > > > > speed I get within Dom0. > > > > > > Is the second disk exactly the same as the first one? I''ll > > > try an IO test > > > here on the same disk array with dom0 and domU and see what I get. > > > > I''ve reproduced the problem and its a real issue. > > It only affects reads, and is almost certainly down to how the blkback > > driver passes requests down to the actual device. > > > > Does anyone on the list actually understand the changes made to linux > > block IO between 2.4 and 2.6? > > > > In the 2.6 blkfront there is no run_task_queue() to flush requests to > > the lower layer, and we use submit_bio() instead of 2.4''s > > generic_make_request(). It looks like this is happening syncronously > > rather than queueing multiple requests. What should we be doing to cause > > things to be batched? > > To my knowlege you cannot queue multiple bio requests at once. The IO > schedulers should batch them up before submitting to the actual devices. I > tried xen-2.0.5 and xen-unstable with a sequential read test using 256k > request size and 8 reader threads with o_direct on a lvm-raid-0 scsci array > (no HW cache) and got: > > xen-2-dom0-2.6.10: 177 MB/sec > xen-2-domU-2.6.10: 185 MB/sec > xen-3-dom0-2.6.11: 177 MB/sec > xen-3-domU-2.6.11: 185 MB/sec > > Better results with VBD :) I am wondering if going through 2 layers of IO > schedulers streams the IO better. I was using AS scheduler. I am going to > try noop scheduler and see what i get. > > What block size were you using with dd? > > -Andrew >My dd command was always the same: "dd if=/dev/hdb6 bs=64k count=1000" and it took 1.6 seconds on hdb6 and 2.2 seconds on hda1 when running in Dom0 and it took 4.6 seconds on hdb6 and 5.8 seconds on hda1 when running on DomU. I did one experiment with count=10000 and it took ten times as long in each of the four cases. I have done the following tests: DomU : dd if=/dev/hdb6 of=/dev/null bs=1024k count=4000 ; duration 301 sec DomU : dd if=/dev/hdb6 of=/dev/null bs=1024k count=4000 ; duration 370 sec Dom0 : dd if=/dev/hdb6 of=/dev/null bs=1024k count=4000 ; duration 115 sec Dom0 : dd if=/dev/hda1 of=/dev/null bs=1024k count=4000 ; duration 140 sec Peter _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> My dd command was always the same: "dd if=/dev/hdb6 bs=64k count=1000" and > it took 1.6 seconds on hdb6 and 2.2 seconds on hda1 when running in Dom0 > and it took 4.6 seconds on hdb6 and 5.8 seconds on hda1 when running on > DomU. I did one experiment with count=10000 and it took ten times as long > in each of the four cases. > > I have done the following tests: > DomU : dd if=/dev/hdb6 of=/dev/null bs=1024k count=4000 ; duration 301 sec > DomU : dd if=/dev/hdb6 of=/dev/null bs=1024k count=4000 ; duration 370 sec > > Dom0 : dd if=/dev/hdb6 of=/dev/null bs=1024k count=4000 ; duration 115 sec > Dom0 : dd if=/dev/hda1 of=/dev/null bs=1024k count=4000 ; duration 140 secOK, I have produced this with both dd and o-direct now. On o-direct, I needed what was the effective dd block request size (128k) and I got similar results. My results are much worse, due to that I am driving 14 disks: dom0: 153.5 MB/sec domU: 12.7 MB/sec It looks like there might be a problem were we are not getting a timely response back from dom0 VBD driver that the io request is complete, which limits the number of outstanding requests to a level which cannot keep the disk utilized well. If you drive enough IO outstanding requests (which can be done with either o-direct with large request or a much larger readahead setting with buffered IO), it''s not an issue. In the domU, can you try setting the readahead size to a much larger value using hdparm? Something like hdparm -a 2028, then run dd? -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, Mar 27, 2005 at 06:41:27PM +0100, Ian Pratt wrote:> > I found out that dom0 does file-system IO and raw IO ( using > > dd as a tool to test > > throughput from the disk ) is about exactly the same as when > > using a standard > > linux kernel without XEN. But the raw IO from DomU to an > > unused disk ( a second > > disk in the system ) is limited to fourty percent of the > > speed I get within Dom0. > > Just to be clear: you''re doing a dd performance test within dom0 to the > exact same partition on the 2nd disk that you''re using when you start > the domU and finding that the domU ''dd'' performance is 40% of the dom0 > performance? > > I''ve not heard of anyone else having problems like this. What happens if > you use a partition on the 1st disk? > > What chipset is the IDE controller? What device (e.g. sda1) are you > exporting the disk partition into the domU as? > > Are you sure dom 0 is idle when doing the dd test in the domU? >I reported same kind of problems earlier too. 2.4 domU is really slow (1/3 speed of 2.6 dom0), 2.6 domU is faster, but not even close to the speed of 2.6 dom0. My tests were on top lvm over sw-raid5. -- Pasi Kärkkäinen ^ . . Linux / - \ Choice.of.the .Next.Generation. ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> It looks like there might be a problem were we are not > getting a timely > response back from dom0 VBD driver that the io request is > complete, which > limits the number of outstanding requests to a level which > cannot keep the > disk utilized well. If you drive enough IO outstanding > requests (which can > be done with either o-direct with large request or a much > larger readahead > setting with buffered IO), it''s not an issue.Andrew, please could you try this with a 2.4 dom0, 2.6 domU. Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt <m+Ian.Pratt <at> cl.cam.ac.uk> writes:> elevator: using anticipatory as default io scheduler > > Thanks, > Ian >Yes, the output is elevator: using anticipatory as default io scheduler Peter _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Theurer <habanero <at> us.ibm.com> writes:> > > My dd command was always the same: "dd if=/dev/hdb6 bs=64k count=1000" and > > it took 1.6 seconds on hdb6 and 2.2 seconds on hda1 when running in Dom0 > > and it took 4.6 seconds on hdb6 and 5.8 seconds on hda1 when running on > > DomU. I did one experiment with count=10000 and it took ten times as long > > in each of the four cases. > > > > I have done the following tests: > > DomU : dd if=/dev/hdb6 of=/dev/null bs=1024k count=4000 ; duration 301 sec > > DomU : dd if=/dev/hdb6 of=/dev/null bs=1024k count=4000 ; duration 370 sec > > > > Dom0 : dd if=/dev/hdb6 of=/dev/null bs=1024k count=4000 ; duration 115 sec > > Dom0 : dd if=/dev/hda1 of=/dev/null bs=1024k count=4000 ; duration 140 sec > > OK, I have produced this with both dd and o-direct now. On o-direct, Ineeded> what was the effective dd block request size (128k) and I got similar > results. My results are much worse, due to that I am driving 14 disks: > > dom0: 153.5 MB/sec > domU: 12.7 MB/sec > > It looks like there might be a problem were we are not getting a timely > response back from dom0 VBD driver that the io request is complete, which > limits the number of outstanding requests to a level which cannot keep the > disk utilized well. If you drive enough IO outstanding requests (which can > be done with either o-direct with large request or a much larger readahead > setting with buffered IO), it''s not an issue. > > In the domU, can you try setting the readahead size to a much larger value > using hdparm? Something like hdparm -a 2028, then run dd? > > -Andrew >It''s tuesday now, and I am working in the office using my two machines with the Promise controller. The two differ in that one is using ide disks, while the other, the newer one has sata disks. I have restricted myself to the elder computer. It has one disk, a Maxtor 6Y120L0, 120 G with a 2048 KB Cache. On that machine the disk is hde and the exported slice is hde1. The slice is not in use and I am running the os from a loop-backed file as rootfs. I have done a "dd if=/dev/hde1 of=/dev/null bs=1024k count=1024" in domU. hdparm told that the default setup was 256k readahead. I have tested the performance with the following readahead settings: readahead | duration 128 sectors | 160 sec 256 sectors | 76 sec 512 sectors | 18.5 sec 1024 sectors | 19.5 sec 2048 sectors | 786 sec 1536 sectors | 775 sec 1200 sectors | 457 sec 1000 sectors | 20 sec 800 sectors | 18.5 sec 600 sectors | 18.5 sec dom0 takes 18.0 secs no matter of the readahead setting in Dom0 is. Peter _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi, On Mon, Mar 28, 2005 at 09:14:10PM +0100, Ian Pratt wrote:> > Is the second disk exactly the same as the first one? I''ll > > try an IO test > > here on the same disk array with dom0 and domU and see what I get. > > I''ve reproduced the problem and its a real issue. > > It only affects reads, and is almost certainly down to how the blkback > driver passes requests down to the actual device.Two points to look at: * Block size (filesystems set this to 4k normally, default it 1k) * Read ahead (you need to do it, otherwise you end up doing tiny requests). You can tune it in sysfs. Regards, -- Kurt Garloff, Director SUSE Labs, Novell Inc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> "dd if=/dev/hde1 of=/dev/null bs=1024k count=1024" > > in domU. > > hdparm told that the default setup was 256k readahead.Do you mean KB or sectors?> I have tested the performance with the following readahead settings: > > readahead | duration > 128 sectors | 160 sec > 256 sectors | 76 sec > 512 sectors | 18.5 sec > 1024 sectors | 19.5 sec > 1200 sectors | 457 sec > dom0 takes 18.0 secs no matter of the readahead setting in Dom0 is.Would you mind repeating these experiments with a 2.4 dom0 and a 2.6domU ? The performance cliff below 512 and above 1024 sectors is spectacular. This is all rather confusing, but at least we know it can be made to work fast. Changing the domU readahead is unlikely to be the right fix. We just need to figure out how to keep it on the sweet spot... Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Would you mind repeating these experiments with a 2.4 dom0 > and a 2.6domU > ?Also, please could you try exporting the device to the dom0 as a scsi device e.g. sda1 rather than ide device hde1 or hda1. [Yes, I know this shouldn''t make any difference, but I have a suspicion it will.] Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Has anyone looked into using the other schedulers? Potentially noop or deadline for Dom0 with deadline or anticipatory for DomUs , or go the other way - noop/deadline in DomUs and cfq/deadline in Dom0? It actually make some sense that DomU performance would be degraded from Dom0''s as the request has to go through it''s scheduler (which typically was designed to run async with some "fairness queuing") which is limited by XEN''s bvt scheduler and then Dom0''s disk i/o scheduler which is also limited by XEN''s bvt scheduler (doesn''t it?) Enabling/Disabling Preemptible Kernel may also provide some light on the situation. If this becomes an item open to modification for performance reasons, I''d prefer to have Dom0 set the performance of the DomU''s. It wouldn''t really matter for the moment, but once DomU''s get to boot their own kernel (as in hosting services providing xen''d servers/services where the client can compile their own kernel - which has been talked about - this will become a requirement/feature request). I was actually going to do some testing in these areas, but my test box (AMD 3000+/water cooled) overheated and fried the northbridge/memory. Oh, the joys of living in the tropics ;-) A new MB (upgraded to AMD64) should arrive end of the week or early next week so I can test then if no one else get around to it. Regards, Brian. On Tue, 2005-03-29 at 09:38, Ian Pratt wrote:> > "dd if=/dev/hde1 of=/dev/null bs=1024k count=1024" > > > > in domU. > > > > hdparm told that the default setup was 256k readahead. > > Do you mean KB or sectors? > > > I have tested the performance with the following readahead settings: > > > > readahead | duration > > 128 sectors | 160 sec > > 256 sectors | 76 sec > > 512 sectors | 18.5 sec > > 1024 sectors | 19.5 sec > > 1200 sectors | 457 sec > > dom0 takes 18.0 secs no matter of the readahead setting in Dom0 is. > > Would you mind repeating these experiments with a 2.4 dom0 and a 2.6domU > ? > > The performance cliff below 512 and above 1024 sectors is spectacular. > This is all rather confusing, but at least we know it can be made to > work fast. Changing the domU readahead is unlikely to be the right fix. > We just need to figure out how to keep it on the sweet spot... > Thanks, > Ian > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt <m+Ian.Pratt <at> cl.cam.ac.uk> writes:> > > Would you mind repeating these experiments with a 2.4 dom0 > > and a 2.6domU > > ? > > Also, please could you try exporting the device to the dom0 as a scsi > device e.g. sda1 rather than ide device hde1 or hda1. [Yes, I know this > shouldn''t make any difference, but I have a suspicion it will.] > > Thanks, > Ian >Ian, I will do the tests you asked for. but today is my wife''s birthday, and I am already at home so I have no access to me test computers there. I have done some testing with the second, newer host with SATA Disks. The change of the readahead quantity showed no effect on the reduced throughput. I do not remember the exact ratio, but I think it was quite similar than with the ide disks and readahead of 256 sectors. I will report on that tomorrow in a more detailed fashion. And I will do the tests with linux 2.4 as domU Peter _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tuesday 29 March 2005 02:13, Ian Pratt wrote:> > It looks like there might be a problem were we are not > > getting a timely > > response back from dom0 VBD driver that the io request is > > complete, which > > limits the number of outstanding requests to a level which > > cannot keep the > > disk utilized well. If you drive enough IO outstanding > > requests (which can > > be done with either o-direct with large request or a much > > larger readahead > > setting with buffered IO), it''s not an issue. > > Andrew, please could you try this with a 2.4 dom0, 2.6 domU.2.4 might be a little while for me, as I an running Fedora core3 with udev. If anyone has any easy way to get around the hotplug/udev stuff, then I can do this. I did run a sequential read on a single disk again (using noop IO schedulers in both domains) with various request sizes with o_direct while capturing iostsat output. The results are interesting. I have included the data in a file because it would just line wrap an be unreadable in this email text. Notice the service commit times for domU tests. It''s like the IO request queue is being plugged for a minimum of 10ms in dom0. Merges happening for>4K requests in dom0 (while hosting domU''s IO) seem to support this.-Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> On Tuesday 29 March 2005 02:13, Ian Pratt wrote: > > > It looks like there might be a problem were we are not > > > getting a timely > > > response back from dom0 VBD driver that the io request is > > > complete, which > > > limits the number of outstanding requests to a level which > > > cannot keep the > > > disk utilized well. If you drive enough IO outstanding > > > requests (which can > > > be done with either o-direct with large request or a much > > > larger readahead > > > setting with buffered IO), it''s not an issue. > > > > Andrew, please could you try this with a 2.4 dom0, 2.6 domU. > > 2.4 might be a little while for me, as I an running Fedora core3 with udev. > If anyone has any easy way to get around the hotplug/udev stuff, then I can > do this.You can run a populated /dev "underneath" the udev stuff quite happily; e.g. if you boot into FC3 w/ udev do: cd /dev/ tar zcpf /root/foo.tgz . If you can boot from a rescue CD or sim, just mount your FC3 partition and untar the device nodes. Works just fine.> I did run a sequential read on a single disk again (using noop IO schedulers > in both domains) with various request sizes with o_direct while capturing > iostsat output. The results are interesting. I have included the data in a > file because it would just line wrap an be unreadable in this email text. > Notice the service commit times for domU tests. It''s like the IO request > queue is being plugged for a minimum of 10ms in dom0. Merges happening for > >4K requests in dom0 (while hosting domU''s IO) seem to support this.[snip] Ah - thanks for this -- will take a detailed look shortly. cheers, S. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Ian, On Tue, Mar 29, 2005 at 07:09:50PM +0100, Ian Pratt wrote:> We''d really appreciate your help on this, or from someone else at SuSE > who actually understands the Linux block layer?I''m Cc''ing Jens ...> In the 2.6 blkfront driver, what scheduler should we be registering > with? What should we be setting as max_sectors? Are there other > parameters we should be setting that we aren''t? (block size?)I think noop is a good choice for secondary domains, as you don''t want to be too clever there, otherwise you stack a clever scheduler on top of a clever scheduler. noop basically only does front- and backmerging to make the request sizes larger. But you probably should initialize the readahead sectors. Please test attached patch. It fixed the problem for me, but my testing was very limited, I only had a small loopback mounted root fs to test with quickly. Note that initializing to 256 (128k) would be OK as well (and might be the better default); it seems to be set to 256 (128k) by default, but it''s not ... If you explicitly set it to 256, the performance still increases tremendously.> In the blkback driver that actually issues the IO''s in dom0, is there > something we should be doing to cause IOs to get batched? In 2.4 we used > a task_queue to push the IO through to the disk having queued it with > generic_make_request(). In 2.6 we''re currently using submit_bio() and > just hoping that batching happens.I don''t think the blkback driver does anything wrong here. Regards, -- Kurt Garloff, Director SUSE Labs, Novell Inc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tuesday 29 March 2005 16:45, Kurt Garloff wrote:> Hi Ian, > > On Tue, Mar 29, 2005 at 07:09:50PM +0100, Ian Pratt wrote: > > We''d really appreciate your help on this, or from someone else at SuSE > > who actually understands the Linux block layer? > > I''m Cc''ing Jens ... > > > In the 2.6 blkfront driver, what scheduler should we be registering > > with? What should we be setting as max_sectors? Are there other > > parameters we should be setting that we aren''t? (block size?) > > I think noop is a good choice for secondary domains, as you don''t > want to be too clever there, otherwise you stack a clever scheduler > on top of a clever scheduler. noop basically only does front- and > backmerging to make the request sizes larger. > > But you probably should initialize the readahead sectors. > > Please test attached patch.This should help the case where one is doing buffered IO (so readahead gets used) but for o_direct, I still think we will have a problem. On Dom0, I can drive 58MB/sec with sequential read with o_direct with just a 32k request size, but on domU with the same request size I can only get ~6MB/sec. I am still wondering is somthing is up with the backend driver. It apperas that the backend driver only submits requests to the actual device every 10ms. With a much larger request size (for o_direct) or a large readahead, 10ms is often enough to keep the disk streaming data. With smaller request sizes or small read ahaad, the disk just doesn''t read effciently. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Andrew, On Tue, Mar 29, 2005 at 04:59:18PM -0600, Andrew Theurer wrote:> On Tuesday 29 March 2005 16:45, Kurt Garloff wrote: > > Please test attached patch. > > This should help the case where one is doing buffered IO (so readahead gets > used) but for o_direct, I still think we will have a problem. On Dom0, I can > drive 58MB/sec with sequential read with o_direct with just a 32k request > size, but on domU with the same request size I can only get ~6MB/sec.I can''t reproduce this. Does this depend on whether your domU root is a loopback mounted file or a real partition/LVM device?> I am still wondering is somthing is up with the backend driver. It > apperas that the backend driver only submits requests to the actual > device every 10ms. With a much larger request size (for o_direct) or > a large readahead, 10ms is often enough to keep the disk streaming > data. With smaller request sizes or small read ahaad, the disk just > doesn''t read effciently.We might have a problem with unplugging then. Regards, -- Kurt Garloff, Director SUSE Labs, Novell Inc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tuesday 29 March 2005 17:19, Kurt Garloff wrote:> Hi Andrew, > > On Tue, Mar 29, 2005 at 04:59:18PM -0600, Andrew Theurer wrote: > > On Tuesday 29 March 2005 16:45, Kurt Garloff wrote: > > > Please test attached patch. > > > > This should help the case where one is doing buffered IO (so readahead > > gets used) but for o_direct, I still think we will have a problem. On > > Dom0, I can drive 58MB/sec with sequential read with o_direct with just a > > 32k request size, but on domU with the same request size I can only get > > ~6MB/sec. > > I can''t reproduce this. > Does this depend on whether your domU root is a loopback mounted file > or a real partition/LVM device?I am not sure. What program are you using for o_direct reads? I use a real LVM device for domU root and then another whole disk for the read tests.> > I am still wondering is somthing is up with the backend driver. It > > apperas that the backend driver only submits requests to the actual > > device every 10ms. With a much larger request size (for o_direct) or > > a large readahead, 10ms is often enough to keep the disk streaming > > data. With smaller request sizes or small read ahaad, the disk just > > doesn''t read effciently. > > We might have a problem with unplugging then.That''s what I suspect, but I do not know the driver code well enough to say for sure. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, Mar 30 2005, Kurt Garloff wrote:> Hi Ian, > > On Tue, Mar 29, 2005 at 07:09:50PM +0100, Ian Pratt wrote: > > We''d really appreciate your help on this, or from someone else at SuSE > > who actually understands the Linux block layer? > > I''m Cc''ing Jens ... > > > In the 2.6 blkfront driver, what scheduler should we be registering > > with? What should we be setting as max_sectors? Are there other > > parameters we should be setting that we aren''t? (block size?) > > I think noop is a good choice for secondary domains, as you don''t > want to be too clever there, otherwise you stack a clever scheduler > on top of a clever scheduler. noop basically only does front- and > backmerging to make the request sizes larger. > > But you probably should initialize the readahead sectors. > > Please test attached patch. > > It fixed the problem for me, but my testing was very limited, > I only had a small loopback mounted root fs to test with quickly. > > Note that initializing to 256 (128k) would be OK as well (and might > be the better default); it seems to be set to 256 (128k) by default, > but it''s not ... If you explicitly set it to 256, the performance > still increases tremendously. > > > In the blkback driver that actually issues the IO''s in dom0, is there > > something we should be doing to cause IOs to get batched? In 2.4 we used > > a task_queue to push the IO through to the disk having queued it with > > generic_make_request(). In 2.6 we''re currently using submit_bio() and > > just hoping that batching happens. > > I don''t think the blkback driver does anything wrong here. > > Regards, > -- > Kurt Garloff, Director SUSE Labs, Novell Inc.> From: Kurt Garloff <garloff@suse.de> > Subject: Initialize readahead in vbd Q init code > > The domU read performance is poor without readahead, so > better make sure we initialize this value. > > Signed-off-by: Kurt Garloff <garloff@suse.de> > > Index: linux-2.6.11/drivers/xen/blkfront/vbd.c > ==================================================================> --- linux-2.6.11.orig/drivers/xen/blkfront/vbd.c > +++ linux-2.6.11/drivers/xen/blkfront/vbd.c > @@ -268,8 +268,11 @@ static struct gendisk *xlvbd_get_gendisk > xlbd_blk_queue, BLKIF_MAX_SEGMENTS_PER_REQUEST); > > /* Make sure buffer addresses are sector-aligned. */ > blk_queue_dma_alignment(xlbd_blk_queue, 511); > + > + /* Set readahead */ > + blk_queue_max_sectors(xlbd_blk_queue, 512);This isn''t read-ahead, it''s the max request size setting. The actual read-ahead setting is in q->backing_dev_info.ra_pages. There is a helper function for this type of stacking, blk_queue_stack_limits(). You call it after setting up your own queue: blk_queue_stack_limits(my_queue, bottom_queue); I''ll check the xen block driver to see if there''s anything else that sticks out. -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, Mar 30, 2005 at 12:45:03AM +0200, Kurt Garloff wrote:> Please test attached patch.Delete it, blk_queue_max_sectors() is called a bit above. Adding printk()s now to see what''s going on there. Regards, -- Kurt Garloff, Director SUSE Labs, Novell Inc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I''ll check the xen block driver to see if there''s anything > else that sticks out. > > Jens AxboeJens, I''d really appreciate this. The blkfront/blkback drivers have rather evolved over time, and I don''t think any of the core team fully understand the block-layer differences between 2.4 and 2.6. There''s also some junk left in there from when the backend was in Xen itself back in the days of 1.2, though Vincent has prepared a patch to clean this up and also make ''refreshing'' of vbd''s work (for size changes), and also allow the blkfront driver to import whole disks rather than paritions. We had this functionality on 2.4, but lost it in the move to 2.6. My bet is that it''s the 2.6 backend that is where the true perofrmance bug lies. Using a 2.6 domU blkfront talking to a 2.4 dom0 blkback seems to give good performance under a wide variety of circumstances. Using a 2.6 dom0 is far more pernickety. I agree with Andrew that I suspect it''s the work queue changes are biting us when we don''t have many outstanding requests. Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt <m+Ian.Pratt <at> cl.cam.ac.uk> writes:> > > I''ll check the xen block driver to see if there''s anything > > else that sticks out. > > > > Jens Axboe > > Jens, I''d really appreciate this. > > The blkfront/blkback drivers have rather evolved over time, and I don''t > think any of the core team fully understand the block-layer differences > between 2.4 and 2.6. > > There''s also some junk left in there from when the backend was in Xen > itself back in the days of 1.2, though Vincent has prepared a patch to > clean this up and also make ''refreshing'' of vbd''s work (for size > changes), and also allow the blkfront driver to import whole disks > rather than paritions. We had this functionality on 2.4, but lost it in > the move to 2.6. > > My bet is that it''s the 2.6 backend that is where the true perofrmance > bug lies. Using a 2.6 domU blkfront talking to a 2.4 dom0 blkback seems > to give good performance under a wide variety of circumstances. Using a > 2.6 dom0 is far more pernickety. I agree with Andrew that I suspect it''s > the work queue changes are biting us when we don''t have many outstanding > requests. > > Thanks, > Ian >I have done my simple dd on hde1 with two different setting of readahead: 256 sectors and 512 sectors. These are the results: DOM0 readahead 512s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq- sz avgqu-sz await svctm %util hde 115055.40 2.00 592.40 0.80 115647.80 22.40 57823.90 11.20 194.99 2.30 3.88 1.68 99.80 hda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %idle 0.20 0.00 31.60 14.20 54.00 DOMU readahead 512s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq- sz avgqu-sz await svctm %util hda1 0.00 0.20 0.00 0.00 0.00 3.20 0.00 1.60 0.00 0.00 0.00 0.00 0.00 hde1 102301.40 0.00 11571.00 0.00 113868.80 0.00 56934.40 0.00 9.84 68.45 5.92 0.09 100.00 avg-cpu: %user %nice %system %iowait %idle 0.00 0.00 35.00 65.00 0.00 DOM0 readahead 256s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq- sz avgqu-sz await svctm %util hde 28289.20 1.80 126.80 0.40 28416.00 17.60 14208.00 8.80 223.53 1.06 8.32 7.85 99.80 hda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %idle 0.20 0.00 1.60 5.60 92.60 DOMU readahead 256s Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq- sz avgqu-sz await svctm %util hda1 0.00 0.20 0.00 0.40 0.00 4.80 0.00 2.40 12.00 0.00 0.00 0.00 0.00 hde1 25085.60 0.00 3330.40 0.00 28416.00 0.00 14208.00 0.00 8.53 30.54 9.17 0.30 100.00 avg-cpu: %user %nice %system %iowait %idle 0.20 0.00 1.40 98.40 0.00 What surprises me is that the service time for the request in DOM0 decreases dramatically when readahead is increased from 256 to 512 sectors. If the output of iostat is reliable, it tells me requests in DOMU are assembled to about 8 to 10 sectors in size, while DOM0 puts them together to about 200 or even more sectors Using readahead of 256 sectors results in a an average queuesize of anout 1 while changing readahead to 512 sectors results in an avaerage queuesize of slightly above 2 on DOM0. Service times in DOM0 and readahead 256 sectors seem to be in the range of the typical seek time of a modern ide disk while it is significantly lower with readahead of 512 sectors. As I have mentioned, this is the system with only one installed disk; this re- sults in the write activity on the disk. The two write request per second go into a different partition and those result in four required seeks per second. This should not be a reason for all requests to take about seek time as service time. I have done a number of further test on various systems. In most cases I failed to achieve service times below 8 msecs in Dom0; the only counterexample is reported above. It seems to me, that at low readahead values the amount of data requested for from disk is simply the readahead amount of data. This request takes about seek time and thus I get lower performance when I work with small readahead values. What I do not understand at all is why throughput collapses with large readahead sizes. I found in mm/readahead.c that the readahead size for a file is updated if the readahead is not efficient. I suspect that the mechanism might lead to readahed being switched of for this file. With readahead being set to 2048 sectors, the product of avgq-sz and avgrq-sz reported by drops to 4 to 5 physical pages. Peter _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
peter bier wrote:>Ian Pratt <m+Ian.Pratt <at> cl.cam.ac.uk> writes: > > > >>>I''ll check the xen block driver to see if there''s anything >>>else that sticks out. >>> >>>Jens Axboe >>> >>> >>Jens, I''d really appreciate this. >> >>The blkfront/blkback drivers have rather evolved over time, and I don''t >>think any of the core team fully understand the block-layer differences >>between 2.4 and 2.6. >> >>There''s also some junk left in there from when the backend was in Xen >>itself back in the days of 1.2, though Vincent has prepared a patch to >>clean this up and also make ''refreshing'' of vbd''s work (for size >>changes), and also allow the blkfront driver to import whole disks >>rather than paritions. We had this functionality on 2.4, but lost it in >>the move to 2.6. >> >>My bet is that it''s the 2.6 backend that is where the true perofrmance >>bug lies. Using a 2.6 domU blkfront talking to a 2.4 dom0 blkback seems >>to give good performance under a wide variety of circumstances. Using a >>2.6 dom0 is far more pernickety. I agree with Andrew that I suspect it''s >>the work queue changes are biting us when we don''t have many outstanding >>requests. >> >>Thanks, >>Ian >> >> >> > > >I have done my simple dd on hde1 with two different setting of readahead: >256 sectors and 512 sectors. >I added a counter and incremented every time blkback daemon was woken up and ran the read test in domU. With 32k and 320k request sizes (o_direct), I consistently got 200 wake ups/second. I expected 100/second, the same interval as the minimum svc cmt times I am seeing, but anyway, 200/sec is way to low for small request sizes. I think this confirms the latency issue. Not sure yet why it cannot wake up more frequently. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, Mar 30 2005, Ian Pratt wrote:> > I''ll check the xen block driver to see if there''s anything > > else that sticks out. > > > > Jens Axboe > > Jens, I''d really appreciate this. > > The blkfront/blkback drivers have rather evolved over time, and I don''t > think any of the core team fully understand the block-layer differences > between 2.4 and 2.6. > > There''s also some junk left in there from when the backend was in Xen > itself back in the days of 1.2, though Vincent has prepared a patch to > clean this up and also make ''refreshing'' of vbd''s work (for size > changes), and also allow the blkfront driver to import whole disks > rather than paritions. We had this functionality on 2.4, but lost it in > the move to 2.6. > > My bet is that it''s the 2.6 backend that is where the true perofrmance > bug lies. Using a 2.6 domU blkfront talking to a 2.4 dom0 blkback seems > to give good performance under a wide variety of circumstances. Using a > 2.6 dom0 is far more pernickety. I agree with Andrew that I suspect it''s > the work queue changes are biting us when we don''t have many outstanding > requests.You never schedule the queues you submit the io against for the 2.6 kernel, you only have a tq_disk run for 2.4 kernels. This basically puts you at the mercy of the timeout unplugging, which is really suboptimal unless you can keep the io queue of the target busy at all times. You need to either mark the last bio going to that device as BIO_SYNC, or do a blk_run_queue() on the target queue after having submitted all io in this batch for it. -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Jens Axboe wrote:> On Wed, Mar 30 2005, Ian Pratt wrote: > > > I''ll check the xen block driver to see if there''s anything > > > else that sticks out. > > > > > > Jens Axboe > > > > Jens, I''d really appreciate this. > > > > The blkfront/blkback drivers have rather evolved over time, and I don''t > > think any of the core team fully understand the block-layer differences > > between 2.4 and 2.6. > > > > There''s also some junk left in there from when the backend was in Xen > > itself back in the days of 1.2, though Vincent has prepared a patch to > > clean this up and also make ''refreshing'' of vbd''s work (for size > > changes), and also allow the blkfront driver to import whole disks > > rather than paritions. We had this functionality on 2.4, but lost it in > > the move to 2.6. > > > > My bet is that it''s the 2.6 backend that is where the true perofrmance > > bug lies. Using a 2.6 domU blkfront talking to a 2.4 dom0 blkback seems > > to give good performance under a wide variety of circumstances. Using a > > 2.6 dom0 is far more pernickety. I agree with Andrew that I suspect it''s > > the work queue changes are biting us when we don''t have many outstanding > > requests. > > You never schedule the queues you submit the io against for the 2.6 > kernel, you only have a tq_disk run for 2.4 kernels. This basically puts > you at the mercy of the timeout unplugging, which is really suboptimal > unless you can keep the io queue of the target busy at all times. > > You need to either mark the last bio going to that device as BIO_SYNC, > or do a blk_run_queue() on the target queue after having submitted all > io in this batch for it.Here is a temporary work-around, this should bring you close to 100% performance at the cost of some extra unplugs. Uncompiled. --- blkback.c~ 2005-03-31 09:06:16.000000000 +0200 +++ blkback.c 2005-03-31 09:09:27.000000000 +0200 @@ -481,7 +481,6 @@ for ( i = 0; i < nr_psegs; i++ ) { struct bio *bio; - struct bio_vec *bv; bio = bio_alloc(GFP_ATOMIC, 1); if ( unlikely(bio == NULL) ) @@ -494,17 +493,12 @@ bio->bi_private = pending_req; bio->bi_end_io = end_block_io_op; bio->bi_sector = phys_seg[i].sector_number; - bio->bi_rw = operation; - bv = bio_iovec_idx(bio, 0); - bv->bv_page = virt_to_page(MMAP_VADDR(pending_idx, i)); - bv->bv_len = phys_seg[i].nr_sects << 9; - bv->bv_offset = phys_seg[i].buffer & ~PAGE_MASK; + bio_add_page(bio, virt_to_page(MMAP_VADDR(pending_idx, i)), + phys_seg[i].nr_sects << 9, + phys_seg[i].buffer & ~PAGE_MASK); - bio->bi_size = bv->bv_len; - bio->bi_vcnt++; - - submit_bio(operation, bio); + submit_bio(operation | (1 << BIO_RW_SYNC), bio); } #endif -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 31 Mar 2005, at 08:10, Jens Axboe wrote:> Here is a temporary work-around, this should bring you close to 100% > performance at the cost of some extra unplugs. Uncompiled.Yep, this does the job for me. Thanks! Avoiding the extra unplugs is harder than it sounds as each request in a batch may go to a different request queue. To minimise the number of unplugs per batch we''d need to add code to remember which queues we had used in the current batch, then kick them at the end of the batch. Is there likely to be any measurable benefit from reducing the number of unplugs? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Keir Fraser wrote:> > On 31 Mar 2005, at 08:10, Jens Axboe wrote: > > >Here is a temporary work-around, this should bring you close to 100% > >performance at the cost of some extra unplugs. Uncompiled. > > Yep, this does the job for me. Thanks! Avoiding the extra unplugs is > harder than it sounds as each request in a batch may go to a different > request queue. To minimise the number of unplugs per batch we''d need to > add code to remember which queues we had used in the current batch, > then kick them at the end of the batch. Is there likely to be anyOr just keep track of the previous queue, if that has changed unplug the previous queue and update previous queue variable.> measurable benefit from reducing the number of unplugs?Probably not, since the plugging happened at the front end as well. So you should get a nice stream of io in any way. -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rumor has it that on Thu, Mar 31, 2005 at 10:19:01AM +0200 Jens Axboe said:> On Thu, Mar 31 2005, Keir Fraser wrote: > > > measurable benefit from reducing the number of unplugs? > > Probably not, since the plugging happened at the front end as well. So > you should get a nice stream of io in any way.This effects merging though, right? I don''t think the the front end has done any merging. Also the BIO_RW_SYNC bit is sometimes ignored in __make_request due to the bad queue locking interactions with scsi_request_fn. The bio can be completed before the bio_sync() test in __make_request. Since there is no other reference to the bio it can be freed and reused by the time it is tested for BIO_RW_SYNC. Cheers, Phil> > -- > Jens Axboe > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel-- Philip R. Auld, Ph.D. Egenera, Inc. Software Architect 165 Forest St. (508) 858-2628 Marlboro, MA 01752 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi, On Thu, Mar 31, 2005 at 09:33:12AM -0500, Philip R Auld wrote:> This effects merging though, right? I don''t think the the front > end has done any merging.The noop elevator does front and back merging. My understanding is that it''s used in the frontend driver. Otherwise, unplugging on every block would indeed be quite bad ... Regards, -- Kurt Garloff, Director SUSE Labs, Novell Inc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Kurt Garloff wrote:> Hi, > > On Thu, Mar 31, 2005 at 09:33:12AM -0500, Philip R Auld wrote: > > This effects merging though, right? I don''t think the the front > > end has done any merging. > > The noop elevator does front and back merging. > My understanding is that it''s used in the frontend driver. > > Otherwise, unplugging on every block would indeed be quite bad ...Not necessarily - either your io rate is not fast enough to sustain a substantial queue depth, in that case you get plugging on basically every io anyways. If on the other hand the io rate is high enough to maintain a queue depth of > 1, then the plugging will never take place because the queue never empties. So all in all, I don''t think the temporary work-around will be such a bad idea. I would still rather implement the queue tracking though, it should not be more than a few lines of code. And Philip, I will get the bio_sync() change merged :-) -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Jens Axboe wrote:> On Thu, Mar 31 2005, Kurt Garloff wrote: > > Hi, > > > > On Thu, Mar 31, 2005 at 09:33:12AM -0500, Philip R Auld wrote: > > > This effects merging though, right? I don''t think the the front > > > end has done any merging. > > > > The noop elevator does front and back merging. > > My understanding is that it''s used in the frontend driver. > > > > Otherwise, unplugging on every block would indeed be quite bad ... > > Not necessarily - either your io rate is not fast enough to sustain a > substantial queue depth, in that case you get plugging on basically > every io anyways. If on the other hand the io rate is high enough to > maintain a queue depth of > 1, then the plugging will never take place > because the queue never empties. > > So all in all, I don''t think the temporary work-around will be such a > bad idea. I would still rather implement the queue tracking though, it > should not be more than a few lines of code.There are still cases where it will be suboptimal of course, I didn''t intend to claim it will always be as fast as queue tracking! If you are unlucky enough that the first request will reach the target device and get started before the next one, you will have a small and a large part of any given request executed. This isn''t good for performance, naturally. But queueing is so fast, I would be surprised if this happened much in the real world. -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 31 Mar 2005, at 16:39, Jens Axboe wrote:> Not necessarily - either your io rate is not fast enough to sustain a > substantial queue depth, in that case you get plugging on basically > every io anyways. If on the other hand the io rate is high enough to > maintain a queue depth of > 1, then the plugging will never take place > because the queue never empties. > > So all in all, I don''t think the temporary work-around will be such a > bad idea. I would still rather implement the queue tracking though, it > should not be more than a few lines of code.I''ve checked in something along the lines of what you described into both the 2.0-testing and the unstable trees. Looks to have identical performance to the original simple patch, at least for a bulk ''dd''. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> > On 31 Mar 2005, at 16:39, Jens Axboe wrote: > >> Not necessarily - either your io rate is not fast enough to sustain a >> substantial queue depth, in that case you get plugging on basically >> every io anyways. If on the other hand the io rate is high enough to >> maintain a queue depth of > 1, then the plugging will never take place >> because the queue never empties. >> >> So all in all, I don''t think the temporary work-around will be such a >> bad idea. I would still rather implement the queue tracking though, it >> should not be more than a few lines of code. > > > I''ve checked in something along the lines of what you described into > both the 2.0-testing and the unstable trees. Looks to have identical > performance to the original simple patch, at least for a bulk ''dd''.I''ll do a pull of unstable and see what I get with o_direct, thanks. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jens Axboe wrote:> There are still cases where it will be suboptimal of course, I didn''t > intend to claim it will always be as fast as queue tracking! If you are > unlucky enough that the first request will reach the target device and > get started before the next one, you will have a small and a large part > of any given request executed. This isn''t good for performance, > naturally. But queueing is so fast, I would be surprised if this > happened much in the real world.Although the usual answer for what scheduling algorithm is best is almost always "depends on the workload", it was suggested to me that the cfq was still the best option to go with. What do people feel about that? (Or is AS going to remain default?). Also, we''re making the assumption here that guest OS = virtual driver/device. I would rather we not make that assumption always. This may be moot because I was also told there might be a patch floating around (-mm ?) that allows you to select scheduling algorithm on a per-device basis. Anyone know if this is going to come in anytime soon? thanks, Nivedita _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rumor has it that on Thu, Mar 31, 2005 at 05:34:49PM +0200 Kurt Garloff said:> Hi, > > On Thu, Mar 31, 2005 at 09:33:12AM -0500, Philip R Auld wrote: > > This effects merging though, right? I don''t think the the front > > end has done any merging. > > The noop elevator does front and back merging. > My understanding is that it''s used in the frontend driver.If that is the case, it can only merge things that are machine contiguous. Current guests know this mapping, but can they get this when running unmodified with VT-x. My experience showed very little if any multipage IO coming out of the front end.> > Otherwise, unplugging on every block would indeed be quite bad ...Seems to be somewhat moot anyway given the curent change planned :) Cheers, Phil> > Regards, > -- > Kurt Garloff, Director SUSE Labs, Novell Inc.-- Philip R. Auld, Ph.D. Egenera, Inc. Software Architect 165 Forest St. (508) 858-2628 Marlboro, MA 01752 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rumor has it that on Thu, Mar 31, 2005 at 05:39:26PM +0200 Jens Axboe said:> > And Philip, I will get the bio_sync() change merged :-)Thanks! It''s good to be transparent ;) Phil> > -- > Jens Axboe-- Philip R. Auld, Ph.D. Egenera, Inc. Software Architect 165 Forest St. (508) 858-2628 Marlboro, MA 01752 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Nivedita Singhvi wrote:> Jens Axboe wrote: > > >There are still cases where it will be suboptimal of course, I didn''t > >intend to claim it will always be as fast as queue tracking! If you are > >unlucky enough that the first request will reach the target device and > >get started before the next one, you will have a small and a large part > >of any given request executed. This isn''t good for performance, > >naturally. But queueing is so fast, I would be surprised if this > >happened much in the real world. > > Although the usual answer for what scheduling algorithm is > best is almost always "depends on the workload", it was > suggested to me that the cfq was still the best option to > go with. What do people feel about that? (Or is AS going > to remain default?).Really the only one that you should not use is AS, anything else will be fine. AS should only ever be used at the bottom of the stack, if on a single spindle backing. CFQ will be fine, as will deadline and noop.> Also, we''re making the assumption here that guest OS = virtual > driver/device. I would rather we not make that assumption > always. This may be moot because I was also told there might > be a patch floating around (-mm ?) that allows you to > select scheduling algorithm on a per-device basis. Anyone > know if this is going to come in anytime soon?That patch is in mainline since 2.6.10. You can change schedulers by echoing the preferred scheduler to /sys/block/<device>/queue/scheduler - reading that file will show you what schedulers are available. -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Keir Fraser wrote:> > On 31 Mar 2005, at 16:39, Jens Axboe wrote: > > >Not necessarily - either your io rate is not fast enough to sustain a > >substantial queue depth, in that case you get plugging on basically > >every io anyways. If on the other hand the io rate is high enough to > >maintain a queue depth of > 1, then the plugging will never take place > >because the queue never empties. > > > >So all in all, I don''t think the temporary work-around will be such a > >bad idea. I would still rather implement the queue tracking though, it > >should not be more than a few lines of code. > > I''ve checked in something along the lines of what you described into > both the 2.0-testing and the unstable trees. Looks to have identical > performance to the original simple patch, at least for a bulk ''dd''.Can you post the patch here for review? Or just point me somewhere I can view it. -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > I''ve checked in something along the lines of what you > described into > > both the 2.0-testing and the unstable trees. Looks to have > identical > > performance to the original simple patch, at least for a bulk ''dd''. > > Can you post the patch here for review? Or just point me > somewhere I can view it.Jens, Thanks for your help on this. Here''s Keirs updated patch: http://xen.bkbits.net:8080/xen-2.0-testing.bk/gnupatch@424c1abd7LgWMiask LEEAAX7ffdkXQ Which is based on this earlier patch from you: http://xen.bkbits.net:8080/xen-2.0-testing.bk/gnupatch@424bba4091aV1FuNk sY_4w_z4Tvr3g Best, Ian diff -Naru a/linux-2.6.11-xen-sparse/drivers/xen/blkback/blkback.c b/linux-2.6.11-xen-sparse/drivers/xen/blkback/blkback.c --- a/linux-2.6.11-xen-sparse/drivers/xen/blkback/blkback.c2005-03-31 09:52:27 -08:00 +++ b/linux-2.6.11-xen-sparse/drivers/xen/blkback/blkback.c2005-03-31 09:52:27 -08:00 @@ -481,7 +481,6 @@ for ( i = 0; i < nr_psegs; i++ ) { struct bio *bio; - struct bio_vec *bv; bio = bio_alloc(GFP_ATOMIC, 1); if ( unlikely(bio == NULL) ) @@ -494,17 +493,14 @@ bio->bi_private = pending_req; bio->bi_end_io = end_block_io_op; bio->bi_sector = phys_seg[i].sector_number; - bio->bi_rw = operation; - bv = bio_iovec_idx(bio, 0); - bv->bv_page = virt_to_page(MMAP_VADDR(pending_idx, i)); - bv->bv_len = phys_seg[i].nr_sects << 9; - bv->bv_offset = phys_seg[i].buffer & ~PAGE_MASK; + bio_add_page( + bio, + virt_to_page(MMAP_VADDR(pending_idx, i)), + phys_seg[i].nr_sects << 9, + phys_seg[i].buffer & ~PAGE_MASK); - bio->bi_size = bv->bv_len; - bio->bi_vcnt++; - - submit_bio(operation, bio); + submit_bio(operation | (1 << BIO_RW_SYNC), bio); } #endif # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2005/03/31 09:52:16+01:00 kaf24@firebug.cl.cam.ac.uk # Backport of Jens blkdev performance patch. I accidentally applied it # first to unstable. # # linux-2.6.11-xen-sparse/drivers/xen/blkback/blkback.c # 2005/03/31 09:52:15+01:00 kaf24@firebug.cl.cam.ac.uk +6 -10 # Backport of Jens blkdev performance patch. I accidentally applied it # first to unstable. # diff -Naru a/linux-2.6.11-xen-sparse/drivers/xen/blkback/blkback.c b/linux-2.6.11-xen-sparse/drivers/xen/blkback/blkback.c --- a/linux-2.6.11-xen-sparse/drivers/xen/blkback/blkback.c2005-03-31 09:54:46 -08:00 +++ b/linux-2.6.11-xen-sparse/drivers/xen/blkback/blkback.c2005-03-31 09:54:46 -08:00 @@ -66,6 +66,19 @@ #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,0) static kmem_cache_t *buffer_head_cachep; +#else +static request_queue_t *plugged_queue; +void bdev_put(struct block_device *bdev) +{ + request_queue_t *q = plugged_queue; + /* We might be giving up last reference to plugged queue. Flush if so. */ + if ( (q != NULL) && + (q == bdev_get_queue(bdev)) && + (cmpxchg(&plugged_queue, q, NULL) == q) ) + blk_run_queue(q); + /* It''s now safe to drop the block device. */ + blkdev_put(bdev); +} #endif static int do_block_io_op(blkif_t *blkif, int max_to_do); @@ -176,9 +189,15 @@ blkif_put(blkif); } -#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,0) /* Push the batch through to disc. */ +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,0) run_task_queue(&tq_disk); +#else + if ( plugged_queue != NULL ) + { + blk_run_queue(plugged_queue); + plugged_queue = NULL; + } #endif } } @@ -481,6 +500,7 @@ for ( i = 0; i < nr_psegs; i++ ) { struct bio *bio; + request_queue_t *q; bio = bio_alloc(GFP_ATOMIC, 1); if ( unlikely(bio == NULL) ) @@ -500,7 +520,14 @@ phys_seg[i].nr_sects << 9, phys_seg[i].buffer & ~PAGE_MASK); - submit_bio(operation | (1 << BIO_RW_SYNC), bio); + if ( (q = bdev_get_queue(bio->bi_bdev)) != plugged_queue ) + { + if ( plugged_queue != NULL ) + blk_run_queue(plugged_queue); + plugged_queue = q; + } + + submit_bio(operation, bio); } #endif diff -Naru a/linux-2.6.11-xen-sparse/drivers/xen/blkback/common.h b/linux-2.6.11-xen-sparse/drivers/xen/blkback/common.h --- a/linux-2.6.11-xen-sparse/drivers/xen/blkback/common.h2005-03-31 09:54:46 -08:00 +++ b/linux-2.6.11-xen-sparse/drivers/xen/blkback/common.h2005-03-31 09:54:46 -08:00 @@ -30,8 +30,10 @@ #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0) typedef struct rb_root rb_root_t; typedef struct rb_node rb_node_t; +extern void bdev_put(struct block_device *bdev); #else struct block_device; +#define bdev_put(_b) ((void)0) #endif typedef struct blkif_st { diff -Naru a/linux-2.6.11-xen-sparse/drivers/xen/blkback/vbd.c b/linux-2.6.11-xen-sparse/drivers/xen/blkback/vbd.c --- a/linux-2.6.11-xen-sparse/drivers/xen/blkback/vbd.c2005-03-31 09:54:46 -08:00 +++ b/linux-2.6.11-xen-sparse/drivers/xen/blkback/vbd.c2005-03-31 09:54:46 -08:00 @@ -150,7 +150,7 @@ { DPRINTK("vbd_grow: device %08x doesn''t exist.\n", x->extent.device); grow->status = BLKIF_BE_STATUS_EXTENT_NOT_FOUND; - blkdev_put(x->bdev); + bdev_put(x->bdev); goto out; } @@ -255,7 +255,7 @@ *px = x->next; /* ATOMIC: no need for vbd_lock. */ #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0) - blkdev_put(x->bdev); + bdev_put(x->bdev); #endif kfree(x); @@ -307,7 +307,7 @@ { t = x->next; #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0) - blkdev_put(x->bdev); + bdev_put(x->bdev); #endif kfree(x); x = t; @@ -335,7 +335,7 @@ { t = x->next; #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0) - blkdev_put(x->bdev); + bdev_put(x->bdev); #endif kfree(x); x = t; # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2005/03/31 16:43:57+01:00 kaf24@firebug.cl.cam.ac.uk # Backport of batched request_queue unplugging in blkback driver. # Signed-off-by: Keir Fraser <keir@xensource.com> # # linux-2.6.11-xen-sparse/drivers/xen/blkback/blkback.c # 2005/03/31 16:43:56+01:00 kaf24@firebug.cl.cam.ac.uk +29 -2 # Backport of batched request_queue unplugging in blkback driver. # Signed-off-by: Keir Fraser <keir@xensource.com> # # linux-2.6.11-xen-sparse/drivers/xen/blkback/common.h # 2005/03/31 16:43:56+01:00 kaf24@firebug.cl.cam.ac.uk +2 -0 # Backport of batched request_queue unplugging in blkback driver. # Signed-off-by: Keir Fraser <keir@xensource.com> # # linux-2.6.11-xen-sparse/drivers/xen/blkback/vbd.c # 2005/03/31 16:43:56+01:00 kaf24@firebug.cl.cam.ac.uk +4 -4 # Backport of batched request_queue unplugging in blkback driver. # Signed-off-by: Keir Fraser <keir@xensource.com> # _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Philip R Auld wrote:> > > This effects merging though, right? I don''t think the the front > > > end has done any merging. > > > > The noop elevator does front and back merging. > > My understanding is that it''s used in the frontend driver. > > If that is the case, it can only merge things that are > machine contiguous. Current guests know this mapping, but > can they get this when running unmodified with VT-x. > > My experience showed very little if any multipage > IO coming out of the front end.There aren''t that many users of multipage ios yet. direct io will use it, ext2 will as well. iirc, -mm has patches for ext3 too. so it''s definitely improving :-) -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Ian Pratt wrote:> > > > I''ve checked in something along the lines of what you > > described into > > > both the 2.0-testing and the unstable trees. Looks to have > > identical > > > performance to the original simple patch, at least for a bulk ''dd''. > > > > Can you post the patch here for review? Or just point me > > somewhere I can view it. > > Jens, > > Thanks for your help on this. > > Here''s Keirs updated patch: > http://xen.bkbits.net:8080/xen-2.0-testing.bk/gnupatch@424c1abd7LgWMiask > LEEAAX7ffdkXQ > > Which is based on this earlier patch from you: > http://xen.bkbits.net:8080/xen-2.0-testing.bk/gnupatch@424bba4091aV1FuNk > sY_4w_z4Tvr3gI cannot immediately see if you call bdev_put() right after queueing the io? If so, I think the patch looks fine. If not, you are missing the last unplug :-) -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Niv, On Thu, Mar 31, 2005 at 08:27:30AM -0800, Nivedita Singhvi wrote:> Although the usual answer for what scheduling algorithm is > best is almost always "depends on the workload", it was > suggested to me that the cfq was still the best option to > go with. What do people feel about that? (Or is AS going > to remain default?).This is a different dicussion. But, yes, I would agree that CFQ (v3) is the best default choice. Jens, should we maybe make sure that the blockback driver does use different (fake) UIDs for the domains that it serves to provide the fairness between them. Next step would be to allow to tweak IO priorities. Or, to make it more general, add a parameter (call it uid), that a block driver can pass down to the IO scheduler and that would normally be current->uid but may be set differently?> Also, we''re making the assumption here that guest OS = virtual > driver/device. I would rather we not make that assumption > always. This may be moot because I was also told there might > be a patch floating around (-mm ?) that allows you to > select scheduling algorithm on a per-device basis. AnyoneIt''s part of 2.6.11. garloff@tpkurt:~ [0]$ cat /sys/block/hda/queue/scheduler noop anticipatory deadline [cfq] Regards, -- Kurt Garloff, Director SUSE Labs, Novell Inc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rumor has it that on Thu, Mar 31, 2005 at 08:01:52PM +0200 Jens Axboe said:> On Thu, Mar 31 2005, Philip R Auld wrote: > > > > My experience showed very little if any multipage > > IO coming out of the front end. > > There aren''t that many users of multipage ios yet. direct io will use > it, ext2 will as well. iirc, -mm has patches for ext3 too. so it''s > definitely improving :-)Sorry, I was being sloppy with terminology :) What I was getting at was that the backend will split requests up and issue each physical segment as a separate bio (at least in the 2.0.5 tree I have in front of me). And that none of these physical segments was more that 1 page. So the request merging in the back end OS is important, no? Cheers, Phil> > -- > Jens Axboe-- Philip R. Auld, Ph.D. Egenera, Inc. Software Architect 165 Forest St. (508) 858-2628 Marlboro, MA 01752 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 31 Mar 2005, at 19:04, Jens Axboe wrote:> I cannot immediately see if you call bdev_put() right after queueing > the > io? If so, I think the patch looks fine. If not, you are missing the > last unplug :-)That''s not the job of bdev_put(): the final unplug is done at the end of blkio_schedule -- the same place that I do a run_task_queue() when compling for Linux 2.4. Cheers, Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> What I was getting at was that the backend will split requests > up and issue each physical segment as a separate bio (at least in > the 2.0.5 tree I have in front of me). And that none of these > physical segments was more that 1 page. > > So the request merging in the back end OS is important, no?Ah, this reminds me I have one more question for Jens. Since all the bio''s that I queue up in a single invocation of dispatch_rw_block_io() will actually be adjacent to each other (because they''re all from the same scatter-gather list) can I actually do something like (very roughly): bio = bio_alloc(GFP_KERNEL, nr_psegs); for ( i = 0; i < nr_psegs; i++ ) bio_add_page(bio, blah...); submit_bio(operation, bio); Each of the biovecs that I queue may not be a full page in size (but won''t straddle a page boundary of course). This would avoid the bio''s having to be merged again later. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 31 Mar 2005, at 20:07, Keir Fraser wrote:> Since all the bio''s that I queue up in a single invocation of > dispatch_rw_block_io() will actually be adjacent to each other > (because they''re all from the same scatter-gather list)I should add: I know that the code makes it look like each s-g element might map somewhere entirely different from the previous one, but we no longer support that mode of operation. Each VBD now always maps onto a single, entire block device or partition. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Keir Fraser wrote:> >What I was getting at was that the backend will split requests > >up and issue each physical segment as a separate bio (at least in > >the 2.0.5 tree I have in front of me). And that none of these > >physical segments was more that 1 page. > > > >So the request merging in the back end OS is important, no? > > Ah, this reminds me I have one more question for Jens. > > Since all the bio''s that I queue up in a single invocation of > dispatch_rw_block_io() will actually be adjacent to each other (because > they''re all from the same scatter-gather list) can I actually do > something like (very roughly): > > bio = bio_alloc(GFP_KERNEL, nr_psegs); > for ( i = 0; i < nr_psegs; i++ ) > bio_add_page(bio, blah...); > submit_bio(operation, bio); > > Each of the biovecs that I queue may not be a full page in size (but > won''t straddle a page boundary of course).Yes, this is precisely what you should do, the current method is pretty suboptimal. Basically allocate a bio with nr_psegs, and call bio_add_page() for each page until it returns _less_ than the number of bytes you requested. When it does that, submit that bio for io and allocate a new bio with nr_psegs-submitted_segs bio_vecs attached. Continue until you are done. -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Philip R Auld wrote:> Rumor has it that on Thu, Mar 31, 2005 at 08:01:52PM +0200 Jens Axboe said: > > On Thu, Mar 31 2005, Philip R Auld wrote: > > > > > > My experience showed very little if any multipage > > > IO coming out of the front end. > > > > There aren''t that many users of multipage ios yet. direct io will use > > it, ext2 will as well. iirc, -mm has patches for ext3 too. so it''s > > definitely improving :-) > > Sorry, I was being sloppy with terminology :) > > What I was getting at was that the backend will split requests > up and issue each physical segment as a separate bio (at least in > the 2.0.5 tree I have in front of me). And that none of these > physical segments was more that 1 page. > > So the request merging in the back end OS is important, no?I suppose it always is, since the merge criteria may have changed from when the io was initially queued. If requests are always split into single pages, then it becomes very important to merge at the backend. -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Keir Fraser wrote:> > On 31 Mar 2005, at 19:04, Jens Axboe wrote: > > >I cannot immediately see if you call bdev_put() right after queueing > >the > >io? If so, I think the patch looks fine. If not, you are missing the > >last unplug :-) > > That''s not the job of bdev_put(): the final unplug is done at the end > of blkio_schedule -- the same place that I do a run_task_queue() when > compling for Linux 2.4.Thanks for confirming, that sounds fine. -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thursday 31 March 2005 11:55, Ian Pratt wrote:> > > I''ve checked in something along the lines of what you > > > > described into > > > > > both the 2.0-testing and the unstable trees. Looks to have > > > > identical > > > > > performance to the original simple patch, at least for a bulk ''dd''. > > > > Can you post the patch here for review? Or just point me > > somewhere I can view it. > > Jens, > > Thanks for your help on this.BTW, I am now getting this with xen-unstable: Process xenblkd (pid: 730, threadinfo=f7cc4000 task=f7c42510) Stack: c022d172 f44b1a08 f363c6f0 f7cc4000 c046d40c c02849f8 f44b1a08 00000010 00000000 f7c42510 c0115b0a 00000000 00000000 f7c42510 c17f1e48 c01092e6 00000000 f7c42510 c0115b0a 00100100 00200200 00000000 00000000 00000000 Call Trace: [<c022d172>] blk_run_queue+0x38/0x91 [<c02849f8>] blkio_schedule+0x126/0x149 [<c0115b0a>] default_wake_function+0x0/0x12 [<c01092e6>] ret_from_fork+0x6/0x1c [<c0115b0a>] default_wake_function+0x0/0x12 [<c02848d2>] blkio_schedule+0x0/0x149 [<c0107571>] kernel_thread_helper+0x5/0xb Code: Bad EIP value. <6>note: xenblkd[730] exited with preempt_count 1 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 31 Mar 2005, at 21:49, Andrew Theurer wrote:> BTW, I am now getting this with xen-unstable: > > Process xenblkd (pid: 730, threadinfo=f7cc4000 task=f7c42510) > Stack: c022d172 f44b1a08 f363c6f0 f7cc4000 c046d40c c02849f8 f44b1a08 > 00000010 > 00000000 f7c42510 c0115b0a 00000000 00000000 f7c42510 c17f1e48 > c01092e6 > 00000000 f7c42510 c0115b0a 00100100 00200200 00000000 00000000 > 00000000I wonder if blk_run_queue() is not the right thing to call. For example, it ignores whether the queue has been forcibly stopped by the underlying driver and doesn''t check whether there are any requests that actually require pushing. Plus various drivers (swraid and probably lvm) have their own unplug function and blk_run_queue doesn''t handle that. Could you try again, but replace calls to blk_run_queue(plugged_queue) in blkback.c with: if ( plugged_queue->unplug_fn ) plugged_queue->unplug_fn(plugged_queue); This looks like a better match with what various other drivers do (e.g. swraid). Thanks, Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Could you try again, but replace calls to > blk_run_queue(plugged_queue) in blkback.c with: > if ( plugged_queue->unplug_fn ) > plugged_queue->unplug_fn(plugged_queue); > > This looks like a better match with what various other drivers do > (e.g. swraid).Sorry, It looks like did not get you the whole output, but I can try what you suggested. Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: 00000000 *pde = ma 00000000 pa 55555000 [<c02691f1>] blkio_schedule+0x126/0x14c [<c01147b5>] default_wake_function+0x0/0x12 [<c01147b5>] default_wake_function+0x0/0x12 [<c02690cb>] blkio_schedule+0x0/0x14c [<c01071f1>] kernel_thread_helper+0x5/0xb Oops: 0000 [#1] Modules linked in: ipt_MASQUERADE iptable_nat ip_conntrack ip_tables qla2300 qla2xxx scsi_transport_fc mptscsih mptbase CPU: 0 EIP: 0061:[<00000000>] Not tainted VLI EFLAGS: 00010282 (2.6.11-xen0-up) EIP is at 0x0 eax: 00000000 ebx: f5ec9b70 ecx: f3e13b64 edx: 00000000 esi: 00000000 edi: c044540c ebp: f7d8dfc0 esp: f7d8df84 ds: 007b es: 007b ss: 0069 Process xenblkd (pid: 730, threadinfo=f7d8c000 task=f7d43020) Stack: c0217f7d f5ec9b70 f3fee6f0 f7d8c000 c02691f1 f5ec9b70 00000010 00000000 f7d43020 c01147b5 00000000 00000000 fbffc000 00000000 f7d8c000 00000000 f7d43020 c01147b5 00100100 00200200 00000000 00000000 00000000 c02690cb Call Trace: [<c0217f7d>] blk_run_queue+0x24/0x47 [<c02691f1>] blkio_schedule+0x126/0x14c [<c01147b5>] default_wake_function+0x0/0x12 [<c01147b5>] default_wake_function+0x0/0x12 [<c02690cb>] blkio_schedule+0x0/0x14c [<c01071f1>] kernel_thread_helper+0x5/0xb Code: Bad EIP value.> > Thanks, > Keir_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Could you try again, but replace calls to > blk_run_queue(plugged_queue) in blkback.c with: > if ( plugged_queue->unplug_fn ) > plugged_queue->unplug_fn(plugged_queue); > > This looks like a better match with what various other > drivers do (e.g. > swraid).This patch is required to make it work with LVM. 2.0-testing and unstable will be updated shortly... Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Kurt Garloff wrote:> Hi Niv, > > On Thu, Mar 31, 2005 at 08:27:30AM -0800, Nivedita Singhvi wrote: > >>Although the usual answer for what scheduling algorithm is >>best is almost always "depends on the workload", it was >>suggested to me that the cfq was still the best option to >>go with. What do people feel about that? (Or is AS going >>to remain default?). > > > This is a different dicussion.Yes, I did change the subject a little ;).> But, yes, I would agree that CFQ (v3) is the best default choice.Yep, even though some of the complications in the Xen environment (as you point out below) will have to be addressed.> Jens, should we maybe make sure that the blockback driver does use > different (fake) UIDs for the domains that it serves to provide > the fairness between them. Next step would be to allow to tweak > IO priorities. Or, to make it more general, add a parameter (call > it uid), that a block driver can pass down to the IO scheduler > and that would normally be current->uid but may be set differently?> It''s part of 2.6.11. > garloff@tpkurt:~ [0]$ cat /sys/block/hda/queue/scheduler > noop anticipatory deadline [cfq]I just saw Jens'' reply as well. This is much goodness :). Very handy indeed! thanks, Nivedita _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thursday 31 March 2005 15:32, Ian Pratt wrote:> > Could you try again, but replace calls to > > > > blk_run_queue(plugged_queue) in blkback.c with: > > if ( plugged_queue->unplug_fn ) > > plugged_queue->unplug_fn(plugged_queue); > > > > This looks like a better match with what various other > > drivers do (e.g. > > swraid).OK, changes worked for me, but still have some min latency here (but much better) reqsze MB/sec svcmt xenU 16k 6266.67 1.25 32k 12618.67 1.20 64k 25002.67 1.28 128k 49322.67 1.35 256k 58538.67 3.15 xen0 16k 13818.67 1.15 32k 27573.33 1.16 64k 54784.00 1.16 128k 58581.33 2.18 256k 58453.33 4.38 noXen 16k 58679.19 0.27 32k 58453.33 0.54 64k 58713.04 1.08 128k 58174.09 2.17 256k 58820.07 4.36 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> reqsze MB/sec svcmt > > xenU 16k 6266.67 1.25 > 32k 12618.67 1.20 > 64k 25002.67 1.28 > 128k 49322.67 1.35 > 256k 58538.67 3.15 > > xen0 16k 13818.67 1.15 > 32k 27573.33 1.16 > 64k 54784.00 1.16 > 128k 58581.33 2.18 > 256k 58453.33 4.38 > > noXen 16k 58679.19 0.27 > 32k 58453.33 0.54 > 64k 58713.04 1.08 > 128k 58174.09 2.17 > 256k 58820.07 4.36These figures for xen0 are interesting. It''s odd that we tail off so badly for short requests. What interrupt rates are occuring when you do these tests? Thanks, ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thursday 31 March 2005 16:36, Ian Pratt wrote:> > reqsze MB/sec svcmt > > > > xenU 16k 6266.67 1.25 > > 32k 12618.67 1.20 > > 64k 25002.67 1.28 > > 128k 49322.67 1.35 > > 256k 58538.67 3.15 > > > > xen0 16k 13818.67 1.15 > > 32k 27573.33 1.16 > > 64k 54784.00 1.16 > > 128k 58581.33 2.18 > > 256k 58453.33 4.38 > > > > noXen 16k 58679.19 0.27 > > 32k 58453.33 0.54 > > 64k 58713.04 1.08 > > 128k 58174.09 2.17 > > 256k 58820.07 4.36 > > These figures for xen0 are interesting. It''s odd that we tail off so > badly for short requests. What interrupt rates are occuring when you > do these tests?I just ran again, and for some reason it looks fine now... I have no idea what I did to get the lower numbers initially, perhaps an inadvertant IO scheduler change. Service commit times are .28ms and I can drive ~58MB/sec with just 16k requests on xen0. I''ll do some more tests to get a more consistent picture. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Mar 31 2005, Keir Fraser wrote:> > On 31 Mar 2005, at 21:49, Andrew Theurer wrote: > > >BTW, I am now getting this with xen-unstable: > > > >Process xenblkd (pid: 730, threadinfo=f7cc4000 task=f7c42510) > >Stack: c022d172 f44b1a08 f363c6f0 f7cc4000 c046d40c c02849f8 f44b1a08 > >00000010 > > 00000000 f7c42510 c0115b0a 00000000 00000000 f7c42510 c17f1e48 > >c01092e6 > > 00000000 f7c42510 c0115b0a 00100100 00200200 00000000 00000000 > >00000000 > > I wonder if blk_run_queue() is not the right thing to call. For > example, it ignores whether the queue has been forcibly stopped by the > underlying driver and doesn''t check whether there are any requests that > actually require pushing. Plus various drivers (swraid and probably > lvm) have their own unplug function and blk_run_queue doesn''t handle > that. > > Could you try again, but replace calls to blk_run_queue(plugged_queue) > in blkback.c with: > if ( plugged_queue->unplug_fn ) > plugged_queue->unplug_fn(plugged_queue); > > This looks like a better match with what various other drivers do (e.g. > swraid).Yes you are right, you really want to just unplug it. That should work correctly in all cases. Remember that ->unplug_fn must not be called with any locks called. -- Jens Axboe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt <m+Ian.Pratt <at> cl.cam.ac.uk> writes:> > > > I''ve checked in something along the lines of what you > > described into > > > both the 2.0-testing and the unstable trees. Looks to have > > identical > > > performance to the original simple patch, at least for a bulk ''dd''. > > > > Can you post the patch here for review? Or just point me > > somewhere I can view it. > > Jens, > > Thanks for your help on this. > > Here''s Keirs updated patch: > http://xen.bkbits.net:8080/xen-2.0-testing.bk/gnupatch <at> 424c1abd7LgWMiask > LEEAAX7ffdkXQ > > Which is based on this earlier patch from you: > http://xen.bkbits.net:8080/xen-2.0-testing.bk/gnupatch <at> 424bba4091aV1FuNk > sY_4w_z4Tvr3g > > Best, > Ian >I have applied the patch in blkback.c for xen0 and have gotten good results now. I have tested two systems one with a standard ide disk device and another with two SATA disks. I stumbled over this issue when I was doing filesystem io and wanted to check the efficiency of xen-linux. It was then that I went to raw IO on block devices and found that it didn''t perform as I hoped. Now I have switched back to the filesystem operations. I do this by copying a "/usr" subtree from a slackware-10.0 installation containg about 750 MB in 2200 directories and 37000 files. Copying these files with target directory on the same device as the source directory, I get between 90 and 93% of the per- formance in Dom0, when I work with DomU. When copying form a directory on one device into a directory of another device, performance in DomU leaks more behind that of Dom0. It''s only 50 to 60 percent of the Dom0 performance. The performance is less than it is when using only one disk. I found out that the sum of the business of the two disks as reported by iostat on Dom0 is always slightly above 100%. Does this reflect that the reading and the writing both go through the VDB driver ? Both devices are never 100 % busy. Any explanations ? Thanks in advance Peter _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Now I have switched back to the filesystem operations. I do > this by copying a "/usr" subtree from a slackware-10.0 > installation containg about 750 MB in 2200 directories and > 37000 files. Copying these files with target directory on > the same device as the source directory, I get between 90 and > 93% of the per- formance in Dom0, when I work with DomU. When > copying form a directory on one device into a directory of > another device, performance in DomU leaks more behind that of > Dom0. It''s only 50 to 60 percent of the Dom0 performance. The > performance is less than it is when using only one disk. I > found out that the sum of the business of the two disks as > reported by iostat on Dom0 is always slightly above 100%. > Does this reflect that the reading and the writing both go > through the VDB driver ? Both devices are never 100 % busy.That latest 2.0-testing tree has some further blk queue plugging enhancements along with a fix for another nasty performance bug. It would be interesting to know whether that improves things. It''s possible that the blkring currently just isn''t big enough if you''re trying to drive multiple devices with independent requests. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I just ran again, and for some reason it looks fine now... I have no > idea what I did to get the lower numbers initially, perhaps an > inadvertant IO scheduler change. Service commit times are .28ms and I > can drive ~58MB/sec with just 16k requests on xen0. I''ll do some more > tests to get a more consistent picture. >I still experience bad performance in domU with latest xen-testing dom0. Here''s my setup : Xen : 2.0.5 Dom0 : 2.6.11-xen-testing (20050401 ~22h CEST) running Debian Sarge DomU : 2.6.10-xen-2.0.5 (8G LVM backed VBDs exported as hda1) running Gentoo Processor : AthlonXP 1800+ Chipset : VIA KT600 Drive : Seagate ST380013AS 80G SATA And my results : Dom0 : 51 MB/s DomU : 36 MB/s I''ve tried with request sizes from 128k to 1024k reading entire volume and obtained always same results. Changing the scheduler on Dom0 and/or DomU doesn''t change anything. I can give you more info if nedded. -- Cédric Schieli _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
There have been some changes to the frontend driver too: you might want to try using the 2.0-testing kernel in domU too. Also, a really nasty CPU performance bug got fixed earlier this evening, so you should make sure you have the latest tree. Ian> > I just ran again, and for some reason it looks fine now... > I have no > > idea what I did to get the lower numbers initially, perhaps an > > inadvertant IO scheduler change. Service commit times are > .28ms and I > > can drive ~58MB/sec with just 16k requests on xen0. I''ll > do some more > > tests to get a more consistent picture. > > > > I still experience bad performance in domU with latest > xen-testing dom0. > > Here''s my setup : > > Xen : 2.0.5 > Dom0 : 2.6.11-xen-testing (20050401 ~22h CEST) running Debian > Sarge DomU : 2.6.10-xen-2.0.5 (8G LVM backed VBDs exported as > hda1) running Gentoo Processor : AthlonXP 1800+ Chipset : VIA > KT600 Drive : Seagate ST380013AS 80G SATA > > And my results : > > Dom0 : 51 MB/s > DomU : 36 MB/s > > I''ve tried with request sizes from 128k to 1024k reading > entire volume and obtained always same results. > Changing the scheduler on Dom0 and/or DomU doesn''t change anything. > > I can give you more info if nedded. > > -- > Cédric Schieli > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I''ve just tried with latest testing tree DomO and DomU and got same results. Le samedi 02 avril 2005 à 00:22 +0100, Ian Pratt a écrit :> There have been some changes to the frontend driver too: you might want to try using the 2.0-testing kernel in domU too. > > Also, a really nasty CPU performance bug got fixed earlier this evening, so you should make sure you have the latest tree. > > Ian > > > > I just ran again, and for some reason it looks fine now... > > I have no > > > idea what I did to get the lower numbers initially, perhaps an > > > inadvertant IO scheduler change. Service commit times are > > .28ms and I > > > can drive ~58MB/sec with just 16k requests on xen0. I''ll > > do some more > > > tests to get a more consistent picture. > > > > > > > I still experience bad performance in domU with latest > > xen-testing dom0. > > > > Here''s my setup : > > > > Xen : 2.0.5 > > Dom0 : 2.6.11-xen-testing (20050401 ~22h CEST) running Debian > > Sarge DomU : 2.6.10-xen-2.0.5 (8G LVM backed VBDs exported as > > hda1) running Gentoo Processor : AthlonXP 1800+ Chipset : VIA > > KT600 Drive : Seagate ST380013AS 80G SATA > > > > And my results : > > > > Dom0 : 51 MB/s > > DomU : 36 MB/s > > > > I''ve tried with request sizes from 128k to 1024k reading > > entire volume and obtained always same results. > > Changing the scheduler on Dom0 and/or DomU doesn''t change anything. > > > > I can give you more info if nedded. > > > > -- > > Cédric Schieli > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > > Xen : 2.0.5 > > > Dom0 : 2.6.11-xen-testing (20050401 ~22h CEST) running > Debian Sarge > > > DomU : 2.6.10-xen-2.0.5 (8G LVM backed VBDs exported as > > > hda1) running Gentoo Processor : AthlonXP 1800+ Chipset : > VIA KT600 > > > Drive : Seagate ST380013AS 80G SATA > > > > > > And my results : > > > > > > Dom0 : 51 MB/s > > > DomU : 36 MB/s > > > > > > I''ve tried with request sizes from 128k to 1024k reading entire > > > volume and obtained always same results. > > > Changing the scheduler on Dom0 and/or DomU doesn''t change > anything.Are you sure you''re reading from the exact same part of the disk in both instances? How are you doing the bandwidth measurements? ''dd''? Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Are you sure you''re reading from the exact same part of the disk in both instances? > How are you doing the bandwidth measurements? ''dd''?I have this line in my DomU conf : disk = [ ''phy:vg/gentoo-root,hda1,w'' ,''phy:vg/gentoo-swap,hda2,w'' ] I make my measurements with : Dom0 : dd if=/dev/vg/gentoo-root of=/dev/null bs={128|256|...}k DomU : dd if=/dev/hda1 of=/dev/null bs={128|256|...}k In all cases I get same results : 50-52 MB/s on Dom0, 34-37 MB/s on DomU I''ve tried with any combination of scheduler. I will try with latest xen-testing hypervisor (I still use 2.0.5 for the moment) but I don''t think this should impact a lot. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Cédric Schieli <cedric <at> schieli.dyndns.org> writes:> > I''ve just tried with latest testing tree DomO and DomU and got same > results. > > Le samedi 02 avril 2005 à 00:22 +0100, Ian Pratt a écrit : > > There have been some changes to the frontend driver too: you might want totry using the 2.0-testing kernel> in domU too. > > > > Also, a really nasty CPU performance bug got fixed earlier this evening, soyou should make sure you have> the latest tree. > > > > Ian > > > > > > I just ran again, and for some reason it looks fine now... > > > I have no > > > > idea what I did to get the lower numbers initially, perhaps an > > > > inadvertant IO scheduler change. Service commit times are > > > .28ms and I > > > > can drive ~58MB/sec with just 16k requests on xen0. I''ll > > > do some more > > > > tests to get a more consistent picture. > > > > > > > > > > I still experience bad performance in domU with latest > > > xen-testing dom0. > > > > > > Here''s my setup : > > > > > > Xen : 2.0.5 > > > Dom0 : 2.6.11-xen-testing (20050401 ~22h CEST) running Debian > > > Sarge DomU : 2.6.10-xen-2.0.5 (8G LVM backed VBDs exported as > > > hda1) running Gentoo Processor : AthlonXP 1800+ Chipset : VIA > > > KT600 Drive : Seagate ST380013AS 80G SATA > > > > > > And my results : > > > > > > Dom0 : 51 MB/s > > > DomU : 36 MB/s > > > > > > I''ve tried with request sizes from 128k to 1024k reading > > > entire volume and obtained always same results. > > > Changing the scheduler on Dom0 and/or DomU doesn''t change anything. > > > > > > I can give you more info if nedded. > > > > > > -- > > > Cédric Schieli > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel <at> lists.xensource.com > > > http://lists.xensource.com/xen-devel > > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel <at> lists.xensource.com > > http://lists.xensource.com/xen-devel >It just sumbled accross the fact, that you are using a SATA disk, Cédric. This is exactly the "dd" behavior that my system containing SATA disks still shows. But it applies only to "dd" ( which, admittedly is read-only ). It does not apply to the performance figures I got when copying my "/usr" tree - as described in a previous post here - from one location of the disk to another location on the same disk ( which, of course is combined read-write on the same device ). Hence it might be possible that my limited performance copying from one disk to another might in fact be an effect of reduced read performance in DomU on a SATA disk. I suspect that this might be an effect specific to SATA disks. I will verify this on monday - when I have access to my computers in the office, by doing it on a system with two IDE disks. I will report it then, if your problem is still open. I will describe the exact configuration of the systems then (Motherboard, IO Controller, etc ). Peter _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I can confim the problem only occur on SATA. I''ve added an old IDE UDMA66 drive, created LVM volume from it and ran same dd tests : Dom0 : 12 MB/s DomU : 12 MB/s> It just sumbled accross the fact, that you are using a SATA disk, Cédric. This > is > exactly the "dd" behavior that my system containing SATA disks still shows. But > it > applies only to "dd" ( which, admittedly is read-only ). It does not apply to > the performance figures I got when copying my "/usr" tree - as described in a > previous post here - from one location of the disk to another location on the > same disk ( which, of course is combined read-write on the same device ). Hence > it might be possible that my limited performance copying from one disk to > another might in fact be an effect of reduced read performance in DomU on a > SATA disk. > > I suspect that this might be an effect specific to SATA disks. I will verify > this on monday - when I have access to my computers in the office, by doing it > on a system with two IDE disks. I will report it then, if your problem is still > open. > > I will describe the exact configuration of the systems then (Motherboard, IO > Controller, etc )._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I can confim the problem only occur on SATA. > I''ve added an old IDE UDMA66 drive, created LVM volume from > it and ran same dd tests : > Dom0 : 12 MB/s > DomU : 12 MB/sSATA works fine for me on 2.0-testing. I get 50MB/s reading from a raw partition in both cases using: time dd if=/dev/sda6 of=/dev/null bs=1024k count=1024 Can you try a raw partition rather than LVM? Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> SATA works fine for me on 2.0-testing. > I get 50MB/s reading from a raw partition in both cases using: > time dd if=/dev/sda6 of=/dev/null bs=1024k count=1024I''ve tried with a raw partition (the same that holds the LVM volume) and got same results : 51 MB/s on Dom0 and 37 MB/s on DomU I don''t know if it is of importance, but I need to add ignorebiostables=1 in my boot parameters in order to make the SATA work (kernel hang on drive detection without it). The SATA controller is a VIA one. Cédric Schieli _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > SATA works fine for me on 2.0-testing. > > I get 50MB/s reading from a raw partition in both cases using: > > time dd if=/dev/sda6 of=/dev/null bs=1024k count=1024 > > I''ve tried with a raw partition (the same that holds the LVM > volume) and got same results : 51 MB/s on Dom0 and 37 MB/s on DomU > > I don''t know if it is of importance, but I need to add > ignorebiostables=1 in my boot parameters in order to make the > SATA work (kernel hang on drive detection without it). The > SATA controller is a VIA one.It doesn''t sound like Xen is too happy on your system, but its not clear how this would explain the performance difference between dom0 and domU. When the IOAPIC patches are checked in it will be interesting to see whether this fixes it. Try the unstable tree in a week or so. Best, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Some simple non-scientific additions to the performance numbers. IBM x335/MPT SCSI. Previously on 2.0.5/Testing on 2.6.10: [nic@stateless:~/sys/xen] sudo hdparm -tT /dev/sda /dev/sda: Timing cached reads: 2884 MB in 2.00 seconds = 1442.00 MB/sec Timing buffered disk reads: 100 MB in 3.05 seconds = 32.79 MB/sec Not completely happy with the buffered read figure of 34Mb. I''m putting together a new x205 server with Xen later today. I''ll try do some native vs Xen testing while I''m at it. dom0: [nic@stateless:~/tmp] time sudo cp db-svn.tgz db-svn-bak.tgz real 0m13.058s user 0m0.030s sys 0m0.530s domU: [nic@base:/export/bak] time sudo cp db-svn.tgz db-svn-bak.tgz real 0m23.574s user 0m0.010s sys 0m0.060s [nic@stateless:~/tmp] ls -l db-svn.tgz -rw-r--r-- 1 nic nic 188247603 2005-04-04 21:06 db-svn.tgz With todays 2.0.6/Testing on 2.6.11.6: [nic@stateless:~] sudo hdparm -tT /dev/sda /dev/sda: Timing cached reads: 2748 MB in 2.00 seconds = 1374.00 MB/sec Timing buffered disk reads: 102 MB in 3.00 seconds = 34.00 MB/sec [nic@stateless:~/tmp] time sudo cp db-svn.tgz db-svn-bak.tgz real 0m10.468s user 0m0.010s sys 0m0.070s [nic@base:/export/bak] time sudo cp db-svn.tgz db-svn-bak.tgz real 0m11.243s user 0m0.000s sys 0m0.040s Both filesystems based on XFS/LVM2. These numbers are based on one-run right after boot, with in the domU case just one domU running. So definite improvement. Nicholas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I am sorry to return to this issue after quite a long interruption. As I mentioned in a post before, I came accross this problem when I was testing file-system performance. After the problems with raw sequential I/O seemed to have been fixed in the testing release, I turned back to my original problem. I did a simple test that dispite its simplicity seems to put the IO subsystem under considerable stress. I took the /usr tree of my system and copied five it times into different directories on a slice of disk 1. This tree con- sistst of 36000 files with about 750 MB of data. Then I started to copy each of these copies recursively onto disk 2 ( each to its own location on that disk, of course ). I ran these copying in parallel and the processes took about 6 to 7 minutes in DOM0, while they needed between 14.6 and 15.9 minutes in DOMU. Essentially, this means that using this heavy io load on the system I get back to my 40% ratio between io performance on DOMU compared and io perfor- mance on DOM0 that I initially reported. This may just be coincidence, but probably it is worth mention. I monitored the disk and block-io activity with iostat. The output of both is too large to post it here, so I will only try to include a few representative lines of each. The first two lines show the activity while doing the copying on DOMU. This is a snapshot of a phase with relatively high throughput (DOMU): Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq- sz avgqu-sz await svctm %util hde 0.00 2748.00 1.60 71.20 12.80 22561.60 6.40 11280.80 310.09 1.78 23.96 4.73 34.40 hdg 2571.00 5.00 126.80 9.60 21580.80 115.20 10790.40 57.60 159.06 5.48 40.38 6.61 90.20 avg-cpu: %user %nice %system %iowait %idle 0.20 0.00 6.20 0.20 93.40 this is a snapshot of a phase with relatively low throughput (DOMU): Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq- sz avgqu-sz await svctm %util hde 0.00 676.40 0.00 33.00 0.00 5678.40 0.00 2839.20 172.07 1.76 53.45 4.91 16.20 hdg 335.80 11.00 315.00 3.40 5206.40 115.20 2603.20 57.60 16.71 4.15 13.02 2.76 87.80 avg-cpu: %user %nice %system %iowait %idle 0.20 0.00 9.00 0.00 90.80 _I suspect, that the reported iowait on cpu-usage is not entirely correct, but I am not sure about it. The next two lines are snapshots of iostat output during the copying in DOM0 again the first snapshot was taken in a phase of relative high throughput (DOM0): Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq- sz avgqu-sz await svctm %util hde 0.00 5845.40 1.40 110.20 11.20 47812.80 5.60 23906.40 428.53 105.96 772.63 8.96 100.00 hdg 46.20 24.80 389.80 2.20 47628.80 216.00 23814.40 108.00 122.05 7.12 18.23 3.30 129.40 avg-cpu: %user %nice %system %iowait %idle 2.40 0.00 40.20 57.40 0.00 the next line was taken in a phase of relatively low throughput (DOM0): Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq- sz avgqu-sz await svctm %util hde 0.00 903.40 0.20 106.80 3.20 7972.80 1.60 3986.40 74.54 20.77 217.91 4.06 43.40 hdg 0.00 24.00 746.60 1.20 9302.40 200.00 4651.20 100.00 12.71 4.96 6.67 1.34 100.00 avg-cpu: %user %nice %system %iowait %idle 3.40 0.00 44.00 52.60 0.00 The problem seems to be the reading. The device hde, which contains the slice where the data is copied onto is almost never really busy when using DOMU. The ratio of kb/s written and usage seems to reflect that writing from DOMU is just as efficient as writing from DOM0 ( writing can be buffered in both cases after all ). Yet the information on reading seems to show a different picture. Blockio merges requests permanently resulting in request sizes that are approxi- mately equal in both cases. Yet service times for DOMU requests are about twice the time needed for requests for DOM0. I do not know if such a scenario is simply inadequate for virtual systems at least under Xen. We are thinking about running a mail gateway on top of a protected and secured dom0 system, and potentially offering other network services in separate domains. We want to avoid corruption of DOM0 while being able to offer "insecure" services in nonprivileged domains. We know that mail servicing can potentially put an intense load onto the filesystem - admittedly more on inodes ( create and delete ) than with respect to data throughput. Do I simply have to accept that under heavy io load domains using vbd to access storage devices will lag behind dom0 and native linux systems, or is there a chance to fix this ? My reported test was done on a fujitsu-siemens system RX100 with a 2.0 Ghz Celeron CPU and a total of only 256 MB of memory. DOM0 had 128 MB and DOMU 100 MB. The disks were simply ide disks. I did the same test on a System with 1.25 GB Ram with both domains having 0.5 GB of memory. It contains SATA disks and the results are essentially the same the only difference is that both processes are slower due to less throughput under random access from the disks. Any advice ore help ? Thanks in advance Peter _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
#> I am sorry to return to this issue after quite a long interruption. > As I mentioned in a post before, I came accross this problem > when I was testing file-system performance. After the > problems with raw sequential I/O seemed to have been fixed in > the testing release, I turned back to my original problem. > I did a simple test that dispite its simplicity seems to put > the IO subsystem under considerable stress. I took the /usr > tree of my system and copied five it times into different > directories on a slice of disk 1. This tree con- sistst of > 36000 files with about 750 MB of data. Then I started to copy > each of these copies recursively onto disk 2 ( each to its > own location on that disk, of course ). I ran these copying > in parallel and the processes took about 6 to 7 minutes in > DOM0, while they needed between 14.6 and 15.9 minutes in DOMU. > > Essentially, this means that using this heavy io load on the > system I get back to my 40% ratio between io performance on > DOMU compared and io perfor- mance on DOM0 that I initially > reported. This may just be coincidence, but probably it is > worth mention.It''s possible that the dom0 doing prefetch as well as the domU is messing up random IO performance. Do the iostat numbers suggest dom0 is reading more data overall when doing it on behalf of a domU? We''ll need a simpler way of reproducing this if any headway is to be made debugging it. It might be worth writing a program to do psuedo-random IO reads to a partition, both in DIRECT and normal mode, then run it in dom0 and domU. [Chris: you have such a program already, right? Can you post it, thanks] Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel