Hi. I''ve been running a couple of benchmarks on a Xen-3.0 installation lately. Part of those compared SMP and CMP configurations on a 2x2 Intel Woodcrest (i.e. two sockets). Tests were all performed between a UP dom0 (on core 0) and a UP domU and pinning the domU VCPU to core 1 (processor 0) or 3 (processor 1). Switching from SMP to CMP, netperf -tTCP_TREAM gets me 1686.64 vs. 2673.20 Mbit/s. Lesser IPI latency, shared caches, all as one should expect, I believe. Now, trying that for block I/O may sound strange but can be done: created a 3 GB ramdisk on dom0 and fed that to domU. peak with ''hdparm -t'' is at 759.37 MB/s on SMP. The fun (for me, fun is probably a personal thing) part is that throughput is higher than with TCP. May be due to the block layer being much thinner than TCP/IP networking, or the fact that transfers utilize the whole 4KB page size for sequential reads. Possibly some of both, I didn''t try. This is not my question. What strikes me is that for the blkdev interface, the CMP setup is 13% *slower* than SMP, at 661.99 MB/s. Now, any ideas? I''m mildly familiar with both netback and blkback, and I''d never expected something like that. Any hint appreciated. Thanks, Daniel -- Daniel Stodden LRR - Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik der TU München D-85748 Garching http://www.lrr.in.tum.de/~stodden mailto:stodden@cs.tum.edu PGP Fingerprint: F5A4 1575 4C56 E26A 0B33 3D80 457E 82AE B0D8 735B _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> The fun (for me, fun is probably a personal thing) part is that > throughput is higher than with TCP. May be due to the block layer being > much thinner than TCP/IP networking, or the fact that transfers utilize > the whole 4KB page size for sequential reads. Possibly some of both, I > didn't try.The big thing is that on network RX it is currently dom0 that does the copy. In the CMP case this leaves the data in the shared cache ready to be accessed by the guest. In the SMP case it doesn't help at all. In netchannel2 we're moving the copy to the guest CPU, and trying to eliminate it with smart hardware. Block IO doesn't require a copy at all.> This is not my question. What strikes me is that for the blkdev > interface, the CMP setup is 13% *slower* than SMP, at 661.99 MB/s. > > Now, any ideas? I'm mildly familiar with both netback and blkback, and > I'd never expected something like that. Any hint appreciated.How stable are your results with hdparm? I've never really trusted it as a benchmarking tool. The ramdisk isn't going to be able to DMA data into the domU's buffer on a read, so it will have to copy it. The hdparm running in domU probably doesn't actually look at any of the data it requests, so it stays local to the dom0 CPU's cache (unlike a real app). Doing all that copying in dom0 is going to beat up the domU in the shared cache in the CMP case, but won't effect it as much in the SMP case. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, 2008-03-09 at 20:07 +0000, Ian Pratt wrote:> > The fun (for me, fun is probably a personal thing) part is that > > throughput is higher than with TCP. May be due to the block layer being > > much thinner than TCP/IP networking, or the fact that transfers utilize > > the whole 4KB page size for sequential reads. Possibly some of both, I > > didn''t try.> The big thing is that on network RX it is currently dom0 that does the copy. > In the CMP case this leaves the data in the shared cache ready to be accessed by > the guest. In the SMP case it doesn''t help at all. In netchannel2 we''re > moving the copy to the guest CPU, and trying to eliminate it with smart hardware.> Block IO doesn''t require a copy at all.Well, not in blkback by itself, but certainly from the in-memory disk image. Unless I misunderstoode Keirs post recently, page flipping is basically dead code, so I thought the number should at least point into roughly the same directions.> > This is not my question. What strikes me is that for the blkdev > > interface, the CMP setup is 13% *slower* than SMP, at 661.99 MB/s. > > > > Now, any ideas? I''m mildly familiar with both netback and blkback, and > > I''d never expected something like that. Any hint appreciated. > > How stable are your results with hdparm? I''ve never really trusted it as a benchmarking tool.So far, all the experiments I''ve done look fairly reasonable. Standard deviance is low, and since I''ve been tracing netback reads I''m fairly confident that the volume wasn''t been left in domU memory somewhere. I''m not so much interested in bio or physical disk performance, but relative performance of how much can be squeezed through the buffer ring before and after applying some changes. It''s hardly a physical disk benchmark, but it''s simple and for the purpose given it seems okay.> The ramdisk isn''t going to be able to DMA data into the domU''s buffer on > a read, so it will have to copy it.Right...> The hdparm running in domU probably > doesn''t actually look at any of the data it requests, so it stays local > to the dom0 CPU''s cache (unlike a real app).hdparm performs sequential 2MB-read()s over a 3s period. It''s not calling the block layer directly or something. That''ll certainly hit domU caches?> Doing all that copying > in dom0 is going to beat up the domU in the shared cache in the CMP > case, but won''t effect it as much in the SMP case.Well, I could live with blaming L2 footprint. Just wanted to hear if someone has different explanations. And I would expect similar results on net RX then, but I may be mistaken. Furthermore, I need to apologize because I failed to use netperf correctly and managed to report the TX path on my original post :P. The real numbers are rather 885.43 (SMP) vs. 1295.46 (CMP), but the difference compared to blk reads as such stays the same. regards, daniel -- Daniel Stodden LRR - Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik der TU München D-85748 Garching http://www.lrr.in.tum.de/~stodden mailto:stodden@cs.tum.edu PGP Fingerprint: F5A4 1575 4C56 E26A 0B33 3D80 457E 82AE B0D8 735B _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > The big thing is that on network RX it is currently dom0 that does the > > copy. In the CMP case this leaves the data in the shared cache ready to > > be accessed by the guest. In the SMP case it doesn''t help at all. In > > netchannel2 we''re moving the copy to the guest CPU, and trying to > > eliminate it with smart hardware. > > > > Block IO doesn''t require a copy at all. > > Well, not in blkback by itself, but certainly from the in-memory disk > image. Unless I misunderstoode Keirs post recently, page flipping is > basically dead code, so I thought the number should at least point into > roughly the same directions.Blkback has always DMA-ed directly into guest memory when reading data from the disk drive (normal usecase), in which case there''s no copy - I think that was Ian''s point. In contrast the Netback driver has to do a copy in the normal case. If you''re using a ramdisk then there must be a copy somewhere, although I''m not sure exactly where it happens! Cheers, Mark> > > This is not my question. What strikes me is that for the blkdev > > > interface, the CMP setup is 13% *slower* than SMP, at 661.99 MB/s. > > > > > > Now, any ideas? I''m mildly familiar with both netback and blkback, and > > > I''d never expected something like that. Any hint appreciated. > > > > How stable are your results with hdparm? I''ve never really trusted it as > > a benchmarking tool. > > So far, all the experiments I''ve done look fairly reasonable. Standard > deviance is low, and since I''ve been tracing netback reads I''m fairly > confident that the volume wasn''t been left in domU memory somewhere. > > I''m not so much interested in bio or physical disk performance, but > relative performance of how much can be squeezed through the buffer ring > before and after applying some changes. It''s hardly a physical disk > benchmark, but it''s simple and for the purpose given it seems okay. > > > The ramdisk isn''t going to be able to DMA data into the domU''s buffer on > > a read, so it will have to copy it. > > Right... > > > The hdparm running in domU probably > > doesn''t actually look at any of the data it requests, so it stays local > > to the dom0 CPU''s cache (unlike a real app). > > hdparm performs sequential 2MB-read()s over a 3s period. It''s not > calling the block layer directly or something. That''ll certainly hit > domU caches? > > > Doing all that copying > > in dom0 is going to beat up the domU in the shared cache in the CMP > > case, but won''t effect it as much in the SMP case. > > Well, I could live with blaming L2 footprint. Just wanted to hear if > someone has different explanations. And I would expect similar results > on net RX then, but I may be mistaken. > > Furthermore, I need to apologize because I failed to use netperf > correctly and managed to report the TX path on my original post :P. The > real numbers are rather 885.43 (SMP) vs. 1295.46 (CMP), but the > difference compared to blk reads as such stays the same. > > regards, > daniel-- Push Me Pull You - Distributed SCM tool (http://www.cl.cam.ac.uk/~maw48/pmpu/) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, 2008-03-16 at 21:15 +0000, Mark Williamson wrote:> > > > > Block IO doesn''t require a copy at all. > > > > Well, not in blkback by itself, but certainly from the in-memory disk > > image. Unless I misunderstoode Keirs post recently, page flipping is > > basically dead code, so I thought the number should at least point into > > roughly the same directions. > > Blkback has always DMA-ed directly into guest memory when reading data from > the disk drive (normal usecase), in which case there''s no copy - I think that > was Ian''s point. In contrast the Netback driver has to do a copy in the > normal case. > > If you''re using a ramdisk then there must be a copy somewhere, although I''m > not sure exactly where it happens!I checked it, this is comparatively easy to find. Since DMA-or-not is ultimately up to the driver, it''s that single memcpy() in rd.c. Looks rather straightforward. In theory, such a pseudo-device could make use of the host DMA engine embedded into newer Intel chipsets to save a few cycles. But looking at the source that does not seem to be the case (and typical usage scenarios for the ramdisk driver (4MB default size iirc) would hardly justify the effort). Blkdev peak throughput at 500MB/s is certainly not a usability issue :) I just asked because I hoped someone who spent more time one the PV drivers than I did might have experienced (or even profiled) similar effects already, and could explain. Thanks and greetings, Daniel -- Daniel Stodden LRR - Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik der TU München D-85748 Garching http://www.lrr.in.tum.de/~stodden mailto:stodden@cs.tum.edu PGP Fingerprint: F5A4 1575 4C56 E26A 0B33 3D80 457E 82AE B0D8 735B _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel