thr3ads.net - Xen users - Investigating memory performance: bare metal vs. xen-pv vs. xen-hvm [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Tudor-Ioan Salomie [PERSONAL]

2013-Aug-28 15:16 UTC

Investigating memory performance: bare metal vs. xen-pv vs. xen-hvm

I''ve been trying to compare memory access speed between bare-metal,
xen-pv
and xen-pvhvm (hvm with pv drivers). In all 3 setups I''m running the
same
kernel (3.6.6), built with support for xen, on a 64 core AMD Opteron 6378.
The output of xm info (relevant parts):
machine                : x86_64
nr_cpus                : 64
nr_nodes               : 8
cores_per_socket       : 16
threads_per_core       : 1
cpu_mhz                : 2400
hw_caps                :
178bf3ff:2fd3fbff:00000000:00001710:32983203:00000000:01ebbfff:00000008
virt_caps              : hvm
total_memory           : 524262
free_memory            : 498318
free_cpus              : 0
xen_major              : 4
xen_minor              : 1
xen_extra              : .2
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : Tue Apr 10 10:50:08 2012 +0200 12:efd10c64454c
xen_commandline        : auto BOOT_IMAGE=user-xen root=801 placeholder
no-bootscrub dom0_mem=4096M dom0_max_vcpus=16 dom0_vcpus_pin root=/dev/sda1
noreboot
cc_compiler            : gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

To avoid as much noise as possible I''ve got a kernel module in which
I''m
performing the following straight forward test:
for (j = 2; j < 18; j++) {
cpysize = 1 << j;
 cpycnt = size / cpysize;
do_gettimeofday(&tvb);
for (i = 0; i < loops; ++i) {
 for (k = 0; k < cpycnt; ++k) {
memcpy(src+k*cpysize,dst+k*cpysize,cpysize);
 }
}
do_gettimeofday(&tve);
msec =  timevaldiff(&tvb, &tve);
 printk(KERN_INFO "Did the loops in %ld msec for cpysize = %d \n",
msec,
cpysize);
}

Where src and dst were allocate with vmalloc of size "size". Also, src
is
initialized with 0s.
Since it''s the same code, run within the kernel, I would have expected
the
obvious: bare metal to be the fastest, then pvhvm and then pv. Curiously,
the numbers I get (summarized in the table below), contradict my assumption:


time (msec) - 100 loops over 16MB  chunk size (cpysize) bare metal pvhvm pv
4 5827 4606 4971  8 3865 3030 3448  16 3177 2134 2270  32 3241 2216 2059  64
1009 943 925  128 760 599 566  256 767 592 559  512 727 587 544  1024 701
575 524  2048 688 570 507  4096 678 566 498  8192 662 552 489  16384 652 542
480  32768 646 539 478  65536 644 535 474  131072 643 535 473

The peculiar observations:
================1. bare metal seems to be slower in all cases (???)
2. pvhvm is faster then pv, but only for small chunks
3. for large chunks, the order is the reverse one of what I would have
anticipated: pv (fastest), pvhvm then bare metal (slowest).

Does anyone have any ideas why this might be happening?
Am I missing something?

Cheers,
-- Tudor.


_______________________________________________
Xen-users mailing list
Xen-users@lists.xen.org
http://lists.xen.org/xen-users

Maybe Matching Threads

Search for more possibly parallel threads

Xen users - Aug 2013 - Investigating memory performance: bare metal vs. xen-pv vs. xen-hvm

Investigating memory performance: bare metal vs. xen-pv vs. xen-hvm

Maybe Matching Threads

Wisdom of the Ancients