I''ve been working on upstreaming some patches in our XenServer patchqueue relating to credit1. One set of patches deals with an issue I call "concurrency hazard" (which I shall define below). As part of the upstreaming process, I did extensive testing to determine which of the patches helped, and how much. I''ve decided to upstream one of the patches and shelve the other two. I''m writing this e-mail to explain the nature of the original problem, the the results of my testing, and why I''ve chosen which one, so that community members can make informed review decisions. So, start with the basics. What is concurrency hazard? Operating systems typically use "spinning" primitives to do synchronization between processors. Spinlocks are the most common example, but there are variations on those (e.g., queued/ticketed spinlocks), as well as other techniques, such as sending an IPI to another processor asking it to do something, and spinning waiting for it to be done. When moving to a virtualized environment, not all vcpus of a VM are scheduled all of the time. It''s possible for a VM to grab a spinlock and immediately be preempted, for example. In that case, if another VM attempts to grab the lock, it may spend its whole timeslice (up to 30ms in the default scheduler) just spinning waiting for the lock. This is not only time that could be used by another VM for useful processing; in our analyses, it''s not uncommon for the VM holding the lock to be on the same cpu, but a lower priority, than the VM trying to grab the lock. In that case, the time spinning is actively preventing the VM holding the lock from doing what it needs to do and release it. To give you an idea of the potential problem, consider the following setup: * 4-core box * Several VMs, each with 4 vcpus, running Windows * Install the Windows DDK, and build the example drivers, timing how long it takes. This "ddk-build" test is similar to the Linux kernbench (or kernel-build) test: it involves a lot of disk I/O, file creation and destruction, and process creation and destruction, involving a number of locks. Two measures are average time of completion. Since this doesn''t give you an idea of overall system performance, I also calculate a "throughput" metric. The throughput of an individual is the somewhat arbitrary of measure of "Number of ddk builds per 100,000 seconds". The advantage of this is that we can sum the throughput of individual VMs, and see what the total throughput of the system is as we increase the number of VMs. In an ideal system, the throughput of 1 VM maxing out at least one resource (either cpu or disk) would be equal to the throughput of 2 VMs both maxing out the same physical resource. If only a single VM is running, and the disk cache is warm, the time to build on my system averaged 71 seconds -- an aggregate throughput of 1402 (==100000/71). In an ideal system, if we add a second VM, we''d expect the average throughput to be no more than 140 seconds. However, what we actually get is 190, and the aggregate throughput down to 1049 (==(100000/t1)+(100000/t2), where t1 and t2 are the times of the individual runs). Things continue to get worse as we go up: Base (no patches) VMs | Ave time | Tput 1 | 71s | 1402 2 | 190s | 1049 3 | 371s | 805 4 | 559s | 665 (For completeness, the complete setup for this test and other tests: * Each VM has Windows Server 2008 R2, 1G of RAM on a 32G box * Before starting testing, 8 VMs are booted. Tests are run with the specified number of VMs, with the rest of them idling. * Each test is run 6 times. The first time is to warm up the guest''s disk cache; the result of this run is discarded. The other 5 runs are calculated, and the average throughput / run time reported here.) Previous analysis had shown that the VMs spent an increasingly significant chunk of their time in spinlock routines -- in other words, concurrency hazard. To address this issue, we pursued two lines. The first line was to try pure scheduling changes, with no changes to the guest. The main point of these was to try to keep vcpus from the same VM off of the same physical processor. Our analysis showed that it was not uncommon for two vcpus from the same VM to end up on the same processor. In that case, it was a common situation for v0 to grab a spinlock, then be preempted by v1, which would then try to grab the spinlock. v1 would then spin for a full timeslice waiting for the lock, when if it had just blocked and let v0 run, it could have gotten the lock in short order. Two patches we came up with were: * "pick": Add knowledge of domains to cpu_pick(), so that cpu_pick() would try to avoid putting vcpus from the same VM on the same processor if possible. * "push": In the course of chaotic load balancing, vcpus often end up on the same cpu anyway. Periodically go through and "push" vcpus which have become co-located to different cpus. The second line of pursuit was paravirtualization. For Windows 2003 / XP, we patched the spinlock routines to spin for a fixed number of cycles, then make a SCHED_yield hypercall. Viridian extensions include a hypervisor callback for missed spinlocks; we re-directed this callback to SCHED_yield as well. The existing implementation of "yield" was simply to calls into the scheduler again. Unfortunately, for the credit scheduler, the result is almost always that it picks the same vcpu to run again. After all, that cpu has the highest priority, it''s the one that should run. So yield is a no-op. So the next patch tested is: * "yield": actually implement yield in the credit scheduler. If a vcpu yields, and there are lower-priority vcpus on the runqueue, put it behind one of the lower-priority vcpus. The runqueue is sorted by priority every 30ms, so the longest this inversion can last is 30ms. The results, when adding all of these together was impressive: Pick,Push,Yield VMs | Ave time | Tput 1 | 74s | 1402 2 | 129s | 1540 3 | 196s | 1527 4 | 267s | 1492 Note that the average runtime for 2 vcpus is *less* than twice the number for 1 vcpu -- the total system throughput is higher, as the resources are being used more efficiently. Also note, however, that the throughput for a single system is lower -- almost 4%. So adding these mechanisms can help with systems with concurrency hazard, but they also have a negative performance impact on systems without concurrency hazard. So to minimize impact, I did some tests to find out which patch was helping, and how much. The pick and push patches both made significant contributions when made alone. However, the yield patch outshone them all: Yield VMs | Ave time | Tput 1 | 72s | 1389 2 | 132s | 1514 3 | 198s | 1508 4 | 271s | 1470 The impact of the yield patch on single VM throughput was much smaller, less than 1%. The throughput of higher numbers of VMs was lower than with all of the patches, but only by a marginal amount -- less than 2% in all cases. So of these three patches, I believe that the best option is to add only the yield patch. I''m attaching the other two patches to this e-mail for posterity, as well as a gnumeric spreadsheet with all of the combinations tested. (For those reading, "ppy" is "pick+push+yield".) I''ll be sending the "yield" patch series shortly. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel