Dan Magenheimer
2009-Aug-28 19:56 UTC
[Xen-devel] rdtsc: correctness vs performance on Xen (and KVM?)
(Sorry for repost.. original was accidentally posted as a reply to an existing thread which, despite a different subject field, gets threaded with the existing thread rather than starting a new one. Please reply to this thread rather than the previous post on the old thread.) To summarize: Xen and KVM currently allow rdtsc to be executed directly by userland. As a result, apps that use rdtsc smartly and effectively on (some) physical machines may break badly in Xen or KVM because of the disassociation of physical and virtual cpus. (Readers not familiar with why rdtsc is a problem, can read e.g. http://en.wikipedia.org/wiki/Rdtsc) VMware always emulates rdtsc, both for kernel and userland rdtsc''s. (I don''t know what HyperV does.) Xen currently has a boot option to always emulate rdtsc in HVM guests and just added code such that the same boot option will always emulate rdtsc for userland-only in PVM guests. There is some agreement in the Xen community that rdtsc emulation should always be the default though the default is currently off. KVM is having a similar discussion and, I''m told, has also come to the conclusion that emulating rdtsc is a necessary evil. The problem is that emulating rdtsc is slow. On my dual-core Conroe, rdtsc is about 72 cycles and emulating rdtsc (returning a fixed frequency 1GHz Xen monotonic system time) is over 15x slower. This is a big hit for apps that do tens to hundreds of thousands of rdtsc''s per processor per second. (And yes these apps are more common than one might think.) VMware has the advantage of binary translation; rdtsc can be translated to return a "conforming" value in ~200 cycles (on an older processor so probably faster if you are comparing against my dual-core Conroe numbers above). This value is "stale" (not linear with wallclock time). For VMs that need rdtsc to more accurately reflect wallclock time, full emulation can be optionally enabled for a VM. I''m searching for alternatives that provide the correctness of emulation, but better performance than emulation. Jeremy points out that the pvclock mechanism in upstream Linux works well, but the pvclock data is currently only exposed to kernel... and exposing it to userland still requires apps-using-rdtsc to be rewritten. But Jeremy claims that all apps-that-use-rdtsc MUST be rewritten because using rdtsc is unsafe, and that they should be rewritten to use gettimeofday (or actually vgettimeofday). But on older OS''s (including the vast majority of installed units) and machines where tsc is "unsafe", gettimeofday can be MUCH slower than emulating rdtsc. So telling app writers to convert all uses of rdtsc to gettimeofday is not an acceptable solution for these apps in the shortterm. My current thinking is that we (the Linux and Xen and KVM community) should architect a userland API using the pvclock mechanism. The underlying implementation of this API would utilize Linux only to "register" the mechanism, preferably via a module so that it, like disk and network frontends, could easily be bolted on to shipping OS''s. Individual uses of "pvclock_read" would need no syscall... like the kernel pvclock mechanism, they need only access memory to get the necessary scaling and offset data. Once instantiated, rdtsc is executed directly by the app as part of the pvclock protocol. If never registered, rdtsc would always be trapped and emulated. I realize this idea is half-baked, but would like to invite other TSC/time experts to determine if some or all of the idea might be used to achieve a fully-baked solution. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel