Some offlist email with the Intel team may have solved the thorniest issues for my earlier proposal explained here: http://lists.xensource.com/archives/html/xen-devel/2009-08/msg01209.html The key showstoppers for this were that the pvclock mechanism as used by a guest kernel depends on knowing the vcpu number and fetching the proper adjustment values which differ depending on the vcpu. AND even if you could get the vcpu number and proper adjustment values, there is no way to ensure that a context switch doesn''t occur right in the middle of the pvclock computation that changes what values must be used. Jun pointed out that on the majority of Intel processors and Intel-based systems that Xen runs on today AND the VAST majority of new Intel and AMD systems being shipped that Xen will run on tomorrow, TSC is "reliable", meaning it is synchronized within a reasonable error range across all processors. So suppose there is a single page for all processors that contains a flag field: "TSC_is_reliable". If TSC_is_reliable is set (by Xen), then the app can safely use the remaining fields to compute the pvclock algorithm. If it is NOT set, the app must use a much slower system call (or a somewhat-slower yet-to-be-designed userland hypercall). A remaining hard problem is that this single "userland-accessible shared page" must be somehow made available to apps (I suggested a rdmsr emulated by Xen so that it works in userland) and must be mapped into the app address space without kernel changes. I think someone (Keir?) suggested this problem was solveable before we got sidetracked on the need-vcpu-number-in-userland problem. Comments? Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 17/09/2009 18:58, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> A remaining hard problem is that this single > "userland-accessible shared page" must be somehow > made available to apps (I suggested a rdmsr emulated > by Xen so that it works in userland) and must be > mapped into the app address space without kernel > changes. I think someone (Keir?) suggested this > problem was solveable before we got sidetracked > on the need-vcpu-number-in-userland problem.I don''t think mapping things into application address space is really possible without guest kernel changes. The guest kernel owns and manages the pte that you''d be overwriting. Just blatting the pte would not be good form. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote on Thu, 17 Sep 2009 at 12:03:42:> On 17/09/2009 18:58, "Dan Magenheimer" <dan.magenheimer@oracle.com> > wrote: > >> A remaining hard problem is that this single >> "userland-accessible shared page" must be somehow >> made available to apps (I suggested a rdmsr emulated >> by Xen so that it works in userland) and must be >> mapped into the app address space without kernel >> changes. I think someone (Keir?) suggested this >> problem was solveable before we got sidetracked >> on the need-vcpu-number-in-userland problem. > I don''t think mapping things into application address space is really > possible without guest kernel changes. The guest kernel owns and manages > the pte that you''d be overwriting. Just blatting the pte would not be > good form. >Maybe you can write a device driver in the guest that sets up mapping against the (virtual) physical memory, then use mmap() in the app? Jun ___ Intel Open Source Technology Center _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 17/09/2009 20:13, "Nakajima, Jun" <jun.nakajima@intel.com> wrote:>>> A remaining hard problem is that this single >>> "userland-accessible shared page" must be somehow >>> made available to apps (I suggested a rdmsr emulated >>> by Xen so that it works in userland) and must be >>> mapped into the app address space without kernel >>> changes. I think someone (Keir?) suggested this >>> problem was solveable before we got sidetracked >>> on the need-vcpu-number-in-userland problem. >> I don''t think mapping things into application address space is really >> possible without guest kernel changes. The guest kernel owns and manages >> the pte that you''d be overwriting. Just blatting the pte would not be >> good form. > > Maybe you can write a device driver in the guest that sets up mapping against > the (virtual) physical memory, then use mmap() in the app?Yeah, that''d work. Doesn''t even really need any Xen changes -- the stable-TSC flag is in CPUID already, and the guest driver can concoct a pvclock page I would think (e.g., from its own Xen pvclock TSC scaling parameters). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 09/17/09 12:13, Nakajima, Jun wrote:> Maybe you can write a device driver in the guest that sets up mapping against the (virtual) physical memory, then use mmap() in the app? >Dan is trying to avoid making guest kernel changes, on the grounds that waiting for enterprise distros to catch up would take too long. Once you''re making kernel changes then you can update the pvclock mechanism to use the xen clock algorithm, obviating the need for usermode ABI changes. However, if its really the case that the tsc is guaranteed synchronized, then the guest can determine that for itself by looking at cpuid and/or /proc/cpuinfo (and presumably doing some sanity checking) and then just directly use rdtsc, with no need to change either Xen or the kernel. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> On 09/17/09 12:13, Nakajima, Jun wrote: > > Maybe you can write a device driver in the guest that sets > up mapping against the (virtual) physical memory, then use > mmap() in the app? > > Dan is trying to avoid making guest kernel changes, on the > grounds that > waiting for enterprise distros to catch up would take too long.Well, as I''ve said all along, a driver in a dynamically loadable module is OK. Whether sensible or not, customers don''t seem to care about that, they only care if you change the kernel bits that gets loaded.> Once you''re making kernel changes then you can update the pvclock > mechanism to use the xen clock algorithm, obviating the need for > usermode ABI changes.Is that working yet (fast vsyscall under Xen)? >:-)> However, if its really the case that the tsc is guaranteed > synchronized, > then the guest can determine that for itself by looking at > cpuid and/or > /proc/cpuinfo (and presumably doing some sanity checking) and > then just > directly use rdtsc, with no need to change either Xen or the kernel.That''s exactly what the app is doing when on bare metal. But in virtual unless it gets some kind of notification on migration (which would be cool, but would also require kernel changes?), it can''t determine the appropriate scaling factor and offset, or that they need to change. (The userland pvclock algorithm would need to keep a version indicator just like the kernel pvclock does.) So that''s what the userland-accessible shared page is needed for. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 09/17/09 12:45, Dan Magenheimer wrote:> Well, as I''ve said all along, a driver in a dynamically loadable > module is OK. Whether sensible or not, customers don''t seem to > care about that, they only care if you change the kernel bits > that gets loaded. >A loadable driver wouldn''t be able to claim fixmap slots or anything like that, but, yes, it could present an mmapable device.>> Once you''re making kernel changes then you can update the pvclock >> mechanism to use the xen clock algorithm, obviating the need for >> usermode ABI changes. >> > Is that working yet (fast vsyscall under Xen)? >:-) >It''s just a matter of someone spending some time on it. My frustration with the pvtsc scheme is that its incredibly niche and is only going to serve a very small number of users; a similar amount of effort spent on a vsyscall solution will have a much larger payoff.>> However, if its really the case that the tsc is guaranteed >> synchronized, >> then the guest can determine that for itself by looking at >> cpuid and/or >> /proc/cpuinfo (and presumably doing some sanity checking) and >> then just >> directly use rdtsc, with no need to change either Xen or the kernel. >> > That''s exactly what the app is doing when on bare metal. >Sure. And they still need to deal with discontinuities resulting from suspend/resume on bare-metal, so dealing with discontinuities caused by migration is no different. And on bare metal it would need to compute the tsc frequency for itself, so it doesn''t really need Xen support for that either.> But in virtual unless it gets some kind of notification on > migration (which would be cool, but would also require > kernel changes?), it can''t determine the appropriate > scaling factor and offset, or that they need to change. >Well, they have access to xenbus, so they could get the tsc parameters that way, along with notifications about changes. It wouldn''t be completely synchronous, but you''d need to implement the pvclock algorithm to do that. But as I say, it has to cope with all this on bare metal anyway, so it doesn''t really need anything from Xen.> (The userland pvclock algorithm would need to keep > a version indicator just like the kernel pvclock does.) > So that''s what the userland-accessible shared page > is needed for. >Sure. The mechanism is the same as you''d need to do vsyscall. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > Well, as I''ve said all along, a driver in a dynamically loadable > > module is OK. Whether sensible or not, customers don''t seem to > > care about that, they only care if you change the kernel bits > > that gets loaded. > > A loadable driver wouldn''t be able to claim fixmap slots or anything > like that, but, yes, it could present an mmapable device.Hopefully that would be all that would be necessary.> >> Once you''re making kernel changes then you can update the pvclock > >> mechanism to use the xen clock algorithm, obviating the need for > >> usermode ABI changes. > >> > > Is that working yet (fast vsyscall under Xen)? >:-) > > It''s just a matter of someone spending some time on it. > My frustration > with the pvtsc scheme is that its incredibly niche and is > only going to > serve a very small number of users; a similar amount of > effort spent on > a vsyscall solution will have a much larger payoff.Well we at Oracle hope that "users of Oracle DB and Oracle JRockit on Oracle VM" is not a small niche. Whereas until the enterprise distros pick up vsyscall+pvclock into a widely distributed kernel, a system call won''t ever be used by any app that cares about performance. Think of it as a critical bug fix. Today''s Xen fails to correctly provide the defined behavior of a user instruction used on physical machines by certain important apps. Latest xen-unstable fixes the correctness issue, but at a significant loss of performance. Vsyscall+pvclock is the right answer for the long term, but apps that try to use it today will see an even larger loss of performance on the vast majority of machines for the forseeable future. (AND it won''t even be faster than emulation unless the administrator can guarantee that all apps running on the guest are fixed to not use rdtsc, so emulation can be disabled.) That''s why I am so persistently trying to find another solution.> >> However, if its really the case that the tsc is guaranteed > >> synchronized, > >> then the guest can determine that for itself by looking at > >> cpuid and/or > >> /proc/cpuinfo (and presumably doing some sanity checking) and > >> then just > >> directly use rdtsc, with no need to change either Xen or > the kernel. > >> > > That''s exactly what the app is doing when on bare metal. > > Sure. And they still need to deal with discontinuities resulting from > suspend/resume on bare-metal, so dealing with discontinuities > caused by > migration is no different. And on bare metal it would need to compute > the tsc frequency for itself, so it doesn''t really need Xen > support for > that either.Yes, the discontinuities resulting from migration ARE different from suspend/resume on bare metal. If they weren''t, I''d agree that no Xen support is needed.> > But in virtual unless it gets some kind of notification on > > migration (which would be cool, but would also require > > kernel changes?), it can''t determine the appropriate > > scaling factor and offset, or that they need to change. > > Well, they have access to xenbus, so they could get the tsc parameters > that way, along with notifications about changes. It wouldn''t be > completely synchronous, but you''d need to implement the pvclock > algorithm to do that.An interesting thought, but unless we can guarantee that a xenbus notifier is always delivered and processed "immediately" (e.g. before any userland "pv-rdtsc instructions" are executed) it won''t work. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 17.09.09 21:03 >>> >On 17/09/2009 18:58, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: > >> A remaining hard problem is that this single >> "userland-accessible shared page" must be somehow >> made available to apps (I suggested a rdmsr emulated >> by Xen so that it works in userland) and must be >> mapped into the app address space without kernel >> changes. I think someone (Keir?) suggested this >> problem was solveable before we got sidetracked >> on the need-vcpu-number-in-userland problem. > >I don''t think mapping things into application address space is really >possible without guest kernel changes. The guest kernel owns and manages the >pte that you''d be overwriting. Just blatting the pte would not be good form.Unless they sit in Xen''s virtual space. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 18/09/2009 08:29, "Jan Beulich" <JBeulich@novell.com> wrote:>> I don''t think mapping things into application address space is really >> possible without guest kernel changes. The guest kernel owns and manages the >> pte that you''d be overwriting. Just blatting the pte would not be good form. > > Unless they sit in Xen''s virtual space.Oh yes, I remember we talked about that before. That is possible, but the design fell down on other points. I think guest kernel involvement, even if only as a kernel driver, should make this more tractable. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-19 00:33 UTC
Re: [Xen-devel] Re: pvclock in userland (reprise)
On 09/18/09 01:06, Keir Fraser wrote:> On 18/09/2009 08:29, "Jan Beulich" <JBeulich@novell.com> wrote: > > >>> I don''t think mapping things into application address space is really >>> possible without guest kernel changes. The guest kernel owns and manages the >>> pte that you''d be overwriting. Just blatting the pte would not be good form. >>> >> Unless they sit in Xen''s virtual space. >> > Oh yes, I remember we talked about that before. That is possible, but the > design fell down on other points. I think guest kernel involvement, even if > only as a kernel driver, should make this more tractable. >Xen''s memory isn''t mappable in a 32-bit compat domain, so you''d need to come up with something else there. Does Xen still claim the top part of the 32-bit address space? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 19/09/2009 01:33, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:>> Oh yes, I remember we talked about that before. That is possible, but the >> design fell down on other points. I think guest kernel involvement, even if >> only as a kernel driver, should make this more tractable. > > Xen''s memory isn''t mappable in a 32-bit compat domain, so you''d need to > come up with something else there. Does Xen still claim the top part of > the 32-bit address space?Yes, it puts the M2P there. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Jeremy Fitzhardinge <jeremy@goop.org> 19.09.09 02:33 >>> >On 09/18/09 01:06, Keir Fraser wrote: >> On 18/09/2009 08:29, "Jan Beulich" <JBeulich@novell.com> wrote: >> >> >>>> I don''t think mapping things into application address space is really >>>> possible without guest kernel changes. The guest kernel owns and manages the >>>> pte that you''d be overwriting. Just blatting the pte would not be good form. >>>> >>> Unless they sit in Xen''s virtual space. >>> >> Oh yes, I remember we talked about that before. That is possible, but the >> design fell down on other points. I think guest kernel involvement, even if >> only as a kernel driver, should make this more tractable. >> > >Xen''s memory isn''t mappable in a 32-bit compat domain, so you''d need to >come up with something else there.Why not?>Does Xen still claim the top part of the 32-bit address space?Sure - currently just for the compat M2P table. I can''t see why other things could be mapped there (the compat M2P table is at most 128M in size, hence there''s plenty of virtual space available). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-21 18:54 UTC
Re: [Xen-devel] Re: pvclock in userland (reprise)
On 09/21/09 01:20, Jan Beulich wrote:>> Does Xen still claim the top part of the 32-bit address space? >> > Sure - currently just for the compat M2P table. I can''t see why other > things could be mapped there (the compat M2P table is at most 128M > in size, hence there''s plenty of virtual space available). >Yep, we could alias the pvclock info there, though I''m coming to like the idea of doing it via syscall/hypercall rather than having the guest explicitly get access to shared memory. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel