Dan Magenheimer
2009-Sep-18 16:30 UTC
[Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
Xen doesn''t appear to support the rdtscp instruction. Should it? (And specifically I''m wondering whether it should be emulated whenever rdtsc is emulated but see below for another intriguing possibility.) Rdtscp is unprivileged and we have apps that are using it on bare metal, after validating that the CPU supports it. The instruction is available on most (all?) recent AMD CPUs and Intel''s Nehalem supports it. For an OS to support rdtscp properly, the OS must (once at boot) wrmsr a different value for each cpu to a "TSC_AUX" register and this register is read along with the TSC when the rdtscp instruction is executed. This allows an app to determine if two consecutive rdtsc''s are (or are not) executed on the same CPU. It appears that all recent RHEL kernels write to TSC_AUX if the CPU supports rdtscp. I''m told Windows 2008 notably does not. Don''t know about SLES or other Windoze. Its not clear to me if/how rdtscp can/should be virtualized. To do it properly, the value written to the TSC_AUX msr would become part of the vcpu''s state, and would need to be changed whenever a vcpu->pcpu mapping changes. To meet only the current use model of the instruction, Xen could write TSC_AUX for each pcpu on Xen boot and always ignore guest OS writes to TSC_AUX. (This assumes that no OS ever reads TSC_AUX and attempts to match it with the value that it thought it wrote to TSC_AUX; and assumes that One solution is for Xen to deny the existence of rdtscp even when Xen is running on hardware that supports it. Is that exactly what is happening? Now thinking creatively, could TSC_AUX be used similar to the pvclock version number... Xen bumps it whenever a migration occurs which would prompt an app to go out and reread new values for scaling and offset (possibly via specially-handled-by-Xen usermode rdmsr)? Hmmm... I think it might be the answer I''ve been looking for! (Go ahead, shoot me down :-) Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-18 20:27 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
OK, here''s the long version (/me crosses fingers and hopes to get away from this for at least some of the weekend)... Proposal ("pv rdtscp"): The rdtscP instruction was added to the x86 architecture by AMD a couple of years ago and Intel added it starting at Nehalem. It is essentially the same as an rdtsc except in addition it copies the value of a privileged MSR register "TSC_AUX" into a specified memory location. There is a CPUID bit that can be checked to determine if the processor supports the rdtscp instruction. Xen currently does not expose hardware support for rdtscp to guests. I propose to paravirtualize support for rdtscp as follows: If guest vm.cfg has vrdtscp=0 (default): rdtscp is emulated and returns nsec since guest boot (same as emulated rdtsc), value returned for TSC_AUX is -1 If guest vm.cfg has vrdtscp=1: If underlying hardware has rdtscp support: rdtscp is directly executed by hardware, value returned for TSC_AUX is non-zero (see below) Else: (no hardware rdtscp support) rdtscp is emulated and returns nsec since guest boot, value returned for TSC_AUX is 0 How it works from the app point-of-view: Guest app must have some capability of getting 64-bit pvclock parameters directly from Xen without OS changes, e.g. emulated userland wrmsr, userland hypercall, or userland mapped shared page. (This will be done rarely so need not be fast! But it does create a new userland<->Xen ABI that must be kept compatible.) On first rdtscp, app records returned TSC_AUX value, verifies that it is neither 0 nor -1, fetches pvclock parameters from Xen, executes another rdtscp. If TSC_AUX matches previous value, app applies pvclock algorithm to tsc value to obtain nsec since guest boot. If TSC_AUX is zero or -1, tsc value IS nsec since guest boot. If TSC_AUX differs from last recorded value, fetch pvclock parameters from Xen again. On subsequent rdtscp''s, app compares returned TSC_AUX against the previous one, and fetches pvclock parameters from Xen only if it differs (which should be rare). What Xen needs to do: Xen must record the setting for each guest''s vrdtscp config variable and ensure that it persists across save/restore and migration. If the guest has vrdtscp=1, a vrdtscp "version" number is also part of the guest''s state and must persist across save/restore/migration. Xen must know whether or not it is running on a machine where TSC is reliable. If TSC is NOT reliable AND rdtscp is supported by hardware, Xen must ensure that TSC_AUX is -1 on all pcpu''s that are running a guest with vrdtscp=0, and 0 on all pcpu''s that are running a guest where vrdtscp=1 (and must enable CR4.TSD on those pcpus if it wasn''t already). If TSC is NOT reliable AND rdtscp is NOT supported by hardware, Xen must emulate rdtscp (e.g. return Xen system time) and emulate the same behavior for TSC_AUX. If TSC IS reliable, Xen sets TSC_AUX to the guest''s vrdtscp version number on all pcpu''s that are running the guest. Finally, when a guest transitions from one "TSC domain" to another (restore/migrate/NUMA) it increments the vrdtscp version number. I think this will work even for a NUMA machine provided Xen always schedules all the vcpus for one guest on pcpus in the same NUMA node, and increments the version number when the guest is rescheduled from one NUMA node to another (assuming TSC on each node is reliable). I think this pv-rdtscp mechanism will work for both PV and HVM (with minor additional work in Xen for HVM); it will be very fast on any hardware that supports rdtscp in hardware (which for Intel only includes Nehalem+ but that provides even more incentive for customers to upgrade). Apps that currently use rdtscp will continue to work (as long as they don''t have some wild use model that I don''t know about). Pvclock algorithm in the OS would need to be changed to use rdtscp (instead of rdtsc) and check for TSC_AUX=0 to do the right thing. If not changed, it will continue to work but slower (whether or not rdtsc is emulated because when emulated it returns the hardware TSC when the instruction was attempted in kernel mode). The only problem I can see is that when vrdtscp==1, other apps that are running on that guest that use rdtsc (no p) directly (i.e. haven''t been modified to use pv-rdtscp) will continue to have the same kinds of failure on save/restore/ migration. But this is true of all the solutions proposed so far: Xen can only turn on emulation guest-wide, not per-app. Also even on machines where TSC is reliable, there is a small chance that consecutive TSC values read will be from different processors and so TSC might appear to go backwards by some small amount. So apps must still put raw TSC values through a "monotonicity filter". (Xen already does this for emulated reads of TSC.) Comments?> -----Original Message----- > From: Dan Magenheimer > Sent: Friday, September 18, 2009 10:30 AM > To: Xen-Devel (E-mail) > Subject: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve > been looking for) > > > Xen doesn''t appear to support the rdtscp instruction. > Should it? (And specifically I''m wondering whether > it should be emulated whenever rdtsc is emulated > but see below for another intriguing possibility.) > > Rdtscp is unprivileged and we have apps that are using it > on bare metal, after validating that the CPU supports it. > The instruction is available on most (all?) recent AMD > CPUs and Intel''s Nehalem supports it. > > For an OS to support rdtscp properly, the OS must (once at boot) > wrmsr a different value for each cpu to a "TSC_AUX" register > and this register is read along with the TSC when the rdtscp > instruction is executed. This allows an app to determine > if two consecutive rdtsc''s are (or are not) executed on the > same CPU. > > It appears that all recent RHEL kernels write to TSC_AUX if > the CPU supports rdtscp. I''m told Windows 2008 notably does > not. Don''t know about SLES or other Windoze. > > Its not clear to me if/how rdtscp can/should be virtualized. > To do it properly, the value written to the TSC_AUX msr > would become part of the vcpu''s state, and would need to > be changed whenever a vcpu->pcpu mapping changes. To meet > only the current use model of the instruction, Xen could write > TSC_AUX for each pcpu on Xen boot and always ignore guest > OS writes to TSC_AUX. (This assumes that no OS ever reads > TSC_AUX and attempts to match it with the value that it > thought it wrote to TSC_AUX; and assumes that > > One solution is for Xen to deny the existence of rdtscp even > when Xen is running on hardware that supports it. Is that > exactly what is happening? > > Now thinking creatively, could TSC_AUX be used similar > to the pvclock version number... Xen bumps it whenever a > migration occurs which would prompt an app to go out > and reread new values for scaling and offset (possibly > via specially-handled-by-Xen usermode rdmsr)? Hmmm... > I think it might be the answer I''ve been looking for! > (Go ahead, shoot me down :-) > > Dan > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-18 22:55 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 09/18/09 13:27, Dan Magenheimer wrote:> If guest vm.cfg has vrdtscp=0 (default): > rdtscp is emulated and returns nsec since guest > boot (same as emulated rdtsc), value returned > for TSC_AUX is -1 > > If guest vm.cfg has vrdtscp=1: > If underlying hardware has rdtscp support: > rdtscp is directly executed by hardware, > value returned for TSC_AUX is non-zero > (see below) > Else: (no hardware rdtscp support) > rdtscp is emulated and returns nsec since > guest boot, value returned for TSC_AUX is 0 >Why do you need to distinguish between the two emulated rdtscp cases? Special-casing a version of ''0'' is awkward because it would arise naturally from version wraparound (after 2^31 time parameter updates, but still). If the hardware doesn''t support rdtscp, how should an app know whether or not to use it? Should it just try running rdtscp being prepared to handle a SIGILL?> How it works from the app point-of-view: > > Guest app must have some capability of getting 64-bit > pvclock parameters directly from Xen without OS changes, > e.g. emulated userland wrmsr, userland hypercall, > or userland mapped shared page. (This will be done > rarely so need not be fast! But it does create > a new userland<->Xen ABI that must be kept compatible.) > > On first rdtscp, app records returned TSC_AUX value, > verifies that it is neither 0 nor -1, > fetches pvclock parameters from Xen, executes > another rdtscp. If TSC_AUX matches previous value, > app applies pvclock algorithm to tsc value to > obtain nsec since guest boot. If TSC_AUX is > zero or -1, tsc value IS nsec since guest boot. > If TSC_AUX differs from last recorded value, > fetch pvclock parameters from Xen again. > > On subsequent rdtscp''s, app compares > returned TSC_AUX against the previous one, > and fetches pvclock parameters from Xen only > if it differs (which should be rare). >Presumably the pvclock would contain the same version number which must match; if not it keeps iterating (rdtscp, get-timing-parameters) until they do.> What Xen needs to do: > > Xen must record the setting for each guest''s vrdtscp > config variable and ensure that it persists across > save/restore and migration. If the guest has > vrdtscp=1, a vrdtscp "version" number is also > part of the guest''s state and must persist > across save/restore/migration. > > Xen must know whether or not it is running on a > machine where TSC is reliable. If TSC is NOT > reliable AND rdtscp is supported by hardware, > Xen must ensure that TSC_AUX is -1 on all pcpu''s > that are running a guest with vrdtscp=0, and 0 > on all pcpu''s that are running a guest where > vrdtscp=1 (and must enable CR4.TSD on those > pcpus if it wasn''t already).If rdtscp is not reliable but Xen has accurate tsc parameter info, then the algorithm above will still work efficiently.> If TSC is NOT > reliable AND rdtscp is NOT supported by hardware, > Xen must emulate rdtscp (e.g. > return Xen system time) and emulate the > same behavior for TSC_AUX. If TSC IS reliable, > Xen sets TSC_AUX to the guest''s vrdtscp version > number on all pcpu''s that are running the guest. > Finally, when a guest transitions from one > "TSC domain" to another (restore/migrate/NUMA) > it increments the vrdtscp version number. >Well, it just needs to increment it whenever Xen knows the tsc has changed, as the current pvclock code does. It could be more frequently than restore/migrate if tsc changes on power events.> The only problem I can see is that when > vrdtscp==1, other apps that are running on that guest > that use rdtsc (no p) directly (i.e. haven''t been > modified to use pv-rdtscp) will continue to > have the same kinds of failure on save/restore/ > migration. But this is true of all the solutions > proposed so far: Xen can only turn on emulation > guest-wide, not per-app. >Linux already reserves rdtscp for use as part of vsyscall, where TSC_AUX contains the NUMA node and the CPU number, so there should be no "naked" users of rdtscp.> Also even on machines where TSC is reliable, > there is a small chance that consecutive > TSC values read will be from different > processors and so TSC might appear to go > backwards by some small amount. So apps > must still put raw TSC values through > a "monotonicity filter". (Xen already > does this for emulated reads of TSC.) >Why? I thought "reliable" tscs were supposed to be synced between cores? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-19 15:34 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> Why do you need to distinguish between the two emulated rdtscp cases? > Special-casing a version of ''0'' is awkward because it would arise > naturally from version wraparound (after 2^31 time parameter updates, > but still).You''re right, I don''t need to differentiate between the two emulated cases. I was trying to overload an extra piece of information that I really don''t need to overload. However, I do need one special case to indicate emulation vs non-emulation, so wraparound is still a problem. Fortunately, wraparound should only occur impossibly rarely (see below), probably less frequently than TSC wraparound.> If the hardware doesn''t support rdtscp, how should an app know whether > or not to use it? Should it just try running rdtscp being prepared to > handle a SIGILL?Yes, that''s the plan. I think this scheme always works, but only works fast if the hardware supports rdtscp and constant_tsc.> If rdtscp is not reliable but Xen has accurate tsc parameter > info, then > the algorithm above will still work efficiently. > : > Well, it just needs to increment it whenever Xen knows the tsc has > changed, as the current pvclock code does. It could be more > frequently > than restore/migrate if tsc changes on power events.I''ve restricted the scheme to constant_tsc as I think it breaks down due to nasty races if running on a machine where the pvclock parameters differ across different pcpus. I think the races can only be avoided if Xen sets the TSC_AUX for all of the pcpus running a pvrdtscp doman while all are idle. Is there a scheme that avoids the races? Fortunately, this also has the effect of greatly reducing the version increase frequency.> > Also even on machines where TSC is reliable, > > there is a small chance that consecutive > > TSC values read will be from different > > processors and so TSC might appear to go > > backwards by some small amount. So apps > > must still put raw TSC values through > > a "monotonicity filter". (Xen already > > does this for emulated reads of TSC.) > > Why? I thought "reliable" tscs were supposed to be synced > between cores?The rate is synced but the values may not be. Since software (BIOS or Xen) sets tsc on each processor it is essentially impossible to ensure they are identical. The rendezvous algorithm should be able to set them so that they are "unobservably" different, but I keep hearing "within 2usec". (It would be interesting to measure this across a broad set of machines.) So it''s probably prudent to recommend that apps be prepared for the possibility even if it never happens. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-21 08:17 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 18.09.09 22:27 >>> >Guest app must have some capability of getting 64-bit >pvclock parameters directly from Xen without OS changes, >e.g. emulated userland wrmsr, userland hypercall, >or userland mapped shared page. (This will be done >rarely so need not be fast! But it does create >a new userland<->Xen ABI that must be kept compatible.)Are you sure this will indeed be infrequent enough? On my supposedly constant-TSC AMD box, I see Xen quite frequently apply small error correction factors to keep TSC from running ahead of HPET/PMTIMER.>I think this will work even for a NUMA machine >provided Xen always schedules all the vcpus >for one guest on pcpus in the same NUMA node, >and increments the version number when >the guest is rescheduled from one NUMA node to >another (assuming TSC on each node is reliable).I think this is an improper assumption: Any guest with more vCPU-s than there are pCPU-s on a single node will likely benefit from being run on two (or more) nodes (compared to its vCPU-s competing amongst themselves for pCPU-s on a single node). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-21 14:04 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
Hi Jan -- Thanks for the feedback!> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 18.09.09 22:27 >>> > >Guest app must have some capability of getting 64-bit > >pvclock parameters directly from Xen without OS changes, > >e.g. emulated userland wrmsr, userland hypercall, > >or userland mapped shared page. (This will be done > >rarely so need not be fast! But it does create > >a new userland<->Xen ABI that must be kept compatible.) > > Are you sure this will indeed be infrequent enough? On my supposedly > constant-TSC AMD box, I see Xen quite frequently apply small error > correction factors to keep TSC from running ahead of HPET/PMTIMER.I''d like to hear from Keir on this, but I''d guess that this would be either a bug or a remnant of or inaccuracy in an old algorithm. Also if you could provide more information, I''d like to see if I can reproduce it on my Intel constant_tsc machines.> >I think this will work even for a NUMA machine > >provided Xen always schedules all the vcpus > >for one guest on pcpus in the same NUMA node, > >and increments the version number when > >the guest is rescheduled from one NUMA node to > >another (assuming TSC on each node is reliable). > > I think this is an improper assumption: Any guest with more > vCPU-s than > there are pCPU-s on a single node will likely benefit from > being run on > two (or more) nodes (compared to its vCPU-s competing amongst > themselves for pCPU-s on a single node).Any guest that has some vcpus running on pcpus in one "TSC domain" and other vcpus running on pcpus in another "TSC domain" would have to be handled the same as running on a machine with tsc_NOT_constant. This does raise a challenge for multi-socket machines that Xen has to be able to determine and record what pcpu''s are within a TSC domain boundary, which may or may not be the same as a NUMA boundary. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-21 14:18 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 21.09.09 16:04 >>> >> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 18.09.09 22:27 >>> >> >Guest app must have some capability of getting 64-bit >> >pvclock parameters directly from Xen without OS changes, >> >e.g. emulated userland wrmsr, userland hypercall, >> >or userland mapped shared page. (This will be done >> >rarely so need not be fast! But it does create >> >a new userland<->Xen ABI that must be kept compatible.) >> >> Are you sure this will indeed be infrequent enough? On my supposedly >> constant-TSC AMD box, I see Xen quite frequently apply small error >> correction factors to keep TSC from running ahead of HPET/PMTIMER. > >I''d like to hear from Keir on this, but I''d >guess that this would be either a bug or a >remnant of or inaccuracy in an old algorithm. > >Also if you could provide more information, I''d >like to see if I can reproduce it on my Intel >constant_tsc machines.Not sure what further detail you mean - all that it is you would want to look for are cases where error_factor is non-zero in local_time_calibration() (or local time getting warped forward in the same function; but I can only say for sure that the former does happen not infrequently in terms of the percentage of executions of local_time_calibration() - of course, that function itself doesn''t run very frequently). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-21 14:47 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> > Why do you need to distinguish between the two emulated > rdtscp cases? > > Special-casing a version of ''0'' is awkward because it would arise > > naturally from version wraparound (after 2^31 time > parameter updates, > > but still). > > You''re right, I don''t need to differentiate between > the two emulated cases. I was trying to overload > an extra piece of information that I really don''t > need to overload. > > However, I do need one special case to indicate > emulation vs non-emulation, so wraparound is > still a problem. > > Fortunately, wraparound should only occur impossibly > rarely (see below), probably less frequently than > TSC wraparound.I realized later that since Xen controls the values placed in TSC_AUX, it can easily skip any special-cased values. Then wraparound is not a problem as long as the app tests for "version number is different" rather than "version number is greater." _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-21 15:25 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> >> Are you sure this will indeed be infrequent enough? On my > supposedly > >> constant-TSC AMD box, I see Xen quite frequently apply small error > >> correction factors to keep TSC from running ahead of HPET/PMTIMER. > > > >I''d like to hear from Keir on this, but I''d > >guess that this would be either a bug or a > >remnant of or inaccuracy in an old algorithm. > > > >Also if you could provide more information, I''d > >like to see if I can reproduce it on my Intel > >constant_tsc machines. > > Not sure what further detail you mean - all that it is you > would want to > look for are cases where error_factor is non-zero in > local_time_calibration() (or local time getting warped forward in the > same function; but I can only say for sure that the former does happen > not infrequently in terms of the percentage of executions of > local_time_calibration() - of course, that function itself doesn''t run > very frequently).OK, I think I see the problem. Since cs19506 "consistent_tscs" is a Xen boot parameter that defaults to disabled. If the boot parameter is enabled and the boot cpu does NOT have X86_FEATURE_CONSTANT_TSC set, consistent_tscs gets re-disabled. For my pvrdtscp scheme to work, consistent_tscs would need to be changed so that it defaults to enabled. Jan, could you confirm that this solves the problem on your constant-TSC AMD box? Keir, is there any reason that consistent_tscs shouldn''t default to enabled? Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-21 15:41 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 21/09/2009 16:25, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> OK, I think I see the problem. > > Since cs19506 "consistent_tscs" is a Xen boot parameter that > defaults to disabled. If the boot parameter is enabled and > the boot cpu does NOT have X86_FEATURE_CONSTANT_TSC set, > consistent_tscs gets re-disabled. > > For my pvrdtscp scheme to work, consistent_tscs would need to > be changed so that it defaults to enabled. > > Jan, could you confirm that this solves the problem on your > constant-TSC AMD box? > > Keir, is there any reason that consistent_tscs shouldn''t > default to enabled?There was some question whether it means what you think it means across NUMA nodes. If we are sure that it does guarantee consistency across NUMA nodes -- or if we don''t care about that -- then it can be enabled by default. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-21 15:53 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 21/09/2009 16:41, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:>> Keir, is there any reason that consistent_tscs shouldn''t >> default to enabled? > > There was some question whether it means what you think it means across NUMA > nodes. If we are sure that it does guarantee consistency across NUMA nodes > -- or if we don''t care about that -- then it can be enabled by default.There is a question mark over this, since it''s not really clear what the CONSTANT_TSC feature flag actually means. For example, it is set if CPUID:0x80000007:EDX:8 is set, and that flag merely means that this particular core''s TSC rate is invariant across all Cx/Px/Tx power-saving states. It doesn''t directly say anything about TSC consistency across cores or sockets unless we are prepared to assume a couple of things: primarily that all packages run their TSCs at the same rate, and that they are clocked from the same mainboard oscillator. Is that reasonable to assume? We at least know the latter is not likely to be true for big-iron NUMA systems, across NUMA nodes. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-21 16:03 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 21.09.09 17:25 >>> >Jan, could you confirm that this solves the problem on your >constant-TSC AMD box?Based on Keir''s responses I don''t think there''s a point trying. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-21 16:55 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> >> Keir, is there any reason that consistent_tscs shouldn''t > >> default to enabled? > > There is a question mark over this, since it''s not really > clear what the > CONSTANT_TSC feature flag actually means. For example, it is set if > CPUID:0x80000007:EDX:8 is set, and that flag merely means that this > particular core''s TSC rate is invariant across all Cx/Px/Tx > power-saving > states. It doesn''t directly say anything about TSC > consistency across cores > or sockets unless we are prepared to assume a couple of > things: primarily > that all packages run their TSCs at the same rate, and that > they are clocked > from the same mainboard oscillator. Is that reasonable to > assume? We at > least know the latter is not likely to be true for big-iron > NUMA systems, > across NUMA nodes.Both Intel and AMD have confirmed that constant_tsc means that TSC is consistent across all cores and even across multiple sockets; and at least one major system vendor (HP) with multi-enclosure "big iron" AMD-based NUMA systems has confirmed that TSC is consistent across all nodes. So by applying the Xen rendezvous-sync algorithm (that writes tsc every second) on such machines, Xen has actually been creating a tsc-sync problem, not alleviating one! I''ve cc''ed key AMD/Intel/HP experts who can confirm or correct/clarify any misassumptions I might have. I *think* "CPU reports tsc_is_constant but it''s not really constant across all sockets/enclosures/nodes" does exist, but may be limited to a few older exceptions such as IBM Summit systems. Upstream Linux now assumes that constant_tsc applies across all CPUs unless the kernel is compiled with CONFIG_X86_NUMAQ (note NOT CONFIG_X86_NUMA), so Linux has now embraced constant_tsc. So I''m thinking we should treat consistent_tscs as the rule rather than the exception, and place the onus on "broken" systems to disable consistent_tscs with the boot option when necessary. To be extremely safe, we could also add some code in time_calibration_std_rendezvous() to check for "signficant" tsc differences and report it (and maybe even auto-disable consistent_tscs). (One minor correction also: constant_tsc does NOT guarantee tsc continues to increment across deep-C- states... that requires nonstop_tsc. But Xen already has the logic to correct deep-C-states in cstate_restore_tsc().) Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-21 17:02 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 21/09/2009 17:55, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Both Intel and AMD have confirmed that constant_tsc means > that TSC is consistent across all cores and even across > multiple sockets; and at least one major system vendor (HP) > with multi-enclosure "big iron" AMD-based NUMA systems has > confirmed that TSC is consistent across all nodes. So > by applying the Xen rendezvous-sync algorithm (that writes > tsc every second) on such machines, Xen has actually been > creating a tsc-sync problem, not alleviating one!Constant_tsc is not even directly a hardware flag. It''s a synthetic value that Linux derives for itself and we inherited. Are vendors really making guarantees about a flag which they do not directly provide? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-21 17:56 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> Are vendors really making guarantees about a flag > which they do not directly provide?Sorry, I was overly terse and had lost some of my context due to a machine crash over the weekend. By constant_tsc I mean that CPUID:0x80000007:EDX:8 is set. Upstream Linux (2.6.30) now uses the term X86_FEATURE_TSC_RELIABLE to indicate that tsc is consistent across cores and sockets and X86_FEATURE_NONSTOP_TSC to indicate that it doesn''t stop in deep C-states (which Xen compensates for) and X86_FEATURE_CONSTANT_TSC to indicate that it stays running across P/T state transitions. On Intel systems, CPUID:0x80000007:EDX:8 enables all of these feature flags. (Interestingly, on AMD systems, X86_FEATURE_TSC_RELIABLE is *not* set by this bit... so my information from AMD is not represented in Linux (yet)). Note also that in linux-2.6.30/arch/x86/kernel/cpu/vmware.c, both X86_FEATURE_CONSTANT_TSC and X86_FEATURE_TSC_RELIABLE get set. Some of this is explained nicely here: http://lkml.indiana.edu/hypermail/linux/kernel/0811.2/00837.html https://lists.ubuntu.com/archives/kernel-team/2008-October/004279.html https://lists.ubuntu.com/archives/kernel-team/2008-October/004282.html (This last one also re-enforces my answer to Jeremy as to why users of the proposed pvrdtscp interface would still need to post-filter rdtscp values to guarantee no time-going-backwards problems.)> -----Original Message----- > From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] > Sent: Monday, September 21, 2009 11:02 AM > To: Dan Magenheimer; Jan Beulich > Cc: JeremyFitzhardinge; Xen-Devel (E-mail); Kurt Hackel; Langsdorf, > Mark; Nakajima, Jun; Alex Williamson > Subject: Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer > I''ve been looking for) > > > On 21/09/2009 17:55, "Dan Magenheimer" > <dan.magenheimer@oracle.com> wrote: > > > Both Intel and AMD have confirmed that constant_tsc means > > that TSC is consistent across all cores and even across > > multiple sockets; and at least one major system vendor (HP) > > with multi-enclosure "big iron" AMD-based NUMA systems has > > confirmed that TSC is consistent across all nodes. So > > by applying the Xen rendezvous-sync algorithm (that writes > > tsc every second) on such machines, Xen has actually been > > creating a tsc-sync problem, not alleviating one! > > Constant_tsc is not even directly a hardware flag. It''s a > synthetic value > that Linux derives for itself and we inherited. Are vendors > really making > guarantees about a flag which they do not directly provide? > > -- Keir > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-21 18:17 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 21/09/2009 18:56, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> By constant_tsc I mean that CPUID:0x80000007:EDX:8 > is set.Well, if it is at least true for 99% of systems, then it might be worth enabling constant_tsc support by default, and detect TSC divergence at runtime and disbale dynamically. I think that''s what Linux does (i.e., it has a fallback at runtime if its TSC assumptions turn out to be wrong). -- Keir> Upstream Linux (2.6.30) now uses the term > X86_FEATURE_TSC_RELIABLE to indicate that tsc is > consistent across cores and sockets and > X86_FEATURE_NONSTOP_TSC to indicate that it > doesn''t stop in deep C-states (which Xen compensates > for) and X86_FEATURE_CONSTANT_TSC to indicate that > it stays running across P/T state transitions. > On Intel systems, CPUID:0x80000007:EDX:8 enables > all of these feature flags. (Interestingly, on > AMD systems, X86_FEATURE_TSC_RELIABLE is *not* > set by this bit... so my information from AMD is > not represented in Linux (yet)). Note also that > in linux-2.6.30/arch/x86/kernel/cpu/vmware.c, both > X86_FEATURE_CONSTANT_TSC and X86_FEATURE_TSC_RELIABLE > get set._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-21 18:36 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 09/19/09 08:34, Dan Magenheimer wrote:> You''re right, I don''t need to differentiate between > the two emulated cases. I was trying to overload > an extra piece of information that I really don''t > need to overload. > > However, I do need one special case to indicate > emulation vs non-emulation, so wraparound is > still a problem. >I was assuming you''d just repurpose the existing version number scheme which is always even, and therefore can never equal -1.>> > If the hardware doesn''t support rdtscp, how should an app know whether >> > or not to use it? Should it just try running rdtscp being prepared to >> > handle a SIGILL? >> > Yes, that''s the plan. I think this scheme always > works, but only works fast if the hardware supports > rdtscp and constant_tscWhat''s the full algorithm for detecting this feature? Usermode has to establish: 1. It is running under Xen (or not, if you expect this to be implemented on multiple hypervisors) 2. rdtscp is available 3. the ABI is actually being implemented, ie: 1. the tsc_aux value actually has the correct meaning 2. it has a working mechanism for getting the tsc scaling parameters 3. (accommodate ways to evolve the ABI in a back-compatible way) before it can do anything else. If nothing else, its probably worth removing the rdtscp feature from the logical guest cpuid, so that nothing else tries to use it for its own purposes; in other words, you''re exclusively claiming rdtscp for this ABI. Or you could disable this ABI if a guest kernel tries to set TSC_AUX.> I''ve restricted the scheme to constant_tsc as I think > it breaks down due to nasty races if running on a > machine where the pvclock parameters differ across > different pcpus. I think the races can only be > avoided if Xen sets the TSC_AUX for all of the > pcpus running a pvrdtscp doman while all are idle. > > Is there a scheme that avoids the races? >rdtscp makes it quite easy to avoid races because you get the tsc and metadata about the tsc atomically. You just need to encode enough info in the metadata to do the conversion. The obvious thing to do is to pack a version number and pcpu number into TSC_AUX. Usermode would maintain an array of pv_clock parameters, one for each pcpu. If the version number matches, then it uses the parameters it has; if not it fetches new parameters and repeats the rdtscp. There''s no need to worry about either thread or vcpu context switches because you get the (tsc,params) tuple atomically, which is the tricky bit without rdtscp. (The version number would be truncated wrt the normal pvclock version number, but it just needs to be large enough to avoid aliasing from wrapping; I''m assuming something like 24 bits version and 8 bits cpu number.)> Fortunately, this also has the effect of greatly > reducing the version increase frequency. >I don''t think that''s going to be a huge issue; fetching time parameters with a syscall/hypercall would be on the same order as doing an emulated rdtsc, and would only need to happen, say, once per timeslice (100Hz?) at the outside.> The rate is synced but the values may not be. Since > software (BIOS or Xen) sets tsc on each processor > it is essentially impossible to ensure they are > identical. The rendezvous algorithm should be able > to set them so that they are "unobservably" different, > but I keep hearing "within 2usec". (It would be > interesting to measure this across a broad set > of machines.) So it''s probably prudent to recommend > that apps be prepared for the possibility even if > it never happens. >You don''t need to guarantee anything stronger than they''d see on bare hardware. You also need to be more precise about exactly what you''re guaranteeing. Are you saying that a single thread will never see regressing tscs? That just requires making sure that Xen gets the tscs synced closer than the context switch time of a thread between cpus, which should be possible. Or are you making the stronger guarantee that two threads running concurrently on different cpus doing rdtsc will see monotonically increasing tscs with respect to the ordering of all their operations? That would require arbitrarily close syncing (well, within a the time it takes a cacheline to bounce I guess). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-21 21:47 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> > By constant_tsc I mean that CPUID:0x80000007:EDX:8 > > is set. > > Well, if it is at least true for 99% of systems, then it > might be worthWell I''m not sure how to count, but I''d venture to guess that close to 99% of servers out there (that are new enough to have this CPUID bit set) are single socket. So as long as constant_tsc applies across all cores in a socket, your 99% test applies. But according to Intel and AMD, it should also apply across multiple sockets, and according to HP, it applies on one big NUMA machine even across enclosures.> enabling constant_tsc support by default, and detect TSC divergence at > runtime and disbale dynamically. I think that''s what Linux > does (i.e., it > has a fallback at runtime if its TSC assumptions turn out to > be wrong).Indeed Linux does, and the code looks easy to leverage. See arch/x86/kernel/tsc_sync.c where check_tsc_sync_* is defined, used by start_secondary() and native_cpu_up() in arch/x86/kernel/smpboot.c. It may actually too aggressively test for TSC reliability as it can fail if TSC''s differ by more than "a cacheline bounce", which is a lot more restrictive than Xen cares about (or any userland algorithm that post-processes for monotonicity). In fact, Linux no longer does any kind of write_tsc(0) at processor boot but apparently instead assumes that the BIOS has done the synchronization. I don''t know if/how the BIOS could do a better job than Xen''s current rendezvous algorithm, but if it does, Xen''s code may not only be superfluous but also making problems worse. We should probably test for divergence and only write_tsc if the test fails? P.S. I was looking at 2.6.30 and 2.6.31 though it looks like check_tsc_sync been around since at least 2.6.24: http://lwn.net/Articles/211051/ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-21 22:20 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> > However, I do need one special case to indicate > > emulation vs non-emulation, so wraparound is > > still a problem. > > I was assuming you''d just repurpose the existing version number scheme > which is always even, and therefore can never equal -1.That wasn''t my plan but if it can be made to work (see below), it probably saves code in Xen.> What''s the full algorithm for detecting this feature? Usermode has to > establish: > > 1. It is running under Xen (or not, if you expect this to be > implemented on multiple hypervisors) > 2. rdtscp is available > 3. the ABI is actually being implemented, ie: > 1. the tsc_aux value actually has the correct meaning > 2. it has a working mechanism for getting the tsc scaling > parameters > 3. (accommodate ways to evolve the ABI in a > back-compatible way) > before it can do anything else.Yes, that''s what I was thinking. I was planning on prototyping these checks with "userland-rdmsr" but userland-hypercall or userland-shared-page could work also.> If nothing else, its probably worth removing the rdtscp > feature from the > logical guest cpuid, so that nothing else tries to use it for its own > purposes; in other words, you''re exclusively claiming rdtscp for this > ABI. Or you could disable this ABI if a guest kernel tries > to set TSC_AUX.I was thinking that setting pvrdtscp=1 would override any kernel use of rdtscp/TSC_AUX, but disabling the cpuid has_rdtscp flag and using a different userland detection mechanism (than checking cpuid for has_rdtscp) would be a better way to avoid possible conflict.> > I''ve restricted the scheme to constant_tsc as I think > > it breaks down due to nasty races if running on a > > machine where the pvclock parameters differ across > > different pcpus. I think the races can only be > > avoided if Xen sets the TSC_AUX for all of the > > pcpus running a pvrdtscp doman while all are idle. > > > > Is there a scheme that avoids the races? > > rdtscp makes it quite easy to avoid races because you get the tsc and > metadata about the tsc atomically. You just need to encode > enough info > in the metadata to do the conversion.Yes but I don''t think there is enough bits for encoding it all (32-bits in TSC_AUX, right?).> The obvious thing to do is to pack a version number and pcpu > number into > TSC_AUX. Usermode would maintain an array of pv_clock parameters, one > for each pcpu. If the version number matches, then it uses the > parameters it has; if not it fetches new parameters and repeats the > rdtscp. There''s no need to worry about either thread or vcpu context > switches because you get the (tsc,params) tuple atomically, > which is the > tricky bit without rdtscp. > > (The version number would be truncated wrt the normal pvclock version > number, but it just needs to be large enough to avoid aliasing from > wrapping; I''m assuming something like 24 bits version and 8 bits cpu > number.)I think a race occurs if the vcpu switches pcpu TWICE from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp each time on pcpu-A but reads one or more pvclock parameters (that are too big to be encoded in TSC_AUX) on pcpu-B. If Xen can atomically bump/change TSC_AUX on *all* pcpus runniing a guest vcpu, the race can be avoided. But I suspect that is too expensive (some kind of rendezvous required for each bump on any processor).> > Fortunately, this also has the effect of greatly > > reducing the version increase frequency. > > I don''t think that''s going to be a huge issue; fetching time > parameters > with a syscall/hypercall would be on the same order as doing > an emulated > rdtsc, and would only need to happen, say, once per timeslice (100Hz?) > at the outside.Even if my assumption of the race (above) is incorrect, 32-bits is not very much time at 100Hz. But the version bump needs to occur synchronously with every P/C-state transition for pvclock to work on non_constant_tsc machines doesn''t it? How frequent can those transitions occur?> > The rate is synced but the values may not be. Since > > software (BIOS or Xen) sets tsc on each processor > > it is essentially impossible to ensure they are > > identical. The rendezvous algorithm should be able > > to set them so that they are "unobservably" different, > > but I keep hearing "within 2usec". (It would be > > interesting to measure this across a broad set > > of machines.) So it''s probably prudent to recommend > > that apps be prepared for the possibility even if > > it never happens. > > You don''t need to guarantee anything stronger than they''d see on bare > hardware. You also need to be more precise about exactly what you''re > guaranteeing. > > Are you saying that a single thread will never see regressing tscs? > That just requires making sure that Xen gets the tscs synced > closer than > the context switch time of a thread between cpus, which > should be possible. > > Or are you making the stronger guarantee that two threads running > concurrently on different cpus doing rdtsc will see monotonically > increasing tscs with respect to the ordering of all their operations? > That would require arbitrarily close syncing (well, within a > the time it > takes a cacheline to bounce I guess).I guess this all depends on what Xen is capable of guaranteeing. If Xen can provide a "cacheline bounce guarantee", the app shouldn''t have to care. Linux now seems to provide a cacheline bounce guarantee for itself, but afaik has no way to communicate that to an app using raw rdtsc{,p} and all the relevant syscalls have a monotonicity option and/or have insufficient resolution to matter. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-21 22:50 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 09/21/09 15:20, Dan Magenheimer wrote:>>> However, I do need one special case to indicate >>> emulation vs non-emulation, so wraparound is >>> still a problem. >>> >> I was assuming you''d just repurpose the existing version number scheme >> which is always even, and therefore can never equal -1. >> > That wasn''t my plan but if it can be made to work (see > below), it probably saves code in Xen. > > >> What''s the full algorithm for detecting this feature? Usermode has to >> establish: >> >> 1. It is running under Xen (or not, if you expect this to be >> implemented on multiple hypervisors) >> 2. rdtscp is available >> 3. the ABI is actually being implemented, ie: >> 1. the tsc_aux value actually has the correct meaning >> 2. it has a working mechanism for getting the tsc scaling >> parameters >> 3. (accommodate ways to evolve the ABI in a >> back-compatible way) >> before it can do anything else. >> > Yes, that''s what I was thinking. I was planning on prototyping > these checks with "userland-rdmsr" but userland-hypercall or > userland-shared-page could work also. > > >> If nothing else, its probably worth removing the rdtscp >> feature from the >> logical guest cpuid, so that nothing else tries to use it for its own >> purposes; in other words, you''re exclusively claiming rdtscp for this >> ABI. Or you could disable this ABI if a guest kernel tries >> to set TSC_AUX. >> > I was thinking that setting pvrdtscp=1 would override > any kernel use of rdtscp/TSC_AUX, but disabling the > cpuid has_rdtscp flag and using a different userland > detection mechanism (than checking cpuid for has_rdtscp) > would be a better way to avoid possible conflict. > > >>> I''ve restricted the scheme to constant_tsc as I think >>> it breaks down due to nasty races if running on a >>> machine where the pvclock parameters differ across >>> different pcpus. I think the races can only be >>> avoided if Xen sets the TSC_AUX for all of the >>> pcpus running a pvrdtscp doman while all are idle. >>> >>> Is there a scheme that avoids the races? >>> >> rdtscp makes it quite easy to avoid races because you get the tsc and >> metadata about the tsc atomically. You just need to encode >> enough info >> in the metadata to do the conversion. >> > Yes but I don''t think there is enough bits for encoding > it all (32-bits in TSC_AUX, right?). > > >> The obvious thing to do is to pack a version number and pcpu >> number into >> TSC_AUX. Usermode would maintain an array of pv_clock parameters, one >> for each pcpu. If the version number matches, then it uses the >> parameters it has; if not it fetches new parameters and repeats the >> rdtscp. There''s no need to worry about either thread or vcpu context >> switches because you get the (tsc,params) tuple atomically, >> which is the >> tricky bit without rdtscp. >> >> (The version number would be truncated wrt the normal pvclock version >> number, but it just needs to be large enough to avoid aliasing from >> wrapping; I''m assuming something like 24 bits version and 8 bits cpu >> number.) >> > I think a race occurs if the vcpu switches pcpu TWICE > from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp > each time on pcpu-A but reads one or more pvclock parameters > (that are too big to be encoded in TSC_AUX) on pcpu-B. >That shouldn''t matter. Once the process has (tsc,cpu,version) it can use its own local copy of cpu''s pvclock parameters to compute the tsc->ns conversion. Once it has that triple, it doesn''t matter if it gets context-switched; the time computation doesn''t depend on what CPU is currently running. It only needs to iterate if it gets a version mismatch. You can potentially get a livelock if the version is constantly changing between the rdtscp and the get-pvclock-params, and exacerbated if the process keeps bouncing between cpus between the two. But given that the rdtsc+get-params should take no more than a couple of microseconds, it seems very unlikely the process is sustaining a megahertz CPU migration rate. And even if it fails, the process always has to be prepared to go to some other time source.> If Xen can atomically bump/change > TSC_AUX on *all* pcpus runniing a guest vcpu, the race > can be avoided. But I suspect that is too expensive (some > kind of rendezvous required for each bump on any processor). >Right. Any synchronized cross-cpu call is going to be very expensive, and can''t be done atomically without some kind of stop-the-world which is even worse.> Even if my assumption of the race (above) is incorrect, > 32-bits is not very much time at 100Hz. But the version > bump needs to occur synchronously with every P/C-state > transition for pvclock to work on non_constant_tsc machines > doesn''t it? How frequent can those transitions occur? >24 bits at 100Hz is 46ish hours. So there''s a potential alias problem if the program reads the tsc at precisely 46.603 (ish) hours after its previous read. One workaround would be to force a re-read of the timing parameters every X secs/mins/hours to guarantee that there''s no wrap for some expected rate of param updates. That said, the standard pvclock algorithm is only 128 times better than that, and I don''t think it has ever considered to be a problem. I''ve never seen an update rate of more than once every few seconds. Also Xen need only update the version number if something has actually read that version; if nobody had read the current parameters, there''s no need to update the version when updating them to a new value. That would help mitigate the case of rapid param updates and a low rate of reading.> I guess this all depends on what Xen is capable of > guaranteeing. If Xen can provide a "cacheline > bounce guarantee", the app shouldn''t have to care. >It can''t, in princple, sync the tscs at a finer grain than the app can measure. It only has the same resources to play with, and there''ll always be some error margin.> Linux now seems to provide a cacheline bounce guarantee for > itself, but afaik has no way to communicate that to an app > using raw rdtsc{,p} and all the relevant syscalls have a > monotonicity option and/or have insufficient resolution > to matter. >It''s a detail that a usermode app can''t rely on anyway. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-21 23:29 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> > I think a race occurs if the vcpu switches pcpu TWICE > > from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp > > each time on pcpu-A but reads one or more pvclock parameters > > (that are too big to be encoded in TSC_AUX) on pcpu-B. > > That shouldn''t matter. Once the process has (tsc,cpu,version) it can > use its own local copy of cpu''s pvclock parameters to compute the > tsc->ns conversion. Once it has that triple, it doesn''t matter if it > gets context-switched; the time computation doesn''t depend on what CPU > is currently running. > > It only needs to iterate if it gets a version mismatch. You can > potentially get a livelock if the version is constantly > changing between > the rdtscp and the get-pvclock-params, and exacerbated if the process > keeps bouncing between cpus between the two. But given that the > rdtsc+get-params should take no more than a couple of microseconds, it > seems very unlikely the process is sustaining a megahertz CPU > migration > rate.Yes, I neglected an important pre-condition. ASSUME the first rdtscp on pcpu-A gets a version mismatch so that it must fetch the parameters again. Then: the vcpu switches pcpu TWICE from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp each time on pcpu-A but reads one or more pvclock parameters (that are too big to be encoded in TSC_AUX) on pcpu-B. I agree that this is vanishingly low probability but on a pcpu-oversubscribed machine I think it only takes one vcpu-to-pcpu reschedule and then a poorly timed interrupt that causes the vcpu to be unscheduled, and then later rescheduled on the original processor.> And even if it fails, the process always has to be prepared to go to > some other time source.And the issue is that there''s no way to recognize failure. Unless... wait... are you assuming that every unscheduled period results in an adjustment of the pvclock offset parameter? That results in "nanoseconds since guest boot during which any vcpu is running" rather than "nanoseconds since guest boot even when all vcpus are idle", right? That''s different than what I had in mind, but I suppose it works. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-21 23:55 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 09/21/09 16:29, Dan Magenheimer wrote:>>> I think a race occurs if the vcpu switches pcpu TWICE >>> from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp >>> each time on pcpu-A but reads one or more pvclock parameters >>> (that are too big to be encoded in TSC_AUX) on pcpu-B. >>> >> That shouldn''t matter. Once the process has (tsc,cpu,version) it can >> use its own local copy of cpu''s pvclock parameters to compute the >> tsc->ns conversion. Once it has that triple, it doesn''t matter if it >> gets context-switched; the time computation doesn''t depend on what CPU >> is currently running. >> >> It only needs to iterate if it gets a version mismatch. You can >> potentially get a livelock if the version is constantly >> changing between >> the rdtscp and the get-pvclock-params, and exacerbated if the process >> keeps bouncing between cpus between the two. But given that the >> rdtsc+get-params should take no more than a couple of microseconds, it >> seems very unlikely the process is sustaining a megahertz CPU >> migration >> rate. >> > Yes, I neglected an important pre-condition. ASSUME the first > rdtscp on pcpu-A gets a version mismatch so that it must fetch > the parameters again. Then: the vcpu switches pcpu TWICE > from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp > each time on pcpu-A but reads one or more pvclock parameters > (that are too big to be encoded in TSC_AUX) on pcpu-B. > > I agree that this is vanishingly low probability but on > a pcpu-oversubscribed machine I think it only takes one > vcpu-to-pcpu reschedule and then a poorly timed interrupt that > causes the vcpu to be unscheduled, and then later rescheduled > on the original processor. >Sure. It just has to keep iterating until it gets consistency. If it iterates too long (10 times? 100? 1000?) it should give up and assume something is inherently broken.>> And even if it fails, the process always has to be prepared to go to >> some other time source. >> > And the issue is that there''s no way to recognize > failure.Yeah, that''s a basic problem with using naked tsc as a timebase. Any app using it needs to be prepared to test the tsc sanity against some other time reference regularly. On the other hand, using the tsc as part of a larger ABI works reliably. This rdtscp proposal is basically the latter, as a variant of the pvclock algorithm. I''m mostly interested in it as an implementation for vsyscall etc, rather than something that apps would use directly.> Unless... wait... are you assuming that > every unscheduled period results in an adjustment > of the pvclock offset parameter? That results in > "nanoseconds since guest boot during which any > vcpu is running" rather than "nanoseconds since > guest boot even when all vcpus are idle", right? > That''s different than what I had in mind, but I > suppose it works. >Not following you here. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-22 00:11 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> > Yes, I neglected an important pre-condition. ASSUME the first > > rdtscp on pcpu-A gets a version mismatch so that it must fetch > > the parameters again. Then: the vcpu switches pcpu TWICE > > from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp > > each time on pcpu-A but reads one or more pvclock parameters > > (that are too big to be encoded in TSC_AUX) on pcpu-B. > > > > I agree that this is vanishingly low probability but on > > a pcpu-oversubscribed machine I think it only takes one > > vcpu-to-pcpu reschedule and then a poorly timed interrupt that > > causes the vcpu to be unscheduled, and then later rescheduled > > on the original processor. > > > > Sure. It just has to keep iterating until it gets consistency. If it > iterates too long (10 times? 100? 1000?) it should give up and assume > something is inherently broken.No, I''m not talking about iteration. In the scenario I''m trying to describe, the version number hasn''t changed on pcpu-A so the algorithm doesn''t iterate.> On the other hand, using the tsc as part of a larger ABI > works reliably. > > This rdtscp proposal is basically the latter, as a variant of the > pvclock algorithm. I''m mostly interested in it as an > implementation for > vsyscall etc, rather than something that apps would use directly. > > > Unless... wait... are you assuming that > > every unscheduled period results in an adjustment > > of the pvclock offset parameter? That results in > > "nanoseconds since guest boot during which any > > vcpu is running" rather than "nanoseconds since > > guest boot even when all vcpus are idle", right? > > That''s different than what I had in mind, but I > > suppose it works. > > > > Not following you here.I realized after I sent this that I''m not really sure I understand the pvclock implementation, particularly under what circumstances the version number changes or doesn''t. And if this is different in any way than the versions you are proposing that the app would see. So I''m not positive we are considering the same cases. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-22 00:42 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 09/21/09 17:11, Dan Magenheimer wrote:>>> Yes, I neglected an important pre-condition. ASSUME the first >>> rdtscp on pcpu-A gets a version mismatch so that it must fetch >>> the parameters again. Then: the vcpu switches pcpu TWICE >>> from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp >>> each time on pcpu-A but reads one or more pvclock parameters >>> (that are too big to be encoded in TSC_AUX) on pcpu-B. >>> >>> I agree that this is vanishingly low probability but on >>> a pcpu-oversubscribed machine I think it only takes one >>> vcpu-to-pcpu reschedule and then a poorly timed interrupt that >>> causes the vcpu to be unscheduled, and then later rescheduled >>> on the original processor. >>> >>> >> Sure. It just has to keep iterating until it gets consistency. If it >> iterates too long (10 times? 100? 1000?) it should give up and assume >> something is inherently broken. >> > No, I''m not talking about iteration. In the scenario I''m > trying to describe, the version number hasn''t changed on > pcpu-A so the algorithm doesn''t iterate. >Well, not "change" so much as "not updated". If the program keeps doing a rdtsc which shows that its local copy of the parameters is out of date, but its attempts to get up-to-date parameters keeps failing (because it keeps migrating cpus), then it will keep iterating without converging. Specifically, the algorithm would be: u64 tsc, time_ns; u32 aux; unsigned int version, cpu; again: rdtscp(&tsc, &aux); cpu = aux >> 24; /* physical cpu */ version = aux & ((1 << 24) - 1); /* At this point tsc and cpu+version are all fetched atomically and consistent, so context switch doesn''t matter here; apply_fixup is not dependent on currently executing cpu. */ /* note that this prob. needs some local synchronization if the usermode program is multithreaded... */ if (unlikely(version != pvclockinfo[cpu].version)) { struct pvclock info; int curcpu; /* again, physical cpu */ /* Always fetches current cpu parameters, and tells us which cpu it is for. If we switched cpus since the rdtscp we won''t end up updating the out-of-date info we detected but that doesn''t matter because... */ curcpu = get_new_pvclock_info(&info); pvclockinfo[curcpu] = info; /* ...we repeat assuming that we''re almost certainly still on the same cpu when we do rdtscp again */ goto again; } time_ns = apply_fixup(tsc, &pvclockinfo[cpu]); get_new_pvclock_info() can either be a syscall, hypercall or some other mechanism which can get a good atomic snapshot of the params along with cpu number from a shared memory region.> I realized after I sent this that I''m not really sure > I understand the pvclock implementation, particularly > under what circumstances the version number changes > or doesn''t. And if this is different in any way > than the versions you are proposing that the app > would see. So I''m not positive we are considering > the same cases. >The pvclock algorithm only changes the version if the either the tsc offset or scale have changed. In the standard pvclock algorithm, a vcpu sees its own pvclock version change if either the pcpu undergoes some change which affects the tsc, *or* if the vcpu gets scheduled on a new pcpu (which could have different offset/scale). In the case we''re talking about above, the code isn''t pinned to a particular pcpu or vcpu (as it is usermode code with no real control over the kernel or xen schedulers), so it has to cope with preempt at any point. That''s simplified by having the tsc and metadata fetch atomic, so it can revalidate its parameters every time it fetches the tsc. In that case, Xen need only update its internal version numbers when there''s an actual change to the tsc''s offset/scale without regard to vcpu scheduling. (And of course if the offset/scale end up being constant, then it will never need to update the offset, and usermode will only ever end up fetching it once per cpu.) J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-22 07:39 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
>>> Jeremy Fitzhardinge <jeremy@goop.org> 21.09.09 20:36 >>> >What''s the full algorithm for detecting this feature? Usermode has to >establish: > > 1. It is running under Xen (or not, if you expect this to be > implemented on multiple hypervisors) > 2. rdtscp is available > 3. the ABI is actually being implemented, ie: > 1. the tsc_aux value actually has the correct meaning > 2. it has a working mechanism for getting the tsc scaling > parametersThis sub-2 can certainly be assumed to imply the respective sub-1.> 3. (accommodate ways to evolve the ABI in a back-compatible way) > >before it can do anything else.>The obvious thing to do is to pack a version number and pcpu number into >TSC_AUX. Usermode would maintain an array of pv_clock parameters, one >for each pcpu. If the version number matches, then it uses the >parameters it has; if not it fetches new parameters and repeats the >rdtscp. There''s no need to worry about either thread or vcpu context >switches because you get the (tsc,params) tuple atomically, which is the >tricky bit without rdtscp. > >(The version number would be truncated wrt the normal pvclock version >number, but it just needs to be large enough to avoid aliasing from >wrapping; I''m assuming something like 24 bits version and 8 bits cpu >number.)I continue to think that it would be fundamentally wrong to use pCPU numbers here: Not only do you share information with the app that it shouldn''t really care about, but you also push scalability issues to it that the kernel is supposed to abstract out for apps. In particular, - the interface must not imply an upper bound for the number of pCPU-s (i.e. a fixed 8-/24-bit separation won''t work, but reducing the version to significantly below 24 bits may cause issues), - the app must not imply the number of pCPU-s is bounded in any way (since, due to migration or CPU hotplug, it may grow). While both can be addressed, this really isn''t something an app should (have to) care about. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-22 07:44 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.09.09 01:29 >>> >Yes, I neglected an important pre-condition. ASSUME the first >rdtscp on pcpu-A gets a version mismatch so that it must fetch >the parameters again. Then: the vcpu switches pcpu TWICE >from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp >each time on pcpu-A but reads one or more pvclock parameters >(that are too big to be encoded in TSC_AUX) on pcpu-B.This fundamentally depends on how the pvclock parameters are being read: While app-accessible MSRs inherently require each of the necessary RDMSRs to be executed on the correct {p,v}CPU (unless you encode the CPU number in the RDMSR input), an app accessible shared memory region wouldn''t have that property. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-22 15:00 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.09.09 01:29 >>> > >Yes, I neglected an important pre-condition. ASSUME the first > >rdtscp on pcpu-A gets a version mismatch so that it must fetch > >the parameters again. Then: the vcpu switches pcpu TWICE > >from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp > >each time on pcpu-A but reads one or more pvclock parameters > >(that are too big to be encoded in TSC_AUX) on pcpu-B. > > This fundamentally depends on how the pvclock parameters are being > read: While app-accessible MSRs inherently require each of > the necessary > RDMSRs to be executed on the correct {p,v}CPU (unless you encode the > CPU number in the RDMSR input), an app accessible shared memory region > wouldn''t have that property.Hmmm... I think a shared memory region still does have that property. To avoid any possibility of a race, there must be a way to atomically fetch the full set of values: { tsc, tsc_aux, pvclock parameters }. (How many bits total in pvclock parameters?) Jeremy''s proposal of a userland hypercall ("get_new_pvclock_info") can do that, but I don''t see how a shared memory region can. But a userland hypercall that writes to userland memory seems risky. An app can mmap memory, if it fails to do so (either accidentally or maliciously), bad things can happen, correct? Pardon my x86 ignorance again: If we define a userland rdmsr, it could overwrite more than just EDX:EAX. If it overwrites all registers that can safely be changed by the calling convention, which registers (how many bits) can it "return"? I suspect this isn''t enough for 32-bit guests, but maybe it is for 64-bit guests? Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-22 15:16 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.09.09 17:00 >>> >> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.09.09 01:29 >>> >> >Yes, I neglected an important pre-condition. ASSUME the first >> >rdtscp on pcpu-A gets a version mismatch so that it must fetch >> >the parameters again. Then: the vcpu switches pcpu TWICE >> >from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp >> >each time on pcpu-A but reads one or more pvclock parameters >> >(that are too big to be encoded in TSC_AUX) on pcpu-B. >> >> This fundamentally depends on how the pvclock parameters are being >> read: While app-accessible MSRs inherently require each of >> the necessary >> RDMSRs to be executed on the correct {p,v}CPU (unless you encode the >> CPU number in the RDMSR input), an app accessible shared memory region >> wouldn''t have that property. > >Hmmm... I think a shared memory region still does have that property. >To avoid any possibility of a race, there must be a way to atomically >fetch the full set of values: > >{ tsc, tsc_aux, pvclock parameters }. > >(How many bits total in pvclock parameters?)Of course the expectation would be that the in-memory values are also tagged with a version.>Jeremy''s proposal of a userland hypercall ("get_new_pvclock_info") >can do that, but I don''t see how a shared memory region can. >But a userland hypercall that writes to userland memory seems >risky. An app can mmap memory, if it fails to do so (either >accidentally or maliciously), bad things can happen, correct?No, I don''t think that''s more risky than writing to kernel memory - Xen would get a page fault, and skip the write (and return -EFAULT).>Pardon my x86 ignorance again: If we define a userland rdmsr, >it could overwrite more than just EDX:EAX. If it overwrites >all registers that can safely be changed by the calling >convention, which registers (how many bits) can it "return"? >I suspect this isn''t enough for 32-bit guests, but maybe >it is for 64-bit guests?On 32-bit you have 3 registers if you don''t want to touch callee saved ones. On 64-bit you have 7 of them (considering the differences between Unix and Windows calling conventions, and hoping there''s no third set in use somewhere). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-22 17:15 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 09/22/09 08:16, Jan Beulich wrote:>> Pardon my x86 ignorance again: If we define a userland rdmsr, >> it could overwrite more than just EDX:EAX. If it overwrites >> all registers that can safely be changed by the calling >> convention, which registers (how many bits) can it "return"? >> I suspect this isn''t enough for 32-bit guests, but maybe >> it is for 64-bit guests? >> > On 32-bit you have 3 registers if you don''t want to touch callee > saved ones. > On 64-bit you have 7 of them (considering the differences between > Unix and Windows calling conventions, and hoping there''s no third > set in use somewhere). >It doesn''t really matter what registers you choose (but 3 is not enough; you need around 200 bits of state for the pvclock params). This special rdtsc (presumably done in the same way as the Xen cpuid, with the XEN_EMULATE_PREFIX) and would need to be carefully emitted in an inline asm, which can do whatever other fixups are required save registers and move values into the right place (gcc inline asm will pretty much automate this). But I think doing this direct from usermode is a bad idea; interactions with Xen should be mediated by the kernel, even if just via a /dev/xen/pvclock driver. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-22 17:26 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 09/22/09 00:39, Jan Beulich wrote:>> 1. It is running under Xen (or not, if you expect this to be >> implemented on multiple hypervisors) >> 2. rdtscp is available >> 3. the ABI is actually being implemented, ie: >> 1. the tsc_aux value actually has the correct meaning >> 2. it has a working mechanism for getting the tsc scaling >> parameters >> > This sub-2 can certainly be assumed to imply the respective sub-1. >Yeah, they''re the minimum requirements of a "working ABI". But I think we should also have something workable if only rdtsc is available.>> The obvious thing to do is to pack a version number and pcpu number into >> TSC_AUX. Usermode would maintain an array of pv_clock parameters, one >> for each pcpu. If the version number matches, then it uses the >> parameters it has; if not it fetches new parameters and repeats the >> rdtscp. There''s no need to worry about either thread or vcpu context >> switches because you get the (tsc,params) tuple atomically, which is the >> tricky bit without rdtscp. >> >> (The version number would be truncated wrt the normal pvclock version >> number, but it just needs to be large enough to avoid aliasing from >> wrapping; I''m assuming something like 24 bits version and 8 bits cpu >> number.) >> > I continue to think that it would be fundamentally wrong to use pCPU > numbers here: Not only do you share information with the app that it > shouldn''t really care about, but you also push scalability issues to it > that the kernel is supposed to abstract out for apps. >As far as usermode is concerned, they''re just tags to distinguish distinct sets of parameters. We could remap them from actual pcpu numbers to some other key space, but I don''t see much point in doing so. The numbers are meaningless to usermode and have no inherent meaning. (Of course we could add some inherent structure to them, like adding node numbers for NUMA systems, so that usermode has at least some idea of how it is being mapped to hardware, at least at that instant. But that''s a whole other discussion.)> In particular, > - the interface must not imply an upper bound for the number of > pCPU-s (i.e. a fixed 8-/24-bit separation won''t work, but reducing the > version to significantly below 24 bits may cause issues), >Yeah. I was considering a mechanism whereby the version/cpu split was a runtime option fetched from Xen. Running out of space for CPU numbers would be a disaster, but a smaller version space can be dealt with by making sure that there''s at new pvclock param update before the version wraps (which you can achieve by requiring an update every X units of wallclock time, where X is less than the expected minimum time of a wrap).> - the app must not imply the number of pCPU-s is bounded in any way > (since, due to migration or CPU hotplug, it may grow). >Usermode might have to use a more flexible structure than a simple array to handle arbitrary parameter keys (aka pcpu numbers).> While both can be addressed, this really isn''t something an app should > (have to) care about. >I agree. All this machinery should be wrapped up in the form of vsyscall. That would simplify many aspects of this discussion. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-22 19:36 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > This rdtscp proposal is basically the latter, as a variant of the > pvclock algorithm. I''m mostly interested in it as an > implementation for > vsyscall etc, rather than something that apps would use directly.> From: Jan Beulich [mailto:JBeulich@novell.com] > I continue to think that it would be fundamentally wrong to use pCPU > numbers here: Not only do you share information with the app that it > shouldn''t really care about, but you also push scalability > issues to it > that the kernel is supposed to abstract out for apps.While I have been hopeful that we can identify a solution that can solve both problems (vsyscall+pvclock and pvrdtscp), I have been concerned we might run into a fundamental conflict since both of us may be attempting to use TSC_AUX for somewhat different purposes. Then in taking a step back to think about this, I realized we may be farther apart in our objectives than I first thought. So I thought it would be a good idea to revisit some assumptions. I am assuming that rdtsc and rdtscp are always emulated; but for some "high frequency timestamp apps" (HFTSAs), trying to define a mechanism where rdtsc/rdtscp are always correct AND, in certain constrained environments, also fast (non-emulated). Any userland pvclock algorithm still requires a rdtsc (or rdtscp) instruction which -- EXCEPT in those certain constrained environments -- is emulated. But the whole point of pvclock is to be faster than entering the hypervisor, right? Are you (Jeremy) still assuming that rdtsc/rdtscp are NOT emulated? Or are you trying to define a vsyscall+pvclock mechanism for the same constrained environments so that HFTSAs have a choice of using clock_gettime instead of pvrdtsc, either of which will be fast? Or am I missing another option altogether? Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-22 19:52 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 09/22/09 12:36, Dan Magenheimer wrote:> Are you (Jeremy) still assuming that rdtsc/rdtscp are NOT > emulated? Or are you trying to define a vsyscall+pvclock > mechanism for the same constrained environments > so that HFTSAs have a choice of using clock_gettime > instead of pvrdtsc, either of which will be fast? >Yes, I''m assuming they''re not emulated. If you''re emulating them there''s no reason to add any extra complexity to usermode by adding any other ABI: rdtsc can be rdtsc and rdtscp can be rdtscp with no Xen/ABI-imposed constraints on TSC_AUX. Once you''re talking about layering another ABI onto the tsc, then there''s no need to consider emulation because you can do all the necessary correction to get a canonical timestamp without it. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-22 20:22 UTC
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
> On 09/22/09 12:36, Dan Magenheimer wrote: > > Are you (Jeremy) still assuming that rdtsc/rdtscp are NOT > > emulated? Or are you trying to define a vsyscall+pvclock > > mechanism for the same constrained environments > > so that HFTSAs have a choice of using clock_gettime > > instead of pvrdtsc, either of which will be fast? > > Yes, I''m assuming they''re not emulated.OK, that''s what I feared. I don''t know how this decision will be made, but any pvclock and pvrdtsc design work is very dependent on the decision.> If you''re emulating them > there''s no reason to add any extra complexity to usermode by > adding any > other ABI: rdtsc can be rdtsc and rdtscp can be rdtscp with no > Xen/ABI-imposed constraints on TSC_AUX.The reason is to improve performance while preserving correctness for applications that need to do tens-to-hundreds of thousands "timestamp reads" without changing the underlying OS. Whether this is a GOOD reason is subject to interpretation, but it is a reason.> Once you''re talking about layering another ABI onto the tsc, then > there''s no need to consider emulation because you can do all the > necessary correction to get a canonical timestamp without it.But only at the cost of losing correctness for (whether you consider them fundamentally broken or not) apps that depend on the rdtsc instruction to deliver the architecturally-defined functionality and may silently fail or corrupt data if rdtsc silently doesn''t behave as defined. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-22 22:18 UTC
Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I''ve been looking for)
On 09/22/09 13:22, Dan Magenheimer wrote:> The reason is to improve performance while preserving > correctness for applications that need to do tens-to-hundreds > of thousands "timestamp reads" without changing the underlying > OS. Whether this is a GOOD reason is subject to interpretation, > but it is a reason. >I don''t think there''s anything new to add to this line of discussion. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel