Dan Magenheimer
2009-Oct-02 17:51 UTC
[Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
============Premise 1: A large and growing percentage of servers running Xen have a "reliable" TSC and Xen can determine conclusively whether a server does or does not have a reliable TSC. ============ The truth of this statement has been vociferously challenged in other threads, so I''d LOVE TO GET FEEDBACK OR CONFIRMATION FROM PROCESSOR AND SERVER VENDORS. The rest of this is long though hopefully educational, but if you have no interest in the rdtsc instruction or timestamping, please move on to [2 of 4]. Since my overall premise is a bit vague, I need to first very clearly define my terms. And to define those terms clearly, I need to provide some more background. As far as I can find, there is no publication which clearly describes all of these concepts. The rdtsc instruction was at one time the easiest and cheapest and most precise method for "approximating the passage of time"; as such rdtsc was widely used by x86 performance practitioners and high-end apps that needed to provide extensive metrics. When commodity SMP x86 systems emerged, rdtsc fell into disfavor because: (a) it was difficult to for different CPU packages to share a crystal or ensure different crystals were synchronized or increasing at precisely the same rate, and (b) SMP apps were oblivious to which CPU their thread(s) were running on so two rdtsc instructions in the same thread might execute on different CPU''s and thus unwittingly use different crystals, resulting in strange things like the appearance that time went backwards (sometimes by a large amount) or events appearing to take different amounts of time depending on whether they were running on processor A or processor B. We will call this the "inconsistent TSC" problem. Processor and system vendors attempted to fix the inconsistent TSC problem by providing a new class of "platform timers" (e.g. HPET), but these proved to be slow and difficult to use, especially for apps that required frequent fine metrics. Processor and system vendors eventually figured out how to synchronize TSC with the same crystal, but then a new set of problems emerged: Power features sometimes caused the clock on one processor to slow down or even stop, thus destroying the synchrony with other processors. This was fixed first by ensuring that the tick rate did not change ("constant TSC") and later that it did not stop ("nonstop TSC"), unless ALL of the TSCs on all of the processors stopped. Nearly all of the most recent generations of server processors support these capabilities, so as a result on most recent servers, the TSC on all processors/cores/sockets is driven by the same crystal, always ticks at the same rate, and doesn''t stop independently of other processors'' TSCs. This is what we call a "reliable TSC". But we''re not done yet. What does a reliable TSC provide? We need to define a few more terms. A "perfect TSC" would be one where a magic logic analyzer with a cesium clock could confirm that the TSC''s on every processor increment at precisely the same femtosecond. Both the speed of light and the pricing models of commodity processors make a perfect TSC unlikely :-) How close is good enough? We define two TSCs as being "unobservably different" if code running on the two processors can never see time going backwards, because the difference bettween their TSCs is smaller than the memory access overhead due to cache synchronization. (This is sometimes called a "cache bounce".) To wit, suppose processor A does a rdtsc and writes the result into memory; meanwhile processor B is spinning until it sees that the memory location has changed, reads A''s value from memory and then does its own rdtsc. If B''s rdtsc is NEVER less OR equal to A''s rdtsc, we will call this an "optimal TSC". A reliable TSC is not guaranteed to be optimal; it may just be very close to optimal, meaning the difference between two TSCs may sometimes be observable but it will always be very small. (As far as I know, processor and server vendors will not guarantee exactly how small.) To simulate an optimal TSC with a reliable TSC, a software wrapper can be placed around the reads from a reliable TSC to catch and "fix" the rare circumstances where time goes backwards. If this wrapper, ensures that time never goes backwards AND ensures that time always moves forward, we call this a monotonically-increasing wrapper. If it instead ensures that time never goes backwards AND may appear to stop, we call this a monotonically-non-decreasing wrapper. Note also that a reliable TSC is not guaranteed to never stop; it is just guaranteed that if the TSC on one processor is stopped, the TSC on all other processors will also be stopped. As a result, a reliable TSC cannot be used as a wallclock, at least without other software support that can properly adjust the TSC on all processors when all processors awaken. Last, there is the issue of whether or not Xen can conclusively determine if the TSC is reliable. This is still an open challenge. There exists a CPUID bit which purports to do this, but it is not known with certainty if there are exceptions. Notably, there is concern if certain newer larger NUMA servers will truly provide a reliable TSC across all system processors even if the CPUID bit on each CPU package says the package does provide a reliable TSC. One large server vendor claims that this is not a problem anymore, but ideally we would like to test this dynamically and there is GPL code available to do exactly that. This code is used in Linux in some circumstances once at boot-time to test for an "optimal TSC". But in some cases the CPUID bit defuses this test. And in any case a boottime test may not catch all problems, such as a power event that doesn''t handle TSC quite properly. So without some form of ongoing post-boottime test, we just don''t know. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-07 21:07 UTC
RE: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
FYI, I finally found a published source describing the TSC Invariant bit in Nehalem. See 2.2.6 in: http://www.intel.com/Assets/PDF/appnote/241618.pdf "In the Core i7 AND FUTURE PROCESSOR GENERATIONS [my emphasis] the TSC will continue to run in the deepest C-states. Therefore, the TSC will run at a constant rate in all ACPI P-, C-, and T-states. Support for this feature is indicated by CPUID.0x8000_0007.EDX[8]. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource." Linux upstream now does exactly that; if this bit is set (on Intel processors), tsc is utilized as the system clocksource and afaict there is NO path that will test or revert this decision. Admittedly, this doesn''t guarantee that a multi-socket platform obeys invariance, but apparently this feature utilizes a crystal available externally to the socket so it is easy to leverage in a system design to ensure invariance across multiple sockets, or even across multiple enclosures that are all on a QPI link. So system designers (other than perhaps for the very largest superNUMA machines) would be silly to not use it. So, I''d recommend that: 1) On (Intel, maybe later AMD) systems where this bit is set, the mechanisms enabled by the Xen consistent_tscs boot option should be enabled automatically for Xen. 2) The time_calibration_tsc_rendezvous loop in timer.c could/should be rewritten or removed and certainly should NOT write_tsc(). Keir, I know you are very sensitive around this code, so thought I''d check before messing with it. Or feel free to do it yourself. Thanks, Dan> -----Original Message----- > From: Dan Magenheimer > Sent: Friday, October 02, 2009 11:51 AM > To: Xen-Devel (E-mail) > Cc: Kurt Hackel; Ian Pratt; Keir Fraser > Subject: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen > [1 of 4]: Reliable TSC > > > ============> Premise 1: A large and growing percentage of servers > running Xen have a "reliable" TSC and Xen can determine > conclusively whether a server does or does not have a > reliable TSC. > ============> > The truth of this statement has been vociferously > challenged in other threads, so I''d LOVE TO GET > FEEDBACK OR CONFIRMATION FROM PROCESSOR AND SERVER > VENDORS. > > The rest of this is long though hopefully educational, > but if you have no interest in the rdtsc instruction > or timestamping, please move on to [2 of 4]. > > Since my overall premise is a bit vague, I need to > first very clearly define my terms. And to define > those terms clearly, I need to provide some more > background. As far as I can find, there is no > publication which clearly describes all of these > concepts. > > The rdtsc instruction was at one time the easiest > and cheapest and most precise method for "approximating > the passage of time"; as such rdtsc was widely > used by x86 performance practitioners and high-end > apps that needed to provide extensive metrics. When > commodity SMP x86 systems emerged, rdtsc fell into > disfavor because: (a) it was difficult to for > different CPU packages to share a crystal or > ensure different crystals were synchronized or > increasing at precisely the same rate, and > (b) SMP apps were oblivious to which CPU their > thread(s) were running on so two rdtsc instructions > in the same thread might execute on different > CPU''s and thus unwittingly use different crystals, > resulting in strange things like the appearance that > time went backwards (sometimes by a large amount) > or events appearing to take different amounts of > time depending on whether they were running on > processor A or processor B. We will call this > the "inconsistent TSC" problem. > > Processor and system vendors attempted to fix the > inconsistent TSC problem by providing a new class > of "platform timers" (e.g. HPET), but these proved > to be slow and difficult to use, especially for > apps that required frequent fine metrics. > > Processor and system vendors eventually figured out > how to synchronize TSC with the same crystal, but > then a new set of problems emerged: Power features > sometimes caused the clock on one processor to > slow down or even stop, thus destroying the synchrony > with other processors. This was fixed first > by ensuring that the tick rate did not change > ("constant TSC") and later that it did not stop > ("nonstop TSC"), unless ALL of the TSCs on all of > the processors stopped. Nearly all of the most recent > generations of server processors support these > capabilities, so as a result on most recent servers, > the TSC on all processors/cores/sockets is driven by > the same crystal, always ticks at the same rate, > and doesn''t stop independently of other processors'' > TSCs. This is what we call a "reliable TSC". > > But we''re not done yet. What does a reliable TSC > provide? We need to define a few more terms. > > A "perfect TSC" would be one where a magic logic > analyzer with a cesium clock could confirm that > the TSC''s on every processor increment at precisely > the same femtosecond. Both the speed of light > and the pricing models of commodity processors > make a perfect TSC unlikely :-) > > How close is good enough? We define two TSCs > as being "unobservably different" if code running > on the two processors can never see time going > backwards, because the difference bettween their > TSCs is smaller than the memory access overhead > due to cache synchronization. (This is sometimes > called a "cache bounce".) To wit, suppose processor > A does a rdtsc and writes the result into memory; > meanwhile processor B is spinning until it sees that the > memory location has changed, reads A''s value > from memory and then does its own rdtsc. If > B''s rdtsc is NEVER less OR equal to A''s rdtsc, > we will call this an "optimal TSC". > > A reliable TSC is not guaranteed to be optimal; > it may just be very close to optimal, meaning > the difference between two TSCs may sometimes > be observable but it will always be very small. > (As far as I know, processor and server vendors > will not guarantee exactly how small.) To simulate > an optimal TSC with a reliable TSC, a software > wrapper can be placed around the reads from a > reliable TSC to catch and "fix" the rare > circumstances where time goes backwards. > If this wrapper, ensures that time never goes > backwards AND ensures that time always moves > forward, we call this a monotonically-increasing > wrapper. If it instead ensures that time never > goes backwards AND may appear to stop, we call > this a monotonically-non-decreasing wrapper. > > Note also that a reliable TSC is not guaranteed > to never stop; it is just guaranteed that if > the TSC on one processor is stopped, the TSC on > all other processors will also be stopped. As > a result, a reliable TSC cannot be used as > a wallclock, at least without other software > support that can properly adjust the TSC on all > processors when all processors awaken. > > Last, there is the issue of whether or not Xen can > conclusively determine if the TSC is reliable. > This is still an open challenge. There exists > a CPUID bit which purports to do this, but it > is not known with certainty if there are exceptions. > Notably, there is concern if certain newer > larger NUMA servers will truly provide a reliable > TSC across all system processors even if the > CPUID bit on each CPU package says the package > does provide a reliable TSC. One large server vendor > claims that this is not a problem anymore, but > ideally we would like to test this dynamically > and there is GPL code available to do exactly > that. This code is used in Linux in some > circumstances once at boot-time to test for > an "optimal TSC". But in some cases the CPUID > bit defuses this test. And in any case a boottime > test may not catch all problems, such as a > power event that doesn''t handle TSC quite properly. > So without some form of ongoing post-boottime > test, we just don''t know. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Oct-08 06:45 UTC
Re: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
On 07/10/2009 22:07, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> So, I''d recommend that: > > 1) On (Intel, maybe later AMD) systems where this > bit is set, the mechanisms enabled by the > Xen consistent_tscs boot option should be enabled > automatically for Xen. > 2) The time_calibration_tsc_rendezvous loop in > timer.c could/should be rewritten or removed > and certainly should NOT write_tsc(). > > Keir, I know you are very sensitive around > this code, so thought I''d check before messing > with it. Or feel free to do it yourself.Feel free to make a patch. K. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Oct-08 06:54 UTC
Re: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
On 08/10/2009 07:45, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:>> 1) On (Intel, maybe later AMD) systems where this >> bit is set, the mechanisms enabled by the >> Xen consistent_tscs boot option should be enabled >> automatically for Xen. >> 2) The time_calibration_tsc_rendezvous loop in >> timer.c could/should be rewritten or removed >> and certainly should NOT write_tsc(). >> >> Keir, I know you are very sensitive around >> this code, so thought I''d check before messing >> with it. Or feel free to do it yourself. > > Feel free to make a patch.At least, make a patch for (1). I don''t think (2) can be easily removed in all cases. For example, Intel''s method for rate-invariant TSC which stops on deep sleeps does involve rewriting TSC values to forcibly keep them in sync. Perhaps change code to never write_tsc() just in the case of TSC_RELIABLE, or whatever you call it? Or perhaps just do (1) for now. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2009-Oct-08 09:13 UTC
Re: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
At 22:07 +0100 on 07 Oct (1254953275), Dan Magenheimer wrote:> Admittedly, this doesn''t guarantee that a multi-socket > platform obeys invariance, but apparently this > feature utilizes a crystal available externally > to the socket so it is easy to leverage in a > system design to ensure invariance across > multiple sockets, or even across multiple enclosures > that are all on a QPI link. So system designers > (other than perhaps for the very largest superNUMA > machines) would be silly to not use it.Oh, that''s reassuring. System designers would never do something that silly. :) If linux relies on it, that''s a good sign, but surely we shouldn''t get rid of any existing correction mechanisms. Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Citrix Systems (R&D) Ltd. [Company #02300071, SL9 0DZ, UK.] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Oct-08 09:22 UTC
Re: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
On 08/10/2009 10:13, "Tim Deegan" <Tim.Deegan@eu.citrix.com> wrote:> So system designers >> (other than perhaps for the very largest superNUMA >> machines) would be silly to not use it. > > Oh, that''s reassuring. System designers would never do something that > silly. :) > > If linux relies on it, that''s a good sign, but surely we shouldn''t get > rid of any existing correction mechanisms.I think at the very least this new ''reliable tsc'' mode must be self contained, not impact the existing modes, and continue to be switchable via a boot parameter. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-08 16:24 UTC
RE: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
> On 08/10/2009 10:13, "Tim Deegan" <Tim.Deegan@eu.citrix.com> wrote: > > > So system designers > >> (other than perhaps for the very largest superNUMA > >> machines) would be silly to not use it. > > > > Oh, that''s reassuring. System designers would never do > > something that silly. :)Tongue-in-cheek noted. ;-) But seriously, what I''m proposing is that now that this is architected by the processor, poorly designed systems (or extremely large systems) should be the rare exception, not the rule. Specifically I''m proposing that (at least for Intel... AMD TBD) if the architectural bit is set Xen should trust it by default, but provide a boot-time parameter (e.g. "tsc_broken") to override the default for any rare poorly-designed or superNUMA systems.> > If linux relies on it, that''s a good sign, but surely we > shouldn''t get > > rid of any existing correction mechanisms.Unfortunately, Xen has no existing detection mechanism so also has no existing correction mechanism. Xen currently blindly assumes tsc is wrong and overwrites all tscs at boottime, after deep C-state, and at 1Hz if the boottime consistent_tscs option is set.> I think at the very least this new ''reliable tsc'' mode must be self > contained, not impact the existing modes, and continue to be > switchable via a boot parameter.OK, let me suggest the following taxonomy of tsc "safeness": A) unsafe (neither constant nor power-invariant) B) semi-safe (constant = P-,T-state invariant, C-state may stop) C) safe (constant+non-stop = P-,T-,and C-state invariant) D) false-positive safe (CPUs safe, system-wide is not) Xen currently assumes A. This is sufficient for Xen''s needs, and for the pvclock algorithm, but insufficient for my plans to expose "TSC reliability" to usermode. B (constant) is now determined in Xen by checking family ids but only used to override consistent_tscs if constant is NOT set. C is architecturally-defined by a cpuid bit but Xen doesn''t currently use it. Intel guarantees TSC invariance across P-, T-, and C-states when it is set (AMD TBD). I''m proposing that: 1) for case C, Xen shall never overwrite TSC 2) for case D, a new "tsc_broken" boot option must be specified when Xen is booted on a broken machine 3) for case B, always use it when the hardware supports it (unless overridden by "tsc_broken") We are also investigating whether the write_tsc() in the cstate recovery code obviates the need for the write_tsc in time_calibration_tsc_rendezvous. Comments? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2009-Oct-09 09:34 UTC
Re: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
At 17:24 +0100 on 08 Oct (1255022685), Dan Magenheimer wrote:> Tongue-in-cheek noted. ;-) But seriously, what I''m proposing > is that now that this is architected by the processor, poorly > designed systems (or extremely large systems) should be the rare > exception, not the rule.That seems like unwarranted optimism, but we''ll just have to wait and see. I''ve seen enough bugs that boiled down to reputable system builders doing things that software engineers thought would surely never happen.> A) unsafe (neither constant nor power-invariant) > B) semi-safe (constant = P-,T-state invariant, C-state may stop) > C) safe (constant+non-stop = P-,T-,and C-state invariant) > D) false-positive safe (CPUs safe, system-wide is not)OK; for the record I believe C should be assumed to be D.> Xen currently assumes A.That''s what I meant by detection and correction.> This is sufficient for Xen''s needs, > and for the pvclock algorithm, but insufficient for my > plans to expose "TSC reliability" to usermode.Your plans for usermode<-->hypervisor direct TSC integration seem to me to be an unpleasant hack. I understand that you have good business reasons for wanting it (even if you''re not allowed to tell us explicitly what they are) and we''ve seen the justifications enough times that we don''t need to cover them again here, but it''s still a hack. I''m unhappy with the idea of kicking around the Xen timekeeping code (and introducing the usual bug-tail) to support introducing a usermode TSC. If there is to be a new mode for this, it should default to the current (works for everyone except the engineering team of a not-to-be-named enterprise application) behaviour.> I''m proposing that: > 1) for case C, Xen shall never overwrite TSC > 2) for case D, a new "tsc_broken" boot option must be specified > when Xen is booted on a broken machineMight as well call it "application_broken" and default it the other way. :) The system builders are entirely within their rights to have separate clocks for separate sockets. Cheers, Tim.> 3) for case B, always use it when the hardware supports it > (unless overridden by "tsc_broken") > > We are also investigating whether the write_tsc() in > the cstate recovery code obviates the need for the > write_tsc in time_calibration_tsc_rendezvous. > > Comments? >-- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Citrix Systems (R&D) Ltd. [Company #02300071, SL9 0DZ, UK.] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-09 14:38 UTC
RE: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
Hi Tim -- Thanks for your comments!> At 17:24 +0100 on 08 Oct (1255022685), Dan Magenheimer wrote: > > Tongue-in-cheek noted. ;-) But seriously, what I''m proposing > > is that now that this is architected by the processor, poorly > > designed systems (or extremely large systems) should be the rare > > exception, not the rule. > > That seems like unwarranted optimism, but we''ll just have to wait and > see. I''ve seen enough bugs that boiled down to reputable system > builders doing things that software engineers thought would > surely never happen.Well, app providers have been beating up on processor and system vendors for years to "fix the d*mn timestamp problem". They finally have, and have even made it architectural. I can think of one large enterprise software provider that would gladly redlist systems that regress in this area. So color me optimistic that the problem is solved or at least that system vendors will only sin for a very good reason; and their indiscretions will be public enough that their need for a special boottime Xen option will not be a closely-guarded secret. Now all I''m trying to do is ensure that Xen virtual machines don''t suffer their own "d*mn timestamp problem", especially given that VMware doesn''t have one.> > A) unsafe (neither constant nor power-invariant) > > B) semi-safe (constant = P-,T-state invariant, C-state may stop) > > C) safe (constant+non-stop = P-,T-,and C-state invariant) > > D) false-positive safe (CPUs safe, system-wide is not) > > OK; for the record I believe C should be assumed to be D.What?!? And waste all that hard work by processor and system vendors to finally fix the problem? ;-) I admit that I have some reservations as well, so would like Xen to verify "safeness" at each boot, and preferably periodically for the life of the system. Verification turns out to be quite ugly though, and probably even more so for those superNUMA systems that might be most likely to fail the test.> > Xen currently assumes A. > > That''s what I meant by detection and correction.IMHO, the road to software performance hell is paved with least-common-denominator solutions. (And, yes, to take the words right out of your mouth before you say them, the road to software maintenance hell is paved with never-used special cases.)> > This is sufficient for Xen''s needs, > > and for the pvclock algorithm, but insufficient for my > > plans to expose "TSC reliability" to usermode. > > Your plans for usermode<-->hypervisor direct TSC integration > seem to me to be an unpleasant hack.Yes, I admit it offends my aesthetics some. But I defend it to myself by believing that this is just a first step in a long road of closer collaboration between hypervisor and apps. Really the whole point of paravirtualization is to benefit from knowing that the underlying platform is virtual. Why should apps be excluded from the party?> I understand that you have good business > reasons for wanting it (even if you''re not allowed to tell us > explicitly > what they are) and we''ve seen the justifications enough times that we > don''t need to cover them again here, but it''s still a hack.I think I''ve been very explicit: Some very large apps, both Oracle and non-Oracle, need a way to get a timestamp at a high frequency in a way that is both correct and very fast and works across a range of hardware/software environments, INCLUDING running under Xen. I AM exposed to some other companies'' confidential information, so any appearance that I am hiding something is due to my clumsy attempts to dance around that in a public forum.> I''m unhappy with the idea of kicking around the Xen timekeeping code > (and introducing the usual bug-tail) to support introducing a usermode > TSC. If there is to be a new mode for this, it should default to the > current (works for everyone except the engineering team of a > not-to-be-named enterprise application) behaviour.This isn''t a new mode, it''s a new (not-so-new for AMD) hardware feature that Xen has yet to make proper use of. And I''m not introducing a usermode TSC... Intel did that years ago. And if, by "new mode" you''re referring to rdtsc emulation, that''s certainly not for Oracle''s benefit.> > I''m proposing that: > > 1) for case C, Xen shall never overwrite TSC > > 2) for case D, a new "tsc_broken" boot option must be specified > > when Xen is booted on a broken machine > > Might as well call it "application_broken" and default it the other > way. :) The system builders are entirely within their rights to have > separate clocks for separate sockets.If you agree with Jeremy''s opinion that "any app that uses rdtsc is fundamentally broken", your syntax makes sense. As you know, I disagree, especially as it applies to future hardware and software. Dan P.S. I''ll have infrequent access to email for the next week. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-09 20:28 UTC
Re: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
On 10/09/09 02:34, Tim Deegan wrote:> Your plans for usermode<-->hypervisor direct TSC integration seem to me > to be an unpleasant hack. I understand that you have good business > reasons for wanting it (even if you''re not allowed to tell us explicitly > what they are) and we''ve seen the justifications enough times that we > don''t need to cover them again here, but it''s still a hack. > > I''m unhappy with the idea of kicking around the Xen timekeeping code > (and introducing the usual bug-tail) to support introducing a usermode > TSC. If there is to be a new mode for this, it should default to the > current (works for everyone except the engineering team of a > not-to-be-named enterprise application) behaviour. >I''m seeing an approx 12x performance improvement with gettimeofday() and clock_gettime() on systems with my vsyscall support patches (~1200ns/call -> ~100ns[*]). I think that should go a long way towards mitigating the performance concerns using standard APIs. There''s probably some scope for improving those numbers on systems with better-than-baseline tsc support (ie rdtscp and/or guaranteed synced tscs), but I think its enough to get started with, especially given the broad applicability and relatively simple engineering. [*] With native tsc; emulated tsc makes that 1700 -> 500, or only ~3.3x improvement. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-09 21:35 UTC
RE: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
Excellent! This is an extremely important piece of the puzzle now filled in. Just for completeness, on your machine, what is the measurement for raw rdtsc? (And if anybody believes this is the ONLY piece of the puzzle that is necessary, I would be happy to expand further.)> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > On 10/09/09 02:34, Tim Deegan wrote: > > Your plans for usermode<-->hypervisor direct TSC > integration seem to me > > to be an unpleasant hack. I understand that you have good business > > reasons for wanting it (even if you''re not allowed to tell > us explicitly > > what they are) and we''ve seen the justifications enough > times that we > > don''t need to cover them again here, but it''s still a hack. > > > > I''m unhappy with the idea of kicking around the Xen timekeeping code > > (and introducing the usual bug-tail) to support introducing > a usermode > > TSC. If there is to be a new mode for this, it should > default to the > > current (works for everyone except the engineering team of a > > not-to-be-named enterprise application) behaviour. > > I''m seeing an approx 12x performance improvement with > gettimeofday() and > clock_gettime() on systems with my vsyscall support patches > (~1200ns/call -> ~100ns[*]). I think that should go a long > way towards > mitigating the performance concerns using standard APIs. > > There''s probably some scope for improving those numbers on > systems with > better-than-baseline tsc support (ie rdtscp and/or guaranteed synced > tscs), but I think its enough to get started with, especially > given the > broad applicability and relatively simple engineering. > > [*] With native tsc; emulated tsc makes that 1700 -> 500, or > only ~3.3x > improvement. > > J_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-10 00:22 UTC
Re: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
On 10/09/09 14:35, Dan Magenheimer wrote:> Excellent! This is an extremely important piece > of the puzzle now filled in. > > Just for completeness, on your machine, what is > the measurement for raw rdtsc? >A naked inline rdtsc is about 30ns, so only about a factor of 3 better. Which is a surprisingly small improvement given that the full gettimeofday path has ~150 instructions, including a couple of multiplies, quite a few jumps and two "lsl" instructions for vgetcpu (which each cost about 10ns). rdtsc is an expensive instruction... J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-10 02:36 UTC
RE: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
> On 10/09/09 14:35, Dan Magenheimer wrote: > > Excellent! This is an extremely important piece > > of the puzzle now filled in. > > > > Just for completeness, on your machine, what is > > the measurement for raw rdtsc? > > > > A naked inline rdtsc is about 30ns, so only about a factor of > 3 better. > Which is a surprisingly small improvement given that the full > gettimeofday path has ~150 instructions, including a couple of > multiplies, quite a few jumps and two "lsl" instructions for vgetcpu > (which each cost about 10ns). rdtsc is an expensive instruction... > > JVery nice! One more measurement if you haven''t already torn down your test environment: If you are at xen-unstable tip, with tsc emulation on, please try something like: for i in {0..100}; do xm debug-key s; xm dmesg | tail; sleep 1; done to get an idea of the number of rdtsc''s being done per second (and also divide by the number of cores so we have rdtsc''s/sec/core). This is of course unloaded, so if you have a favorite load to throw on it, that would be very interesting also. (Note that the s debug-key may be slow because xen is also now running check_tsc_warp each time.) Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-10 05:55 UTC
Re: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
On 10/09/09 19:36, Dan Magenheimer wrote:> Very nice! > > One more measurement if you haven''t already torn down > your test environment: If you are at xen-unstable tip, > with tsc emulation on, please try something like: > > for i in {0..100}; do > xm debug-key s; xm dmesg | tail; sleep 1; > done > > to get an idea of the number of rdtsc''s being > done per second (and also divide by the number > of cores so we have rdtsc''s/sec/core). This is > of course unloaded, so if you have a favorite > load to throw on it, that would be very interesting > also. >The kernel does about between 400k and 1.4M/sec, median around ~600k, for a git pull (which I think is single-threaded), and about 200k-500k/sec for a kernel compile (-j4 on 2 vcpus). Usermode is a much lower rate; around 1000/sec for the kernel compile. Baseline idle is around 1000/sec kernel, 10/sec user. Also, my inline naked rdtsc benchmark shows that the emulated rdtsc is taking around 465ns (vs 30, a 15x slowdown). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Oct-10 06:35 UTC
Re: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
On 10/10/2009 06:55, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:> The kernel does about between 400k and 1.4M/sec, median around ~600k, > for a git pull (which I think is single-threaded), and about > 200k-500k/sec for a kernel compile (-j4 on 2 vcpus). Usermode is a much > lower rate; around 1000/sec for the kernel compile. > > Baseline idle is around 1000/sec kernel, 10/sec user. > > Also, my inline naked rdtsc benchmark shows that the emulated rdtsc is > taking around 465ns (vs 30, a 15x slowdown).Hmmm... So at 600k/sec, the kernel spends an appreciable amount of time (1-2%) doing RDTSCs? And with emulation that''ll be more like 25-30%. It''s quite a surprisingly high rate. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-10 14:22 UTC
RE: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
> On 10/10/2009 06:55, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote: > > > The kernel does about between 400k and 1.4M/sec, median > around ~600k, > > for a git pull (which I think is single-threaded), and about > > 200k-500k/sec for a kernel compile (-j4 on 2 vcpus). > Usermode is a much > > lower rate; around 1000/sec for the kernel compile. > > > > Baseline idle is around 1000/sec kernel, 10/sec user. > > > > Also, my inline naked rdtsc benchmark shows that the > emulated rdtsc is > > taking around 465ns (vs 30, a 15x slowdown). > > Hmmm... So at 600k/sec, the kernel spends an appreciable > amount of time > (1-2%) doing RDTSCs? And with emulation that''ll be more like 25-30%. > > It''s quite a surprisingly high rate. > > -- KeirI''m trying a kernel compile (-j4, 2 vcpus, 2 pcpus) and seeing about 1300/sec kernel and 500/sec user. My "idle" rate appears to be about 400/sec (kernel, and every now and then a handful of user rdtscs). That''s with a cpu-only load.... # while true; do i=i+1; done && while true; do i=i+1; done It seems to be about 100/sec for a truly idle domain. With an NFS untar I am seeing higher numbers though (~10K/sec). (All these loads are on a EL5u2 32-bit PV guest.) Jeremy, were you maybe measuring per hundred seconds, or per minute? Or, on the git pull, maybe your VNIC throughput is much much higher than mine and there is a getnstimeofday() call for each packet? Another scary thought... what is gcc doing using rdtsc? Might it be randomly sensitive to the rdtsc discontinuites one will encounter with migration and we''ve just not seen it yet? In other words, is gcc a "fundamentally broken" app? ;-) And what is that other usermode app (service?) that seems to use a handful of rdtsc''s when the system is cpu-only loaded? Is it using rdtsc safely? The time ratio of emulated rdtsc to native rdtsc matches what I measured on my machine (360 vs 22), so 15x seems like a safe multiplier estimate to use. Curiouser and curiouser... Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2009-Oct-12 09:51 UTC
Re: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
Hi, At 15:38 +0100 on 09 Oct (1255102732), Dan Magenheimer wrote:> So color me optimistic that the problem is solved or > at least that system vendors will only sin for a very > good reason;OK, we disagree, that''s fine.> Now all I''m trying to do is ensure that Xen virtual machines > don''t suffer their own "d*mn timestamp problem", especially > given that VMware doesn''t have one.Jeremy seems to be taking care of this, AFAICS, using the existing APIs. That way everybody wins, not just people who are clued in enough to find and use a new xen-specific API. Also, AFAICS, without needing changes to Xen''s own timekeeping/TSC code. There seem to be some other bugs on the HVM side but that''s a separate discussion, I think.> Yes, I admit it offends my aesthetics some. But I defend > it to myself by believing that this is just a first step > in a long road of closer collaboration between hypervisor > and apps. Really the whole point of paravirtualization > is to benefit from knowing that the underlying platform > is virtual. Why should apps be excluded from the party?In this case I don''t think it helps. The OS should and can provide a fast and reliable time source to user space without needing a new hypervisor-to-application API for it, with all the portability and maintenance fun that that would bring.> I think I''ve been very explicit: Some very large apps, both > Oracle and non-Oracle, need a way to get a timestamp > at a high frequency in a way that is both correct and > very fast and works across a range of hardware/software > environments, INCLUDING running under Xen.There''s a further requirement (which you have mentioned before) that people are unwilling/unable to accept kernel changes. I think that''s a bit unreasonable.> I AM exposed to some other companies'' confidential > information, so any appearance that I am hiding something > is due to my clumsy attempts to dance around that > in a public forum.Understood; I don''t blame you for the situation you find yourself in. But it doesn''t change what you''re asking for: an unpleasant hack and a reshuffle of the core timekeeping code to support unnamed third parties.> > Might as well call it "application_broken" and default it the other > > way. :) The system builders are entirely within their rights to have > > separate clocks for separate sockets. > > If you agree with Jeremy''s opinion that "any app that uses > rdtsc is fundamentally broken", your syntax makes sense.I''ll stick with "many apps that use rdtsc are broken": it''s harder than most people think and it doesn''t do what some people want. (But no, I wouldn''t seriously use that as the option name.) Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Citrix Systems (R&D) Ltd. [Company #02300071, SL9 0DZ, UK.] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel