Is it "legal" to write to the TSC, e.g. via wrmsr(0x10,x,y), in a PV kernel? Assuming this were executed and would cause a GPF, I can''t find the code in Xen that would handle it, or even ignore it. There are uses of write_tsc in linux-2.6.18-xen... perhaps that code never gets executed? Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 08/25/09 14:54, Dan Magenheimer wrote:> Is it "legal" to write to the TSC, e.g. via wrmsr(0x10,x,y), > in a PV kernel? Assuming this were executed and would cause > a GPF, I can''t find the code in Xen that would handle it, or > even ignore it. >arch/x86/traps.c:emulate_privileged_op(), case 0x30. It looks like writing to 0x10 would be silently ignored. Allowing it would require careful handling to avoid screwing up timekeeping (you''d need to update the timekeeping parameters), but also fairly pointless because it would only affect the pcpu that the vcpu happens to be running on at that moment. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > Is it "legal" to write to the TSC, e.g. via wrmsr(0x10,x,y), > > in a PV kernel? Assuming this were executed and would cause > > a GPF, I can''t find the code in Xen that would handle it, or > > even ignore it. > > arch/x86/traps.c:emulate_privileged_op(), case 0x30. It looks like > writing to 0x10 would be silently ignored.Hmmm... maybe I am misreading the code but it looks like the default case will end up with "goto fail" which will not update IP and so will infinite loop trapping on that instruction. It appears that write_tsc calls are made in linux-2.6.18 (though apparently never get executed) but disappear somewhere before 2.6.24 and don''t exist in 2.6.30 either. So perhaps write_tsc has never been executed in a PV guest and just doesn''t work.> Allowing it would require > careful handling to avoid screwing up timekeeping (you''d need > to update > the timekeeping parameters), but also fairly pointless > because it would > only affect the pcpu that the vcpu happens to be running on > at that moment.I''m still working on TSC emulation which will return Xen system time. The physical TSC won''t get changed, but maintaining an offset is necessary if its possible for TSC to be "written". I guess I will ignore that possibility for now. Hmmm... what about save/restore/migration? For pvclock to work properly across save/restore/migration, a Xen system time offset must already be handled, so I''m thinking I don''t need to worry about that case. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 26/08/2009 00:09, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:>> arch/x86/traps.c:emulate_privileged_op(), case 0x30. It looks like >> writing to 0x10 would be silently ignored. > > Hmmm... maybe I am misreading the code but it looks like the > default case will end up with "goto fail" which will not > update IP and so will infinite loop trapping on that instruction. > > It appears that write_tsc calls are made in linux-2.6.18 (though > apparently never get executed) but disappear somewhere before > 2.6.24 and don''t exist in 2.6.30 either. So perhaps write_tsc > has never been executed in a PV guest and just doesn''t work.Jeremy is correct. The TSC MSR cannot be written. Most that will happen is that Xen will print a warning message, but the WRMSR instruction will always be skipped over. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> >> arch/x86/traps.c:emulate_privileged_op(), case 0x30. It looks like > >> writing to 0x10 would be silently ignored. > > > > Hmmm... maybe I am misreading the code but it looks like the > > default case will end up with "goto fail" which will not > > update IP and so will infinite loop trapping on that instruction. > > > > It appears that write_tsc calls are made in linux-2.6.18 (though > > apparently never get executed) but disappear somewhere before > > 2.6.24 and don''t exist in 2.6.30 either. So perhaps write_tsc > > has never been executed in a PV guest and just doesn''t work. > > Jeremy is correct. The TSC MSR cannot be written. Most that > will happen is > that Xen will print a warning message, but the WRMSR > instruction will always > be skipped over.OK, I see, wrmsr_hypervisor_regs(0x10) and mce_wrmsr(0x10) and rdmsr_safe(0x10) all return 0, so the code at "invalid:" is executed and a warning is printk''d. So in the current implementation, write_tsc is skipped over. But ARCHITECTURALLY does Xen consider write_tsc to be a no-op for PV domains, or is this just a case that''s never been encountered before? In other words, if a future PV OS had a good reason to write_tsc, would we implement it (and make the necessary adjustments to Xen''s usages of tsc) or just say, sorry, not allowed? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 26/08/2009 16:42, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> OK, I see, wrmsr_hypervisor_regs(0x10) and mce_wrmsr(0x10) and > rdmsr_safe(0x10) all return 0, so the code at "invalid:" is > executed and a warning is printk''d. So in the current > implementation, write_tsc is skipped over. > > But ARCHITECTURALLY does Xen consider write_tsc to be a no-op > for PV domains, or is this just a case that''s never been > encountered before? In other words, if a future PV OS had a > good reason to write_tsc, would we implement it (and make > the necessary adjustments to Xen''s usages of tsc) or just say, > sorry, not allowed?There''d have to be a good argument for supporting it. I don''t think we ever will. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 08/26/09 08:42, Dan Magenheimer wrote:> But ARCHITECTURALLY does Xen consider write_tsc to be a no-op > for PV domains, or is this just a case that''s never been > encountered before? In other words, if a future PV OS had a > good reason to write_tsc, would we implement it (and make > the necessary adjustments to Xen''s usages of tsc) or just say, > sorry, not allowed? >You can think of it this way: a Xen PV VCPU has no tsc. There is a register that can be read with "rdtsc", but that''re purely part of Xen''s time ABI and is not independently useful. The ABI includes no notion of writing to that register. Usermode code can execute "rdtsc", but without access to the rest of the time parameters it just returns some undefined bits with no relationship to time. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> On 08/26/09 08:42, Dan Magenheimer wrote: > > But ARCHITECTURALLY does Xen consider write_tsc to be a no-op > > for PV domains, or is this just a case that''s never been > > encountered before? In other words, if a future PV OS had a > > good reason to write_tsc, would we implement it (and make > > the necessary adjustments to Xen''s usages of tsc) or just say, > > sorry, not allowed? > > You can think of it this way: a Xen PV VCPU has no tsc. There is a > register that can be read with "rdtsc", but that''re purely > part of Xen''s > time ABI and is not independently useful. The ABI includes > no notion of > writing to that register. Usermode code can execute "rdtsc", but > without access to the rest of the time parameters it just returns some > undefined bits with no relationship to time.While I think I understand entirely why you would want to think of it that way, there''s thousands (millions?) of applications out there that would beg to differ. They DO assume that rdtsc bears "some" relationship to time. Indeed Linux itself does. Exactly what that relationship to time is defined to be is open to debate, and whether Xen supports whatever relationship is defined is also debatable (especially in the presence of migration). But defining rdtsc as returning random bits is not an acceptable solution for Xen. Dom0 won''t even boot if rdtsc returns random bits so Xen must already be guaranteeing that rdtsc has "some" relationship to time. We''ve been lucky so far with allowing rdtsc to execute directly in hardware, but we really do need to fix it properly. But since applications cannot WRITE to tsc and Xen has some control over the OS->Xen PV API, it might be safe to define that write_tsc is a no-op. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 08/26/09 13:23, Dan Magenheimer wrote:>> You can think of it this way: a Xen PV VCPU has no tsc. There is a >> register that can be read with "rdtsc", but that''re purely >> part of Xen''s >> time ABI and is not independently useful. The ABI includes >> no notion of >> writing to that register. Usermode code can execute "rdtsc", but >> without access to the rest of the time parameters it just returns some >> undefined bits with no relationship to time. >> > While I think I understand entirely why you would want to > think of it that way, there''s thousands (millions?) of applications > out there that would beg to differ. They DO assume that > rdtsc bears "some" relationship to time.They are wrong. Linux doesn''t offer the tsc to usermode for its use. The closest it gets is vgettimeofday, which we could implement better.> Indeed Linux itself > does.A pv linux guest doesn''t have a TSC in the same way that it doesn''t have a TSS or any number of other CPU features. It would be a grave error for the kernel to use a tsc-based clocksource rather than the Xen pv clocksource. A Xen PV VCPU bears a passing resemblance to an Intel x86 CPU, but should not be confused with one.> Exactly what that relationship to time is defined to be is > open to debate, and whether Xen supports whatever relationship > is defined is also debatable (especially in the presence of > migration). But defining rdtsc as returning random bits > is not an acceptable solution for Xen. Dom0 won''t even > boot if rdtsc returns random bits so Xen must already be > guaranteeing that rdtsc has "some" relationship to time. >No, it really doesn''t. It provides a PV clock, which includes "rdtsc" as part of its ABI. It is not a general tsc. You can''t meaningfully execute "rdtsc" without also being (indirectly) aware of what pcpu its running on and applying the appropriate corrections to turn it into system monotonic time. Executing rdtsc willy-nilly gets you useless results; fortunately no PV Xen kernel does that.> We''ve been lucky so far with allowing rdtsc to execute directly > in hardware, but we really do need to fix it properly.No, that''s false. The current Xen time model works fine for all guests using it correctly. Emulating rdtsc for hvm guests is another question entirely.> But since applications cannot WRITE to tsc and Xen has some > control over the OS->Xen PV API, it might be safe to define that > write_tsc is a no-op. >No, write_tsc is meaningless, and anyone trying to execute it is not even wrong. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > While I think I understand entirely why you would want to > > think of it that way, there''s thousands (millions?) of applications > > out there that would beg to differ. They DO assume that > > rdtsc bears "some" relationship to time. > > They are wrong. Linux doesn''t offer the tsc to usermode for its use. > The closest it gets is vgettimeofday, which we could implement better.Linux doesn''t have to offer it. The Intel x86 CPU does. It''s a legal instruction for an app to use and (quote from Intel SDM) "is guaranteed to return a monotonically increasing unique value whenever executed except for 64-bit wraparound." While that''s not precisely a "relationship" to time, mere mortals programming are likely to interpret it that way. (Keir, please note that it says monotonically-increasing, not monotonically-non-decreasing, so the current softtsc implementation for HVM I think is incorrect.)> > Indeed Linux itself does. > > A pv linux guest doesn''t have a TSC in the same way that it > doesn''t have > a TSS or any number of other CPU features. It would be a grave error > for the kernel to use a tsc-based clocksource rather than the Xen pv > clocksource. A Xen PV VCPU bears a passing resemblance to an > Intel x86 > CPU, but should not be confused with one.So are you going to guarantee that 2.6.31 Linux when running on Xen has no uses or dependencies on rdtsc delivering anything other than a random value?> > Exactly what that relationship to time is defined to be is > > open to debate, and whether Xen supports whatever relationship > > is defined is also debatable (especially in the presence of > > migration). But defining rdtsc as returning random bits > > is not an acceptable solution for Xen. Dom0 won''t even > > boot if rdtsc returns random bits so Xen must already be > > guaranteeing that rdtsc has "some" relationship to time. > > No, it really doesn''t. It provides a PV clock, which includes "rdtsc" > as part of its ABI. It is not a general tsc. You can''t meaningfully > execute "rdtsc" without also being (indirectly) aware of what pcpu its > running on and applying the appropriate corrections to turn it into > system monotonic time. Executing rdtsc willy-nilly gets you useless > results; fortunately no PV Xen kernel does that.While what you are saying may seem reasonable, I think you will find by looking at linux-2.6.18-xen that it is not true in reality. If you trap kernel uses of rdtsc and return random values, dom0 will not boot.> > We''ve been lucky so far with allowing rdtsc to execute directly > > in hardware, but we really do need to fix it properly. > No, that''s false. The current Xen time model works fine for > all guests > using it correctly. > > Emulating rdtsc for hvm guests is another question entirely.In the end, I don''t care if rdtsc''s in the kernel are emulated (and the patch I submitted earlier doesn''t emulate them other than to do a "slow" rdtsc). But apps don''t care if they are running on an HVM or a PVM, so if they use rdtsc, even if you believe that usage of rdtsc is incorrect, rdtsc must deliver what the Intel ABI guarantees.> > But since applications cannot WRITE to tsc and Xen has some > > control over the OS->Xen PV API, it might be safe to define that > > write_tsc is a no-op. > > No, write_tsc is meaningless, and anyone trying to execute it is not > even wrong.In that case, are you saying it is an illegal instruction for a PV guest to execute? If so, we should not ignore it, we should fail the guest. But that would be unfortunate for the RHEL5-64bit PV guests that actually DO use it. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer wrote:> In that case, are you saying it is an illegal instruction for a PV > guest to execute? If so, we should not ignore it, we should fail > the guest. But that would be unfortunate for the RHEL5-64bit > PV guests that actually DO use it.Wait, what? Could you point out where this is in RHEL-5 64-bit PV? The only case of write_tsc() I see in the code is in arch/i386/kernel/smpboot.c, which is not used by the Xen PV implementation in RHEL-5. Where else in the PV implementation does a write_tsc? -- Chris Lalancette _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> as part of its ABI. It is not a general tsc. You can''t meaningfully > execute "rdtsc" without also being (indirectly) aware of what pcpu its > running on and applying the appropriate corrections to turn it into > system monotonic time. Executing rdtsc willy-nilly gets you useless > results; fortunately no PV Xen kernel does that.Actually for user space this isn''t at all true. You can use rdtsc directly and sample the data for things like profiling then correct for things like spikes and skews from processor switches by filtering.> No, write_tsc is meaningless, and anyone trying to execute it is not > even wrong.Writing to the tsc is perfectly reasonable providing the tsc is an advertised feature. Being able to use the tsc becomes much more relevant with newer processors which have sane tsc implementations in the architecture however. Unfortunately if you hide the tsc and hide the tsc flag in the cpu info lots of stuff doesn''t run due to crap coding 8( _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Dan Magenheimer wrote: > > In that case, are you saying it is an illegal instruction for a PV > > guest to execute? If so, we should not ignore it, we should fail > > the guest. But that would be unfortunate for the RHEL5-64bit > > PV guests that actually DO use it. > > Wait, what? Could you point out where this is in RHEL-5 > 64-bit PV? The only > case of write_tsc() I see in the code is in > arch/i386/kernel/smpboot.c, which is > not used by the Xen PV implementation in RHEL-5. Where else in the PV > implementation does a write_tsc?Hi Chris -- I was surprised also, and digging deeper it looks like I was mistaken. I instrumented a hypervisor so that Xen would printk a console message if it was ignoring a wrmsr and was getting output when I launched a RHEL-5 PV guest. But I refined the printk and it is NOT wrmsr(0x10) so you''re right, it is NOT a write_tsc. Thanks for pointing out my error. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer wrote:>> Dan Magenheimer wrote: >>> In that case, are you saying it is an illegal instruction for a PV >>> guest to execute? If so, we should not ignore it, we should fail >>> the guest. But that would be unfortunate for the RHEL5-64bit >>> PV guests that actually DO use it. >> Wait, what? Could you point out where this is in RHEL-5 >> 64-bit PV? The only >> case of write_tsc() I see in the code is in >> arch/i386/kernel/smpboot.c, which is >> not used by the Xen PV implementation in RHEL-5. Where else in the PV >> implementation does a write_tsc? > > Hi Chris -- > > I was surprised also, and digging deeper it looks like I was mistaken. > > I instrumented a hypervisor so that Xen would printk a console > message if it was ignoring a wrmsr and was getting output > when I launched a RHEL-5 PV guest. But I refined the > printk and it is NOT wrmsr(0x10) so you''re right, it is > NOT a write_tsc. > > Thanks for pointing out my error.OK, cool, no problem. I just wanted to make sure I wasn''t missing something. Thanks, -- Chris Lalancette _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 08/27/09 01:48, Alan Cox wrote:>> as part of its ABI. It is not a general tsc. You can''t meaningfully >> execute "rdtsc" without also being (indirectly) aware of what pcpu its >> running on and applying the appropriate corrections to turn it into >> system monotonic time. Executing rdtsc willy-nilly gets you useless >> results; fortunately no PV Xen kernel does that. >> > Actually for user space this isn''t at all true. You can use rdtsc > directly and sample the data for things like profiling then correct for > things like spikes and skews from processor switches by filtering. >If an app is sophisticated to do this correctly then it doesn''t need any special assistance from a hypervisor to make the tsc well-behaved. It should continue to work even in a Xen guest where both the process can skip between VCPUs and the VCPUs can skip between PCPUs.>> No, write_tsc is meaningless, and anyone trying to execute it is not >> even wrong. >> > Writing to the tsc is perfectly reasonable providing the tsc is an > advertised feature. Being able to use the tsc becomes much more relevant > with newer processors which have sane tsc implementations in the > architecture however. >Apparently on some large servers the tsc is only synced and sane within a NUMA node, and not globally across all processors, so any app which assumed sane tsc behaviour would break when the hardware gets scaled up. But in this case I''m talking specifically about a Xen PV guest, where the tsc is claimed for use by the Xen clocksource ABI.> Unfortunately if you hide the tsc and hide the tsc flag in the cpu info > lots of stuff doesn''t run due to crap coding 8( >And you can''t actually hide the TSC flag in cpuid without virtualization extensions. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> On 08/27/09 01:48, Alan Cox wrote: > >> as part of its ABI. It is not a general tsc. You can''t > meaningfully > >> execute "rdtsc" without also being (indirectly) aware of > what pcpu its > >> running on and applying the appropriate corrections to turn it into > >> system monotonic time. Executing rdtsc willy-nilly gets > you useless > >> results; fortunately no PV Xen kernel does that. > >> > > Actually for user space this isn''t at all true. You can use rdtsc > > directly and sample the data for things like profiling then > correct for > > things like spikes and skews from processor switches by filtering. > > If an app is sophisticated to do this correctly then it > doesn''t need any > special assistance from a hypervisor to make the tsc well-behaved. It > should continue to work even in a Xen guest where both the process can > skip between VCPUs and the VCPUs can skip between PCPUs.No, I don''t think this is true. An enterprise app that binds processes to fixed physical processors on a physical machine can make assumptions about the results of rdtsc that aren''t valid when the vcpus can skip between pcpus. Further, like Linux itself, applications may test assumptions about tsc at startup that are assumed to remain valid for the life of the app, which is perfectly reasonable on a physical machine and a bad mistake in a virtualized environment.> >> No, write_tsc is meaningless, and anyone trying to execute > it is not > >> even wrong. > >> > > Writing to the tsc is perfectly reasonable providing the tsc is an > > advertised feature. Being able to use the tsc becomes much > more relevant > > with newer processors which have sane tsc implementations in the > > architecture however. > > Apparently on some large servers the tsc is only synced and > sane within > a NUMA node, and not globally across all processors, so any app which > assumed sane tsc behaviour would break when the hardware gets > scaled up.True, but any app that tries to run on a NUMA machine without being aware of the idiosyncracies of a NUMA machine probably has worse problems to deal with than tsc sync. Further, there are many many apps that will likely never ever run on those machines. Are we going to penalize all apps all the time because some might run some of the time on a machine where tsc is not synced?> But in this case I''m talking specifically about a Xen PV guest, where > the tsc is claimed for use by the Xen clocksource ABI.I just don''t understand how you can say that a valid userland instruction is "claimed for use" by Xen (or Linux or both). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> No, I don''t think this is true. An enterprise app that binds processes > to fixed physical processors on a physical machine can make > assumptions about the results of rdtsc that aren''t valid when > the vcpus can skip between pcpus. Further, like Linux itself,They rarely make the right assumptions> applications may test assumptions about tsc at startup that are > assumed to remain valid for the life of the app, which is > perfectly reasonable on a physical machineNo it isn''t because of things like suspend/resume.> True, but any app that tries to run on a NUMA machine without > being aware of the idiosyncracies of a NUMA machine probably > has worse problems to deal with than tsc sync. Further, thereDisagree - this is true if your NUMA factor is large but quite a few machines today are "vaguely NUMA" - the NUMA factor is low enough the app doesn''t need to care. Anyway you don''t need NUMA to see TSC skew between cores. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Alan --> > No, I don''t think this is true. An enterprise app that > binds processes > > to fixed physical processors on a physical machine can make > > assumptions about the results of rdtsc that aren''t valid when > > the vcpus can skip between pcpus. Further, like Linux itself, > > They rarely make the right assumptionsI freely admit that there are a high percentage of apps-that-use-rdtsc that are at risk of being buggy if moved from a "tsc safe" machine to a "tsc unsafe" machine. But, echoing your earlier reply, there are some that are careful and smart about using rdtsc. Jeremy''s claim is that because some apps-that-use- rdtsc risk bugginess, Xen can claim rdtsc for its own use and effectively disallow all uses of rdtsc in any app by breaking the existing, sometimes-useful semantics of the instruction.> > True, but any app that tries to run on a NUMA machine without > > being aware of the idiosyncracies of a NUMA machine probably > > has worse problems to deal with than tsc sync. Further, there > > Disagree - this is true if your NUMA factor is large but quite a few > machines today are "vaguely NUMA" - the NUMA factor is low > enough the app > doesn''t need to care. Anyway you don''t need NUMA to see TSC > skew between cores.Yes, but I think we are agreeing here. My point, poorly made I admit, is that there are a lot of different machine topologies and we can''t force all applications to conform to the lowest common denominator. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Jeremy''s claim is that because some apps-that-use- > rdtsc risk bugginess, Xen can claim rdtsc for its own > use and effectively disallow all uses of rdtsc in any > app by breaking the existing, sometimes-useful semantics > of the instruction.If Xen is hiding the tsc cpu feature from the kernel/apps it can. One problem there is a lot of grotty code simply explodes without rdtsc working. The alternative is to virtualise the TSC as some other hypedvisors do but that has other impacts. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 08/27/09 20:29, Dan Magenheimer wrote:>> If an app is sophisticated to do this correctly then it >> doesn''t need any >> special assistance from a hypervisor to make the tsc well-behaved. It >> should continue to work even in a Xen guest where both the process can >> skip between VCPUs and the VCPUs can skip between PCPUs. >> > No, I don''t think this is true. An enterprise app that binds processes > to fixed physical processors on a physical machine can make > assumptions about the results of rdtsc that aren''t valid when > the vcpus can skip between pcpus.You can bind a vcpu to a pcpu or group of pcpus with the right tsc properties. At this point you''re talking about a specialized non-portable app with very sensitive dependencies on the system software and underlying hardware, so requiring some special effort to virtualize it doesn''t seem like a big problem.> Further, like Linux itself, > applications may test assumptions about tsc at startup that are > assumed to remain valid for the life of the app, which is > perfectly reasonable on a physical machine and a bad mistake > in a virtualized environment. >Not really. An app can''t tell whether its initial test happened to be in a stable period that will be later upset by a power event, suspend/resume, migration via some other mechanism (like vserver/containers), etc, etc. An app making such assumptions will be very machine and system dependent, and not at all portable.> True, but any app that tries to run on a NUMA machine without > being aware of the idiosyncracies of a NUMA machine probably > has worse problems to deal with than tsc sync. Further, there > are many many apps that will likely never ever run on those > machines.Who can say? Effects caused by locality issues will only result in performance problems rather than outright correctness problems.> Are we going to penalize all apps all the time > because some might run some of the time on a machine where > tsc is not synced? >They''re already penalized. The population of machines with a tsc which can be used in the manner you''re suggesting is very small, and even then there are strong caveats.>> But in this case I''m talking specifically about a Xen PV guest, where >> the tsc is claimed for use by the Xen clocksource ABI. >> > I just don''t understand how you can say that a valid userland > instruction is "claimed for use" by Xen (or Linux or both). >Apps are free to try and use the tsc in any way they feel like, but it has never had any guaranteed properties. Some uses are completely reasonable (like using it as some entropy to seed an RNG, for example). At one point the kernel did disable the tsc for usermode use, but that was quickly reverted (or perhaps it never made it to mainline) because its not for the kernel to break backwards compatibility for the sake of second-guessing usermode. I think this is getting a bit repetitive. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I think this is getting a bit repetitive.True, and we are going down some unfortunate ratholes. So let''s see if we can focus on the core of the disagreement.> Apps are free to try and use the tsc in any way they > feel like, but it has never had any > GUARANTEED [djm''s emphasis] properties.I think this is the key difference of opinion which must be resolved. If what you say is true, your other positions make sense. If it is false, they make much less sense. (And unfortunately it is not a black and white issue.) There ARE guaranteed properties specified by the Intel SDM for any _single_ processor, namely that rdtsc is "guaranteed to return a monotonically increasing unique value whenever executed, except for 64-bit counter wraparound. Intel guarantees that the time-stamp counter will not wrap-around within 10 years after being reset." Both uses of the word "guarantee" are quoted from the Intel SDM. What is NOT guaranteed, but is widely and incorrectly assumed to be implied and has gotten us into this mess, is that the same properties applies across multiple processors. And there are notable examples of systems where the properties do NOT apply. So it is true that an app that does not know conclusively that certain threads are running on certain processors cannot always safely use rdtsc to obtain the single-processor-guaranteed results. BUT some software systems (including VMware) do provide this guarantee across multiple processors. And recent families of both Intel and AMD multi-core have advanced to the point where the properties apply across all cores, so on the vast majority (but admittedly not all) of future physical systems, apps can and will use rdtsc and expect the properties to apply (whether guaranteed or not). So in your opinion, some systems are broken so Xen should assume all future systems are broken. In my opinion, the problem is being fixed in hardware and has always been fixed in VMware, so Xen should look to the future not the past. Does that sound like a good summary of this disagreement? P.S. Summarizing the broader discussion on a new thread. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Aug-28 17:49 UTC
[Xen-devel] rdtsc: correctness vs performance on Xen (and KVM?)
To summarize: Xen and KVM currently allow rdtsc to be executed directly by userland. As a result, apps that use rdtsc smartly and effectively on (some) physical machines may break badly in Xen or KVM because of the disassociation of physical and virtual cpus. (Readers not familiar with why rdtsc is a problem, can read e.g. http://en.wikipedia.org/wiki/Rdtsc) VMware always emulates rdtsc, both for kernel and userland rdtsc''s. (I don''t know what HyperV does.) Xen currently has a boot option to always emulate rdtsc in HVM guests and just added code such that the same boot option will always emulate rdtsc for userland-only in PVM guests. There is some agreement in the Xen community that rdtsc emulation should always be the default though the default is currently off. KVM is having a similar discussion and, I''m told, has also come to the conclusion that emulating rdtsc is a necessary evil. The problem is that emulating rdtsc is slow. On my dual-core Conroe, rdtsc is about 72 cycles and emulating rdtsc (returning a fixed frequency 1GHz Xen monotonic system time) is over 15x slower. This is a big hit for apps that do tens to hundreds of thousands of rdtsc''s per processor per second. (And yes these apps are more common than one might think.) VMware has the advantage of binary translation; rdtsc can be translated to return a "conforming" value in ~200 cycles (on an older processor so probably faster if you are comparing against my dual-core Conroe numbers above). This value is "stale" (not linear with wallclock time). For VMs that need rdtsc to more accurately reflect wallclock time, full emulation can be optionally enabled for a VM. I''m searching for alternatives that provide the correctness of emulation, but better performance than emulation. Jeremy points out that the pvclock mechanism in upstream Linux works well, but the pvclock data is currently only exposed to kernel... and exposing it to userland still requires apps-using-rdtsc to be rewritten. But Jeremy claims that all apps-that-use-rdtsc MUST be rewritten because using rdtsc is unsafe, and that they should be rewritten to use gettimeofday (or actually vgettimeofday). But on older OS''s (including the vast majority of installed units) and machines where tsc is "unsafe", gettimeofday can be MUCH slower than emulating rdtsc. So telling app writers to convert all uses of rdtsc to gettimeofday is not an acceptable solution for these apps in the shortterm. My current thinking is that we (the Linux and Xen and KVM community) should architect a userland API using the pvclock mechanism. The underlying implementation of this API would utilize Linux only to "register" the mechanism, preferably via a module so that it, like disk and network frontends, could easily be bolted on to shipping OS''s. Individual uses of "pvclock_read" would need no syscall... like the kernel pvclock mechanism, they need only access memory to get the necessary scaling and offset data. Once instantiated, rdtsc is executed directly by the app as part of the pvclock protocol. If never registered, rdtsc would always be trapped and emulated. I realize this idea is half-baked, but would like to invite other TSC/time experts to determine if some or all of the idea might be used to achieve a fully-baked solution. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > Jeremy''s claim is that because some apps-that-use- > > rdtsc risk bugginess, Xen can claim rdtsc for its own > > use and effectively disallow all uses of rdtsc in any > > app by breaking the existing, sometimes-useful semantics > > of the instruction. > > If Xen is hiding the tsc cpu feature from the > kernel/apps it can.True, it can, but Xen does not currently do so and there has been no proposal for Xen to do so. And given Xen''s policy of supporting all existing applications, I don''t expect that a proposal to hide the tsc cpu feature will fly.> One problem there is a lot of grotty code simply > explodes without rdtsc working.Indeed. While it might be satisfying to legislate against stupidity, it rarely works. :-)> The alternative is to virtualise the TSC as some other > hypedvisors do but that has other impacts.Yes, this is where this whole discussion started. Let me summarize, but start a separate thread to do so. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 08/28/09 10:49, Dan Magenheimer wrote:>> Apps are free to try and use the tsc in any way they >> feel like, but it has never had any >> GUARANTEED [djm''s emphasis] properties. >> > I think this is the key difference of opinion which > must be resolved. If what you say is true, your > other positions make sense. If it is false, > they make much less sense. (And unfortunately > it is not a black and white issue.) > > There ARE guaranteed properties specified by > the Intel SDM for any _single_ processor, > namely that rdtsc is "guaranteed to return > a monotonically increasing unique value whenever > executed, except for 64-bit counter wraparound. > Intel guarantees that the time-stamp counter > will not wrap-around within 10 years after being > reset." Both uses of the word "guarantee" > are quoted from the Intel SDM. >Yes, but those are fairly weak guarantees. It does not guarantee that the tsc won''t change rate arbitrarily, or stop outright between reads.> What is NOT guaranteed, but is widely and > incorrectly assumed to be implied and has > gotten us into this mess, is that > the same properties applies across multiple > processors.Yes, Linux offers even weaker guarantees than Intel. Aside from the processor migration issue, the tsc can jump arbitrarily as a result of suspend/resume (ie, it can be non-monotonic).> And there are notable examples > of systems where the properties do NOT apply. > So it is true that an app that > does not know conclusively that certain threads > are running on certain processors cannot > always safely use rdtsc to obtain the > single-processor-guaranteed results. > > BUT some software systems (including VMware) do > provide this guarantee across multiple processors. > And recent families of both Intel and AMD > multi-core have advanced to the point where > the properties apply across all cores, so > on the vast majority (but admittedly not all) > of future physical systems, apps can and will > use rdtsc and expect the properties to apply > (whether guaranteed or not). >Even very recent processors with "constant" tscs (ie, they don''t change rate with the core frequency) stop in certain power states. Any motherboard design which runs packages in different clock-domains will lose tsc-sync between those packages, regardless of what''s in the packages. The "sane tsc" properties are primarily for the benefit of kernels, to allow them to make better use of the tsc. They will have enough knowledge of the overall system architecture to know how and when the tsc can be trusted. Usermode apps can try to piggyback onto this if they like, but they''re in much more treacherous territory. They can never know what the underlying system design is, or whether its really safe to trust the tsc''s sanity. And without some explicit guarantees on Linux''s part, the tsc will still be non-monotonic over suspend/resume (in all its many forms).> So in your opinion, some systems are broken > so Xen should assume all future systems are > broken. In my opinion, the problem is being > fixed in hardware and has always been fixed > in VMware, so Xen should look to the future > not the past. > > Does that sound like a good summary of this > disagreement? > >Not quite. You are talking about three different cases: 1. the reliability of the tsc in a PV guest in kernel mode 2. the reliability of the tsc in a PV guest in user mode 3. the reliability of the tsc in an HVM guest I don''t think 1. needs any attention. The current scheme works fine. The only option for 3 is to try make a best-effort of tsc quality, which ranges from trapping every rdtsc to make them all give globally monotonic results, or use the other VT/SVM features to apply an offset from the raw tsc to a guest tsc, etc. Either way the situation isn''t much different from running native (ie, apps will see basically the same tsc behaviour as in the native case, to some degree of approximation). So, there''s case 2: pv usermode. There are four classes of apps worth considering here: 1. Old apps which make unwarranted assumptions about the behavour of the tsc. They assume they''re basically running on some equivalent of a P54, and so will get junk on any modernish system with SMP and/or power management. If people are still using such apps, it probably means their performance isn''t critically dependent on the tsc. 2. More sophisticated apps which know the tsc has some limitations and try to mitigate them by filtering discontinuities, using rdtscp, etc. They''re best-effort, but they inherently lack enough information to do a complete job (they have to guess at where power transitions occured, etc). 3. New apps which know about modern processor capabilities, and attempt to rely on constant_tsc forgoing all the best-effort filtering, etc 4. Apps which use gettimeofday() and/or clock_gettime() for all time measurement. They''re guaranteed to get consistent time results, perhaps at the cost of a syscall. On systems which support it, they''ll get vsyscall implementations which avoid the syscall while still using the best-possible clocksource. Even if they don''t a syscall will outperform an emulated rdtsc. Class 1 apps are just broken. We can try to emulate a UP, no-PM processor for them, and that''s probably best done in an HVM domain. There''s no need to go to extraordinary efforts for them because the native hardware certainly won''t. Class 2 apps will work as well as ever in a Xen PV domain as-is. If they use rdtscp then they will be able to correlate the tsc to the underlying pcpu and manage consistency that way. If they pin threads to VCPUs, then they may also requre VCPUs to be pinned to PCPUs. But there''s no need to make deep changes to Xen''s tsc handling to accommodate them. Class 3 apps will get a bit of a rude surprise in a PV Xen domain. But they''re also new enough to use another mechanism to get time. They''re new enough to "know" that gettimeofday can be very efficient, and should not be going down the rathole of using rdtsc directly. And unless they''re going to be restricted to a very narrow class of machines (for example, not my relatively new Core2 laptop which stops the "constant" tsc in deep sleep modes), they need to fall back to being a class 2 or 4 app anyway. Class 4 apps are not well-served under Xen. I think the vsyscall mechanism will be disabled and they''ll always end up doing a real syscall. However, I think it would be relatively easy to add a new vgettimeofday implementation which directly uses the pvclock mechanism from usermode (the same code would work equally well for Xen and KVM). There''s no need to add a new usermode ABI to get quick, high-quality time in usermode. Performance-wise it would be more or less indistinguishable from using a raw rdtsc, but it has the benefit of getting full cooperation from the kernel and Xen, and can take into account all tsc variations (if any). So if you want to address these problems, it seems to me you''ll get most bang for the buck by fixing (v)gettimeofday to use pvclock, and convincing app writers to trust in gettimeofday. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
(Reordered with most important points first...)> You are talking about three different cases:I agree with your analysis for case 1 and case 3.> So, there''s case 2: pv usermode. There are four > classes of apps worth considering here:I agree with your classification. But a key point is that VMware provides correctness for all of these classes. AND provides it at much better performance than trap-and-emulate. AND provides correctness+performance regardless of the underlying OS (e.g. even "old" OS''s such as RHEL4 and RHEL5). AND provides it regardless whether the guest OS is 32-bit or 64-bit. AND, by the way, provides it for your case 1 (PV OS) and case 3 (HVM) as well.> So if you want to address these problems, it seems to me > you''ll get most > bang for the buck by fixing (v)gettimeofday to use pvclock, and > convincing app writers to trust in gettimeofday.(Partially irrelevant point, but gettimeofday returns microseconds which is not enough resolution for many cases where rdtsc has been used in apps. Clock_gettime is the relevant API I think.) If we can come up with a way for a kernel-loadable module to handle some equivalent of clock_gettime so that the most widely used shipping PV OS''s can provide a pvclock interface to apps, this might be workable. If we tell app providers and customers: "You can choose either performance OR correctness but not both, unless you upgrade to a new OS (that is not even available yet)", I don''t think that will be acceptable. Any ideas on how pvclock might be provided through a module that could be added to, eg. RHEL4 or RHEL5?> > There ARE guaranteed properties specified by > > the Intel SDM for any _single_ processor... > > Yes, but those are fairly weak guarantees. It does not guarantee that > the tsc won''t change rate arbitrarily, or stop outright between reads.They are weak guarantees only if one uses rdtsc to accurately track wallclock time. They are perfectly useful guarantees if one simply wants to either timestamp data to record ordering (e.g. for journaling or transaction replay), or approximate the passing of time to provide approximate execution metrics (e.g. for performance tools).> > What is NOT guaranteed, but is widely and > > incorrectly assumed to be implied and has > > gotten us into this mess, is that > > the same properties applies across multiple > > processors. > > Yes, Linux offers even weaker guarantees than Intel. Aside from the > processor migration issue, the tsc can jump arbitrarily as a result of > suspend/resume (ie, it can be non-monotonic).Please explain. Suspend/resume is an S state isn''t it? Is it possible to suspend/resume one processor in an SMP system and not another processor? I think not. Your point is valid for C-states and P-states but those are what Intel/AMD has fixed in the most recent families of multi-core processors. So I don''t see how (in the most recent familes of processors) tsc can be non-monotonic.> Even very recent processors with "constant" tscs (ie, they > don''t change > rate with the core frequency) stop in certain power states.For the most recent families of processors, the TSC continues to run at a fixed rate even for all the P-states and C-states. We should confirm this with Intel and AMD.> Any motherboard design which runs packages in different > clock-domains will lose tsc-sync between those packages, > regardless of what''s in the packages.I''m told this is not true for recent multi-socket systems where the sockets are on the same motherboard. And at least one large vendor that ships a new one-socket-per- motherboard NUMA-ish system claims that it is not even true when the sockets are on different motherboards. Dan (no further replies below, remaining original text retained for context)> You are talking about three different cases: > > 1. the reliability of the tsc in a PV guest in kernel mode > 2. the reliability of the tsc in a PV guest in user mode > 3. the reliability of the tsc in an HVM guest > > I don''t think 1. needs any attention. The current scheme works fine. > > The only option for 3 is to try make a best-effort of tsc > quality, which > ranges from trapping every rdtsc to make them all give globally > monotonic results, or use the other VT/SVM features to apply an offset > from the raw tsc to a guest tsc, etc. Either way the situation isn''t > much different from running native (ie, apps will see > basically the same > tsc behaviour as in the native case, to some degree of approximation). > > So, there''s case 2: pv usermode. There are four classes of apps worth > considering here: > > 1. Old apps which make unwarranted assumptions about the > behavour of > the tsc. They assume they''re basically running on some > equivalent > of a P54, and so will get junk on any modernish system with SMP > and/or power management. If people are still using > such apps, it > probably means their performance isn''t critically > dependent on the > tsc. > 2. More sophisticated apps which know the tsc has some limitations > and try to mitigate them by filtering discontinuities, using > rdtscp, etc. They''re best-effort, but they inherently > lack enough > information to do a complete job (they have to guess at where > power transitions occured, etc). > 3. New apps which know about modern processor capabilities, and > attempt to rely on constant_tsc forgoing all the best-effort > filtering, etc > 4. Apps which use gettimeofday() and/or clock_gettime() > for all time > measurement. They''re guaranteed to get consistent time results, > perhaps at the cost of a syscall. On systems which support it, > they''ll get vsyscall implementations which avoid the > syscall while > still using the best-possible clocksource. Even if they don''t a > syscall will outperform an emulated rdtsc. > > Class 1 apps are just broken. We can try to emulate a UP, no-PM > processor for them, and that''s probably best done in an HVM domain. > There''s no need to go to extraordinary efforts for them because the > native hardware certainly won''t. > > Class 2 apps will work as well as ever in a Xen PV domain as-is. If > they use rdtscp then they will be able to correlate the tsc to the > underlying pcpu and manage consistency that way. If they pin > threads to > VCPUs, then they may also requre VCPUs to be pinned to PCPUs. But > there''s no need to make deep changes to Xen''s tsc handling to > accommodate them. > > Class 3 apps will get a bit of a rude surprise in a PV Xen > domain. But > they''re also new enough to use another mechanism to get time. They''re > new enough to "know" that gettimeofday can be very efficient, > and should > not be going down the rathole of using rdtsc directly. And unless > they''re going to be restricted to a very narrow class of machines (for > example, not my relatively new Core2 laptop which stops the "constant" > tsc in deep sleep modes), they need to fall back to being a > class 2 or 4 > app anyway. > > Class 4 apps are not well-served under Xen. I think the vsyscall > mechanism will be disabled and they''ll always end up doing a real > syscall. However, I think it would be relatively easy to add a new > vgettimeofday implementation which directly uses the pvclock mechanism > from usermode (the same code would work equally well for Xen > and KVM). > There''s no need to add a new usermode ABI to get quick, high-quality > time in usermode. Performance-wise it would be more or less > indistinguishable from using a raw rdtsc, but it has the benefit of > getting full cooperation from the kernel and Xen, and can take into > account all tsc variations (if any). > > > So if you want to address these problems, it seems to me > you''ll get most > bang for the buck by fixing (v)gettimeofday to use pvclock, and > convincing app writers to trust in gettimeofday. > > J >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I''m experimenting with clock_gettime(), gettimeofday(), and rdtsc with a 2.6.30 64-bit pvguest. I have tried both with kernel.vsyscall64 equal to 0 and 1 (but haven''t seen any significant difference between the two). I have confirmed from sysfs that clocksource=xen I have yet to get a measurement of either syscall that is better than 2.5x WORSE than emulating rdtsc. On my dual-core Conroe (Intel E6850) with 64-bit Xen and 32-bit dom0, I get approximately: rdtsc native: 22ns softtsc (rdtsc emulated): 360ns gettime syscall w/softtsc: 1400ns gettime syscall native tsc: 980ns gettimeofday w/softtsc: 1750ns gettimeofday native tsc: 900ns I''m hoping this is either a bug in the 2.6.30 xen pvclock implementation or in my measurement methodology, so would welcome others measuring this. A couple other minor observations: 1) The syscalls seem to be somewhat slower when usermode rdtscs are being emulated, by approximately the cost of emulating an rdtsc. I suppose this makes sense since vsyscalls are executed in userland and since vgettimeofday does a rdtsc. However it complicates strategy if emulating rdtsc is the default. 2) The syscall clock_getres() does not seem to reflect the fact that> -----Original Message----- > From: Dan Magenheimer > Sent: Saturday, August 29, 2009 11:52 AM > To: Jeremy Fitzhardinge > Cc: Alan Cox; Xen-Devel (E-mail); Keir Fraser > Subject: RE: [Xen-devel] write_tsc in a PV domain? > > > (Reordered with most important points first...) > > > You are talking about three different cases: > > I agree with your analysis for case 1 and case 3. > > > So, there''s case 2: pv usermode. There are four > > classes of apps worth considering here: > > I agree with your classification. But a key point > is that VMware provides correctness for all > of these classes. AND provides it at much better > performance than trap-and-emulate. AND provides > correctness+performance regardless of the underlying > OS (e.g. even "old" OS''s such as RHEL4 and RHEL5). > AND provides it regardless whether the guest OS is > 32-bit or 64-bit. AND, by the way, provides it for > your case 1 (PV OS) and case 3 (HVM) as well. > > > So if you want to address these problems, it seems to me > > you''ll get most > > bang for the buck by fixing (v)gettimeofday to use pvclock, and > > convincing app writers to trust in gettimeofday. > > (Partially irrelevant point, but gettimeofday returns > microseconds which is not enough resolution for many > cases where rdtsc has been used in apps. Clock_gettime > is the relevant API I think.) > > If we can come up with a way for a kernel-loadable module > to handle some equivalent of clock_gettime so that > the most widely used shipping PV OS''s can provide a > pvclock interface to apps, this might be workable. > If we tell app providers and customers: "You > can choose either performance OR correctness but > not both, unless you upgrade to a new OS (that is > not even available yet)", I don''t think that will > be acceptable. > > Any ideas on how pvclock might be provided through > a module that could be added to, eg. RHEL4 or RHEL5? > > > > There ARE guaranteed properties specified by > > > the Intel SDM for any _single_ processor... > > > > Yes, but those are fairly weak guarantees. It does not > guarantee that > > the tsc won''t change rate arbitrarily, or stop outright > between reads. > > They are weak guarantees only if one uses rdtsc > to accurately track wallclock time. They are > perfectly useful guarantees if one simply wants to > either timestamp data to record ordering (e.g. > for journaling or transaction replay), or > approximate the passing of time to provide > approximate execution metrics (e.g. for > performance tools). > > > > What is NOT guaranteed, but is widely and > > > incorrectly assumed to be implied and has > > > gotten us into this mess, is that > > > the same properties applies across multiple > > > processors. > > > > Yes, Linux offers even weaker guarantees than Intel. Aside from the > > processor migration issue, the tsc can jump arbitrarily as > a result of > > suspend/resume (ie, it can be non-monotonic). > > Please explain. Suspend/resume is an S state isn''t > it? Is it possible to suspend/resume one processor > in an SMP system and not another processor? I think > not. Your point is valid for C-states and P-states > but those are what Intel/AMD has fixed in the most > recent families of multi-core processors. > > So I don''t see how (in the most recent familes of > processors) tsc can be non-monotonic. > > > Even very recent processors with "constant" tscs (ie, they > > don''t change > > rate with the core frequency) stop in certain power states. > > For the most recent families of processors, the TSC > continues to run at a fixed rate even for all the > P-states and C-states. We should confirm this with > Intel and AMD. > > > Any motherboard design which runs packages in different > > clock-domains will lose tsc-sync between those packages, > > regardless of what''s in the packages. > > I''m told this is not true for recent multi-socket systems > where the sockets are on the same motherboard. And at > least one large vendor that ships a new one-socket-per- > motherboard NUMA-ish system claims that it is not even > true when the sockets are on different motherboards. > > Dan > > (no further replies below, remaining original text retained > for context) > > > You are talking about three different cases: > > > > 1. the reliability of the tsc in a PV guest in kernel mode > > 2. the reliability of the tsc in a PV guest in user mode > > 3. the reliability of the tsc in an HVM guest > > > > I don''t think 1. needs any attention. The current scheme > works fine. > > > > The only option for 3 is to try make a best-effort of tsc > > quality, which > > ranges from trapping every rdtsc to make them all give globally > > monotonic results, or use the other VT/SVM features to > apply an offset > > from the raw tsc to a guest tsc, etc. Either way the > situation isn''t > > much different from running native (ie, apps will see > > basically the same > > tsc behaviour as in the native case, to some degree of > approximation). > > > > So, there''s case 2: pv usermode. There are four classes of > apps worth > > considering here: > > > > 1. Old apps which make unwarranted assumptions about the > > behavour of > > the tsc. They assume they''re basically running on some > > equivalent > > of a P54, and so will get junk on any modernish > system with SMP > > and/or power management. If people are still using > > such apps, it > > probably means their performance isn''t critically > > dependent on the > > tsc. > > 2. More sophisticated apps which know the tsc has some > limitations > > and try to mitigate them by filtering discontinuities, using > > rdtscp, etc. They''re best-effort, but they inherently > > lack enough > > information to do a complete job (they have to guess at where > > power transitions occured, etc). > > 3. New apps which know about modern processor capabilities, and > > attempt to rely on constant_tsc forgoing all the best-effort > > filtering, etc > > 4. Apps which use gettimeofday() and/or clock_gettime() > > for all time > > measurement. They''re guaranteed to get consistent > time results, > > perhaps at the cost of a syscall. On systems which > support it, > > they''ll get vsyscall implementations which avoid the > > syscall while > > still using the best-possible clocksource. Even if > they don''t a > > syscall will outperform an emulated rdtsc. > > > > Class 1 apps are just broken. We can try to emulate a UP, no-PM > > processor for them, and that''s probably best done in an HVM domain. > > There''s no need to go to extraordinary efforts for them because the > > native hardware certainly won''t. > > > > Class 2 apps will work as well as ever in a Xen PV domain as-is. If > > they use rdtscp then they will be able to correlate the tsc to the > > underlying pcpu and manage consistency that way. If they pin > > threads to > > VCPUs, then they may also requre VCPUs to be pinned to PCPUs. But > > there''s no need to make deep changes to Xen''s tsc handling to > > accommodate them. > > > > Class 3 apps will get a bit of a rude surprise in a PV Xen > > domain. But > > they''re also new enough to use another mechanism to get > time. They''re > > new enough to "know" that gettimeofday can be very efficient, > > and should > > not be going down the rathole of using rdtsc directly. And unless > > they''re going to be restricted to a very narrow class of > machines (for > > example, not my relatively new Core2 laptop which stops the > "constant" > > tsc in deep sleep modes), they need to fall back to being a > > class 2 or 4 > > app anyway. > > > > Class 4 apps are not well-served under Xen. I think the vsyscall > > mechanism will be disabled and they''ll always end up doing a real > > syscall. However, I think it would be relatively easy to add a new > > vgettimeofday implementation which directly uses the > pvclock mechanism > > from usermode (the same code would work equally well for Xen > > and KVM). > > There''s no need to add a new usermode ABI to get quick, high-quality > > time in usermode. Performance-wise it would be more or less > > indistinguishable from using a raw rdtsc, but it has the benefit of > > getting full cooperation from the kernel and Xen, and can take into > > account all tsc variations (if any). > > > > > > So if you want to address these problems, it seems to me > > you''ll get most > > bang for the buck by fixing (v)gettimeofday to use pvclock, and > > convincing app writers to trust in gettimeofday. > > > > J > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 31/08/2009 19:11, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> I have yet to get a measurement of either syscall that > is better than 2.5x WORSE than emulating rdtsc. On > my dual-core Conroe (Intel E6850) with 64-bit Xen and > 32-bit dom0, I get approximately: > > rdtsc native: 22ns > softtsc (rdtsc emulated): 360nsTrap-and-emulate in 360ns seems astoundingly good. Perhaps too good to be true? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 08/31/09 11:11, Dan Magenheimer wrote:> I''m experimenting with clock_gettime(), gettimeofday(), > and rdtsc with a 2.6.30 64-bit pvguest. I have tried both > with kernel.vsyscall64 equal to 0 and 1 (but haven''t seen > any significant difference between the two). I have > confirmed from sysfs that clocksource=xen >Yeah, as I said, I wouldn''t expect vsyscall to work under Xen at the moment; the Xen clocksource will disable it. Clocksources can implement a "vread" method for use from a vsyscall, but from a quick look it didn''t appear we could use it as-is (because the pvclock info isn''t mapped into userspace, and the current vsyscall code assumes a single set of parameters rather than percpu). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > I have yet to get a measurement of either syscall that > > is better than 2.5x WORSE than emulating rdtsc. On > > my dual-core Conroe (Intel E6850) with 64-bit Xen and > > 32-bit dom0, I get approximately: > > > > rdtsc native: 22ns > > softtsc (rdtsc emulated): 360ns > > Trap-and-emulate in 360ns seems astoundingly good. Perhaps > too good to be true?I measured with the patch you checked in as 20128. I tried a couple of tests, first changing pv_soft_rdtsc to always return a value with the 4 LSB of the return value cleared, second with the 4 LSB of the return value set. Both were properly reflected by a userland rdtsc. So it looks like the correct emulation code is executing. And get_s_time() always returns nanoseconds, correct? So consecutive emulated rdtsc''s should return values that differ by the amount of nsec necessary to do the emulation, right? I ran 2 million rdtsc''s in a loop and took the average so, ignoring loop and load/store overhead, the 360ns appears to be an accurate measurement. A thousand cycles to trap, decode, call get_s_time, and return seems astoundingly good? Probably it''s faster than a vmexit because there''s so much less state to save. But still it''s 15x slower than a raw rdtsc. If you have ideas on how to test the measurement further, I''d be happy to give them a spin. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Aug-31 23:52 UTC
[Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
> My current thinking is that we (the Linux and > Xen and KVM community) should architect a > userland API using the pvclock mechanism.OK, here''s a slightly refined proposal. To reiterate, the problem is that Xen''s current mechanism for handling the rdtsc instruction may silently provide incorrect results while alternative mechanisms are too slow (vs VMware which is both fast and correct). My goal is to provide a paravirtualized tsc mechanism for apps running on Xen that is reliably correct, is not dependent on a particular OS or processor family, is approximately as fast as rdtsc (or at least much faster than emulated rdtsc), provides adequate (e.g. nanosecond) resolution, does not require recompilation to work both on Xen and bare metal, and works properly across: vcpu-to-pcpu rescheduling even on NUMA machines; system sleep/hibernation; and save/restore/migration between machines with dissimilar clock rates. Implementation requires changes in Xen and "the app" but no OS changes thus making it still viable on legacy OS''s and possibly(?) HVM domains. Note that only apps that need to sample time on the order of >5-100K/core/second would use this; for other apps, rdtsc emulation overhead is probably negligible (<0.2%). 0) Xen implements rdtsc emulation by default 1) Guest OS is launched with pvtsc=1 in vm.cfg 2) App running on guest OS sets up a SIGILL handler 3) App executes a special rdmsr instruction or hypercall. 4a) If SIGILL results, not running on Xen at all, or on old Xen; app uses rdtsc at own risk. Done. 4b) Else, rdmsr/hypercall returns virtual address of special pvclock page ("pvclock_va"). 5) App executes another special rdmsr instruction/ hypercall to disable rdtsc emulation. This affects ALL execution for all processes in this VM. 6) Xen maintains mapping of pvclock_va to a different physical page for each processor and transparently handles TLB misses for pvclock_va 7) App uses (unemulated) rdtsc and applies pvclock algorithm (using values in memory at pvclock_va) resulting in pvtsc, which is nanoseconds since VM start. App can further apply local algorithms to enforce monotonicity or frequency scaling as desired. Comments appreciated. I realize that this is hacky and ugly... better alternatives gladly solicited. Thanks, Dan P.S. While it would be nice if we could just tell apps to use a fast vgettimeofday equivalent, this does not exist today and, even if it did, would not be widely available for years in the kernel running under most enterprise app deployments (and, even then, only on 64-bit Linux.)> -----Original Message----- > From: Dan Magenheimer > Sent: Friday, August 28, 2009 11:50 AM > To: Xen-Devel (E-mail) > Cc: Jeremy Fitzhardinge; Keir Fraser; Alan Cox > Subject: rdtsc: correctness vs performance on Xen (and KVM?) > > > To summarize: > > Xen and KVM currently allow rdtsc to be executed > directly by userland. As a result, apps that > use rdtsc smartly and effectively on (some) physical > machines may break badly in Xen or KVM because of > the disassociation of physical and virtual cpus. > (Readers not familiar with why rdtsc is a problem, > can read e.g. http://en.wikipedia.org/wiki/Rdtsc) > > VMware always emulates rdtsc, both for kernel and > userland rdtsc''s. (I don''t know what HyperV does.) > > Xen currently has a boot option to always emulate > rdtsc in HVM guests and just added code such that > the same boot option will always emulate rdtsc for > userland-only in PVM guests. There is some agreement > in the Xen community that rdtsc emulation should > always be the default though the default is currently > off. KVM is having a similar discussion and, I''m > told, has also come to the conclusion that emulating > rdtsc is a necessary evil. > > The problem is that emulating rdtsc is slow. On > my dual-core Conroe, rdtsc is about 72 cycles and > emulating rdtsc (returning a fixed frequency 1GHz > Xen monotonic system time) is over 15x slower. > This is a big hit for apps that do tens to hundreds > of thousands of rdtsc''s per processor per second. > (And yes these apps are more common than one > might think.) > > VMware has the advantage of binary translation; > rdtsc can be translated to return a "conforming" > value in ~200 cycles (on an older processor so > probably faster if you are comparing against my > dual-core Conroe numbers above). This value > is "stale" (not linear with wallclock time). > For VMs that need rdtsc to more accurately reflect > wallclock time, full emulation can be optionally > enabled for a VM. > > I''m searching for alternatives that provide the > correctness of emulation, but better performance > than emulation. Jeremy points out that the > pvclock mechanism in upstream Linux works well, > but the pvclock data is currently only exposed > to kernel... and exposing it to userland still > requires apps-using-rdtsc to be rewritten. > But Jeremy claims that all apps-that-use-rdtsc > MUST be rewritten because using rdtsc is unsafe, > and that they should be rewritten to use > gettimeofday (or actually vgettimeofday). > But on older OS''s (including the vast majority > of installed units) and machines where tsc is > "unsafe", gettimeofday can be MUCH slower than > emulating rdtsc. So telling app writers to > convert all uses of rdtsc to gettimeofday is > not an acceptable solution for these apps in > the shortterm. > > My current thinking is that we (the Linux and > Xen and KVM community) should architect a > userland API using the pvclock mechanism. > The underlying implementation of this API would > utilize Linux only to "register" the mechanism, > preferably via a module so that it, like > disk and network frontends, could easily be > bolted on to shipping OS''s. Individual uses > of "pvclock_read" would need no syscall... like > the kernel pvclock mechanism, they need only > access memory to get the necessary scaling > and offset data. Once instantiated, rdtsc > is executed directly by the app as part of the > pvclock protocol. If never registered, > rdtsc would always be trapped and emulated. > > I realize this idea is half-baked, but would like > to invite other TSC/time experts to determine > if some or all of the idea might be used to > achieve a fully-baked solution._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-01 00:22 UTC
[Xen-devel] Re: rdtsc: correctness vs performance on Xen (and KVM?)
On 08/31/09 16:52, Dan Magenheimer wrote:> work both on Xen and bare metal, and works properly > across: vcpu-to-pcpu rescheduling even on NUMA > machines; system sleep/hibernation; and > save/restore/migration between machines with > dissimilar clock rates.But it will only do this when running under Xen. If running on bare metal, there will be nothing providing the correction info to the app, and it will be no better than using raw rdtsc with all its limitations. In practice this means that the app will have to have some other code path anyway.> Implementation requires > changes in Xen and "the app" but no OS changes > thus making it still viable on legacy OS''s > and possibly(?) HVM domains. Note that > only apps that need to sample time on the > order of >5-100K/core/second would use this; > for other apps, rdtsc emulation overhead > is probably negligible (<0.2%). > > 0) Xen implements rdtsc emulation by default > 1) Guest OS is launched with pvtsc=1 in vm.cfg > 2) App running on guest OS sets up a SIGILL handler > 3) App executes a special rdmsr instruction or > hypercall. >No way to do direct hypercalls from usermode, so it would need to be an illegal instruction (like cpuid). But really it should be a system-wide kernel setting, set via sysctl or something.> 4a) If SIGILL results, not running on Xen at all, > or on old Xen; app uses rdtsc at own risk. Done. > 4b) Else, rdmsr/hypercall returns virtual address of > special pvclock page ("pvclock_va"). >This can''t be done without changing the kernel; Xen can''t just start sticking stuff into usermode mappings (how does Xen even know where a given OS''s usermode is?). And again, usermode can''t do hypercalls and I don''t think we should start making fake rdmsrs start working in usermode.> 5) App executes another special rdmsr instruction/ > hypercall to disable rdtsc emulation. This > affects ALL execution for all processes in this VM. >Once enabled, it should just stay enabled. System-wide is very coarse anyway (since there''s no guarantee that all apps will use the mechanism).> 6) Xen maintains mapping of pvclock_va to a > different physical page for each processor > and transparently handles TLB misses for > pvclock_va >If you mean that a given VA has a per-cpu mapping, it requires percpu pagetables. That''s not possible in Linux with PV pagetables (since two tasks/threads on different cpus sharing the same mm will use the same pagetable).> 7) App uses (unemulated) rdtsc and applies > pvclock algorithm (using values in memory > at pvclock_va) resulting in pvtsc, which > is nanoseconds since VM start. App can > further apply local algorithms to enforce > monotonicity or frequency scaling as desired. > > Comments appreciated. I realize that this is hacky > and ugly... better alternatives gladly solicited. >In general even Linux''s specialised APIs are entirely unused (sendfile, vmsplice, etc). Something as esoteric as this will be pretty much unused. This can be entirely done within the vsyscall mechansim without any app changes. There''s no reason no to.> P.S. While it would be nice if we could just tell > apps to use a fast vgettimeofday equivalent, this > does not exist today and, even if it did, would not > be widely available for years in the kernel running under > most enterprise app deployments (and, even then, > only on 64-bit Linux.) >These rationales are very unconvincing: Making vsyscall work on 32bit is just a matter of doing it; apparently nobody has put the effort into it, but there''s no fundimental reason why it wouldn''t work. Besides, who runs enterprise apps on 32-bit these days? Anything requiring even moderate amounts of memory is better run on 64-bit. Your mechanism will require kernel changes anyway, so there''s no getting around that. Once vsyscall does Xen/KVM properly, then every app will automatically do the right thing without modification. There''s no need for specialized APIs that nobody will end up using anyway. It only makes sense to go to this kind of effort if it ends up making a plain "rdtsc" have the properties you want it to have. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 31/08/2009 22:06, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> A thousand cycles to trap, decode, call get_s_time, > and return seems astoundingly good? Probably it''s > faster than a vmexit because there''s so much less state > to save. But still it''s 15x slower than a raw rdtsc.A kernel trap used to take about a microsecond. Maybe it has got faster on new processors. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-01 13:54 UTC
[Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
Hi Jeremy -- Thanks for the feedback!> Making vsyscall work...While I highly respect your opinion, and while vsyscall may be a fine choice in the future, it just doesn''t solve the problem today and won''t solve it ever for currently shipping PV OS''s. If you can figure out a way to allow vsyscall to be installed as a module and still achieve its performance, it might be a possible solution, but otherwise we have to go around the OS to solve this problem. The rdtsc instruction will be fully emulated by default in Xen 4.0, and before that releases I need to find a fast alternative for those apps that are dependent on BOTH its correct functionality AND high performance.> > work both on Xen and bare metal, and works properly > > across: vcpu-to-pcpu rescheduling even on NUMA > > machines; system sleep/hibernation; and > > save/restore/migration between machines with > > dissimilar clock rates. > > But it will only do this when running under Xen. If running on bare > metal, there will be nothing providing the correction info to the app, > and it will be no better than using raw rdtsc with all its > limitations. > In practice this means that the app will have to have some other code > path anyway.Yes, that''s true. I''m not trying to legislate whether an app can use rdtsc or not on a physical machine, just trying to provide the same guarantees for a rdtsc executed in a virtual environment as already provided for a a physical environment, but without significant performance cost.> > 3) App executes a special rdmsr instruction or > > hypercall. > > No way to do direct hypercalls from usermode, so it would > need to be an illegal instruction (like cpuid). > ...and I don''t think we should > start making fake rdmsrs start working in usermode.I''m told (by Keir) that it might be possible to allow certain hypercalls to be executed from userland. I haven''t investigated yet. But a "fake rdmsr" might be a better answer anyway; enlightened Windows and HyperV already use a fake rdmsr, correct? But I''m not keen on it either and am open to alternatives.> But really it should be a system-wide kernel setting, set via > sysctl or something.I''m not sure what you are suggesting here.> > 4a) If SIGILL results, not running on Xen at all, > > or on old Xen; app uses rdtsc at own risk. Done. > > 4b) Else, rdmsr/hypercall returns virtual address of > > special pvclock page ("pvclock_va"). > > > This can''t be done without changing the kernel; Xen can''t just start > sticking stuff into usermode mappings (how does Xen even know where a > given OS''s usermode is?).It doesn''t have to be a usermode mapping, it just needs to be a "magic" address; it can (for example) be in the virtual address space Xen has reserved for itself.> > 5) App executes another special rdmsr instruction/ > > hypercall to disable rdtsc emulation. This > > affects ALL execution for all processes in this VM. > > Once enabled, it should just stay enabled. System-wide is very coarse > anyway (since there''s no guarantee that all apps will use the > mechanism).Yes this is an ugly potential issue. Fortunately, many enterprise class apps essentially are the machine; and this may be even more true in a virtualized world. Again, I''m not keen on this either but I don''t see an alternative.> > 6) Xen maintains mapping of pvclock_va to a > > different physical page for each processor > > and transparently handles TLB misses for > > pvclock_va > > If you mean that a given VA has a per-cpu mapping, it requires percpu > pagetables. That''s not possible in Linux with PV pagetables > (since two > tasks/threads on different cpus sharing the same mm will use the same > pagetable).What the OS can do is completely irrelevant. The mapping is handled entirely by Xen so the OS will never even see a page fault for this address. Note also that one-page-per-cpu is not needed. The page is readonly and there is no sensitive information in a pvclock data structure so many per-cpu-pvclock-structs could be on the same page.> In general even Linux''s specialised APIs are entirely unused > (sendfile, > vmsplice, etc). Something as esoteric as this will be pretty > much unused.If apps are happy with the performance of emulated rdtsc, there''s no reason for them to use it, so I would be happy if this pvtsc ABI never gets used. However, most enterprise apps are sensitive to a performance hit of several percent and will be eager to try alternatives.> This can be entirely done within the vsyscall mechansim > without any app > changes. There''s no reason no to.Performance with app portability is the reason.> > P.S. While it would be nice if we could just tell > > apps to use a fast vgettimeofday equivalent, this > > does not exist today and, even if it did, would not > > be widely available for years in the kernel running under > > most enterprise app deployments (and, even then, > > only on 64-bit Linux.) > > These rationales are very unconvincing: > > Making vsyscall work on 32bit is just a matter of doing it; apparently > nobody has put the effort into it, but there''s no fundimental > reason why > it wouldn''t work. Besides, who runs enterprise apps on 32-bit these > days? Anything requiring even moderate amounts of memory is > better run > on 64-bit.Many people run enterprise apps on 32-bit these days, and I''m not planning on forcing them to switch. But 32-bit vs 64-bit is a small parenthetical objection, not particularly relevant to the main issue.> Your mechanism will require kernel changes anyway, so there''s > no getting > around that.I think that''s exactly what the proposal does: gets around requiring kernel changes. If kernel changes are required (other than bolting on a kernel loadable module), pvtsc is also not an acceptable solution.> Once vsyscall does Xen/KVM properly, then every app will automatically > do the right thing without modification. There''s no need for > specialized APIs that nobody will end up using anyway.I fully agree that vsyscall is the right longterm answer but telling the app providers to switch to something that is non-existent in 100% of their deployments today, has not yet been implemented sufficiently to be measured, and probably won''t exceed 50% of their deployments within five years... well I don''t expect them to be convinced.> It only makes > sense to go to this kind of effort if it ends up making a > plain "rdtsc" > have the properties you want it to have.Intel and AMD are responsible for making a plain rdtsc have the properties you want it to have in a physical environment and apparently they''ve done a good enough job that apps are using it today (albeit with an added layer of glue to handle certain SMP systems). Emulating rdtsc provides the same properties in a virtual environment but at a significant performance cost. pvtsc is only intended to retrieve some of that performance. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-01 14:34 UTC
[Xen-devel] Re: rdtsc: correctness vs performance on Xen (and KVM?)
On 01/09/2009 14:54, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> >> Making vsyscall work... > > While I highly respect your opinion, and while vsyscall > may be a fine choice in the future, it just doesn''t > solve the problem today and won''t solve it ever for > currently shipping PV OS''s. If you can figure out a > way to allow vsyscall to be installed as a module and > still achieve its performance, it > might be a possible solution, but otherwise we have > to go around the OS to solve this problem.Do you believe there''s a solution which doesn''t involve PV kernel modifications? I think the suggestions you''ve made so far would require such modifications. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-01 14:53 UTC
[Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
> >> Making vsyscall work... > > > > While I highly respect your opinion, and while vsyscall > > may be a fine choice in the future, it just doesn''t > > solve the problem today and won''t solve it ever for > > currently shipping PV OS''s. If you can figure out a > > way to allow vsyscall to be installed as a module and > > still achieve its performance, it > > might be a possible solution, but otherwise we have > > to go around the OS to solve this problem. > > Do you believe there''s a solution which doesn''t involve PV kernel > modifications? I think the suggestions you''ve made so far > would require such modifications.That is certainly my goal. I *think* the proposal does NOT require PV OS mods as the communication is strictly between an app and Xen. However, I''m really not familiar with all the subtleties of the x86 architecture so could be missing something. I think these are the two key architectural dependencies that I''m not certain of: 1) fake rdmsr (or hypercall if it works) returns a virtual address within a range of addresses that is not "owned by" the OS (e.g. maybe in Xen address space?). The page is only readable outside of ring 0, but writeable in ring 0 (by Xen). 2) All TLB misses on this page are handled directly by Xen so the OS never sees the address/page. If these are OK, and you see other parts of the proposal that require PV kernel mods, please point them out. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-01 15:08 UTC
[Xen-devel] Re: rdtsc: correctness vs performance on Xen (and KVM?)
On 01/09/2009 15:53, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> 1) fake rdmsr (or hypercall if it works) returns a virtual > address within a range of addresses that is not "owned by" > the OS (e.g. maybe in Xen address space?). The page is > only readable outside of ring 0, but writeable in ring 0 > (by Xen). > 2) All TLB misses on this page are handled directly by Xen > so the OS never sees the address/page.I think these are probably possible, at least for a 64-bit hypervisor which isn''t playing segment limit tricks.> If these are OK, and you see other parts of the proposal > that require PV kernel mods, please point them out.Won''t the pvclock computation be per-cpu? How will you deal with that? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-01 15:26 UTC
[Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
> On 01/09/2009 15:53, "Dan Magenheimer" > <dan.magenheimer@oracle.com> wrote: > > > 1) fake rdmsr (or hypercall if it works) returns a virtual > > address within a range of addresses that is not "owned by" > > the OS (e.g. maybe in Xen address space?). The page is > > only readable outside of ring 0, but writeable in ring 0 > > (by Xen). > > 2) All TLB misses on this page are handled directly by Xen > > so the OS never sees the address/page. > > I think these are probably possible, at least for a 64-bit > hypervisor which > isn''t playing segment limit tricks.Will it work for pv32_on_64? (I don''t care much about 32-bit hypervisor.)> > If these are OK, and you see other parts of the proposal > > that require PV kernel mods, please point them out. > > Won''t the pvclock computation be per-cpu? How will you deal with > that?Hmmm... is it possible for the same virtual address/page to map to a different physical address/page on each processor? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-01 15:32 UTC
[Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:26 >>> >> On 01/09/2009 15:53, "Dan Magenheimer" >> <dan.magenheimer@oracle.com> wrote: >> >> > 1) fake rdmsr (or hypercall if it works) returns a virtual >> > address within a range of addresses that is not "owned by" >> > the OS (e.g. maybe in Xen address space?). The page is >> > only readable outside of ring 0, but writeable in ring 0 >> > (by Xen). >> > 2) All TLB misses on this page are handled directly by Xen >> > so the OS never sees the address/page. >> >> I think these are probably possible, at least for a 64-bit >> hypervisor which >> isn''t playing segment limit tricks. > >Will it work for pv32_on_64? (I don''t care much about >32-bit hypervisor.)It can be made work - you just need to properly arrange this and the compatibility p2m table.>> > If these are OK, and you see other parts of the proposal >> > that require PV kernel mods, please point them out. >> >> Won''t the pvclock computation be per-cpu? How will you deal with >> that? > >Hmmm... is it possible for the same virtual address/page >to map to a different physical address/page on each processor?Not within today''s Xen or Linux (which both assume a global kernel address space, in particular non-root page table entries mapping kernel space to be the same in all address spaces - you''d need separate entries at all levels for this). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-01 15:43 UTC
[Xen-devel] Re: rdtsc: correctness vs performance on Xen (and KVM?)
On 01/09/2009 16:26, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:>> I think these are probably possible, at least for a 64-bit >> hypervisor which >> isn''t playing segment limit tricks. > > Will it work for pv32_on_64? (I don''t care much about > 32-bit hypervisor.)It could do. Space is reserved at the top of 4GB for the M2P tables, and I suppose such a mapping could go there.>> Won''t the pvclock computation be per-cpu? How will you deal with >> that? > > Hmmm... is it possible for the same virtual address/page > to map to a different physical address/page on each processor?Not without PV guest kernel support. The guest kernel manages the page directories. And Linux runs threads on exactly the same pagetables across different cpus. That would have to change. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-01 15:56 UTC
RE: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
> >> > If these are OK, and you see other parts of the proposal > >> > that require PV kernel mods, please point them out. > >> > >> Won''t the pvclock computation be per-cpu? How will you deal with > >> that? > > > >Hmmm... is it possible for the same virtual address/page > >to map to a different physical address/page on each processor? > > Not within today''s Xen or Linux (which both assume a global kernel > address space, in particular non-root page table entries > mapping kernel > space to be the same in all address spaces - you''d need > separate entries > at all levels for this).OK, I forgot: No software-accessible TLB. Can you think of any trick (that doesn''t require the cost of a trap/hypercall) to allow an app to determine what pcpu it is running on? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-01 16:04 UTC
RE: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:56 >>> >Can you think of any trick (that doesn''t require the cost of a >trap/hypercall) to allow an app to determine what pcpu >it is running on?Just like what is being used to allow apps to get the CPU number on native kernels (or the vCPU one on Xen-ified ones): Have a GDT entry the limit of which is the number you want, and have the app use the lsl instruction to get at it. I am, however, always a little bit concerned when it comes to exposing information that shouldn''t really be exposed, due to the possibility of overlooking potential misuses. In the specific case here, I can''t see at all why you''d the pCPU number exposed - after all the kernel can do what you want apps to do without having that information. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-01 16:06 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
On 01/09/2009 16:56, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:>> Not within today''s Xen or Linux (which both assume a global kernel >> address space, in particular non-root page table entries >> mapping kernel >> space to be the same in all address spaces - you''d need >> separate entries >> at all levels for this). > > OK, I forgot: No software-accessible TLB. > > Can you think of any trick (that doesn''t require the cost of a > trap/hypercall) to allow an app to determine what pcpu > it is running on?I can''t think of any that don''t require kernel modifications. Which takes us back to considering vsyscall, perhaps. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-01 16:41 UTC
RE: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:56 >>> > >Can you think of any trick (that doesn''t require the cost of a > >trap/hypercall) to allow an app to determine what pcpu > >it is running on? > > Just like what is being used to allow apps to get the CPU > number on native > kernels (or the vCPU one on Xen-ified ones): Have a GDT entry > the limit of > which is the number you want, and have the app use the lsl > instruction to > get at it.Can you explain more? Will this work for a userland process to get its current pcpu (not vcpu)?> I am, however, always a little bit concerned when it comes to exposing > information that shouldn''t really be exposed, due to the > possibility of > overlooking potential misuses. In the specific case here, I > can''t see at all > why you''d the pCPU number exposedThere is one pvclock "struct" for each pcpu. We want an app to "see" the right one. If that''s not possible, we want the app to see the whole array of them and be able to properly index into the array. If possible, I''d like to see if we can identify a solution at all, and then discard it if the issues are too difficult to overcome.> after all the kernel can do what > you want apps to do without having that information.In the current Linux 2.6.30 implementation of pvclock it can do it, but it can''t do it fast. In versions of the kernel prior to 2.2.28(?), it can''t do it at all, correct? Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-01 16:55 UTC
RE: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
> >> Not within today''s Xen or Linux (which both assume a global kernel > >> address space, in particular non-root page table entries > >> mapping kernel > >> space to be the same in all address spaces - you''d need > >> separate entries > >> at all levels for this). > > > > OK, I forgot: No software-accessible TLB. > > > > Can you think of any trick (that doesn''t require the cost of a > > trap/hypercall) to allow an app to determine what pcpu > > it is running on? > > I can''t think of any that don''t require kernel modifications. > Which takes us > back to considering vsyscall, perhaps. > > -- KeirIf a solution that doesn''t require kernel mods is not possible, then I suspect apps will continue to use rdtsc as-is and suffer the emulation overhead. Requiring all customers to update the OS underlying these apps is a non-starter. Also, it has yet to be proven that pvclock can work in a vsyscall. Doesn''t the same per-cpu in userspace problem exist? Pvclock without vsyscall has been measured and is too slow, so until a vsyscall version of pvclock is implemented and measured (let alone upstream or available in distros), it''s hard to call it an alternative to consider, even for the future. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-01 21:25 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
On 01/09/2009 17:04, "Jan Beulich" <JBeulich@novell.com> wrote:>>>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:56 >>> >> Can you think of any trick (that doesn''t require the cost of a >> trap/hypercall) to allow an app to determine what pcpu >> it is running on? > > Just like what is being used to allow apps to get the CPU number on native > kernels (or the vCPU one on Xen-ified ones): Have a GDT entry the limit of > which is the number you want, and have the app use the lsl instruction to > get at it.Yes, that''s true. Xen could provide such a segment descriptor in its private area of the GDT. The issue then would be that, in a compound pvclock operation spanning multiple machine instructions, the pCPU number revealed by the LSL instruction can be stale by the time it is used later in the compound operation. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-01 22:08 UTC
RE: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
> > Just like what is being used to allow apps to get the CPU > number on native > > kernels (or the vCPU one on Xen-ified ones): Have a GDT > entry the limit of > > which is the number you want, and have the app use the lsl > instruction to > > get at it. > > Yes, that''s true. Xen could provide such a segment descriptor > in its private > area of the GDT. The issue then would be that, in a compound pvclock > operation spanning multiple machine instructions, the pCPU > number revealed > by the LSL instruction can be stale by the time it is used > later in the > compound operation.The algorithm could check the pCPU number before and after reading the pvclock data and doing the rdtsc, and if they don''t match, start again. (Doesn''t the pvclock algorithm already do that with some versioning number in the pvclock data itself to ensure that the rest of the data didn''t change while it was being read?) I''m clueless about GDTs and the LSL instrution so would need some help prototyping this. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-01 22:21 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
On 09/01/09 15:08, Dan Magenheimer wrote:>>> Just like what is being used to allow apps to get the CPU >>> >> number on native >> >>> kernels (or the vCPU one on Xen-ified ones): Have a GDT >>> >> entry the limit of >> >>> which is the number you want, and have the app use the lsl >>> >> instruction to >> >>> get at it. >>> >> Yes, that''s true. Xen could provide such a segment descriptor >> in its private >> area of the GDT. The issue then would be that, in a compound pvclock >> operation spanning multiple machine instructions, the pCPU >> number revealed >> by the LSL instruction can be stale by the time it is used >> later in the >> compound operation. >> > The algorithm could check the pCPU number before and after > reading the pvclock data and doing the rdtsc, and if they > don''t match, start again. (Doesn''t the pvclock algorithm > already do that with some versioning number in the pvclock > data itself to ensure that the rest of the data didn''t > change while it was being read?) >There''s still a race there, if the thread switched PCPU twice during the operation: <running on PCPU A> get CPU # <switch to PCPU B> read tsc apply corrections from (from PCPU A) <switch to PCPU A> check CPU # is the same as we started with: all OK! note that the <switch to PCPU B> could either be a result of the Xen scheduler moving the VCPU *or* the Linux scheduler moving the thread to a different VCPU. In the former case, Xen could update a version counter to help detect the discontinuity, but it doesn''t really know about guest scheduling decisions. I guess the guest kernel could update the pvclock version counter itself.> I''m clueless about GDTs and the LSL instrution so would > need some help prototyping this. >It''s what vsyscall already uses. Your scheme is precisely analogous to what''s already there. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-01 22:41 UTC
RE: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
> There''s still a race thereGood point. Essentially we need to ensure that {{rdtsc and the pvclock struct values for PCPU-X}} are obtained atomically and there''s no way to guarantee that (at least without incurring overhead that''s likely to exceed just emulating rdtsc to begin with).> > I''m clueless about GDTs and the LSL instrution so would > > need some help prototyping this. > > > > It''s what vsyscall already uses. Your scheme is precisely > analogous to > what''s already there.(...except if it can be done entirely in the app with no OS dependencies) Won''t pvclock+vsyscall have the same race? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-01 23:26 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
On 09/01/09 15:41, Dan Magenheimer wrote:> Won''t pvclock+vsyscall have the same race? >Yes, it would need to be resolved either way. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-02 07:01 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
>>> Keir Fraser <keir.fraser@eu.citrix.com> 01.09.09 23:25 >>> >On 01/09/2009 17:04, "Jan Beulich" <JBeulich@novell.com> wrote: > >>>>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:56 >>> >>> Can you think of any trick (that doesn''t require the cost of a >>> trap/hypercall) to allow an app to determine what pcpu >>> it is running on? >> >> Just like what is being used to allow apps to get the CPU number on native >> kernels (or the vCPU one on Xen-ified ones): Have a GDT entry the limit of >> which is the number you want, and have the app use the lsl instruction to >> get at it. > >Yes, that''s true. Xen could provide such a segment descriptor in its private >area of the GDT. The issue then would be that, in a compound pvclockAnd in fact there already is such a descriptor, just with DPL=0.>operation spanning multiple machine instructions, the pCPU number revealed >by the LSL instruction can be stale by the time it is used later in the >compound operation.Correct. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-02 07:05 UTC
RE: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 18:41 >>> >> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:56 >>> >> >Can you think of any trick (that doesn''t require the cost of a >> >trap/hypercall) to allow an app to determine what pcpu >> >it is running on? >> >> Just like what is being used to allow apps to get the CPU >> number on native >> kernels (or the vCPU one on Xen-ified ones): Have a GDT entry >> the limit of >> which is the number you want, and have the app use the lsl >> instruction to >> get at it. > >Can you explain more? Will this work for a userland >process to get its current pcpu (not vcpu)?Sure, if the descriptor''s DPL is set to 3.>> I am, however, always a little bit concerned when it comes to exposing >> information that shouldn''t really be exposed, due to the >> possibility of >> overlooking potential misuses. In the specific case here, I >> can''t see at all >> why you''d the pCPU number exposed > >There is one pvclock "struct" for each pcpu. We want >an app to "see" the right one. If that''s not possible, >we want the app to see the whole array of them and be >able to properly index into the array.These pvclock structs should be per vCPU, shouldn''t they? The hypervisor ensures that the per-vCPU structure reflects the proper state on the pCPU that vCPU is currently running on.>> after all the kernel can do what >> you want apps to do without having that information. > >In the current Linux 2.6.30 implementation of pvclock >it can do it, but it can''t do it fast. In versions >of the kernel prior to 2.2.28(?), it can''t do it at >all, correct?I don''t think it ever uses a pCPU number. If it does, just point me to where this is happening. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-02 07:16 UTC
RE: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 02.09.09 00:08 >>> >> > Just like what is being used to allow apps to get the CPU >> number on native >> > kernels (or the vCPU one on Xen-ified ones): Have a GDT >> entry the limit of >> > which is the number you want, and have the app use the lsl >> instruction to >> > get at it. >> >> Yes, that''s true. Xen could provide such a segment descriptor >> in its private >> area of the GDT. The issue then would be that, in a compound pvclock >> operation spanning multiple machine instructions, the pCPU >> number revealed >> by the LSL instruction can be stale by the time it is used >> later in the >> compound operation. > >The algorithm could check the pCPU number before and after >reading the pvclock data and doing the rdtsc, and if they >don''t match, start again. (Doesn''t the pvclock algorithm >already do that with some versioning number in the pvclock >data itself to ensure that the rest of the data didn''t >change while it was being read?)No, that won''t do - the underlying pCPU may change multiple times during that process.>I''m clueless about GDTs and the LSL instrution so would >need some help prototyping this.As said in another reply, such a descriptor already exists (PER_CPU_GDT_ENTRY). But as also already said, I doubt you really need this. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-02 07:20 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
On 02/09/2009 00:26, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:> On 09/01/09 15:41, Dan Magenheimer wrote: >> Won''t pvclock+vsyscall have the same race? > > Yes, it would need to be resolved either way.The problem is a bit easier with vsyscall potentially. For example, give each thread its own vsyscall clock data area (easy?), updated by kernel whenever the thread is scheduled, and increment a version counter, checked before and after by the vsyscall operation. Well, I don''t know how easy or fast that could actually be implemented, but I''m at least confident it could work. But it does need kernel assistance. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-02 21:44 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
On 09/02/09 00:20, Keir Fraser wrote:> The problem is a bit easier with vsyscall potentially. For example, give > each thread its own vsyscall clock data area (easy?), updated by kernel > whenever the thread is scheduled, and increment a version counter, checked > before and after by the vsyscall operation. >Yes. Perhaps the very simplest way would be to make the kernel update the pvclock version counter on context switch, the same way Xen does; that would allow the usermode vsyscall code to use exactly the same algorithm as the kernel code. Would Xen cope with that?> Well, I don''t know how easy or fast that could actually be implemented, but > I''m at least confident it could work. But it does need kernel assistance. >Yes. I''m very uneasy about letting usermode have direct access to bits of Xen without the kernel''s knowledge anyway. It suddenly means we need to not only maintain a Xen<->kernel ABI, but a Xen<->usermode ABI as well. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-02 21:50 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
On 02/09/2009 22:44, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:> On 09/02/09 00:20, Keir Fraser wrote: >> The problem is a bit easier with vsyscall potentially. For example, give >> each thread its own vsyscall clock data area (easy?), updated by kernel >> whenever the thread is scheduled, and increment a version counter, checked >> before and after by the vsyscall operation. >> > > Yes. Perhaps the very simplest way would be to make the kernel update > the pvclock version counter on context switch, the same way Xen does; > that would allow the usermode vsyscall code to use exactly the same > algorithm as the kernel code. Would Xen cope with that?Yes, that''s basically how I would envision it working. The main missing detail afaics is how to manage and access the required per-thread data. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-02 22:05 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
On 09/02/09 14:50, Keir Fraser wrote:>> Yes. Perhaps the very simplest way would be to make the kernel update >> the pvclock version counter on context switch, the same way Xen does; >> that would allow the usermode vsyscall code to use exactly the same >> algorithm as the kernel code. Would Xen cope with that? >> > Yes, that''s basically how I would envision it working. The main missing > detail afaics is how to manage and access the required per-thread data. >I was imagining: 1. Add a hypercall to set the desired location of the clock correction info rather than putting it in the shared-info area (akin to vcpu placement). KVM already has this; they write the address to a magic MSR. 2. Pack all the clock structures into a single page, indexed by vcpu number 3. Map that RO into userspace via fixmap, like the vsyscall page itself 4. Use the lsl trick to get the current vcpu to index into the array, then compute a time value using tsc with corrections; iterate if version stamp changes under our feet. 5. On context switch, the kernel would increment the version of the *old* vcpu clock structure, so that when the usermode code re-checks the version at the end of its time calculation, it can tell that it has a stale vcpu and it needs to iterate with a new vcpu+clock structure J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-03 08:23 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
>>> Jeremy Fitzhardinge <jeremy@goop.org> 03.09.09 00:05 >>> > 1. Add a hypercall to set the desired location of the clock > correction info rather than putting it in the shared-info area > (akin to vcpu placement). KVM already has this; they write the > address to a magic MSR.But this is already subject to placement, as it''s part of the vcpu_info structure. While of course you don''t want to make the whole vcpu_info visible to guests, it would seem awkward to further segregate the shared_info pieces. I''d rather consider adding a second (optional) copy of it, since the updating of this is rather little overhead in Xen, but using this in the kernel time handling code would eliminate the potential for accessing all the vcpu_info fields using percpu_read().> 2. Pack all the clock structures into a single page, indexed by vcpu > numberThat adds a scalability issue, albeit a relatively light one: You shouldn''t anymore assume there''s a limit on the number of vCPU-s.> 3. Map that RO into userspace via fixmap, like the vsyscall page itself > 4. Use the lsl trick to get the current vcpu to index into the array, > then compute a time value using tsc with corrections; iterate if > version stamp changes under our feet. > 5. On context switch, the kernel would increment the version of the > *old* vcpu clock structure, so that when the usermode code > re-checks the version at the end of its time calculation, it can > tell that it has a stale vcpu and it needs to iterate with a new > vcpu+clock structureI don''t think you can re-use the hypervisor updated version field here, unless you add a protocol on how the two updaters avoid collision. struct vcpu_time_info has a padding field, which might be designated as guest-kernel-version. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Sep-03 14:22 UTC
RE: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
> I was imagining: > > 1. Add a hypercall to set the desired location of the clock > correction info rather than putting it in the shared-info area > (akin to vcpu placement). KVM already has this; they write the > address to a magic MSR. > 2. Pack all the clock structures into a single page, > indexed by vcpu > number > 3. Map that RO into userspace via fixmap, like the > vsyscall page itself > 4. Use the lsl trick to get the current vcpu to index into > the array, > then compute a time value using tsc with corrections; iterate if > version stamp changes under our feet. > 5. On context switch, the kernel would increment the version of the > *old* vcpu clock structure, so that when the usermode code > re-checks the version at the end of its time calculation, it can > tell that it has a stale vcpu and it needs to iterate with a new > vcpu+clock structureIt would be nice to see a prototyped version of this so it could be confirmed that it works, the kernel impact can be evaluated, performance can be measured, and, if all looks good, distros can start putting it into their kernels. Also, it would be nice if there is some way for apps to determine if it is present and working, e.g. if (clock_gettime_performance_doesnt_suck) t = clock_gettime(); else { t= rdtsc(); apply_post_processing(t); } as apparently sysctl.vsyscall64==1 is not sufficient. In fact, if there can be agreement as to how this determination can be done (sysctl.fastpvclock==1??) apps could start getting ready. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-03 17:29 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
On 09/03/09 01:23, Jan Beulich wrote:>> 1. Add a hypercall to set the desired location of the clock >> correction info rather than putting it in the shared-info area >> (akin to vcpu placement). KVM already has this; they write the >> address to a magic MSR. >> > But this is already subject to placement, as it''s part of the vcpu_info > structure. While of course you don''t want to make the whole vcpu_info > visible to guests, it would seem awkward to further segregate the > shared_info pieces. I''d rather consider adding a second (optional) copy > of it, since the updating of this is rather little overhead in Xen,Hm, I guess that''s possible. Though once you''ve added a new "other time struct" pointer, it would be easier to just make Xen update that pointer rather than update two. I don''t think a guest is going to know/care about having two versions of the info (except that it opens the possibility of getting confused by looking at the wrong one). I''d propose that there''d be just one, and the non-valid pvclock structure have its version set to 0xffffffff, since a guest should never see a version in that state.> but > using this in the kernel time handling code would eliminate the > potential for accessing all the vcpu_info fields using percpu_read(). >I don''t think that''s a big concern. The kernel''s pvclock handing is common between Xen and KVM now, and it just gets a pointer to the structure; it never accesses it as a percpu variable.>> 2. Pack all the clock structures into a single page, indexed by vcpu >> number >> > That adds a scalability issue, albeit a relatively light one: You shouldn''t > anymore assume there''s a limit on the number of vCPU-s. >Well, that''s up to the kernel rather than Xen. If there a lot of CPUs it can span multiple pages. There''s no need to make them physically contiguous, since the kernel never needs to treat them as an array and we can map disjoint pages contiguously into userspace (it might take a chunk of fixmap slots). I guess one concern is that it ends up exposing the scheduling info about all the VCPUs to all usermode. I doubt that''s a problem in itself, but who knows if it could be used as part of a larger attack.>> 5. On context switch, the kernel would increment the version of the >> *old* vcpu clock structure, so that when the usermode code >> re-checks the version at the end of its time calculation, it can >> tell that it has a stale vcpu and it needs to iterate with a new >> vcpu+clock structure >> > I don''t think you can re-use the hypervisor updated version field here, > unless you add a protocol on how the two updaters avoid collision. > struct vcpu_time_info has a padding field, which might be designated > as guest-kernel-version. >There''s no padding. It would be an extension of the pvclock ABI, which KVM also implements, so we''d need to make sure they can cope too. We only need to worry about Xen preempting a kernel update rather than the other way around. I think it ends up being very simple: void ctxtsw_update_pvclock(struct pvclock_vcpu_time_info *pvclock) { BUG_ON(preemptible()); /* Switching VCPUs would be a disaster */ /* * We just need to update version; if Xen did it behind our back, then * that''s OK with us. We should never see an update-in-progress because Xen * will always completely update the pvclock structure before rescheduling the * VCPU, so version should always be even. We don''t care if Xen updates the * timing parameters here because we''re not in the middle of a clock read. * Usermode might be in the middle of a read, but all it needs to see is version * changing to a new even number, even if this add gets preempted by Xen in * the middle. There are no cross-PCPU writes going on, so we don''t need to * worry about bus-level atomicity. */ pvclock->version += 2; } Looks like this would work for KVM too. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-04 07:19 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
>>> Jeremy Fitzhardinge <jeremy@goop.org> 03.09.09 19:29 >>> >On 09/03/09 01:23, Jan Beulich wrote: >> I don''t think you can re-use the hypervisor updated version field here, >> unless you add a protocol on how the two updaters avoid collision. >> struct vcpu_time_info has a padding field, which might be designated >> as guest-kernel-version. >> > >There''s no padding. It would be an extension of the pvclock ABI, which >KVM also implements, so we''d need to make sure they can cope too.struct pvclock_vcpu_time_info has a ''pad0'' field afaics.>We only need to worry about Xen preempting a kernel update rather than >the other way around. I think it ends up being very simple: > >void ctxtsw_update_pvclock(struct pvclock_vcpu_time_info *pvclock) >{ > BUG_ON(preemptible()); /* Switching VCPUs would be a disaster */ > > /* > * We just need to update version; if Xen did it behind our back, then > * that''s OK with us. We should never see an update-in-progress because Xen > * will always completely update the pvclock structure before rescheduling the > * VCPU, so version should always be even. We don''t care if Xen updates the > * timing parameters here because we''re not in the middle of a clock read. > * Usermode might be in the middle of a read, but all it needs to see is version > * changing to a new even number, even if this add gets preempted by Xen in > * the middle. There are no cross-PCPU writes going on, so we don''t need to > * worry about bus-level atomicity. > */ > pvclock->version += 2; >}No, that won''t work as-is, because you can''t guarantee the compiler to translate this to and add-with-memory-operand. While avoiding a bus lock here indeed seems possible (as long as it is clear that user mode will never be interested in reading other than the instance of the CPU it''s currently running on), you won''t get away without inline assembly. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Sep-04 15:44 UTC
Re: [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
On 09/04/09 00:19, Jan Beulich wrote:> struct pvclock_vcpu_time_info has a ''pad0'' field afaics. >Ah, yes, I was looking at wall_clock. We could claim the padding for "local version", but it would require a 64-bit unpreemptible read, which is awkward on 32-bit.>> We only need to worry about Xen preempting a kernel update rather than >> the other way around. I think it ends up being very simple: >> >> void ctxtsw_update_pvclock(struct pvclock_vcpu_time_info *pvclock) >> { >> BUG_ON(preemptible()); /* Switching VCPUs would be a disaster */ >> >> /* >> * We just need to update version; if Xen did it behind our back, then >> * that''s OK with us. We should never see an update-in-progress because Xen >> * will always completely update the pvclock structure before rescheduling the >> * VCPU, so version should always be even. We don''t care if Xen updates the >> * timing parameters here because we''re not in the middle of a clock read. >> * Usermode might be in the middle of a read, but all it needs to see is version >> * changing to a new even number, even if this add gets preempted by Xen in >> * the middle. There are no cross-PCPU writes going on, so we don''t need to >> * worry about bus-level atomicity. >> */ >> pvclock->version += 2; >> } >> > No, that won''t work as-is, because you can''t guarantee the compiler to > translate this to and add-with-memory-operand. While avoiding a bus > lock here indeed seems possible (as long as it is clear that user mode will > never be interested in reading other than the instance of the CPU it''s > currently running on), you won''t get away without inline assembly. >I don''t think that matters, even if the compiler generates a preemptable sequence: the end result will always be a changed version number. Even if we end up rolling back the version to a smaller number (because Xen did multiple pvclock updates while it preempted us) nothing will get confused because nothing observed those intermediate versions. Xen itself doesn''t care about the version number (its effectively write-only). KVM is the same. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel