On Wed, Jul 31, 2013 at 06:25:04AM -0700, H. Peter Anvin wrote:> On 07/31/2013 06:17 AM, Konrad Rzeszutek Wilk wrote: > >> > >> The big problem with pvops is that they are a permanent tax on future > >> development -- a classic case of "the hooks problem." As such it is > >> important that there be a real, significant, use case with enough users > >> to make the pain worthwhile. With Xen looking at sunsetting PV support > >> with a long horizon, it might currently be possible to remove pvops some > > > > PV MMU parts specifically. > > > > Pretty much stuff that is driverized on plain hardware doesn''t matter. > What are you looking at with respect to the basic CPU control state?CC-ing Mukesh here. Let me iterate down what the experimental patch uses: struct pv_init_ops pv_init_ops; [still use xen_patch, but I think that is not needed anymore] struct pv_time_ops pv_time_ops; [we need that as we are using the PV clock source] struct pv_cpu_ops pv_cpu_ops; [only end up using cpuid. This one is a tricky one. We could arguable remove it but it does do some filtering - for example THERM is turned off, or MWAIT if a certain hypercall tells us to disable that. Since this is now a trapped operation this could be handled in the hypervisor - but then it would be in charge of filtering certain CPUID - and this is at bootup - so there is not user interaction. This needs a bit more of thinking] struct pv_irq_ops pv_irq_ops; [none so far, we use normal sti/cli struct pv_apic_ops pv_apic_ops; [we over-write them without own event channel logic for IPI''s, etc. Thought with virtualized APIC this could be done differently and some Intel engineers told me that they have it on their roadmap] struct pv_mmu_ops pv_mmu_ops; [we use two: - .flush_tlb_others (xen_flush_tlb_others) - and I think we actually remove that. Mukesh, do you recall why we need it? - .pagetable_init - but that can be moved out as the only reason it does that is to use a new address (__va) on the shared page (it swaps out of using the __kva to using __va).] struct pv_lock_ops pv_lock_ops; [still using that] Please please take this with a grain of salt. The patches are still experimental so we might be missing something and this is not set in stone.
On Fri, Aug 02, 2013 at 03:09:34PM -0400, Konrad Rzeszutek Wilk wrote:> On Wed, Jul 31, 2013 at 06:25:04AM -0700, H. Peter Anvin wrote: > > On 07/31/2013 06:17 AM, Konrad Rzeszutek Wilk wrote: > > >> > > >> The big problem with pvops is that they are a permanent tax on future > > >> development -- a classic case of "the hooks problem." As such it is > > >> important that there be a real, significant, use case with enough users > > >> to make the pain worthwhile. With Xen looking at sunsetting PV support > > >> with a long horizon, it might currently be possible to remove pvops some > > > > > > PV MMU parts specifically. > > > > > > > Pretty much stuff that is driverized on plain hardware doesn''t matter. > > What are you looking at with respect to the basic CPU control state? > > > CC-ing Mukesh here. > > Let me iterate down what the experimental patch uses: > > struct pv_init_ops pv_init_ops; > [still use xen_patch, but I think that is not needed anymore] > > struct pv_time_ops pv_time_ops; > [we need that as we are using the PV clock source] > > struct pv_cpu_ops pv_cpu_ops; > [only end up using cpuid. This one is a tricky one. We could > arguable remove it but it does do some filtering - for example > THERM is turned off, or MWAIT if a certain hypercall tells us to > disable that. Since this is now a trapped operation this could be > handled in the hypervisor - but then it would be in charge of > filtering certain CPUID - and this is at bootup - so there is not > user interaction. This needs a bit more of thinking] >read_msr/write_msr in this one make all msr accesses safe. IIRC there are MSRs that Linux uses without checking cpuid bits. IA32_PERF_CAPABILITIES for instance is used without checking PDCM bit. -- Gleb.
On Sun, Aug 04, 2013 at 03:37:08PM +0300, Gleb Natapov wrote:> On Fri, Aug 02, 2013 at 03:09:34PM -0400, Konrad Rzeszutek Wilk wrote: > > On Wed, Jul 31, 2013 at 06:25:04AM -0700, H. Peter Anvin wrote: > > > On 07/31/2013 06:17 AM, Konrad Rzeszutek Wilk wrote: > > > >> > > > >> The big problem with pvops is that they are a permanent tax on future > > > >> development -- a classic case of "the hooks problem." As such it is > > > >> important that there be a real, significant, use case with enough users > > > >> to make the pain worthwhile. With Xen looking at sunsetting PV support > > > >> with a long horizon, it might currently be possible to remove pvops some > > > > > > > > PV MMU parts specifically. > > > > > > > > > > Pretty much stuff that is driverized on plain hardware doesn''t matter. > > > What are you looking at with respect to the basic CPU control state? > > > > > > CC-ing Mukesh here. > > > > Let me iterate down what the experimental patch uses: > > > > struct pv_init_ops pv_init_ops; > > [still use xen_patch, but I think that is not needed anymore] > > > > struct pv_time_ops pv_time_ops; > > [we need that as we are using the PV clock source] > > > > struct pv_cpu_ops pv_cpu_ops; > > [only end up using cpuid. This one is a tricky one. We could > > arguable remove it but it does do some filtering - for example > > THERM is turned off, or MWAIT if a certain hypercall tells us to > > disable that. Since this is now a trapped operation this could be > > handled in the hypervisor - but then it would be in charge of > > filtering certain CPUID - and this is at bootup - so there is not > > user interaction. This needs a bit more of thinking] > > > read_msr/write_msr in this one make all msr accesses safe. IIRC there > are MSRs that Linux uses without checking cpuid bits. > IA32_PERF_CAPABILITIES for instance is used without checking PDCM bit.Right, those are needed as well. Completly forgot about them.> > > -- > Gleb.
On 08/05/2013 09:50 AM, Konrad Rzeszutek Wilk wrote:>>> >>> Let me iterate down what the experimental patch uses: >>> >>> struct pv_init_ops pv_init_ops; >>> [still use xen_patch, but I think that is not needed anymore] >>> >>> struct pv_time_ops pv_time_ops; >>> [we need that as we are using the PV clock source] >>> >>> struct pv_cpu_ops pv_cpu_ops; >>> [only end up using cpuid. This one is a tricky one. We could >>> arguable remove it but it does do some filtering - for example >>> THERM is turned off, or MWAIT if a certain hypercall tells us to >>> disable that. Since this is now a trapped operation this could be >>> handled in the hypervisor - but then it would be in charge of >>> filtering certain CPUID - and this is at bootup - so there is not >>> user interaction. This needs a bit more of thinking] >>> >> read_msr/write_msr in this one make all msr accesses safe. IIRC there >> are MSRs that Linux uses without checking cpuid bits. >> IA32_PERF_CAPABILITIES for instance is used without checking PDCM bit. > > Right, those are needed as well. Completly forgot about them.CPUID is not too bad. RDMSR/WRMSR is actually worse since there are some MSRs which are performance-critical. The really messy pvops are the memory-related ones, as they don''t match the hardware behavior. Similarly, beyond pvops, what new assumptions does this code add to the code base? -hpa
> >>> struct pv_cpu_ops pv_cpu_ops; > >>> [only end up using cpuid. This one is a tricky one. We could > >>> arguable remove it but it does do some filtering - for example > >>> THERM is turned off, or MWAIT if a certain hypercall tells us to > >>> disable that. Since this is now a trapped operation this could be > >>> handled in the hypervisor - but then it would be in charge of > >>> filtering certain CPUID - and this is at bootup - so there is not > >>> user interaction. This needs a bit more of thinking] > >>> > >> read_msr/write_msr in this one make all msr accesses safe. IIRC there > >> are MSRs that Linux uses without checking cpuid bits. > >> IA32_PERF_CAPABILITIES for instance is used without checking PDCM bit. > > > > Right, those are needed as well. Completly forgot about them. > > CPUID is not too bad. RDMSR/WRMSR is actually worse since there are > some MSRs which are performance-critical. The really messy pvops are > the memory-related ones, as they don''t match the hardware behavior.Would you have a by any chance a nice test-case to demonstrate the rdmsr/wrmsr paths which performance-critical under baremetal?> > Similarly, beyond pvops, what new assumptions does this code add to the > code base?We have not yet narrowed down on how to "negotiate" the GDT values - as the VMX code in the hypervisor has setup those before it loads the kernel. I think Mukesh was thinking to extend the .Xen.note to enumerate some of the ones that are needed and somehow the hypervisor slurps them in.