Waiman Long
2017-Feb-10 16:35 UTC
[PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/10/2017 11:19 AM, Peter Zijlstra wrote:> On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote: >> It was found when running fio sequential write test with a XFS ramdisk >> on a VM running on a 2-socket x86-64 system, the %CPU times as reported >> by perf were as follows: >> >> 69.75% 0.59% fio [k] down_write >> 69.15% 0.01% fio [k] call_rwsem_down_write_failed >> 67.12% 1.12% fio [k] rwsem_down_write_failed >> 63.48% 52.77% fio [k] osq_lock >> 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt >> 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted >> > Thinking about this again, wouldn't something like the below also work? > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > index 099fcba4981d..6aa33702c15c 100644 > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) > local_irq_restore(flags); > } > > +#ifdef CONFIG_X86_32 > __visible bool __kvm_vcpu_is_preempted(int cpu) > { > struct kvm_steal_time *src = &per_cpu(steal_time, cpu); > @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) > } > PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); > > +#else > + > +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int); > + > +asm( > +".pushsection .text;" > +".global __raw_callee_save___kvm_vcpu_is_preempted;" > +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" > +"__raw_callee_save___kvm_vcpu_is_preempted:" > +FRAME_BEGIN > +"push %rdi;" > +"push %rdx;" > +"movslq %edi, %rdi;" > +"movq $steal_time+16, %rax;" > +"movq __per_cpu_offset(,%rdi,8), %rdx;" > +"cmpb $0, (%rdx,%rax);" > +"setne %al;" > +"pop %rdx;" > +"pop %rdi;" > +FRAME_END > +"ret;" > +".popsection"); > + > +#endif > + > /* > * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. > */That should work for now. I have done something similar for __pv_queued_spin_unlock. However, this has the problem of creating a dependency on the exact layout of the steal_time structure. Maybe the constant 16 can be passed in as a parameter offsetof(struct kvm_steal_time, preempted) to the asm call. Cheers, Longman
Waiman Long
2017-Feb-10 17:00 UTC
[PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/10/2017 11:35 AM, Waiman Long wrote:> On 02/10/2017 11:19 AM, Peter Zijlstra wrote: >> On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote: >>> It was found when running fio sequential write test with a XFS ramdisk >>> on a VM running on a 2-socket x86-64 system, the %CPU times as reported >>> by perf were as follows: >>> >>> 69.75% 0.59% fio [k] down_write >>> 69.15% 0.01% fio [k] call_rwsem_down_write_failed >>> 67.12% 1.12% fio [k] rwsem_down_write_failed >>> 63.48% 52.77% fio [k] osq_lock >>> 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt >>> 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted >>> >> Thinking about this again, wouldn't something like the below also work? >> >> >> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c >> index 099fcba4981d..6aa33702c15c 100644 >> --- a/arch/x86/kernel/kvm.c >> +++ b/arch/x86/kernel/kvm.c >> @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) >> local_irq_restore(flags); >> } >> >> +#ifdef CONFIG_X86_32 >> __visible bool __kvm_vcpu_is_preempted(int cpu) >> { >> struct kvm_steal_time *src = &per_cpu(steal_time, cpu); >> @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) >> } >> PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); >> >> +#else >> + >> +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int); >> + >> +asm( >> +".pushsection .text;" >> +".global __raw_callee_save___kvm_vcpu_is_preempted;" >> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" >> +"__raw_callee_save___kvm_vcpu_is_preempted:" >> +FRAME_BEGIN >> +"push %rdi;" >> +"push %rdx;" >> +"movslq %edi, %rdi;" >> +"movq $steal_time+16, %rax;" >> +"movq __per_cpu_offset(,%rdi,8), %rdx;" >> +"cmpb $0, (%rdx,%rax);" >> +"setne %al;" >> +"pop %rdx;" >> +"pop %rdi;" >> +FRAME_END >> +"ret;" >> +".popsection"); >> + >> +#endif >> + >> /* >> * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. >> */ > That should work for now. I have done something similar for > __pv_queued_spin_unlock. However, this has the problem of creating a > dependency on the exact layout of the steal_time structure. Maybe the > constant 16 can be passed in as a parameter offsetof(struct > kvm_steal_time, preempted) to the asm call. > > Cheers, > LongmanOne more thing, that will improve KVM performance, but it won't help Xen. I looked into the assembly code for rwsem_spin_on_owner, It need to save and restore 2 additional registers with my patch. Doing it your way, will transfer the save and restore overhead to the assembly code. However, __kvm_vcpu_is_preempted() is called multiple times per invocation of rwsem_spin_on_owner. That function is simple enough that making __kvm_vcpu_is_preempted() callee-save won't produce much compiler optimization opportunity. The outer function rwsem_down_write_failed() does appear to be a bit bigger (from 866 bytes to 884 bytes) though. Cheers, Longman
Peter Zijlstra
2017-Feb-13 10:47 UTC
[PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On Fri, Feb 10, 2017 at 12:00:43PM -0500, Waiman Long wrote:> >> +asm( > >> +".pushsection .text;" > >> +".global __raw_callee_save___kvm_vcpu_is_preempted;" > >> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" > >> +"__raw_callee_save___kvm_vcpu_is_preempted:" > >> +FRAME_BEGIN > >> +"push %rdi;" > >> +"push %rdx;" > >> +"movslq %edi, %rdi;" > >> +"movq $steal_time+16, %rax;" > >> +"movq __per_cpu_offset(,%rdi,8), %rdx;" > >> +"cmpb $0, (%rdx,%rax);"Could we not put the $steal_time+16 displacement as an immediate in the cmpb and save a whole register here? That way we'd end up with something like: asm(" push %rdi; movslq %edi, %rdi; movq __per_cpu_offset(,%rdi,8), %rax; cmpb $0, %[offset](%rax); setne %al; pop %rdi; " : : [offset] "i" (((unsigned long)&steal_time) + offsetof(struct steal_time, preempted))); And if we could get rid of the sign extend on edi we could avoid all the push-pop nonsense, but I'm not sure I see how to do that (then again, this asm foo isn't my strongest point).> >> +"setne %al;" > >> +"pop %rdx;" > >> +"pop %rdi;" > >> +FRAME_END > >> +"ret;" > >> +".popsection"); > >> + > >> +#endif > >> + > >> /* > >> * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. > >> */ > > That should work for now. I have done something similar for > > __pv_queued_spin_unlock. However, this has the problem of creating a > > dependency on the exact layout of the steal_time structure. Maybe the > > constant 16 can be passed in as a parameter offsetof(struct > > kvm_steal_time, preempted) to the asm call.Yeah it should be well possible to pass that in. But ideally we'd have GCC grow something like __attribute__((callee_saved)) or somesuch and it would do all this for us.> One more thing, that will improve KVM performance, but it won't help Xen.People still use Xen? ;-) In any case, their implementation looks very similar and could easily crib this.> I looked into the assembly code for rwsem_spin_on_owner, It need to save > and restore 2 additional registers with my patch. Doing it your way, > will transfer the save and restore overhead to the assembly code. > However, __kvm_vcpu_is_preempted() is called multiple times per > invocation of rwsem_spin_on_owner. That function is simple enough that > making __kvm_vcpu_is_preempted() callee-save won't produce much compiler > optimization opportunity.This is because of that noinline, right? Otherwise it would've been folded and register pressure would be much higher.> The outer function rwsem_down_write_failed() > does appear to be a bit bigger (from 866 bytes to 884 bytes) though.I suspect GCC is being clever and since all this is static it plays games with the calling convention and pushes these clobbers out.
Reasonably Related Threads
- [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
- [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
- [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
- [PATCH v3 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
- [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64