thr3ads.net - Linux Virtualization - [PATCH v2 0/4] implement vcpu preempted check [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Paolo Bonzini

2016-Jul-06 12:28 UTC

[PATCH v2 0/4] implement vcpu preempted check

On 06/07/2016 14:08, Wanpeng Li wrote:> 2016-07-06 18:44 GMT+08:00 Paolo Bonzini <pbonzini at redhat.com>:
>>
>>
>> On 06/07/2016 08:52, Peter Zijlstra wrote:
>>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>>> change fomr v1:
>>>>      a simplier definition of default vcpu_is_preempted
>>>>      skip mahcine type check on ppc, and add config. remove
dedicated macro.
>>>>      add one patch to drop overload of rwsem_spin_on_owner and
mutex_spin_on_owner.
>>>>      add more comments
>>>>      thanks boqun and Peter's suggestion.
>>>>
>>>> This patch set aims to fix lock holder preemption issues.
>>>>
>>>> test-case:
>>>> perf record -a perf bench sched messaging -g 400 -p &&
perf report
>>>>
>>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k]
rwsem_spin_on_owner
>>>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>>  3.89%  sched-messaging  [kernel.vmlinux]  [k]
wait_consider_task
>>>>  3.64%  sched-messaging  [kernel.vmlinux]  [k]
_raw_write_lock_irq
>>>>  3.41%  sched-messaging  [kernel.vmlinux]  [k]
mutex_spin_on_owner.is
>>>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>>
>>>> We introduce interface bool vcpu_is_preempted(int cpu) and use
it in some spin
>>>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>>>> These spin_on_onwer variant also cause rcu stall before we
apply this patch set
>>>
>>> Paolo, could you help out with an (x86) KVM interface for this?
>>
>> If it's just for spin loops, you can check if the version field in
the
>> steal time structure has changed.
> 
> Steal time will not be updated until ahead of next vmentry except
> wrmsr MSR_KVM_STEAL_TIME. So it can't represent it is preempted
> currently, right?
Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
VCPU has been scheduled out since the last time the guest reset the bit.
 The guest can use an xchg to test-and-clear it.  The bit can be
accessed at any time, independent of the version field.

Paolo

Wanpeng Li

2016-Jul-06 13:03 UTC

head link

[PATCH v2 0/4] implement vcpu preempted check

2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonzini at
redhat.com>:>
>
> On 06/07/2016 14:08, Wanpeng Li wrote:
>> 2016-07-06 18:44 GMT+08:00 Paolo Bonzini <pbonzini at
redhat.com>:
>>>
>>>
>>> On 06/07/2016 08:52, Peter Zijlstra wrote:
>>>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>>>> change fomr v1:
>>>>>      a simplier definition of default vcpu_is_preempted
>>>>>      skip mahcine type check on ppc, and add config. remove
dedicated macro.
>>>>>      add one patch to drop overload of rwsem_spin_on_owner
and mutex_spin_on_owner.
>>>>>      add more comments
>>>>>      thanks boqun and Peter's suggestion.
>>>>>
>>>>> This patch set aims to fix lock holder preemption issues.
>>>>>
>>>>> test-case:
>>>>> perf record -a perf bench sched messaging -g 400 -p
&& perf report
>>>>>
>>>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k]
rwsem_spin_on_owner
>>>>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>>>  3.89%  sched-messaging  [kernel.vmlinux]  [k]
wait_consider_task
>>>>>  3.64%  sched-messaging  [kernel.vmlinux]  [k]
_raw_write_lock_irq
>>>>>  3.41%  sched-messaging  [kernel.vmlinux]  [k]
mutex_spin_on_owner.is
>>>>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>>>
>>>>> We introduce interface bool vcpu_is_preempted(int cpu) and
use it in some spin
>>>>> loops of osq_lock, rwsem_spin_on_owner and
mutex_spin_on_owner.
>>>>> These spin_on_onwer variant also cause rcu stall before we
apply this patch set
>>>>
>>>> Paolo, could you help out with an (x86) KVM interface for this?
>>>
>>> If it's just for spin loops, you can check if the version field
in the
>>> steal time structure has changed.
>>
>> Steal time will not be updated until ahead of next vmentry except
>> wrmsr MSR_KVM_STEAL_TIME. So it can't represent it is preempted
>> currently, right?
>
> Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's
flags to
> indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1,
the
> VCPU has been scheduled out since the last time the guest reset the bit.
>  The guest can use an xchg to test-and-clear it.  The bit can be
> accessed at any time, independent of the version field.
I will try to implement it tomorrow, thanks for your proposal. :)

Regards,
Wanpeng Li

Wanpeng Li

2016-Jul-07 08:48 UTC

head link

[PATCH v2 0/4] implement vcpu preempted check

2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonzini at
redhat.com>:>
>
> On 06/07/2016 14:08, Wanpeng Li wrote:
>> 2016-07-06 18:44 GMT+08:00 Paolo Bonzini <pbonzini at
redhat.com>:
>>>
>>>
>>> On 06/07/2016 08:52, Peter Zijlstra wrote:
>>>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>>>> change fomr v1:
>>>>>      a simplier definition of default vcpu_is_preempted
>>>>>      skip mahcine type check on ppc, and add config. remove
dedicated macro.
>>>>>      add one patch to drop overload of rwsem_spin_on_owner
and mutex_spin_on_owner.
>>>>>      add more comments
>>>>>      thanks boqun and Peter's suggestion.
>>>>>
>>>>> This patch set aims to fix lock holder preemption issues.
>>>>>
>>>>> test-case:
>>>>> perf record -a perf bench sched messaging -g 400 -p
&& perf report
>>>>>
>>>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k]
rwsem_spin_on_owner
>>>>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>>>  3.89%  sched-messaging  [kernel.vmlinux]  [k]
wait_consider_task
>>>>>  3.64%  sched-messaging  [kernel.vmlinux]  [k]
_raw_write_lock_irq
>>>>>  3.41%  sched-messaging  [kernel.vmlinux]  [k]
mutex_spin_on_owner.is
>>>>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>>>
>>>>> We introduce interface bool vcpu_is_preempted(int cpu) and
use it in some spin
>>>>> loops of osq_lock, rwsem_spin_on_owner and
mutex_spin_on_owner.
>>>>> These spin_on_onwer variant also cause rcu stall before we
apply this patch set
>>>>
>>>> Paolo, could you help out with an (x86) KVM interface for this?
>>>
>>> If it's just for spin loops, you can check if the version field
in the
>>> steal time structure has changed.
>>
>> Steal time will not be updated until ahead of next vmentry except
>> wrmsr MSR_KVM_STEAL_TIME. So it can't represent it is preempted
>> currently, right?
>
> Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's
flags to
> indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1,
the
> VCPU has been scheduled out since the last time the guest reset the bit.
>  The guest can use an xchg to test-and-clear it.  The bit can be
> accessed at any time, independent of the version field.
If one vCPU is preempted, and guest check it several times before this
vCPU is scheded in, then the first time we can get "vCPU is
preempted", however, since the field is cleared, the second time we
will get "vCPU is running".

Do you mean we should call record_steal_time() in both kvm_sched_in()
and kvm_sched_out() to record this field? Btw, if we should keep both
vcpu->preempted and kvm_steal_time's "vCPU preempted" field
present
simultaneous?

Regards,
Wanpeng Li

Peter Zijlstra

2016-Jul-07 09:42 UTC

head link

[PATCH v2 0/4] implement vcpu preempted check

On Thu, Jul 07, 2016 at 04:48:05PM +0800, Wanpeng Li
wrote:> 2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonzini at redhat.com>:
> > Hmm, you're right.  We can use bit 0 of struct
kvm_steal_time's flags to
> > indicate that pad[0] is a "VCPU preempted" field; if pad[0]
is 1, the
> > VCPU has been scheduled out since the last time the guest reset the
bit.
> >  The guest can use an xchg to test-and-clear it.  The bit can be
> > accessed at any time, independent of the version field.
> 
> If one vCPU is preempted, and guest check it several times before this
> vCPU is scheded in, then the first time we can get "vCPU is
> preempted", however, since the field is cleared, the second time we
> will get "vCPU is running".
> 
> Do you mean we should call record_steal_time() in both kvm_sched_in()
> and kvm_sched_out() to record this field? Btw, if we should keep both
> vcpu->preempted and kvm_steal_time's "vCPU preempted"
field present
> simultaneous?
I suspect you want something like so; except this has holes in.

We clear KVM_ST_PAD_PREEMPT before disabling preemption and we set it
after enabling it, this means that if we get preempted in between, the
vcpu is reported as running even though it very much is not.

Fixing that requires much larger surgery.

---
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b2766723c951..117270df43b6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1997,8 +1997,29 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.pv_time_enabled = false;
 }
 
+static void update_steal_time_preempt(struct kvm_vcpu *vcpu)
+{
+	struct kvm_steal_time *st;
+
+	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
+		return;
+
+	if (unlikely(kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
+		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
+		return;
+
+	st = &vcpu->arch.st.steal;
+
+	st->pad[KVM_ST_PAD_PREEMPT] = 1; /* we've stopped running */
+
+	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
+		st, sizeof(struct kvm_steal_time));
+}
+
 static void record_steal_time(struct kvm_vcpu *vcpu)
 {
+	struct kvm_steal_time *st;
+
 	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
 		return;
 
@@ -2006,29 +2027,34 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
 		return;
 
-	if (vcpu->arch.st.steal.version & 1)
-		vcpu->arch.st.steal.version += 1;  /* first time write, random junk */
+	st = &vcpu->arch.st.steal;
+
+	if (st->version & 1) {
+		st->flags = KVM_ST_FLAG_PREEMPT;
+		st->version += 1;  /* first time write, random junk */
+	}
 
-	vcpu->arch.st.steal.version += 1;
+	st->version += 1;
 
 	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
-		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+		st, sizeof(struct kvm_steal_time));
 
 	smp_wmb();
 
-	vcpu->arch.st.steal.steal += current->sched_info.run_delay -
+	st->steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
+	st->pad[KVM_ST_PAD_PREEMPT] = 0; /* we're about to start running */
 
 	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
-		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+		st, sizeof(struct kvm_steal_time));
 
 	smp_wmb();
 
-	vcpu->arch.st.steal.version += 1;
+	st->version += 1;
 
 	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
-		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+		st, sizeof(struct kvm_steal_time));
 }
 
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
@@ -6693,6 +6719,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	preempt_enable();
 
+	update_steal_time_preempt(vcpu);
+
 	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
 
 	/*

Possibly Parallel Threads

Search for more apparently analagous threads

Linux Virtualization - Jul 2016 - [PATCH v2 0/4] implement vcpu preempted check

[PATCH v2 0/4] implement vcpu preempted check

[PATCH v2 0/4] implement vcpu preempted check

[PATCH v2 0/4] implement vcpu preempted check

[PATCH v2 0/4] implement vcpu preempted check

Possibly Parallel Threads