> +static void auto_switch_lbr_msrs(struct vcpu_vmx *vmx) > +{ > + int i; > + struct perf_lbr_stack lbr_stack; > + > + perf_get_lbr_stack(&lbr_stack); > + > + add_atomic_switch_msr(vmx, MSR_LBR_SELECT, 0, 0); > + add_atomic_switch_msr(vmx, lbr_stack.lbr_tos, 0, 0); > + > + for (i = 0; i < lbr_stack.lbr_nr; i++) { > + add_atomic_switch_msr(vmx, lbr_stack.lbr_from + i, 0, 0); > + add_atomic_switch_msr(vmx, lbr_stack.lbr_to + i, 0, 0); > + if (lbr_stack.lbr_info) > + add_atomic_switch_msr(vmx, lbr_stack.lbr_info + i, 0, > + 0); > + }That will be really expensive and add a lot of overhead to every entry/exit. perf can already context switch the LBRs on task context switch. With that you can just switch LBR_SELECT, which is *much* cheaper because there are far less context switches than exit/entries. It implies that when KVM is running it needs to prevent perf from enabling LBRs in the context of KVM, but that should be straight forward. -Andi
On 09/25/2017 10:57 PM, Andi Kleen wrote:>> +static void auto_switch_lbr_msrs(struct vcpu_vmx *vmx) >> +{ >> + int i; >> + struct perf_lbr_stack lbr_stack; >> + >> + perf_get_lbr_stack(&lbr_stack); >> + >> + add_atomic_switch_msr(vmx, MSR_LBR_SELECT, 0, 0); >> + add_atomic_switch_msr(vmx, lbr_stack.lbr_tos, 0, 0); >> + >> + for (i = 0; i < lbr_stack.lbr_nr; i++) { >> + add_atomic_switch_msr(vmx, lbr_stack.lbr_from + i, 0, 0); >> + add_atomic_switch_msr(vmx, lbr_stack.lbr_to + i, 0, 0); >> + if (lbr_stack.lbr_info) >> + add_atomic_switch_msr(vmx, lbr_stack.lbr_info + i, 0, >> + 0); >> + } > That will be really expensive and add a lot of overhead to every entry/exit. > perf can already context switch the LBRs on task context switch. With that > you can just switch LBR_SELECT, which is *much* cheaper because there > are far less context switches than exit/entries. > > It implies that when KVM is running it needs to prevent perf from enabling > LBRs in the context of KVM, but that should be straight forward.I kind of have a different thought here: 1) vCPU context switching and guest side task switching are not identical. That is, when the vCPU is scheduled out, the guest task on the vCPU may not run out its time slice yet, so the task will continue to run when the vCPU is scheduled in by the host (lbr wasn't save by the guest task when the vCPU is scheduled out in this case). It is possible to have the vCPU which runs the guest task (in use of lbr) scheduled out, followed by a new host task being scheduled in on the pCPU to run. It is not guaranteed that the new host task does not use the LBR feature on the pCPU. 2) Sometimes, people may want this usage: "perf record -b ./qemu-system-x86_64 ...", which will need lbr to be used in KVM as well. I think one possible optimization we could do would be to add the LBR MSRs to auto switching when the guest requests to enable the feature, and remove them when being disabled. This will need to trap guest access to MSR_DEBUGCTL. Best, Wei
> 1) vCPU context switching and guest side task switching are not identical. > That is, when the vCPU is scheduled out, the guest task on the vCPU may notguest task lifetime has nothing to do with this. It's completely independent of what you do here on the VCPU level.> run out its time slice yet, so the task will continue to run when the vCPU > is > scheduled in by the host (lbr wasn't save by the guest task when the vCPU is > scheduled out in this case). > > It is possible to have the vCPU which runs the guest task (in use of lbr) > scheduled > out, followed by a new host task being scheduled in on the pCPU to run. > It is not guaranteed that the new host task does not use the LBR feature on > the > pCPU.Sure it may use the LBR, and the normal perf context switch will switch it and everything works fine. It's like any other per-task LBR user.> > 2) Sometimes, people may want this usage: "perf record -b > ./qemu-system-x86_64 ...", > which will need lbr to be used in KVM as well.In this obscure case you can disable LBR support for the guest. The common case is far more important. It sounds like you didn't do any performance measurements. I expect the performance of your current solution to be terrible. e.g. a normal perf PMI does at least 1 MSR reads and 4+ MSR writes for a single counter. With multiple counters it gets worse. For each of those you'll need to exit. Adding something to the entry/exit list is similar to the cost of doing explicit RD/WRMSRs. On Skylake we have 32*3=96 MSRs for the LBRs. So with the 5 exits and entries, you're essentually doing 5*2*96=18432 extra MSR accesses for each PMI. MSR access is 100+ cycles at least, for writes it is far more expensive. -Andi