thr3ads.net - Xen devel - [Xen-devel] Performance Monitoring Counter(PMC) Problem [Nov 2010]

If this information is useful, please help other people find it:
Share via:

alex

2010-Nov-21 03:28 UTC

[Xen-devel] Performance Monitoring Counter(PMC) Problem

Hi all,

I am running 64bit xen-3.4.2 on AMD Phenom II Quad-core processor, on which
each core has sepearate performance counter. My research is to use
performance counter to track interesting events(say, L3 cache miss) for each
VM. Thus, I developed software multiplexing to support PMC tracking for
individual PVOPS-VMs. Each domain has its own logical performance counter.
Every time it will be scheduled, performance counter for that domain is
reloaded and started. When a domain (VCPU) is de-scheduled after 30ms, the
performance counter is stopped and stored to logical counter.  I modified
context_switch (in arch/x86/domain.c) by adding statements below:

#define MAX_DOMAIN_NUMBER 8
volatile uint64_t perfcounter[MAX_DOMAIN_NUMBER] = { 0, 0, 0, 0, 0, 0, 0, 0
};   // element 0 is reserved for dom0, element 7 is reserved for
IDLE_DOMAIN_ID.

// multiplexing the performance counter for more than 4 VMs
void startPMC(unsigned int pcpu_id, unsigned int domain_id)
{
    uint32_t eax, edx;
    /* reload performance counter for next dom */
    if(domain_id == IDLE_DOMAIN_ID) {
          wrmsrl(MSR_K7_PERFCTR0, perfcounter[7]);
    }
    else {
         wrmsrl(MSR_K7_PERFCTR0, perfcounter[domain_id]);
    }
    edx = 0x4;
    eax = 0xE1 | (0x3 << 16) | (0x1 << (12 + pcpu_id)) | (0x7
<< 8) | (0x1
<< 22); // L3 cache misses for accesses from a core(cpu)
    wrmsr(MSR_K7_EVNTSEL0, eax, edx);
}

// multiplexing the performance counter for more than 4 VMs
void stopPMC(unsigned int pcpu_id, unsigned int domain_id)
{
      uint32_t eax, edx;
      edx = 0x4;
      eax = (0xE1 | (0x3 << 16) | (0x1 << (12 + pcpu_id)) | (0x7
<< 8)) &
~(0x1 << 22); // L3 cache misses for accesses from core(cpu)
      wrmsr(MSR_K7_EVNTSEL0, eax, edx);

     /* save current performance counter */
    if(domain_id == IDLE_DOMAIN_ID) {
          rdmsrl(MSR_K7_PERFCTR0, perfcounter[7]);
    }
    else {
         rdmsrl(MSR_K7_PERFCTR0, perfcounter[domain_id]);
    }
}

void context_switch(struct vcpu *prev, struct vcpu *next)
{
   unsigned int cpu = smp_processor_id();
   ......

    stopPMC(cpu, prev->domain->domain_id);
    startPMC(cpu, next->domain->domain_id);

   .......
}

In my experiment, I run 6 pvops DomUs and each pv domain has one VCPU and
1GB allocated memory. Dom0 has four VCPUs, and 1GB memory. The testing code
is very simple and listed below.
int main(void)
{
  int i;
  char *buf;
  buf = (char *)malloc(400 * 1024 * 1024);
  for(i=0; i<400*1024*1024;i++)
    buf[i] = 1;
  return 0;
}

I run the testing code only in DomUs and found the mismatching results in
different scenarios.

Results:
1) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  and run the
testing code in a DomU whose vcpu is also pinned, the performance counter I
checked is around 6553600 (L3 cache misses). That means DomU accessed about
400MB data. (The cache line size is 64B).

2) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  but don''t
pin DomU''s vcpu, the performance counter is a little less than 6553600
(L3
cache misses).

3) If I don''t pin Dom0 vcpus, and still run testing code in a DomU, the
performance counter I got is much less than 6553600 (L3 cache misses), and
the average value is 2949120.

Now I move to run testing code in 2 DomUs, and also found mismatching
results.
4) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  and pin each DomU
vcpu to different pcpu, the result is around 6553600 (L3 cache misses).
5) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  and pin each DomU
vcpu to the same pcpu, the result is much less than 6553600 (L3 cache
misses).

To validate each VM accesses the specified data, I used "page-fault
approach" which is another way to track memory accesses and find out each
VM
has about 102400 page faults, which is equivalent to 400MB. Since there is
no paging sharing between VMs, each VM should access the same amount of
400MB data. I am not sure what is wrong with the performance counter
multiplexing.  So can anyone give me some suggestions?

Thank you,

Lei
University of Arizona


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2010-Nov-23 14:56 UTC

head link

Re: [Xen-devel] Performance Monitoring Counter(PMC) Problem

So, as I understand it, you expect the 6.5M cache misses for cases 1, 2, 
4, but don''t understand the much lower results for 3 and 5?  And you 
suspect that the numbers you''re getting aren''t valid, but are
the result
of some mismanagement?

For one, you don''t check in your {start,stop}PMC() functions whether
the
domain ID is less than MAX_DOMAIN_NUMBER-1.

You''re aware also that context_switch() isn''t called when
switching from a

It might be worth adding a traces for the values you''re reading from
and
writing to the registers, and using xenalyze to see if you notice 
anything strange.

The xenalyze source can be found here:
  http://xenbits.xensource.com/ext/xenalyze.hg

You''d have to make xenalyze understand your performance trace record, 
but it will understand most everything else.

  -George

On 21/11/10 03:28, alex wrote:> Hi all,
> I am running 64bit xen-3.4.2 on AMD Phenom II Quad-core processor, on
> which each core has sepearate performance counter. My research is to use
> performance counter to track interesting events(say, L3 cache miss) for
> each VM. Thus, I developed software multiplexing to support PMC tracking
> for individual PVOPS-VMs. Each domain has its own logical performance
> counter. Every time it will be scheduled, performance counter for that
> domain is reloaded and started. When a domain (VCPU) is de-scheduled
> after 30ms, the performance counter is stopped and stored to logical
> counter.  I modified context_switch (in arch/x86/domain.c) by adding
> statements below:
> #define MAX_DOMAIN_NUMBER 8
> volatile uint64_t perfcounter[MAX_DOMAIN_NUMBER] = { 0, 0, 0, 0, 0, 0,
> 0, 0 };   // element 0 is reserved for dom0, element 7 is reserved for
> IDLE_DOMAIN_ID.
> // multiplexing the performance counter for more than 4 VMs
> void startPMC(unsigned int pcpu_id, unsigned int domain_id)
> {
>      uint32_t eax, edx;
>      /* reload performance counter for next dom */
>      if(domain_id == IDLE_DOMAIN_ID) {
>            wrmsrl(MSR_K7_PERFCTR0, perfcounter[7]);
>      }
>      else {
>           wrmsrl(MSR_K7_PERFCTR0, perfcounter[domain_id]);
>      }
>      edx = 0x4;
>      eax = 0xE1 | (0x3 << 16) | (0x1 << (12 + pcpu_id)) | (0x7
<< 8) |
> (0x1 << 22); // L3 cache misses for accesses from a core(cpu)
>      wrmsr(MSR_K7_EVNTSEL0, eax, edx);
> }
> // multiplexing the performance counter for more than 4 VMs
> void stopPMC(unsigned int pcpu_id, unsigned int domain_id)
> {
>        uint32_t eax, edx;
>        edx = 0x4;
>        eax = (0xE1 | (0x3 << 16) | (0x1 << (12 + pcpu_id)) |
(0x7 << 8))
> & ~(0x1 << 22); // L3 cache misses for accesses from core(cpu)
>        wrmsr(MSR_K7_EVNTSEL0, eax, edx);
>       /* save current performance counter */
>      if(domain_id == IDLE_DOMAIN_ID) {
>            rdmsrl(MSR_K7_PERFCTR0, perfcounter[7]);
>      }
>      else {
>           rdmsrl(MSR_K7_PERFCTR0, perfcounter[domain_id]);
>      }
> }
> void context_switch(struct vcpu *prev, struct vcpu *next)
> {
>     unsigned int cpu = smp_processor_id();
>     ......
>      stopPMC(cpu, prev->domain->domain_id);
>      startPMC(cpu, next->domain->domain_id);
>     .......
> }
> In my experiment, I run 6 pvops DomUs and each pv domain has one VCPU
> and 1GB allocated memory. Dom0 has four VCPUs, and 1GB memory. The
> testing code is very simple and listed below.
> int main(void)
> {
>    int i;
>    char *buf;
>    buf = (char *)malloc(400 * 1024 * 1024);
>    for(i=0; i<400*1024*1024;i++)
>      buf[i] = 1;
>    return 0;
> }
> I run the testing code only in DomUs and found the mismatching results
> in different scenarios.
> Results:
> 1) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  and run the
> testing code in a DomU whose vcpu is also pinned, the performance
> counter I checked is around 6553600 (L3 cache misses). That means DomU
> accessed about 400MB data. (The cache line size is 64B).
> 2) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  but
don''t
> pin DomU''s vcpu, the performance counter is a little less than
6553600
> (L3 cache misses).
> 3) If I don''t pin Dom0 vcpus, and still run testing code in a
DomU, the
> performance counter I got is much less than 6553600 (L3 cache misses),
> and the average value is 2949120.
> Now I move to run testing code in 2 DomUs, and also found mismatching
> results.
> 4) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  and pin each
> DomU vcpu to different pcpu, the result is around 6553600 (L3 cache
misses).
> 5) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  and pin each
> DomU vcpu to the same pcpu, the result is much less than 6553600 (L3
> cache misses).
> To validate each VM accesses the specified data, I used "page-fault
> approach" which is another way to track memory accesses and find out
> each VM has about 102400 page faults, which is equivalent to 400MB.
> Since there is no paging sharing between VMs, each VM should access the
> same amount of 400MB data. I am not sure what is wrong with the
> performance counter multiplexing.  So can anyone give me some suggestions?
> Thank you,
> Lei
> University of Arizona

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2010-Nov-23 15:04 UTC

head link

Re: [Xen-devel] Performance Monitoring Counter(PMC) Problem

On Tue, Nov 23, 2010 at 2:56 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:> You''re aware also that context_switch() isn''t called when
switching from a
Oops -- please ignore this line, I thought I''d deleted it...
 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Nov 2010 - Performance Monitoring Counter(PMC) Problem

[Xen-devel] Performance Monitoring Counter(PMC) Problem

Re: [Xen-devel] Performance Monitoring Counter(PMC) Problem

Re: [Xen-devel] Performance Monitoring Counter(PMC) Problem