thr3ads.net - Xen devel - [Xen-devel] [RFC][PATCH] Per-cpu xentrace buffers [Jan 2010]

If this information is useful, please help other people find it:
Share via:

George Dunlap

2010-Jan-07 15:13 UTC

[Xen-devel] [RFC][PATCH] Per-cpu xentrace buffers

In the current xentrace configuration, xentrace buffers are all
allocated in a single contiguous chunk, and then divided among logical
cpus, one buffer per cpu.  The size of an allocatable chunk is fairly
limited, in my experience about 128 pages (512KiB).  As the number of
logical cores increase, this means a much smaller maximum per-cpu
trace buffer per cpu; on my dual-socket quad-core nehalem box with
hyperthreading (16 logical cpus), that comes to 8 pages per logical
cpu.

The attached patch addresses this issue by allocating per-cpu buffers
separately.  This allows larger trace buffers; however, it requires an
interface change to xentrace, which is why I''m making a Request For
Comments.  (I''m not expecting this patch to be included in the 4.0
release.)

The old interface to get trace buffers was fairly simple: you ask for
the info, and it gives you:
* the mfn of the first page in the buffer allocation
* the total size of the trace buffer

The tools then mapped [mfn,mfn+size), calculated where the per-pcpu
buffers were, and went on to consume records from them.

-- Interface --

The proposed interface works as follows.

* XEN_SYSCTL_TBUFOP_get_info still returns an mfn and a size (so no
changes to the library).  However, this new are is to a trace buffer
info area  (t_info), allocated once at boot time.  The trace buffer
info area contains mfns of the per-pcpu buffers.
* The t_info struct contains an array of "offset pointers", one per
pcpu.  These are an offset into the t_info data area of an array of
mfns for that pcpu.  So logically, the layout looks like this:
struct {
 int16_t tbuf_size; /* Number of pages per cpu */
 int16_t offset[NR_CPUS]; /* Offset into the t_info area of the array */
 uint32_t mfn[NR_CPUS][TBUF_SIZE];
};

So if NR_CPUS was 16, and TBUF_SIZE was 32, we''d have:
struct {
 int16_t tbuf_size; /* Number of pages per cpu */
 int16_t offset[16]; /* Offset into the t_info area of the array */
 uint32_t p0_mfn_list[32];
 uint32_t p1_mfn_list[32];
  ...
 uint32_t p15_mfn_list[32];
};
* So the new way to map trace buffers is as follows:
 + Call TBUFOP_get_info to get the mfn and size of the t_info area, and map it.
 + Get the number of cpus
 + For each cpu:
  - Calculate the offset into the t_info area thus: unsigned long
*mfn_list = ((unsigned long*)t_info)+(t_info->cpu_offset[cpu]))
  - Map t_info->tbuf_size mfns from mfn_list using xc_map_foreign_batch()

In the current implementation, the t_info size is fixed at 2 pages,
allowing about 2000 pages total to be mapped.  For a 32-way system,
this would allow up to 63 pages per cpu (256MiB).  Bumping this up to
4 would allow even larger systems if required.

The current implementation also allocates each trace buffer
contiguously, since that''s the easiest way to get contiguous virtual
address space.  But this interface allows Xen the flexibility, in the
future, to allocate buffers in several chunks if necessary, without
having to change the interface again.

-- Implementation notes --

The t_info area is allocated once at boot.  Trace buffers are
allocated either at boot (if a parameter is passed) or when
TBUFOP_set_size is called.  Due to the complexity of tracking pages
mapped by dom0, unmapping or resizing trace buffers is not supported.

I introduced a new per-cpu spinlock guarding trace data and buffers.
This allows per-cpu data to be safely accessed and modified without
tracing with current tracing events.  The per-cpu spinlock is grabbed
whenever a trace event is generated; but in the (very very very)
common case, the lock should be in the cache already.

Feedback welcome.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2010-Jan-20 17:38 UTC

head link

[Xen-devel] Re: [RFC][PATCH] Per-cpu xentrace buffers

Keir, would you mind commenting on this new design in the next few
days?  If it looks like a good design, I''d like to do some more
testing and get this into our next XenServer release.

 -George

On Thu, Jan 7, 2010 at 3:13 PM, George Dunlap <dunlapg@umich.edu>
wrote:> In the current xentrace configuration, xentrace buffers are all
> allocated in a single contiguous chunk, and then divided among logical
> cpus, one buffer per cpu.  The size of an allocatable chunk is fairly
> limited, in my experience about 128 pages (512KiB).  As the number of
> logical cores increase, this means a much smaller maximum per-cpu
> trace buffer per cpu; on my dual-socket quad-core nehalem box with
> hyperthreading (16 logical cpus), that comes to 8 pages per logical
> cpu.
>
> The attached patch addresses this issue by allocating per-cpu buffers
> separately.  This allows larger trace buffers; however, it requires an
> interface change to xentrace, which is why I''m making a Request
For
> Comments.  (I''m not expecting this patch to be included in the 4.0
> release.)
>
> The old interface to get trace buffers was fairly simple: you ask for
> the info, and it gives you:
> * the mfn of the first page in the buffer allocation
> * the total size of the trace buffer
>
> The tools then mapped [mfn,mfn+size), calculated where the per-pcpu
> buffers were, and went on to consume records from them.
>
> -- Interface --
>
> The proposed interface works as follows.
>
> * XEN_SYSCTL_TBUFOP_get_info still returns an mfn and a size (so no
> changes to the library).  However, this new are is to a trace buffer
> info area  (t_info), allocated once at boot time.  The trace buffer
> info area contains mfns of the per-pcpu buffers.
> * The t_info struct contains an array of "offset pointers", one
per
> pcpu.  These are an offset into the t_info data area of an array of
> mfns for that pcpu.  So logically, the layout looks like this:
> struct {
>  int16_t tbuf_size; /* Number of pages per cpu */
>  int16_t offset[NR_CPUS]; /* Offset into the t_info area of the array */
>  uint32_t mfn[NR_CPUS][TBUF_SIZE];
> };
>
> So if NR_CPUS was 16, and TBUF_SIZE was 32, we''d have:
> struct {
>  int16_t tbuf_size; /* Number of pages per cpu */
>  int16_t offset[16]; /* Offset into the t_info area of the array */
>  uint32_t p0_mfn_list[32];
>  uint32_t p1_mfn_list[32];
>  ...
>  uint32_t p15_mfn_list[32];
> };
> * So the new way to map trace buffers is as follows:
>  + Call TBUFOP_get_info to get the mfn and size of the t_info area, and map
it.
>  + Get the number of cpus
>  + For each cpu:
>  - Calculate the offset into the t_info area thus: unsigned long
> *mfn_list = ((unsigned long*)t_info)+(t_info->cpu_offset[cpu]))
>  - Map t_info->tbuf_size mfns from mfn_list using xc_map_foreign_batch()
>
> In the current implementation, the t_info size is fixed at 2 pages,
> allowing about 2000 pages total to be mapped.  For a 32-way system,
> this would allow up to 63 pages per cpu (256MiB).  Bumping this up to
> 4 would allow even larger systems if required.
>
> The current implementation also allocates each trace buffer
> contiguously, since that''s the easiest way to get contiguous
virtual
> address space.  But this interface allows Xen the flexibility, in the
> future, to allocate buffers in several chunks if necessary, without
> having to change the interface again.
>
> -- Implementation notes --
>
> The t_info area is allocated once at boot.  Trace buffers are
> allocated either at boot (if a parameter is passed) or when
> TBUFOP_set_size is called.  Due to the complexity of tracking pages
> mapped by dom0, unmapping or resizing trace buffers is not supported.
>
> I introduced a new per-cpu spinlock guarding trace data and buffers.
> This allows per-cpu data to be safely accessed and modified without
> tracing with current tracing events.  The per-cpu spinlock is grabbed
> whenever a trace event is generated; but in the (very very very)
> common case, the lock should be in the cache already.
>
> Feedback welcome.
>
>  -George
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Jan-20 17:50 UTC

head link

[Xen-devel] Re: [RFC][PATCH] Per-cpu xentrace buffers

Oh, I''m fine with it. I wasn''t sure about putting it in for
4.0.0, but
actually plenty is going in for rc2. What do you think?

 -- Keir

On 20/01/2010 17:38, "George Dunlap"
<George.Dunlap@eu.citrix.com> wrote:
> Keir, would you mind commenting on this new design in the next few
> days?  If it looks like a good design, I''d like to do some more
> testing and get this into our next XenServer release.
> 
>  -George
> 
> On Thu, Jan 7, 2010 at 3:13 PM, George Dunlap <dunlapg@umich.edu>
wrote:
>> In the current xentrace configuration, xentrace buffers are all
>> allocated in a single contiguous chunk, and then divided among logical
>> cpus, one buffer per cpu.  The size of an allocatable chunk is fairly
>> limited, in my experience about 128 pages (512KiB).  As the number of
>> logical cores increase, this means a much smaller maximum per-cpu
>> trace buffer per cpu; on my dual-socket quad-core nehalem box with
>> hyperthreading (16 logical cpus), that comes to 8 pages per logical
>> cpu.
>> 
>> The attached patch addresses this issue by allocating per-cpu buffers
>> separately.  This allows larger trace buffers; however, it requires an
>> interface change to xentrace, which is why I''m making a
Request For
>> Comments.  (I''m not expecting this patch to be included in the
4.0
>> release.)
>> 
>> The old interface to get trace buffers was fairly simple: you ask for
>> the info, and it gives you:
>> * the mfn of the first page in the buffer allocation
>> * the total size of the trace buffer
>> 
>> The tools then mapped [mfn,mfn+size), calculated where the per-pcpu
>> buffers were, and went on to consume records from them.
>> 
>> -- Interface --
>> 
>> The proposed interface works as follows.
>> 
>> * XEN_SYSCTL_TBUFOP_get_info still returns an mfn and a size (so no
>> changes to the library).  However, this new are is to a trace buffer
>> info area  (t_info), allocated once at boot time.  The trace buffer
>> info area contains mfns of the per-pcpu buffers.
>> * The t_info struct contains an array of "offset pointers",
one per
>> pcpu.  These are an offset into the t_info data area of an array of
>> mfns for that pcpu.  So logically, the layout looks like this:
>> struct {
>>  int16_t tbuf_size; /* Number of pages per cpu */
>>  int16_t offset[NR_CPUS]; /* Offset into the t_info area of the array
*/
>>  uint32_t mfn[NR_CPUS][TBUF_SIZE];
>> };
>> 
>> So if NR_CPUS was 16, and TBUF_SIZE was 32, we''d have:
>> struct {
>>  int16_t tbuf_size; /* Number of pages per cpu */
>>  int16_t offset[16]; /* Offset into the t_info area of the array */
>>  uint32_t p0_mfn_list[32];
>>  uint32_t p1_mfn_list[32];
>>  ...
>>  uint32_t p15_mfn_list[32];
>> };
>> * So the new way to map trace buffers is as follows:
>>  + Call TBUFOP_get_info to get the mfn and size of the t_info area, and
map
>> it.
>>  + Get the number of cpus
>>  + For each cpu:
>>  - Calculate the offset into the t_info area thus: unsigned long
>> *mfn_list = ((unsigned long*)t_info)+(t_info->cpu_offset[cpu]))
>>  - Map t_info->tbuf_size mfns from mfn_list using
xc_map_foreign_batch()
>> 
>> In the current implementation, the t_info size is fixed at 2 pages,
>> allowing about 2000 pages total to be mapped.  For a 32-way system,
>> this would allow up to 63 pages per cpu (256MiB).  Bumping this up to
>> 4 would allow even larger systems if required.
>> 
>> The current implementation also allocates each trace buffer
>> contiguously, since that''s the easiest way to get contiguous
virtual
>> address space.  But this interface allows Xen the flexibility, in the
>> future, to allocate buffers in several chunks if necessary, without
>> having to change the interface again.
>> 
>> -- Implementation notes --
>> 
>> The t_info area is allocated once at boot.  Trace buffers are
>> allocated either at boot (if a parameter is passed) or when
>> TBUFOP_set_size is called.  Due to the complexity of tracking pages
>> mapped by dom0, unmapping or resizing trace buffers is not supported.
>> 
>> I introduced a new per-cpu spinlock guarding trace data and buffers.
>> This allows per-cpu data to be safely accessed and modified without
>> tracing with current tracing events.  The per-cpu spinlock is grabbed
>> whenever a trace event is generated; but in the (very very very)
>> common case, the lock should be in the cache already.
>> 
>> Feedback welcome.
>> 
>>  -George
>> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2010-Jan-20 18:06 UTC

head link

[Xen-devel] Re: [RFC][PATCH] Per-cpu xentrace buffers

How long between rc2 and expected release (if no other candidates are 
considered)?  It''s more of a debugging feature, so it''s not
going to
screw over any production systems if it''s got some subtle bugs.  (The 
"tb_init_done" flag that turns it on or off is exactly the same.)  I 
could try to put it through its paces this week and early next week, and 
if nothing turns up, it''s probably fine to go in.

It will definitely require a tools rebuild if anyone''s using xentrace, 
which people may not expect. :-)

 -George

Keir Fraser wrote:> Oh, I''m fine with it. I wasn''t sure about putting it in
for 4.0.0, but
> actually plenty is going in for rc2. What do you think?
>
>  -- Keir
>
> On 20/01/2010 17:38, "George Dunlap"
<George.Dunlap@eu.citrix.com> wrote:
>
>   
>> Keir, would you mind commenting on this new design in the next few
>> days?  If it looks like a good design, I''d like to do some
more
>> testing and get this into our next XenServer release.
>>
>>  -George
>>
>> On Thu, Jan 7, 2010 at 3:13 PM, George Dunlap <dunlapg@umich.edu>
wrote:
>>     
>>> In the current xentrace configuration, xentrace buffers are all
>>> allocated in a single contiguous chunk, and then divided among
logical
>>> cpus, one buffer per cpu.  The size of an allocatable chunk is
fairly
>>> limited, in my experience about 128 pages (512KiB).  As the number
of
>>> logical cores increase, this means a much smaller maximum per-cpu
>>> trace buffer per cpu; on my dual-socket quad-core nehalem box with
>>> hyperthreading (16 logical cpus), that comes to 8 pages per logical
>>> cpu.
>>>
>>> The attached patch addresses this issue by allocating per-cpu
buffers
>>> separately.  This allows larger trace buffers; however, it requires
an
>>> interface change to xentrace, which is why I''m making a
Request For
>>> Comments.  (I''m not expecting this patch to be included in
the 4.0
>>> release.)
>>>
>>> The old interface to get trace buffers was fairly simple: you ask
for
>>> the info, and it gives you:
>>> * the mfn of the first page in the buffer allocation
>>> * the total size of the trace buffer
>>>
>>> The tools then mapped [mfn,mfn+size), calculated where the per-pcpu
>>> buffers were, and went on to consume records from them.
>>>
>>> -- Interface --
>>>
>>> The proposed interface works as follows.
>>>
>>> * XEN_SYSCTL_TBUFOP_get_info still returns an mfn and a size (so no
>>> changes to the library).  However, this new are is to a trace
buffer
>>> info area  (t_info), allocated once at boot time.  The trace buffer
>>> info area contains mfns of the per-pcpu buffers.
>>> * The t_info struct contains an array of "offset
pointers", one per
>>> pcpu.  These are an offset into the t_info data area of an array of
>>> mfns for that pcpu.  So logically, the layout looks like this:
>>> struct {
>>>  int16_t tbuf_size; /* Number of pages per cpu */
>>>  int16_t offset[NR_CPUS]; /* Offset into the t_info area of the
array */
>>>  uint32_t mfn[NR_CPUS][TBUF_SIZE];
>>> };
>>>
>>> So if NR_CPUS was 16, and TBUF_SIZE was 32, we''d have:
>>> struct {
>>>  int16_t tbuf_size; /* Number of pages per cpu */
>>>  int16_t offset[16]; /* Offset into the t_info area of the array */
>>>  uint32_t p0_mfn_list[32];
>>>  uint32_t p1_mfn_list[32];
>>>  ...
>>>  uint32_t p15_mfn_list[32];
>>> };
>>> * So the new way to map trace buffers is as follows:
>>>  + Call TBUFOP_get_info to get the mfn and size of the t_info area,
and map
>>> it.
>>>  + Get the number of cpus
>>>  + For each cpu:
>>>  - Calculate the offset into the t_info area thus: unsigned long
>>> *mfn_list = ((unsigned long*)t_info)+(t_info->cpu_offset[cpu]))
>>>  - Map t_info->tbuf_size mfns from mfn_list using
xc_map_foreign_batch()
>>>
>>> In the current implementation, the t_info size is fixed at 2 pages,
>>> allowing about 2000 pages total to be mapped.  For a 32-way system,
>>> this would allow up to 63 pages per cpu (256MiB).  Bumping this up
to
>>> 4 would allow even larger systems if required.
>>>
>>> The current implementation also allocates each trace buffer
>>> contiguously, since that''s the easiest way to get
contiguous virtual
>>> address space.  But this interface allows Xen the flexibility, in
the
>>> future, to allocate buffers in several chunks if necessary, without
>>> having to change the interface again.
>>>
>>> -- Implementation notes --
>>>
>>> The t_info area is allocated once at boot.  Trace buffers are
>>> allocated either at boot (if a parameter is passed) or when
>>> TBUFOP_set_size is called.  Due to the complexity of tracking pages
>>> mapped by dom0, unmapping or resizing trace buffers is not
supported.
>>>
>>> I introduced a new per-cpu spinlock guarding trace data and
buffers.
>>> This allows per-cpu data to be safely accessed and modified without
>>> tracing with current tracing events.  The per-cpu spinlock is
grabbed
>>> whenever a trace event is generated; but in the (very very very)
>>> common case, the lock should be in the cache already.
>>>
>>> Feedback welcome.
>>>
>>>  -George
>>>
>>>       
>
>
>   

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Jan-20 18:34 UTC

head link

[Xen-devel] Re: [RFC][PATCH] Per-cpu xentrace buffers

Final release is still a few weeks away. It should probably go in for rc2
then.

 -- Keir

On 20/01/2010 18:06, "George Dunlap"
<George.Dunlap@eu.citrix.com> wrote:
> How long between rc2 and expected release (if no other candidates are
> considered)?  It''s more of a debugging feature, so it''s
not going to
> screw over any production systems if it''s got some subtle bugs. 
(The
> "tb_init_done" flag that turns it on or off is exactly the same.)
I
> could try to put it through its paces this week and early next week, and
> if nothing turns up, it''s probably fine to go in.
> 
> It will definitely require a tools rebuild if anyone''s using
xentrace,
> which people may not expect. :-)
> 
>  -George
> 
> Keir Fraser wrote:
>> Oh, I''m fine with it. I wasn''t sure about putting it
in for 4.0.0, but
>> actually plenty is going in for rc2. What do you think?
>> 
>>  -- Keir
>> 
>> On 20/01/2010 17:38, "George Dunlap"
<George.Dunlap@eu.citrix.com> wrote:
>> 
>>   
>>> Keir, would you mind commenting on this new design in the next few
>>> days?  If it looks like a good design, I''d like to do some
more
>>> testing and get this into our next XenServer release.
>>> 
>>>  -George
>>> 
>>> On Thu, Jan 7, 2010 at 3:13 PM, George Dunlap
<dunlapg@umich.edu> wrote:
>>>     
>>>> In the current xentrace configuration, xentrace buffers are all
>>>> allocated in a single contiguous chunk, and then divided among
logical
>>>> cpus, one buffer per cpu.  The size of an allocatable chunk is
fairly
>>>> limited, in my experience about 128 pages (512KiB).  As the
number of
>>>> logical cores increase, this means a much smaller maximum
per-cpu
>>>> trace buffer per cpu; on my dual-socket quad-core nehalem box
with
>>>> hyperthreading (16 logical cpus), that comes to 8 pages per
logical
>>>> cpu.
>>>> 
>>>> The attached patch addresses this issue by allocating per-cpu
buffers
>>>> separately.  This allows larger trace buffers; however, it
requires an
>>>> interface change to xentrace, which is why I''m making
a Request For
>>>> Comments.  (I''m not expecting this patch to be
included in the 4.0
>>>> release.)
>>>> 
>>>> The old interface to get trace buffers was fairly simple: you
ask for
>>>> the info, and it gives you:
>>>> * the mfn of the first page in the buffer allocation
>>>> * the total size of the trace buffer
>>>> 
>>>> The tools then mapped [mfn,mfn+size), calculated where the
per-pcpu
>>>> buffers were, and went on to consume records from them.
>>>> 
>>>> -- Interface --
>>>> 
>>>> The proposed interface works as follows.
>>>> 
>>>> * XEN_SYSCTL_TBUFOP_get_info still returns an mfn and a size
(so no
>>>> changes to the library).  However, this new are is to a trace
buffer
>>>> info area  (t_info), allocated once at boot time.  The trace
buffer
>>>> info area contains mfns of the per-pcpu buffers.
>>>> * The t_info struct contains an array of "offset
pointers", one per
>>>> pcpu.  These are an offset into the t_info data area of an
array of
>>>> mfns for that pcpu.  So logically, the layout looks like this:
>>>> struct {
>>>>  int16_t tbuf_size; /* Number of pages per cpu */
>>>>  int16_t offset[NR_CPUS]; /* Offset into the t_info area of the
array */
>>>>  uint32_t mfn[NR_CPUS][TBUF_SIZE];
>>>> };
>>>> 
>>>> So if NR_CPUS was 16, and TBUF_SIZE was 32, we''d have:
>>>> struct {
>>>>  int16_t tbuf_size; /* Number of pages per cpu */
>>>>  int16_t offset[16]; /* Offset into the t_info area of the
array */
>>>>  uint32_t p0_mfn_list[32];
>>>>  uint32_t p1_mfn_list[32];
>>>>  ...
>>>>  uint32_t p15_mfn_list[32];
>>>> };
>>>> * So the new way to map trace buffers is as follows:
>>>>  + Call TBUFOP_get_info to get the mfn and size of the t_info
area, and map
>>>> it.
>>>>  + Get the number of cpus
>>>>  + For each cpu:
>>>>  - Calculate the offset into the t_info area thus: unsigned
long
>>>> *mfn_list = ((unsigned
long*)t_info)+(t_info->cpu_offset[cpu]))
>>>>  - Map t_info->tbuf_size mfns from mfn_list using
xc_map_foreign_batch()
>>>> 
>>>> In the current implementation, the t_info size is fixed at 2
pages,
>>>> allowing about 2000 pages total to be mapped.  For a 32-way
system,
>>>> this would allow up to 63 pages per cpu (256MiB).  Bumping this
up to
>>>> 4 would allow even larger systems if required.
>>>> 
>>>> The current implementation also allocates each trace buffer
>>>> contiguously, since that''s the easiest way to get
contiguous virtual
>>>> address space.  But this interface allows Xen the flexibility,
in the
>>>> future, to allocate buffers in several chunks if necessary,
without
>>>> having to change the interface again.
>>>> 
>>>> -- Implementation notes --
>>>> 
>>>> The t_info area is allocated once at boot.  Trace buffers are
>>>> allocated either at boot (if a parameter is passed) or when
>>>> TBUFOP_set_size is called.  Due to the complexity of tracking
pages
>>>> mapped by dom0, unmapping or resizing trace buffers is not
supported.
>>>> 
>>>> I introduced a new per-cpu spinlock guarding trace data and
buffers.
>>>> This allows per-cpu data to be safely accessed and modified
without
>>>> tracing with current tracing events.  The per-cpu spinlock is
grabbed
>>>> whenever a trace event is generated; but in the (very very
very)
>>>> common case, the lock should be in the cache already.
>>>> 
>>>> Feedback welcome.
>>>> 
>>>>  -George
>>>> 
>>>>       
>> 
>> 
>>   
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jan 2010 - [RFC][PATCH] Per-cpu xentrace buffers

[Xen-devel] [RFC][PATCH] Per-cpu xentrace buffers

[Xen-devel] Re: [RFC][PATCH] Per-cpu xentrace buffers

[Xen-devel] Re: [RFC][PATCH] Per-cpu xentrace buffers

[Xen-devel] Re: [RFC][PATCH] Per-cpu xentrace buffers

[Xen-devel] Re: [RFC][PATCH] Per-cpu xentrace buffers