In the current xentrace configuration, xentrace buffers are all allocated in a single contiguous chunk, and then divided among logical cpus, one buffer per cpu. The size of an allocatable chunk is fairly limited, in my experience about 128 pages (512KiB). As the number of logical cores increase, this means a much smaller maximum per-cpu trace buffer per cpu; on my dual-socket quad-core nehalem box with hyperthreading (16 logical cpus), that comes to 8 pages per logical cpu. The attached patch addresses this issue by allocating per-cpu buffers separately. This allows larger trace buffers; however, it requires an interface change to xentrace, which is why I''m making a Request For Comments. (I''m not expecting this patch to be included in the 4.0 release.) The old interface to get trace buffers was fairly simple: you ask for the info, and it gives you: * the mfn of the first page in the buffer allocation * the total size of the trace buffer The tools then mapped [mfn,mfn+size), calculated where the per-pcpu buffers were, and went on to consume records from them. -- Interface -- The proposed interface works as follows. * XEN_SYSCTL_TBUFOP_get_info still returns an mfn and a size (so no changes to the library). However, this new are is to a trace buffer info area (t_info), allocated once at boot time. The trace buffer info area contains mfns of the per-pcpu buffers. * The t_info struct contains an array of "offset pointers", one per pcpu. These are an offset into the t_info data area of an array of mfns for that pcpu. So logically, the layout looks like this: struct { int16_t tbuf_size; /* Number of pages per cpu */ int16_t offset[NR_CPUS]; /* Offset into the t_info area of the array */ uint32_t mfn[NR_CPUS][TBUF_SIZE]; }; So if NR_CPUS was 16, and TBUF_SIZE was 32, we''d have: struct { int16_t tbuf_size; /* Number of pages per cpu */ int16_t offset[16]; /* Offset into the t_info area of the array */ uint32_t p0_mfn_list[32]; uint32_t p1_mfn_list[32]; ... uint32_t p15_mfn_list[32]; }; * So the new way to map trace buffers is as follows: + Call TBUFOP_get_info to get the mfn and size of the t_info area, and map it. + Get the number of cpus + For each cpu: - Calculate the offset into the t_info area thus: unsigned long *mfn_list = ((unsigned long*)t_info)+(t_info->cpu_offset[cpu])) - Map t_info->tbuf_size mfns from mfn_list using xc_map_foreign_batch() In the current implementation, the t_info size is fixed at 2 pages, allowing about 2000 pages total to be mapped. For a 32-way system, this would allow up to 63 pages per cpu (256MiB). Bumping this up to 4 would allow even larger systems if required. The current implementation also allocates each trace buffer contiguously, since that''s the easiest way to get contiguous virtual address space. But this interface allows Xen the flexibility, in the future, to allocate buffers in several chunks if necessary, without having to change the interface again. -- Implementation notes -- The t_info area is allocated once at boot. Trace buffers are allocated either at boot (if a parameter is passed) or when TBUFOP_set_size is called. Due to the complexity of tracking pages mapped by dom0, unmapping or resizing trace buffers is not supported. I introduced a new per-cpu spinlock guarding trace data and buffers. This allows per-cpu data to be safely accessed and modified without tracing with current tracing events. The per-cpu spinlock is grabbed whenever a trace event is generated; but in the (very very very) common case, the lock should be in the cache already. Feedback welcome. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir, would you mind commenting on this new design in the next few days? If it looks like a good design, I''d like to do some more testing and get this into our next XenServer release. -George On Thu, Jan 7, 2010 at 3:13 PM, George Dunlap <dunlapg@umich.edu> wrote:> In the current xentrace configuration, xentrace buffers are all > allocated in a single contiguous chunk, and then divided among logical > cpus, one buffer per cpu. The size of an allocatable chunk is fairly > limited, in my experience about 128 pages (512KiB). As the number of > logical cores increase, this means a much smaller maximum per-cpu > trace buffer per cpu; on my dual-socket quad-core nehalem box with > hyperthreading (16 logical cpus), that comes to 8 pages per logical > cpu. > > The attached patch addresses this issue by allocating per-cpu buffers > separately. This allows larger trace buffers; however, it requires an > interface change to xentrace, which is why I''m making a Request For > Comments. (I''m not expecting this patch to be included in the 4.0 > release.) > > The old interface to get trace buffers was fairly simple: you ask for > the info, and it gives you: > * the mfn of the first page in the buffer allocation > * the total size of the trace buffer > > The tools then mapped [mfn,mfn+size), calculated where the per-pcpu > buffers were, and went on to consume records from them. > > -- Interface -- > > The proposed interface works as follows. > > * XEN_SYSCTL_TBUFOP_get_info still returns an mfn and a size (so no > changes to the library). However, this new are is to a trace buffer > info area (t_info), allocated once at boot time. The trace buffer > info area contains mfns of the per-pcpu buffers. > * The t_info struct contains an array of "offset pointers", one per > pcpu. These are an offset into the t_info data area of an array of > mfns for that pcpu. So logically, the layout looks like this: > struct { > int16_t tbuf_size; /* Number of pages per cpu */ > int16_t offset[NR_CPUS]; /* Offset into the t_info area of the array */ > uint32_t mfn[NR_CPUS][TBUF_SIZE]; > }; > > So if NR_CPUS was 16, and TBUF_SIZE was 32, we''d have: > struct { > int16_t tbuf_size; /* Number of pages per cpu */ > int16_t offset[16]; /* Offset into the t_info area of the array */ > uint32_t p0_mfn_list[32]; > uint32_t p1_mfn_list[32]; > ... > uint32_t p15_mfn_list[32]; > }; > * So the new way to map trace buffers is as follows: > + Call TBUFOP_get_info to get the mfn and size of the t_info area, and map it. > + Get the number of cpus > + For each cpu: > - Calculate the offset into the t_info area thus: unsigned long > *mfn_list = ((unsigned long*)t_info)+(t_info->cpu_offset[cpu])) > - Map t_info->tbuf_size mfns from mfn_list using xc_map_foreign_batch() > > In the current implementation, the t_info size is fixed at 2 pages, > allowing about 2000 pages total to be mapped. For a 32-way system, > this would allow up to 63 pages per cpu (256MiB). Bumping this up to > 4 would allow even larger systems if required. > > The current implementation also allocates each trace buffer > contiguously, since that''s the easiest way to get contiguous virtual > address space. But this interface allows Xen the flexibility, in the > future, to allocate buffers in several chunks if necessary, without > having to change the interface again. > > -- Implementation notes -- > > The t_info area is allocated once at boot. Trace buffers are > allocated either at boot (if a parameter is passed) or when > TBUFOP_set_size is called. Due to the complexity of tracking pages > mapped by dom0, unmapping or resizing trace buffers is not supported. > > I introduced a new per-cpu spinlock guarding trace data and buffers. > This allows per-cpu data to be safely accessed and modified without > tracing with current tracing events. The per-cpu spinlock is grabbed > whenever a trace event is generated; but in the (very very very) > common case, the lock should be in the cache already. > > Feedback welcome. > > -George >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Oh, I''m fine with it. I wasn''t sure about putting it in for 4.0.0, but actually plenty is going in for rc2. What do you think? -- Keir On 20/01/2010 17:38, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:> Keir, would you mind commenting on this new design in the next few > days? If it looks like a good design, I''d like to do some more > testing and get this into our next XenServer release. > > -George > > On Thu, Jan 7, 2010 at 3:13 PM, George Dunlap <dunlapg@umich.edu> wrote: >> In the current xentrace configuration, xentrace buffers are all >> allocated in a single contiguous chunk, and then divided among logical >> cpus, one buffer per cpu. The size of an allocatable chunk is fairly >> limited, in my experience about 128 pages (512KiB). As the number of >> logical cores increase, this means a much smaller maximum per-cpu >> trace buffer per cpu; on my dual-socket quad-core nehalem box with >> hyperthreading (16 logical cpus), that comes to 8 pages per logical >> cpu. >> >> The attached patch addresses this issue by allocating per-cpu buffers >> separately. This allows larger trace buffers; however, it requires an >> interface change to xentrace, which is why I''m making a Request For >> Comments. (I''m not expecting this patch to be included in the 4.0 >> release.) >> >> The old interface to get trace buffers was fairly simple: you ask for >> the info, and it gives you: >> * the mfn of the first page in the buffer allocation >> * the total size of the trace buffer >> >> The tools then mapped [mfn,mfn+size), calculated where the per-pcpu >> buffers were, and went on to consume records from them. >> >> -- Interface -- >> >> The proposed interface works as follows. >> >> * XEN_SYSCTL_TBUFOP_get_info still returns an mfn and a size (so no >> changes to the library). However, this new are is to a trace buffer >> info area (t_info), allocated once at boot time. The trace buffer >> info area contains mfns of the per-pcpu buffers. >> * The t_info struct contains an array of "offset pointers", one per >> pcpu. These are an offset into the t_info data area of an array of >> mfns for that pcpu. So logically, the layout looks like this: >> struct { >> int16_t tbuf_size; /* Number of pages per cpu */ >> int16_t offset[NR_CPUS]; /* Offset into the t_info area of the array */ >> uint32_t mfn[NR_CPUS][TBUF_SIZE]; >> }; >> >> So if NR_CPUS was 16, and TBUF_SIZE was 32, we''d have: >> struct { >> int16_t tbuf_size; /* Number of pages per cpu */ >> int16_t offset[16]; /* Offset into the t_info area of the array */ >> uint32_t p0_mfn_list[32]; >> uint32_t p1_mfn_list[32]; >> ... >> uint32_t p15_mfn_list[32]; >> }; >> * So the new way to map trace buffers is as follows: >> + Call TBUFOP_get_info to get the mfn and size of the t_info area, and map >> it. >> + Get the number of cpus >> + For each cpu: >> - Calculate the offset into the t_info area thus: unsigned long >> *mfn_list = ((unsigned long*)t_info)+(t_info->cpu_offset[cpu])) >> - Map t_info->tbuf_size mfns from mfn_list using xc_map_foreign_batch() >> >> In the current implementation, the t_info size is fixed at 2 pages, >> allowing about 2000 pages total to be mapped. For a 32-way system, >> this would allow up to 63 pages per cpu (256MiB). Bumping this up to >> 4 would allow even larger systems if required. >> >> The current implementation also allocates each trace buffer >> contiguously, since that''s the easiest way to get contiguous virtual >> address space. But this interface allows Xen the flexibility, in the >> future, to allocate buffers in several chunks if necessary, without >> having to change the interface again. >> >> -- Implementation notes -- >> >> The t_info area is allocated once at boot. Trace buffers are >> allocated either at boot (if a parameter is passed) or when >> TBUFOP_set_size is called. Due to the complexity of tracking pages >> mapped by dom0, unmapping or resizing trace buffers is not supported. >> >> I introduced a new per-cpu spinlock guarding trace data and buffers. >> This allows per-cpu data to be safely accessed and modified without >> tracing with current tracing events. The per-cpu spinlock is grabbed >> whenever a trace event is generated; but in the (very very very) >> common case, the lock should be in the cache already. >> >> Feedback welcome. >> >> -George >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
How long between rc2 and expected release (if no other candidates are considered)? It''s more of a debugging feature, so it''s not going to screw over any production systems if it''s got some subtle bugs. (The "tb_init_done" flag that turns it on or off is exactly the same.) I could try to put it through its paces this week and early next week, and if nothing turns up, it''s probably fine to go in. It will definitely require a tools rebuild if anyone''s using xentrace, which people may not expect. :-) -George Keir Fraser wrote:> Oh, I''m fine with it. I wasn''t sure about putting it in for 4.0.0, but > actually plenty is going in for rc2. What do you think? > > -- Keir > > On 20/01/2010 17:38, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote: > > >> Keir, would you mind commenting on this new design in the next few >> days? If it looks like a good design, I''d like to do some more >> testing and get this into our next XenServer release. >> >> -George >> >> On Thu, Jan 7, 2010 at 3:13 PM, George Dunlap <dunlapg@umich.edu> wrote: >> >>> In the current xentrace configuration, xentrace buffers are all >>> allocated in a single contiguous chunk, and then divided among logical >>> cpus, one buffer per cpu. The size of an allocatable chunk is fairly >>> limited, in my experience about 128 pages (512KiB). As the number of >>> logical cores increase, this means a much smaller maximum per-cpu >>> trace buffer per cpu; on my dual-socket quad-core nehalem box with >>> hyperthreading (16 logical cpus), that comes to 8 pages per logical >>> cpu. >>> >>> The attached patch addresses this issue by allocating per-cpu buffers >>> separately. This allows larger trace buffers; however, it requires an >>> interface change to xentrace, which is why I''m making a Request For >>> Comments. (I''m not expecting this patch to be included in the 4.0 >>> release.) >>> >>> The old interface to get trace buffers was fairly simple: you ask for >>> the info, and it gives you: >>> * the mfn of the first page in the buffer allocation >>> * the total size of the trace buffer >>> >>> The tools then mapped [mfn,mfn+size), calculated where the per-pcpu >>> buffers were, and went on to consume records from them. >>> >>> -- Interface -- >>> >>> The proposed interface works as follows. >>> >>> * XEN_SYSCTL_TBUFOP_get_info still returns an mfn and a size (so no >>> changes to the library). However, this new are is to a trace buffer >>> info area (t_info), allocated once at boot time. The trace buffer >>> info area contains mfns of the per-pcpu buffers. >>> * The t_info struct contains an array of "offset pointers", one per >>> pcpu. These are an offset into the t_info data area of an array of >>> mfns for that pcpu. So logically, the layout looks like this: >>> struct { >>> int16_t tbuf_size; /* Number of pages per cpu */ >>> int16_t offset[NR_CPUS]; /* Offset into the t_info area of the array */ >>> uint32_t mfn[NR_CPUS][TBUF_SIZE]; >>> }; >>> >>> So if NR_CPUS was 16, and TBUF_SIZE was 32, we''d have: >>> struct { >>> int16_t tbuf_size; /* Number of pages per cpu */ >>> int16_t offset[16]; /* Offset into the t_info area of the array */ >>> uint32_t p0_mfn_list[32]; >>> uint32_t p1_mfn_list[32]; >>> ... >>> uint32_t p15_mfn_list[32]; >>> }; >>> * So the new way to map trace buffers is as follows: >>> + Call TBUFOP_get_info to get the mfn and size of the t_info area, and map >>> it. >>> + Get the number of cpus >>> + For each cpu: >>> - Calculate the offset into the t_info area thus: unsigned long >>> *mfn_list = ((unsigned long*)t_info)+(t_info->cpu_offset[cpu])) >>> - Map t_info->tbuf_size mfns from mfn_list using xc_map_foreign_batch() >>> >>> In the current implementation, the t_info size is fixed at 2 pages, >>> allowing about 2000 pages total to be mapped. For a 32-way system, >>> this would allow up to 63 pages per cpu (256MiB). Bumping this up to >>> 4 would allow even larger systems if required. >>> >>> The current implementation also allocates each trace buffer >>> contiguously, since that''s the easiest way to get contiguous virtual >>> address space. But this interface allows Xen the flexibility, in the >>> future, to allocate buffers in several chunks if necessary, without >>> having to change the interface again. >>> >>> -- Implementation notes -- >>> >>> The t_info area is allocated once at boot. Trace buffers are >>> allocated either at boot (if a parameter is passed) or when >>> TBUFOP_set_size is called. Due to the complexity of tracking pages >>> mapped by dom0, unmapping or resizing trace buffers is not supported. >>> >>> I introduced a new per-cpu spinlock guarding trace data and buffers. >>> This allows per-cpu data to be safely accessed and modified without >>> tracing with current tracing events. The per-cpu spinlock is grabbed >>> whenever a trace event is generated; but in the (very very very) >>> common case, the lock should be in the cache already. >>> >>> Feedback welcome. >>> >>> -George >>> >>> > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Final release is still a few weeks away. It should probably go in for rc2 then. -- Keir On 20/01/2010 18:06, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:> How long between rc2 and expected release (if no other candidates are > considered)? It''s more of a debugging feature, so it''s not going to > screw over any production systems if it''s got some subtle bugs. (The > "tb_init_done" flag that turns it on or off is exactly the same.) I > could try to put it through its paces this week and early next week, and > if nothing turns up, it''s probably fine to go in. > > It will definitely require a tools rebuild if anyone''s using xentrace, > which people may not expect. :-) > > -George > > Keir Fraser wrote: >> Oh, I''m fine with it. I wasn''t sure about putting it in for 4.0.0, but >> actually plenty is going in for rc2. What do you think? >> >> -- Keir >> >> On 20/01/2010 17:38, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote: >> >> >>> Keir, would you mind commenting on this new design in the next few >>> days? If it looks like a good design, I''d like to do some more >>> testing and get this into our next XenServer release. >>> >>> -George >>> >>> On Thu, Jan 7, 2010 at 3:13 PM, George Dunlap <dunlapg@umich.edu> wrote: >>> >>>> In the current xentrace configuration, xentrace buffers are all >>>> allocated in a single contiguous chunk, and then divided among logical >>>> cpus, one buffer per cpu. The size of an allocatable chunk is fairly >>>> limited, in my experience about 128 pages (512KiB). As the number of >>>> logical cores increase, this means a much smaller maximum per-cpu >>>> trace buffer per cpu; on my dual-socket quad-core nehalem box with >>>> hyperthreading (16 logical cpus), that comes to 8 pages per logical >>>> cpu. >>>> >>>> The attached patch addresses this issue by allocating per-cpu buffers >>>> separately. This allows larger trace buffers; however, it requires an >>>> interface change to xentrace, which is why I''m making a Request For >>>> Comments. (I''m not expecting this patch to be included in the 4.0 >>>> release.) >>>> >>>> The old interface to get trace buffers was fairly simple: you ask for >>>> the info, and it gives you: >>>> * the mfn of the first page in the buffer allocation >>>> * the total size of the trace buffer >>>> >>>> The tools then mapped [mfn,mfn+size), calculated where the per-pcpu >>>> buffers were, and went on to consume records from them. >>>> >>>> -- Interface -- >>>> >>>> The proposed interface works as follows. >>>> >>>> * XEN_SYSCTL_TBUFOP_get_info still returns an mfn and a size (so no >>>> changes to the library). However, this new are is to a trace buffer >>>> info area (t_info), allocated once at boot time. The trace buffer >>>> info area contains mfns of the per-pcpu buffers. >>>> * The t_info struct contains an array of "offset pointers", one per >>>> pcpu. These are an offset into the t_info data area of an array of >>>> mfns for that pcpu. So logically, the layout looks like this: >>>> struct { >>>> int16_t tbuf_size; /* Number of pages per cpu */ >>>> int16_t offset[NR_CPUS]; /* Offset into the t_info area of the array */ >>>> uint32_t mfn[NR_CPUS][TBUF_SIZE]; >>>> }; >>>> >>>> So if NR_CPUS was 16, and TBUF_SIZE was 32, we''d have: >>>> struct { >>>> int16_t tbuf_size; /* Number of pages per cpu */ >>>> int16_t offset[16]; /* Offset into the t_info area of the array */ >>>> uint32_t p0_mfn_list[32]; >>>> uint32_t p1_mfn_list[32]; >>>> ... >>>> uint32_t p15_mfn_list[32]; >>>> }; >>>> * So the new way to map trace buffers is as follows: >>>> + Call TBUFOP_get_info to get the mfn and size of the t_info area, and map >>>> it. >>>> + Get the number of cpus >>>> + For each cpu: >>>> - Calculate the offset into the t_info area thus: unsigned long >>>> *mfn_list = ((unsigned long*)t_info)+(t_info->cpu_offset[cpu])) >>>> - Map t_info->tbuf_size mfns from mfn_list using xc_map_foreign_batch() >>>> >>>> In the current implementation, the t_info size is fixed at 2 pages, >>>> allowing about 2000 pages total to be mapped. For a 32-way system, >>>> this would allow up to 63 pages per cpu (256MiB). Bumping this up to >>>> 4 would allow even larger systems if required. >>>> >>>> The current implementation also allocates each trace buffer >>>> contiguously, since that''s the easiest way to get contiguous virtual >>>> address space. But this interface allows Xen the flexibility, in the >>>> future, to allocate buffers in several chunks if necessary, without >>>> having to change the interface again. >>>> >>>> -- Implementation notes -- >>>> >>>> The t_info area is allocated once at boot. Trace buffers are >>>> allocated either at boot (if a parameter is passed) or when >>>> TBUFOP_set_size is called. Due to the complexity of tracking pages >>>> mapped by dom0, unmapping or resizing trace buffers is not supported. >>>> >>>> I introduced a new per-cpu spinlock guarding trace data and buffers. >>>> This allows per-cpu data to be safely accessed and modified without >>>> tracing with current tracing events. The per-cpu spinlock is grabbed >>>> whenever a trace event is generated; but in the (very very very) >>>> common case, the lock should be in the cache already. >>>> >>>> Feedback welcome. >>>> >>>> -George >>>> >>>> >> >> >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel