Tracing Fans, I know it''s been a long time in coming but the CPU Performance Counter (CPC) provider is almost here! The code is currently in for review and a proposed architecture document is attached here for review. Any and all feedback/questions on the proposed implementation is welcome. Thanks. Jon. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cpc-provider-onepager.txt URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20080714/995c1b4b/attachment.txt>
Hi Jon, Looks good! I guess the thing which most people are going to pick up on is the sampling granularity. I just wonder whether you should always have to specify this, or whether you should pick up some sensible (safe) default; I guess there are pros and cons, the downside being that the default might change underneath you. Also your third example; maybe I''m nitpicking but does this really do what it says: 3. L2 cache misses, by function, generated by any running executables called ''brendan'' on an AMD platform. cpc:::BU_fill_req_missed_L2-all-0x7-10000 /execname == "brendan"/ { @[ufunc(arg1)] = count(); } The filter''s applied in the D, after the probe has fired, so surely the probe firing indicates there have been 10000 global L2 cache misses, and it just so happens this time the probe has fired whilst "brendan" is on CPU? It doesn''t necessarily mean there have been 10000 firings whilst "brendan" was on CPU. I could have misunderstood, it could be nitpicking, and I know it''s a subtlety, but AIUI this is different to how cputrack (say) would work. Regards, -- Philip Beevers Chief Architect - Fidessa mailto:philip.beevers at fidessa.com phone: +44 1483 206571> -----Original Message----- > From: dtrace-discuss-bounces at opensolaris.org > [mailto:dtrace-discuss-bounces at opensolaris.org] On Behalf Of > Jon Haslam > Sent: Monday, July 14, 2008 11:42 AM > To: Solaris Dtrace List > Subject: [dtrace-discuss] CPC provider - input welcome > > Tracing Fans, > > I know it''s been a long time in coming but the CPU > Performance Counter (CPC) provider is almost here! The code > is currently in for review and a proposed architecture > document is attached here for review. > > Any and all feedback/questions on the proposed implementation > is welcome. > > Thanks. > > Jon. >******************************************************************************************************************************************************************************************** This message is intended only for the stated addressee(s) and may be confidential. Access to this email by anyone else is unauthorised. Any opinions expressed in this email do not necessarily reflect the opinions of Fidessa. Any unauthorised disclosure, use or dissemination, either whole or in part is prohibited. If you are not the intended recipient of this message, please notify the sender immediately. Fidessa plc - Registered office: Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom Registered in England no. 3781700 VAT registration no. 688 9008 78 Fidessa group plc - Registered Office: Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom Registered in England no. 3234176 VAT registration no. 688 9008 78
Hi Phil,> Looks good!Thanks!> I guess the thing which most people are going to pick up on is the > sampling granularity. I just wonder whether you should always have to > specify this, or whether you should pick up some sensible (safe) > default; I guess there are pros and cons, the downside being that the > default might change underneath you.I prefer a user to always have to specify an event frequency without having a default. I''m a big believer (in the tracing world anyway) of people getting what they ask for and having to ask for it to get it! I just like a consumer having to be explicit with the parameters they are requesting and I shy away from hidden defaults as I think it can lead to confusion in what is quite a confusing area anyway. Do you think it really hurts to *have* to specify an event frequency?> Also your third example; maybe I''m nitpicking but does this really do > what it says:No, you are correct. It''s possibly misleading wording. See below.> 3. L2 cache misses, by function, generated by any running executables > called ''brendan'' on an AMD platform. > > cpc:::BU_fill_req_missed_L2-all-0x7-10000 > /execname == "brendan"/ > { > @[ufunc(arg1)] = count(); > } > > The filter''s applied in the D, after the probe has fired, so surely the > probe firing indicates there have been 10000 global L2 cache misses, and > it just so happens this time the probe has fired whilst "brendan" is on > CPU? It doesn''t necessarily mean there have been 10000 firings whilst > "brendan" was on CPU.Absolutely correct. We are just sampling and the thread belonging to process "brendan" just happened to generate the 10000th L2 cache miss. How many of the previous 9999 overflow events of this type were caused by "brendan" is anyones guess. The wording isn''t incorrect but just misleading. The overflow is generated by the executable "brendan" - it may not have generated all of the events that lead to that overflow though. The sampling technique used here is fundamental to the operation of the ''cpc'' provider and has to be borne in mind when it''s used. As with any sampling technique, the more samples the better. I probably need to change the wording on that example or, at least, be more verbose.> I could have misunderstood, it could be nitpicking, and I know it''s a > subtlety, but AIUI this is different to how cputrack (say) would work.It''s a completely valid point and, yes, it''s different to how cputrack works. Thanks. Jon.
Hey, Jon, On Mon, Jul 14, 2008 at 5:42 AM, Jon Haslam <Jonathan.Haslam at sun.com> wrote:> 3. Co-existence with existing tools > > The provider has priority over per-LWP libcpc usage (i.e. cputrack) > for access to counters. In the same manner as cpustat, enabling probes > causes all existing per-LWP counter contexts to be invalidated. As long as > these enablings remain active, the counters will remain unavailable to > cputrack-type consumers. > > Only one of cpustat and DTrace may use the counter hardware at any one time. > Ownership of the counters is given on a first-come, first-served basis.I''m curious, how does DTrace interact with DTrace in this situation? Specifically, if two DTrace invocations (either separate D scripts or even two clauses in the same D script) specify the same hardware counter in a probe, will they both see (effectively) the same data, or will the first invocation be the only one to see useful data? Chad
Hi Chad,>> >> Only one of cpustat and DTrace may use the counter hardware at any one time. >> Ownership of the counters is given on a first-come, first-served basis. > > I''m curious, how does DTrace interact with DTrace in this situation? > Specifically, if two DTrace invocations (either separate D scripts or > even two clauses in the same D script) specify the same hardware > counter in a probe, will they both see (effectively) the same data, or > will the first invocation be the only one to see useful data?The key thing to remember here is that a probename encodes everything needed to program the performance counter hardware up. Therefore, if you have two probes with the same event name but a different mode or event frequency then it is two different hardware configurations. If the hardware you''re on can''t tell which counter overflowed in the face of multiple active counters (like AMD or UltraSPARC) then I only allow a single enabling. For example, a script that attempts to enable ''cpc:::IC_miss-user-10000'' and ''cpc:::IC_miss-kernel-10000'' on an AMD box will succeed for the first enabling but fail the second and the consumer will exit: # dtrace -n ''cpc:::IC_miss-user-10000'' -n ''cpc:::IC_miss-kernel-10000'' dtrace: description ''cpc:::IC_miss-user-10000'' matched 1 probe dtrace: failed to enable ''cpc:::IC_miss-kernel-10000'': Failed to enable probe # If, however, two separate consumers attempted to enable different IC_miss incantations then the first one to execute would succeed and the second would be denied. However, if you were on a Niagara T2 based system then you could enable two cpc probes with different hardware configurations (the T2 only has two counters but it can tell you which one overflowed). I tried to communicate this in section "B3 - Probe Availability". Give it another read and let me know if it needs improving at all (or if the above wasn''t an correct understanding of your question!). Jon.
Hey, Jon, On Tue, Jul 15, 2008 at 11:15 AM, Jon Haslam <Jonathan.Haslam at sun.com> wrote:> Hi Chad, > >>> >>> Only one of cpustat and DTrace may use the counter hardware at any one >>> time. >>> Ownership of the counters is given on a first-come, first-served basis. >> >> I''m curious, how does DTrace interact with DTrace in this situation? >> Specifically, if two DTrace invocations (either separate D scripts or >> even two clauses in the same D script) specify the same hardware >> counter in a probe, will they both see (effectively) the same data, or >> will the first invocation be the only one to see useful data? > > > The key thing to remember here is that a probename encodes > everything needed to program the performance counter hardware up. > Therefore, if you have two probes with the same event name but a > different mode or event frequency then it is two different hardware > configurations. If the hardware you''re on can''t tell which > counter overflowed in the face of multiple active counters (like AMD > or UltraSPARC) then I only allow a single enabling. > > For example, a script that attempts to enable ''cpc:::IC_miss-user-10000'' > and ''cpc:::IC_miss-kernel-10000'' on an AMD box will succeed for the > first enabling but fail the second and the consumer will exit: > > # dtrace -n ''cpc:::IC_miss-user-10000'' -n ''cpc:::IC_miss-kernel-10000'' > dtrace: description ''cpc:::IC_miss-user-10000'' matched 1 probe > dtrace: failed to enable ''cpc:::IC_miss-kernel-10000'': Failed to enable > probe > # > > > I tried to communicate this in section "B3 - Probe Availability". > Give it another read and let me know if it needs improving at all > (or if the above wasn''t an correct understanding of your question!).Okay, this makes sense. On a closer reading of section "B3 - Probe Availability", I can see that the information is there, it just wasn''t clear on the first pass. The example you give above makes it very clear, both the case that you''re describing and the failure mode, so I might suggest adding that example to that section for clarity. Thanks, Chad
On Jul 14, 2008, at 3:42 AM, Jon Haslam wrote:> Tracing Fans, > > I know it''s been a long time in coming but the CPU Performance > Counter (CPC) provider is almost here! The code is currently in > for review and a proposed architecture document is attached here > for review.Very cool :-).> > 4. Limiting Overflow Rate > > So as to not saturate the system with overflow interrupts, a default > minimum > of 10000 is imposed on the value that can be specified for the ''count'' > part of the probename (refer to section ''B1 - Probe Format''). This > can be > reduced explicitly by altering the ''dcpc_min_overflow'' kernel > variable with > mdb(1) or by modifying the dcpc.conf driver configuration file and > unloading > and reloading the dcpc driver module. >You may need a "per type" of overflow limit. A 3 - 4GHz cpu will generate 300k - 400k events per second when tracking cycles. Some of the more exotic cache behavior counters might fire only 1-2k events per second, though. When you combine that with the need for the counter to fire while inside the application you are interested in, a 10k event minimum seems too small. Does a minimum limit actually buy much here? Consider the case of: cpc:::FR_retired_x86_instr_w_excp_intr-user-10000 vs pid123::: In this case, the cpc probe is actually less capable! If the pid provider can generate one event per instruction, without a safety limit/check, does the CPC provider really need a safety limit? James M
On Jul 15, 2008, at 1:27 PM, James McIlree wrote:> > Some of the more exotic cache behavior counters might fire only 1-2k > events > per second, though. When you combine that with the need for the > counter to fire while > inside the application you are interested in, a 10k event minimum > seems too small.Err, "seems too large." James M
Hi James,> You may need a "per type" of overflow limit. A 3 - 4GHz cpu will > generate > 300k - 400k events per second when tracking cycles. > > Some of the more exotic cache behavior counters might fire only 1-2k > events > per second, though. When you combine that with the need for the > counter to fire while > inside the application you are interested in, a 10k event minimum > seems too small.(I saw your other email and I figured that was what you meant :-) ). Yeah, I had an off-line discussion with Phil Beevers about this. I understand that having a global limit for all types of events kind of sucks. I looked at implementing a mechanism for per-event overflow limits but I couldn''t come up with something that I was happy with and I made a trade-off that I think is reasonable for the first implementation. I may well revisit this at a later date though. The lower bound may be too low though and may need revisiting (see below).> Does a minimum limit actually buy much here? > > Consider the case of: > > cpc:::FR_retired_x86_instr_w_excp_intr-user-10000 > > vs > > pid123::: > > In this case, the cpc probe is actually less capable! If the pid > provider can generate one event per instruction, without a safety > limit/check, > does the CPC provider really need a safety limit?But what happens if someone (by accident probably) does: cpc:::FR_retired_x86_instr_w_excp_intr-user-1 and we have userland code going at full tilt generating in the order of 1 billion insts/sec. We can''t sustain that level of interrupt delivery. Even if you were sampling instructions at a lesser rate but you were measuring kernel cycles at the same time at a high frequency (assuming the processor was capable of accommodating multiple events at the same time) then extreme badness would ensue. Just measuring kernel instruction with a small rate (~500) can hang a machine owing to us spending all our time processing interrupts (as they generate interrupts themselves...). There may be a case for having a value lower than 10000 but I was being extremely conservative. I''ll do some more experimentation in the next few days and let you know what I find. Thanks for raising it. Jon.
> Yeah, I had an off-line discussion with Phil Beevers about this. > I understand that having a global limit for all types of > events kind of sucks.Just for the record, my concern was basically the same as James'': that 10000 might be too small a limit to be safe for some events and too big a limit to be applicable for others. My idea for addressing this was for the callback rate to be adaptive: you want to smooth out the interrupt rate, so for a given event it could start with a base frequency value and tune it to make sure the interrupts are happening not too quickly and not too slowly. Thus the number of events before an interrupt might not be constant (i.e. you wouldn''t expect it to be specified as part of the probe name), so you need the number of events as an argument to the probe. As well as being more complex to implement, I''d be the first to admit it''s not that intuitive to script against. It has some nice properties, though - it means you don''t have to manually tune to deal with the quantisation effects, but you don''t kill yourself with interrupts. For now I think Jon''s approach is about the best compromise. It has the added advantage of being analogous with the performance counter monitoring in the Sun Studio analyzer, so at least some segment of the user population should understand what it''s doing! -- Philip Beevers Chief Architect - Fidessa mailto:philip.beevers at fidessa.com phone: +44 1483 206571 ******************************************************************************************************************************************************************************************** This message is intended only for the stated addressee(s) and may be confidential. Access to this email by anyone else is unauthorised. Any opinions expressed in this email do not necessarily reflect the opinions of Fidessa. Any unauthorised disclosure, use or dissemination, either whole or in part is prohibited. If you are not the intended recipient of this message, please notify the sender immediately. Fidessa plc - Registered office: Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom Registered in England no. 3781700 VAT registration no. 688 9008 78 Fidessa group plc - Registered Office: Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom Registered in England no. 3234176 VAT registration no. 688 9008 78
I''d like to suggest adding something to the error to indicate that the problem is that another probe is already enabled that conflicts with this one. The current message "Failed to enable probe" is pretty much the same as "it didn''t work" and will be doubly confusing because the problem may or may not be transient. Brian Utterback
Hi Brian,> I''d like to suggest adding something to the error to indicate that the > problem is that another probe is already enabled that conflicts with > this one. The current message "Failed to enable probe" is pretty much > the same as "it didn''t work" and will be doubly confusing because the > problem may or may not be transient.Yes, I agree that a "resource unavailable" type error may be more illuminating in that case than the usual failed to enable error. Somewhere in the murky past I had that on a todo list but it seemed to have fell out of view. Thanks for raising it. I''ll take a look. Jon.
Hi James, I''ve done some further experimentation and I think I''m going to keep a minimum overflow rate but have it at 5000. The reason for this is that it gets fairly easy to drive a system off the rails by setting values much lower when measuring kernel cycles. I use cycles as that''s obviously the fastest incrementing of all events. At rates not far under 5000 we get wedged up in interrupt code; servicing regular interrupts and piling in and out of overflow handler code. Without this limit it''s just too easy for some well meaning but keyboard challenged user to put the lights out. As you state in your original example though, we could employ a significantly lower overflow value for userland code and we can even work with values in single digits. However, it goes without saying that with overflow values this low the forward progress of an application tends to zero. If someone wants to do this then that''s fine and they can tune the minimum overflow rate down manually. If, when people start using the provider, I get feedback that this approach needs modifying then I''ll happily revisit but I''ll implement it as I stated here for the first pass. Jon.> On Jul 14, 2008, at 3:42 AM, Jon Haslam wrote: > > >> Tracing Fans, >> >> I know it''s been a long time in coming but the CPU Performance >> Counter (CPC) provider is almost here! The code is currently in >> for review and a proposed architecture document is attached here >> for review. >> > > Very cool :-). > >> 4. Limiting Overflow Rate >> >> So as to not saturate the system with overflow interrupts, a default >> minimum >> of 10000 is imposed on the value that can be specified for the ''count'' >> part of the probename (refer to section ''B1 - Probe Format''). This >> can be >> reduced explicitly by altering the ''dcpc_min_overflow'' kernel >> variable with >> mdb(1) or by modifying the dcpc.conf driver configuration file and >> unloading >> and reloading the dcpc driver module. >> >> > > You may need a "per type" of overflow limit. A 3 - 4GHz cpu will > generate > 300k - 400k events per second when tracking cycles. > > Some of the more exotic cache behavior counters might fire only 1-2k > events > per second, though. When you combine that with the need for the > counter to fire while > inside the application you are interested in, a 10k event minimum > seems too small. > > Does a minimum limit actually buy much here? > > Consider the case of: > > cpc:::FR_retired_x86_instr_w_excp_intr-user-10000 > > vs > > pid123::: > > In this case, the cpc probe is actually less capable! If the pid > provider can generate one event per instruction, without a safety > limit/check, > does the CPC provider really need a safety limit? > > James M > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org >
> Tracing Fans, > > I know it''s been a long time in coming but the CPU Performance > Counter (CPC) provider is almost here! The code is currently in > for review and a proposed architecture document is attached here > for review.Many thanks to all those that gave me feedback on this proposal. A revised version is attached which we''ll hopefully submit shortly. The additions I''ve done to the original are really just to try and be a bit more verbose about the behaviour of the provider. For those that are interested additions were made to Section "B1 - Probe Format" and "B3 - Probe Availability". Also the default minimum overflow rate has been lowered from 10000 to 5000. If any of the changes make you violently ill, please let me know. Jon. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cpc-provider-onepager.txt URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20080723/dc46290c/attachment.txt>
Thanks for implementing this feature. A few comments simply on the presentation of the ideas. 1) Section 2 talks about args[0] and args[1], yet the examples use arg0 and arg1. This may be just cosmetic, but it might be worth being consistent. Also, should the description be specific about what data type is given for args[0] and args[1], or is that implicit by saying they are program counter values? 2) Example 3 (and hence 4) sounds unclear to me given the discussion last week. It still leaves open the interpretation that all the L2 cache misses that are being counted are all caused by the executable "brendan". Given the discussion last week, it would be more clear to describe it being a sampling of what the brendan executable was doing each time the L2 cache miss counter hit the target. It might be useful to add a second clause to the example 3 script to count events that happened when some other executable was executing. This makes the "brendan" counts a more effective drill-down into the total set of L2 cache miss events. Did I understand last week''s discussion correctly? cpc:::BU_fill_req_missed_L2-all-0x7-10000 /execname == "brendan"/ { @[ufunc(arg1)] = count(); cpc:::BU_fill_req_missed_L2-all-0x7-10000 /execname != "brendan"/ { @["OtherExecutable"] = count(); } } 3) It might also be helpful to have an example that keeps a running total of some performance counter, and then periodically samples that counter during some other event of interest. I.e. we could use the dtrace tools to keep a running count of L2 cache misses, and then wake up each several msec and sample both who is running and what the current counts are. (In such an application, we would just be using dtrace as a quick way of enabling and disabling the specific counters we want to track.) Peter Jon Haslam wrote: Tracing Fans, I know it''s been a long time in coming but the CPU Performance Counter (CPC) provider is almost here! The code is currently in for review and a proposed architecture document is attached here for review. Many thanks to all those that gave me feedback on this proposal. A revised version is attached which we''ll hopefully submit shortly. The additions I''ve done to the original are really just to try and be a bit more verbose about the behaviour of the provider. For those that are interested additions were made to Section "B1 - Probe Format" and "B3 - Probe Availability". Also the default minimum overflow rate has been lowered from 10000 to 5000. If any of the changes make you violently ill, please let me know. Jon. _______________________________________________ dtrace-discuss mailing list dtrace-discuss-xZgeD5Kw2fzokhkdeNNY6A@public.gmane.org
Hi Peter,> 1) Section 2 talks about args[0] and args[1], yet the examples use > arg0 and arg1. This may be just cosmetic, but it might be worth > being consistent. Also, should the description be specific about > what data type is given for args[0] and args[1], or is that implicit > by saying they are program counter values?Oops. I shouldn''t refer to the typed argument array here as the arguments are not presented through it. That should be arg0 and arg1. Thanks.> 2) Example 3 (and hence 4) sounds unclear to me given the discussion > last week. It still leaves open the interpretation that all the L2 > cache misses that are being counted are all caused by the executable > "brendan". Given the discussion last week, it would be more clear to > describe it being a sampling of what the brendan executable was doing > each time the L2 cache miss counter hit the target. It might be > useful to add a second clause to the example 3 script to count events > that happened when some other executable was executing. This makes > the "brendan" counts a more effective drill-down into the total set of > L2 cache miss events. Did I understand last week''s discussion correctly? > cpc:::BU_fill_req_missed_L2-all-0x7-10000 > /execname == "brendan"/ > { > @[ufunc(arg1)] = count(); > cpc:::BU_fill_req_missed_L2-all-0x7-10000 > /execname != "brendan"/ > { > @["OtherExecutable"] = count(); > } > }I didn''t alter these examples because I added a paragraph in section "B1 - Probe Format" which contains an example to cover this off. I explicitly mention the fact that the events may not be all generated by the executable. Also bear in mind that this document is for architecture review and it''s a different thing to the user guide where I''ll be a lot more verbose.> 3) It might also be helpful to have an example that keeps a running > total of some performance counter, and then periodically samples that > counter during some other event of interest. I.e. we could use the > dtrace tools to keep a running count of L2 cache misses, and then wake > up each several msec and sample both who is running and what the > current counts are. (In such an application, we would just be using > dtrace as a quick way of enabling and disabling the specific counters > we want to track.)The user guide chapter will have more and different examples. I''ll have a play around with that idea but I''m not sure how useful it is to correlate values that we''ve been counting with a piece of data such as the current onproc thread. Still, you never know till you''ve tried it. Thanks. Jon.> > Peter > > Jon Haslam wrote: >>> Tracing Fans, >>> >>> I know it''s been a long time in coming but the CPU Performance >>> Counter (CPC) provider is almost here! The code is currently in >>> for review and a proposed architecture document is attached here >>> for review. >> >> Many thanks to all those that gave me feedback on this proposal. >> A revised version is attached which we''ll hopefully submit shortly. >> The additions I''ve done to the original are really just to try and >> be a bit more verbose about the behaviour of the provider. For >> those that are interested additions were made to Section >> "B1 - Probe Format" and "B3 - Probe Availability". Also the >> default minimum overflow rate has been lowered from 10000 to >> 5000. >> >> If any of the changes make you violently ill, please let me know. >> >> Jon. >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> dtrace-discuss mailing list >> dtrace-discuss at opensolaris.org