thr3ads.net - dtrace discuss - [dtrace-discuss] I/O bottleneck Root cause identification w Dtrace ?? (controller or IO bus) [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Todd Jobson

2007-Mar-14 23:01 UTC

[dtrace-discuss] I/O bottleneck Root cause identification w Dtrace ?? (controller or IO bus)

Dtrace and Performance Teams,

I have the following IO Performance Specific Questions (and I''m already
savy with the lockstat and pre-dtrace
utilities for performance analysis.. but in need of details regarding
specifying IO bottlenecks @ the controller or IO bus..) :

**Q.A*> Determining IO Saturation bottlenecks ( */.. beyond service
times and kernel contention.. )/

I''m trying to quantify the /*% of resources consumed on the IO side*/
of
things.
For the cpu, memory, kernel, etc.. I can see where there is contention
and utilization, etc..
For the Network NIC''s, I know the general thresholds for pegging a
"ce",
etc.. and what it looks
like from lockstat, interrupts, and kernel icsw, etc.. when that happens.

/***However**, for the IO subsystem, HOW can I determine when either a
Controller OR the IO bus
is Saturated ?? *(what specifically would the lockstat contention look
like ?, where spinning and on
which system calls, etc..) I know that for IO contention in general..
you''ll see vmstat kernel threads blocked accumulating, then likely some
overhead for smtx/csw/migr, etc.. as fallout to the underlying iostat
high wait or service times for specific devices.. *but HOW to identify
if it''s the controller OR the IO Bus .. or beyond the controller (SAN,
etc..) ??
*/
*Q.B>*

On that same topic, /*is there any reference anywhere that you''re aware
of to show the thresholds
for certain storage or HBA drivers*/ ? .. like the network team has # RX
pkts/sec interrupt thresholds for the "ce" and "ge"
NIC''s ???
or is there some easy tool or way to determine % of IO Bandwidth
that''s
being consumed/available ??
(I think I already know the answer to this.. but ANYTHING is easier than
trying to extrapollate the
bytes written/read and applying a huge fudge-factor for
overhead/Ack''s,
etc.. to do this by hand)
/*
Q.C>

*/Lastly, it would be a bonus if you also had any thoughts on
/*identifying and linking the hottest
locks under contention with specific devices and/or files.*/ I
didn''t see anything in Dtrace canned for this.

Any Information on this would be greatly appreciated, even if it''s a
handful of URL links or
a cc to specific engineers/teams or even aliases within Sun that can
provide me this info.

Thanks in Advance,

Todd Jobson
Sr. Enterprise Architect
Sun Microsystems
todd.jobson at sun.com
908-391-2165

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20070314/633c3a03/attachment.html>

Jim Mauro

2007-Mar-15 00:52 UTC

head link

[dtrace-discuss] I/O bottleneck Root cause identification w Dtrace ?? (controller or IO bus)

IO saturation implies running at or near theoretical line speed.
For a 1Gbit FCAL channel, or 1Gbit NIC, let''s call it running
at 100MBytes/sec, more or less. Scale up for 2 and 4 Gbit FCAL.

So your first test is determine what the data rate through given controller
is. This can be done with iostat(1), or the IO provider. you need to do some
math based on controller numbers for disks. For networks, Brendan''s
nicstat
tool will tell you all you need to know.

The other saturation level is IOPS (for disks) and pkts/sec (for networks).
More often than not, most workloads or bottlenecks are IOPS or pkt/sec
constrained, and hit that threshold before the data bandwidth threshold.
Obviously, it is very workload dependent. Again, use iostat(1), the io
provider, netstat(1) or nicstat to determine what the sustained rates are.

Bandwidth thresholds tend to be well documented. IOPS and pkt rates
are not. I have a 3 page paper I found years ago that provides limits for
ethernet at different speeds (but I don''t have it handy - Google
"ethernet
speeds" and you should find it). The number 140,000 IOPS for a 1Gbit
FCAL sticks in my head, but don''t hold me to that. And of course the
actual
IOP limit is more often than not throttled by available spindle IOPS, not
the wire.

Contention in the IO stack due to large numbers of threads hitting the
disk or network will typically manifest itself as long sleep times, and
you''ll see the kernel IO code path in a kernel profile, potentially
with
some hot locks.

In summary, there is not (that I know of) an out-of-the-box easy
way to do this, but it all starts with understanding the hardware
config, so at least you have your head around the theoretical limits.
Then collect the actual data, bandwidth and rates, and determine if
you''re
close enough to a limit that may have you thinking it''s a bottleneck
source.

Brendan''s nicstat makes like easier for networks. There are other tools
in his DtraceToolkit 
(brendangregg.com/dtrace.html#DTraceToolkit),
like iotop and rwtop, that will help on the disk side in terms of 
correlating disk
traffic to processes.

/jim

Todd Jobson wrote:> Dtrace and Performance Teams,
>
> I have the following IO Performance Specific Questions (and I''m 
> already savy with the lockstat and pre-dtrace
> utilities for performance analysis.. but in need of details regarding 
> specifying IO bottlenecks @ the controller or IO bus..) :
>
> **Q.A*> Determining IO Saturation bottlenecks ( */.. beyond service 
> times and kernel contention.. )/
>
> I''m trying to quantify the /*% of resources consumed on the IO
side*/
> of things.
> For the cpu, memory, kernel, etc.. I can see where there is contention 
> and utilization, etc..
> For the Network NIC''s, I know the general thresholds for pegging a
> "ce", etc.. and what it looks
> like from lockstat, interrupts, and kernel icsw, etc.. when that happens.
>
> /***However**, for the IO subsystem, HOW can I determine when either a 
> Controller OR the IO bus
> is Saturated ??   *(what specifically would the lockstat contention 
> look like ?, where spinning and on
> which system calls, etc..)  I know that for IO contention in general.. 
> you''ll see vmstat kernel threads blocked accumulating, then likely
> some overhead for smtx/csw/migr, etc.. as fallout to the underlying 
> iostat high wait or service times for specific devices.. *but HOW to 
> identify if it''s the controller OR the IO Bus .. or beyond the 
> controller (SAN, etc..) ??
> */
> *Q.B>*
>
> On that same topic, /*is there any reference anywhere that you''re 
> aware of to show the thresholds
> for certain storage or HBA drivers*/ ? .. like the network team has # 
> RX pkts/sec interrupt thresholds for the "ce" and "ge"
NIC''s ???
> or  is there some easy tool or way to determine % of IO Bandwidth 
> that''s being consumed/available ??
> (I think I already know the answer to this.. but ANYTHING is easier 
> than trying to extrapollate the
>  bytes written/read and applying a huge fudge-factor for 
> overhead/Ack''s, etc.. to do this by hand)
> /*
> Q.C>
>
> */Lastly, it would be a bonus if you also had any thoughts on 
> /*identifying and linking the hottest
>        locks under contention with specific devices and/or files.*/  I 
> didn''t see anything in Dtrace canned for this.
>
>
> Any Information on this would be greatly appreciated, even if it''s
a
> handful of URL links or
> a cc to specific engineers/teams or even aliases within Sun that can 
> provide me this info.
>
> Thanks in Advance,
>
> Todd Jobson
> Sr. Enterprise Architect
> Sun Microsystems
> todd.jobson at sun.com
> 908-391-2165
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris.org
>

Benoit

2007-Mar-15 01:19 UTC

head link

[dtrace-discuss] I/O bottleneck Root cause identification w Dtrace ?? (controller or IO bus)

And to add to what Jim said, the *Ortera Atlas tool* is all what you
ever dreamed of in order to analyze I/O contentions.
Dave Fisk invented it. Adrian Cockroft and myself were some of the first
alpha users. We are using it in the Sun benchmark center.

See ortera.com

regards;



<sun.com> 	*Benoit Chaffanjon*

Sun Solution Center
Customer benchmarking
12 Network Circle
Menlo park CA 94025 - USA

Office : 1 - (650) 786-6177
Cell    : 1 - (510) 396-0104
Blog   : blogs.sun.com/MrBenchmark

"De la Tactique dans la Pratique"
-Bourvil


This email may contain confidential and privileged material for the sole
use of the intended recipient.
Any review or distribution by others is strictly prohibited.
If you are not the intended recipient please contact the sender and
delete all copies.





Jim Mauro wrote:
>IO saturation implies running at or near theoretical line speed.
>For a 1Gbit FCAL channel, or 1Gbit NIC, let''s call it running
>at 100MBytes/sec, more or less. Scale up for 2 and 4 Gbit FCAL.
>
>So your first test is determine what the data rate through given controller
>is. This can be done with iostat(1), or the IO provider. you need to do some
>math based on controller numbers for disks. For networks, Brendan''s
nicstat
>tool will tell you all you need to know.
>
>The other saturation level is IOPS (for disks) and pkts/sec (for networks).
>More often than not, most workloads or bottlenecks are IOPS or pkt/sec
>constrained, and hit that threshold before the data bandwidth threshold.
>Obviously, it is very workload dependent. Again, use iostat(1), the io
>provider, netstat(1) or nicstat to determine what the sustained rates are.
>
>Bandwidth thresholds tend to be well documented. IOPS and pkt rates
>are not. I have a 3 page paper I found years ago that provides limits for
>ethernet at different speeds (but I don''t have it handy - Google
"ethernet
>speeds" and you should find it). The number 140,000 IOPS for a 1Gbit
>FCAL sticks in my head, but don''t hold me to that. And of course
the actual
>IOP limit is more often than not throttled by available spindle IOPS, not
>the wire.
>
>Contention in the IO stack due to large numbers of threads hitting the
>disk or network will typically manifest itself as long sleep times, and
>you''ll see the kernel IO code path in a kernel profile, potentially
with
>some hot locks.
>
>In summary, there is not (that I know of) an out-of-the-box easy
>way to do this, but it all starts with understanding the hardware
>config, so at least you have your head around the theoretical limits.
>Then collect the actual data, bandwidth and rates, and determine if
you''re
>close enough to a limit that may have you thinking it''s a
bottleneck source.
>
>Brendan''s nicstat makes like easier for networks. There are other
tools
>in his DtraceToolkit 
>(brendangregg.com/dtrace.html#DTraceToolkit),
>like iotop and rwtop, that will help on the disk side in terms of 
>correlating disk
>traffic to processes.
>
>/jim
>
>
>
>Todd Jobson wrote:
>  
>
>>Dtrace and Performance Teams,
>>
>>I have the following IO Performance Specific Questions (and I''m
>>already savy with the lockstat and pre-dtrace
>>utilities for performance analysis.. but in need of details regarding 
>>specifying IO bottlenecks @ the controller or IO bus..) :
>>
>>**Q.A*> Determining IO Saturation bottlenecks ( */.. beyond service 
>>times and kernel contention.. )/
>>
>>I''m trying to quantify the /*% of resources consumed on the IO
side*/
>>of things.
>>For the cpu, memory, kernel, etc.. I can see where there is contention 
>>and utilization, etc..
>>For the Network NIC''s, I know the general thresholds for
pegging a
>>"ce", etc.. and what it looks
>>like from lockstat, interrupts, and kernel icsw, etc.. when that
happens.
>>
>>/***However**, for the IO subsystem, HOW can I determine when either a 
>>Controller OR the IO bus
>>is Saturated ??   *(what specifically would the lockstat contention 
>>look like ?, where spinning and on
>>which system calls, etc..)  I know that for IO contention in general.. 
>>you''ll see vmstat kernel threads blocked accumulating, then
likely
>>some overhead for smtx/csw/migr, etc.. as fallout to the underlying 
>>iostat high wait or service times for specific devices.. *but HOW to 
>>identify if it''s the controller OR the IO Bus .. or beyond the 
>>controller (SAN, etc..) ??
>>*/
>>*Q.B>*
>>
>>On that same topic, /*is there any reference anywhere that
you''re
>>aware of to show the thresholds
>>for certain storage or HBA drivers*/ ? .. like the network team has # 
>>RX pkts/sec interrupt thresholds for the "ce" and
"ge" NIC''s ???
>>or  is there some easy tool or way to determine % of IO Bandwidth 
>>that''s being consumed/available ??
>>(I think I already know the answer to this.. but ANYTHING is easier 
>>than trying to extrapollate the
>> bytes written/read and applying a huge fudge-factor for 
>>overhead/Ack''s, etc.. to do this by hand)
>>/*
>>Q.C>
>>
>>*/Lastly, it would be a bonus if you also had any thoughts on 
>>/*identifying and linking the hottest
>>       locks under contention with specific devices and/or files.*/  I 
>>didn''t see anything in Dtrace canned for this.
>>
>>
>>Any Information on this would be greatly appreciated, even if
it''s a
>>handful of URL links or
>>a cc to specific engineers/teams or even aliases within Sun that can 
>>provide me this info.
>>
>>Thanks in Advance,
>>
>>Todd Jobson
>>Sr. Enterprise Architect
>>Sun Microsystems
>>todd.jobson at sun.com
>>908-391-2165
>>
>>
>>------------------------------------------------------------------------
>>
>>_______________________________________________
>>dtrace-discuss mailing list
>>dtrace-discuss at opensolaris.org
>>  
>>    
>>
>_______________________________________________
>dtrace-discuss mailing list
>dtrace-discuss at opensolaris.org
>  
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20070314/2c3b4001/attachment.html>

Todd Jobson

2007-Jul-31 18:06 UTC

head link

[dtrace-discuss] I/O bottleneck Root cause identification w Dtrace ?? (controller or IO

Thanks Guys,

I actually was hoping to find some internal specs on saturation points for
specific IO drivers, etc.. or more specific details around the use of busstat
for various backplanes and/or HBA''s....

This is for my "sys_diag" performance profiling and analysis utility
(snapshot, correlation, analysis, reporting.. with a single .html color coded
report.. w dashboard, TOC, etc..).

I had already gone the route of quantifying the NIC and HBA Peak and AVG
throughput #''s in sys_diag, as well as knowing some of the NIC
saturation points (inbound RX pkts / sec.. that also relates to the interrupt
issues.. and tuning required for "ce", etc..).. but was hoping to get
some real quantified "thresholds" to report against.

I have been in discussions with Brendan Gregg.. (DtraceToolkit author).. and got
his permissison a while back to embed some of his Dtrace scripts within
sys_diag.. and then
added some of my own to round things out.

If you haven''t already checked out sys_diag, it is my one-stop
performance and config snapshot utility that honestly does save a lot of time in
profiling a system.. as was the need in the field that this filled with a single
ksh script.   It''s available from both BigAdmin and SunFreeware.com at
:

sun.com/bigadmin/jsp/descFile.jsp?url=descAll/sys_diag__solaris_c
and
sunfreeware.com

More comprehensive discussion / writeup''s on the use of sys_diag can be
found at my Sun blog :
blogs.sun.com/toddjobson    

And if you''re internal to Sun.. there''s an article in this
week''s Technocrat.

Thanks for the responses guys.. and I''ll check out the other tool you
mentioned.

Todd


--
This message posted from opensolaris.org

Seemingly Similar Threads

Search for more reasonably related threads

dtrace discuss - Mar 2007 - I/O bottleneck Root cause identification w Dtrace ?? (controller or IO bus)

[dtrace-discuss] I/O bottleneck Root cause identification w Dtrace ?? (controller or IO bus)

[dtrace-discuss] I/O bottleneck Root cause identification w Dtrace ?? (controller or IO bus)

[dtrace-discuss] I/O bottleneck Root cause identification w Dtrace ?? (controller or IO bus)

[dtrace-discuss] I/O bottleneck Root cause identification w Dtrace ?? (controller or IO

Seemingly Similar Threads