G''Day,
On Mon, May 21, 2007 at 01:40:49PM -0700, Jeffrey Collyer
wrote:> 5 identical V440s, Solaris 10, storage on a Netapp via NFS, providing
access to a mailstore (so 80% read, 20% write).
>
> During the day, randomly a machine will start to climb its load from the
baseline of 2-3 up to 50-60. Under heavy loading, I''ve seen it go up
to 300. All the time will be in split almost 50/50 user and kernel, no idle,
nothing in I/O (according to top).
Sounds like thundering heards of threads. The "iowait" that top
reports has
been delibrately hardwired to zero, as it was a confusing and problematic
metric. I wouldn''t pay much attention to top anyway - it is handy for a
10
second look, and then you use other tools (which is why I guess you are
posting to this list. :)
As a side question - is a 50-60 load average a problem? is 300 a problem?
A good answer may be: "we have a monitoring tool that performs dummy client
activity and measures operation latency, archiving this data. We noticed
a corrolation between high latency and load average".
A bad answer may be: "our baseline is 2-3, so 50-60 must be bad,
right?"
> I''m suspecting NFS problems, but the Netapp and switch traffic
graphics look clean and consistent. Nothing shows network errors, not nfsstat,
not the switch ports, not the netapp.
It may be better to read network traffic from the server, rather than the
switch. Myself and Tim Cook wrote a tool called "nicstat" that prints
network
interface utilization and saturation (easy to find on the Internet), which
may help.
> And like I mentioned, the problem moves. One day on machine 1, tomorrow on
4, etc No real pattern.
>
> How would I go about trying to discover what the kernel is doing when this
is happening.
A quick way to get started examining the kernel is,
1. Sample kernel stacks (hit Ctrl-C to end),
# dtrace -n ''profile-1001 /arg0/ { @num[stack()] = count();
}''
2. Interrupt CPU consumption (uses DTrace),
# intrstat 5
> Some of the simple dtrace stuff I''ve tried have just shown me alot
lof lwp_parks (the main apps is heavily multithreaded, so that figures).
Heavily multithreaded, lwp_parks, and thundering heards: There is a good
chance that the load is caused by the application, in particular how the
application uses locks. The plockstat tool (which uses DTrace) can
measure lock behaviour.
As for "heavily" multithreaded - I''d also check if the
application is
creating and destroying many worker threads, causing the thundering
heards. You could use,
# dtrace -n ''proc::: { @[probename] = count(); }''
there is some great documentation in the DTrace Guide on docs.sun.com
for what each of those proc::: probes mean.
Anyhow, there are lots of things to try, including scripts in the
DTraceToolkit that will at least shed light on what is measurable from
DTrace. Explaining mysterious load averages is something that DTrace
can do tremendously well.
> Anyone got any key dtrace probes they look at for NFS or dnlc problems?
DNLC? There should be a Perl script for that in the CacheKit,
http://www.brendangregg.com/cachekit.html, or you can just eyeball the
raw values using,
# kstat -n dnlcstats 1
Anyhow, top may have thrown you down the wrong path by calling system time
"kernel". Usually, this is mostly CPU cycles in kernel code that were
requested by applications via syscalls, for example, those lwp_parks().
Although I still wouldn''t rule out a performance problem with NFS
in the kernel - thank goodness DTrace can answer which it is!
Brendan
--
Brendan
[CA, USA]