thr3ads.net - Lustre discuss - [Lustre-discuss] help tracking down extremely high loads on OSSs [Oct 2010]

If this information is useful, please help other people find it:
Share via:

John White

2010-Oct-18 17:26 UTC

[Lustre-discuss] help tracking down extremely high loads on OSSs

Hello Folks,
	A while back (say 3 weeks ago) we started noticing extremely high loads (load
avg around 300 at times) on our OSSs when in production and serving IO.  This
cluster was, at the time, on 1.8.2 (we have since upgraded to 1.8.4 but the
problem remains).  The load increases fairly predictably as clients generate IO
but even 2 clients can produce a load avg above 5.00.  An identical file system
of ours does not exhibit this behavior (sticks below load avg 1.00 under even
the heaviest IO load).  I''ve looked around bugzilla and
haven''t found anything.  We''ve disabled heartbeat on the
off-chance that was generating the load (it''s not), we''ve
attempted using a different client transport (o2ib->tcp), this did not solve
the issue.  There doesn''t appear to be any specific non-kernel thread
causing the high-load.  The only info in dmesg/syslog pertains to sporadic
client evictions or sporadic slow setattr due to heavy IO load (we''ve
since tuned the number of OST threads).  We''re basically out of ideas
to try.

As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900 couplet (15
tiers, 1:1 lun mapping) running the lustre.org rpm build kernel for 1.8.4.  The
MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is this a common problem
or should a bug be filed?  Any info available upon request.  Thanks for your
time.
----------------
John White
High Performance Computing Services (HPCS)
(510) 486-7307
One Cyclotron Rd, MS: 50B-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

Paul Nowoczynski

2010-Oct-18 17:43 UTC

head link

[Lustre-discuss] help tracking down extremely high loads on OSSs

I wonder if there''s some type of fault in the I/O path which is 
increasing the latency of individual I/Os?  Something like this could 
affect the load especially when considering the number of kernel threads 
on the OST.
paul

John White wrote:> Hello Folks,
> 	A while back (say 3 weeks ago) we started noticing extremely high loads
(load avg around 300 at times) on our OSSs when in production and serving IO. 
This cluster was, at the time, on 1.8.2 (we have since upgraded to 1.8.4 but the
problem remains).  The load increases fairly predictably as clients generate IO
but even 2 clients can produce a load avg above 5.00.  An identical file system
of ours does not exhibit this behavior (sticks below load avg 1.00 under even
the heaviest IO load).  I''ve looked around bugzilla and
haven''t found anything.  We''ve disabled heartbeat on the
off-chance that was generating the load (it''s not), we''ve
attempted using a different client transport (o2ib->tcp), this did not solve
the issue.  There doesn''t appear to be any specific non-kernel thread
causing the high-load.  The only info in dmesg/syslog pertains to sporadic
client evictions or sporadic slow setattr due to heavy IO load (we''ve
since tuned the number of OST threads).  We''re basically
>   out of ideas to try.
>
> As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900 couplet
(15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel for 1.8.4. 
The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is this a common
problem or should a bug be filed?  Any info available upon request.  Thanks for
your time.
> ----------------
> John White
> High Performance Computing Services (HPCS)
> (510) 486-7307
> One Cyclotron Rd, MS: 50B-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Peter Kjellstrom

2010-Oct-18 17:49 UTC

head link

[Lustre-discuss] help tracking down extremely high loads on OSSs

On Monday 18 October 2010, John White wrote:> Hello Folks,
> 	A while back (say 3 weeks ago) we started noticing extremely high loads
> (load avg around 300 at times) on our OSSs when in production and serving
> IO.  This cluster was, at the time, on 1.8.2 (we have since upgraded to
> 1.8.4 but the problem remains).  The load increases fairly predictably as
> clients generate IO but even 2 clients can produce a load avg above 5.00. 
Does this impact performance or does it only show up as an unexpectedly high 
number on the OSSes?

/Peter
> An identical file system of ours does not exhibit this behavior (sticks
> below load avg 1.00 under even the heaviest IO load).  I''ve looked
around
> bugzilla and haven''t found anything.  We''ve disabled
heartbeat on the
> off-chance that was generating the load (it''s not), we''ve
attempted using a
> different client transport (o2ib->tcp), this did not solve the issue. 
> There doesn''t appear to be any specific non-kernel thread causing
the
> high-load.  The only info in dmesg/syslog pertains to sporadic client
> evictions or sporadic slow setattr due to heavy IO load (we''ve
since tuned
> the number of OST threads).  We''re basically out of ideas to try.
>
> As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900 couplet
> (15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel for
> 1.8.4.  The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is this
> a common problem or should a bug be filed?  Any info available upon
> request.  Thanks for your time. ----------------
> John White
> High Performance Computing Services (HPCS)
> (510) 486-7307
> One Cyclotron Rd, MS: 50B-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101018/e6b271fb/attachment.bin

Wojciech Turek

2010-Oct-18 18:55 UTC

head link

[Lustre-discuss] help tracking down extremely high loads on OSSs

Is this filesystem nearly full? Fragmentation can decrease back end
performance.

Also check the disks stats on the DDN, maybe you have a slow disk in one of
your tiers.

Wojciech

On 18 October 2010 18:49, Peter Kjellstrom <cap at nsc.liu.se> wrote:
> On Monday 18 October 2010, John White wrote:
> > Hello Folks,
> >       A while back (say 3 weeks ago) we started noticing extremely
high
> loads
> > (load avg around 300 at times) on our OSSs when in production and
serving
> > IO.  This cluster was, at the time, on 1.8.2 (we have since upgraded
to
> > 1.8.4 but the problem remains).  The load increases fairly predictably
as
> > clients generate IO but even 2 clients can produce a load avg above
5.00.
>
> Does this impact performance or does it only show up as an unexpectedly
> high
> number on the OSSes?
>
> /Peter
>
> > An identical file system of ours does not exhibit this behavior
(sticks
> > below load avg 1.00 under even the heaviest IO load).  I''ve
looked around
> > bugzilla and haven''t found anything.  We''ve disabled
heartbeat on the
> > off-chance that was generating the load (it''s not),
we''ve attempted using
> a
> > different client transport (o2ib->tcp), this did not solve the
issue.
> > There doesn''t appear to be any specific non-kernel thread
causing the
> > high-load.  The only info in dmesg/syslog pertains to sporadic client
> > evictions or sporadic slow setattr due to heavy IO load
(we''ve since
> tuned
> > the number of OST threads).  We''re basically out of ideas to
try.
> >
> > As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900
couplet
> > (15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel
for
> > 1.8.4.  The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is
> this
> > a common problem or should a bug be filed?  Any info available upon
> > request.  Thanks for your time. ----------------
> > John White
> > High Performance Computing Services (HPCS)
> > (510) 486-7307
> > One Cyclotron Rd, MS: 50B-3209C
> > Lawrence Berkeley National Lab
> > Berkeley, CA 94720
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

-- 
Wojciech Turek

Senior System Architect

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101018/e5afcc6d/attachment.html

John White

2010-Oct-18 18:58 UTC

head link

[Lustre-discuss] help tracking down extremely high loads on OSSs

On Oct 18, 2010, at 10:49 AM, Peter Kjellstrom wrote:
> On Monday 18 October 2010, John White wrote:
>> Hello Folks,
>> 	A while back (say 3 weeks ago) we started noticing extremely high
loads
>> (load avg around 300 at times) on our OSSs when in production and
serving
>> IO.  This cluster was, at the time, on 1.8.2 (we have since upgraded to
>> 1.8.4 but the problem remains).  The load increases fairly predictably
as
>> clients generate IO but even 2 clients can produce a load avg above
5.00.
> 
> Does this impact performance or does it only show up as an unexpectedly
high
> number on the OSSes?
We have gotten reports of scaling issues that we had not experienced prior to
this issue cropping up.  Throughput is certainly less predictable than before
but we are able to hit the same peaks.
> 
> /Peter
> 
>> An identical file system of ours does not exhibit this behavior (sticks
>> below load avg 1.00 under even the heaviest IO load).  I''ve
looked around
>> bugzilla and haven''t found anything.  We''ve disabled
heartbeat on the
>> off-chance that was generating the load (it''s not),
we''ve attempted using a
>> different client transport (o2ib->tcp), this did not solve the
issue.
>> There doesn''t appear to be any specific non-kernel thread
causing the
>> high-load.  The only info in dmesg/syslog pertains to sporadic client
>> evictions or sporadic slow setattr due to heavy IO load (we''ve
since tuned
>> the number of OST threads).  We''re basically out of ideas to
try.
>> 
>> As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900
couplet
>> (15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel for
>> 1.8.4.  The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is
this
>> a common problem or should a bug be filed?  Any info available upon
>> request.  Thanks for your time. ----------------
>> John White
>> High Performance Computing Services (HPCS)
>> (510) 486-7307
>> One Cyclotron Rd, MS: 50B-3209C
>> Lawrence Berkeley National Lab
>> Berkeley, CA 94720
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
----------------
John White
High Performance Computing Services (HPCS)
(510) 486-7307
One Cyclotron Rd, MS: 50B-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

John White

2010-Oct-18 18:59 UTC

head link

[Lustre-discuss] help tracking down extremely high loads on OSSs

We''ve thoroughly examined the back-end storage and the connections
between the OSSs and back-end.  There are no faults as of now.  Previously our
couplet had lost cache sync, but that''s since been resolved and the
load issue remains.


On Oct 18, 2010, at 10:43 AM, Paul Nowoczynski wrote:
> I wonder if there''s some type of fault in the I/O path which is
increasing the latency of individual I/Os?  Something like this could affect the
load especially when considering the number of kernel threads on the OST.
> paul
> 
> John White wrote:
>> Hello Folks,
>> 	A while back (say 3 weeks ago) we started noticing extremely high
loads (load avg around 300 at times) on our OSSs when in production and serving
IO.  This cluster was, at the time, on 1.8.2 (we have since upgraded to 1.8.4
but the problem remains).  The load increases fairly predictably as clients
generate IO but even 2 clients can produce a load avg above 5.00.  An identical
file system of ours does not exhibit this behavior (sticks below load avg 1.00
under even the heaviest IO load).  I''ve looked around bugzilla and
haven''t found anything.  We''ve disabled heartbeat on the
off-chance that was generating the load (it''s not), we''ve
attempted using a different client transport (o2ib->tcp), this did not solve
the issue.  There doesn''t appear to be any specific non-kernel thread
causing the high-load.  The only info in dmesg/syslog pertains to sporadic
client evictions or sporadic slow setattr due to heavy IO load (we''ve
since tuned the number of OST threads).  We''re basically
>>  out of ideas to try.
>> 
>> As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900
couplet (15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel for
1.8.4.  The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is this a
common problem or should a bug be filed?  Any info available upon request. 
Thanks for your time.
>> ----------------
>> John White
>> High Performance Computing Services (HPCS)
>> (510) 486-7307
>> One Cyclotron Rd, MS: 50B-3209C
>> Lawrence Berkeley National Lab
>> Berkeley, CA 94720
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>  
> 
> 
----------------
John White
High Performance Computing Services (HPCS)
(510) 486-7307
One Cyclotron Rd, MS: 50B-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

John White

2010-Oct-18 19:00 UTC

head link

[Lustre-discuss] help tracking down extremely high loads on OSSs

Far far from it.  All OSTs are at most 23% full.  There appear to be no lagging
disks.

On Oct 18, 2010, at 11:55 AM, Wojciech Turek wrote:
> Is this filesystem nearly full? Fragmentation can decrease back end
performance.
>  
> Also check the disks stats on the DDN, maybe you have a slow disk in one of
your tiers.
> 
> Wojciech
> 
> On 18 October 2010 18:49, Peter Kjellstrom <cap at nsc.liu.se> wrote:
> On Monday 18 October 2010, John White wrote:
> > Hello Folks,
> >       A while back (say 3 weeks ago) we started noticing extremely
high loads
> > (load avg around 300 at times) on our OSSs when in production and
serving
> > IO.  This cluster was, at the time, on 1.8.2 (we have since upgraded
to
> > 1.8.4 but the problem remains).  The load increases fairly predictably
as
> > clients generate IO but even 2 clients can produce a load avg above
5.00.
> 
> Does this impact performance or does it only show up as an unexpectedly
high
> number on the OSSes?
> 
> /Peter
> 
> > An identical file system of ours does not exhibit this behavior
(sticks
> > below load avg 1.00 under even the heaviest IO load).  I''ve
looked around
> > bugzilla and haven''t found anything.  We''ve disabled
heartbeat on the
> > off-chance that was generating the load (it''s not),
we''ve attempted using a
> > different client transport (o2ib->tcp), this did not solve the
issue.
> > There doesn''t appear to be any specific non-kernel thread
causing the
> > high-load.  The only info in dmesg/syslog pertains to sporadic client
> > evictions or sporadic slow setattr due to heavy IO load
(we''ve since tuned
> > the number of OST threads).  We''re basically out of ideas to
try.
> >
> > As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900
couplet
> > (15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel
for
> > 1.8.4.  The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is
this
> > a common problem or should a bug be filed?  Any info available upon
> > request.  Thanks for your time. ----------------
> > John White
> > High Performance Computing Services (HPCS)
> > (510) 486-7307
> > One Cyclotron Rd, MS: 50B-3209C
> > Lawrence Berkeley National Lab
> > Berkeley, CA 94720
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> 
> 
> -- 
> Wojciech Turek
> 
> Senior System Architect
> 
> High Performance Computing Service
> University of Cambridge
> Email: wjt27 at cam.ac.uk
> Tel: (+)44 1223 763517 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
----------------
John White
High Performance Computing Services (HPCS)
(510) 486-7307
One Cyclotron Rd, MS: 50B-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

Lawrence Sorrillo

2010-Oct-19 15:01 UTC

head link

[Lustre-discuss] help tracking down extremely high loads on OSSs

On 10/18/2010 2:58 PM, John White wrote:> On Oct 18, 2010, at 10:49 AM, Peter Kjellstrom wrote:
>
>> On Monday 18 October 2010, John White wrote:
>>> Hello Folks,
>>> 	A while back (say 3 weeks ago) we started noticing extremely high
loads
>>> (load avg around 300 at times) on our OSSs when in production and
serving
>>> IO.  This cluster was, at the time, on 1.8.2 (we have since
upgraded to
>>> 1.8.4 but the problem remains).  The load increases fairly
predictably as
>>> clients generate IO but even 2 clients can produce a load avg above
5.00.
>> Does this impact performance or does it only show up as an unexpectedly
high
>> number on the OSSes?
> We have gotten reports of scaling issues that we had not experienced prior
to this issue cropping up.  Throughput is certainly less predictable than before
but we are able to hit the same peaks.
>
>> /Peter
>>
>>> An identical file system of ours does not exhibit this behavior
(sticks
>>> below load avg 1.00 under even the heaviest IO load). 
I''ve looked around
>>> bugzilla and haven''t found anything.  We''ve
disabled heartbeat on the
>>> off-chance that was generating the load (it''s not),
we''ve attempted using a
>>> different client transport (o2ib->tcp), this did not solve the
issue.
>>> There doesn''t appear to be any specific non-kernel thread
causing the
>>> high-load.  The only info in dmesg/syslog pertains to sporadic
client
>>> evictions or sporadic slow setattr due to heavy IO load
(we''ve since tuned
>>> the number of OST threads).  We''re basically out of ideas
to try.
>>>
>>> As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900
couplet
>>> (15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel
for
>>> 1.8.4.  The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000. 
Is this
>>> a common problem or should a bug be filed?  Any info available upon
>>> request.  Thanks for your time. ----------------
>>> John White
>>> High Performance Computing Services (HPCS)
>>> (510) 486-7307
>>> One Cyclotron Rd, MS: 50B-3209C
>>> Lawrence Berkeley National Lab
>>> Berkeley, CA 94720
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> ----------------
> John White
> High Performance Computing Services (HPCS)
> (510) 486-7307
> One Cyclotron Rd, MS: 50B-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discussYou should examine you kernel I/O scheduler. The deadline scheduler 
sometimes help in these kinds of circumstances.

~Lawrence

Jason Hill

2010-Oct-19 15:21 UTC

head link

[Lustre-discuss] help tracking down extremely high loads on OSSs

Also something to look at if you aren''t having any luck with other
avenues
would be the debug log with RPC trace enabled. We do something like:

echo +rpctrace > /proc/sys/lnet/debug; 
lctl dk > /dev/null; sleep 60; 
lctl dk > /tmp/rpctrace; echo -rpctrace > /proc/sys/lnet/debug

You''ll need to know what all the opcodes are (that''s available
in the code I
beleive), but that will give you a definate breakdown of every action
thats''
happening. 

You may also want to look at +neterror, etc. More info available from the
manual or lustre.org I''m sure. 

-Jason

On Tue, Oct 19, 2010 at 11:01:59AM -0400, Lawrence Sorrillo
wrote:>   On 10/18/2010 2:58 PM, John White wrote:
> > On Oct 18, 2010, at 10:49 AM, Peter Kjellstrom wrote:
> >
> >> On Monday 18 October 2010, John White wrote:
> >>> Hello Folks,
> >>> 	A while back (say 3 weeks ago) we started noticing extremely
high loads
> >>> (load avg around 300 at times) on our OSSs when in production
and serving
> >>> IO.  This cluster was, at the time, on 1.8.2 (we have since
upgraded to
> >>> 1.8.4 but the problem remains).  The load increases fairly
predictably as
> >>> clients generate IO but even 2 clients can produce a load avg
above 5.00.
> >> Does this impact performance or does it only show up as an
unexpectedly high
> >> number on the OSSes?
> > We have gotten reports of scaling issues that we had not experienced
prior to this issue cropping up.  Throughput is certainly less predictable than
before but we are able to hit the same peaks.
> >
> >> /Peter
> >>
> >>> An identical file system of ours does not exhibit this
behavior (sticks
> >>> below load avg 1.00 under even the heaviest IO load). 
I''ve looked around
> >>> bugzilla and haven''t found anything.  We''ve
disabled heartbeat on the
> >>> off-chance that was generating the load (it''s not),
we''ve attempted using a
> >>> different client transport (o2ib->tcp), this did not solve
the issue.
> >>> There doesn''t appear to be any specific non-kernel
thread causing the
> >>> high-load.  The only info in dmesg/syslog pertains to sporadic
client
> >>> evictions or sporadic slow setattr due to heavy IO load
(we''ve since tuned
> >>> the number of OST threads).  We''re basically out of
ideas to try.
> >>>
> >>> As reference, this is a 1 MDS/4 OSS cluster backed by a DDN
9900 couplet
> >>> (15 tiers, 1:1 lun mapping) running the lustre.org rpm build
kernel for
> >>> 1.8.4.  The MDS/OSSs are Dell R710s and the MDT is a Dell
MD1000.  Is this
> >>> a common problem or should a bug be filed?  Any info available
upon
> >>> request.  Thanks for your time. ----------------
> >>> John White
> >>> High Performance Computing Services (HPCS)
> >>> (510) 486-7307
> >>> One Cyclotron Rd, MS: 50B-3209C
> >>> Lawrence Berkeley National Lab
> >>> Berkeley, CA 94720
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> > ----------------
> > John White
> > High Performance Computing Services (HPCS)
> > (510) 486-7307
> > One Cyclotron Rd, MS: 50B-3209C
> > Lawrence Berkeley National Lab
> > Berkeley, CA 94720
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> You should examine you kernel I/O scheduler. The deadline scheduler 
> sometimes help in these kinds of circumstances.
> 
> ~Lawrence
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 
-Jason
-------------------------------------------------
//  Jason J. Hill                              //
//  HPC Systems Administrator                  //
//  National Center for Computational Sciences //
//  Oak Ridge National Laboratory              // 
//  e-mail: hilljj at ornl.gov                    //
//  Phone: (865) 576-5867                      //
-------------------------------------------------

Johann Lombardi

2010-Oct-19 16:21 UTC

head link

[Lustre-discuss] help tracking down extremely high loads on OSSs

On Tue, Oct 19, 2010 at 11:21:25AM -0400, Jason Hill
wrote:> Also something to look at if you aren''t having any luck with other
avenues
> would be the debug log with RPC trace enabled. We do something like:
>
> echo +rpctrace > /proc/sys/lnet/debug; 
> lctl dk > /dev/null; sleep 60; 
> lctl dk > /tmp/rpctrace; echo -rpctrace > /proc/sys/lnet/debug
FYI, a request history for each service is also available through /proc.
# lctl get_param ost.OSS.*.req_history

For more details, see:
http://wiki.lustre.org/manual/LustreManual18_HTML/LustreDebugging.html#50593922_47732

HTH
Johann

Lustre discuss - Oct 2010 - help tracking down extremely high loads on OSSs

[Lustre-discuss] help tracking down extremely high loads on OSSs

[Lustre-discuss] help tracking down extremely high loads on OSSs

[Lustre-discuss] help tracking down extremely high loads on OSSs

[Lustre-discuss] help tracking down extremely high loads on OSSs

[Lustre-discuss] help tracking down extremely high loads on OSSs

[Lustre-discuss] help tracking down extremely high loads on OSSs

[Lustre-discuss] help tracking down extremely high loads on OSSs

[Lustre-discuss] help tracking down extremely high loads on OSSs

[Lustre-discuss] help tracking down extremely high loads on OSSs

[Lustre-discuss] help tracking down extremely high loads on OSSs