thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre SNMP module [Mar 2008]

If this information is useful, please help other people find it:
Share via:

Kilian CAVALOTTI

2008-Mar-08 01:01 UTC

[Lustre-discuss] Lustre SNMP module

Hi all,

I''d like to get some Lustre info from my OSS/MDSs through SNMP. So
I''m
reading the Lustre manual, and it indicates [1] that the lustresnmp.so 
file should be provided by the "base Lustre RPM". But it''s
not. :) At
least not in the 1.6.4.1 RHEL4 x86_64 RPMs.

So I was wondering if there was any plan to include the SNMP module back 
in future RPM versions?

Thanks,
-- 
Kilian

[1]http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-15-1.html

Klaus Steden

2008-Mar-08 01:52 UTC

head link

[Lustre-discuss] Lustre SNMP module

Hi Killian,

I was asking that same question a few months ago. I can send you my 1.6.2
spec file for reference ... That version also did not bundle the SNMP
library, so I ended up building it by recompiling the whole set of Lustre
RPMs to get what I needed, and then just dropped the DSO in place.

I''m curious as to what metrics you see to be useful -- I
wasn''t sure what to
look for, so while I installed the module, I haven''t yet thought of
good
things to ask of it.

cheers,
Klaus

On 3/7/08 5:01 PM, "Kilian CAVALOTTI" <kilian at
stanford.edu>did etch on stone
tablets:
> Hi all,
> 
> I''d like to get some Lustre info from my OSS/MDSs through SNMP. So
I''m
> reading the Lustre manual, and it indicates [1] that the lustresnmp.so
> file should be provided by the "base Lustre RPM". But
it''s not. :) At
> least not in the 1.6.4.1 RHEL4 x86_64 RPMs.
> 
> So I was wondering if there was any plan to include the SNMP module back
> in future RPM versions?
> 
> Thanks,

Kilian CAVALOTTI

2008-Mar-08 01:53 UTC

head link

[Lustre-discuss] Lustre SNMP module

On Friday 07 March 2008 05:01:11 pm Kilian CAVALOTTI
wrote:> So I was wondering if there was any plan to include the SNMP module
> back in future RPM versions?
And in addition to that, is there any plan to add more stats through 
this SNMP module (the kind we find  
in /proc/fs/lustre/{llite,ost,mdt}/.../stats)? That''d be an excellent 
starting point to collect metrics and remotely monitor a Lustre setup 
fron a central location.

Thanks,
-- 
Kilian

Kilian CAVALOTTI

2008-Mar-10 20:11 UTC

head link

[Lustre-discuss] Lustre SNMP module

Hi Klaus,

On Friday 07 March 2008 05:52:51 pm Klaus Steden wrote:> I was asking that same question a few months ago.
Yes, I remember you haven''t been overwhelmed by answers. :\
> I can send you my 
> 1.6.2 spec file for reference ... That version also did not bundle
> the SNMP library, so I ended up building it by recompiling the whole
> set of Lustre RPMs to get what I needed, and then just dropped the
> DSO in place.
That''s exactly what I did, finally.
> I''m curious as to what metrics you see to be useful -- I
wasn''t sure
> what to look for, so while I installed the module, I haven''t yet
> thought of good things to ask of it.
So, from what I''ve seen in the MIB, the current SNMP module mainly 
report version numbers and free space information.

I think it would also be useful to get "activity metrics", the same
kind
of information which is in /proc/fs/lustre/llite/*/stats on clients (so 
we can see reads/writes and fs operations rates), 
in /proc/fs/lustre/obdfilter/*/stats on OSSes and 
in /proc/fs/lustre/mds/*/stats on MDSes.

Actually, all the /proc/fs/lustre/*/**/stats could be useful, but I 
guess what precise metric is the most useful heavily depends on what 
you want to see. :)

Cheers,
-- 
Kilian

Brian J. Murrell

2008-Mar-10 22:04 UTC

head link

[Lustre-discuss] Lustre SNMP module

On Mon, 2008-03-10 at 13:11 -0700, Kilian CAVALOTTI
wrote:> I think it would also be useful to get "activity metrics", the
same kind
> of information which is in /proc/fs/lustre/llite/*/stats on clients (so 
> we can see reads/writes and fs operations rates), 
> in /proc/fs/lustre/obdfilter/*/stats on OSSes and 
> in /proc/fs/lustre/mds/*/stats on MDSes.
I can''t disagree with that, especially as Lustre installations get
bigger and bigger.  Apart from writing custom monitoring tools, there''s
not a lot of "pre-emptive" monitoring options available.  There are a
few tools out there like collectl (never seen it, just heard about it)
and LLNL have one on sourceforge, but I can certainly see the attraction
at being able to monitor Lustre on your servers with the same tools as
you are using to monitor the servers'' health themselves.
> Actually, all the /proc/fs/lustre/*/**/stats could be useful, but I 
> guess what precise metric is the most useful heavily depends on what 
> you want to see. :)
This could wind becoming a lustre-devel@ discussion, but for now, it
would be interesting to extend the interface(s) we use to
introduce /proc (and what will soon be it''s replacement/augmentation)
stats files so that they are automagically provided via SNMP.

You know, given the discussion in this thread:
http://lists.lustre.org/pipermail/lustre-devel/2008-January/001475.html
now would be a good time for the the community (that perhaps might want
to contribute) desiring SNMP access to get their foot in the door.
Ideally, you get SNMP into the generic interface and then SNMP access to
all current and future variables comes more or less free.

That all said, there are some /proc files which provide a copious amount
of information, like brw_stats for instance.  I don''t know how well
that
sort of thing maps to SNMP, but having an SNMP manager watching
something as useful as brw_stats for trends over time could be quite
interesting.

b.

Kilian CAVALOTTI

2008-Mar-10 23:58 UTC

head link

[Lustre-discuss] Lustre SNMP module

Hi Brian, 

On Monday 10 March 2008 03:04:33 pm Brian J. Murrell
wrote:> I can''t disagree with that, especially as Lustre installations get
> bigger and bigger.  Apart from writing custom monitoring tools,
> there''s not a lot of "pre-emptive" monitoring options
available.
> There are a few tools out there like collectl (never seen it, just
> heard about it) 
collectl is very nice, but as dstat and such, it has to run on each and
every host. It can provide its results via sockets though, so it could
be used as a centralized monitoring system for a Lustre installation.

And it provides detailled statistics too:

# collectl -sL -O R
waiting for 1 second sample...

# LUSTRE CLIENT DETAIL: READAHEAD
#Filsys   Reads ReadKB  Writes WriteKB  Pend  Hits Misses NotCon MisWin LckFal 
Discrd ZFile ZerWin RA2Eof HitMax
home        100    192       0       0     0     0    100      0      0      0  
0      0    100      0      0
scratch     100    192       0       0     0     0    100      0      0      0  
0      0    100      0      0
home        102   6294      23     233     0     0     87      0      0      0  
0      0     87      0      0
scratch     102   6294      23     233     0     0     87      0      0      0  
0      0     87      0      0
home         95    158      22     222     0     0     81      0      0      0  
0      0     81      0      0
scratch      95    158      22     222     0     0     81      0      0      0  
0      0     81      0      0

# collectl -sL -O M
waiting for 1 second sample...

# LUSTRE CLIENT DETAIL: METADATA
#Filsys   Reads ReadKB  Writes WriteKB  Open Close GAttr SAttr  Seek Fsync
DrtHit DrtMis
home          0      0       0       0     0     0     0     0     0     0     
0      0
scratch       0      0       0       0     0     0     2     0     0     0     
0      0
home          0      0       0       0     0     0     0     0     0     0     
0      0
scratch       0      0       0       0     0     0     0     0     0     0     
0      0
home          0      0       0       0     0     0     0     0     0     0     
0      0
scratch       0      0       0       0     0     0     1     0     0     0     
0      0

# collectl -sL -O B
waiting for 1 second sample...

# LUSTRE FILESYSTEM SINGLE OST STATISTICS
#Ost              Rds  RdK   1K   2K   4K   8K  16K  32K  64K 128K 256K Wrts
WrtK   1K   2K   4K   8K  16K  32K  64K 128K 256K
home-OST0007        0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0
scratch-OST0007     0    0    9    0    0    0    0    0    0    0    0   12
3075    9    0    0    0    0    0    0    0    3
home-OST0007        0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0
scratch-OST0007     0    0    1    0    0    0    0    0    0    0    0    1   
2    1    0    0    0    0    0    0    0    0
home-OST0007        0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0
scratch-OST0007     0    0    1    0    0    0    0    0    0    0    0    1   
2    1    0    0    0    0    0    0    0    0

> and LLNL have one on sourceforge, 
Last time I checked, it only supported 1.4 versions, but it''s been a
while,
so I''m probably a bit behind.
> but I can certainly  
> see the attraction at being able to monitor Lustre on your servers
> with the same tools as you are using to monitor the servers''
health
> themselves.
Yes, that''d be a strong selling point.
> This could wind becoming a lustre-devel@ discussion, but for now, it
> would be interesting to extend the interface(s) we use to
> introduce /proc (and what will soon be it''s
replacement/augmentation)
> stats files so that they are automagically provided via SNMP.
That sounds like the way to proceed, indeed.
> You know, given the discussion in this thread:
> http://lists.lustre.org/pipermail/lustre-devel/2008-January/001475.ht
>ml now would be a good time for the the community (that perhaps might
> want to contribute) desiring SNMP access to get their foot in the
> door. Ideally, you get SNMP into the generic interface and then SNMP
> access to all current and future variables comes more or less free.
Oh, thanks for pointing this. It looks like major underlying changes 
are coming. I think I''ll subscribe to the lustre-devel ML to try to 
follow them.
> That all said, there are some /proc files which provide a copious
> amount of information, like brw_stats for instance.  I don''t know
how
> well that sort of thing maps to SNMP, but having an SNMP manager
> watching something as useful as brw_stats for trends over time could
> be quite interesting.
Add some RRD graphs to keep historical variations, and you got the 
all-in-one Lustre monitoring tool we sysadmins are all waiting for. ;)

Cheers,
-- 
Kilian

Brian J. Murrell

2008-Mar-11 08:52 UTC

head link

[Lustre-discuss] Lustre SNMP module

On Mon, 2008-03-10 at 16:58 -0700, Kilian CAVALOTTI
wrote:> Hi Brian, 
Hi.
> Oh, thanks for pointing this. It looks like major underlying changes 
> are coming. I think I''ll subscribe to the lustre-devel ML to try
to
> follow them.
You could do that, but I suspect that if you want to see those
developments include SNMP access to the stats, you are going to have to
be more proactive than just following the current development.  I don''t
have any more insight than what''s in that thread about the plans
underway but I''d be very surprised if they currently include SNMP.  I
could be wrong but I suspect that if you want to see SNMP availability
you''d have to get active either with participating in the design and
perhaps some hacking or voicing your desires through your sales channel.
> Add some RRD graphs to keep historical variations, and you got the 
> all-in-one Lustre monitoring tool we sysadmins are all waiting for. ;)
Heh.  Indeed.

b.

Kilian CAVALOTTI

2008-Mar-11 18:14 UTC

head link

[Lustre-discuss] Lustre SNMP module

On Tuesday 11 March 2008 01:52:33 am Brian J. Murrell
wrote:> You could do that, but I suspect that if you want to see those
> developments include SNMP access to the stats, you are going to have
> to be more proactive than just following the current development.  I
> don''t have any more insight than what''s in that thread
about the
> plans underway but I''d be very surprised if they currently include
> SNMP.  I could be wrong but I suspect that if you want to see SNMP
> availability you''d have to get active 
Gotcha. Bug #15197, "Feature request: expand SNMP scope"
> either with participating in 
> the design and perhaps some hacking 
I''m not sure I can be of any help in this area, unfortunately. But 
I''ve seen that some users expressed the same kind of need and rolled up
their sleeves :) 
http://lists.lustre.org/pipermail/lustre-devel/2008-January/001504.html
> or voicing your desires through 
> your sales channel.
That I can do. :)

Thanks for the advice,
-- 
Kilian

Mark Seger

2008-Mar-20 20:15 UTC

head link

[Lustre-discuss] Lustre SNMP module

Kilian CAVALOTTI <kilian at ...> writes:
> 
> Hi Brian, 
> 
> On Monday 10 March 2008 03:04:33 pm Brian J. Murrell wrote:
> > I can''t disagree with that, especially as Lustre
installations get
> > bigger and bigger.  Apart from writing custom monitoring tools,
> > there''s not a lot of "pre-emptive" monitoring
options available.
> > There are a few tools out there like collectl (never seen it, just
> > heard about it) 
> 
> collectl is very nice, but as dstat and such, it has to run on each and
> every host. It can provide its results via sockets though, so it could
> be used as a centralized monitoring system for a Lustre installation.
I''m the author of collect and so have a few opinions of my own  8-)

Nice to see someone realized there''s a method to my madness. 
I''ve been very
frustrated by all the tools out there that come close to solving the distributed
management problem and always seem to leave something out, be it handling the
level of detail one needs to get their job done or providing an ability to look
at historical data or supporting what I consider fine-grained monitoring, that
is taking a sample at one a second [or less!  yes, collectl does support that].

My solution was to focus on one thing and do it well - collectl local data and
provide a rational methodology for archiving it and displaying it, which also
includes being able to plot it.  OK, guess that was more than one.  But I has
also hoped that by supplying hooks, others could build on my work rather than
start all over again, with yet another tool that stands alone.
> And it provides detailled statistics too:
> 
> # collectl -sL -O R
> waiting for 1 second sample...
> 
> # LUSTRE CLIENT DETAIL: READAHEAD
> #Filsys   Reads ReadKB  Writes WriteKB  Pend  Hits Misses NotCon MisWin
LckFal
 Discrd ZFile ZerWin RA2Eof HitMax> home        100    192       0       0     0     0    100      0      0    
0
     0      0    100      0      0> scratch     100    192       0       0     0     0    100      0      0    
0
     0      0    100      0      0> home        102   6294      23     233     0     0     87      0      0    
0
     0      0     87      0      0> scratch     102   6294      23     233     0     0     87      0      0    
0
     0      0     87      0      0> home         95    158      22     222     0     0     81      0      0    
0
     0      0     81      0      0> scratch      95    158      22     222     0     0     81      0      0    
0
     0      0     81      0      0> 
> # collectl -sL -O M
> waiting for 1 second sample...
> 
> # LUSTRE CLIENT DETAIL: METADATA
> #Filsys   Reads ReadKB  Writes WriteKB  Open Close GAttr SAttr  Seek Fsync
DrtHit DrtMis> home          0      0       0       0     0     0     0     0     0     0
  0      0> scratch       0      0       0       0     0     0     2     0     0     0
  0      0> home          0      0       0       0     0     0     0     0     0     0
  0      0> scratch       0      0       0       0     0     0     0     0     0     0
  0      0> home          0      0       0       0     0     0     0     0     0     0
  0      0> scratch       0      0       0       0     0     0     1     0     0     0
  0      0> 
> # collectl -sL -O B
> waiting for 1 second sample...
> 
> # LUSTRE FILESYSTEM SINGLE OST STATISTICS
> #Ost              Rds  RdK   1K   2K   4K   8K  16K  32K  64K 128K 256K
Wrts
WrtK   1K   2K   4K   8K  16K  32K  64K 128K 256K> home-OST0007        0    0    0    0    0    0    0    0    0    0    0   
0
  0    0    0    0    0    0    0    0    0    0> scratch-OST0007     0    0    9    0    0    0    0    0    0    0    0  
12
3075    9    0    0    0    0    0    0    0    3> home-OST0007        0    0    0    0    0    0    0    0    0    0    0   
0
  0    0    0    0    0    0    0    0    0    0> scratch-OST0007     0    0    1    0    0    0    0    0    0    0    0   
1
  2    1    0    0    0    0    0    0    0    0> home-OST0007        0    0    0    0    0    0    0    0    0    0    0   
0
  0    0    0    0    0    0    0    0    0    0> scratch-OST0007     0    0    1    0    0    0    0    0    0    0    0   
1
  2    1    0    0    0    0    0    0    0    0> 
> > and LLNL have one on sourceforge, 
> 
> Last time I checked, it only supported 1.4 versions, but it''s been
a while,
> so I''m probably a bit behind.
not sure if you''re talking about collectl but in fact I have only just
begun to
look at 1.6.4.3 and the good news is I only found a few minor things I need to
fix.  I do plan on releasing a new version of collectl in a couple of days.
> > but I can certainly  
> > see the attraction at being able to monitor Lustre on your servers
> > with the same tools as you are using to monitor the servers''
health
> > themselves.
>
> Yes, that''d be a strong selling point.
not only is that a strong point, it'' the main point!  When you have
multiple
tools trying to track multiple resources and something goes wrong, how are you
expected to do any correlation.  to that point, collectl even tracks time to the
msec and and alings to a whole second within a msec or two and even does it
across a cluster - assuming of course your clocks are synchronized.
> > This could wind becoming a lustre-devel@ discussion, but for now, it
> > would be interesting to extend the interface(s) we use to
> > introduce /proc (and what will soon be it''s
replacement/augmentation)
> > stats files so that they are automagically provided via SNMP.
given my comments about focusing on one thing and doing it well, I could see if
someone really wanted to export lustre data with snmp they could always use the
--sexpr switch to collectl, telling it to write every sample as an s-expression
which an snmp module could then pick up.  that way you only have one piece of
code worrying about collecting/parsging lustre data.

One other very key point here - at least I think it is - I often see
cluster-based tools collecting data once every minute or even more. 
Alternatively they might choose to only take a small sample of data, the common
theme being any large volume of data will overwhelm a management station. 
Right!  For that reason, I could see one exporting a subset of data upstream
while keeping more details and perhaps at a finer time granularity locally for
debug purposes.
> That sounds like the way to proceed, indeed.
> 
> > You know, given the discussion in this thread:
> > http://lists.lustre.org/pipermail/lustre-devel/2008-January/001475.ht
> >ml now would be a good time for the the community (that perhaps might
> > want to contribute) desiring SNMP access to get their foot in the
> > door. Ideally, you get SNMP into the generic interface and then SNMP
> > access to all current and future variables comes more or less free.
> 
> Oh, thanks for pointing this. It looks like major underlying changes 
> are coming. I think I''ll subscribe to the lustre-devel ML to try
to
> follow them.
> 
> > That all said, there are some /proc files which provide a copious
> > amount of information, like brw_stats for instance.  I don''t
know how
> > well that sort of thing maps to SNMP, but having an SNMP manager
> > watching something as useful as brw_stats for trends over time could
> > be quite interesting.
> 
> Add some RRD graphs to keep historical variations, and you got the 
> all-in-one Lustre monitoring tool we sysadmins are all waiting for. ;)
Be careful here.  You can certain stick some data into an rrd but certainly not
all of it, especially if you want to collect a lot of it at a reasonable
frequency.  If you want accurate detail plots, you''ve gotta go to the
data
stored on each separate system.  I just don''t see any way around this,
at least
not yet...

As a final note, I''ve put together a tutorial on using collectl in a
lustre
environment and have upload a preliminary copy at
http://collectl.sourceforge.net/Tutorial-Lustre.html in case anyone wants to
preview it before I link it into the documentation.  If nothing else, look at my
very last example where I show what you can see by monitoring lustre at the same
time as your network interface.  Did I also mention that collectl is probably
one of the few tools that can monitor your Infiniband traffic as well?

Sorry or being so long winded...
-mark

Kilian CAVALOTTI

2008-Mar-20 21:44 UTC

head link

[Lustre-discuss] Lustre SNMP module

On Thursday 20 March 2008 01:15:04 pm Mark Seger wrote:> not sure if you''re talking about collectl 
Not I wasn''t, I was referring to the Lustre Monitoring Tool (LMT) from 
LLNL.
> Be careful here.  You can certain stick some data into an rrd but
> certainly not all of it, especially if you want to collect a lot of
> it at a reasonable frequency.  If you want accurate detail plots,
> you''ve gotta go to the data stored on each separate system.  I
just
> don''t see any way around this, at least not yet...
Yes, you''re absolutely right. Given its intrinsic multi-scale nature, a
RRD is well suited for keeping historical data on large time scales. 
This could allow a very convenient graphical overview of the different 
system metrics, but would be pointless for debugging purposes, where 
you do need fine-grained data. That''s where collectl is the most useful
for me. 

But what about both? I don''t see any reason why collectl
couldn''t
provide high-frequency accurate data to diagnose problems locally, and 
at the same time allow to aggregate less precise values in RRD for 
global visualization of multi-hosts systems.
> As a final note, I''ve put together a tutorial on using collectl in
a
> lustre environment and have upload a preliminary copy at
> http://collectl.sourceforge.net/Tutorial-Lustre.html in case anyone
> wants to preview it before I link it into the documentation.  
> If nothing else, look at my very last example where I show what you 
> can see by monitoring lustre at the same time as your network
> interface.  
Very good, thanks for this. The readahead experiment is insightful.
> Did I also mention that collectl is probably one of the few tools
> that can monitor your Infiniband traffic as well?
That''s why it rocks. :)

Now the only thing which still make me want to use other monitoring 
software is the ability to get a global view. Centralized data 
collection and easy graphing (RRD feeding) are still what I need most 
of the time.

Cheers,
-- 
Kilian

Mark Seger

2008-Mar-20 22:32 UTC

head link

[Lustre-discuss] Lustre SNMP module

>> Be careful here.  You can certain stick some data into an rrd but
>> certainly not all of it, especially if you want to collect a lot of
>> it at a reasonable frequency.  If you want accurate detail plots,
>> you''ve gotta go to the data stored on each separate system.  I
just
>> don''t see any way around this, at least not yet...
>>     
>
> Yes, you''re absolutely right. Given its intrinsic multi-scale
nature, a
> RRD is well suited for keeping historical data on large time scales. 
> This could allow a very convenient graphical overview of the different 
> system metrics, but would be pointless for debugging purposes, where 
> you do need fine-grained data. That''s where collectl is the most
useful
> for me. 
>
> But what about both? I don''t see any reason why collectl
couldn''t
> provide high-frequency accurate data to diagnose problems locally, and 
> at the same time allow to aggregate less precise values in RRD for 
> global visualization of multi-hosts systems.
>   I agree 1000%...  The mental model I''ve been building in my head is to 
tell collectl to log its data locally and also write out an s-expression 
using --sexpr.  Then a daemon can periodically pick out the data its 
interested at whatever frequency it''s interested in and forward it on
up
the line.>> As a final note, I''ve put together a tutorial on using
collectl in a
>> lustre environment and have upload a preliminary copy at
>> http://collectl.sourceforge.net/Tutorial-Lustre.html in case anyone
>> wants to preview it before I link it into the documentation.  
>> If nothing else, look at my very last example where I show what you 
>> can see by monitoring lustre at the same time as your network
>> interface.  
>>     
>
> Very good, thanks for this. The readahead experiment is insightful.
>   
It was to me when I first encountered the problem.>> Did I also mention that collectl is probably one of the few tools
>> that can monitor your Infiniband traffic as well?
>>     
>
> That''s why it rocks. :)
>
> Now the only thing which still make me want to use other monitoring 
> software is the ability to get a global view. Centralized data 
> collection and easy graphing (RRD feeding) are still what I need most 
> of the time.
>   I hear you here too.  That''s the main reason I put in the ability to 
generate data in plottable format.  That''s as close as I''m
willing to go
with providing a graphing capability in collectl itself.  I''m trying 
real hard to bound its scope as I figure it already has more than enough 
switches...  9-)
-mark

Andreas Dilger

2008-Mar-21 10:29 UTC

head link

[Lustre-discuss] Lustre SNMP module

On Mar 20, 2008  18:32 -0400, Mark Seger wrote:> >> As a final note, I''ve put together a tutorial on using
collectl in a
> >> lustre environment and have upload a preliminary copy at
> >> http://collectl.sourceforge.net/Tutorial-Lustre.html in case
anyone
> >> wants to preview it before I link it into the documentation.  
> >> If nothing else, look at my very last example where I show what
you
> >> can see by monitoring lustre at the same time as your network
> >> interface.  
> >
> > Very good, thanks for this. The readahead experiment is insightful.
> >   
> It was to me when I first encountered the problem.
This is a very interesting example, and I wish we had known about
collectl a year ago before we invested time in writing data gathering
scripts which aren''t as useful as what you have here.

One question - is this "over readahead" still a problem?  I know there
was an but like this (anything over 8kB was considered to be sequential
and invoked readahead because it generated 3 consecutive pages of IO),
but I thought it had been fixed some time ago.  There is a sanity.sh
test_101 that exercises random reads oand checks that there are no
discarded pages.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Mark Seger

2008-Mar-21 20:28 UTC

head link

[Lustre-discuss] Lustre SNMP module

> This is a very interesting example, and I wish we had known about
> collectl a year ago before we invested time in writing data gathering
> scripts which aren''t as useful as what you have here.
>   I had mentioned collectl/lustre in a couple of places before but I guess 
I wasn''t loud enough.  8-)
The important thing is I got your attention.> One question - is this "over readahead" still a problem?  I know
there
> was an but like this (anything over 8kB was considered to be sequential
> and invoked readahead because it generated 3 consecutive pages of IO),
> but I thought it had been fixed some time ago.  There is a sanity.sh
> test_101 that exercises random reads oand checks that there are no
> discarded pages.
>   actually, as we speak I''m getting ready to release a new version of 
collectl (stay tuned).
while I don''t have any specific readahead needs, I believe there is 
still something not right.  I also think the operations manual is 
misleading because it says readahead is triggered after the second 
sequential read and I''d think one could interpret that to mean when you
do your second read you invoke readahead, but it''s really not until
your
third read.  Furthermore, ''read'' sounds like a read call when
in fact it
really means - as you stated above - it''s the 3rd page, not call.  And 
finally when you say this has been fixed, what exactly does that mean?  
does readahead work differently now?

Anyhow getting back to some of my experiments, and these are on 1.6.4.3. 
First of all, I discovered my perl script that was doing the random 
reads was using the perl ''read'' function rather than
''sysread'' so
there''s some stuff extra happening there behind the scenes I''m
not
really sure about.  However, it''s causing a lot of readahead (or at 
least excess network traffic) and that puzzles me.  Here''s an example
of
doing 8K reads using perl''s read function:

[root at cag-dl145-172 disktests]# collectl -snl -OR -oT
#         <----------Network----------><-------------Lustre 
Client-------------->
#Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
Hits Misses
16:33:41     141    148     26     138     69    276      0       0      
0     61
16:33:42     296    307     52     261     70    280      0       0      
2     64
16:33:43     311    323     54     275     78    312      0       0      
0     64
16:33:44     310    321     54     276     73    292      0       0      
0     63
16:33:45     306    316     53     266     63    252      0       0      
0     61
16:33:46     301    311     53     267     76    304      0       0      
0     68

and you can clearly see the traffic on the network matches what lustre 
is delivering to the client.  I also saw in the rpc stats that all the 
requests were for single pages when they should have been for 2. But now 
look what happens when I go to 9K

#         <----------Network----------><-------------Lustre 
Client-------------->
#Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
Hits Misses
16:34:42   13017   8887    349    4597     39    156      0       0      
0     48
16:34:43   15310  10443    418    5544     65    260      0       0      
0     69
16:34:44   18801  12839    501    6601     58    232      0       0      
0     62
16:34:45   19436  13263    522    6926     24     96      0       0      
0     32

This is clearly generating a lot of network traffic compared to the 
client''s data rate.  Perhaps someone who is more familiar with the 
subtleties of the perl ''read'' function will know.

Anyhow when I changed my ''read'' to ''sysread''
things seem to get better
so perhaps readahead indeed works differently now?  If so does that mean 
the current definition is wrong?  If so, what should it be?  In any 
event, playing around a little I kind of stumbled on this one.  I ran my 
perl script to do a single sysread, sleep a second and then do another.  
While I couldn''t see it doing any unexpected network traffic for 12K 
requests, look what happens for 50K ones:

#         <----------Network----------><-------------Lustre 
Client-------------->
#Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
Hits Misses
16:41:32      55     41      2      31      1     50      0       0     
12      1
16:41:33      56     46      4      38      1     50      0       0     
12      1
16:41:34      55     41      2      31      1     50      0       0     
12      1
16:41:35      55     40      2      31      1     50      0       0     
12      1
16:41:36    1122    766     30     408      1     50      0       0     
12      1
16:41:37      55     41      2      31      1     50      0       0     
12      1
16:41:38      55     40      2      31      1     50      0       0      
0      1
16:41:39    1130    774     30     412      0      0      0       0     
12      0

If not readahead, lustre is certainly doing something funky over the 
wire...  And finally, if I remove the sleep and just do a bunch of 50K 
reads here''s what I see:

#         <----------Network----------><-------------Lustre 
Client-------------->
#Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
Hits Misses
16:45:35    2952   2061     98    1121     49   2450      0       0    
564     47
16:45:36    4744   3296    149    1745     40   2000      0       0    
468     39
16:45:37    5158   3562    153    1884     46   2300      0       0    
541     43
16:45:38    5816   4027    177    2129     47   2350      0       0    
552     46
16:45:39    3601   2520    120    1356     52   2600      0       0    
610     50
16:45:40    4897   3405    155    1808     51   2550      0       0    
564     47
16:45:41    5862   4061    178    2134     49   2450      0       0    
588     49
16:45:42    4799   3336    151    1763     52   2600      0       0    
588     49
16:45:43    5864   4067    179    2139     52   2600      0       0    
573     48
16:45:44    4836   3362    153    1799     38   1900      0       0    
444     37
16:45:45    4199   2913    130    1550     55   2750      0       0    
587     47
16:45:46    6938   4789    204    2498     53   2650      0       0    
600     50
16:45:47    4854   3373    153    1789     46   2300      0       0    
494     38

on the average it looks like 2-3 times more data is being sent over the 
network than the client is delivering.  Any thoughts of what''s going on
in these cases?  In any event feel free to download collectl and check 
things out for yourself.  I''ll notify this list when that happens.

sorry for the long reply...

-mark
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Mark Seger

2008-Mar-21 22:40 UTC

head link

[Lustre-discuss] Announcing new release of collectl

It''s spring, and even though UConn just lost their first NCAA
tournament
opening round since 1979, it felt like the time was right for a new 
release of collectl especially since this one supports Lustre 1.6.4.3.  
If you''re not familiar with collectl check it out at 
http://collectl.sourceforge.net/ and the main reason you should care is 
it''s the one tool that lets you monitor lustre stats along side of 
virtually any other system metrics you may care to look at.  It''s also 
low overhead which means interactive one second monitoring intervals are 
no big deal.  If you want to run it as a daemon it will collect samples 
every 10 seconds for a pretty accurate longer term view of the world.  
If you want to get an idea of what collectl can actually do, have a look 
at the short tutorial complete with screen shots at 
http://collectl.sourceforge.net/.

One key feature of collectl, at least as far as lustre is concerned, is 
that it dynamically looks for and responds to configuration changes!  In 
other words if OSTs or file systems are added or removed, collectl 
recognizes and reacts.  Anyhow, don''t just take my word for it , check 
it out...

-mark

Andreas Dilger

2008-Mar-22 01:24 UTC

head link

[Lustre-discuss] Lustre SNMP module

On Mar 21, 2008  16:28 -0400, Mark Seger wrote:> > One question - is this "over readahead" still a problem?  I
know there
> > was an but like this (anything over 8kB was considered to be
sequential
> > and invoked readahead because it generated 3 consecutive pages of IO),
> > but I thought it had been fixed some time ago.  There is a sanity.sh
> > test_101 that exercises random reads oand checks that there are no
> > discarded pages.
>
> while I don''t have any specific readahead needs, I believe there
is
> still something not right.  I also think the operations manual is 
> misleading because it says readahead is triggered after the second 
> sequential read and I''d think one could interpret that to mean
when you
> do your second read you invoke readahead, but it''s really not
until your
> third read.
Well, the _sequential_ part of that statement is important.  The first
read is just a read.  When the second read is done you can determine if
it is sequential or not, and likewise the third read will be the second
_sequential_ read.  I suppose we could clarify that a bit in any case.
> Furthermore, ''read'' sounds like a read call when in fact
it
> really means - as you stated above - it''s the 3rd page, not call.
Note that my mention above was in the past tense.  The readahead code
now makes decisions based on the sys_read sizes and not the individual
pages.
> And finally when you say this has been fixed, what exactly does that mean?
> does readahead work differently now?
The readahead detection code was fixed in 1.4.7 or so to make the
decisions based on sequential sys_read() requests, and does not decide
based on individual sequential pages being read.  This means that the
readahead is done with a multiple of the sys_read() size, and isn''t
confused by sequential pages within a single read.

In the very latest code (upcoming 1.6.5 only I think) there is also
strided readahead so that if a client is reading, say, 5x1MB every 100MB
(common HPC load) then the readahead will detect this and start readahead
of 5MB every 100MB instead of continuing linear readahead for 40MB,
detecting a seek, and then resetting the readahead.
> Anyhow getting back to some of my experiments, and these are on 1.6.4.3. 
> First of all, I discovered my perl script that was doing the random 
> reads was using the perl ''read'' function rather than
''sysread'' so
> there''s some stuff extra happening there behind the scenes
I''m not
> really sure about.  However, it''s causing a lot of readahead (or
at
> least excess network traffic) and that puzzles me.  Here''s an
example of
> doing 8K reads using perl''s read function:
> 
> [root at cag-dl145-172 disktests]# collectl -snl -OR -oT
> #         <----------Network----------><-------------Lustre 
> Client-------------->
> #Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
> Hits Misses
> 16:33:41     141    148     26     138     69    276      0       0      
> 0     61
> 16:33:42     296    307     52     261     70    280      0       0      
> 2     64
> 16:33:43     311    323     54     275     78    312      0       0      
> 0     64
> 16:33:44     310    321     54     276     73    292      0       0      
> 0     63
> 16:33:45     306    316     53     266     63    252      0       0      
> 0     61
> 16:33:46     301    311     53     267     76    304      0       0      
> 0     68
> 
> and you can clearly see the traffic on the network matches what lustre 
> is delivering to the client.  I also saw in the rpc stats that all the 
> requests were for single pages when they should have been for 2.
Yes, pretty clearly this is a problem, and will go back to confusing
the readahead, but at this stage there isn''t much the readahead can
do about it.  Well, I suppose the chance of a program going from
purely random reads to straight linear reads is uncommon.  We might
hint to the readahead that a random reader needs to do 4 or 5 sequential
reads before it gets reset to doing readahead, instead of just 2.
> Anyhow when I changed my ''read'' to
''sysread'' things seem to get better
> so perhaps readahead indeed works differently now?  If so does that mean 
> the current definition is wrong?  If so, what should it be?  In any 
> event, playing around a little I kind of stumbled on this one.  I ran my 
> perl script to do a single sysread, sleep a second and then do another.  
> While I couldn''t see it doing any unexpected network traffic for
12K
> requests, look what happens for 50K ones:
> 
> #         <----------Network----------><-------------Lustre 
> Client-------------->
> #Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
> Hits Misses
> 16:41:32      55     41      2      31      1     50      0       0     
> 12      1
> 16:41:33      56     46      4      38      1     50      0       0     
> 12      1
> 16:41:34      55     41      2      31      1     50      0       0     
> 12      1
> 16:41:35      55     40      2      31      1     50      0       0     
> 12      1
> 16:41:36    1122    766     30     408      1     50      0       0     
> 12      1
> 16:41:37      55     41      2      31      1     50      0       0     
> 12      1
> 16:41:38      55     40      2      31      1     50      0       0      
> 0      1
> 16:41:39    1130    774     30     412      0      0      0       0     
> 12      0
> 
> If not readahead, lustre is certainly doing something funky over the 
> wire...
How big is the file being read here?  There is a new feature in the
readahead code that if the file size is < 2MB it will fetch the whole
file instead of just the small read, because the overhead of doing
3 read RPCs in order to detect sequential readahead is high compared
to the overhead of doing a larger read the first time.  I don''t know
if that is the case or not.
> And finally, if I remove the sleep and just do a bunch of 50K 
> reads here''s what I see:
> 
> #         <----------Network----------><-------------Lustre 
> Client-------------->
> #Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
> Hits Misses
> 16:45:35    2952   2061     98    1121     49   2450      0       0    
> 564     47
> 16:45:36    4744   3296    149    1745     40   2000      0       0    
> 468     39
> 16:45:37    5158   3562    153    1884     46   2300      0       0    
> 541     43
> 16:45:38    5816   4027    177    2129     47   2350      0       0    
> 552     46
> 16:45:39    3601   2520    120    1356     52   2600      0       0    
> 610     50
> 16:45:40    4897   3405    155    1808     51   2550      0       0    
> 
> on the average it looks like 2-3 times more data is being sent over the 
> network than the client is delivering.  Any thoughts of what''s
going on
> in these cases?
One possibility (depending on file size and random number generator) is
that you are occasionally getting sequential random numbers and this is
triggering readahead?  This would be easily detectable inside your test
program.
> In any event feel free to download collectl and check 
> things out for yourself.  I''ll notify this list when that happens.
Yes, I''ve been meaning to take a look for a while now.  It looks like
a very powerful, useful, and also usable tool.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Mar 2008 - Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Lustre SNMP module

[Lustre-discuss] Announcing new release of collectl

[Lustre-discuss] Lustre SNMP module