Hi all, I''d like to get some Lustre info from my OSS/MDSs through SNMP. So I''m reading the Lustre manual, and it indicates [1] that the lustresnmp.so file should be provided by the "base Lustre RPM". But it''s not. :) At least not in the 1.6.4.1 RHEL4 x86_64 RPMs. So I was wondering if there was any plan to include the SNMP module back in future RPM versions? Thanks, -- Kilian [1]http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-15-1.html
Hi Killian, I was asking that same question a few months ago. I can send you my 1.6.2 spec file for reference ... That version also did not bundle the SNMP library, so I ended up building it by recompiling the whole set of Lustre RPMs to get what I needed, and then just dropped the DSO in place. I''m curious as to what metrics you see to be useful -- I wasn''t sure what to look for, so while I installed the module, I haven''t yet thought of good things to ask of it. cheers, Klaus On 3/7/08 5:01 PM, "Kilian CAVALOTTI" <kilian at stanford.edu>did etch on stone tablets:> Hi all, > > I''d like to get some Lustre info from my OSS/MDSs through SNMP. So I''m > reading the Lustre manual, and it indicates [1] that the lustresnmp.so > file should be provided by the "base Lustre RPM". But it''s not. :) At > least not in the 1.6.4.1 RHEL4 x86_64 RPMs. > > So I was wondering if there was any plan to include the SNMP module back > in future RPM versions? > > Thanks,
On Friday 07 March 2008 05:01:11 pm Kilian CAVALOTTI wrote:> So I was wondering if there was any plan to include the SNMP module > back in future RPM versions?And in addition to that, is there any plan to add more stats through this SNMP module (the kind we find in /proc/fs/lustre/{llite,ost,mdt}/.../stats)? That''d be an excellent starting point to collect metrics and remotely monitor a Lustre setup fron a central location. Thanks, -- Kilian
Hi Klaus, On Friday 07 March 2008 05:52:51 pm Klaus Steden wrote:> I was asking that same question a few months ago.Yes, I remember you haven''t been overwhelmed by answers. :\> I can send you my > 1.6.2 spec file for reference ... That version also did not bundle > the SNMP library, so I ended up building it by recompiling the whole > set of Lustre RPMs to get what I needed, and then just dropped the > DSO in place.That''s exactly what I did, finally.> I''m curious as to what metrics you see to be useful -- I wasn''t sure > what to look for, so while I installed the module, I haven''t yet > thought of good things to ask of it.So, from what I''ve seen in the MIB, the current SNMP module mainly report version numbers and free space information. I think it would also be useful to get "activity metrics", the same kind of information which is in /proc/fs/lustre/llite/*/stats on clients (so we can see reads/writes and fs operations rates), in /proc/fs/lustre/obdfilter/*/stats on OSSes and in /proc/fs/lustre/mds/*/stats on MDSes. Actually, all the /proc/fs/lustre/*/**/stats could be useful, but I guess what precise metric is the most useful heavily depends on what you want to see. :) Cheers, -- Kilian
On Mon, 2008-03-10 at 13:11 -0700, Kilian CAVALOTTI wrote:> I think it would also be useful to get "activity metrics", the same kind > of information which is in /proc/fs/lustre/llite/*/stats on clients (so > we can see reads/writes and fs operations rates), > in /proc/fs/lustre/obdfilter/*/stats on OSSes and > in /proc/fs/lustre/mds/*/stats on MDSes.I can''t disagree with that, especially as Lustre installations get bigger and bigger. Apart from writing custom monitoring tools, there''s not a lot of "pre-emptive" monitoring options available. There are a few tools out there like collectl (never seen it, just heard about it) and LLNL have one on sourceforge, but I can certainly see the attraction at being able to monitor Lustre on your servers with the same tools as you are using to monitor the servers'' health themselves.> Actually, all the /proc/fs/lustre/*/**/stats could be useful, but I > guess what precise metric is the most useful heavily depends on what > you want to see. :)This could wind becoming a lustre-devel@ discussion, but for now, it would be interesting to extend the interface(s) we use to introduce /proc (and what will soon be it''s replacement/augmentation) stats files so that they are automagically provided via SNMP. You know, given the discussion in this thread: http://lists.lustre.org/pipermail/lustre-devel/2008-January/001475.html now would be a good time for the the community (that perhaps might want to contribute) desiring SNMP access to get their foot in the door. Ideally, you get SNMP into the generic interface and then SNMP access to all current and future variables comes more or less free. That all said, there are some /proc files which provide a copious amount of information, like brw_stats for instance. I don''t know how well that sort of thing maps to SNMP, but having an SNMP manager watching something as useful as brw_stats for trends over time could be quite interesting. b.
Hi Brian, On Monday 10 March 2008 03:04:33 pm Brian J. Murrell wrote:> I can''t disagree with that, especially as Lustre installations get > bigger and bigger. Apart from writing custom monitoring tools, > there''s not a lot of "pre-emptive" monitoring options available. > There are a few tools out there like collectl (never seen it, just > heard about it)collectl is very nice, but as dstat and such, it has to run on each and every host. It can provide its results via sockets though, so it could be used as a centralized monitoring system for a Lustre installation. And it provides detailled statistics too: # collectl -sL -O R waiting for 1 second sample... # LUSTRE CLIENT DETAIL: READAHEAD #Filsys Reads ReadKB Writes WriteKB Pend Hits Misses NotCon MisWin LckFal Discrd ZFile ZerWin RA2Eof HitMax home 100 192 0 0 0 0 100 0 0 0 0 0 100 0 0 scratch 100 192 0 0 0 0 100 0 0 0 0 0 100 0 0 home 102 6294 23 233 0 0 87 0 0 0 0 0 87 0 0 scratch 102 6294 23 233 0 0 87 0 0 0 0 0 87 0 0 home 95 158 22 222 0 0 81 0 0 0 0 0 81 0 0 scratch 95 158 22 222 0 0 81 0 0 0 0 0 81 0 0 # collectl -sL -O M waiting for 1 second sample... # LUSTRE CLIENT DETAIL: METADATA #Filsys Reads ReadKB Writes WriteKB Open Close GAttr SAttr Seek Fsync DrtHit DrtMis home 0 0 0 0 0 0 0 0 0 0 0 0 scratch 0 0 0 0 0 0 2 0 0 0 0 0 home 0 0 0 0 0 0 0 0 0 0 0 0 scratch 0 0 0 0 0 0 0 0 0 0 0 0 home 0 0 0 0 0 0 0 0 0 0 0 0 scratch 0 0 0 0 0 0 1 0 0 0 0 0 # collectl -sL -O B waiting for 1 second sample... # LUSTRE FILESYSTEM SINGLE OST STATISTICS #Ost Rds RdK 1K 2K 4K 8K 16K 32K 64K 128K 256K Wrts WrtK 1K 2K 4K 8K 16K 32K 64K 128K 256K home-OST0007 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 scratch-OST0007 0 0 9 0 0 0 0 0 0 0 0 12 3075 9 0 0 0 0 0 0 0 3 home-OST0007 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 scratch-OST0007 0 0 1 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 0 0 0 0 home-OST0007 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 scratch-OST0007 0 0 1 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 0 0 0 0> and LLNL have one on sourceforge,Last time I checked, it only supported 1.4 versions, but it''s been a while, so I''m probably a bit behind.> but I can certainly > see the attraction at being able to monitor Lustre on your servers > with the same tools as you are using to monitor the servers'' health > themselves.Yes, that''d be a strong selling point.> This could wind becoming a lustre-devel@ discussion, but for now, it > would be interesting to extend the interface(s) we use to > introduce /proc (and what will soon be it''s replacement/augmentation) > stats files so that they are automagically provided via SNMP.That sounds like the way to proceed, indeed.> You know, given the discussion in this thread: > http://lists.lustre.org/pipermail/lustre-devel/2008-January/001475.ht >ml now would be a good time for the the community (that perhaps might > want to contribute) desiring SNMP access to get their foot in the > door. Ideally, you get SNMP into the generic interface and then SNMP > access to all current and future variables comes more or less free.Oh, thanks for pointing this. It looks like major underlying changes are coming. I think I''ll subscribe to the lustre-devel ML to try to follow them.> That all said, there are some /proc files which provide a copious > amount of information, like brw_stats for instance. I don''t know how > well that sort of thing maps to SNMP, but having an SNMP manager > watching something as useful as brw_stats for trends over time could > be quite interesting.Add some RRD graphs to keep historical variations, and you got the all-in-one Lustre monitoring tool we sysadmins are all waiting for. ;) Cheers, -- Kilian
On Mon, 2008-03-10 at 16:58 -0700, Kilian CAVALOTTI wrote:> Hi Brian,Hi.> Oh, thanks for pointing this. It looks like major underlying changes > are coming. I think I''ll subscribe to the lustre-devel ML to try to > follow them.You could do that, but I suspect that if you want to see those developments include SNMP access to the stats, you are going to have to be more proactive than just following the current development. I don''t have any more insight than what''s in that thread about the plans underway but I''d be very surprised if they currently include SNMP. I could be wrong but I suspect that if you want to see SNMP availability you''d have to get active either with participating in the design and perhaps some hacking or voicing your desires through your sales channel.> Add some RRD graphs to keep historical variations, and you got the > all-in-one Lustre monitoring tool we sysadmins are all waiting for. ;)Heh. Indeed. b.
On Tuesday 11 March 2008 01:52:33 am Brian J. Murrell wrote:> You could do that, but I suspect that if you want to see those > developments include SNMP access to the stats, you are going to have > to be more proactive than just following the current development. I > don''t have any more insight than what''s in that thread about the > plans underway but I''d be very surprised if they currently include > SNMP. I could be wrong but I suspect that if you want to see SNMP > availability you''d have to get activeGotcha. Bug #15197, "Feature request: expand SNMP scope"> either with participating in > the design and perhaps some hackingI''m not sure I can be of any help in this area, unfortunately. But I''ve seen that some users expressed the same kind of need and rolled up their sleeves :) http://lists.lustre.org/pipermail/lustre-devel/2008-January/001504.html> or voicing your desires through > your sales channel.That I can do. :) Thanks for the advice, -- Kilian
Kilian CAVALOTTI <kilian at ...> writes:> > Hi Brian, > > On Monday 10 March 2008 03:04:33 pm Brian J. Murrell wrote: > > I can''t disagree with that, especially as Lustre installations get > > bigger and bigger. Apart from writing custom monitoring tools, > > there''s not a lot of "pre-emptive" monitoring options available. > > There are a few tools out there like collectl (never seen it, just > > heard about it) > > collectl is very nice, but as dstat and such, it has to run on each and > every host. It can provide its results via sockets though, so it could > be used as a centralized monitoring system for a Lustre installation.I''m the author of collect and so have a few opinions of my own 8-) Nice to see someone realized there''s a method to my madness. I''ve been very frustrated by all the tools out there that come close to solving the distributed management problem and always seem to leave something out, be it handling the level of detail one needs to get their job done or providing an ability to look at historical data or supporting what I consider fine-grained monitoring, that is taking a sample at one a second [or less! yes, collectl does support that]. My solution was to focus on one thing and do it well - collectl local data and provide a rational methodology for archiving it and displaying it, which also includes being able to plot it. OK, guess that was more than one. But I has also hoped that by supplying hooks, others could build on my work rather than start all over again, with yet another tool that stands alone.> And it provides detailled statistics too: > > # collectl -sL -O R > waiting for 1 second sample... > > # LUSTRE CLIENT DETAIL: READAHEAD > #Filsys Reads ReadKB Writes WriteKB Pend Hits Misses NotCon MisWin LckFalDiscrd ZFile ZerWin RA2Eof HitMax> home 100 192 0 0 0 0 100 0 0 00 0 100 0 0> scratch 100 192 0 0 0 0 100 0 0 00 0 100 0 0> home 102 6294 23 233 0 0 87 0 0 00 0 87 0 0> scratch 102 6294 23 233 0 0 87 0 0 00 0 87 0 0> home 95 158 22 222 0 0 81 0 0 00 0 81 0 0> scratch 95 158 22 222 0 0 81 0 0 00 0 81 0 0> > # collectl -sL -O M > waiting for 1 second sample... > > # LUSTRE CLIENT DETAIL: METADATA > #Filsys Reads ReadKB Writes WriteKB Open Close GAttr SAttr Seek FsyncDrtHit DrtMis> home 0 0 0 0 0 0 0 0 0 00 0> scratch 0 0 0 0 0 0 2 0 0 00 0> home 0 0 0 0 0 0 0 0 0 00 0> scratch 0 0 0 0 0 0 0 0 0 00 0> home 0 0 0 0 0 0 0 0 0 00 0> scratch 0 0 0 0 0 0 1 0 0 00 0> > # collectl -sL -O B > waiting for 1 second sample... > > # LUSTRE FILESYSTEM SINGLE OST STATISTICS > #Ost Rds RdK 1K 2K 4K 8K 16K 32K 64K 128K 256K WrtsWrtK 1K 2K 4K 8K 16K 32K 64K 128K 256K> home-OST0007 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0> scratch-OST0007 0 0 9 0 0 0 0 0 0 0 0 123075 9 0 0 0 0 0 0 0 3> home-OST0007 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0> scratch-OST0007 0 0 1 0 0 0 0 0 0 0 0 12 1 0 0 0 0 0 0 0 0> home-OST0007 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0> scratch-OST0007 0 0 1 0 0 0 0 0 0 0 0 12 1 0 0 0 0 0 0 0 0> > > and LLNL have one on sourceforge, > > Last time I checked, it only supported 1.4 versions, but it''s been a while, > so I''m probably a bit behind.not sure if you''re talking about collectl but in fact I have only just begun to look at 1.6.4.3 and the good news is I only found a few minor things I need to fix. I do plan on releasing a new version of collectl in a couple of days.> > but I can certainly > > see the attraction at being able to monitor Lustre on your servers > > with the same tools as you are using to monitor the servers'' health > > themselves. > > Yes, that''d be a strong selling point.not only is that a strong point, it'' the main point! When you have multiple tools trying to track multiple resources and something goes wrong, how are you expected to do any correlation. to that point, collectl even tracks time to the msec and and alings to a whole second within a msec or two and even does it across a cluster - assuming of course your clocks are synchronized.> > This could wind becoming a lustre-devel@ discussion, but for now, it > > would be interesting to extend the interface(s) we use to > > introduce /proc (and what will soon be it''s replacement/augmentation) > > stats files so that they are automagically provided via SNMP.given my comments about focusing on one thing and doing it well, I could see if someone really wanted to export lustre data with snmp they could always use the --sexpr switch to collectl, telling it to write every sample as an s-expression which an snmp module could then pick up. that way you only have one piece of code worrying about collecting/parsging lustre data. One other very key point here - at least I think it is - I often see cluster-based tools collecting data once every minute or even more. Alternatively they might choose to only take a small sample of data, the common theme being any large volume of data will overwhelm a management station. Right! For that reason, I could see one exporting a subset of data upstream while keeping more details and perhaps at a finer time granularity locally for debug purposes.> That sounds like the way to proceed, indeed. > > > You know, given the discussion in this thread: > > http://lists.lustre.org/pipermail/lustre-devel/2008-January/001475.ht > >ml now would be a good time for the the community (that perhaps might > > want to contribute) desiring SNMP access to get their foot in the > > door. Ideally, you get SNMP into the generic interface and then SNMP > > access to all current and future variables comes more or less free. > > Oh, thanks for pointing this. It looks like major underlying changes > are coming. I think I''ll subscribe to the lustre-devel ML to try to > follow them. > > > That all said, there are some /proc files which provide a copious > > amount of information, like brw_stats for instance. I don''t know how > > well that sort of thing maps to SNMP, but having an SNMP manager > > watching something as useful as brw_stats for trends over time could > > be quite interesting. > > Add some RRD graphs to keep historical variations, and you got the > all-in-one Lustre monitoring tool we sysadmins are all waiting for. ;)Be careful here. You can certain stick some data into an rrd but certainly not all of it, especially if you want to collect a lot of it at a reasonable frequency. If you want accurate detail plots, you''ve gotta go to the data stored on each separate system. I just don''t see any way around this, at least not yet... As a final note, I''ve put together a tutorial on using collectl in a lustre environment and have upload a preliminary copy at http://collectl.sourceforge.net/Tutorial-Lustre.html in case anyone wants to preview it before I link it into the documentation. If nothing else, look at my very last example where I show what you can see by monitoring lustre at the same time as your network interface. Did I also mention that collectl is probably one of the few tools that can monitor your Infiniband traffic as well? Sorry or being so long winded... -mark
On Thursday 20 March 2008 01:15:04 pm Mark Seger wrote:> not sure if you''re talking about collectlNot I wasn''t, I was referring to the Lustre Monitoring Tool (LMT) from LLNL.> Be careful here. You can certain stick some data into an rrd but > certainly not all of it, especially if you want to collect a lot of > it at a reasonable frequency. If you want accurate detail plots, > you''ve gotta go to the data stored on each separate system. I just > don''t see any way around this, at least not yet...Yes, you''re absolutely right. Given its intrinsic multi-scale nature, a RRD is well suited for keeping historical data on large time scales. This could allow a very convenient graphical overview of the different system metrics, but would be pointless for debugging purposes, where you do need fine-grained data. That''s where collectl is the most useful for me. But what about both? I don''t see any reason why collectl couldn''t provide high-frequency accurate data to diagnose problems locally, and at the same time allow to aggregate less precise values in RRD for global visualization of multi-hosts systems.> As a final note, I''ve put together a tutorial on using collectl in a > lustre environment and have upload a preliminary copy at > http://collectl.sourceforge.net/Tutorial-Lustre.html in case anyone > wants to preview it before I link it into the documentation. > If nothing else, look at my very last example where I show what you > can see by monitoring lustre at the same time as your network > interface.Very good, thanks for this. The readahead experiment is insightful.> Did I also mention that collectl is probably one of the few tools > that can monitor your Infiniband traffic as well?That''s why it rocks. :) Now the only thing which still make me want to use other monitoring software is the ability to get a global view. Centralized data collection and easy graphing (RRD feeding) are still what I need most of the time. Cheers, -- Kilian
>> Be careful here. You can certain stick some data into an rrd but >> certainly not all of it, especially if you want to collect a lot of >> it at a reasonable frequency. If you want accurate detail plots, >> you''ve gotta go to the data stored on each separate system. I just >> don''t see any way around this, at least not yet... >> > > Yes, you''re absolutely right. Given its intrinsic multi-scale nature, a > RRD is well suited for keeping historical data on large time scales. > This could allow a very convenient graphical overview of the different > system metrics, but would be pointless for debugging purposes, where > you do need fine-grained data. That''s where collectl is the most useful > for me. > > But what about both? I don''t see any reason why collectl couldn''t > provide high-frequency accurate data to diagnose problems locally, and > at the same time allow to aggregate less precise values in RRD for > global visualization of multi-hosts systems. >I agree 1000%... The mental model I''ve been building in my head is to tell collectl to log its data locally and also write out an s-expression using --sexpr. Then a daemon can periodically pick out the data its interested at whatever frequency it''s interested in and forward it on up the line.>> As a final note, I''ve put together a tutorial on using collectl in a >> lustre environment and have upload a preliminary copy at >> http://collectl.sourceforge.net/Tutorial-Lustre.html in case anyone >> wants to preview it before I link it into the documentation. >> If nothing else, look at my very last example where I show what you >> can see by monitoring lustre at the same time as your network >> interface. >> > > Very good, thanks for this. The readahead experiment is insightful. >It was to me when I first encountered the problem.>> Did I also mention that collectl is probably one of the few tools >> that can monitor your Infiniband traffic as well? >> > > That''s why it rocks. :) > > Now the only thing which still make me want to use other monitoring > software is the ability to get a global view. Centralized data > collection and easy graphing (RRD feeding) are still what I need most > of the time. >I hear you here too. That''s the main reason I put in the ability to generate data in plottable format. That''s as close as I''m willing to go with providing a graphing capability in collectl itself. I''m trying real hard to bound its scope as I figure it already has more than enough switches... 9-) -mark
On Mar 20, 2008 18:32 -0400, Mark Seger wrote:> >> As a final note, I''ve put together a tutorial on using collectl in a > >> lustre environment and have upload a preliminary copy at > >> http://collectl.sourceforge.net/Tutorial-Lustre.html in case anyone > >> wants to preview it before I link it into the documentation. > >> If nothing else, look at my very last example where I show what you > >> can see by monitoring lustre at the same time as your network > >> interface. > > > > Very good, thanks for this. The readahead experiment is insightful. > > > It was to me when I first encountered the problem.This is a very interesting example, and I wish we had known about collectl a year ago before we invested time in writing data gathering scripts which aren''t as useful as what you have here. One question - is this "over readahead" still a problem? I know there was an but like this (anything over 8kB was considered to be sequential and invoked readahead because it generated 3 consecutive pages of IO), but I thought it had been fixed some time ago. There is a sanity.sh test_101 that exercises random reads oand checks that there are no discarded pages. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
> This is a very interesting example, and I wish we had known about > collectl a year ago before we invested time in writing data gathering > scripts which aren''t as useful as what you have here. >I had mentioned collectl/lustre in a couple of places before but I guess I wasn''t loud enough. 8-) The important thing is I got your attention.> One question - is this "over readahead" still a problem? I know there > was an but like this (anything over 8kB was considered to be sequential > and invoked readahead because it generated 3 consecutive pages of IO), > but I thought it had been fixed some time ago. There is a sanity.sh > test_101 that exercises random reads oand checks that there are no > discarded pages. >actually, as we speak I''m getting ready to release a new version of collectl (stay tuned). while I don''t have any specific readahead needs, I believe there is still something not right. I also think the operations manual is misleading because it says readahead is triggered after the second sequential read and I''d think one could interpret that to mean when you do your second read you invoke readahead, but it''s really not until your third read. Furthermore, ''read'' sounds like a read call when in fact it really means - as you stated above - it''s the 3rd page, not call. And finally when you say this has been fixed, what exactly does that mean? does readahead work differently now? Anyhow getting back to some of my experiments, and these are on 1.6.4.3. First of all, I discovered my perl script that was doing the random reads was using the perl ''read'' function rather than ''sysread'' so there''s some stuff extra happening there behind the scenes I''m not really sure about. However, it''s causing a lot of readahead (or at least excess network traffic) and that puzzles me. Here''s an example of doing 8K reads using perl''s read function: [root at cag-dl145-172 disktests]# collectl -snl -OR -oT # <----------Network----------><-------------Lustre Client--------------> #Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite Hits Misses 16:33:41 141 148 26 138 69 276 0 0 0 61 16:33:42 296 307 52 261 70 280 0 0 2 64 16:33:43 311 323 54 275 78 312 0 0 0 64 16:33:44 310 321 54 276 73 292 0 0 0 63 16:33:45 306 316 53 266 63 252 0 0 0 61 16:33:46 301 311 53 267 76 304 0 0 0 68 and you can clearly see the traffic on the network matches what lustre is delivering to the client. I also saw in the rpc stats that all the requests were for single pages when they should have been for 2. But now look what happens when I go to 9K # <----------Network----------><-------------Lustre Client--------------> #Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite Hits Misses 16:34:42 13017 8887 349 4597 39 156 0 0 0 48 16:34:43 15310 10443 418 5544 65 260 0 0 0 69 16:34:44 18801 12839 501 6601 58 232 0 0 0 62 16:34:45 19436 13263 522 6926 24 96 0 0 0 32 This is clearly generating a lot of network traffic compared to the client''s data rate. Perhaps someone who is more familiar with the subtleties of the perl ''read'' function will know. Anyhow when I changed my ''read'' to ''sysread'' things seem to get better so perhaps readahead indeed works differently now? If so does that mean the current definition is wrong? If so, what should it be? In any event, playing around a little I kind of stumbled on this one. I ran my perl script to do a single sysread, sleep a second and then do another. While I couldn''t see it doing any unexpected network traffic for 12K requests, look what happens for 50K ones: # <----------Network----------><-------------Lustre Client--------------> #Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite Hits Misses 16:41:32 55 41 2 31 1 50 0 0 12 1 16:41:33 56 46 4 38 1 50 0 0 12 1 16:41:34 55 41 2 31 1 50 0 0 12 1 16:41:35 55 40 2 31 1 50 0 0 12 1 16:41:36 1122 766 30 408 1 50 0 0 12 1 16:41:37 55 41 2 31 1 50 0 0 12 1 16:41:38 55 40 2 31 1 50 0 0 0 1 16:41:39 1130 774 30 412 0 0 0 0 12 0 If not readahead, lustre is certainly doing something funky over the wire... And finally, if I remove the sleep and just do a bunch of 50K reads here''s what I see: # <----------Network----------><-------------Lustre Client--------------> #Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite Hits Misses 16:45:35 2952 2061 98 1121 49 2450 0 0 564 47 16:45:36 4744 3296 149 1745 40 2000 0 0 468 39 16:45:37 5158 3562 153 1884 46 2300 0 0 541 43 16:45:38 5816 4027 177 2129 47 2350 0 0 552 46 16:45:39 3601 2520 120 1356 52 2600 0 0 610 50 16:45:40 4897 3405 155 1808 51 2550 0 0 564 47 16:45:41 5862 4061 178 2134 49 2450 0 0 588 49 16:45:42 4799 3336 151 1763 52 2600 0 0 588 49 16:45:43 5864 4067 179 2139 52 2600 0 0 573 48 16:45:44 4836 3362 153 1799 38 1900 0 0 444 37 16:45:45 4199 2913 130 1550 55 2750 0 0 587 47 16:45:46 6938 4789 204 2498 53 2650 0 0 600 50 16:45:47 4854 3373 153 1789 46 2300 0 0 494 38 on the average it looks like 2-3 times more data is being sent over the network than the client is delivering. Any thoughts of what''s going on in these cases? In any event feel free to download collectl and check things out for yourself. I''ll notify this list when that happens. sorry for the long reply... -mark> Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
It''s spring, and even though UConn just lost their first NCAA tournament opening round since 1979, it felt like the time was right for a new release of collectl especially since this one supports Lustre 1.6.4.3. If you''re not familiar with collectl check it out at http://collectl.sourceforge.net/ and the main reason you should care is it''s the one tool that lets you monitor lustre stats along side of virtually any other system metrics you may care to look at. It''s also low overhead which means interactive one second monitoring intervals are no big deal. If you want to run it as a daemon it will collect samples every 10 seconds for a pretty accurate longer term view of the world. If you want to get an idea of what collectl can actually do, have a look at the short tutorial complete with screen shots at http://collectl.sourceforge.net/. One key feature of collectl, at least as far as lustre is concerned, is that it dynamically looks for and responds to configuration changes! In other words if OSTs or file systems are added or removed, collectl recognizes and reacts. Anyhow, don''t just take my word for it , check it out... -mark
On Mar 21, 2008 16:28 -0400, Mark Seger wrote:> > One question - is this "over readahead" still a problem? I know there > > was an but like this (anything over 8kB was considered to be sequential > > and invoked readahead because it generated 3 consecutive pages of IO), > > but I thought it had been fixed some time ago. There is a sanity.sh > > test_101 that exercises random reads oand checks that there are no > > discarded pages. > > while I don''t have any specific readahead needs, I believe there is > still something not right. I also think the operations manual is > misleading because it says readahead is triggered after the second > sequential read and I''d think one could interpret that to mean when you > do your second read you invoke readahead, but it''s really not until your > third read.Well, the _sequential_ part of that statement is important. The first read is just a read. When the second read is done you can determine if it is sequential or not, and likewise the third read will be the second _sequential_ read. I suppose we could clarify that a bit in any case.> Furthermore, ''read'' sounds like a read call when in fact it > really means - as you stated above - it''s the 3rd page, not call.Note that my mention above was in the past tense. The readahead code now makes decisions based on the sys_read sizes and not the individual pages.> And finally when you say this has been fixed, what exactly does that mean? > does readahead work differently now?The readahead detection code was fixed in 1.4.7 or so to make the decisions based on sequential sys_read() requests, and does not decide based on individual sequential pages being read. This means that the readahead is done with a multiple of the sys_read() size, and isn''t confused by sequential pages within a single read. In the very latest code (upcoming 1.6.5 only I think) there is also strided readahead so that if a client is reading, say, 5x1MB every 100MB (common HPC load) then the readahead will detect this and start readahead of 5MB every 100MB instead of continuing linear readahead for 40MB, detecting a seek, and then resetting the readahead.> Anyhow getting back to some of my experiments, and these are on 1.6.4.3. > First of all, I discovered my perl script that was doing the random > reads was using the perl ''read'' function rather than ''sysread'' so > there''s some stuff extra happening there behind the scenes I''m not > really sure about. However, it''s causing a lot of readahead (or at > least excess network traffic) and that puzzles me. Here''s an example of > doing 8K reads using perl''s read function: > > [root at cag-dl145-172 disktests]# collectl -snl -OR -oT > # <----------Network----------><-------------Lustre > Client--------------> > #Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite > Hits Misses > 16:33:41 141 148 26 138 69 276 0 0 > 0 61 > 16:33:42 296 307 52 261 70 280 0 0 > 2 64 > 16:33:43 311 323 54 275 78 312 0 0 > 0 64 > 16:33:44 310 321 54 276 73 292 0 0 > 0 63 > 16:33:45 306 316 53 266 63 252 0 0 > 0 61 > 16:33:46 301 311 53 267 76 304 0 0 > 0 68 > > and you can clearly see the traffic on the network matches what lustre > is delivering to the client. I also saw in the rpc stats that all the > requests were for single pages when they should have been for 2.Yes, pretty clearly this is a problem, and will go back to confusing the readahead, but at this stage there isn''t much the readahead can do about it. Well, I suppose the chance of a program going from purely random reads to straight linear reads is uncommon. We might hint to the readahead that a random reader needs to do 4 or 5 sequential reads before it gets reset to doing readahead, instead of just 2.> Anyhow when I changed my ''read'' to ''sysread'' things seem to get better > so perhaps readahead indeed works differently now? If so does that mean > the current definition is wrong? If so, what should it be? In any > event, playing around a little I kind of stumbled on this one. I ran my > perl script to do a single sysread, sleep a second and then do another. > While I couldn''t see it doing any unexpected network traffic for 12K > requests, look what happens for 50K ones: > > # <----------Network----------><-------------Lustre > Client--------------> > #Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite > Hits Misses > 16:41:32 55 41 2 31 1 50 0 0 > 12 1 > 16:41:33 56 46 4 38 1 50 0 0 > 12 1 > 16:41:34 55 41 2 31 1 50 0 0 > 12 1 > 16:41:35 55 40 2 31 1 50 0 0 > 12 1 > 16:41:36 1122 766 30 408 1 50 0 0 > 12 1 > 16:41:37 55 41 2 31 1 50 0 0 > 12 1 > 16:41:38 55 40 2 31 1 50 0 0 > 0 1 > 16:41:39 1130 774 30 412 0 0 0 0 > 12 0 > > If not readahead, lustre is certainly doing something funky over the > wire...How big is the file being read here? There is a new feature in the readahead code that if the file size is < 2MB it will fetch the whole file instead of just the small read, because the overhead of doing 3 read RPCs in order to detect sequential readahead is high compared to the overhead of doing a larger read the first time. I don''t know if that is the case or not.> And finally, if I remove the sleep and just do a bunch of 50K > reads here''s what I see: > > # <----------Network----------><-------------Lustre > Client--------------> > #Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite > Hits Misses > 16:45:35 2952 2061 98 1121 49 2450 0 0 > 564 47 > 16:45:36 4744 3296 149 1745 40 2000 0 0 > 468 39 > 16:45:37 5158 3562 153 1884 46 2300 0 0 > 541 43 > 16:45:38 5816 4027 177 2129 47 2350 0 0 > 552 46 > 16:45:39 3601 2520 120 1356 52 2600 0 0 > 610 50 > 16:45:40 4897 3405 155 1808 51 2550 0 0 > > on the average it looks like 2-3 times more data is being sent over the > network than the client is delivering. Any thoughts of what''s going on > in these cases?One possibility (depending on file size and random number generator) is that you are occasionally getting sequential random numbers and this is triggering readahead? This would be easily detectable inside your test program.> In any event feel free to download collectl and check > things out for yourself. I''ll notify this list when that happens.Yes, I''ve been meaning to take a look for a while now. It looks like a very powerful, useful, and also usable tool. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.