thr3ads.net - Linux_hpc_swstack - [Linux_hpc_swstack] nagging nagios feelings [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Makia Minich

2009-Mar-04 00:35 UTC

[Linux_hpc_swstack] nagging nagios feelings

So I''ve been doing some configuration and testing with Nagios and have 
been having this nagging feeling that it is going to lead to some pretty 
major issues in the future.  Monitoring is a "need-to-have" in the HPC
world, and the leaders in this pack so far are Nagios and Ganglia. 
While we''ve been including Ganglia in the stack, we''ve never
really
aided in the configuration in lieu of taking care of other tasks.  For 
the next release, it seemed like a good idea to go ahead and pause and 
see if Ganglia is the right choice, or if perhaps Nagios can provide 
some more options.

My biggest issue, right now, is a question of scalability of Nagios. 
This is primarily drawn out when you look at just how Nagios is 
configured.  To define a cluster, you must create a host entry for each 
host within the cluster; while this is easy enough and scriptable it 
really draws out the question "are you thinking about 1000, or 10000, or 
even more nodes?"  Yes, this is only the configuration file, but it also 
progresses into the monitoring solution itself.  Nagios uses a polling 
method to check every service on every node.  In the case of 10K nodes, 
how long will it take for the same node to be checked twice; or three 
times; how long will it take before we find out that it''s down?

Perhaps I just don''t understand the configuration options available to 
me (which is why I''m writing, hoping someone tells me I''m
stupid).
Perhaps there are other ways to approach this with Nagios (e.g., use 
scalable units that each only monitor a subset of nodes).  Any thoughts 
out there?

(This has been cross posted on http://blogs.sun.com/giraffe)
-- 
"A simile is not a lie, unless it is a bad simile."
- Christopher John Francis Boone

Mike Berg

2009-Mar-04 04:25 UTC

head link

[Linux_hpc_swstack] nagging nagios feelings

Does it make sense two use both Nagios and Ganglia?

Nagios to monitor the infrastructure systems providing specific  
services, Ganglia to monitor the mass of computational systems. Then  
the job schedular such as SGE can be used to vett out the
''sick''
computational nodes via it''s resource monitors.

On Mar 3, 2009, at 5:35 PM, Makia Minich wrote:
> So I''ve been doing some configuration and testing with Nagios and
have
> been having this nagging feeling that it is going to lead to some  
> pretty
> major issues in the future.  Monitoring is a "need-to-have" in
the HPC
> world, and the leaders in this pack so far are Nagios and Ganglia.
> While we''ve been including Ganglia in the stack, we''ve
never really
> aided in the configuration in lieu of taking care of other tasks.  For
> the next release, it seemed like a good idea to go ahead and pause and
> see if Ganglia is the right choice, or if perhaps Nagios can provide
> some more options.
>
> My biggest issue, right now, is a question of scalability of Nagios.
> This is primarily drawn out when you look at just how Nagios is
> configured.  To define a cluster, you must create a host entry for  
> each
> host within the cluster; while this is easy enough and scriptable it
> really draws out the question "are you thinking about 1000, or  
> 10000, or
> even more nodes?"  Yes, this is only the configuration file, but it  
> also
> progresses into the monitoring solution itself.  Nagios uses a polling
> method to check every service on every node.  In the case of 10K  
> nodes,
> how long will it take for the same node to be checked twice; or three
> times; how long will it take before we find out that it''s down?
>
> Perhaps I just don''t understand the configuration options
available to
> me (which is why I''m writing, hoping someone tells me I''m
stupid).
> Perhaps there are other ways to approach this with Nagios (e.g., use
> scalable units that each only monitor a subset of nodes).  Any  
> thoughts
> out there?
>
> (This has been cross posted on http://blogs.sun.com/giraffe)
> -- 
> "A simile is not a lie, unless it is a bad simile."
> - Christopher John Francis Boone
> _______________________________________________
> Linux_hpc_swstack mailing list
> Linux_hpc_swstack at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack

Melvin Koh

2009-Mar-04 04:28 UTC

head link

[Linux_hpc_swstack] nagging nagios feelings

Maybe this will be of interest:

http://dnx.sourceforge.net/about.html

Makia Minich wrote:> So I''ve been doing some configuration and testing with Nagios and
have
> been having this nagging feeling that it is going to lead to some pretty 
> major issues in the future.  Monitoring is a "need-to-have" in
the HPC
> world, and the leaders in this pack so far are Nagios and Ganglia. 
> While we''ve been including Ganglia in the stack, we''ve
never really
> aided in the configuration in lieu of taking care of other tasks.  For 
> the next release, it seemed like a good idea to go ahead and pause and 
> see if Ganglia is the right choice, or if perhaps Nagios can provide 
> some more options.
> 
> My biggest issue, right now, is a question of scalability of Nagios. 
> This is primarily drawn out when you look at just how Nagios is 
> configured.  To define a cluster, you must create a host entry for each 
> host within the cluster; while this is easy enough and scriptable it 
> really draws out the question "are you thinking about 1000, or 10000,
or
> even more nodes?"  Yes, this is only the configuration file, but it
also
> progresses into the monitoring solution itself.  Nagios uses a polling 
> method to check every service on every node.  In the case of 10K nodes, 
> how long will it take for the same node to be checked twice; or three 
> times; how long will it take before we find out that it''s down?
> 
> Perhaps I just don''t understand the configuration options
available to
> me (which is why I''m writing, hoping someone tells me I''m
stupid).
> Perhaps there are other ways to approach this with Nagios (e.g., use 
> scalable units that each only monitor a subset of nodes).  Any thoughts 
> out there?
> 
> (This has been cross posted on http://blogs.sun.com/giraffe)

Makia Minich

2009-Mar-04 04:36 UTC

head link

[Linux_hpc_swstack] nagging nagios feelings

I was thinking about that.  For machines with service processors (ILOM) 
or other such out-of-band options, Nagios could work quite well.  And 
then we could utilize Ganglia for the other nodes.

The only issue with this approach is how to tie both monitoring tools 
together into a combined view of the cluster.  This isn''t 
insurmountable, just an added challenge.

On 3/3/09 9:25 PM, Mike Berg wrote:> Does it make sense two use both Nagios and Ganglia?
>
> Nagios to monitor the infrastructure systems providing specific
> services, Ganglia to monitor the mass of computational systems. Then
> the job schedular such as SGE can be used to vett out the
''sick''
> computational nodes via it''s resource monitors.
>
> On Mar 3, 2009, at 5:35 PM, Makia Minich wrote:
>
>> So I''ve been doing some configuration and testing with Nagios
and have
>> been having this nagging feeling that it is going to lead to some
>> pretty
>> major issues in the future.  Monitoring is a "need-to-have"
in the HPC
>> world, and the leaders in this pack so far are Nagios and Ganglia.
>> While we''ve been including Ganglia in the stack,
we''ve never really
>> aided in the configuration in lieu of taking care of other tasks.  For
>> the next release, it seemed like a good idea to go ahead and pause and
>> see if Ganglia is the right choice, or if perhaps Nagios can provide
>> some more options.
>>
>> My biggest issue, right now, is a question of scalability of Nagios.
>> This is primarily drawn out when you look at just how Nagios is
>> configured.  To define a cluster, you must create a host entry for
>> each
>> host within the cluster; while this is easy enough and scriptable it
>> really draws out the question "are you thinking about 1000, or
>> 10000, or
>> even more nodes?"  Yes, this is only the configuration file, but
it
>> also
>> progresses into the monitoring solution itself.  Nagios uses a polling
>> method to check every service on every node.  In the case of 10K
>> nodes,
>> how long will it take for the same node to be checked twice; or three
>> times; how long will it take before we find out that it''s
down?
>>
>> Perhaps I just don''t understand the configuration options
available to
>> me (which is why I''m writing, hoping someone tells me
I''m stupid).
>> Perhaps there are other ways to approach this with Nagios (e.g., use
>> scalable units that each only monitor a subset of nodes).  Any
>> thoughts
>> out there?
>>
>> (This has been cross posted on http://blogs.sun.com/giraffe)
>> --
>> "A simile is not a lie, unless it is a bad simile."
>> - Christopher John Francis Boone
>> _______________________________________________
>> Linux_hpc_swstack mailing list
>> Linux_hpc_swstack at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack
>
> _______________________________________________
> Linux_hpc_swstack mailing list
> Linux_hpc_swstack at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack
-- 
"A simile is not a lie, unless it is a bad simile."
- Christopher John Francis Boone

Makia Minich

2009-Mar-04 14:16 UTC

head link

[Linux_hpc_swstack] nagging nagios feelings

Thanks very much for this link, I''ll have to investigate it further. 
On
first glance, this seems to be a tool to create a tree of nagios servers 
(kind of like the scalable unit idea I mentioned before).

The interesting thing would be to really see the impact on each of these 
"DNX Worker" servers.  Do you happen to know of any other installs of 
DNX (other than LDS)?

Thanks again.

On 3/3/09 9:28 PM, Melvin Koh wrote:> Maybe this will be of interest:
>
> http://dnx.sourceforge.net/about.html
>
> Makia Minich wrote:
>> So I''ve been doing some configuration and testing with Nagios
and have
>> been having this nagging feeling that it is going to lead to some
pretty
>> major issues in the future.  Monitoring is a "need-to-have"
in the HPC
>> world, and the leaders in this pack so far are Nagios and Ganglia.
>> While we''ve been including Ganglia in the stack,
we''ve never really
>> aided in the configuration in lieu of taking care of other tasks.  For
>> the next release, it seemed like a good idea to go ahead and pause and
>> see if Ganglia is the right choice, or if perhaps Nagios can provide
>> some more options.
>>
>> My biggest issue, right now, is a question of scalability of Nagios.
>> This is primarily drawn out when you look at just how Nagios is
>> configured.  To define a cluster, you must create a host entry for each
>> host within the cluster; while this is easy enough and scriptable it
>> really draws out the question "are you thinking about 1000, or
10000, or
>> even more nodes?"  Yes, this is only the configuration file, but
it also
>> progresses into the monitoring solution itself.  Nagios uses a polling
>> method to check every service on every node.  In the case of 10K nodes,
>> how long will it take for the same node to be checked twice; or three
>> times; how long will it take before we find out that it''s
down?
>>
>> Perhaps I just don''t understand the configuration options
available to
>> me (which is why I''m writing, hoping someone tells me
I''m stupid).
>> Perhaps there are other ways to approach this with Nagios (e.g., use
>> scalable units that each only monitor a subset of nodes).  Any thoughts
>> out there?
>>
>> (This has been cross posted on http://blogs.sun.com/giraffe)
>
> _______________________________________________
> Linux_hpc_swstack mailing list
> Linux_hpc_swstack at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack
-- 
"A simile is not a lie, unless it is a bad simile."
- Christopher John Francis Boone

Larry McIntosh

2009-Mar-04 16:01 UTC

head link

[Linux_hpc_swstack] nagging nagios feelings

Mike,

This is a good idea.

I know that Rocks utilizes Ganglia for what you have described for over 
1150 registered cllusters WW:

http://www.rocksclusters.org/rocks-register/

There are also a good number of these clusters which utilize nagios.

In addition, I believe some work has been done for a nagios roll for 
rocks as well.

So there is reference for a good deal of numbers of folks utilizing 
these tools -- particularly ganglia for the monitoring.

Larry

Mike Berg wrote:> Does it make sense two use both Nagios and Ganglia?
>
> Nagios to monitor the infrastructure systems providing specific  
> services, Ganglia to monitor the mass of computational systems. Then  
> the job schedular such as SGE can be used to vett out the
''sick''
> computational nodes via it''s resource monitors.
>
> On Mar 3, 2009, at 5:35 PM, Makia Minich wrote:
>
>   
>> So I''ve been doing some configuration and testing with Nagios
and have
>> been having this nagging feeling that it is going to lead to some  
>> pretty
>> major issues in the future.  Monitoring is a "need-to-have"
in the HPC
>> world, and the leaders in this pack so far are Nagios and Ganglia.
>> While we''ve been including Ganglia in the stack,
we''ve never really
>> aided in the configuration in lieu of taking care of other tasks.  For
>> the next release, it seemed like a good idea to go ahead and pause and
>> see if Ganglia is the right choice, or if perhaps Nagios can provide
>> some more options.
>>
>> My biggest issue, right now, is a question of scalability of Nagios.
>> This is primarily drawn out when you look at just how Nagios is
>> configured.  To define a cluster, you must create a host entry for  
>> each
>> host within the cluster; while this is easy enough and scriptable it
>> really draws out the question "are you thinking about 1000, or  
>> 10000, or
>> even more nodes?"  Yes, this is only the configuration file, but
it
>> also
>> progresses into the monitoring solution itself.  Nagios uses a polling
>> method to check every service on every node.  In the case of 10K  
>> nodes,
>> how long will it take for the same node to be checked twice; or three
>> times; how long will it take before we find out that it''s
down?
>>
>> Perhaps I just don''t understand the configuration options
available to
>> me (which is why I''m writing, hoping someone tells me
I''m stupid).
>> Perhaps there are other ways to approach this with Nagios (e.g., use
>> scalable units that each only monitor a subset of nodes).  Any  
>> thoughts
>> out there?
>>
>> (This has been cross posted on http://blogs.sun.com/giraffe)
>> -- 
>> "A simile is not a lie, unless it is a bad simile."
>> - Christopher John Francis Boone
>> _______________________________________________
>> Linux_hpc_swstack mailing list
>> Linux_hpc_swstack at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack
>>     
>
> _______________________________________________
> Linux_hpc_swstack mailing list
> Linux_hpc_swstack at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack
>   -------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/linux_hpc_swstack/attachments/20090304/a5a2f5a2/attachment.html

Melvin Koh

2009-Mar-04 17:45 UTC

head link

[Linux_hpc_swstack] nagging nagios feelings

Makia,

I just happen to stumble onto this tool On closer look it seems to be 
quite a new tool and still in alpha, so I don''t think that it is widely
used.

Makia Minich wrote:> Thanks very much for this link, I''ll have to investigate it
further.  On
> first glance, this seems to be a tool to create a tree of nagios servers 
> (kind of like the scalable unit idea I mentioned before).
> 
> The interesting thing would be to really see the impact on each of these 
> "DNX Worker" servers.  Do you happen to know of any other
installs of
> DNX (other than LDS)?
> 
> Thanks again.
> 
> On 3/3/09 9:28 PM, Melvin Koh wrote:
>> Maybe this will be of interest:
>>
>> http://dnx.sourceforge.net/about.html
>>
>> Makia Minich wrote:
>>> So I''ve been doing some configuration and testing with
Nagios and have
>>> been having this nagging feeling that it is going to lead to some
pretty
>>> major issues in the future.  Monitoring is a
"need-to-have" in the HPC
>>> world, and the leaders in this pack so far are Nagios and Ganglia.
>>> While we''ve been including Ganglia in the stack,
we''ve never really
>>> aided in the configuration in lieu of taking care of other tasks. 
For
>>> the next release, it seemed like a good idea to go ahead and pause
and
>>> see if Ganglia is the right choice, or if perhaps Nagios can
provide
>>> some more options.
>>>
>>> My biggest issue, right now, is a question of scalability of
Nagios.
>>> This is primarily drawn out when you look at just how Nagios is
>>> configured.  To define a cluster, you must create a host entry for
each
>>> host within the cluster; while this is easy enough and scriptable
it
>>> really draws out the question "are you thinking about 1000, or
10000, or
>>> even more nodes?"  Yes, this is only the configuration file,
but it also
>>> progresses into the monitoring solution itself.  Nagios uses a
polling
>>> method to check every service on every node.  In the case of 10K
nodes,
>>> how long will it take for the same node to be checked twice; or
three
>>> times; how long will it take before we find out that it''s
down?
>>>
>>> Perhaps I just don''t understand the configuration options
available to
>>> me (which is why I''m writing, hoping someone tells me
I''m stupid).
>>> Perhaps there are other ways to approach this with Nagios (e.g.,
use
>>> scalable units that each only monitor a subset of nodes).  Any
thoughts
>>> out there?
>>>
>>> (This has been cross posted on http://blogs.sun.com/giraffe)
>>
>> _______________________________________________
>> Linux_hpc_swstack mailing list
>> Linux_hpc_swstack at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack
>

Frank Leers

2009-Mar-04 17:54 UTC

head link

[Linux_hpc_swstack] nagging nagios feelings

Fits nicely then ;-)  It''s good to get on the bandwagon early...

On Mar 4, 2009, at 9:45 AM, Melvin Koh wrote:
> Makia,
>
> I just happen to stumble onto this tool On closer look it seems to be
> quite a new tool and still in alpha, so I don''t think that it is  
> widely
> used.
>
> Makia Minich wrote:
>> Thanks very much for this link, I''ll have to investigate it  
>> further.  On
>> first glance, this seems to be a tool to create a tree of nagios  
>> servers
>> (kind of like the scalable unit idea I mentioned before).
>>
>> The interesting thing would be to really see the impact on each of  
>> these
>> "DNX Worker" servers.  Do you happen to know of any other
installs of
>> DNX (other than LDS)?
>>
>> Thanks again.
>>
>> On 3/3/09 9:28 PM, Melvin Koh wrote:
>>> Maybe this will be of interest:
>>>
>>> http://dnx.sourceforge.net/about.html
>>>
>>> Makia Minich wrote:
>>>> So I''ve been doing some configuration and testing with
Nagios and
>>>> have
>>>> been having this nagging feeling that it is going to lead to
some
>>>> pretty
>>>> major issues in the future.  Monitoring is a
"need-to-have" in
>>>> the HPC
>>>> world, and the leaders in this pack so far are Nagios and
Ganglia.
>>>> While we''ve been including Ganglia in the stack,
we''ve never really
>>>> aided in the configuration in lieu of taking care of other  
>>>> tasks.  For
>>>> the next release, it seemed like a good idea to go ahead and  
>>>> pause and
>>>> see if Ganglia is the right choice, or if perhaps Nagios can  
>>>> provide
>>>> some more options.
>>>>
>>>> My biggest issue, right now, is a question of scalability of  
>>>> Nagios.
>>>> This is primarily drawn out when you look at just how Nagios is
>>>> configured.  To define a cluster, you must create a host entry
>>>> for each
>>>> host within the cluster; while this is easy enough and
scriptable
>>>> it
>>>> really draws out the question "are you thinking about
1000, or
>>>> 10000, or
>>>> even more nodes?"  Yes, this is only the configuration
file, but
>>>> it also
>>>> progresses into the monitoring solution itself.  Nagios uses a
>>>> polling
>>>> method to check every service on every node.  In the case of
10K
>>>> nodes,
>>>> how long will it take for the same node to be checked twice; or
>>>> three
>>>> times; how long will it take before we find out that
it''s down?
>>>>
>>>> Perhaps I just don''t understand the configuration
options
>>>> available to
>>>> me (which is why I''m writing, hoping someone tells me
I''m stupid).
>>>> Perhaps there are other ways to approach this with Nagios
(e.g.,
>>>> use
>>>> scalable units that each only monitor a subset of nodes).  Any
>>>> thoughts
>>>> out there?
>>>>
>>>> (This has been cross posted on http://blogs.sun.com/giraffe)
>>>
>>> _______________________________________________
>>> Linux_hpc_swstack mailing list
>>> Linux_hpc_swstack at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack
>>
>
> _______________________________________________
> Linux_hpc_swstack mailing list
> Linux_hpc_swstack at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2421 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/linux_hpc_swstack/attachments/20090304/0344ab02/attachment-0001.bin

Kevin Van Maren

2009-Mar-05 07:42 UTC

head link

[Linux_hpc_swstack] nagging nagios feelings

Mike Berg wrote:> Does it make sense two use both Nagios and Ganglia?
>
> Nagios to monitor the infrastructure systems providing specific  
> services, Ganglia to monitor the mass of computational systems. Then  
> the job schedular such as SGE can be used to vett out the
''sick''
> computational nodes via it''s resource monitors.
>   
I think you nailed it here: the scheduler/resource manager already does 
it''s monitoring
(with the associated OS jitter); adding (multiple) additional 
monitorings just makes the
situation worse.  What you really need is to better leverage the 
monitoring to reduce the
duplication of effort (especially for the in-band monitoring on the 
node), yet make the
information available where needed.
> On Mar 3, 2009, at 5:35 PM, Makia Minich wrote:
>
>   
>> So I''ve been doing some configuration and testing with Nagios
and have
>> been having this nagging feeling that it is going to lead to some  
>> pretty
>> major issues in the future.  Monitoring is a "need-to-have"
in the HPC
>> world, and the leaders in this pack so far are Nagios and Ganglia.
>> While we''ve been including Ganglia in the stack,
we''ve never really
>> aided in the configuration in lieu of taking care of other tasks.  For
>> the next release, it seemed like a good idea to go ahead and pause and
>> see if Ganglia is the right choice, or if perhaps Nagios can provide
>> some more options.
>>
>> My biggest issue, right now, is a question of scalability of Nagios.
>> This is primarily drawn out when you look at just how Nagios is
>> configured.  To define a cluster, you must create a host entry for  
>> each
>> host within the cluster; while this is easy enough and scriptable it
>> really draws out the question "are you thinking about 1000, or  
>> 10000, or
>> even more nodes?"  Yes, this is only the configuration file, but
it
>> also
>> progresses into the monitoring solution itself.  Nagios uses a polling
>> method to check every service on every node.  In the case of 10K  
>> nodes,
>> how long will it take for the same node to be checked twice; or three
>> times; how long will it take before we find out that it''s
down?
>>     
Nagios is routinely used to monitor 1000+ nodes, especially for simple 
up/down reporting.

Architecturally, if you are going to build a system from scratch, I 
believe polling is the correct
approach for monitoring.  It makes it easier to dynamically adjust 
polling intervals, and makes it
easier to minimize jitter (use IP multicast to drive the collection, 
which gives you "free"
synchronization -- just need to handle the thundering herd responses - 
and can scale by
breaking the system into multiple groups listening on different 
multicast addresses)

Kevin

>> Perhaps I just don''t understand the configuration options
available to
>> me (which is why I''m writing, hoping someone tells me
I''m stupid).
>> Perhaps there are other ways to approach this with Nagios (e.g., use
>> scalable units that each only monitor a subset of nodes).  Any  
>> thoughts
>> out there?
>>
>> (This has been cross posted on http://blogs.sun.com/giraffe)
>> --

Linux_hpc_swstack - Mar 2009 - nagging nagios feelings

[Linux_hpc_swstack] nagging nagios feelings

[Linux_hpc_swstack] nagging nagios feelings

[Linux_hpc_swstack] nagging nagios feelings

[Linux_hpc_swstack] nagging nagios feelings

[Linux_hpc_swstack] nagging nagios feelings

[Linux_hpc_swstack] nagging nagios feelings

[Linux_hpc_swstack] nagging nagios feelings

[Linux_hpc_swstack] nagging nagios feelings

[Linux_hpc_swstack] nagging nagios feelings