thr3ads.net - Puppet users - [Puppet Users] Puppet agents stop reporting after master runs out of disk space... [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Kyle Mallory

2012-Jan-27 16:52 UTC

[Puppet Users] Puppet agents stop reporting after master runs out of disk space...

I am experiencing a curious event, and wondering if others have seen 
this... As well, I have a question related to it.

Today, I noticed my puppet summary report from Foreman this morning, 
that 60 of my 160 hosts all stopped reporting at nearly the exact same 
time, and have not since restarted.  Investigating, it appears that my 
puppetmaster temporarily ran out of disk space on the /var volume, 
probably in part do to logging.  I have log rollers running, which 
eventually freed up some disk space, but the 60 hosts, have not resumed 
reporting.

If I dig into the logs on one of the failing agents, there are no 
messages from puppet, past 4am (here is a snippet of my logs):

Jan 27 02:44:25 kmallory3 puppet-agent[15340]: Using cached catalog
Jan 27 02:44:25 kmallory3 puppet-agent[15340]: Could not retrieve 
catalog; skipping run
Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Could not retrieve 
catalog from remote server: Error 400 on SERVER: No space left on device 
- /var/lib/puppet/yaml/facts/kmallory3.xxx.xxx.xxx.yaml
Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Using cached catalog
Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Could not retrieve 
catalog; skipping run
Jan 27 03:47:30 kmallory3 puppet-agent[15340]: Could not retrieve 
plugin: execution expired
Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Could not retrieve 
catalog from remote server: execution expired
Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Using cached catalog
Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Could not retrieve 
catalog; skipping run

Forcing a run of puppet, I get the following message:

kmallory3:/var/log# puppetd --onetime --test
notice: Ignoring --listen on onetime run
notice: Run of Puppet configuration client already in progress; skipping

After stopping and restarting the puppet service, the agent started 
running properly.  It appears that the failure from the server has 
caused the agent to hang, from which it was not able to recover 
gracefully.  Has anyone experienced this before?  We are running 2.6.1 
on the large majority of our hosts, including this one.  Many failed, 
but 2/3rds keep running properly.

Now, on to my question.. Anyone got some bright ideas for how I could 
force Puppet to restart itself on a 60 machines, when Puppet isn''t 
running??  I''m not really excited by the prospect of logging into 60 
machines, and running a sudo command...  sigh.


--Kyle

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Denmat

2012-Jan-27 21:53 UTC

head link

Re: [Puppet Users] Puppet agents stop reporting after master runs out of disk space...

Hi,
Puppet''s sister project, MCollective would do it. An alternative would
be something like Rundeck.

Den

On 28/01/2012, at 3:52, Kyle Mallory <kyle.mallory@utah.edu> wrote:
> I am experiencing a curious event, and wondering if others have seen
this... As well, I have a question related to it.
> 
> Today, I noticed my puppet summary report from Foreman this morning, that
60 of my 160 hosts all stopped reporting at nearly the exact same time, and have
not since restarted.  Investigating, it appears that my puppetmaster temporarily
ran out of disk space on the /var volume, probably in part do to logging.  I
have log rollers running, which eventually freed up some disk space, but the 60
hosts, have not resumed reporting.
> 
> If I dig into the logs on one of the failing agents, there are no messages
from puppet, past 4am (here is a snippet of my logs):
> 
> Jan 27 02:44:25 kmallory3 puppet-agent[15340]: Using cached catalog
> Jan 27 02:44:25 kmallory3 puppet-agent[15340]: Could not retrieve catalog;
skipping run
> Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Could not retrieve catalog
from remote server: Error 400 on SERVER: No space left on device -
/var/lib/puppet/yaml/facts/kmallory3.xxx.xxx.xxx.yaml
> Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Using cached catalog
> Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Could not retrieve catalog;
skipping run
> Jan 27 03:47:30 kmallory3 puppet-agent[15340]: Could not retrieve plugin:
execution expired
> Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Could not retrieve catalog
from remote server: execution expired
> Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Using cached catalog
> Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Could not retrieve catalog;
skipping run
> 
> Forcing a run of puppet, I get the following message:
> 
> kmallory3:/var/log# puppetd --onetime --test
> notice: Ignoring --listen on onetime run
> notice: Run of Puppet configuration client already in progress; skipping
> 
> After stopping and restarting the puppet service, the agent started running
properly.  It appears that the failure from the server has caused the agent to
hang, from which it was not able to recover gracefully.  Has anyone experienced
this before?  We are running 2.6.1 on the large majority of our hosts, including
this one.  Many failed, but 2/3rds keep running properly.
> 
> Now, on to my question.. Anyone got some bright ideas for how I could force
Puppet to restart itself on a 60 machines, when Puppet isn''t running?? 
I''m not really excited by the prospect of logging into 60 machines, and
running a sudo command...  sigh.
> 
> 
> --Kyle
> 
> -- 
> You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
> To post to this group, send email to puppet-users@googlegroups.com.
> To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
> For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.
> 
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Christopher Wood

2012-Jan-27 22:14 UTC

head link

Re: [Puppet Users] Puppet agents stop reporting after master runs out of disk space...

While you''re logging into every host to install mcollective, there are
some other things to think about (that are easily puppetizeable):

-remote syslogging, so that lots of logs don''t cause application hosts
to clot
-file system monitoring for your hosts, so you get an alert before things fill
up
-trend analysis (graphs) on the hosts, so you get an alert when
something''s trending to fill up (by inode as well depending on the
host)
-something monitoring critical processes, so that if they stop responding
it''ll restart them (here I plug monit for simplicity''s sake,
but snmp agents and similar items can do this too)
something monitoring the logs which can alarm when something is absent/present
when it shouldn''t be

As to your immediate problem, try an ssh loop if you can run init scripts via
sudo. Use -t so that sudo will have a tty. For security''s sake
you''ll have to enter your password 60 times, but the experience will
incentivize you to monitor for this problem.

echo <<XX >/tmp/h
host1
host2
XX

for h in `cat /tmp/h`; do ssh -t $h sudo /etc/init.d/puppet restart; done


Good luck.


On Sat, Jan 28, 2012 at 08:53:37AM +1100, Denmat wrote:> Hi,
> Puppet''s sister project, MCollective would do it. An alternative
would be something like Rundeck.
> 
> Den
> 
> On 28/01/2012, at 3:52, Kyle Mallory <kyle.mallory@utah.edu> wrote:
> 
> > I am experiencing a curious event, and wondering if others have seen
this... As well, I have a question related to it.
> > 
> > Today, I noticed my puppet summary report from Foreman this morning,
that 60 of my 160 hosts all stopped reporting at nearly the exact same time, and
have not since restarted.  Investigating, it appears that my puppetmaster
temporarily ran out of disk space on the /var volume, probably in part do to
logging.  I have log rollers running, which eventually freed up some disk space,
but the 60 hosts, have not resumed reporting.
> > 
> > If I dig into the logs on one of the failing agents, there are no
messages from puppet, past 4am (here is a snippet of my logs):
> > 
> > Jan 27 02:44:25 kmallory3 puppet-agent[15340]: Using cached catalog
> > Jan 27 02:44:25 kmallory3 puppet-agent[15340]: Could not retrieve
catalog; skipping run
> > Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Could not retrieve
catalog from remote server: Error 400 on SERVER: No space left on device -
/var/lib/puppet/yaml/facts/kmallory3.xxx.xxx.xxx.yaml
> > Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Using cached catalog
> > Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Could not retrieve
catalog; skipping run
> > Jan 27 03:47:30 kmallory3 puppet-agent[15340]: Could not retrieve
plugin: execution expired
> > Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Could not retrieve
catalog from remote server: execution expired
> > Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Using cached catalog
> > Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Could not retrieve
catalog; skipping run
> > 
> > Forcing a run of puppet, I get the following message:
> > 
> > kmallory3:/var/log# puppetd --onetime --test
> > notice: Ignoring --listen on onetime run
> > notice: Run of Puppet configuration client already in progress;
skipping
> > 
> > After stopping and restarting the puppet service, the agent started
running properly.  It appears that the failure from the server has caused the
agent to hang, from which it was not able to recover gracefully.  Has anyone
experienced this before?  We are running 2.6.1 on the large majority of our
hosts, including this one.  Many failed, but 2/3rds keep running properly.
> > 
> > Now, on to my question.. Anyone got some bright ideas for how I could
force Puppet to restart itself on a 60 machines, when Puppet isn''t
running??  I''m not really excited by the prospect of logging into 60
machines, and running a sudo command...  sigh.
> > 
> > 
> > --Kyle
> > 
> > -- 
> > You received this message because you are subscribed to the Google
Groups "Puppet Users" group.
> > To post to this group, send email to puppet-users@googlegroups.com.
> > To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
> > For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.
> > 
> 
> -- 
> You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
> To post to this group, send email to puppet-users@googlegroups.com.
> To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
> For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.
> 
> 
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Kyle Mallory

2012-Jan-30 18:05 UTC

head link

Re: [Puppet Users] Puppet agents stop reporting after master runs out of disk space...

Thanks guys.. I''ll check out mcollective.  Yeah, the root password 60 
times is a bit painful, but the ssh loop would help.  If I remember 
right, there is an API/REST call for Foreman that will give me a list of 
the hosts not responsive.

The problem here is that puppet was in memory, and running.  It just 
wasn''t responsive, perhaps waiting for something to happen that never 
did.  So, checks for the process (monit/snmp/pgrep), etc would say that 
puppet is fine.

Are there any more bullet-proof ways of watch-dogging Puppet 
specifically?  Could we kill the process if catalog locks are more than 
30 minutes old? Or are locks on the catalog even a reality? Is this 
something Puppet could do on its own, in a separate thread, or does it 
need a new process?  I''m just throwing an idea or two.  Unfortunately,
I
lack the deep understanding of Puppet internals to know if I''m barking 
up the wrong tree or not.

--Kyle

On 01/27/2012 03:14 PM, Christopher Wood wrote:> While you''re logging into every host to install mcollective, there
are some other things to think about (that are easily puppetizeable):
>
> -remote syslogging, so that lots of logs don''t cause application
hosts to clot
> -file system monitoring for your hosts, so you get an alert before things
fill up
> -trend analysis (graphs) on the hosts, so you get an alert when
something''s trending to fill up (by inode as well depending on the
host)
> -something monitoring critical processes, so that if they stop responding
it''ll restart them (here I plug monit for simplicity''s sake,
but snmp agents and similar items can do this too)
> something monitoring the logs which can alarm when something is
absent/present when it shouldn''t be
>
> As to your immediate problem, try an ssh loop if you can run init scripts
via sudo. Use -t so that sudo will have a tty. For security''s sake
you''ll have to enter your password 60 times, but the experience will
incentivize you to monitor for this problem.
>
> echo<<XX>/tmp/h
> host1
> host2
> XX
>
> for h in `cat /tmp/h`; do ssh -t $h sudo /etc/init.d/puppet restart; done
>
>
> Good luck.
>
>
> On Sat, Jan 28, 2012 at 08:53:37AM +1100, Denmat wrote:
>> Hi,
>> Puppet''s sister project, MCollective would do it. An
alternative would be something like Rundeck.
>>
>> Den
>>
>> On 28/01/2012, at 3:52, Kyle Mallory<kyle.mallory@utah.edu> 
wrote:
>>
>>> I am experiencing a curious event, and wondering if others have
seen this... As well, I have a question related to it.
>>>
>>> Today, I noticed my puppet summary report from Foreman this
morning, that 60 of my 160 hosts all stopped reporting at nearly the exact same
time, and have not since restarted.  Investigating, it appears that my
puppetmaster temporarily ran out of disk space on the /var volume, probably in
part do to logging.  I have log rollers running, which eventually freed up some
disk space, but the 60 hosts, have not resumed reporting.
>>>
>>> If I dig into the logs on one of the failing agents, there are no
messages from puppet, past 4am (here is a snippet of my logs):
>>>
>>> Jan 27 02:44:25 kmallory3 puppet-agent[15340]: Using cached catalog
>>> Jan 27 02:44:25 kmallory3 puppet-agent[15340]: Could not retrieve
catalog; skipping run
>>> Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Could not retrieve
catalog from remote server: Error 400 on SERVER: No space left on device -
/var/lib/puppet/yaml/facts/kmallory3.xxx.xxx.xxx.yaml
>>> Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Using cached catalog
>>> Jan 27 03:14:30 kmallory3 puppet-agent[15340]: Could not retrieve
catalog; skipping run
>>> Jan 27 03:47:30 kmallory3 puppet-agent[15340]: Could not retrieve
plugin: execution expired
>>> Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Could not retrieve
catalog from remote server: execution expired
>>> Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Using cached catalog
>>> Jan 27 04:01:02 kmallory3 puppet-agent[15340]: Could not retrieve
catalog; skipping run
>>>
>>> Forcing a run of puppet, I get the following message:
>>>
>>> kmallory3:/var/log# puppetd --onetime --test
>>> notice: Ignoring --listen on onetime run
>>> notice: Run of Puppet configuration client already in progress;
skipping
>>>
>>> After stopping and restarting the puppet service, the agent started
running properly.  It appears that the failure from the server has caused the
agent to hang, from which it was not able to recover gracefully.  Has anyone
experienced this before?  We are running 2.6.1 on the large majority of our
hosts, including this one.  Many failed, but 2/3rds keep running properly.
>>>
>>> Now, on to my question.. Anyone got some bright ideas for how I
could force Puppet to restart itself on a 60 machines, when Puppet
isn''t running??  I''m not really excited by the prospect of
logging into 60 machines, and running a sudo command...  sigh.
>>>
>>>
>>> --Kyle
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google
Groups "Puppet Users" group.
>>> To post to this group, send email to puppet-users@googlegroups.com.
>>> To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
>>> For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.
>>>
>> -- 
>> You received this message because you are subscribed to the Google
Groups "Puppet Users" group.
>> To post to this group, send email to puppet-users@googlegroups.com.
>> To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
>> For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.
>>
>>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

R.I.Pienaar

2012-Jan-30 18:09 UTC

head link

Re: [Puppet Users] Puppet agents stop reporting after master runs out of disk space...

----- Original Message -----> Thanks guys.. I''ll check out mcollective.  Yeah, the root password
60
> times is a bit painful, but the ssh loop would help.  If I remember
> right, there is an API/REST call for Foreman that will give me a list
> of the hosts not responsive.
> 
> The problem here is that puppet was in memory, and running.  It just
> wasn''t responsive, perhaps waiting for something to happen that
never
> did.  So, checks for the process (monit/snmp/pgrep), etc would say
> that puppet is fine.
> 
> Are there any more bullet-proof ways of watch-dogging Puppet
> specifically?  Could we kill the process if catalog locks are more
> than 30 minutes old? Or are locks on the catalog even a reality? Is this
> something Puppet could do on its own, in a separate thread, or does
> it need a new process?  I''m just throwing an idea or two.
Times I''ve seen this happen is when the network connection to the
master dies
at just the right (wrong) time so the Ruby VM gets stuck on blocking IO which
it can never recover from.  So a supervisor thread wont do - it would also be
blocked.

I''ve written a monitor script for puppet that uses the new
last_run_summary.yaml
file to figure out if puppet has recently run and I monitor that with nagios
and nrpe.  So at least I know when this happens

https://github.com/ripienaar/monitoring-scripts/blob/master/puppet/check_puppet.rb

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Apparently Analagous Threads

Search for more seemingly similar threads

Puppet users - Jan 2012 - Puppet agents stop reporting after master runs out of disk space...

[Puppet Users] Puppet agents stop reporting after master runs out of disk space...

Re: [Puppet Users] Puppet agents stop reporting after master runs out of disk space...

Re: [Puppet Users] Puppet agents stop reporting after master runs out of disk space...

Re: [Puppet Users] Puppet agents stop reporting after master runs out of disk space...

Re: [Puppet Users] Puppet agents stop reporting after master runs out of disk space...

Apparently Analagous Threads