Hi All, In my set-up, I''ve got a cron job that triggers a Puppet run every 20 minutes. I''ve found that on approximately 13 nodes (out of 166), puppetd just hangs. I have to go in, kill the process, remove /var/lib/puppet/state/puppetdlock, and run puppet again and then it''s fine. After a while, it just hangs again so I have to go in, kill the process, etc. Any ideas? Thanks! Gonzalo -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
If you are like me, the problem is that the ruby for your platform sucks. The webstack ruby 1.8.7 for Solaris 10 has a nasty tendency to hang (for the daemons) and core dump for individual runs. Individual runs out of a crontab are the most reliable way I''ve found to make it all work. On Tue, Feb 7, 2012 at 7:11 PM, Gonzalo Servat <gservat@gmail.com> wrote:> Hi All, > > In my set-up, I''ve got a cron job that triggers a Puppet run every 20 > minutes. I''ve found that on approximately 13 nodes (out of 166), > puppetd just hangs. I have to go in, kill the process, remove > /var/lib/puppet/state/puppetdlock, and run puppet again and then it''s > fine. > > After a while, it just hangs again so I have to go in, kill the process, > etc. > > Any ideas? > > Thanks! > Gonzalo > > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To post to this group, send email to puppet-users@googlegroups.com. > To unsubscribe from this group, send email to > puppet-users+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/puppet-users?hl=en. > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Wed, Feb 8, 2012 at 3:25 PM, Brian Gallew <geek@gallew.org> wrote:> If you are like me, the problem is that the ruby for your platform sucks. > The webstack ruby 1.8.7 for Solaris 10 has a nasty tendency to hang (for > the daemons) and core dump for individual runs. Individual runs out of a > crontab are the most reliable way I''ve found to make it all work.This is ruby-1.8.7.299-7.el6_1.1 and I am running Puppet out of crontab, but it''s still frequently hanging. Right about now it has hanged again on several nodes. Any ideas? - Gonzalo -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Tue, Feb 7, 2012 at 23:56, Gonzalo Servat <gservat@gmail.com> wrote:> On Wed, Feb 8, 2012 at 3:25 PM, Brian Gallew <geek@gallew.org> wrote: >> >> If you are like me, the problem is that the ruby for your platform sucks. >> The webstack ruby 1.8.7 for Solaris 10 has a nasty tendency to hang (for >> the daemons) and core dump for individual runs. Individual runs out of a >> crontab are the most reliable way I''ve found to make it all work. > > This is ruby-1.8.7.299-7.el6_1.1 and I am running Puppet out of crontab, but > it''s still frequently hanging. Right about now it has hanged again on > several nodes. > > Any ideas?RedHat released some update kernels that reintroduced a bug from the 2.6.13 Linux kernel. You can run any of the code in this gist to check if your kernel suffers that: https://gist.github.com/441278 The C code is obviously a pretty good choice, as it excludes Ruby entirely from the problem space, and will confirm if that is your root cause. (The bug is that select on a file in /proc hangs for a long time, possibly forever, and Ruby will use select on a file if there are enough handles open. This happens in some daemon configurations.)> > - Gonzalo > > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To post to this group, send email to puppet-users@googlegroups.com. > To unsubscribe from this group, send email to > puppet-users+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/puppet-users?hl=en.-- Daniel Pittman ⎋ Puppet Labs Developer – http://puppetlabs.com ♲ Made with 100 percent post-consumer electrons -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Thu, Feb 9, 2012 at 5:44 AM, Daniel Pittman <daniel@puppetlabs.com> wrote:> RedHat released some update kernels that reintroduced a bug from the > 2.6.13 Linux kernel. You can run any of the code in this gist to > check if your kernel suffers that: https://gist.github.com/441278 > > The C code is obviously a pretty good choice, as it excludes Ruby > entirely from the problem space, and will confirm if that is your root > cause. > > (The bug is that select on a file in /proc hangs for a long time, > possibly forever, and Ruby will use select on a file if there are > enough handles open. This happens in some daemon configurations.)Hi Daniel, I tried the C code (with vda, instead of sda, as this is a VM using virtio) and the result matched the "good" section of that url you pasted. On stracing a hung puppetd run, I see infinite number of these: select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) gettimeofday({1328740663, 962461}, NULL) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 The process looks like this: /usr/bin/ruby /usr/sbin/puppetd --pluginsync --ignorecache --no-usecacheonfailure --onetime --no-daemonize --logdest syslog --environment=production --server=puppet-server --report Any other ideas? - Gonzalo -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Wed, Feb 8, 2012 at 14:40, Gonzalo Servat <gservat@gmail.com> wrote:> On Thu, Feb 9, 2012 at 5:44 AM, Daniel Pittman <daniel@puppetlabs.com> wrote: >> RedHat released some update kernels that reintroduced a bug from the >> 2.6.13 Linux kernel. You can run any of the code in this gist to >> check if your kernel suffers that: https://gist.github.com/441278 >> >> The C code is obviously a pretty good choice, as it excludes Ruby >> entirely from the problem space, and will confirm if that is your root >> cause. >> >> (The bug is that select on a file in /proc hangs for a long time, >> possibly forever, and Ruby will use select on a file if there are >> enough handles open. This happens in some daemon configurations.) > > I tried the C code (with vda, instead of sda, as this is a VM using virtio) > and the result matched the "good" section of that url you pasted. > > On stracing a hung puppetd run, I see infinite number of these: > > select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) > gettimeofday({1328740663, 962461}, NULL) = 0 > rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 > rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 > rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 > rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 > rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0Damn. Well, at least we eliminated one possible cause. Is there any chance you can run with `--debug` enabled on one of the failed machines, and see if that points to the right place? Otherwise we have to start to get into some fairly heavy ways to figure out what is going on. We can''t trivially reproduce this in-house, though we will keep trying. -- Daniel Pittman ⎋ Puppet Labs Developer – http://puppetlabs.com ♲ Made with 100 percent post-consumer electrons -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
> > Damn. Well, at least we eliminated one possible cause. Is there any > chance you can run with `--debug` enabled on one of the failed > machines, and see if that points to the right place? Otherwise we > have to start to get into some fairly heavy ways to figure out what is > going on. >OK I''m now running it with --debug into separate log files, to compare a working and non-working runs. Unfortunately the hung Puppet doesn''t seem to reveal anything interesting in the logs. A working puppet run looks like this: [..stuff..] debug: Finishing transaction 70131030874760 debug: Loaded state in 0.01 seconds info: Retrieving plugin debug: file_metadata supports formats: b64_zlib_yaml marshal pson raw yaml; using pson debug: Using cached certificate for ca debug: Using cached certificate for mtsldrp118.sirca.org.au debug: Using cached certificate_revocation_list for ca debug: Finishing transaction 70131030519320 info: Loading facts in /var/lib/puppet/lib/facter/server_class.rb [...more custom facts loading...] debug: catalog supports formats: b64_zlib_yaml dot marshal pson raw yaml; using pson debug: Puppet::Type::Package::ProviderRpm: Executing ''/bin/rpm --version'' debug: Puppet::Type::Package::ProviderAptrpm: Executing ''/bin/rpm -ql rpm'' debug: Puppet::Type::Package::ProviderYum: Executing ''/bin/rpm --version'' [..etc..] A broken Puppet run shows: [..stuff..] debug: /File[/var/lib/puppet/state]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/clientbucket]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/client_data]: Autorequiring File[/var/lib/puppet] debug: Finishing transaction 69910666048880 debug: /File[/var/lib/puppet/lib]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/state]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/ssl/certs]: Autorequiring File[/var/lib/puppet/ssl] debug: /File[/var/lib/puppet/ssl]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/ssl/private]: Autorequiring File[/var/lib/puppet/ssl] debug: /File[/var/lib/puppet/facts]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/ssl/crl.pem]: Autorequiring File[/var/lib/puppet/ssl] debug: /File[/var/lib/puppet/ssl/certs/ca.pem]: Autorequiring File[/var/lib/puppet/ssl/certs] debug: Finishing transaction 69910666553940 debug: Using cached certificate for ca debug: Using cached certificate for puppetclient.mydomain debug: Finishing transaction 69910665891720 debug: Loaded state in 0.01 seconds info: Retrieving plugin debug: file_metadata supports formats: b64_zlib_yaml marshal pson raw yaml; using pson debug: Using cached certificate for ca debug: Using cached certificate for puppetclient.mydomain debug: Using cached certificate_revocation_list for ca debug: Finishing transaction 69910665535980 That''s it. Nothing else in the output. Strace on the puppetd process shows repetitions of what I pasted in an earlier email: select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) gettimeofday({1328767567, 900875}, NULL) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 gettimeofday({1328767567, 901663}, NULL) = 0 Would appreciate any suggestions you have on this. Regards Gonzalo -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Thu, Feb 9, 2012 at 5:08 PM, Gonzalo Servat <gservat@gmail.com> wrote:> Damn. Well, at least we eliminated one possible cause. Is there any >> chance you can run with `--debug` enabled on one of the failed >> machines, and see if that points to the right place? Otherwise we >> have to start to get into some fairly heavy ways to figure out what is >> going on. >> > > OK I''m now running it with --debug into separate log files, to compare a > working and non-working runs. Unfortunately the hung Puppet doesn''t seem to > reveal anything interesting in the logs. A working puppet run looks like > this: >[..snip..] Hi Daniel, I''m having an increasing number of nodes now with puppet hangs. I now have 26 nodes where puppetd just hangs. Any ideas on what I can try? I''ve tried removing any Puppet configuration for the hanging nodes, but it doesn''t help so it looks like a client side problem and not the catalog that gets applied to them. Kernel on hanging nodes: 2.6.32-131.17.1.el6.x86_64. It would be nice if all the nodes running the above kernel had this problem, but unfortunately there are other nodes using the above kernel that are not hanging. Thanks in advance. Gonzalo -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Thu, Feb 9, 2012 at 17:19, Gonzalo Servat <gservat@gmail.com> wrote:> On Thu, Feb 9, 2012 at 5:08 PM, Gonzalo Servat <gservat@gmail.com> wrote: >>> >>> Damn. Well, at least we eliminated one possible cause. Is there any >>> chance you can run with `--debug` enabled on one of the failed >>> machines, and see if that points to the right place? Otherwise we >>> have to start to get into some fairly heavy ways to figure out what is >>> going on. >> >> OK I''m now running it with --debug into separate log files, to compare a >> working and non-working runs. Unfortunately the hung Puppet doesn''t seem to >> reveal anything interesting in the logs. A working puppet run looks like >> this: > > [..snip..] > > I''m having an increasing number of nodes now with puppet hangs. I now have > 26 nodes where puppetd just hangs. Any ideas on what I can try? > I''ve tried removing any Puppet configuration for the hanging nodes, but it > doesn''t help so it looks like a client side problem and not the catalog that > gets applied to them.Sorry for not getting back to this sooner. If you are running 2.7.10, can you try removing the file `puppet/util/instrumentation/listeners/process_name.rb` and see if that fixes the problem? We have some reports that can cause hangs, and eliminating it will make sure this doesn''t descend from there. Otherwise we really have to start getting into more aggressive debugging. Would you be comfortable doing some hacking / patching of the code to narrow this down, and/or installing some development tools on one of the nodes that triggers the hang? -- Daniel Pittman ⎋ Puppet Labs Developer – http://puppetlabs.com ♲ Made with 100 percent post-consumer electrons -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Wed, Feb 15, 2012 at 11:02 AM, Daniel Pittman <daniel@puppetlabs.com>wrote:> Sorry for not getting back to this sooner. If you are running 2.7.10, > can you try removing the file > `puppet/util/instrumentation/listeners/process_name.rb` and see if > that fixes the problem? >No worries Daniel. Yes. It did fix the problem and I did actually raise a bug on this (to avoid doubling up work): http://projects.puppetlabs.com/issues/12588 - Gonzalo -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Tue, Feb 14, 2012 at 16:28, Gonzalo Servat <gservat@gmail.com> wrote:> On Wed, Feb 15, 2012 at 11:02 AM, Daniel Pittman <daniel@puppetlabs.com> > wrote: >> >> Sorry for not getting back to this sooner. If you are running 2.7.10, >> can you try removing the file >> `puppet/util/instrumentation/listeners/process_name.rb` and see if >> that fixes the problem? > > No worries Daniel. Yes. It did fix the problem and I did actually raise a > bug on this (to avoid doubling up > work): http://projects.puppetlabs.com/issues/12588Oh, awesome. This is why the bug system works better than just email - someone else noticed and fixed it up. :) That code will be gone in the next release, and won''t return until better behaved. -- Daniel Pittman ⎋ Puppet Labs Developer – http://puppetlabs.com ♲ Made with 100 percent post-consumer electrons -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.