Jo Rhett
2011-Dec-02 18:58 UTC
[Puppet Users] puppet master under passenger locks up completely
I came in this morning to find all the servers all locked up solid: # passenger-status ----------- General information ----------- max = 20 count = 20 active = 20 inactive = 0 Waiting on global queue: 236 ----------- Domains ----------- /etc/puppet/rack: PID: 2720 Sessions: 1 Processed: 939 Uptime: 9h 22m 18s PID: 1615 Sessions: 1 Processed: 947 Uptime: 9h 23m 14s PID: 1596 Sessions: 1 Processed: 607 Uptime: 9h 23m 15s PID: 1722 Sessions: 1 Processed: 953 Uptime: 9h 23m 9s PID: 2218 Sessions: 1 Processed: 378 Uptime: 9h 22m 43s PID: 4286 Sessions: 1 Processed: 178 Uptime: 8h 50m 58s PID: 5749 Sessions: 1 Processed: 708 Uptime: 8h 20m 20s PID: 4253 Sessions: 1 Processed: 820 Uptime: 8h 51m 1s PID: 5624 Sessions: 1 Processed: 126 Uptime: 8h 20m 24s PID: 7328 Sessions: 1 Processed: 811 Uptime: 7h 49m 17s PID: 7274 Sessions: 1 Processed: 984 Uptime: 7h 49m 20s PID: 8761 Sessions: 1 Processed: 85 Uptime: 7h 18m 50s PID: 9135 Sessions: 1 Processed: 907 Uptime: 7h 16m 27s PID: 8777 Sessions: 1 Processed: 342 Uptime: 7h 18m 49s PID: 10508 Sessions: 1 Processed: 51 Uptime: 6h 47m 6s PID: 10853 Sessions: 1 Processed: 603 Uptime: 6h 43m 9s PID: 10620 Sessions: 1 Processed: 939 Uptime: 6h 45m 52s PID: 11438 Sessions: 1 Processed: 870 Uptime: 6h 30m 8s PID: 12582 Sessions: 1 Processed: 448 Uptime: 6h 9m 59s PID: 12670 Sessions: 1 Processed: 400 Uptime: 6h 8m 46s For comparison, most of our server processes recycle within 20 minutes normally, as they hit 1000 really fast. # you probably want to tune these settings PassengerHighPerformance on PassengerUseGlobalQueue on PassengerMaxPoolSize 20 PassengerPoolIdleTime 1800 PassengerMaxRequests 1000 #PassengerStatThrottleRate 120 RackAutoDetect Off RailsAutoDetect Off There is nothing useful in the system logs. They just stopped: Dec 2 12:06:34 axxats003 puppet-master[12670]: Compiled catalog for axxamx001.sjc.company.com in environment production in 1.76 seconds Dec 2 12:06:37 axxats003 puppet-master[12670]: Compiled catalog for axxatn016.sjc.company.com in environment production in 1.64 seconds Dec 2 12:06:40 axxats003 puppet-master[12670]: Compiled catalog for axaafc001.company.com in environment production i n 1.70 seconds Dec 2 14:10:02 axxats003 puppet-agent[16965]: Reopening log files Dec 2 14:10:02 axxats003 puppet-agent[16965]: Starting Puppet client version 2.6.12 Dec 2 14:12:04 axxats003 puppet-agent[16965]: Could not retrieve catalog from remote server: execution expired Dec 2 14:12:04 axxats003 puppet-agent[16965]: Using cached catalog (every 30 minutes puppet agent says the same thing until I restart the puppet master) Dec 2 18:06:09 axxats003 puppet-master[25783]: Starting Puppet master version 2.6.12 Dec 2 18:06:10 axxats003 puppet-master[25802]: Starting Puppet master version 2.6.12 Dec 2 18:06:11 axxats003 puppet-master[25831]: Starting Puppet master version 2.6.12 Dec 2 18:06:12 axxats003 puppet-master[25864]: Starting Puppet master version 2.6.12 Dec 2 18:06:13 axxats003 puppet-master[25897]: Starting Puppet master version 2.6.12 Dec 2 18:06:14 axxats003 puppet-master[25922]: Starting Puppet master version 2.6.12 Dec 2 18:06:15 axxats003 puppet-master[25947]: Starting Puppet master version 2.6.12 Dec 2 18:06:16 axxats003 puppet-master[25972]: Starting Puppet master version 2.6.12 Dec 2 18:06:17 axxats003 puppet-master[25997]: Starting Puppet master version 2.6.12 Dec 2 18:06:18 axxats003 puppet-master[26019]: Starting Puppet master version 2.6.12 Dec 2 18:06:19 axxats003 puppet-master[26056]: Starting Puppet master version 2.6.12 Dec 2 18:06:20 axxats003 puppet-master[26081]: Starting Puppet master version 2.6.12 Dec 2 18:06:21 axxats003 puppet-master[26115]: Starting Puppet master version 2.6.12 Dec 2 18:14:32 axxats003 puppet-master[26115]: Compiled catalog for axxatn018.sjc.company.com in environment production in 3.63 seconds Dec 2 18:14:37 axxats003 puppet-master[26115]: Compiled catalog for axxamb002.sjc.company.com in environment production in 1.47 seconds Dec 2 18:14:50 axxats003 puppet-master[26115]: Compiled catalog for axxasn001.sjc.company.com in environment production in 1.57 seconds There are no other messages in /var/log/messages -- the system was otherwise not busy. Apache error log only observed max clients get hit: [Fri Dec 02 08:42:43 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations [Fri Dec 02 12:23:46 2011] [error] server reached MaxClients setting, consider raising the MaxClients setting [Fri Dec 02 18:06:07 2011] [notice] caught SIGTERM, shutting down [Fri Dec 02 18:06:08 2011] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec) [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? [Fri Dec 02 18:06:08 2011] [notice] Digest: generating secret for digest authentication ... [Fri Dec 02 18:06:08 2011] [notice] Digest: done [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? [Fri Dec 02 18:06:08 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations -- Jo Rhett jrhett@company.com (415) 999-1798 -- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
Jo Rhett
2011-Dec-02 21:03 UTC
[Puppet Users] Re: puppet master under passenger locks up completely
Okay, this has happened again. Puppet master stopped logging catalog compiles, every server stopped returning results and the global queue went quickly through the roof in like 9 minutes. It appears puppet master is stopping dead in its tracks without logging any errors. # passenger-status ----------- General information ----------- max = 20 count = 20 active = 20 inactive = 0 Waiting on global queue: 209 ----------- Domains ----------- /etc/puppet/rack: PID: 25783 Sessions: 1 Processed: 329 Uptime: 2h 52m 7s PID: 25831 Sessions: 1 Processed: 4 Uptime: 2h 52m 5s PID: 28517 Sessions: 1 Processed: 6 Uptime: 2h 22m 0s PID: 25802 Sessions: 1 Processed: 714 Uptime: 2h 52m 6s PID: 30905 Sessions: 1 Processed: 13 Uptime: 1h 50m 27s PID: 25864 Sessions: 1 Processed: 709 Uptime: 2h 52m 4s PID: 31028 Sessions: 1 Processed: 347 Uptime: 1h 50m 21s PID: 28944 Sessions: 1 Processed: 377 Uptime: 2h 21m 50s PID: 31090 Sessions: 1 Processed: 266 Uptime: 1h 50m 18s PID: 577 Sessions: 1 Processed: 400 Uptime: 1h 27m 27s PID: 418 Sessions: 1 Processed: 647 Uptime: 1h 28m 2s PID: 1247 Sessions: 1 Processed: 133 Uptime: 1h 19m 3s PID: 1474 Sessions: 1 Processed: 52 Uptime: 1h 18m 9s PID: 594 Sessions: 1 Processed: 378 Uptime: 1h 27m 26s PID: 4706 Sessions: 1 Processed: 414 Uptime: 48m 5s PID: 4775 Sessions: 1 Processed: 218 Uptime: 47m 28s PID: 4854 Sessions: 1 Processed: 584 Uptime: 47m 23s PID: 7774 Sessions: 1 Processed: 165 Uptime: 14m 27s PID: 7902 Sessions: 1 Processed: 44 Uptime: 13m 44s PID: 8149 Sessions: 1 Processed: 541 Uptime: 11m 21s On Dec 2, 2011, at 10:58 AM, Jo Rhett wrote:> I came in this morning to find all the servers all locked up solid: > > # passenger-status > ----------- General information ----------- > max = 20 > count = 20 > active = 20 > inactive = 0 > Waiting on global queue: 236 > > ----------- Domains ----------- > /etc/puppet/rack: > PID: 2720 Sessions: 1 Processed: 939 Uptime: 9h 22m 18s > PID: 1615 Sessions: 1 Processed: 947 Uptime: 9h 23m 14s > PID: 1596 Sessions: 1 Processed: 607 Uptime: 9h 23m 15s > PID: 1722 Sessions: 1 Processed: 953 Uptime: 9h 23m 9s > PID: 2218 Sessions: 1 Processed: 378 Uptime: 9h 22m 43s > PID: 4286 Sessions: 1 Processed: 178 Uptime: 8h 50m 58s > PID: 5749 Sessions: 1 Processed: 708 Uptime: 8h 20m 20s > PID: 4253 Sessions: 1 Processed: 820 Uptime: 8h 51m 1s > PID: 5624 Sessions: 1 Processed: 126 Uptime: 8h 20m 24s > PID: 7328 Sessions: 1 Processed: 811 Uptime: 7h 49m 17s > PID: 7274 Sessions: 1 Processed: 984 Uptime: 7h 49m 20s > PID: 8761 Sessions: 1 Processed: 85 Uptime: 7h 18m 50s > PID: 9135 Sessions: 1 Processed: 907 Uptime: 7h 16m 27s > PID: 8777 Sessions: 1 Processed: 342 Uptime: 7h 18m 49s > PID: 10508 Sessions: 1 Processed: 51 Uptime: 6h 47m 6s > PID: 10853 Sessions: 1 Processed: 603 Uptime: 6h 43m 9s > PID: 10620 Sessions: 1 Processed: 939 Uptime: 6h 45m 52s > PID: 11438 Sessions: 1 Processed: 870 Uptime: 6h 30m 8s > PID: 12582 Sessions: 1 Processed: 448 Uptime: 6h 9m 59s > PID: 12670 Sessions: 1 Processed: 400 Uptime: 6h 8m 46s > > For comparison, most of our server processes recycle within 20 minutes normally, as they hit 1000 really fast. > > # you probably want to tune these settings > PassengerHighPerformance on > PassengerUseGlobalQueue on > PassengerMaxPoolSize 20 > PassengerPoolIdleTime 1800 > PassengerMaxRequests 1000 > #PassengerStatThrottleRate 120 > RackAutoDetect Off > RailsAutoDetect Off > > There is nothing useful in the system logs. They just stopped: > > Dec 2 12:06:34 axxats003 puppet-master[12670]: Compiled catalog for axxamx001.sjc.company.com in environment production > in 1.76 seconds > Dec 2 12:06:37 axxats003 puppet-master[12670]: Compiled catalog for axxatn016.sjc.company.com in environment production > in 1.64 seconds > Dec 2 12:06:40 axxats003 puppet-master[12670]: Compiled catalog for axaafc001.company.com in environment production i > n 1.70 seconds > Dec 2 14:10:02 axxats003 puppet-agent[16965]: Reopening log files > Dec 2 14:10:02 axxats003 puppet-agent[16965]: Starting Puppet client version 2.6.12 > Dec 2 14:12:04 axxats003 puppet-agent[16965]: Could not retrieve catalog from remote server: execution expired > Dec 2 14:12:04 axxats003 puppet-agent[16965]: Using cached catalog > > (every 30 minutes puppet agent says the same thing until I restart the puppet master) > > Dec 2 18:06:09 axxats003 puppet-master[25783]: Starting Puppet master version 2.6.12 > Dec 2 18:06:10 axxats003 puppet-master[25802]: Starting Puppet master version 2.6.12 > Dec 2 18:06:11 axxats003 puppet-master[25831]: Starting Puppet master version 2.6.12 > Dec 2 18:06:12 axxats003 puppet-master[25864]: Starting Puppet master version 2.6.12 > Dec 2 18:06:13 axxats003 puppet-master[25897]: Starting Puppet master version 2.6.12 > Dec 2 18:06:14 axxats003 puppet-master[25922]: Starting Puppet master version 2.6.12 > Dec 2 18:06:15 axxats003 puppet-master[25947]: Starting Puppet master version 2.6.12 > Dec 2 18:06:16 axxats003 puppet-master[25972]: Starting Puppet master version 2.6.12 > Dec 2 18:06:17 axxats003 puppet-master[25997]: Starting Puppet master version 2.6.12 > Dec 2 18:06:18 axxats003 puppet-master[26019]: Starting Puppet master version 2.6.12 > Dec 2 18:06:19 axxats003 puppet-master[26056]: Starting Puppet master version 2.6.12 > Dec 2 18:06:20 axxats003 puppet-master[26081]: Starting Puppet master version 2.6.12 > Dec 2 18:06:21 axxats003 puppet-master[26115]: Starting Puppet master version 2.6.12 > Dec 2 18:14:32 axxats003 puppet-master[26115]: Compiled catalog for axxatn018.sjc.company.com in environment production in 3.63 seconds > Dec 2 18:14:37 axxats003 puppet-master[26115]: Compiled catalog for axxamb002.sjc.company.com in environment production in 1.47 seconds > Dec 2 18:14:50 axxats003 puppet-master[26115]: Compiled catalog for axxasn001.sjc.company.com in environment production in 1.57 seconds > > There are no other messages in /var/log/messages -- the system was otherwise not busy. Apache error log only observed max clients get hit: > [Fri Dec 02 08:42:43 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations > [Fri Dec 02 12:23:46 2011] [error] server reached MaxClients setting, consider raising the MaxClients setting > [Fri Dec 02 18:06:07 2011] [notice] caught SIGTERM, shutting down > [Fri Dec 02 18:06:08 2011] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec) > [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? > [Fri Dec 02 18:06:08 2011] [notice] Digest: generating secret for digest authentication ... > [Fri Dec 02 18:06:08 2011] [notice] Digest: done > [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? > [Fri Dec 02 18:06:08 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations > > > -- > Jo Rhett > jrhett@company.com > (415) 999-1798 > > -- > Jo Rhett > Net Consonance : consonant endings by net philanthropy, open source and other randomness >-- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
Nigel Kersten
2011-Dec-02 21:30 UTC
Re: [Puppet Users] Re: puppet master under passenger locks up completely
On Fri, Dec 2, 2011 at 1:03 PM, Jo Rhett <jrhett@netconsonance.com> wrote:> Okay, this has happened again. Puppet master stopped logging catalog > compiles, every server stopped returning results and the global queue went > quickly through the roof in like 9 minutes. It appears puppet master is > stopping dead in its tracks without logging any errors. >A really quick test would be to start a webrick puppetmaster on an alternate port with the same configuration file in debug mode and then puppet against it to see if there''s a problem at that level, (on master) puppet master --no-daemonize --verbose --debug --masterport 9140 (for example) (on an agent) puppet agent --test --masterport 9140 If that doesn''t show anything, let us know whether you''re running Apache prefork or worker, and your relevant pool regulation settings like: StartServers MinSpareServers MaxSpareServers ServerLimit MaxClients MaxRequestsPerChild ?> > # passenger-status > ----------- General information ----------- > max = 20 > count = 20 > active = 20 > inactive = 0 > Waiting on global queue: 209 > > ----------- Domains ----------- > /etc/puppet/rack: > PID: 25783 Sessions: 1 Processed: 329 Uptime: 2h 52m 7s > PID: 25831 Sessions: 1 Processed: 4 Uptime: 2h 52m 5s > PID: 28517 Sessions: 1 Processed: 6 Uptime: 2h 22m 0s > PID: 25802 Sessions: 1 Processed: 714 Uptime: 2h 52m 6s > PID: 30905 Sessions: 1 Processed: 13 Uptime: 1h 50m 27s > PID: 25864 Sessions: 1 Processed: 709 Uptime: 2h 52m 4s > PID: 31028 Sessions: 1 Processed: 347 Uptime: 1h 50m 21s > PID: 28944 Sessions: 1 Processed: 377 Uptime: 2h 21m 50s > PID: 31090 Sessions: 1 Processed: 266 Uptime: 1h 50m 18s > PID: 577 Sessions: 1 Processed: 400 Uptime: 1h 27m 27s > PID: 418 Sessions: 1 Processed: 647 Uptime: 1h 28m 2s > PID: 1247 Sessions: 1 Processed: 133 Uptime: 1h 19m 3s > PID: 1474 Sessions: 1 Processed: 52 Uptime: 1h 18m 9s > PID: 594 Sessions: 1 Processed: 378 Uptime: 1h 27m 26s > PID: 4706 Sessions: 1 Processed: 414 Uptime: 48m 5s > PID: 4775 Sessions: 1 Processed: 218 Uptime: 47m 28s > PID: 4854 Sessions: 1 Processed: 584 Uptime: 47m 23s > PID: 7774 Sessions: 1 Processed: 165 Uptime: 14m 27s > PID: 7902 Sessions: 1 Processed: 44 Uptime: 13m 44s > PID: 8149 Sessions: 1 Processed: 541 Uptime: 11m 21s > > > On Dec 2, 2011, at 10:58 AM, Jo Rhett wrote: > > I came in this morning to find all the servers all locked up solid: > > # passenger-status > ----------- General information ----------- > max = 20 > count = 20 > active = 20 > inactive = 0 > Waiting on global queue: 236 > > ----------- Domains ----------- > /etc/puppet/rack: > PID: 2720 Sessions: 1 Processed: 939 Uptime: 9h 22m 18s > PID: 1615 Sessions: 1 Processed: 947 Uptime: 9h 23m 14s > PID: 1596 Sessions: 1 Processed: 607 Uptime: 9h 23m 15s > PID: 1722 Sessions: 1 Processed: 953 Uptime: 9h 23m 9s > PID: 2218 Sessions: 1 Processed: 378 Uptime: 9h 22m 43s > PID: 4286 Sessions: 1 Processed: 178 Uptime: 8h 50m 58s > PID: 5749 Sessions: 1 Processed: 708 Uptime: 8h 20m 20s > PID: 4253 Sessions: 1 Processed: 820 Uptime: 8h 51m 1s > PID: 5624 Sessions: 1 Processed: 126 Uptime: 8h 20m 24s > PID: 7328 Sessions: 1 Processed: 811 Uptime: 7h 49m 17s > PID: 7274 Sessions: 1 Processed: 984 Uptime: 7h 49m 20s > PID: 8761 Sessions: 1 Processed: 85 Uptime: 7h 18m 50s > PID: 9135 Sessions: 1 Processed: 907 Uptime: 7h 16m 27s > PID: 8777 Sessions: 1 Processed: 342 Uptime: 7h 18m 49s > PID: 10508 Sessions: 1 Processed: 51 Uptime: 6h 47m 6s > PID: 10853 Sessions: 1 Processed: 603 Uptime: 6h 43m 9s > PID: 10620 Sessions: 1 Processed: 939 Uptime: 6h 45m 52s > PID: 11438 Sessions: 1 Processed: 870 Uptime: 6h 30m 8s > PID: 12582 Sessions: 1 Processed: 448 Uptime: 6h 9m 59s > PID: 12670 Sessions: 1 Processed: 400 Uptime: 6h 8m 46s > > For comparison, most of our server processes recycle within 20 minutes > normally, as they hit 1000 really fast. > > # you probably want to tune these settings > PassengerHighPerformance on > PassengerUseGlobalQueue on > PassengerMaxPoolSize 20 > PassengerPoolIdleTime 1800 > PassengerMaxRequests 1000 > #PassengerStatThrottleRate 120 > RackAutoDetect Off > RailsAutoDetect Off > > There is nothing useful in the system logs. They just stopped: > > Dec 2 12:06:34 axxats003 puppet-master[12670]: Compiled catalog for > axxamx001.sjc.company.com in environment production > in 1.76 seconds > Dec 2 12:06:37 axxats003 puppet-master[12670]: Compiled catalog for > axxatn016.sjc.company.com in environment production > in 1.64 seconds > Dec 2 12:06:40 axxats003 puppet-master[12670]: Compiled catalog for > axaafc001.company.com in environment production i > n 1.70 seconds > Dec 2 14:10:02 axxats003 puppet-agent[16965]: Reopening log files > Dec 2 14:10:02 axxats003 puppet-agent[16965]: Starting Puppet client > version 2.6.12 > Dec 2 14:12:04 axxats003 puppet-agent[16965]: Could not retrieve catalog > from remote server: execution expired > Dec 2 14:12:04 axxats003 puppet-agent[16965]: Using cached catalog > > (every 30 minutes puppet agent says the same thing until I restart the > puppet master) > > Dec 2 18:06:09 axxats003 puppet-master[25783]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:10 axxats003 puppet-master[25802]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:11 axxats003 puppet-master[25831]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:12 axxats003 puppet-master[25864]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:13 axxats003 puppet-master[25897]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:14 axxats003 puppet-master[25922]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:15 axxats003 puppet-master[25947]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:16 axxats003 puppet-master[25972]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:17 axxats003 puppet-master[25997]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:18 axxats003 puppet-master[26019]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:19 axxats003 puppet-master[26056]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:20 axxats003 puppet-master[26081]: Starting Puppet master > version 2.6.12 > Dec 2 18:06:21 axxats003 puppet-master[26115]: Starting Puppet master > version 2.6.12 > Dec 2 18:14:32 axxats003 puppet-master[26115]: Compiled catalog for > axxatn018.sjc.company.com in environment production in 3.63 seconds > Dec 2 18:14:37 axxats003 puppet-master[26115]: Compiled catalog for > axxamb002.sjc.company.com in environment production in 1.47 seconds > Dec 2 18:14:50 axxats003 puppet-master[26115]: Compiled catalog for > axxasn001.sjc.company.com in environment production in 1.57 seconds > > There are no other messages in /var/log/messages -- the system was > otherwise not busy. Apache error log only observed max clients get hit: > [Fri Dec 02 08:42:43 2011] [notice] Apache/2.2.3 (CentOS) configured -- > resuming normal operations > [Fri Dec 02 12:23:46 2011] [error] server reached MaxClients setting, > consider raising the MaxClients setting > [Fri Dec 02 18:06:07 2011] [notice] caught SIGTERM, shutting down > [Fri Dec 02 18:06:08 2011] [notice] suEXEC mechanism enabled (wrapper: > /usr/sbin/suexec) > [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) ` > puppetmaster.company.com'' does NOT match server name!? > [Fri Dec 02 18:06:08 2011] [notice] Digest: generating secret for digest > authentication ... > [Fri Dec 02 18:06:08 2011] [notice] Digest: done > [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) ` > puppetmaster.company.com'' does NOT match server name!? > [Fri Dec 02 18:06:08 2011] [notice] Apache/2.2.3 (CentOS) configured -- > resuming normal operations > > > -- > Jo Rhett > jrhett@company.com > (415) 999-1798 > > -- > Jo Rhett > Net Consonance : consonant endings by net philanthropy, open source and > other randomness > > > -- > Jo Rhett > Net Consonance : consonant endings by net philanthropy, open source and > other randomness > > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To post to this group, send email to puppet-users@googlegroups.com. > To unsubscribe from this group, send email to > puppet-users+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/puppet-users?hl=en. >-- Nigel Kersten Product Manager, Puppet Labs -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
Jo Rhett
2011-Dec-02 22:22 UTC
Re: [Puppet Users] puppet master under passenger locks up completely
On Dec 2, 2011, at 1:30 PM, Nigel Kersten wrote:> On Fri, Dec 2, 2011 at 1:03 PM, Jo Rhett <jrhett@netconsonance.com> wrote: > Okay, this has happened again. Puppet master stopped logging catalog compiles, every server stopped returning results and the global queue went quickly through the roof in like 9 minutes. It appears puppet master is stopping dead in its tracks without logging any errors. > > A really quick test would be to start a webrick puppetmaster on an alternate port with the same configuration file in debug mode and then puppet against it to see if there''s a problem at that level, > > (on master) > puppet master --no-daemonize --verbose --debug --masterport 9140 (for example) > > (on an agent) > puppet agent --test --masterport 9140This works perfectly fine.> If that doesn''t show anything, let us know whether you''re running Apache prefork or worker, and your relevant pool regulation settings like: > > StartServers > MinSpareServers > MaxSpareServers > ServerLimit > MaxClients > MaxRequestsPerChildpre fork with the following settings: StartServers 8 MinSpareServers 5 MaxSpareServers 20 ServerLimit 256 MaxClients 256 MaxRequestsPerChild 4000> # passenger-status > ----------- General information ----------- > max = 20 > count = 20 > active = 20 > inactive = 0 > Waiting on global queue: 209 > > ----------- Domains ----------- > /etc/puppet/rack: > PID: 25783 Sessions: 1 Processed: 329 Uptime: 2h 52m 7s > PID: 25831 Sessions: 1 Processed: 4 Uptime: 2h 52m 5s > PID: 28517 Sessions: 1 Processed: 6 Uptime: 2h 22m 0s > PID: 25802 Sessions: 1 Processed: 714 Uptime: 2h 52m 6s > PID: 30905 Sessions: 1 Processed: 13 Uptime: 1h 50m 27s > PID: 25864 Sessions: 1 Processed: 709 Uptime: 2h 52m 4s > PID: 31028 Sessions: 1 Processed: 347 Uptime: 1h 50m 21s > PID: 28944 Sessions: 1 Processed: 377 Uptime: 2h 21m 50s > PID: 31090 Sessions: 1 Processed: 266 Uptime: 1h 50m 18s > PID: 577 Sessions: 1 Processed: 400 Uptime: 1h 27m 27s > PID: 418 Sessions: 1 Processed: 647 Uptime: 1h 28m 2s > PID: 1247 Sessions: 1 Processed: 133 Uptime: 1h 19m 3s > PID: 1474 Sessions: 1 Processed: 52 Uptime: 1h 18m 9s > PID: 594 Sessions: 1 Processed: 378 Uptime: 1h 27m 26s > PID: 4706 Sessions: 1 Processed: 414 Uptime: 48m 5s > PID: 4775 Sessions: 1 Processed: 218 Uptime: 47m 28s > PID: 4854 Sessions: 1 Processed: 584 Uptime: 47m 23s > PID: 7774 Sessions: 1 Processed: 165 Uptime: 14m 27s > PID: 7902 Sessions: 1 Processed: 44 Uptime: 13m 44s > PID: 8149 Sessions: 1 Processed: 541 Uptime: 11m 21s > > > On Dec 2, 2011, at 10:58 AM, Jo Rhett wrote: >> I came in this morning to find all the servers all locked up solid: >> >> # passenger-status >> ----------- General information ----------- >> max = 20 >> count = 20 >> active = 20 >> inactive = 0 >> Waiting on global queue: 236 >> >> ----------- Domains ----------- >> /etc/puppet/rack: >> PID: 2720 Sessions: 1 Processed: 939 Uptime: 9h 22m 18s >> PID: 1615 Sessions: 1 Processed: 947 Uptime: 9h 23m 14s >> PID: 1596 Sessions: 1 Processed: 607 Uptime: 9h 23m 15s >> PID: 1722 Sessions: 1 Processed: 953 Uptime: 9h 23m 9s >> PID: 2218 Sessions: 1 Processed: 378 Uptime: 9h 22m 43s >> PID: 4286 Sessions: 1 Processed: 178 Uptime: 8h 50m 58s >> PID: 5749 Sessions: 1 Processed: 708 Uptime: 8h 20m 20s >> PID: 4253 Sessions: 1 Processed: 820 Uptime: 8h 51m 1s >> PID: 5624 Sessions: 1 Processed: 126 Uptime: 8h 20m 24s >> PID: 7328 Sessions: 1 Processed: 811 Uptime: 7h 49m 17s >> PID: 7274 Sessions: 1 Processed: 984 Uptime: 7h 49m 20s >> PID: 8761 Sessions: 1 Processed: 85 Uptime: 7h 18m 50s >> PID: 9135 Sessions: 1 Processed: 907 Uptime: 7h 16m 27s >> PID: 8777 Sessions: 1 Processed: 342 Uptime: 7h 18m 49s >> PID: 10508 Sessions: 1 Processed: 51 Uptime: 6h 47m 6s >> PID: 10853 Sessions: 1 Processed: 603 Uptime: 6h 43m 9s >> PID: 10620 Sessions: 1 Processed: 939 Uptime: 6h 45m 52s >> PID: 11438 Sessions: 1 Processed: 870 Uptime: 6h 30m 8s >> PID: 12582 Sessions: 1 Processed: 448 Uptime: 6h 9m 59s >> PID: 12670 Sessions: 1 Processed: 400 Uptime: 6h 8m 46s >> >> For comparison, most of our server processes recycle within 20 minutes normally, as they hit 1000 really fast. >> >> # you probably want to tune these settings >> PassengerHighPerformance on >> PassengerUseGlobalQueue on >> PassengerMaxPoolSize 20 >> PassengerPoolIdleTime 1800 >> PassengerMaxRequests 1000 >> #PassengerStatThrottleRate 120 >> RackAutoDetect Off >> RailsAutoDetect Off >> >> There is nothing useful in the system logs. They just stopped: >> >> Dec 2 12:06:34 axxats003 puppet-master[12670]: Compiled catalog for axxamx001.sjc.company.com in environment production >> in 1.76 seconds >> Dec 2 12:06:37 axxats003 puppet-master[12670]: Compiled catalog for axxatn016.sjc.company.com in environment production >> in 1.64 seconds >> Dec 2 12:06:40 axxats003 puppet-master[12670]: Compiled catalog for axaafc001.company.com in environment production i >> n 1.70 seconds >> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Reopening log files >> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Starting Puppet client version 2.6.12 >> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Could not retrieve catalog from remote server: execution expired >> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Using cached catalog >> >> (every 30 minutes puppet agent says the same thing until I restart the puppet master) >> >> Dec 2 18:06:09 axxats003 puppet-master[25783]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:10 axxats003 puppet-master[25802]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:11 axxats003 puppet-master[25831]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:12 axxats003 puppet-master[25864]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:13 axxats003 puppet-master[25897]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:14 axxats003 puppet-master[25922]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:15 axxats003 puppet-master[25947]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:16 axxats003 puppet-master[25972]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:17 axxats003 puppet-master[25997]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:18 axxats003 puppet-master[26019]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:19 axxats003 puppet-master[26056]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:20 axxats003 puppet-master[26081]: Starting Puppet master version 2.6.12 >> Dec 2 18:06:21 axxats003 puppet-master[26115]: Starting Puppet master version 2.6.12 >> Dec 2 18:14:32 axxats003 puppet-master[26115]: Compiled catalog for axxatn018.sjc.company.com in environment production in 3.63 seconds >> Dec 2 18:14:37 axxats003 puppet-master[26115]: Compiled catalog for axxamb002.sjc.company.com in environment production in 1.47 seconds >> Dec 2 18:14:50 axxats003 puppet-master[26115]: Compiled catalog for axxasn001.sjc.company.com in environment production in 1.57 seconds >> >> There are no other messages in /var/log/messages -- the system was otherwise not busy. Apache error log only observed max clients get hit: >> [Fri Dec 02 08:42:43 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations >> [Fri Dec 02 12:23:46 2011] [error] server reached MaxClients setting, consider raising the MaxClients setting >> [Fri Dec 02 18:06:07 2011] [notice] caught SIGTERM, shutting down >> [Fri Dec 02 18:06:08 2011] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec) >> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? >> [Fri Dec 02 18:06:08 2011] [notice] Digest: generating secret for digest authentication ... >> [Fri Dec 02 18:06:08 2011] [notice] Digest: done >> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? >> [Fri Dec 02 18:06:08 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations >> >> >> -- >> Jo Rhett >> jrhett@company.com >> (415) 999-1798 >> >> -- >> Jo Rhett >> Net Consonance : consonant endings by net philanthropy, open source and other randomness >> > > -- > Jo Rhett > Net Consonance : consonant endings by net philanthropy, open source and other randomness > > > -- > You received this message because you are subscribed to the Google Groups "Puppet Users" group. > To post to this group, send email to puppet-users@googlegroups.com. > To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en. > > > > -- > Nigel Kersten > Product Manager, Puppet Labs > > > > -- > You received this message because you are subscribed to the Google Groups "Puppet Users" group. > To post to this group, send email to puppet-users@googlegroups.com. > To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.-- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
Hm, you know I don''t think that it''s a sudden lock of all 20 passenger clients. I think it''s a slow lockup of various puppet sessions until all 20 are locked. Here''s an example: every one of the "active" sessions below with an uptime longer than 30 minutes has had the same "processed" number for more than 30 minutes at this time. So in theory, they''ve been processing the same session for more than 30 minutes. Somehow, I don''t think so. I think those sessions are locked up. And what is happening is that eventually all 20 processes are hung and we are dead in the water. Fri Dec 2 23:05:59 UTC 2011 ----------- General information ----------- max = 20 count = 18 active = 12 inactive = 6 Waiting on global queue: 0 ----------- Domains ----------- /etc/puppet/rack: PID: 21021 Sessions: 0 Processed: 362 Uptime: 5m 37s PID: 21005 Sessions: 0 Processed: 537 Uptime: 5m 38s PID: 21555 Sessions: 0 Processed: 69 Uptime: 30s PID: 21571 Sessions: 0 Processed: 62 Uptime: 29s PID: 20989 Sessions: 0 Processed: 209 Uptime: 5m 39s PID: 20968 Sessions: 0 Processed: 157 Uptime: 5m 41s PID: 9221 Sessions: 1 Processed: 903 Uptime: 2h 5m 55s PID: 9340 Sessions: 1 Processed: 764 Uptime: 2h 4m 58s PID: 10379 Sessions: 1 Processed: 568 Uptime: 1h 57m 37s PID: 11847 Sessions: 1 Processed: 712 Uptime: 1h 41m 13s PID: 11686 Sessions: 1 Processed: 314 Uptime: 1h 41m 19s PID: 10845 Sessions: 1 Processed: 511 Uptime: 1h 48m 52s PID: 11650 Sessions: 1 Processed: 747 Uptime: 1h 41m 21s PID: 14967 Sessions: 1 Processed: 84 Uptime: 1h 8m 28s PID: 17605 Sessions: 1 Processed: 497 Uptime: 44m 41s PID: 20342 Sessions: 1 Processed: 0 Uptime: 13m 14s PID: 20358 Sessions: 1 Processed: 54 Uptime: 13m 13s PID: 18098 Sessions: 1 Processed: 854 Uptime: 35m 46s On Dec 2, 2011, at 2:22 PM, Jo Rhett wrote:> On Dec 2, 2011, at 1:30 PM, Nigel Kersten wrote: >> On Fri, Dec 2, 2011 at 1:03 PM, Jo Rhett <jrhett@netconsonance.com> wrote: >> Okay, this has happened again. Puppet master stopped logging catalog compiles, every server stopped returning results and the global queue went quickly through the roof in like 9 minutes. It appears puppet master is stopping dead in its tracks without logging any errors. >> >> A really quick test would be to start a webrick puppetmaster on an alternate port with the same configuration file in debug mode and then puppet against it to see if there''s a problem at that level, >> >> (on master) >> puppet master --no-daemonize --verbose --debug --masterport 9140 (for example) >> >> (on an agent) >> puppet agent --test --masterport 9140 > > This works perfectly fine. > >> If that doesn''t show anything, let us know whether you''re running Apache prefork or worker, and your relevant pool regulation settings like: >> >> StartServers >> MinSpareServers >> MaxSpareServers >> ServerLimit >> MaxClients >> MaxRequestsPerChild > > pre fork with the following settings: > > StartServers 8 > MinSpareServers 5 > MaxSpareServers 20 > ServerLimit 256 > MaxClients 256 > MaxRequestsPerChild 4000 > >> # passenger-status >> ----------- General information ----------- >> max = 20 >> count = 20 >> active = 20 >> inactive = 0 >> Waiting on global queue: 209 >> >> ----------- Domains ----------- >> /etc/puppet/rack: >> PID: 25783 Sessions: 1 Processed: 329 Uptime: 2h 52m 7s >> PID: 25831 Sessions: 1 Processed: 4 Uptime: 2h 52m 5s >> PID: 28517 Sessions: 1 Processed: 6 Uptime: 2h 22m 0s >> PID: 25802 Sessions: 1 Processed: 714 Uptime: 2h 52m 6s >> PID: 30905 Sessions: 1 Processed: 13 Uptime: 1h 50m 27s >> PID: 25864 Sessions: 1 Processed: 709 Uptime: 2h 52m 4s >> PID: 31028 Sessions: 1 Processed: 347 Uptime: 1h 50m 21s >> PID: 28944 Sessions: 1 Processed: 377 Uptime: 2h 21m 50s >> PID: 31090 Sessions: 1 Processed: 266 Uptime: 1h 50m 18s >> PID: 577 Sessions: 1 Processed: 400 Uptime: 1h 27m 27s >> PID: 418 Sessions: 1 Processed: 647 Uptime: 1h 28m 2s >> PID: 1247 Sessions: 1 Processed: 133 Uptime: 1h 19m 3s >> PID: 1474 Sessions: 1 Processed: 52 Uptime: 1h 18m 9s >> PID: 594 Sessions: 1 Processed: 378 Uptime: 1h 27m 26s >> PID: 4706 Sessions: 1 Processed: 414 Uptime: 48m 5s >> PID: 4775 Sessions: 1 Processed: 218 Uptime: 47m 28s >> PID: 4854 Sessions: 1 Processed: 584 Uptime: 47m 23s >> PID: 7774 Sessions: 1 Processed: 165 Uptime: 14m 27s >> PID: 7902 Sessions: 1 Processed: 44 Uptime: 13m 44s >> PID: 8149 Sessions: 1 Processed: 541 Uptime: 11m 21s >> >> >> On Dec 2, 2011, at 10:58 AM, Jo Rhett wrote: >>> I came in this morning to find all the servers all locked up solid: >>> >>> # passenger-status >>> ----------- General information ----------- >>> max = 20 >>> count = 20 >>> active = 20 >>> inactive = 0 >>> Waiting on global queue: 236 >>> >>> ----------- Domains ----------- >>> /etc/puppet/rack: >>> PID: 2720 Sessions: 1 Processed: 939 Uptime: 9h 22m 18s >>> PID: 1615 Sessions: 1 Processed: 947 Uptime: 9h 23m 14s >>> PID: 1596 Sessions: 1 Processed: 607 Uptime: 9h 23m 15s >>> PID: 1722 Sessions: 1 Processed: 953 Uptime: 9h 23m 9s >>> PID: 2218 Sessions: 1 Processed: 378 Uptime: 9h 22m 43s >>> PID: 4286 Sessions: 1 Processed: 178 Uptime: 8h 50m 58s >>> PID: 5749 Sessions: 1 Processed: 708 Uptime: 8h 20m 20s >>> PID: 4253 Sessions: 1 Processed: 820 Uptime: 8h 51m 1s >>> PID: 5624 Sessions: 1 Processed: 126 Uptime: 8h 20m 24s >>> PID: 7328 Sessions: 1 Processed: 811 Uptime: 7h 49m 17s >>> PID: 7274 Sessions: 1 Processed: 984 Uptime: 7h 49m 20s >>> PID: 8761 Sessions: 1 Processed: 85 Uptime: 7h 18m 50s >>> PID: 9135 Sessions: 1 Processed: 907 Uptime: 7h 16m 27s >>> PID: 8777 Sessions: 1 Processed: 342 Uptime: 7h 18m 49s >>> PID: 10508 Sessions: 1 Processed: 51 Uptime: 6h 47m 6s >>> PID: 10853 Sessions: 1 Processed: 603 Uptime: 6h 43m 9s >>> PID: 10620 Sessions: 1 Processed: 939 Uptime: 6h 45m 52s >>> PID: 11438 Sessions: 1 Processed: 870 Uptime: 6h 30m 8s >>> PID: 12582 Sessions: 1 Processed: 448 Uptime: 6h 9m 59s >>> PID: 12670 Sessions: 1 Processed: 400 Uptime: 6h 8m 46s >>> >>> For comparison, most of our server processes recycle within 20 minutes normally, as they hit 1000 really fast. >>> >>> # you probably want to tune these settings >>> PassengerHighPerformance on >>> PassengerUseGlobalQueue on >>> PassengerMaxPoolSize 20 >>> PassengerPoolIdleTime 1800 >>> PassengerMaxRequests 1000 >>> #PassengerStatThrottleRate 120 >>> RackAutoDetect Off >>> RailsAutoDetect Off >>> >>> There is nothing useful in the system logs. They just stopped: >>> >>> Dec 2 12:06:34 axxats003 puppet-master[12670]: Compiled catalog for axxamx001.sjc.company.com in environment production >>> in 1.76 seconds >>> Dec 2 12:06:37 axxats003 puppet-master[12670]: Compiled catalog for axxatn016.sjc.company.com in environment production >>> in 1.64 seconds >>> Dec 2 12:06:40 axxats003 puppet-master[12670]: Compiled catalog for axaafc001.company.com in environment production i >>> n 1.70 seconds >>> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Reopening log files >>> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Starting Puppet client version 2.6.12 >>> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Could not retrieve catalog from remote server: execution expired >>> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Using cached catalog >>> >>> (every 30 minutes puppet agent says the same thing until I restart the puppet master) >>> >>> Dec 2 18:06:09 axxats003 puppet-master[25783]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:10 axxats003 puppet-master[25802]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:11 axxats003 puppet-master[25831]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:12 axxats003 puppet-master[25864]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:13 axxats003 puppet-master[25897]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:14 axxats003 puppet-master[25922]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:15 axxats003 puppet-master[25947]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:16 axxats003 puppet-master[25972]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:17 axxats003 puppet-master[25997]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:18 axxats003 puppet-master[26019]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:19 axxats003 puppet-master[26056]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:20 axxats003 puppet-master[26081]: Starting Puppet master version 2.6.12 >>> Dec 2 18:06:21 axxats003 puppet-master[26115]: Starting Puppet master version 2.6.12 >>> Dec 2 18:14:32 axxats003 puppet-master[26115]: Compiled catalog for axxatn018.sjc.company.com in environment production in 3.63 seconds >>> Dec 2 18:14:37 axxats003 puppet-master[26115]: Compiled catalog for axxamb002.sjc.company.com in environment production in 1.47 seconds >>> Dec 2 18:14:50 axxats003 puppet-master[26115]: Compiled catalog for axxasn001.sjc.company.com in environment production in 1.57 seconds >>> >>> There are no other messages in /var/log/messages -- the system was otherwise not busy. Apache error log only observed max clients get hit: >>> [Fri Dec 02 08:42:43 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations >>> [Fri Dec 02 12:23:46 2011] [error] server reached MaxClients setting, consider raising the MaxClients setting >>> [Fri Dec 02 18:06:07 2011] [notice] caught SIGTERM, shutting down >>> [Fri Dec 02 18:06:08 2011] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec) >>> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? >>> [Fri Dec 02 18:06:08 2011] [notice] Digest: generating secret for digest authentication ... >>> [Fri Dec 02 18:06:08 2011] [notice] Digest: done >>> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? >>> [Fri Dec 02 18:06:08 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations >>> >>> >>> -- >>> Jo Rhett >>> jrhett@company.com >>> (415) 999-1798 >>> >>> -- >>> Jo Rhett >>> Net Consonance : consonant endings by net philanthropy, open source and other randomness >>> >> >> -- >> Jo Rhett >> Net Consonance : consonant endings by net philanthropy, open source and other randomness >> >> >> -- >> You received this message because you are subscribed to the Google Groups "Puppet Users" group. >> To post to this group, send email to puppet-users@googlegroups.com. >> To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. >> For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en. >> >> >> >> -- >> Nigel Kersten >> Product Manager, Puppet Labs >> >> >> >> -- >> You received this message because you are subscribed to the Google Groups "Puppet Users" group. >> To post to this group, send email to puppet-users@googlegroups.com. >> To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. >> For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en. > > -- > Jo Rhett > Net Consonance : consonant endings by net philanthropy, open source and other randomness >-- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
I am also now pretty certain that this issue (ticket #11140) is tied directly to 3 systems (in ticket #11143) which can''t get catalogs. I believe their attempts to get a catalog produce a hung server. 3 servers every 30 minutes means that in just over 3 hours I have 20 hung puppetmasters, and the queue goes out of control. I would deeply appreciate some information on how to diagnose the catalog failures and related puppetmaster hangs. On Dec 2, 2011, at 3:09 PM, Jo Rhett wrote:> Hm, you know I don''t think that it''s a sudden lock of all 20 passenger clients. I think it''s a slow lockup of various puppet sessions until all 20 are locked. Here''s an example: every one of the "active" sessions below with an uptime longer than 30 minutes has had the same "processed" number for more than 30 minutes at this time. So in theory, they''ve been processing the same session for more than 30 minutes. Somehow, I don''t think so. I think those sessions are locked up. And what is happening is that eventually all 20 processes are hung and we are dead in the water. > > Fri Dec 2 23:05:59 UTC 2011 > ----------- General information ----------- > max = 20 > count = 18 > active = 12 > inactive = 6 > Waiting on global queue: 0 > > ----------- Domains ----------- > /etc/puppet/rack: > PID: 21021 Sessions: 0 Processed: 362 Uptime: 5m 37s > PID: 21005 Sessions: 0 Processed: 537 Uptime: 5m 38s > PID: 21555 Sessions: 0 Processed: 69 Uptime: 30s > PID: 21571 Sessions: 0 Processed: 62 Uptime: 29s > PID: 20989 Sessions: 0 Processed: 209 Uptime: 5m 39s > PID: 20968 Sessions: 0 Processed: 157 Uptime: 5m 41s > PID: 9221 Sessions: 1 Processed: 903 Uptime: 2h 5m 55s > PID: 9340 Sessions: 1 Processed: 764 Uptime: 2h 4m 58s > PID: 10379 Sessions: 1 Processed: 568 Uptime: 1h 57m 37s > PID: 11847 Sessions: 1 Processed: 712 Uptime: 1h 41m 13s > PID: 11686 Sessions: 1 Processed: 314 Uptime: 1h 41m 19s > PID: 10845 Sessions: 1 Processed: 511 Uptime: 1h 48m 52s > PID: 11650 Sessions: 1 Processed: 747 Uptime: 1h 41m 21s > PID: 14967 Sessions: 1 Processed: 84 Uptime: 1h 8m 28s > PID: 17605 Sessions: 1 Processed: 497 Uptime: 44m 41s > PID: 20342 Sessions: 1 Processed: 0 Uptime: 13m 14s > PID: 20358 Sessions: 1 Processed: 54 Uptime: 13m 13s > PID: 18098 Sessions: 1 Processed: 854 Uptime: 35m 46s > > On Dec 2, 2011, at 2:22 PM, Jo Rhett wrote: > >> On Dec 2, 2011, at 1:30 PM, Nigel Kersten wrote: >>> On Fri, Dec 2, 2011 at 1:03 PM, Jo Rhett <jrhett@netconsonance.com> wrote: >>> Okay, this has happened again. Puppet master stopped logging catalog compiles, every server stopped returning results and the global queue went quickly through the roof in like 9 minutes. It appears puppet master is stopping dead in its tracks without logging any errors. >>> >>> A really quick test would be to start a webrick puppetmaster on an alternate port with the same configuration file in debug mode and then puppet against it to see if there''s a problem at that level, >>> >>> (on master) >>> puppet master --no-daemonize --verbose --debug --masterport 9140 (for example) >>> >>> (on an agent) >>> puppet agent --test --masterport 9140 >> >> This works perfectly fine. >> >>> If that doesn''t show anything, let us know whether you''re running Apache prefork or worker, and your relevant pool regulation settings like: >>> >>> StartServers >>> MinSpareServers >>> MaxSpareServers >>> ServerLimit >>> MaxClients >>> MaxRequestsPerChild >> >> pre fork with the following settings: >> >> StartServers 8 >> MinSpareServers 5 >> MaxSpareServers 20 >> ServerLimit 256 >> MaxClients 256 >> MaxRequestsPerChild 4000 >> >>> # passenger-status >>> ----------- General information ----------- >>> max = 20 >>> count = 20 >>> active = 20 >>> inactive = 0 >>> Waiting on global queue: 209 >>> >>> ----------- Domains ----------- >>> /etc/puppet/rack: >>> PID: 25783 Sessions: 1 Processed: 329 Uptime: 2h 52m 7s >>> PID: 25831 Sessions: 1 Processed: 4 Uptime: 2h 52m 5s >>> PID: 28517 Sessions: 1 Processed: 6 Uptime: 2h 22m 0s >>> PID: 25802 Sessions: 1 Processed: 714 Uptime: 2h 52m 6s >>> PID: 30905 Sessions: 1 Processed: 13 Uptime: 1h 50m 27s >>> PID: 25864 Sessions: 1 Processed: 709 Uptime: 2h 52m 4s >>> PID: 31028 Sessions: 1 Processed: 347 Uptime: 1h 50m 21s >>> PID: 28944 Sessions: 1 Processed: 377 Uptime: 2h 21m 50s >>> PID: 31090 Sessions: 1 Processed: 266 Uptime: 1h 50m 18s >>> PID: 577 Sessions: 1 Processed: 400 Uptime: 1h 27m 27s >>> PID: 418 Sessions: 1 Processed: 647 Uptime: 1h 28m 2s >>> PID: 1247 Sessions: 1 Processed: 133 Uptime: 1h 19m 3s >>> PID: 1474 Sessions: 1 Processed: 52 Uptime: 1h 18m 9s >>> PID: 594 Sessions: 1 Processed: 378 Uptime: 1h 27m 26s >>> PID: 4706 Sessions: 1 Processed: 414 Uptime: 48m 5s >>> PID: 4775 Sessions: 1 Processed: 218 Uptime: 47m 28s >>> PID: 4854 Sessions: 1 Processed: 584 Uptime: 47m 23s >>> PID: 7774 Sessions: 1 Processed: 165 Uptime: 14m 27s >>> PID: 7902 Sessions: 1 Processed: 44 Uptime: 13m 44s >>> PID: 8149 Sessions: 1 Processed: 541 Uptime: 11m 21s >>> >>> >>> On Dec 2, 2011, at 10:58 AM, Jo Rhett wrote: >>>> I came in this morning to find all the servers all locked up solid: >>>> >>>> # passenger-status >>>> ----------- General information ----------- >>>> max = 20 >>>> count = 20 >>>> active = 20 >>>> inactive = 0 >>>> Waiting on global queue: 236 >>>> >>>> ----------- Domains ----------- >>>> /etc/puppet/rack: >>>> PID: 2720 Sessions: 1 Processed: 939 Uptime: 9h 22m 18s >>>> PID: 1615 Sessions: 1 Processed: 947 Uptime: 9h 23m 14s >>>> PID: 1596 Sessions: 1 Processed: 607 Uptime: 9h 23m 15s >>>> PID: 1722 Sessions: 1 Processed: 953 Uptime: 9h 23m 9s >>>> PID: 2218 Sessions: 1 Processed: 378 Uptime: 9h 22m 43s >>>> PID: 4286 Sessions: 1 Processed: 178 Uptime: 8h 50m 58s >>>> PID: 5749 Sessions: 1 Processed: 708 Uptime: 8h 20m 20s >>>> PID: 4253 Sessions: 1 Processed: 820 Uptime: 8h 51m 1s >>>> PID: 5624 Sessions: 1 Processed: 126 Uptime: 8h 20m 24s >>>> PID: 7328 Sessions: 1 Processed: 811 Uptime: 7h 49m 17s >>>> PID: 7274 Sessions: 1 Processed: 984 Uptime: 7h 49m 20s >>>> PID: 8761 Sessions: 1 Processed: 85 Uptime: 7h 18m 50s >>>> PID: 9135 Sessions: 1 Processed: 907 Uptime: 7h 16m 27s >>>> PID: 8777 Sessions: 1 Processed: 342 Uptime: 7h 18m 49s >>>> PID: 10508 Sessions: 1 Processed: 51 Uptime: 6h 47m 6s >>>> PID: 10853 Sessions: 1 Processed: 603 Uptime: 6h 43m 9s >>>> PID: 10620 Sessions: 1 Processed: 939 Uptime: 6h 45m 52s >>>> PID: 11438 Sessions: 1 Processed: 870 Uptime: 6h 30m 8s >>>> PID: 12582 Sessions: 1 Processed: 448 Uptime: 6h 9m 59s >>>> PID: 12670 Sessions: 1 Processed: 400 Uptime: 6h 8m 46s >>>> >>>> For comparison, most of our server processes recycle within 20 minutes normally, as they hit 1000 really fast. >>>> >>>> # you probably want to tune these settings >>>> PassengerHighPerformance on >>>> PassengerUseGlobalQueue on >>>> PassengerMaxPoolSize 20 >>>> PassengerPoolIdleTime 1800 >>>> PassengerMaxRequests 1000 >>>> #PassengerStatThrottleRate 120 >>>> RackAutoDetect Off >>>> RailsAutoDetect Off >>>> >>>> There is nothing useful in the system logs. They just stopped: >>>> >>>> Dec 2 12:06:34 axxats003 puppet-master[12670]: Compiled catalog for axxamx001.sjc.company.com in environment production >>>> in 1.76 seconds >>>> Dec 2 12:06:37 axxats003 puppet-master[12670]: Compiled catalog for axxatn016.sjc.company.com in environment production >>>> in 1.64 seconds >>>> Dec 2 12:06:40 axxats003 puppet-master[12670]: Compiled catalog for axaafc001.company.com in environment production i >>>> n 1.70 seconds >>>> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Reopening log files >>>> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Starting Puppet client version 2.6.12 >>>> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Could not retrieve catalog from remote server: execution expired >>>> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Using cached catalog >>>> >>>> (every 30 minutes puppet agent says the same thing until I restart the puppet master) >>>> >>>> Dec 2 18:06:09 axxats003 puppet-master[25783]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:10 axxats003 puppet-master[25802]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:11 axxats003 puppet-master[25831]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:12 axxats003 puppet-master[25864]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:13 axxats003 puppet-master[25897]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:14 axxats003 puppet-master[25922]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:15 axxats003 puppet-master[25947]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:16 axxats003 puppet-master[25972]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:17 axxats003 puppet-master[25997]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:18 axxats003 puppet-master[26019]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:19 axxats003 puppet-master[26056]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:20 axxats003 puppet-master[26081]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:06:21 axxats003 puppet-master[26115]: Starting Puppet master version 2.6.12 >>>> Dec 2 18:14:32 axxats003 puppet-master[26115]: Compiled catalog for axxatn018.sjc.company.com in environment production in 3.63 seconds >>>> Dec 2 18:14:37 axxats003 puppet-master[26115]: Compiled catalog for axxamb002.sjc.company.com in environment production in 1.47 seconds >>>> Dec 2 18:14:50 axxats003 puppet-master[26115]: Compiled catalog for axxasn001.sjc.company.com in environment production in 1.57 seconds >>>> >>>> There are no other messages in /var/log/messages -- the system was otherwise not busy. Apache error log only observed max clients get hit: >>>> [Fri Dec 02 08:42:43 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations >>>> [Fri Dec 02 12:23:46 2011] [error] server reached MaxClients setting, consider raising the MaxClients setting >>>> [Fri Dec 02 18:06:07 2011] [notice] caught SIGTERM, shutting down >>>> [Fri Dec 02 18:06:08 2011] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec) >>>> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? >>>> [Fri Dec 02 18:06:08 2011] [notice] Digest: generating secret for digest authentication ... >>>> [Fri Dec 02 18:06:08 2011] [notice] Digest: done >>>> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? >>>> [Fri Dec 02 18:06:08 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations >>>> >>>> >>>> -- >>>> Jo Rhett >>>> jrhett@company.com >>>> (415) 999-1798 >>>> >>>> -- >>>> Jo Rhett >>>> Net Consonance : consonant endings by net philanthropy, open source and other randomness >>>> >>> >>> -- >>> Jo Rhett >>> Net Consonance : consonant endings by net philanthropy, open source and other randomness >>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups "Puppet Users" group. >>> To post to this group, send email to puppet-users@googlegroups.com. >>> To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. >>> For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en. >>> >>> >>> >>> -- >>> Nigel Kersten >>> Product Manager, Puppet Labs >>> >>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups "Puppet Users" group. >>> To post to this group, send email to puppet-users@googlegroups.com. >>> To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. >>> For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en. >> >> -- >> Jo Rhett >> Net Consonance : consonant endings by net philanthropy, open source and other randomness >> > > -- > Jo Rhett > Net Consonance : consonant endings by net philanthropy, open source and other randomness > > > -- > You received this message because you are subscribed to the Google Groups "Puppet Users" group. > To post to this group, send email to puppet-users@googlegroups.com. > To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.-- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
I tried removing the cert from puppet master for one of the three systems that can''t get catalogs, and removing the entire /var/lib/puppet directory on the client but got the exact same response. After getting a new cert and signing it, the client accepted the cert and then hung waiting for catalog. The process on the puppetmaster hung. Anything else I could test, check, clear…? --debug --trace on the client simply shows me a timeout, no extra detail. On Dec 2, 2011, at 3:32 PM, Jo Rhett wrote:> I am also now pretty certain that this issue (ticket #11140) is tied directly to 3 systems (in ticket #11143) which can''t get catalogs. I believe their attempts to get a catalog produce a hung server. 3 servers every 30 minutes means that in just over 3 hours I have 20 hung puppetmasters, and the queue goes out of control. > > I would deeply appreciate some information on how to diagnose the catalog failures and related puppetmaster hangs. > > On Dec 2, 2011, at 3:09 PM, Jo Rhett wrote: >> Hm, you know I don''t think that it''s a sudden lock of all 20 passenger clients. I think it''s a slow lockup of various puppet sessions until all 20 are locked. Here''s an example: every one of the "active" sessions below with an uptime longer than 30 minutes has had the same "processed" number for more than 30 minutes at this time. So in theory, they''ve been processing the same session for more than 30 minutes. Somehow, I don''t think so. I think those sessions are locked up. And what is happening is that eventually all 20 processes are hung and we are dead in the water. >> >> Fri Dec 2 23:05:59 UTC 2011 >> ----------- General information ----------- >> max = 20 >> count = 18 >> active = 12 >> inactive = 6 >> Waiting on global queue: 0 >> >> ----------- Domains ----------- >> /etc/puppet/rack: >> PID: 21021 Sessions: 0 Processed: 362 Uptime: 5m 37s >> PID: 21005 Sessions: 0 Processed: 537 Uptime: 5m 38s >> PID: 21555 Sessions: 0 Processed: 69 Uptime: 30s >> PID: 21571 Sessions: 0 Processed: 62 Uptime: 29s >> PID: 20989 Sessions: 0 Processed: 209 Uptime: 5m 39s >> PID: 20968 Sessions: 0 Processed: 157 Uptime: 5m 41s >> PID: 9221 Sessions: 1 Processed: 903 Uptime: 2h 5m 55s >> PID: 9340 Sessions: 1 Processed: 764 Uptime: 2h 4m 58s >> PID: 10379 Sessions: 1 Processed: 568 Uptime: 1h 57m 37s >> PID: 11847 Sessions: 1 Processed: 712 Uptime: 1h 41m 13s >> PID: 11686 Sessions: 1 Processed: 314 Uptime: 1h 41m 19s >> PID: 10845 Sessions: 1 Processed: 511 Uptime: 1h 48m 52s >> PID: 11650 Sessions: 1 Processed: 747 Uptime: 1h 41m 21s >> PID: 14967 Sessions: 1 Processed: 84 Uptime: 1h 8m 28s >> PID: 17605 Sessions: 1 Processed: 497 Uptime: 44m 41s >> PID: 20342 Sessions: 1 Processed: 0 Uptime: 13m 14s >> PID: 20358 Sessions: 1 Processed: 54 Uptime: 13m 13s >> PID: 18098 Sessions: 1 Processed: 854 Uptime: 35m 46s >> >> On Dec 2, 2011, at 2:22 PM, Jo Rhett wrote: >> >>> On Dec 2, 2011, at 1:30 PM, Nigel Kersten wrote: >>>> On Fri, Dec 2, 2011 at 1:03 PM, Jo Rhett <jrhett@netconsonance.com> wrote: >>>> Okay, this has happened again. Puppet master stopped logging catalog compiles, every server stopped returning results and the global queue went quickly through the roof in like 9 minutes. It appears puppet master is stopping dead in its tracks without logging any errors. >>>> >>>> A really quick test would be to start a webrick puppetmaster on an alternate port with the same configuration file in debug mode and then puppet against it to see if there''s a problem at that level, >>>> >>>> (on master) >>>> puppet master --no-daemonize --verbose --debug --masterport 9140 (for example) >>>> >>>> (on an agent) >>>> puppet agent --test --masterport 9140 >>> >>> This works perfectly fine. >>> >>>> If that doesn''t show anything, let us know whether you''re running Apache prefork or worker, and your relevant pool regulation settings like: >>>> >>>> StartServers >>>> MinSpareServers >>>> MaxSpareServers >>>> ServerLimit >>>> MaxClients >>>> MaxRequestsPerChild >>> >>> pre fork with the following settings: >>> >>> StartServers 8 >>> MinSpareServers 5 >>> MaxSpareServers 20 >>> ServerLimit 256 >>> MaxClients 256 >>> MaxRequestsPerChild 4000 >>> >>>> # passenger-status >>>> ----------- General information ----------- >>>> max = 20 >>>> count = 20 >>>> active = 20 >>>> inactive = 0 >>>> Waiting on global queue: 209 >>>> >>>> ----------- Domains ----------- >>>> /etc/puppet/rack: >>>> PID: 25783 Sessions: 1 Processed: 329 Uptime: 2h 52m 7s >>>> PID: 25831 Sessions: 1 Processed: 4 Uptime: 2h 52m 5s >>>> PID: 28517 Sessions: 1 Processed: 6 Uptime: 2h 22m 0s >>>> PID: 25802 Sessions: 1 Processed: 714 Uptime: 2h 52m 6s >>>> PID: 30905 Sessions: 1 Processed: 13 Uptime: 1h 50m 27s >>>> PID: 25864 Sessions: 1 Processed: 709 Uptime: 2h 52m 4s >>>> PID: 31028 Sessions: 1 Processed: 347 Uptime: 1h 50m 21s >>>> PID: 28944 Sessions: 1 Processed: 377 Uptime: 2h 21m 50s >>>> PID: 31090 Sessions: 1 Processed: 266 Uptime: 1h 50m 18s >>>> PID: 577 Sessions: 1 Processed: 400 Uptime: 1h 27m 27s >>>> PID: 418 Sessions: 1 Processed: 647 Uptime: 1h 28m 2s >>>> PID: 1247 Sessions: 1 Processed: 133 Uptime: 1h 19m 3s >>>> PID: 1474 Sessions: 1 Processed: 52 Uptime: 1h 18m 9s >>>> PID: 594 Sessions: 1 Processed: 378 Uptime: 1h 27m 26s >>>> PID: 4706 Sessions: 1 Processed: 414 Uptime: 48m 5s >>>> PID: 4775 Sessions: 1 Processed: 218 Uptime: 47m 28s >>>> PID: 4854 Sessions: 1 Processed: 584 Uptime: 47m 23s >>>> PID: 7774 Sessions: 1 Processed: 165 Uptime: 14m 27s >>>> PID: 7902 Sessions: 1 Processed: 44 Uptime: 13m 44s >>>> PID: 8149 Sessions: 1 Processed: 541 Uptime: 11m 21s >>>> >>>> >>>> On Dec 2, 2011, at 10:58 AM, Jo Rhett wrote: >>>>> I came in this morning to find all the servers all locked up solid: >>>>> >>>>> # passenger-status >>>>> ----------- General information ----------- >>>>> max = 20 >>>>> count = 20 >>>>> active = 20 >>>>> inactive = 0 >>>>> Waiting on global queue: 236 >>>>> >>>>> ----------- Domains ----------- >>>>> /etc/puppet/rack: >>>>> PID: 2720 Sessions: 1 Processed: 939 Uptime: 9h 22m 18s >>>>> PID: 1615 Sessions: 1 Processed: 947 Uptime: 9h 23m 14s >>>>> PID: 1596 Sessions: 1 Processed: 607 Uptime: 9h 23m 15s >>>>> PID: 1722 Sessions: 1 Processed: 953 Uptime: 9h 23m 9s >>>>> PID: 2218 Sessions: 1 Processed: 378 Uptime: 9h 22m 43s >>>>> PID: 4286 Sessions: 1 Processed: 178 Uptime: 8h 50m 58s >>>>> PID: 5749 Sessions: 1 Processed: 708 Uptime: 8h 20m 20s >>>>> PID: 4253 Sessions: 1 Processed: 820 Uptime: 8h 51m 1s >>>>> PID: 5624 Sessions: 1 Processed: 126 Uptime: 8h 20m 24s >>>>> PID: 7328 Sessions: 1 Processed: 811 Uptime: 7h 49m 17s >>>>> PID: 7274 Sessions: 1 Processed: 984 Uptime: 7h 49m 20s >>>>> PID: 8761 Sessions: 1 Processed: 85 Uptime: 7h 18m 50s >>>>> PID: 9135 Sessions: 1 Processed: 907 Uptime: 7h 16m 27s >>>>> PID: 8777 Sessions: 1 Processed: 342 Uptime: 7h 18m 49s >>>>> PID: 10508 Sessions: 1 Processed: 51 Uptime: 6h 47m 6s >>>>> PID: 10853 Sessions: 1 Processed: 603 Uptime: 6h 43m 9s >>>>> PID: 10620 Sessions: 1 Processed: 939 Uptime: 6h 45m 52s >>>>> PID: 11438 Sessions: 1 Processed: 870 Uptime: 6h 30m 8s >>>>> PID: 12582 Sessions: 1 Processed: 448 Uptime: 6h 9m 59s >>>>> PID: 12670 Sessions: 1 Processed: 400 Uptime: 6h 8m 46s >>>>> >>>>> For comparison, most of our server processes recycle within 20 minutes normally, as they hit 1000 really fast. >>>>> >>>>> # you probably want to tune these settings >>>>> PassengerHighPerformance on >>>>> PassengerUseGlobalQueue on >>>>> PassengerMaxPoolSize 20 >>>>> PassengerPoolIdleTime 1800 >>>>> PassengerMaxRequests 1000 >>>>> #PassengerStatThrottleRate 120 >>>>> RackAutoDetect Off >>>>> RailsAutoDetect Off >>>>> >>>>> There is nothing useful in the system logs. They just stopped: >>>>> >>>>> Dec 2 12:06:34 axxats003 puppet-master[12670]: Compiled catalog for axxamx001.sjc.company.com in environment production >>>>> in 1.76 seconds >>>>> Dec 2 12:06:37 axxats003 puppet-master[12670]: Compiled catalog for axxatn016.sjc.company.com in environment production >>>>> in 1.64 seconds >>>>> Dec 2 12:06:40 axxats003 puppet-master[12670]: Compiled catalog for axaafc001.company.com in environment production i >>>>> n 1.70 seconds >>>>> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Reopening log files >>>>> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Starting Puppet client version 2.6.12 >>>>> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Could not retrieve catalog from remote server: execution expired >>>>> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Using cached catalog >>>>> >>>>> (every 30 minutes puppet agent says the same thing until I restart the puppet master) >>>>> >>>>> Dec 2 18:06:09 axxats003 puppet-master[25783]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:10 axxats003 puppet-master[25802]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:11 axxats003 puppet-master[25831]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:12 axxats003 puppet-master[25864]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:13 axxats003 puppet-master[25897]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:14 axxats003 puppet-master[25922]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:15 axxats003 puppet-master[25947]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:16 axxats003 puppet-master[25972]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:17 axxats003 puppet-master[25997]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:18 axxats003 puppet-master[26019]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:19 axxats003 puppet-master[26056]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:20 axxats003 puppet-master[26081]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:06:21 axxats003 puppet-master[26115]: Starting Puppet master version 2.6.12 >>>>> Dec 2 18:14:32 axxats003 puppet-master[26115]: Compiled catalog for axxatn018.sjc.company.com in environment production in 3.63 seconds >>>>> Dec 2 18:14:37 axxats003 puppet-master[26115]: Compiled catalog for axxamb002.sjc.company.com in environment production in 1.47 seconds >>>>> Dec 2 18:14:50 axxats003 puppet-master[26115]: Compiled catalog for axxasn001.sjc.company.com in environment production in 1.57 seconds >>>>> >>>>> There are no other messages in /var/log/messages -- the system was otherwise not busy. Apache error log only observed max clients get hit: >>>>> [Fri Dec 02 08:42:43 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations >>>>> [Fri Dec 02 12:23:46 2011] [error] server reached MaxClients setting, consider raising the MaxClients setting >>>>> [Fri Dec 02 18:06:07 2011] [notice] caught SIGTERM, shutting down >>>>> [Fri Dec 02 18:06:08 2011] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec) >>>>> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? >>>>> [Fri Dec 02 18:06:08 2011] [notice] Digest: generating secret for digest authentication ... >>>>> [Fri Dec 02 18:06:08 2011] [notice] Digest: done >>>>> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) `puppetmaster.company.com'' does NOT match server name!? >>>>> [Fri Dec 02 18:06:08 2011] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations >>>>> >>>>> >>>>> -- >>>>> Jo Rhett >>>>> jrhett@company.com >>>>> (415) 999-1798 >>>>> >>>>> -- >>>>> Jo Rhett >>>>> Net Consonance : consonant endings by net philanthropy, open source and other randomness >>>>> >>>> >>>> -- >>>> Jo Rhett >>>> Net Consonance : consonant endings by net philanthropy, open source and other randomness >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google Groups "Puppet Users" group. >>>> To post to this group, send email to puppet-users@googlegroups.com. >>>> To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. >>>> For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en. >>>> >>>> >>>> >>>> -- >>>> Nigel Kersten >>>> Product Manager, Puppet Labs >>>> >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google Groups "Puppet Users" group. >>>> To post to this group, send email to puppet-users@googlegroups.com. >>>> To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. >>>> For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en. >>> >>> -- >>> Jo Rhett >>> Net Consonance : consonant endings by net philanthropy, open source and other randomness >>> >> >> -- >> Jo Rhett >> Net Consonance : consonant endings by net philanthropy, open source and other randomness >> >> >> -- >> You received this message because you are subscribed to the Google Groups "Puppet Users" group. >> To post to this group, send email to puppet-users@googlegroups.com. >> To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. >> For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en. > > -- > Jo Rhett > Net Consonance : consonant endings by net philanthropy, open source and other randomness > > > -- > You received this message because you are subscribed to the Google Groups "Puppet Users" group. > To post to this group, send email to puppet-users@googlegroups.com. > To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.-- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
awesome, thanks for the ticket #s Jo, I''ll do some follow up there and hopefully we can report a happy summary back to the list. On Fri, Dec 2, 2011 at 3:32 PM, Jo Rhett <jrhett@netconsonance.com> wrote:> I am also now pretty certain that this issue (ticket #11140) is tied > directly to 3 systems (in ticket #11143) which can''t get catalogs. I > believe their attempts to get a catalog produce a hung server. 3 servers > every 30 minutes means that in just over 3 hours I have 20 hung > puppetmasters, and the queue goes out of control. > > I would deeply appreciate some information on how to diagnose the catalog > failures and related puppetmaster hangs. > > On Dec 2, 2011, at 3:09 PM, Jo Rhett wrote: > > Hm, you know I don''t think that it''s a sudden lock of all 20 passenger > clients. I think it''s a slow lockup of various puppet sessions until all > 20 are locked. Here''s an example: every one of the "active" sessions below > with an uptime longer than 30 minutes has had the same "processed" number > for more than 30 minutes at this time. So in theory, they''ve been > processing the same session for more than 30 minutes. Somehow, I don''t > think so. I think those sessions are locked up. And what is happening is > that eventually all 20 processes are hung and we are dead in the water. > > Fri Dec 2 23:05:59 UTC 2011 > ----------- General information ----------- > max = 20 > count = 18 > active = 12 > inactive = 6 > Waiting on global queue: 0 > > ----------- Domains ----------- > /etc/puppet/rack: > PID: 21021 Sessions: 0 Processed: 362 Uptime: 5m 37s > PID: 21005 Sessions: 0 Processed: 537 Uptime: 5m 38s > PID: 21555 Sessions: 0 Processed: 69 Uptime: 30s > PID: 21571 Sessions: 0 Processed: 62 Uptime: 29s > PID: 20989 Sessions: 0 Processed: 209 Uptime: 5m 39s > PID: 20968 Sessions: 0 Processed: 157 Uptime: 5m 41s > PID: 9221 Sessions: 1 Processed: 903 Uptime: 2h 5m 55s > PID: 9340 Sessions: 1 Processed: 764 Uptime: 2h 4m 58s > PID: 10379 Sessions: 1 Processed: 568 Uptime: 1h 57m 37s > PID: 11847 Sessions: 1 Processed: 712 Uptime: 1h 41m 13s > PID: 11686 Sessions: 1 Processed: 314 Uptime: 1h 41m 19s > PID: 10845 Sessions: 1 Processed: 511 Uptime: 1h 48m 52s > PID: 11650 Sessions: 1 Processed: 747 Uptime: 1h 41m 21s > PID: 14967 Sessions: 1 Processed: 84 Uptime: 1h 8m 28s > PID: 17605 Sessions: 1 Processed: 497 Uptime: 44m 41s > PID: 20342 Sessions: 1 Processed: 0 Uptime: 13m 14s > PID: 20358 Sessions: 1 Processed: 54 Uptime: 13m 13s > PID: 18098 Sessions: 1 Processed: 854 Uptime: 35m 46s > > On Dec 2, 2011, at 2:22 PM, Jo Rhett wrote: > > On Dec 2, 2011, at 1:30 PM, Nigel Kersten wrote: > > On Fri, Dec 2, 2011 at 1:03 PM, Jo Rhett <jrhett@netconsonance.com> wrote: > >> Okay, this has happened again. Puppet master stopped logging catalog >> compiles, every server stopped returning results and the global queue went >> quickly through the roof in like 9 minutes. It appears puppet master is >> stopping dead in its tracks without logging any errors. >> > > A really quick test would be to start a webrick puppetmaster on an > alternate port with the same configuration file in debug mode and then > puppet against it to see if there''s a problem at that level, > > (on master) > puppet master --no-daemonize --verbose --debug --masterport 9140 (for > example) > > (on an agent) > puppet agent --test --masterport 9140 > > > This works perfectly fine. > > If that doesn''t show anything, let us know whether you''re running Apache > prefork or worker, and your relevant pool regulation settings like: > > StartServers > MinSpareServers > MaxSpareServers > ServerLimit > MaxClients > MaxRequestsPerChild > > > pre fork with the following settings: > > StartServers 8 > MinSpareServers 5 > MaxSpareServers 20 > ServerLimit 256 > MaxClients 256 > MaxRequestsPerChild 4000 > > # passenger-status >> ----------- General information ----------- >> max = 20 >> count = 20 >> active = 20 >> inactive = 0 >> Waiting on global queue: 209 >> >> ----------- Domains ----------- >> /etc/puppet/rack: >> PID: 25783 Sessions: 1 Processed: 329 Uptime: 2h 52m 7s >> PID: 25831 Sessions: 1 Processed: 4 Uptime: 2h 52m 5s >> PID: 28517 Sessions: 1 Processed: 6 Uptime: 2h 22m 0s >> PID: 25802 Sessions: 1 Processed: 714 Uptime: 2h 52m 6s >> PID: 30905 Sessions: 1 Processed: 13 Uptime: 1h 50m 27s >> PID: 25864 Sessions: 1 Processed: 709 Uptime: 2h 52m 4s >> PID: 31028 Sessions: 1 Processed: 347 Uptime: 1h 50m 21s >> PID: 28944 Sessions: 1 Processed: 377 Uptime: 2h 21m 50s >> PID: 31090 Sessions: 1 Processed: 266 Uptime: 1h 50m 18s >> PID: 577 Sessions: 1 Processed: 400 Uptime: 1h 27m 27s >> PID: 418 Sessions: 1 Processed: 647 Uptime: 1h 28m 2s >> PID: 1247 Sessions: 1 Processed: 133 Uptime: 1h 19m 3s >> PID: 1474 Sessions: 1 Processed: 52 Uptime: 1h 18m 9s >> PID: 594 Sessions: 1 Processed: 378 Uptime: 1h 27m 26s >> PID: 4706 Sessions: 1 Processed: 414 Uptime: 48m 5s >> PID: 4775 Sessions: 1 Processed: 218 Uptime: 47m 28s >> PID: 4854 Sessions: 1 Processed: 584 Uptime: 47m 23s >> PID: 7774 Sessions: 1 Processed: 165 Uptime: 14m 27s >> PID: 7902 Sessions: 1 Processed: 44 Uptime: 13m 44s >> PID: 8149 Sessions: 1 Processed: 541 Uptime: 11m 21s >> >> >> On Dec 2, 2011, at 10:58 AM, Jo Rhett wrote: >> >> I came in this morning to find all the servers all locked up solid: >> >> # passenger-status >> ----------- General information ----------- >> max = 20 >> count = 20 >> active = 20 >> inactive = 0 >> Waiting on global queue: 236 >> >> ----------- Domains ----------- >> /etc/puppet/rack: >> PID: 2720 Sessions: 1 Processed: 939 Uptime: 9h 22m 18s >> PID: 1615 Sessions: 1 Processed: 947 Uptime: 9h 23m 14s >> PID: 1596 Sessions: 1 Processed: 607 Uptime: 9h 23m 15s >> PID: 1722 Sessions: 1 Processed: 953 Uptime: 9h 23m 9s >> PID: 2218 Sessions: 1 Processed: 378 Uptime: 9h 22m 43s >> PID: 4286 Sessions: 1 Processed: 178 Uptime: 8h 50m 58s >> PID: 5749 Sessions: 1 Processed: 708 Uptime: 8h 20m 20s >> PID: 4253 Sessions: 1 Processed: 820 Uptime: 8h 51m 1s >> PID: 5624 Sessions: 1 Processed: 126 Uptime: 8h 20m 24s >> PID: 7328 Sessions: 1 Processed: 811 Uptime: 7h 49m 17s >> PID: 7274 Sessions: 1 Processed: 984 Uptime: 7h 49m 20s >> PID: 8761 Sessions: 1 Processed: 85 Uptime: 7h 18m 50s >> PID: 9135 Sessions: 1 Processed: 907 Uptime: 7h 16m 27s >> PID: 8777 Sessions: 1 Processed: 342 Uptime: 7h 18m 49s >> PID: 10508 Sessions: 1 Processed: 51 Uptime: 6h 47m 6s >> PID: 10853 Sessions: 1 Processed: 603 Uptime: 6h 43m 9s >> PID: 10620 Sessions: 1 Processed: 939 Uptime: 6h 45m 52s >> PID: 11438 Sessions: 1 Processed: 870 Uptime: 6h 30m 8s >> PID: 12582 Sessions: 1 Processed: 448 Uptime: 6h 9m 59s >> PID: 12670 Sessions: 1 Processed: 400 Uptime: 6h 8m 46s >> >> For comparison, most of our server processes recycle within 20 minutes >> normally, as they hit 1000 really fast. >> >> # you probably want to tune these settings >> PassengerHighPerformance on >> PassengerUseGlobalQueue on >> PassengerMaxPoolSize 20 >> PassengerPoolIdleTime 1800 >> PassengerMaxRequests 1000 >> #PassengerStatThrottleRate 120 >> RackAutoDetect Off >> RailsAutoDetect Off >> >> There is nothing useful in the system logs. They just stopped: >> >> Dec 2 12:06:34 axxats003 puppet-master[12670]: Compiled catalog for >> axxamx001.sjc.company.com in environment production >> in 1.76 seconds >> Dec 2 12:06:37 axxats003 puppet-master[12670]: Compiled catalog for >> axxatn016.sjc.company.com in environment production >> in 1.64 seconds >> Dec 2 12:06:40 axxats003 puppet-master[12670]: Compiled catalog for >> axaafc001.company.com in environment production i >> n 1.70 seconds >> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Reopening log files >> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Starting Puppet client >> version 2.6.12 >> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Could not retrieve catalog >> from remote server: execution expired >> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Using cached catalog >> >> (every 30 minutes puppet agent says the same thing until I restart the >> puppet master) >> >> Dec 2 18:06:09 axxats003 puppet-master[25783]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:10 axxats003 puppet-master[25802]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:11 axxats003 puppet-master[25831]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:12 axxats003 puppet-master[25864]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:13 axxats003 puppet-master[25897]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:14 axxats003 puppet-master[25922]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:15 axxats003 puppet-master[25947]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:16 axxats003 puppet-master[25972]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:17 axxats003 puppet-master[25997]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:18 axxats003 puppet-master[26019]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:19 axxats003 puppet-master[26056]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:20 axxats003 puppet-master[26081]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:06:21 axxats003 puppet-master[26115]: Starting Puppet master >> version 2.6.12 >> Dec 2 18:14:32 axxats003 puppet-master[26115]: Compiled catalog for >> axxatn018.sjc.company.com in environment production in 3.63 seconds >> Dec 2 18:14:37 axxats003 puppet-master[26115]: Compiled catalog for >> axxamb002.sjc.company.com in environment production in 1.47 seconds >> Dec 2 18:14:50 axxats003 puppet-master[26115]: Compiled catalog for >> axxasn001.sjc.company.com in environment production in 1.57 seconds >> >> There are no other messages in /var/log/messages -- the system was >> otherwise not busy. Apache error log only observed max clients get hit: >> [Fri Dec 02 08:42:43 2011] [notice] Apache/2.2.3 (CentOS) configured -- >> resuming normal operations >> [Fri Dec 02 12:23:46 2011] [error] server reached MaxClients setting, >> consider raising the MaxClients setting >> [Fri Dec 02 18:06:07 2011] [notice] caught SIGTERM, shutting down >> [Fri Dec 02 18:06:08 2011] [notice] suEXEC mechanism enabled (wrapper: >> /usr/sbin/suexec) >> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) ` >> puppetmaster.company.com'' does NOT match server name!? >> [Fri Dec 02 18:06:08 2011] [notice] Digest: generating secret for digest >> authentication ... >> [Fri Dec 02 18:06:08 2011] [notice] Digest: done >> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) ` >> puppetmaster.company.com'' does NOT match server name!? >> [Fri Dec 02 18:06:08 2011] [notice] Apache/2.2.3 (CentOS) configured -- >> resuming normal operations >> >> >> -- >> Jo Rhett >> jrhett@company.com >> (415) 999-1798 >> >> -- >> Jo Rhett >> Net Consonance : consonant endings by net philanthropy, open source and >> other randomness >> >> >> -- >> Jo Rhett >> Net Consonance : consonant endings by net philanthropy, open source and >> other randomness >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Puppet Users" group. >> To post to this group, send email to puppet-users@googlegroups.com. >> To unsubscribe from this group, send email to >> puppet-users+unsubscribe@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/puppet-users?hl=en. >> > > > > -- > Nigel Kersten > Product Manager, Puppet Labs > > > > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To post to this group, send email to puppet-users@googlegroups.com. > To unsubscribe from this group, send email to > puppet-users+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/puppet-users?hl=en. > > > -- > Jo Rhett > Net Consonance : consonant endings by net philanthropy, open source and > other randomness > > > -- > Jo Rhett > Net Consonance : consonant endings by net philanthropy, open source and > other randomness > > > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To post to this group, send email to puppet-users@googlegroups.com. > To unsubscribe from this group, send email to > puppet-users+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/puppet-users?hl=en. > > > -- > Jo Rhett > Net Consonance : consonant endings by net philanthropy, open source and > other randomness > > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To post to this group, send email to puppet-users@googlegroups.com. > To unsubscribe from this group, send email to > puppet-users+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/puppet-users?hl=en. >-- Nigel Kersten Product Manager, Puppet Labs -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Fri, Dec 2, 2011 at 7:28 PM, Nigel Kersten <nigel@puppetlabs.com> wrote:> awesome, thanks for the ticket #s Jo, I''ll do some follow up there and > hopefully we can report a happy summary back to the list. > > > On Fri, Dec 2, 2011 at 3:32 PM, Jo Rhett <jrhett@netconsonance.com> wrote: >> >> I am also now pretty certain that this issue (ticket #11140) is tied >> directly to 3 systems (in ticket #11143) which can''t get catalogs. I believe >> their attempts to get a catalog produce a hung server. 3 servers every 30 >> minutes means that in just over 3 hours I have 20 hung puppetmasters, and >> the queue goes out of control. >> >> I would deeply appreciate some information on how to diagnose the catalog >> failures and related puppetmaster hangs. >> >> On Dec 2, 2011, at 3:09 PM, Jo Rhett wrote: >> >> Hm, you know I don''t think that it''s a sudden lock of all 20 passenger >> clients. I think it''s a slow lockup of various puppet sessions until all 20 >> are locked. Here''s an example: every one of the "active" sessions below >> with an uptime longer than 30 minutes has had the same "processed" number >> for more than 30 minutes at this time. So in theory, they''ve been >> processing the same session for more than 30 minutes. Somehow, I don''t >> think so. I think those sessions are locked up. And what is happening is >> that eventually all 20 processes are hung and we are dead in the water.Not sure if this was mentioned, but in the config.ru file you can enable more debugging. # if you want debugging: # ARGV << "--debug" On the master you can try to compile for these hanging clients via: puppet master --compile ${hostname} --debug Hopefully these methods give you more useful outputs. Last do you happen to use an ENC or custom function that query data? Thanks, Nan -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Dec 2, 2011, at 6:27 PM, Nan Liu wrote:> Not sure if this was mentioned, but in the config.ru file you can > enable more debugging. > > # if you want debugging: > # ARGV << "--debug"Yep, it was mentioned. Not sure I''d want debugging from all 20 at once. I have been using the idea of running a puppet master on an alternate port and that reproduced the problem. Unfortunately no useful debugging.> On the master you can try to compile for these hanging clients via: > puppet master --compile ${hostname} --debugAh, I didn''t know about that. Unfortunately didn''t yield much. $ puppet master --compile us0101afc002.tangome.gbl --debug info: Not using expired node for us0101afc002.tangome.gbl from cache; expired at Sat Dec 03 02:35:25 +0000 2011 As we have seen with the puppermaster on the alternate port, this process is unkillable. It ignores CTRL-C and kill -15, it requires a kill -9 to stop it.> Hopefully these methods give you more useful outputs. Last do you > happen to use an ENC or custom function that query data?Nope, simple straight up puppet with passenger. No query functions, no ENCs. -- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
It was corrupted yaml files in $vardir/yaml/[nodes,facts]. Removing them solved the problem completely. On Dec 2, 2011, at 6:27 PM, Nan Liu wrote:> On Fri, Dec 2, 2011 at 7:28 PM, Nigel Kersten <nigel@puppetlabs.com> wrote: >> awesome, thanks for the ticket #s Jo, I''ll do some follow up there and >> hopefully we can report a happy summary back to the list. >> >> >> On Fri, Dec 2, 2011 at 3:32 PM, Jo Rhett <jrhett@netconsonance.com> wrote: >>> >>> I am also now pretty certain that this issue (ticket #11140) is tied >>> directly to 3 systems (in ticket #11143) which can''t get catalogs. I believe >>> their attempts to get a catalog produce a hung server. 3 servers every 30 >>> minutes means that in just over 3 hours I have 20 hung puppetmasters, and >>> the queue goes out of control. >>> >>> I would deeply appreciate some information on how to diagnose the catalog >>> failures and related puppetmaster hangs. >>> >>> On Dec 2, 2011, at 3:09 PM, Jo Rhett wrote: >>> >>> Hm, you know I don''t think that it''s a sudden lock of all 20 passenger >>> clients. I think it''s a slow lockup of various puppet sessions until all 20 >>> are locked. Here''s an example: every one of the "active" sessions below >>> with an uptime longer than 30 minutes has had the same "processed" number >>> for more than 30 minutes at this time. So in theory, they''ve been >>> processing the same session for more than 30 minutes. Somehow, I don''t >>> think so. I think those sessions are locked up. And what is happening is >>> that eventually all 20 processes are hung and we are dead in the water. > > Not sure if this was mentioned, but in the config.ru file you can > enable more debugging. > > # if you want debugging: > # ARGV << "--debug" > > On the master you can try to compile for these hanging clients via: > puppet master --compile ${hostname} --debug > > Hopefully these methods give you more useful outputs. Last do you > happen to use an ENC or custom function that query data? > > Thanks, > > Nan > > -- > You received this message because you are subscribed to the Google Groups "Puppet Users" group. > To post to this group, send email to puppet-users@googlegroups.com. > To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en. >-- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
Nigel Kersten
2011-Dec-03 04:30 UTC
Re: FIXED: [Puppet Users] the slow crawl towards death
On Fri, Dec 2, 2011 at 8:07 PM, Jo Rhett <jrhett@netconsonance.com> wrote:> It was corrupted yaml files in $vardir/yaml/[nodes,facts]. Removing them > solved the problem completely. >What version of Puppet Jo? Jo sent me the corrupt yaml files off list (they may contain some sensitive info) and we''ll get that sorted for the bug report. -- Nigel Kersten Product Manager, Puppet Labs -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
> On Fri, Dec 2, 2011 at 8:07 PM, Jo Rhett <jrhett@netconsonance.com> wrote: > It was corrupted yaml files in $vardir/yaml/[nodes,facts]. Removing them solved the problem completely.On Dec 2, 2011, at 8:30 PM, Nigel Kersten wrote:> What version of Puppet Jo?2.6.12 on centos 5.7 -- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.