I''m not 100% sure if the subject correctly describes the problem I''ve been having, but it''s the closest I can get with my troubleshooting. My setup looks like this: * 2 puppetmasters running 0.25.4 on Ubuntu, running under passenger * backend content (etc and var) shared over NFS * haproxy load balancing across the 2 puppetmasters * mysql for stored configs I just upgraded from 0.24.8 to 0.25.4 a couple of weeks ago. The setup we''ve been using above has worked fine since we implemented it months ago, so I don''t believe that there is any problem with NFS or the load balancer. I have a handful of custom functions, and after updating to 0.25.4, puppetmaster started complaining about one of them, a simple function called nagios_name. This function takes an FQDN and turns it into a name we use in Nagios and mcollective (turning "support.arces.net" into "arces.support" for example). The function is basic ruby and is available for you to look at here: http://monachus.pastebin.com/yLF1syqU. The function works fine. The error that puppetmaster reports is: Unknown function nagios_name at /var/www/localhost/puppet/etc/ manifests/outsidein_nodes.pp:16 on node some.node.com. It doesn''t report this all of the time - instead it reports it about 40% of the time, while other nodes before and after it do not report the error. It seems that a node with a problem will always have the problem, and a node where it works will always work. This reinforces the fact that the function is fine - it works and has worked for months. My thought is that it''s some sort of caching issue, and I even thought it might be a race condition with the backend storage being NFS - one puppetmaster loading a cached yaml file before the other was done writing it or something. I''ve done all of the following, all with no success: * turn off one puppetmaster so traffic isn''t split across them * move yaml files for node/facts to local storage instead of NFS * enable IP-based persistence in haproxy so that traffic from a client always goes to the same puppetmaster * --ignorecache in config.ru for puppetmaster What I''ve discovered, however, is more interesting. It appears that if I go into the actual nagios_name.rb file and change it in any way (add a single character of whitespace) and restart Apache, the error goes away. The file is detected as different and loaded for delivery to the clients, and everything works fine after that. I discovered this by adding debug() statements to the function 2 weeks ago, only to find that it worked fine from then on. The problem resurfaced today when I turned the 2nd puppetmaster back on, and I decided to try it with whitespace - same thing. Clears it right up. This tells me that there is some sort of caching wonkiness happening somewhere, but I''m not able to figure out where. Perhaps one of the variables the function is looking for (fqdn?) isn''t available at the time it''s requested, resulting in a compile error that isn''t always visible? I''m pleased to have a workaround, but to go from "Unknown function" to "everything is cool" by adding a space to the file and saving it isn''t really much of a long-term solution. I''m sending this to the list rather than filing a bug report to see if anyone has experienced anything like this or has any thoughts. If there''s any further information I can give to help narrow down the source of the problem, I''m happy to do so. Adrian -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
Could it be that when you change "nagios_name.rb" file on pupptermaster A, there is an event triggered so that Apache reloads this file? But since this event isn''t passed over to nfs in any way, this doesn''t happend to puppetmaster B? Have you tried to restart every component after you change a file, just to verify that it is read correct by all components? Have no idea if this is to any help but its better than nothing. On 24 Jul, 14:37, Monachus <monac...@gmail.com> wrote:> I''m not 100% sure if the subject correctly describes the problem I''ve > been having, but it''s the closest I can get with my troubleshooting. > My setup looks like this: > > * 2 puppetmasters running 0.25.4 on Ubuntu, running under passenger > * backend content (etc and var) shared over NFS > * haproxy load balancing across the 2 puppetmasters > * mysql for stored configs > > I just upgraded from 0.24.8 to 0.25.4 a couple of weeks ago. The > setup we''ve been using above has worked fine since we implemented it > months ago, so I don''t believe that there is any problem with NFS or > the load balancer. I have a handful of custom functions, and after > updating to 0.25.4, puppetmaster started complaining about one of > them, a simple function called nagios_name. This function takes an > FQDN and turns it into a name we use in Nagios and mcollective > (turning "support.arces.net" into "arces.support" for example). The > function is basic ruby and is available for you to look at here:http://monachus.pastebin.com/yLF1syqU. The function works fine. > > The error that puppetmaster reports is: > > Unknown function nagios_name at /var/www/localhost/puppet/etc/ > manifests/outsidein_nodes.pp:16 on node some.node.com. > > It doesn''t report this all of the time - instead it reports it about > 40% of the time, while other nodes before and after it do not report > the error. It seems that a node with a problem will always have the > problem, and a node where it works will always work. This reinforces > the fact that the function is fine - it works and has worked for > months. > > My thought is that it''s some sort of caching issue, and I even thought > it might be a race condition with the backend storage being NFS - one > puppetmaster loading a cached yaml file before the other was done > writing it or something. I''ve done all of the following, all with no > success: > > * turn off one puppetmaster so traffic isn''t split across them > * move yaml files for node/facts to local storage instead of NFS > * enable IP-based persistence in haproxy so that traffic from a client > always goes to the same puppetmaster > * --ignorecache in config.ru for puppetmaster > > What I''ve discovered, however, is more interesting. It appears that > if I go into the actual nagios_name.rb file and change it in any way > (add a single character of whitespace) and restart Apache, the error > goes away. The file is detected as different and loaded for delivery > to the clients, and everything works fine after that. I discovered > this by adding debug() statements to the function 2 weeks ago, only to > find that it worked fine from then on. The problem resurfaced today > when I turned the 2nd puppetmaster back on, and I decided to try it > with whitespace - same thing. Clears it right up. This tells me that > there is some sort of caching wonkiness happening somewhere, but I''m > not able to figure out where. > > Perhaps one of the variables the function is looking for (fqdn?) isn''t > available at the time it''s requested, resulting in a compile error > that isn''t always visible? > > I''m pleased to have a workaround, but to go from "Unknown function" to > "everything is cool" by adding a space to the file and saving it isn''t > really much of a long-term solution. > > I''m sending this to the list rather than filing a bug report to see if > anyone has experienced anything like this or has any thoughts. If > there''s any further information I can give to help narrow down the > source of the problem, I''m happy to do so. > > Adrian-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
Monachus
2010-Jul-27 12:08 UTC
[Puppet Users] Re: 0.25.4 caching problem with custom function
On Jul 27, 8:40 am, Tore <tore.lo...@gmail.com> wrote:> Could it be that when you change "nagios_name.rb" file on > pupptermaster A, there is an event triggered so that Apache reloads > this file? But since this event isn''t passed over to nfs in any way, > this doesn''t happend to puppetmaster B?NFS caching was one of the things that I looked at. NFS stats the file to determine if it needs to be reloaded, and I''ve adjusted this to as aggressive as possible. I know that the puppetmasters reload other files immediately on change (manifests, modules, other file resources being pushed to the clients) - it''s only the function that is having an issue, and only _this_ function, which is even weirder. That''s why I posted the function on pastebin - maybe there''s some ruby shortcut in there which makes Puppet barf when it''s between 1 and 3 in the afternoon and the moon is between 36% and 42% full on any of the last 4 Tuesdays.> Have you tried to restart every component after you change a file, > just to verify that it is read correct by all components?I have. When I had the problem the other day on one puppetmaster and not the other, I went through a battery of tests including bouncing Apache and thus puppetmasterd (since it runs under Passenger). In an earlier iteration I even put the entire NFS datastore on local storage and removed NFS from the equation. It doesn''t help. The only thing that helps is if I physically change the nagios_name.rb file somehow. It''s like there''s a cache somewhere that isn''t obvious - some place where puppetmasterd is storing the functions in a serialized form for quick reload, maybe?> Have no idea if this is to any help but its better than nothing.Thanks for the thoughts - any and all help is appreciated. It''s a weird weird bug to try and track down. I''m pleased that my workaround is holding, though I''d like to know that a long-term fix is possible. Adrian -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
Joe McDonagh
2010-Jul-28 14:35 UTC
Re: [Puppet Users] Re: 0.25.4 caching problem with custom function
On 07/27/2010 08:08 AM, Monachus wrote:> On Jul 27, 8:40 am, Tore<tore.lo...@gmail.com> wrote: > >> Could it be that when you change "nagios_name.rb" file on >> pupptermaster A, there is an event triggered so that Apache reloads >> this file? But since this event isn''t passed over to nfs in any way, >> this doesn''t happend to puppetmaster B? >> > NFS caching was one of the things that I looked at. NFS stats the > file to determine if it needs to be reloaded, and I''ve adjusted this > to as aggressive as possible. I know that the puppetmasters reload > other files immediately on change (manifests, modules, other file > resources being pushed to the clients) - it''s only the function that > is having an issue, and only _this_ function, which is even weirder. > That''s why I posted the function on pastebin - maybe there''s some ruby > shortcut in there which makes Puppet barf when it''s between 1 and 3 in > the afternoon and the moon is between 36% and 42% full on any of the > last 4 Tuesdays. > > >> Have you tried to restart every component after you change a file, >> just to verify that it is read correct by all components? >> > I have. When I had the problem the other day on one puppetmaster and > not the other, I went through a battery of tests including bouncing > Apache and thus puppetmasterd (since it runs under Passenger). In an > earlier iteration I even put the entire NFS datastore on local storage > and removed NFS from the equation. It doesn''t help. The only thing > that helps is if I physically change the nagios_name.rb file somehow. > It''s like there''s a cache somewhere that isn''t obvious - some place > where puppetmasterd is storing the functions in a serialized form for > quick reload, maybe? > > >> Have no idea if this is to any help but its better than nothing. >> > Thanks for the thoughts - any and all help is appreciated. It''s a > weird weird bug to try and track down. I''m pleased that my workaround > is holding, though I''d like to know that a long-term fix is possible. > > Adrian > >I don''t have much in the way of suggestions, but I ran into a lot of problems recently when I tried to have a docroot sitting on NFS. No matter what I did I always ran into weird problems just like this. It sort of worked but definitely not something I could use in production... -- Joe McDonagh Operations Engineer AIM: YoosingYoonickz IRC: joe-mac on freenode "When the going gets weird, the weird turn pro." -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
Monachus
2010-Jul-30 07:44 UTC
[Puppet Users] Re: 0.25.4 caching problem with custom function
> I don''t have much in the way of suggestions, but I ran into a lot of > problems recently when I tried to have a docroot sitting on NFS. No > matter what I did I always ran into weird problems just like this. It > sort of worked but definitely not something I could use in production...We''ve been using NFS for 2 years and it works great. I worked very hard to eliminate NFS write issues when setting it up - as long as it''s used as a read datastore it seems to work fine. The specific problem to which I''m referring didn''t appear until we moved to 0.25 a few weeks ago. At 0.24.8 this never happened. I suppose it''s possible that there''s some sort of write issue with NFS under 0.25, but why would that manifest as an inability to read this one specific function and not in any other part of my massive module stack? Even when I''ve seen Puppet clients fail because of "Stale NFS filehandle" when the master is reloading something I''ve changed, they still do the run from cache and load that file fine the next time through. This function error is a critical error that kills the puppet run completely, and it never goes away until I change the .rb file for the function in some way (newline, space, whatever). That''s not NFS - that''s code. I''m not seeing any response from those who could speak to caching, and now that 2.6 is out, we''ll probably have to wait to troubleshoot it with the developers until after we upgrade again. Thanks, everyone, for your response and for taking the time to think about what it might be. Adrian -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.