thr3ads.net - Puppet users - [Puppet Users] 0.25.4 caching problem with custom function [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Monachus

2010-Jul-24 12:37 UTC

[Puppet Users] 0.25.4 caching problem with custom function

I''m not 100% sure if the subject correctly describes the problem
I''ve
been having, but it''s the closest I can get with my troubleshooting.
My setup looks like this:

* 2 puppetmasters running 0.25.4 on Ubuntu, running under passenger
* backend content (etc and var) shared over NFS
* haproxy load balancing across the 2 puppetmasters
* mysql for stored configs

I just upgraded from 0.24.8 to 0.25.4 a couple of weeks ago.  The
setup we''ve been using above has worked fine since we implemented it
months ago, so I don''t believe that there is any problem with NFS or
the load balancer.  I have a handful of custom functions, and after
updating to 0.25.4, puppetmaster started complaining about one of
them, a simple function called nagios_name.  This function takes an
FQDN and turns it into a name we use in Nagios and mcollective
(turning "support.arces.net" into "arces.support" for
example).  The
function is basic ruby and is available for you to look at here:
http://monachus.pastebin.com/yLF1syqU.  The function works fine.

The error that puppetmaster reports is:

Unknown function nagios_name at /var/www/localhost/puppet/etc/
manifests/outsidein_nodes.pp:16 on node some.node.com.

It doesn''t report this all of the time - instead it reports it about
40% of the time, while other nodes before and after it do not report
the error.  It seems that a node with a problem will always have the
problem, and a node where it works will always work.  This reinforces
the fact that the function is fine - it works and has worked for
months.

My thought is that it''s some sort of caching issue, and I even thought
it might be a race condition with the backend storage being NFS - one
puppetmaster loading a cached yaml file before the other was done
writing it or something.  I''ve done all of the following, all with no
success:

* turn off one puppetmaster so traffic isn''t split across them
* move yaml files for node/facts to local storage instead of NFS
* enable IP-based persistence in haproxy so that traffic from a client
always goes to the same puppetmaster
* --ignorecache in config.ru for puppetmaster

What I''ve discovered, however, is more interesting.  It appears that
if I go into the actual nagios_name.rb file and change it in any way
(add a single character of whitespace) and restart Apache, the error
goes away.  The file is detected as different and loaded for delivery
to the clients, and everything works fine after that.  I discovered
this by adding debug() statements to the function 2 weeks ago, only to
find that it worked fine from then on.  The problem resurfaced today
when I turned the 2nd puppetmaster back on, and I decided to try it
with whitespace - same thing.  Clears it right up.  This tells me that
there is some sort of caching wonkiness happening somewhere, but I''m
not able to figure out where.

Perhaps one of the variables the function is looking for (fqdn?) isn''t
available at the time it''s requested, resulting in a compile error
that isn''t always visible?

I''m pleased to have a workaround, but to go from "Unknown
function" to
"everything is cool" by adding a space to the file and saving it
isn''t
really much of a long-term solution.

I''m sending this to the list rather than filing a bug report to see if
anyone has experienced anything like this or has any thoughts.  If
there''s any further information I can give to help narrow down the
source of the problem, I''m happy to do so.

Adrian

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Tore

2010-Jul-27 06:40 UTC

head link

[Puppet Users] Re: 0.25.4 caching problem with custom function

Could it be that when you change "nagios_name.rb" file on
pupptermaster A, there is an event triggered so that Apache reloads
this file? But since this event isn''t passed over to nfs in any way,
this doesn''t happend to puppetmaster B?

Have you tried to restart every component after you change a file,
just to verify that it is read correct by all components?

Have no idea if this is to any help but its better than nothing.

On 24 Jul, 14:37, Monachus <monac...@gmail.com>
wrote:> I''m not 100% sure if the subject correctly describes the problem
I''ve
> been having, but it''s the closest I can get with my
troubleshooting.
> My setup looks like this:
>
> * 2 puppetmasters running 0.25.4 on Ubuntu, running under passenger
> * backend content (etc and var) shared over NFS
> * haproxy load balancing across the 2 puppetmasters
> * mysql for stored configs
>
> I just upgraded from 0.24.8 to 0.25.4 a couple of weeks ago.  The
> setup we''ve been using above has worked fine since we implemented
it
> months ago, so I don''t believe that there is any problem with NFS
or
> the load balancer.  I have a handful of custom functions, and after
> updating to 0.25.4, puppetmaster started complaining about one of
> them, a simple function called nagios_name.  This function takes an
> FQDN and turns it into a name we use in Nagios and mcollective
> (turning "support.arces.net" into "arces.support" for
example).  The
> function is basic ruby and is available for you to look at
here:http://monachus.pastebin.com/yLF1syqU.  The function works fine.
>
> The error that puppetmaster reports is:
>
> Unknown function nagios_name at /var/www/localhost/puppet/etc/
> manifests/outsidein_nodes.pp:16 on node some.node.com.
>
> It doesn''t report this all of the time - instead it reports it
about
> 40% of the time, while other nodes before and after it do not report
> the error.  It seems that a node with a problem will always have the
> problem, and a node where it works will always work.  This reinforces
> the fact that the function is fine - it works and has worked for
> months.
>
> My thought is that it''s some sort of caching issue, and I even
thought
> it might be a race condition with the backend storage being NFS - one
> puppetmaster loading a cached yaml file before the other was done
> writing it or something.  I''ve done all of the following, all with
no
> success:
>
> * turn off one puppetmaster so traffic isn''t split across them
> * move yaml files for node/facts to local storage instead of NFS
> * enable IP-based persistence in haproxy so that traffic from a client
> always goes to the same puppetmaster
> * --ignorecache in config.ru for puppetmaster
>
> What I''ve discovered, however, is more interesting.  It appears
that
> if I go into the actual nagios_name.rb file and change it in any way
> (add a single character of whitespace) and restart Apache, the error
> goes away.  The file is detected as different and loaded for delivery
> to the clients, and everything works fine after that.  I discovered
> this by adding debug() statements to the function 2 weeks ago, only to
> find that it worked fine from then on.  The problem resurfaced today
> when I turned the 2nd puppetmaster back on, and I decided to try it
> with whitespace - same thing.  Clears it right up.  This tells me that
> there is some sort of caching wonkiness happening somewhere, but
I''m
> not able to figure out where.
>
> Perhaps one of the variables the function is looking for (fqdn?)
isn''t
> available at the time it''s requested, resulting in a compile error
> that isn''t always visible?
>
> I''m pleased to have a workaround, but to go from "Unknown
function" to
> "everything is cool" by adding a space to the file and saving it
isn''t
> really much of a long-term solution.
>
> I''m sending this to the list rather than filing a bug report to
see if
> anyone has experienced anything like this or has any thoughts.  If
> there''s any further information I can give to help narrow down the
> source of the problem, I''m happy to do so.
>
> Adrian
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Monachus

2010-Jul-27 12:08 UTC

head link

[Puppet Users] Re: 0.25.4 caching problem with custom function

On Jul 27, 8:40 am, Tore <tore.lo...@gmail.com>
wrote:> Could it be that when you change "nagios_name.rb" file on
> pupptermaster A, there is an event triggered so that Apache reloads
> this file? But since this event isn''t passed over to nfs in any
way,
> this doesn''t happend to puppetmaster B?
NFS caching was one of the things that I looked at.  NFS stats the
file to determine if it needs to be reloaded, and I''ve adjusted this
to as aggressive as possible.  I know that the puppetmasters reload
other files immediately on change (manifests, modules, other file
resources being pushed to the clients) - it''s only the function that
is having an issue, and only _this_ function, which is even weirder.
That''s why I posted the function on pastebin - maybe there''s
some ruby
shortcut in there which makes Puppet barf when it''s between 1 and 3 in
the afternoon and the moon is between 36% and 42% full on any of the
last 4 Tuesdays.
> Have you tried to restart every component after you change a file,
> just to verify that it is read correct by all components?
I have.  When I had the problem the other day on one puppetmaster and
not the other, I went through a battery of tests including bouncing
Apache and thus puppetmasterd (since it runs under Passenger).  In an
earlier iteration I even put the entire NFS datastore on local storage
and removed NFS from the equation.  It doesn''t help.  The only thing
that helps is if I physically change the nagios_name.rb file somehow.
It''s like there''s a cache somewhere that isn''t
obvious - some place
where puppetmasterd is storing the functions in a serialized form for
quick reload, maybe?
> Have no idea if this is to any help but its better than nothing.
Thanks for the thoughts - any and all help is appreciated.  It''s a
weird weird bug to try and track down.  I''m pleased that my workaround
is holding, though I''d like to know that a long-term fix is possible.

Adrian

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Joe McDonagh

2010-Jul-28 14:35 UTC

head link

Re: [Puppet Users] Re: 0.25.4 caching problem with custom function

On 07/27/2010 08:08 AM, Monachus wrote:> On Jul 27, 8:40 am, Tore<tore.lo...@gmail.com>  wrote:
>    
>> Could it be that when you change "nagios_name.rb" file on
>> pupptermaster A, there is an event triggered so that Apache reloads
>> this file? But since this event isn''t passed over to nfs in
any way,
>> this doesn''t happend to puppetmaster B?
>>      
> NFS caching was one of the things that I looked at.  NFS stats the
> file to determine if it needs to be reloaded, and I''ve adjusted
this
> to as aggressive as possible.  I know that the puppetmasters reload
> other files immediately on change (manifests, modules, other file
> resources being pushed to the clients) - it''s only the function
that
> is having an issue, and only _this_ function, which is even weirder.
> That''s why I posted the function on pastebin - maybe
there''s some ruby
> shortcut in there which makes Puppet barf when it''s between 1 and
3 in
> the afternoon and the moon is between 36% and 42% full on any of the
> last 4 Tuesdays.
>
>    
>> Have you tried to restart every component after you change a file,
>> just to verify that it is read correct by all components?
>>      
> I have.  When I had the problem the other day on one puppetmaster and
> not the other, I went through a battery of tests including bouncing
> Apache and thus puppetmasterd (since it runs under Passenger).  In an
> earlier iteration I even put the entire NFS datastore on local storage
> and removed NFS from the equation.  It doesn''t help.  The only
thing
> that helps is if I physically change the nagios_name.rb file somehow.
> It''s like there''s a cache somewhere that isn''t
obvious - some place
> where puppetmasterd is storing the functions in a serialized form for
> quick reload, maybe?
>
>    
>> Have no idea if this is to any help but its better than nothing.
>>      
> Thanks for the thoughts - any and all help is appreciated.  It''s a
> weird weird bug to try and track down.  I''m pleased that my
workaround
> is holding, though I''d like to know that a long-term fix is
possible.
>
> Adrian
>
>    I don''t have much in the way of suggestions, but I ran into a lot of 
problems recently when I tried to have a docroot sitting on NFS. No 
matter what I did I always ran into weird problems just like this. It 
sort of worked but definitely not something I could use in production...

--
Joe McDonagh
Operations Engineer
AIM: YoosingYoonickz
IRC: joe-mac on freenode
"When the going gets weird, the weird turn pro."

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Monachus

2010-Jul-30 07:44 UTC

head link

[Puppet Users] Re: 0.25.4 caching problem with custom function

> I don''t have much in the way of suggestions, but I ran into a lot
of
> problems recently when I tried to have a docroot sitting on NFS. No
> matter what I did I always ran into weird problems just like this. It
> sort of worked but definitely not something I could use in production...
We''ve been using NFS for 2 years and it works great.  I worked very
hard to eliminate NFS write issues when setting it up - as long as
it''s used as a read datastore it seems to work fine.  The specific
problem to which I''m referring didn''t appear until we moved to
0.25 a
few weeks ago.  At 0.24.8 this never happened.  I suppose it''s
possible that there''s some sort of write issue with NFS under 0.25,
but why would that manifest as an inability to read this one specific
function and not in any other part of my massive module stack?  Even
when I''ve seen Puppet clients fail because of "Stale NFS
filehandle"
when the master is reloading something I''ve changed, they still do the
run from cache and load that file fine the next time through.  This
function error is a critical error that kills the puppet run
completely, and it never goes away until I change the .rb file for the
function in some way (newline, space, whatever).  That''s not NFS -
that''s code.

I''m not seeing any response from those who could speak to caching, and
now that 2.6 is out, we''ll probably have to wait to troubleshoot it
with the developers until after we upgrade again.

Thanks, everyone, for your response and for taking the time to think
about what it might be.

Adrian

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Puppet users - Jul 2010 - 0.25.4 caching problem with custom function

[Puppet Users] 0.25.4 caching problem with custom function

[Puppet Users] Re: 0.25.4 caching problem with custom function

[Puppet Users] Re: 0.25.4 caching problem with custom function

Re: [Puppet Users] Re: 0.25.4 caching problem with custom function

[Puppet Users] Re: 0.25.4 caching problem with custom function