Jonathon Anderson
2010-Aug-02 11:07 UTC
[Puppet Users] wrong facts going into storeconfigs, 0.25+2.6
I''m re-posting this because I''m not sure that it got through the first time. If someone could at least echo back that this is reaching the list, I''d appreciate it. (I''m new to the list.) Sometimes (with variable frequency) storeconfigs stores the wrong data in the fact_values table. This has the end result that exported resources, when collected, have invalid configuration. The most recent example: the "hostname" fact for one of our nodes got, in stead, the value that should have gone in the "processorcount" fact. The had the end result that the node''s nagios configuration started trying to monitor a host "8" rather than "cn19", and ssh keys for cn19 were collected at other nodes as "8,8.example.com <keytext>" in stead of "cn19,cn19.example.com <keytext>". The hostname fact is the only destination that I''ve noticed the corrupted data in, but the source has been swapfree/swapsize, processor[n], operatingsystem, operatingsystemrelease, kernelrelease, and others. I realize that I don''t have much of a "simple, repeatable, minimal" test case here, but I''ve been trying to figure it out for months to no avail. I had hoped that an upgrade to 2.6 would make this problem go away, but no: we''ve just now experienced it again. For the record, we''ve seen it since sometime in the 0.24.x branch (when we started using it). It might have something to do with an appropriately high load on storeconfigs. I ran it for 2 days with nodes exporting data (but not collecting) to see if it would happen again, and I didn''t notice any corruption. Then, today, I enabled collection (e.g., ssh_known_hosts) on all (~138) hosts, and soon after found a corrupt nagios configuration. (Then again, it might just be that it''s more probably with more nodes doing the collection.) I''ve never seen the actual facter command return one of these bits of misplaced data: the furthest back I''ve been able to trace it is to the facts_values table. We''re using a single puppet master, with storeconfigs storing to a postgresql database on a different host from the puppet master host. Everything works in the majority of cases, but fails just often enough to make it really, really annoying. Any help anyone can provide, including insight into where I might look to track down the cause even further, would be much appreciated. Thanks. ~jon -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
Brice Figureau
2010-Aug-02 12:18 UTC
Re: [Puppet Users] wrong facts going into storeconfigs, 0.25+2.6
Hi, On Mon, 2010-08-02 at 14:07 +0300, Jonathon Anderson wrote:> I''m re-posting this because I''m not sure that it got through the first > time. If someone could at least echo back that this is reaching the > list, I''d appreciate it. (I''m new to the list.)I don''t know if your first message went through, but I confirm this one did.> Sometimes (with variable frequency) storeconfigs stores the wrong data > in the fact_values table. This has the end result that exported > resources, when collected, have invalid configuration. > > The most recent example: the "hostname" fact for one of our nodes got, > in stead, the value that should have gone in the "processorcount" > fact. The had the end result that the node''s nagios configuration > started trying to monitor a host "8" rather than "cn19", and ssh keys > for cn19 were collected at other nodes as "8,8.example.com <keytext>" > in stead of "cn19,cn19.example.com <keytext>". The hostname fact is > the only destination that I''ve noticed the corrupted data in, but the > source has been swapfree/swapsize, processor[n], operatingsystem, > operatingsystemrelease, kernelrelease, and others. > > I realize that I don''t have much of a "simple, repeatable, minimal" > test case here, but I''ve been trying to figure it out for months to no > avail. I had hoped that an upgrade to 2.6 would make this problem go > away, but no: we''ve just now experienced it again. For the record, > we''ve seen it since sometime in the 0.24.x branch (when we started > using it).So that''s an "old" issue, not something introduced in the brand new 2.6.> It might have something to do with an appropriately high load on > storeconfigs. I ran it for 2 days with nodes exporting data (but not > collecting) to see if it would happen again, and I didn''t notice any > corruption. Then, today, I enabled collection (e.g., ssh_known_hosts) > on all (~138) hosts, and soon after found a corrupt nagios > configuration. (Then again, it might just be that it''s more probably > with more nodes doing the collection.)Which seems logical.> I''ve never seen the actual facter command return one of these bits of > misplaced data: the furthest back I''ve been able to trace it is to the > facts_values table. > > We''re using a single puppet master, with storeconfigs storing to a > postgresql database on a different host from the puppet master host. > Everything works in the majority of cases, but fails just often enough > to make it really, really annoying. > > Any help anyone can provide, including insight into where I might look > to track down the cause even further, would be much appreciated. > Thanks.So, the real question is to be able to understand where does the issue come. As I see it, the facts the node sends to the puppetmaster are correct, otherwise the received catalog wouldn''t apply correctly. So the issue is, to my understanding, a pure storeconfig issue. The first thing you should check is the version of active record or the postgres lib you are using. Try to upgrade those, maybe the issue was fixed (assuming the issue is not on the Puppet side). Next, you should try to analyse where the issue came, by having a look to the SQL queries active record generated: 1) clean up the mess so that you start with a good database 2) activate on your master the active_record log (set rails_loglevel=debug and railslog=/path/to/rails.log) 3) let it run until you notice the issue 4) read the rails log to find the culprit sql request, maybe that could give you more information. At least we''ll know what it tries to save. Then, I''d add debug statements to the puppetmaster (check lib/puppet/rails/host.rb especially the merge_facts method). By correlating this debug information and the query log, you might be able to notice a pattern or at least find if the problem comes from an issue in the data Puppet has, or if the issue is created in the AR layer. You should also file a bug report with all the information you''ll find. Hope that helps, -- Brice Figureau Follow the latest Puppet Community evolutions on www.planetpuppet.org! -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.