Hello, We have upgraded Puppetdb from 1.0.2 to 1.1.1 last week, and recently we noticed that queue is not processed. We don''t know if this problem is directly linked to the upgrade. There are now 25 000 messages waiting in the queue. We also see errors in the Postgres log: ERROR: insert or update on table "certname_catalogs" violates foreign key> constraint "certname_catalogs_catalog_fkey" > DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not > present in table "catalogs". > STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) > VALUES ($1,$2,$3) RETURNING * >Can this be somehow related to the the KahaDB leak thread? (https://groups.google.com/d/topic/puppet-users/gWACi4a0BZs/discussion ) Any thoughts? Would flushing the queue solve the problem? -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
We flushed the queue and also restarted Postgres. Didn''t help. On Thursday, February 28, 2013 10:14:16 AM UTC+1, ak0ska wrote:> > Hello, > > We have upgraded Puppetdb from 1.0.2 to 1.1.1 last week, and recently we > noticed that queue is not processed. We don''t know if this problem is > directly linked to the upgrade. There are now 25 000 messages waiting in > the queue. We also see errors in the Postgres log: > > ERROR: insert or update on table "certname_catalogs" violates foreign key >> constraint "certname_catalogs_catalog_fkey" >> DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not >> present in table "catalogs". >> STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) >> VALUES ($1,$2,$3) RETURNING * >> > > Can this be somehow related to the the KahaDB leak thread? ( > https://groups.google.com/d/topic/puppet-users/gWACi4a0BZs/discussion ) > > Any thoughts? Would flushing the queue solve the problem? > > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Or maybe it did. The queue size went up to like 900 and after a while it normalized and started slowly decreasing. Right now it''s empty. There were 2 insert errors in postgres log since its restart. It was perhaps on our end, but it would be good to hear some theories, or suggestions where else to look for a problem in a similar case On Thursday, February 28, 2013 10:38:42 AM UTC+1, ak0ska wrote:> > We flushed the queue and also restarted Postgres. Didn''t help. > > On Thursday, February 28, 2013 10:14:16 AM UTC+1, ak0ska wrote: >> >> Hello, >> >> We have upgraded Puppetdb from 1.0.2 to 1.1.1 last week, and recently we >> noticed that queue is not processed. We don''t know if this problem is >> directly linked to the upgrade. There are now 25 000 messages waiting in >> the queue. We also see errors in the Postgres log: >> >> ERROR: insert or update on table "certname_catalogs" violates foreign key >>> constraint "certname_catalogs_catalog_fkey" >>> DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not >>> present in table "catalogs". >>> STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) >>> VALUES ($1,$2,$3) RETURNING * >>> >> >> Can this be somehow related to the the KahaDB leak thread? ( >> https://groups.google.com/d/topic/puppet-users/gWACi4a0BZs/discussion ) >> >> Any thoughts? Would flushing the queue solve the problem? >> >> >> >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
The queue jumped back to ~1 600, and according to the puppetdb logs, it hasn''t processed a single entry for like 20 minutes, but the log shows a lot of slow queries (see example below). Postgres logs show no error. And now it started processing the queue again. Slow queries: http://pastebin.com/2B4azRib On Thursday, February 28, 2013 1:09:15 PM UTC+1, ak0ska wrote:> > Or maybe it did. The queue size went up to like 900 and after a while it > normalized and started slowly decreasing. Right now it''s empty. > There were 2 insert errors in postgres log since its restart. > > It was perhaps on our end, but it would be good to hear some theories, or > suggestions where else to look for a problem in a similar case > > On Thursday, February 28, 2013 10:38:42 AM UTC+1, ak0ska wrote: >> >> We flushed the queue and also restarted Postgres. Didn''t help. >> >> On Thursday, February 28, 2013 10:14:16 AM UTC+1, ak0ska wrote: >>> >>> Hello, >>> >>> We have upgraded Puppetdb from 1.0.2 to 1.1.1 last week, and recently we >>> noticed that queue is not processed. We don''t know if this problem is >>> directly linked to the upgrade. There are now 25 000 messages waiting in >>> the queue. We also see errors in the Postgres log: >>> >>> ERROR: insert or update on table "certname_catalogs" violates foreign >>>> key constraint "certname_catalogs_catalog_fkey" >>>> DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not >>>> present in table "catalogs". >>>> STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) >>>> VALUES ($1,$2,$3) RETURNING * >>>> >>> >>> Can this be somehow related to the the KahaDB leak thread? ( >>> https://groups.google.com/d/topic/puppet-users/gWACi4a0BZs/discussion ) >>> >>> Any thoughts? Would flushing the queue solve the problem? >>> >>> >>> >>-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
If you clear the queue and rollback to the original version does the problem disappear? If you''re having processing problems at the latest version thats what I would do, as I presume we''re talking production here right?> Can this be somehow related to the the KahaDB leak thread?No - it doesn''t sound like it.> ERROR: insert or update on table "certname_catalogs" violates foreign key constraint "certname_catalogs_catalog_fkey" > DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in table "catalogs". > STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) VALUES ($1,$2,$3) RETURNING *That is a bit of a concern, are you receiving a lot of these? Is this constant?> Use of unversioned APIs is deprecated; please use /v1/resourcesThese are appearing in your slow query log - are you sure you''re running puppetdb-terminus 1.1.1 as well? Did you restart your master after upgrading the terminus? ken. On Thu, Feb 28, 2013 at 12:54 PM, ak0ska <akos.hencz@gmail.com> wrote:> The queue jumped back to ~1 600, and according to the puppetdb logs, it > hasn''t processed a single entry for like 20 minutes, but the log shows a lot > of slow queries (see example below). Postgres logs show no error. > > And now it started processing the queue again. > > > Slow queries: http://pastebin.com/2B4azRib > > > > > On Thursday, February 28, 2013 1:09:15 PM UTC+1, ak0ska wrote: >> >> Or maybe it did. The queue size went up to like 900 and after a while it >> normalized and started slowly decreasing. Right now it''s empty. >> There were 2 insert errors in postgres log since its restart. >> >> It was perhaps on our end, but it would be good to hear some theories, or >> suggestions where else to look for a problem in a similar case >> >> On Thursday, February 28, 2013 10:38:42 AM UTC+1, ak0ska wrote: >>> >>> We flushed the queue and also restarted Postgres. Didn''t help. >>> >>> On Thursday, February 28, 2013 10:14:16 AM UTC+1, ak0ska wrote: >>>> >>>> Hello, >>>> >>>> We have upgraded Puppetdb from 1.0.2 to 1.1.1 last week, and recently we >>>> noticed that queue is not processed. We don''t know if this problem is >>>> directly linked to the upgrade. There are now 25 000 messages waiting in the >>>> queue. We also see errors in the Postgres log: >>>> >>>>> ERROR: insert or update on table "certname_catalogs" violates foreign >>>>> key constraint "certname_catalogs_catalog_fkey" >>>>> DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not >>>>> present in table "catalogs". >>>>> STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) >>>>> VALUES ($1,$2,$3) RETURNING * >>>> >>>> >>>> Can this be somehow related to the the KahaDB leak thread? >>>> (https://groups.google.com/d/topic/puppet-users/gWACi4a0BZs/discussion ) >>>> >>>> Any thoughts? Would flushing the queue solve the problem? >>>> >>>> > > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to puppet-users+unsubscribe@googlegroups.com. > To post to this group, send email to puppet-users@googlegroups.com. > Visit this group at http://groups.google.com/group/puppet-users?hl=en. > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Hi, thanks for trying to help! :) If you clear the queue and rollback to the original version does the> problem disappear? If you''re having processing problems at the latest > version thats what I would do, as I presume we''re talking production > here right? > >Yes, this is production. We would like to avoid reverting, but maybe we will have to.> > > ERROR: insert or update on table "certname_catalogs" violates foreign > key constraint "certname_catalogs_catalog_fkey" > > DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not > present in table "catalogs". > > STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) > VALUES ($1,$2,$3) RETURNING * > > That is a bit of a concern, are you receiving a lot of these? Is this > constant? >Before we flushed the queue, quite a lot. Since we flushed it, only 4 times.> > Use of unversioned APIs is deprecated; please use /v1/resources > > These are appearing in your slow query log - are you sure you''re > running puppetdb-terminus 1.1.1 as well? Did you restart your master > after upgrading the terminus? > >The terminus version is correct, however we were not entirely sure that the masters were properly restarted, so we did it now again, let''s see if this changes anything. But the hanging is happening again. I checked what queries run on the database server. Some queries are running for ~ 90 mins, and there are some idle threads. There are two update commands that are waiting. I assume while those 2 don''t get executed, the queue processing is halted. -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
> Hi, thanks for trying to help! :)I''m @ken_barber on irc btw if that is easier.>> That is a bit of a concern, are you receiving a lot of these? Is this >> constant? > > Before we flushed the queue, quite a lot. Since we flushed it, only 4 times. > > >> >> > Use of unversioned APIs is deprecated; please use /v1/resources >> >> These are appearing in your slow query log - are you sure you''re >> running puppetdb-terminus 1.1.1 as well? Did you restart your master >> after upgrading the terminus? >> > > The terminus version is correct, however we were not entirely sure that the > masters were properly restarted, so we did it now again, let''s see if this > changes anything.Okay. Did you clear the ActiveMQ queues after doing this? I usually just move the old KahaDB directory out of the way when I do this.> But the hanging is happening again. I checked what queries run on the > database server. Some queries are running for ~ 90 mins, and there are some > idle threads. There are two update commands that are waiting. I assume while > those 2 don''t get executed, the queue processing is halted.Okay, so were those old queries that are taking 90 minutes, what are they? Have you tried shutting down the PuppetDB, clearing the queue - running full vacuum on the database then starting it all up again? What version of PostgreSQL are we talking here? What distro and version specifically? How big is the database itself? How many nodes and how frequently are they checking in? A screenshot of your puppetdb dashboard might give us a good snapshot of the size of your environment as well. Has any tuning been performed on your postgresql server? A copy of your postgresql.conf would be interesting. We can do this all privately if you are concerned about sharing information publicly as well - up to you. Like I said, I''m also on Freenode IRC as @ken_barber if this is better for you. ken. -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
> I''m @ken_barber on irc btw if that is easier. >Can''t use IRC here, sorry. :(> Okay. Did you clear the ActiveMQ queues after doing this? I usually > just move the old KahaDB directory out of the way when I do this. >I haven''t though about myself, but it makes sense, so I just flushed the queue again while puppetdb service was stopped. Since this last restart it seems smooth, but it''s only up for a few minutes.> Okay, so were those old queries that are taking 90 minutes, what are they? > > SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr,certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL 18 instances were running, oldest was ~90 mins at the time SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) 12 instances were running, oldest was ~90 mins at the time> Have you tried shutting down the PuppetDB, clearing the queue - > running full vacuum on the database then starting it all up again? >Not yet, according to the Postgresql documentation full vacuum is not really recommended. Also it might take a while, and PuppetDB would be unavailable for the duration.> What version of PostgreSQL are we talking here? What distro and > version specifically? How big is the database itself? >8.4.13 on a RHEL6 based distribution. The database size might be an indicator that something is not ok. It currently takes up 512Gbs, and it grew 100Gbs over a week. We made a database dump before the PuppetDB upgrade and it was ~1Gb uncompressed. The full vacuum might solve this, but my guess that it would run quite long.> How many nodes and how frequently are they checking in? A screenshot > of your puppetdb dashboard might give us a good snapshot of the size > of your environment as well. >We have ~2500 nodes, run interval for the two major environments: 60mins for 1593 machines and 30 mins for 496. Splay limit is 900sec for both. Two screenshots from today: http://imgur.com/kG5Sth5,bLFqGAX#0 and http://imgur.com/kG5Sth5,bLFqGAX#1> Has any tuning been performed on your postgresql server? A copy of > your postgresql.conf would be interesting. >No tuning that I''m aware of. The same default postgresql.conf as deployed by the puppetlabs-postgres module. It''s here: http://pastebin.com/njvGm4eu> > We can do this all privately if you are concerned about sharing > information publicly as well - up to you. Like I said, I''m also on > Freenode IRC as @ken_barber if this is better for you. > > ken. >So far I don''t feel like I shared anything sensitive, so this might be useful for others as well. Thanks for the help! -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
>> Okay. Did you clear the ActiveMQ queues after doing this? I usually >> just move the old KahaDB directory out of the way when I do this. > > > I haven''t though about myself, but it makes sense, so I just flushed the > queue again while puppetdb service was stopped. Since this last restart it > seems smooth, but it''s only up for a few minutes.You mean, you''ve only been watching it for a few minutes, and so far so good - or it crashed? Sorry - just want to be clear :-).>> Okay, so were those old queries that are taking 90 minutes, what are they? >> > SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, > certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND > c.deactivated IS NULL > 18 instances were running, oldest was ~90 mins at the time > > SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources > cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND > cc.certname=c.name AND c.deactivated IS NULL) > 12 instances were running, oldest was ~90 mins at the time>> Have you tried shutting down the PuppetDB, clearing the queue - >> running full vacuum on the database then starting it all up again? > > > Not yet, according to the Postgresql documentation full vacuum is not really > recommended. Also it might take a while, and PuppetDB would be unavailable > for the duration.Its not recommended as a re-occurring maintenance task this is true, but if your DB has had major changes due to a schema upgrade, or if vacuum hasn''t ran in a while its generally okay and can provide some benefit. But yeah, judging by the size of your DB this will take a long time. I guess my point is, every recommendation has a caveat or back-story.>> What version of PostgreSQL are we talking here? What distro and >> version specifically? How big is the database itself? > > > 8.4.13 on a RHEL6 based distribution. The database size might be an > indicator that something is not ok. It currently takes up 512Gbs, and it > grew 100Gbs over a week. We made a database dump before the PuppetDB upgrade > and it was ~1Gb uncompressed. The full vacuum might solve this, but my guess > that it would run quite long.Wow - it should be more like 5 GB or less for your size. It sounds like you''ve got some major fragmentation issues, your indexes may need rebuilding and yeah - a vacuum will probably help - and index rebuild. But this will require an outage I would say. When was the last time it was vacuumed? Try the SQL: select relname,last_vacuum, last_autovacuum, last_analyze, last_autoanalyze from pg_stat_user_tables; And give us the output. I think by default autovacuum should be on for Postgresql 8.4 on Redhat but I can''t recall. On another note ... to be honest I can''t be precise about why the upgrade failed, I''m pretty sure you were running old revisions of the terminus when you upgraded but that should cause this kind of anguish. The errors you were receiving about constraints: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in table "catalogs". Should not be occurring at all, which is all very suspect - alas I have no clue yet as to why. Have they stopped now we have cleared the queue and restarted?>> How many nodes and how frequently are they checking in? A screenshot >> of your puppetdb dashboard might give us a good snapshot of the size >> of your environment as well. > > > We have ~2500 nodes, run interval for the two major environments: 60mins for > 1593 machines and 30 mins for 496. Splay limit is 900sec for both. > Two screenshots from today: http://imgur.com/kG5Sth5,bLFqGAX#0 and > http://imgur.com/kG5Sth5,bLFqGAX#1Okay.>> Has any tuning been performed on your postgresql server? A copy of >> your postgresql.conf would be interesting. > > > No tuning that I''m aware of. The same default postgresql.conf as deployed by > the puppetlabs-postgres module. It''s here: http://pastebin.com/njvGm4euWhat is the size of your database? CPU/cores ... and RAM on the box? Does your puppetdb service live on the same node as the database? I''m guessing this to be true, as your postgresql.conf is listening only on ''localhost'' ... what is the memory consumption of your apps? The output of ''free'' would probably be a good start. ken. -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
FYI, I just upgraded only the PuppetDB part to 1.1.1, using the old 1.0.2 terminus I get no errors: 2013-02-28 17:06:27,711 WARN [qtp1478462104-39] [http.server] Use of unversioned APIs is deprecated; please use /v1/commands 2013-02-28 17:06:28,284 INFO [command-proc-44] [puppetdb.command] [7a4a1f70-1e85-4256-8f95-21008b6c1199] [replace facts] puppetdb2.vm 2013-02-28 17:06:30,724 WARN [qtp1478462104-32] [http.server] Use of unversioned APIs is deprecated; please use /v1/commands 2013-02-28 17:06:32,693 INFO [command-proc-44] [puppetdb.command] [a87001c5-f064-4a2d-9f9e-de903f62f824] [replace catalog] puppetdb2.vm If at all possible - I wouldn''t mind a full copy of your puppetdb.log ... to dig a bit deeper. And I know I told you to clear the KahaDB queue (I always make this mistake) but I don''t suppose you kept an old copy of it? ken. On Thu, Feb 28, 2013 at 3:55 PM, Ken Barber <ken@puppetlabs.com> wrote:>>> Okay. Did you clear the ActiveMQ queues after doing this? I usually >>> just move the old KahaDB directory out of the way when I do this. >> >> >> I haven''t though about myself, but it makes sense, so I just flushed the >> queue again while puppetdb service was stopped. Since this last restart it >> seems smooth, but it''s only up for a few minutes. > > You mean, you''ve only been watching it for a few minutes, and so far > so good - or it crashed? Sorry - just want to be clear :-). > >>> Okay, so were those old queries that are taking 90 minutes, what are they? >>> >> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, >> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND >> c.deactivated IS NULL >> 18 instances were running, oldest was ~90 mins at the time >> >> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources >> cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND >> cc.certname=c.name AND c.deactivated IS NULL) >> 12 instances were running, oldest was ~90 mins at the time > >>> Have you tried shutting down the PuppetDB, clearing the queue - >>> running full vacuum on the database then starting it all up again? >> >> >> Not yet, according to the Postgresql documentation full vacuum is not really >> recommended. Also it might take a while, and PuppetDB would be unavailable >> for the duration. > > Its not recommended as a re-occurring maintenance task this is true, > but if your DB has had major changes due to a schema upgrade, or if > vacuum hasn''t ran in a while its generally okay and can provide some > benefit. But yeah, judging by the size of your DB this will take a > long time. I guess my point is, every recommendation has a caveat or > back-story. > >>> What version of PostgreSQL are we talking here? What distro and >>> version specifically? How big is the database itself? >> >> >> 8.4.13 on a RHEL6 based distribution. The database size might be an >> indicator that something is not ok. It currently takes up 512Gbs, and it >> grew 100Gbs over a week. We made a database dump before the PuppetDB upgrade >> and it was ~1Gb uncompressed. The full vacuum might solve this, but my guess >> that it would run quite long. > > Wow - it should be more like 5 GB or less for your size. > > It sounds like you''ve got some major fragmentation issues, your > indexes may need rebuilding and yeah - a vacuum will probably help - > and index rebuild. But this will require an outage I would say. > > When was the last time it was vacuumed? Try the SQL: > > select relname,last_vacuum, last_autovacuum, last_analyze, > last_autoanalyze from pg_stat_user_tables; > > And give us the output. I think by default autovacuum should be on for > Postgresql 8.4 on Redhat but I can''t recall. > > On another note ... to be honest I can''t be precise about why the > upgrade failed, I''m pretty sure you were running old revisions of the > terminus when you upgraded but that should cause this kind of anguish. > The errors you were receiving about constraints: > > Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not > present in table "catalogs". > > Should not be occurring at all, which is all very suspect - alas I > have no clue yet as to why. Have they stopped now we have cleared the > queue and restarted? > >>> How many nodes and how frequently are they checking in? A screenshot >>> of your puppetdb dashboard might give us a good snapshot of the size >>> of your environment as well. >> >> >> We have ~2500 nodes, run interval for the two major environments: 60mins for >> 1593 machines and 30 mins for 496. Splay limit is 900sec for both. >> Two screenshots from today: http://imgur.com/kG5Sth5,bLFqGAX#0 and >> http://imgur.com/kG5Sth5,bLFqGAX#1 > > Okay. > >>> Has any tuning been performed on your postgresql server? A copy of >>> your postgresql.conf would be interesting. >> >> >> No tuning that I''m aware of. The same default postgresql.conf as deployed by >> the puppetlabs-postgres module. It''s here: http://pastebin.com/njvGm4eu > > What is the size of your database? CPU/cores ... and RAM on the box? > > Does your puppetdb service live on the same node as the database? I''m > guessing this to be true, as your postgresql.conf is listening only on > ''localhost'' ... what is the memory consumption of your apps? The > output of ''free'' would probably be a good start. > > ken.-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
> You mean, you''ve only been watching it for a few minutes, and so far > so good - or it crashed? Sorry - just want to be clear :-). >I was watching it for a few minutes and it seemed good. However the queue grew up to 4000 items overnight. Also we have more of the constraint violation errors now.> Its not recommended as a re-occurring maintenance task this is true, > but if your DB has had major changes due to a schema upgrade, or if > vacuum hasn''t ran in a while its generally okay and can provide some > benefit. But yeah, judging by the size of your DB this will take a > long time. I guess my point is, every recommendation has a caveat or > back-story. >We decided to schedule a full vacuum for this weekend.> Wow - it should be more like 5 GB or less for your size. > > It sounds like you''ve got some major fragmentation issues, your > indexes may need rebuilding and yeah - a vacuum will probably help - > and index rebuild. But this will require an outage I would say. > > When was the last time it was vacuumed? Try the SQL: > > select relname,last_vacuum, last_autovacuum, last_analyze, > last_autoanalyze from pg_stat_user_tables; > > And give us the output. I think by default autovacuum should be on for > Postgresql 8.4 on Redhat but I can''t recall. >relname | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze -------------------------+-------------+-------------------------------+--------------+------------------------------- edges | | 2013-03-01 09:49:04.659005+01 | | 2013-03-01 08:57:59.479092+01 reports | | | | resource_events | | | | schema_migrations | | | | certnames | | 2013-03-01 09:20:50.378484+01 | | 2013-03-01 09:19:49.22173+01 certname_catalogs | | 2013-03-01 09:07:54.251874+01 | | 2013-03-01 09:20:50.548025+01 catalog_resources | | 2013-01-29 23:17:04.224172+01 | | 2013-01-30 08:47:38.371796+01 catalogs | | 2013-03-01 08:20:48.148931+01 | | 2013-03-01 09:19:48.749645+01 certname_facts_metadata | | 2013-03-01 09:20:51.318913+01 | | 2013-03-01 09:19:50.021701+01 certname_facts | | 2013-03-01 09:19:47.655727+01 | | 2013-03-01 09:10:53.688119+01 resource_params | | 2013-02-28 15:21:02.192264+01 | | 2013-02-28 13:13:59.806642+01 (11 rows) We actually did manually vacuum the databse before the upgrade, when we saw the difference between the dump and database size. Strange that it doesn''t show up in that query. But probably it''s too little too late anyway.> On another note ... to be honest I can''t be precise about why the > upgrade failed, I''m pretty sure you were running old revisions of the > terminus when you upgraded but that should cause this kind of anguish. > The errors you were receiving about constraints: > > Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not > present in table "catalogs". > > Should not be occurring at all, which is all very suspect - alas I > have no clue yet as to why. Have they stopped now we have cleared the > queue and restarted? >As is said above, we have new errors of this kind.> What is the size of your database? CPU/cores ... and RAM on the box? >4 corees @2,27GHz, 12Gb RAM Does your puppetdb service live on the same node as the database? I''m> guessing this to be true, as your postgresql.conf is listening only on > ''localhost'' ... what is the memory consumption of your apps? The > output of ''free'' would probably be a good start. >Yes, it''s the same machine. For now. "free" ouput total used free shared buffers cached Mem: 12319548 12150396 169152 0 7388 9504356 -/+ buffers/cache: 2638652 9680896 Swap: 6160376 85212 6075164 -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
On Thursday, February 28, 2013 6:09:35 PM UTC+1, Ken Barber wrote:> > FYI, I just upgraded only the PuppetDB part to 1.1.1, using the old > 1.0.2 terminus I get no errors: > > 2013-02-28 17:06:27,711 WARN [qtp1478462104-39] [http.server] Use of > unversioned APIs is deprecated; please use /v1/commands > 2013-02-28 17:06:28,284 INFO [command-proc-44] [puppetdb.command] > [7a4a1f70-1e85-4256-8f95-21008b6c1199] [replace facts] puppetdb2.vm > 2013-02-28 17:06:30,724 WARN [qtp1478462104-32] [http.server] Use of > unversioned APIs is deprecated; please use /v1/commands > 2013-02-28 17:06:32,693 INFO [command-proc-44] [puppetdb.command] > [a87001c5-f064-4a2d-9f9e-de903f62f824] [replace catalog] puppetdb2.vm > >As I was saying, we''re not entirely sure that this behaviour was causes by the upgrade. We only noticed it one week after the upgrade (so unfortunately the postgres logs from before the update were rotated, that''s the default setting), and it might be a coincidence. Then again. maybe not. :)> If at all possible - I wouldn''t mind a full copy of your puppetdb.log > ... to dig a bit deeper. And I know I told you to clear the KahaDB > queue (I always make this mistake) but I don''t suppose you kept an old > copy of it? >We kept the KahaDB backup both times we flushed the queue. The sizes in /var/lib/puppetdb/mq/localhost/ are as follows: 1.3G ./KahaDB 637M ./KahaDB.old2 1.1G ./scheduler 6.2G ./KahaDB.old The big one is a 25 000 queue. I would happily send the logs over, but this is something I have discuss with my colleagues. Again, thank you for helping! :) ak0ska -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
So I''ve been pondering this issue of yours, and I keep coming back to that error in my mind: ERROR: insert or update on table "certname_catalogs" violates foreign key constraint "certname_catalogs_catalog_fkey" Regardless of the other issues, 512 GB db - yes its big, but so what? That shouldn''t cause an integrity violation. It could be a big hint to something else, but I''m not entirely sure this is the cause. This error - can we try to track down the corresponding PuppetDB error message that goes with it in the puppetdb.log or in perhaps the dead letter queue for activemq? If postgresql is throwing the error - then there must be a corresponding exception in puppetdb.log. Its just that when I look at the code ... it doesn''t make sense that the insert fails here - "Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in table "catalogs". ... but I can''t yet see a fault in the logic inside the code yet. Still its going to be hard to replicate without the data on my side. As far as your issue ... you have a couple of avenues: * Drop the existing database, allow it to be recreated from scratch - run noop a serveral times across your entire infrastructure to repopulate it. Has the advantage of being an effective vacuum at the same time :-). * Rollback the PuppetDB version and the database. This will at least get you back to a known working state, but will still leave you in a place where you need to recover. Either case - you would want to take a backup of the existing database for diagnostic purposes. But I really feel like there is something structurally wrong here - perhaps the migration scripts have failed during the database upgrade? But I''m guessing you can''t stay broken for very long - so I would take a snapshot of the current broken state so we can take a look and try to replicate it, and make moves on getting back to a working state as suggested above. I''ll need: * PuppetDB logs * Backup of your current DB * Backup of your old DB * The broken KahaDB queues ken. On Fri, Mar 1, 2013 at 9:19 AM, ak0ska <akos.hencz@gmail.com> wrote:> > > On Thursday, February 28, 2013 6:09:35 PM UTC+1, Ken Barber wrote: >> >> FYI, I just upgraded only the PuppetDB part to 1.1.1, using the old >> 1.0.2 terminus I get no errors: >> >> 2013-02-28 17:06:27,711 WARN [qtp1478462104-39] [http.server] Use of >> unversioned APIs is deprecated; please use /v1/commands >> 2013-02-28 17:06:28,284 INFO [command-proc-44] [puppetdb.command] >> [7a4a1f70-1e85-4256-8f95-21008b6c1199] [replace facts] puppetdb2.vm >> 2013-02-28 17:06:30,724 WARN [qtp1478462104-32] [http.server] Use of >> unversioned APIs is deprecated; please use /v1/commands >> 2013-02-28 17:06:32,693 INFO [command-proc-44] [puppetdb.command] >> [a87001c5-f064-4a2d-9f9e-de903f62f824] [replace catalog] puppetdb2.vm >> > > As I was saying, we''re not entirely sure that this behaviour was causes by > the upgrade. We only noticed it one week after the upgrade (so unfortunately > the postgres logs from before the update were rotated, that''s the default > setting), and it might be a coincidence. Then again. maybe not. :) > >> >> If at all possible - I wouldn''t mind a full copy of your puppetdb.log >> ... to dig a bit deeper. And I know I told you to clear the KahaDB >> queue (I always make this mistake) but I don''t suppose you kept an old >> copy of it? > > > We kept the KahaDB backup both times we flushed the queue. The sizes in > /var/lib/puppetdb/mq/localhost/ are as follows: > > 1.3G ./KahaDB > 637M ./KahaDB.old2 > 1.1G ./scheduler > 6.2G ./KahaDB.old > > The big one is a 25 000 queue. > > I would happily send the logs over, but this is something I have discuss > with my colleagues. > > Again, thank you for helping! :) > > ak0ska > > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to puppet-users+unsubscribe@googlegroups.com. > To post to this group, send email to puppet-users@googlegroups.com. > Visit this group at http://groups.google.com/group/puppet-users?hl=en. > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Oh - and a copy of the current dead letter queue would be nice, its normally stored in: /var/lib/puppetdb/mq/discarded/* This should also contain the full exceptions for the failed SQL as I mentioned earlier, so perhaps a glance into those now and letting me know what the prevalent failure is would be handy. I can organise a secure space on a Puppetlabs support storage area to upload this data if you are willing. Just contact me privately to organise it. ken. On Fri, Mar 1, 2013 at 2:25 PM, Ken Barber <ken@puppetlabs.com> wrote:> So I''ve been pondering this issue of yours, and I keep coming back to > that error in my mind: > > ERROR: insert or update on table "certname_catalogs" violates foreign > key constraint "certname_catalogs_catalog_fkey" > > Regardless of the other issues, 512 GB db - yes its big, but so what? > That shouldn''t cause an integrity violation. It could be a big hint to > something else, but I''m not entirely sure this is the cause. > > This error - can we try to track down the corresponding PuppetDB error > message that goes with it in the puppetdb.log or in perhaps the dead > letter queue for activemq? If postgresql is throwing the error - then > there must be a corresponding exception in puppetdb.log. > > Its just that when I look at the code ... it doesn''t make sense that > the insert fails here - "Key > (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in > table "catalogs". ... but I can''t yet see a fault in the logic inside > the code yet. > > Still its going to be hard to replicate without the data on my side. > > As far as your issue ... you have a couple of avenues: > > * Drop the existing database, allow it to be recreated from scratch - > run noop a serveral times across your entire infrastructure to > repopulate it. Has the advantage of being an effective vacuum at the > same time :-). > * Rollback the PuppetDB version and the database. This will at least > get you back to a known working state, but will still leave you in a > place where you need to recover. > > Either case - you would want to take a backup of the existing database > for diagnostic purposes. But I really feel like there is something > structurally wrong here - perhaps the migration scripts have failed > during the database upgrade? > > But I''m guessing you can''t stay broken for very long - so I would take > a snapshot of the current broken state so we can take a look and try > to replicate it, and make moves on getting back to a working state as > suggested above. > > I''ll need: > > * PuppetDB logs > * Backup of your current DB > * Backup of your old DB > * The broken KahaDB queues > > ken. > > On Fri, Mar 1, 2013 at 9:19 AM, ak0ska <akos.hencz@gmail.com> wrote: >> >> >> On Thursday, February 28, 2013 6:09:35 PM UTC+1, Ken Barber wrote: >>> >>> FYI, I just upgraded only the PuppetDB part to 1.1.1, using the old >>> 1.0.2 terminus I get no errors: >>> >>> 2013-02-28 17:06:27,711 WARN [qtp1478462104-39] [http.server] Use of >>> unversioned APIs is deprecated; please use /v1/commands >>> 2013-02-28 17:06:28,284 INFO [command-proc-44] [puppetdb.command] >>> [7a4a1f70-1e85-4256-8f95-21008b6c1199] [replace facts] puppetdb2.vm >>> 2013-02-28 17:06:30,724 WARN [qtp1478462104-32] [http.server] Use of >>> unversioned APIs is deprecated; please use /v1/commands >>> 2013-02-28 17:06:32,693 INFO [command-proc-44] [puppetdb.command] >>> [a87001c5-f064-4a2d-9f9e-de903f62f824] [replace catalog] puppetdb2.vm >>> >> >> As I was saying, we''re not entirely sure that this behaviour was causes by >> the upgrade. We only noticed it one week after the upgrade (so unfortunately >> the postgres logs from before the update were rotated, that''s the default >> setting), and it might be a coincidence. Then again. maybe not. :) >> >>> >>> If at all possible - I wouldn''t mind a full copy of your puppetdb.log >>> ... to dig a bit deeper. And I know I told you to clear the KahaDB >>> queue (I always make this mistake) but I don''t suppose you kept an old >>> copy of it? >> >> >> We kept the KahaDB backup both times we flushed the queue. The sizes in >> /var/lib/puppetdb/mq/localhost/ are as follows: >> >> 1.3G ./KahaDB >> 637M ./KahaDB.old2 >> 1.1G ./scheduler >> 6.2G ./KahaDB.old >> >> The big one is a 25 000 queue. >> >> I would happily send the logs over, but this is something I have discuss >> with my colleagues. >> >> Again, thank you for helping! :) >> >> ak0ska >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Puppet Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to puppet-users+unsubscribe@googlegroups.com. >> To post to this group, send email to puppet-users@googlegroups.com. >> Visit this group at http://groups.google.com/group/puppet-users?hl=en. >> For more options, visit https://groups.google.com/groups/opt_out. >> >>-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
On Friday, March 1, 2013 3:36:20 PM UTC+1, Ken Barber wrote:> > Oh - and a copy of the current dead letter queue would be nice, its > normally stored in: > > /var/lib/puppetdb/mq/discarded/* >I will back it up.> This should also contain the full exceptions for the failed SQL as I > mentioned earlier, so perhaps a glance into those now and letting me > know what the prevalent failure is would be handy. >Here''s on of them. Attempt 1 @ 2013-02-18T17:04:30.512Z java.lang.IllegalArgumentException: Edge ''{:relationship :contains, :target {:title nil, :type nil}, :source {:title "Castor::Installer", :type "Class"}}'' refers to resource ''{:title nil, :type nil}'', which doesn''t exist in the catalog. com.puppetlabs.puppetdb.catalog$validate_edges.invoke(catalog.clj:205) clojure.core$comp$fn__4036.invoke(core.clj:2286) com.puppetlabs.puppetdb.catalog$eval1498$fn__1499.invoke(catalog.clj:311) clojure.lang.MultiFn.invoke(MultiFn.java:167) com.puppetlabs.puppetdb.command$replace_catalog_STAR_$fn__2696.invoke(command.clj:308) com.puppetlabs.puppetdb.command$replace_catalog_STAR_.invoke(command.clj:308) com.puppetlabs.puppetdb.command$eval2716$fn__2718.invoke(command.clj:329) clojure.lang.MultiFn.invoke(MultiFn.java:167) com.puppetlabs.puppetdb.command$produce_message_handler$fn__2838.invoke(command.clj:566) com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789$fn__2792.invoke(command.clj:474) com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown Source) com.yammer.metrics.core.Timer.time(Timer.java:91) com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789.invoke(command.clj:473) com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774$fn__2775.invoke(command.clj:427) com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown Source) com.yammer.metrics.core.Timer.time(Timer.java:91) com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774.invoke(command.clj:426) com.puppetlabs.puppetdb.command$wrap_with_command_parser$fn__2784.invoke(command.clj:449) com.puppetlabs.puppetdb.command$wrap_with_meter$fn__2765.invoke(command.clj:387) com.puppetlabs.puppetdb.command$wrap_with_thread_name$fn__2797.invoke(command.clj:489) clamq.jms$jms_consumer$fn__2452.invoke(jms.clj:38) clamq.jms.proxy$java.lang.Object$MessageListener$ce893c05.onMessage(Unknown Source) org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:560) org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:498) org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:467) org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:325) org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:263) org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1058) org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1050) org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:947) java.lang.Thread.run(Thread.java:679) Attempt null @ 2013-02-18T17:04:30.512Z java.lang.IllegalArgumentException: Edge ''{:relationship :contains, :target {:title nil, :type nil}, :source {:title "Castor::Installer", :type "Class"}}'' refers to resource ''{:title nil, :type nil}'', which doesn''t exist in the catalog. com.puppetlabs.puppetdb.catalog$validate_edges.invoke(catalog.clj:205) clojure.core$comp$fn__4036.invoke(core.clj:2286) com.puppetlabs.puppetdb.catalog$eval1498$fn__1499.invoke(catalog.clj:311) clojure.lang.MultiFn.invoke(MultiFn.java:167) com.puppetlabs.puppetdb.command$replace_catalog_STAR_$fn__2696.invoke(command.clj:308) com.puppetlabs.puppetdb.command$replace_catalog_STAR_.invoke(command.clj:308) com.puppetlabs.puppetdb.command$eval2716$fn__2718.invoke(command.clj:329) clojure.lang.MultiFn.invoke(MultiFn.java:167) com.puppetlabs.puppetdb.command$produce_message_handler$fn__2838.invoke(command.clj:566) com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789$fn__2792.invoke(command.clj:474) com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown Source) com.yammer.metrics.core.Timer.time(Timer.java:91) com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789.invoke(command.clj:473) com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774$fn__2775.invoke(command.clj:427) com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown Source) com.yammer.metrics.core.Timer.time(Timer.java:91) com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774.invoke(command.clj:426) com.puppetlabs.puppetdb.command$wrap_with_command_parser$fn__2784.invoke(command.clj:449) com.puppetlabs.puppetdb.command$wrap_with_meter$fn__2765.invoke(command.clj:387) com.puppetlabs.puppetdb.command$wrap_with_thread_name$fn__2797.invoke(command.clj:489) clamq.jms$jms_consumer$fn__2452.invoke(jms.clj:38) clamq.jms.proxy$java.lang.Object$MessageListener$ce893c05.onMessage(Unknown Source) org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:560) org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:498) org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:467) org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:325) org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:263) org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1058) org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1050) org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:947) java.lang.Thread.run(Thread.java:679)> I can organise a secure space on a Puppetlabs support storage area to > upload this data if you are willing. Just contact me privately to > organise it. >I am willing, but this is not up to me. :) I will keep all the backups you asked for for now.> > So I''ve been pondering this issue of yours, and I keep coming back to > > that error in my mind: > > > > ERROR: insert or update on table "certname_catalogs" violates foreign > > key constraint "certname_catalogs_catalog_fkey" > > > > Regardless of the other issues, 512 GB db - yes its big, but so what? > > That shouldn''t cause an integrity violation. It could be a big hint to > > something else, but I''m not entirely sure this is the cause. > > > > This error - can we try to track down the corresponding PuppetDB error > > message that goes with it in the puppetdb.log or in perhaps the dead > > letter queue for activemq? If postgresql is throwing the error - then > > there must be a corresponding exception in puppetdb.log. > > > > Its just that when I look at the code ... it doesn''t make sense that > > the insert fails here - "Key > > (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in > > table "catalogs". ... but I can''t yet see a fault in the logic inside > > the code yet. > > > > Still its going to be hard to replicate without the data on my side. > > > > As far as your issue ... you have a couple of avenues: > > > > * Drop the existing database, allow it to be recreated from scratch - > > run noop a serveral times across your entire infrastructure to > > repopulate it. Has the advantage of being an effective vacuum at the > > same time :-). >This could be the next step :) If the database maintenance doesn''t help.> * Rollback the PuppetDB version and the database. This will at least > > get you back to a known working state, but will still leave you in a > > place where you need to recover. >> Either case - you would want to take a backup of the existing database > > for diagnostic purposes. But I really feel like there is something > > structurally wrong here - perhaps the migration scripts have failed > > during the database upgrade? >I made a backup today, to have a fresh one before we start the database maintenance. The structurally wrong might not be so far fetched, since we didn''t upgrade from an official 1.0.2 release. My colleague got a patched version (don''t know the details, and can''t ask now, as he''s on holiday), because at that time the official release was having an issue with binary files (http://projects.puppetlabs.com/issues/17216). One consequence was, that value 8 was missing from the schema_migrations table, even though the matching modifications were already in place in the database ( 8 rename-fact-column in /src/com/puppetlabs/puppetdb/scf/migrate.clj ). Because of that, upgrades failed initially, when PuppetDB tried to rename non-existing columns. The schema looked like it was actually matching the migrations up to number 8, so just adding that value seemed to be ok. This was tested in a lot smaller scale, but there the upgrade seemed fine, never saw this error. Perhaps we overlooked something. If this is the case, then your suggested solution, to recreate the database from scratch, should solve the problem. This will be up our next step, if database maintenance doesn''t help.> > But I''m guessing you can''t stay broken for very long - so I would take > > a snapshot of the current broken state so we can take a look and try > > to replicate it, and make moves on getting back to a working state as > > suggested above. > > > > I''ll need: > > > > * PuppetDB logs > > * Backup of your current DB > > * Backup of your old DB > > * The broken KahaDB queues > > > > ken. > > >I have all these backups, but other people have to authorize sharing them. Have a nice weekend! :) ak0ska -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
> I made a backup today, to have a fresh one before we start the database > maintenance. The structurally wrong might not be so far fetched, since we > didn''t upgrade from an official 1.0.2 release. My colleague got a patched > version (don''t know the details, and can''t ask now, as he''s on holiday), > because at that time the official release was having an issue with binary > files (http://projects.puppetlabs.com/issues/17216). One consequence was, > that value 8 was missing from the schema_migrations table, even though the > matching modifications were already in place in the database ( 8 > rename-fact-column in /src/com/puppetlabs/puppetdb/scf/migrate.clj ). > Because of that, upgrades failed initially, when PuppetDB tried to rename > non-existing columns. The schema looked like it was actually matching the > migrations up to number 8, so just adding that value seemed to be ok. This > was tested in a lot smaller scale, but there the upgrade seemed fine, never > saw this error. Perhaps we overlooked something. > > If this is the case, then your suggested solution, to recreate the database > from scratch, should solve the problem. This will be up our next step, if > database maintenance doesn''t help.Well, I don''t think a vacuum will help you - I imagine something is wrong with the schema right now or some data migration failed during upgrade. Esp. if you had migration issues from your custom PuppetDB. Of course I can''t prove this with so little knowledge - but it certainly does raise a red flag. The only other option is to compare the schemas with a known good one, and compare the database data to try and find the fault. This however is prone to error and might be time-consuming. Recreating the database sounds like a far more reliable option with a lot more guarantees around it. ken. -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Yes, maybe not. The next step will be to recreate it from scratch. On Friday, March 1, 2013 5:47:06 PM UTC+1, Ken Barber wrote:> > > Well, I don''t think a vacuum will help you - I imagine something is > wrong with the schema right now or some data migration failed during > upgrade. Esp. if you had migration issues from your custom PuppetDB. > Of course I can''t prove this with so little knowledge - but it > certainly does raise a red flag. > > The only other option is to compare the schemas with a known good one, > and compare the database data to try and find the fault. This however > is prone to error and might be time-consuming. Recreating the database > sounds like a far more reliable option with a lot more guarantees > around it. > > ken. >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Is puppetdb and postgres on the same server? How many node does your environment have? I had a similar issue and it was linked to I/O. Can you look at that? Mike { Thanks => always } On Mar 4, 2013, at 4:55 PM, Ken Barber <ken@puppetlabs.com> wrote:> Any progress today? > > On Fri, Mar 1, 2013 at 9:00 AM, ak0ska <akos.hencz@gmail.com> wrote: >> Yes, maybe not. The next step will be to recreate it from scratch. >> >> >> On Friday, March 1, 2013 5:47:06 PM UTC+1, Ken Barber wrote: >>> >>> >>> Well, I don''t think a vacuum will help you - I imagine something is >>> wrong with the schema right now or some data migration failed during >>> upgrade. Esp. if you had migration issues from your custom PuppetDB. >>> Of course I can''t prove this with so little knowledge - but it >>> certainly does raise a red flag. >>> >>> The only other option is to compare the schemas with a known good one, >>> and compare the database data to try and find the fault. This however >>> is prone to error and might be time-consuming. Recreating the database >>> sounds like a far more reliable option with a lot more guarantees >>> around it. >>> >>> ken. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Puppet Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to puppet-users+unsubscribe@googlegroups.com. >> To post to this group, send email to puppet-users@googlegroups.com. >> Visit this group at http://groups.google.com/group/puppet-users?hl=en. >> For more options, visit https://groups.google.com/groups/opt_out. > > -- > You received this message because you are subscribed to the Google Groups "Puppet Users" group. > To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. > To post to this group, send email to puppet-users@googlegroups.com. > Visit this group at http://groups.google.com/group/puppet-users?hl=en. > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
A little update on our story. Vacuum full was running for the whole weekend, so we didn''t yet have time to rebuild indexes, because that would require more downtime, and we''re not sure how long it would take. The size of the database didn''t drop that much, it''s now ~370Gb. We already see some improvements. The queue doesn''t get insanely large, in fact the highest peak we saw since Monday morning was about 20, but mostly it''s between 0 and 3. That''s good. :) One strange benefit is that we now see "Resources" and "Resource duplication" values on the Dashboard, those were previously just question marks, as you can see on my previous screenshots. However, we still get the constraint violations, steadily 2 violations per hour. They appear as "Retried" commands on the dashboard. But if the queue reaches the size 0, then does this mean these commands get written to the database eventually? The violations seem to happen shortly after puppetdb starts garbage collection. You can see it here: http://pastebin.com/B6VR67LW We are still thinking about wiping the database, and bringing up a new one, as you suggested, but that cannot be done before the next weekend. Perhaps we should rebuild the indexes first, and see if it further improves the situation. It''s only been 1 day since the restart, but so far it looks a lot better then before. Cheers, ak0ska On Monday, March 4, 2013 10:55:37 PM UTC+1, Ken Barber wrote:> > Any progress today? > > On Fri, Mar 1, 2013 at 9:00 AM, ak0ska <akos....@gmail.com <javascript:>> > wrote: > > Yes, maybe not. The next step will be to recreate it from scratch. > > > > > > On Friday, March 1, 2013 5:47:06 PM UTC+1, Ken Barber wrote: > >> > >> > >> Well, I don''t think a vacuum will help you - I imagine something is > >> wrong with the schema right now or some data migration failed during > >> upgrade. Esp. if you had migration issues from your custom PuppetDB. > >> Of course I can''t prove this with so little knowledge - but it > >> certainly does raise a red flag. > >> > >> The only other option is to compare the schemas with a known good one, > >> and compare the database data to try and find the fault. This however > >> is prone to error and might be time-consuming. Recreating the database > >> sounds like a far more reliable option with a lot more guarantees > >> around it. > >> > >> ken. > > > > -- > > You received this message because you are subscribed to the Google > Groups > > "Puppet Users" group. > > To unsubscribe from this group and stop receiving emails from it, send > an > > email to puppet-users...@googlegroups.com <javascript:>. > > To post to this group, send email to puppet...@googlegroups.com<javascript:>. > > > Visit this group at http://groups.google.com/group/puppet-users?hl=en. > > For more options, visit https://groups.google.com/groups/opt_out. > > > > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Hey Mike, Thanks for the suggestions, but we already checked the IO rates, and they seemed fine. And yes, PuppetDB and Postgres are on the same machine for now, but we plan to change that sometime in the future. Cheers, ak0ska On Tuesday, March 5, 2013 12:51:16 AM UTC+1, Mike wrote:> > Is puppetdb and postgres on the same server? > How many node does your environment have? > > I had a similar issue and it was linked to I/O. Can you look at that? > > Mike { Thanks => always } > > On Mar 4, 2013, at 4:55 PM, Ken Barber <k...@puppetlabs.com <javascript:>> > wrote: > > > Any progress today? > > > > On Fri, Mar 1, 2013 at 9:00 AM, ak0ska <akos....@gmail.com <javascript:>> > wrote: > >> Yes, maybe not. The next step will be to recreate it from scratch. > >> > >> > >> On Friday, March 1, 2013 5:47:06 PM UTC+1, Ken Barber wrote: > >>> > >>> > >>> Well, I don''t think a vacuum will help you - I imagine something is > >>> wrong with the schema right now or some data migration failed during > >>> upgrade. Esp. if you had migration issues from your custom PuppetDB. > >>> Of course I can''t prove this with so little knowledge - but it > >>> certainly does raise a red flag. > >>> > >>> The only other option is to compare the schemas with a known good one, > >>> and compare the database data to try and find the fault. This however > >>> is prone to error and might be time-consuming. Recreating the database > >>> sounds like a far more reliable option with a lot more guarantees > >>> around it. > >>> > >>> ken. > >> > >> -- > >> You received this message because you are subscribed to the Google > Groups > >> "Puppet Users" group. > >> To unsubscribe from this group and stop receiving emails from it, send > an > >> email to puppet-users...@googlegroups.com <javascript:>. > >> To post to this group, send email to puppet...@googlegroups.com<javascript:>. > > >> Visit this group at http://groups.google.com/group/puppet-users?hl=en. > >> For more options, visit https://groups.google.com/groups/opt_out. > > > > -- > > You received this message because you are subscribed to the Google > Groups "Puppet Users" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to puppet-users...@googlegroups.com <javascript:>. > > To post to this group, send email to puppet...@googlegroups.com<javascript:>. > > > Visit this group at http://groups.google.com/group/puppet-users?hl=en. > > For more options, visit https://groups.google.com/groups/opt_out. > > > > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
What are the I/O stats? Can I just peak at them? Mike { Thanks => always } On Mar 5, 2013, at 3:00 AM, ak0ska <akos.hencz@gmail.com> wrote:> Hey Mike, > > Thanks for the suggestions, but we already checked the IO rates, and they seemed fine. And yes, PuppetDB and Postgres are on the same machine for now, but we plan to change that sometime in the future. > > Cheers, > > ak0ska > > On Tuesday, March 5, 2013 12:51:16 AM UTC+1, Mike wrote: >> >> Is puppetdb and postgres on the same server? >> How many node does your environment have? >> >> I had a similar issue and it was linked to I/O. Can you look at that? >> >> Mike { Thanks => always } >> >> On Mar 4, 2013, at 4:55 PM, Ken Barber <k...@puppetlabs.com> wrote: >> >> > Any progress today? >> > >> > On Fri, Mar 1, 2013 at 9:00 AM, ak0ska <akos....@gmail.com> wrote: >> >> Yes, maybe not. The next step will be to recreate it from scratch. >> >> >> >> >> >> On Friday, March 1, 2013 5:47:06 PM UTC+1, Ken Barber wrote: >> >>> >> >>> >> >>> Well, I don''t think a vacuum will help you - I imagine something is >> >>> wrong with the schema right now or some data migration failed during >> >>> upgrade. Esp. if you had migration issues from your custom PuppetDB. >> >>> Of course I can''t prove this with so little knowledge - but it >> >>> certainly does raise a red flag. >> >>> >> >>> The only other option is to compare the schemas with a known good one, >> >>> and compare the database data to try and find the fault. This however >> >>> is prone to error and might be time-consuming. Recreating the database >> >>> sounds like a far more reliable option with a lot more guarantees >> >>> around it. >> >>> >> >>> ken. >> >> >> >> -- >> >> You received this message because you are subscribed to the Google Groups >> >> "Puppet Users" group. >> >> To unsubscribe from this group and stop receiving emails from it, send an >> >> email to puppet-users...@googlegroups.com. >> >> To post to this group, send email to puppet...@googlegroups.com. >> >> Visit this group at http://groups.google.com/group/puppet-users?hl=en. >> >> For more options, visit https://groups.google.com/groups/opt_out. >> > >> > -- >> > You received this message because you are subscribed to the Google Groups "Puppet Users" group. >> > To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users...@googlegroups.com. >> > To post to this group, send email to puppet...@googlegroups.com. >> > Visit this group at http://groups.google.com/group/puppet-users?hl=en. >> > For more options, visit https://groups.google.com/groups/opt_out. >> > >> > > -- > You received this message because you are subscribed to the Google Groups "Puppet Users" group. > To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. > To post to this group, send email to puppet-users@googlegroups.com. > Visit this group at http://groups.google.com/group/puppet-users?hl=en. > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
@Mike iostat -nx Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util md0 0.00 0.00 85.55 405.31 9226.53 3234.60 25.39 0.00 0.00 0.00 0.00 @Ken Wow. Thats still way too large for the amount of nodes. I imagine> re-indexing might help, we can check first. Can you display how big > each of your relations are? A query like this might help: > > SELECT nspname || ''.'' || relname AS "relation", > pg_size_pretty(pg_relation_size(C.oid)) AS "size" > FROM pg_class C > LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) > WHERE nspname NOT IN (''pg_catalog'', ''information_schema'') > ORDER BY pg_relation_size(C.oid) DESC > LIMIT 20; >Indexes seem bloated. relation | size -----------------------------------------+--------- public.idx_catalog_resources_tags_gin | 117 GB public.idx_catalog_resources_tags | 96 GB public.idx_catalog_resources_resource | 39 GB public.idx_catalog_resources_catalog | 39 GB public.idx_catalog_resources_type_title | 34 GB public.catalog_resources_pkey | 32 GB public.idx_catalog_resources_type | 16 GB public.catalog_resources | 9454 MB public.edges_pkey | 2460 MB public.edges | 875 MB public.idx_certname_facts_name | 447 MB public.certname_facts_pkey | 77 MB public.idx_certname_facts_certname | 66 MB public.resource_params | 66 MB public.idx_resources_params_name | 60 MB public.idx_resources_params_resource | 50 MB public.resource_params_pkey | 43 MB public.certname_facts | 41 MB pg_toast.pg_toast_16463 | 34 MB pg_toast.pg_toast_16463_index | 2360 kB> Also - what are the row counts for each of your tables btw? > > If you run this for each table it will get you a row count: > > select count(*) from table_name; > > I''d be curious to see if any of these are higher than what they should > be based on your node count, maybe some of the space is just large > tables perhaps? >Lines: certname_facts 336426 catalogs 2963 resource_events 0 reports 0 certnames 2825 certname_catalogs 2810 certname_facts_metadata 2764 catalog_resources 1881888 resource_params 348806 edges 3454907 schema_migrations 9 Perhaps the queries were just timing out on the dashboard?>Probably This should be evident in your puppetdb.log if you trace the uuid of a> command. If commands ''fail'' completely, they end up in the DQL located > here: > > /var/lib/puppetdb/mq/discarded/ >Just looked at that directory, no entry with a recent date, so I guess they go through eventually. -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
On Thursday, March 7, 2013 12:23:13 AM UTC+1, Ken Barber wrote:> > > > So the index ''idx_catalog_resources_tags'' was removed in 1.1.0 I > think, so that is no longer needed. > > This points back to making sure you schema matches exactly what a > known good 1.1.1 has, as things have been missed obviously. > > Try this tool out: > > http://www.apgdiff.com/diff_online.php > > I''ve got a pg_dump of a good 1.1.1 here: > > https://gist.github.com/kbarber/5104169 > > All you need to do is perform a pg_dump of your schema: > > pg_dump --schema-only puppetdb > > And use that online tool to compare your schema with the one I''ve > provided. I would obviously double check its accuracy before doing > anything drastic, but this should help to get your schema inline with > what it should be. If you are paranoid about any part of the diff, > paste the results in a gist and we can discuss it. >The only difference seem to be that 1 index, nothing else.> > Okay, well - lets at least get your schema back inline with what it > should be, get your re-indexes done and we can come back round to this > problem if it still exists. > > ken. >We will rebuild the indexes during the weekend, and see how it performs afterwards. -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
After dropping the obsolete index, and rebuilding the others, the database is now ~ 30 GB. We still get the constraint violation errors when garbage collection starts. Also the "Resources" and "Resource duplication" values on the dashboard are still question marks, so those queries probably time out. Although these errors don''t have a big impact on performance (at least for now), could they be a sign of something not being quite ok? Shall we still consider rebuilding the database? -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
> After dropping the obsolete index, and rebuilding the others, the database > is now ~ 30 GB. We still get the constraint violation errors when garbage > collection starts.Okay - can you please send me the puppetdb.log entry that shows the exception? Including surrounding messages?> Also the "Resources" and "Resource duplication" values on the dashboard are > still question marks, so those queries probably time out.Again, this should have a corresponding exception in puppetdb.log or at least a slow query being logged, also - can you pull up the javascript console and see what error are produced in relation to that error.> Although these errors don''t have a big impact on performance (at least for > now), could they be a sign of something not being quite ok? Shall we still > consider rebuilding the database?Not sure, I need more data. ken. -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
I think my previous comment just got lost. So, I cut three occurrences of this error from the database log and the corresponding part from the puppetdb log. I removed the hostnames, I hope it''s still sensible: http://pastebin.com/yvyBDWQE The unversioned api warnings are not from the masters. They''re from an older version of the puppetdbquery library, and the puppetdb module''s puppetdb_conn_validator function. Apart from that, there are 2 kinds of slow queries that appear in the logs: 2013-03-12 11:54:37,149 WARN [qtp1992135396-3384] [jdbc.internal] Query> slower than 10s threshold: actual execution time: 452.2331 seconds; Query: > SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, > certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND > c.deactivated IS NULL; Query Params: > 2013-03-12 11:54:42,035 WARN [qtp1992135396-3391] [jdbc.internal] Query > slower than 10s threshold: actual execution time: 452.0072 seconds; Query: > SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, > certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND > c.deactivated IS NULL; Query Params: > 2013-03-12 11:55:43,146 WARN [qtp1992135396-3392] [jdbc.internal] Query > slower than 10s threshold: actual execution time: 441.2548 seconds; Query: > SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources > cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND > cc.certname=c.name AND c.deactivated IS NULL) r; Query Params: > 2013-03-12 12:00:56,492 WARN [qtp1992135396-3402] [jdbc.internal] Query > slower than 10s threshold: actual execution time: 433.4503 seconds; Query: > SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources > cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND > cc.certname=c.name AND c.deactivated IS NULL) r; Query Params: > 2013-03-12 12:03:30,933 WARN [qtp1992135396-3392] [jdbc.internal] Query > slower than 10s threshold: actual execution time: 467.7872 seconds; Query: > SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, > certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND > c.deactivated IS NULL; Query Params: > 2013-03-12 12:04:18,225 WARN [qtp1992135396-3396] [jdbc.internal] Query > slower than 10s threshold: actual execution time: 472.9386 seconds; Query: > SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, > certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND > c.deactivated IS NULL; Query Params: > 2013-03-12 12:54:49,042 WARN [qtp1992135396-3463] [jdbc.internal] Query > slower than 10s threshold: actual execution time: 464.9721 seconds; Query: > SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, > certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND > c.deactivated IS NULL; Query Params: > 2013-03-12 12:57:58,791 WARN [qtp1992135396-3482] [jdbc.internal] Query > slower than 10s threshold: actual execution time: 460.9083 seconds; Query: > SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, > certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND > c.deactivated IS NULL; Query Params: > 2013-03-12 12:57:58,803 WARN [qtp1992135396-3490] [jdbc.internal] Query > slower than 10s threshold: actual execution time: 460.7699 seconds; Query: > SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources > cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND > cc.certname=c.name AND c.deactivated IS NULL) r; Query Params: >I didn''t see any javascript errors on the dashboard, the queries are just slow. I ran the resource duplication query manually on the puppetdb machine: time curl -X GET -H ''Accept: application/json''> http://[puppetdbip]:8080/v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes > -v > * About to connect() to [puppetdbip] port 8080 (#0) > * Trying [puppetdbip]... connected > * Connected to [puppetdbip] ([puppetdbip]) port 8080 (#0) > > GET > /v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes > HTTP/1.1 > > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 > NSS/3.13.6.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2 > > Host: [puppetdbip]:8080 > > Accept: application/json > > > < HTTP/1.1 200 OK > < Date: Tue, 12 Mar 2013 11:55:34 GMT > < Content-Type: application/json;charset=ISO-8859-1 > < Content-Length: 34 > < Server: Jetty(7.x.y-SNAPSHOT) > < > { > "Value" : 0.9691941482891232 > * Connection #0 to host [puppetdbip] left intact > * Closing connection #0 > } > real 16m57.604s > user 0m0.012s > sys 0m0.025s >On Monday, March 11, 2013 7:04:05 PM UTC+1, Ken Barber wrote:> > > After dropping the obsolete index, and rebuilding the others, the > database > > is now ~ 30 GB. We still get the constraint violation errors when > garbage > > collection starts. > > Okay - can you please send me the puppetdb.log entry that shows the > exception? Including surrounding messages? > > > Also the "Resources" and "Resource duplication" values on the dashboard > are > > still question marks, so those queries probably time out. > > Again, this should have a corresponding exception in puppetdb.log or > at least a slow query being logged, also - can you pull up the > javascript console and see what error are produced in relation to that > error. > > > Although these errors don''t have a big impact on performance (at least > for > > now), could they be a sign of something not being quite ok? Shall we > still > > consider rebuilding the database? > > Not sure, I need more data. > > ken. >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
On Tue, Mar 12, 2013 at 6:38 AM, ak0ska <akos.hencz@gmail.com> wrote:> I think my previous comment just got lost. > > So, I cut three occurrences of this error from the database log and the > corresponding part from the puppetdb log. I removed the hostnames, I hope > it''s still sensible: http://pastebin.com/yvyBDWQE > The unversioned api warnings are not from the masters. They''re from an > older version of the puppetdbquery library, and the puppetdb module''s > puppetdb_conn_validator function.> Apart from that, there are 2 kinds of slow queries that appear in the logs: > > 2013-03-12 11:54:37,149 WARN [qtp1992135396-3384] [jdbc.internal] Query >> slower than 10s threshold: actual execution time: 452.2331 seconds; Query: >> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, >> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND >> c.deactivated IS NULL; Query Params: >> 2013-03-12 11:54:42,035 WARN [qtp1992135396-3391] [jdbc.internal] Query >> slower than 10s threshold: actual execution time: 452.0072 seconds; Query: >> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, >> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND >> c.deactivated IS NULL; Query Params: >> 2013-03-12 11:55:43,146 WARN [qtp1992135396-3392] [jdbc.internal] Query >> slower than 10s threshold: actual execution time: 441.2548 seconds; Query: >> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources >> cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND >> cc.certname=c.name AND c.deactivated IS NULL) r; Query Params: >> 2013-03-12 12:00:56,492 WARN [qtp1992135396-3402] [jdbc.internal] Query >> slower than 10s threshold: actual execution time: 433.4503 seconds; Query: >> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources >> cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND >> cc.certname=c.name AND c.deactivated IS NULL) r; Query Params: >> 2013-03-12 12:03:30,933 WARN [qtp1992135396-3392] [jdbc.internal] Query >> slower than 10s threshold: actual execution time: 467.7872 seconds; Query: >> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, >> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND >> c.deactivated IS NULL; Query Params: >> 2013-03-12 12:04:18,225 WARN [qtp1992135396-3396] [jdbc.internal] Query >> slower than 10s threshold: actual execution time: 472.9386 seconds; Query: >> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, >> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND >> c.deactivated IS NULL; Query Params: >> 2013-03-12 12:54:49,042 WARN [qtp1992135396-3463] [jdbc.internal] Query >> slower than 10s threshold: actual execution time: 464.9721 seconds; Query: >> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, >> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND >> c.deactivated IS NULL; Query Params: >> 2013-03-12 12:57:58,791 WARN [qtp1992135396-3482] [jdbc.internal] Query >> slower than 10s threshold: actual execution time: 460.9083 seconds; Query: >> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, >> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND >> c.deactivated IS NULL; Query Params: >> 2013-03-12 12:57:58,803 WARN [qtp1992135396-3490] [jdbc.internal] Query >> slower than 10s threshold: actual execution time: 460.7699 seconds; Query: >> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources >> cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND >> cc.certname=c.name AND c.deactivated IS NULL) r; Query Params: >> > > >Can you fire up psql, point it at your puppetdb database, and run "EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL;"? That will profile the query, and we can see where the problem is. Also worth examining is the GC query (modified to be non-destructive): EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT * FROM catalog_resources cr WHERE cr.resource=resource_params.resource); And the resource duplication query: EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; And finally, can you give us a dump of your database settings? You can do this from within psql by doing "SHOW ALL;"> I didn''t see any javascript errors on the dashboard, the queries are just > slow. I ran the resource duplication query manually on the puppetdb machine: > > time curl -X GET -H ''Accept: application/json'' http://[puppetdbip]:8080/v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes >> -v >> * About to connect() to [puppetdbip] port 8080 (#0) >> * Trying [puppetdbip]... connected >> * Connected to [puppetdbip] ([puppetdbip]) port 8080 (#0) >> > GET >> /v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes >> HTTP/1.1 >> > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/ >> 3.13.6.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2 >> > Host: [puppetdbip]:8080 >> > Accept: application/json >> > >> < HTTP/1.1 200 OK >> < Date: Tue, 12 Mar 2013 11:55:34 GMT >> < Content-Type: application/json;charset=ISO-8859-1 >> < Content-Length: 34 >> < Server: Jetty(7.x.y-SNAPSHOT) >> < >> { >> "Value" : 0.9691941482891232 >> * Connection #0 to host [puppetdbip] left intact >> * Closing connection #0 >> } >> real 16m57.604s >> user 0m0.012s >> sys 0m0.025s >> > > > On Monday, March 11, 2013 7:04:05 PM UTC+1, Ken Barber wrote: >> >> > After dropping the obsolete index, and rebuilding the others, the >> database >> > is now ~ 30 GB. We still get the constraint violation errors when >> garbage >> > collection starts. >> >> Okay - can you please send me the puppetdb.log entry that shows the >> exception? Including surrounding messages? >> >> > Also the "Resources" and "Resource duplication" values on the dashboard >> are >> > still question marks, so those queries probably time out. >> >> Again, this should have a corresponding exception in puppetdb.log or >> at least a slow query being logged, also - can you pull up the >> javascript console and see what error are produced in relation to that >> error. >> >> > Although these errors don''t have a big impact on performance (at least >> for >> > now), could they be a sign of something not being quite ok? Shall we >> still >> > consider rebuilding the database? >> >> Not sure, I need more data. >> >> ken. >> > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to puppet-users+unsubscribe@googlegroups.com. > To post to this group, send email to puppet-users@googlegroups.com. > Visit this group at http://groups.google.com/group/puppet-users?hl=en. > For more options, visit https://groups.google.com/groups/opt_out. > > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Hello Deepak, Here are the queries you asked for:> Can you fire up psql, point it at your puppetdb database, and run "EXPLAIN > ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources > cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND > c.deactivated IS NULL;"? That will profile the query, and we can see where > the problem is. > >QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=3132186.09..3132186.10 rows=1 width=0) (actual time=348426.375..348426.376 rows=1 loops=1) -> Hash Join (cost=247.37..3096532.80 rows=14261314 width=0) (actual time=24059.151..348255.933 rows=1604253 loops=1) Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) -> Seq Scan on catalog_resources cr (cost=0.00..2871450.86 rows=16444286 width=41) (actual time=24053.542..347250.967 rows=2107982 loops=1) -> Hash (cost=216.42..216.42 rows=2476 width=41) (actual time=5.555..5.555 rows=2476 loops=1) -> Hash Join (cost=94.84..216.42 rows=2476 width=41) (actual time=1.850..4.555 rows=2476 loops=1) Hash Cond: (cc.certname = c.name) -> Seq Scan on certname_catalogs cc (cost=0.00..82.55 rows=2855 width=62) (actual time=0.010..0.941 rows=2855 loops=1) -> Hash (cost=63.71..63.71 rows=2490 width=21) (actual time=1.818..1.818 rows=2490 loops=1) -> Seq Scan on certnames c (cost=0.00..63.71 rows=2490 width=21) (actual time=0.013..0.823 rows=2490 loops=1) Filter: (deactivated IS NULL) Total runtime: 348446.364 ms (12 rows) Also worth examining is the GC query (modified to be non-destructive):> > EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT * > FROM catalog_resources cr WHERE cr.resource=resource_params.resource); > >QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Nested Loop Anti Join (cost=0.00..1710681.18 rows=360492 width=126) (actual time=1967.182..1967.182 rows=0 loops=1) -> Seq Scan on resource_params (cost=0.00..13249.23 rows=389623 width=126) (actual time=0.013..76.403 rows=389693 loops=1) -> Index Scan using idx_catalog_resources_resource on catalog_resources cr (cost=0.00..21429.14 rows=5291 width=41) (actual time=0.004..0.004 rows=1 loops=389693) Index Cond: ((cr.resource)::text = (resource_params.resource)::text) Total runtime: 1967.253 ms (5 rows) And the resource duplication query:> > EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM > catalog_resources cr, certname_catalogs cc, certnames c WHERE > cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; > >QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=3129997.71..3129997.72 rows=1 width=0) (actual time=474369.420..474369.420 rows=1 loops=1) -> HashAggregate (cost=3129927.78..3129958.86 rows=3108 width=41) (actual time=474351.730..474365.144 rows=49466 loops=1) -> Hash Join (cost=248.48..3094726.38 rows=14080561 width=41) (actual time=21536.467..473678.479 rows=1604237 loops=1) Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) -> Seq Scan on catalog_resources cr (cost=0.00..2871450.86 rows=16444286 width=82) (actual time=21531.355..472692.303 rows=1934212 loops=1) -> Hash (cost=217.36..217.36 rows=2490 width=41) (actual time=5.071..5.071 rows=2476 loops=1) -> Hash Join (cost=94.84..217.36 rows=2490 width=41) (actual time=1.700..4.048 rows=2476 loops=1) Hash Cond: (cc.certname = c.name) -> Seq Scan on certname_catalogs cc (cost=0.00..83.08 rows=2908 width=62) (actual time=0.010..0.603 rows=2855 loops=1) -> Hash (cost=63.71..63.71 rows=2490 width=21) (actual time=1.670..1.670 rows=2490 loops=1) -> Seq Scan on certnames c (cost=0.00..63.71 rows=2490 width=21) (actual time=0.014..0.785 rows=2490 loops=1) Filter: (deactivated IS NULL) Total runtime: 474370.129 ms (13 rows)> And finally, can you give us a dump of your database settings? You can do > this from within psql by doing "SHOW ALL;" >http://pastebin.com/raw.php?i=HfhqfVa4 Thank you for your help! Cheers, ak0ska -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Hi ak0ska, So I''ve been spending the last 2 days trying all kinds of things to replicate your constraint violation problem and I still am getting nowhere with it. I''ve been speaking to all kinds of smart people and we believe its some sort of lock and/or transactional mode problem but none of my testing can replicate it. My latest test was something like: https://gist.github.com/kbarber/5157836 But with all that preparation and load simulation I still couldn''t replicate it :-(. I think I need more data from you ... in this case, a statement log. Now a statement log can be pretty heavy, so if we can get the data surrounding a GC only that should be enough - however I anticipate it might reduce some performance so if you''re apprehensive about doing it in production I fully understand. I''m just running out of ideas, and without more data I may have trouble coming to a conclusion. Anyway, if at all possible the instructions for enabling this are as follows: * Go to your postgresql.conf file (On RHEL 6 this is usually located in /var/lib/pgsql/data) * Add/modify the entries: log_statement = ''all'' log_line_prefix = ''%m PID[%p] SID[%c] VTID[%v] TID[%x] '' * Restart postgresql * Logs should go to: /var/lib/pgsql/data/pg_log The closer you can do this to being set _just before_ a GC run the better, as it will reduce the amount of time this setting needs to be on. Once you have a GC run and a subsequent constraint logged you should be able to turn it off. Don''t worry about PuppetDB during these cases, it retries the database so if you can take the small hit with queries the outage should be quick while you switch off/on the setting (and be mindful that catalogs/facts/reports submissions will get queued, queries of course will fail during that short time during DB restart). Now the data itself is going to obviously contain private items, so if you prefer to send it to me privately thats fine (contact me off this list if so). Otherwise, it might need some scrubbing. So the concern I have with your issue, even though it doesn''t stop a catalog from being submitted - is that you''re seeing a ''block'' when those two catalogs get submitted which makes me fear that data doesn''t get populated until _after_ the GC is complete. This is a concern because we don''t want data to be delayed like this. If this wasn''t the case we could happily let it fail and retry, ultimately the data isn''t lost, but we should be able to process data within reasonable time limits without you have to worry about database GC slowing things down. Now in regards to your performance, we are going to keep looking into the results you gave us. Looking at your settings Deepak and I think you probably need to increase at least work_mem to 16MB and maintenance_work_mem to 32MB (this is used during vacuum) the main item being work_mem here really. These are reasonable changes we''ve seen other users apply with success, but alone they probably won''t solve your issue - it would be good to just get you working using higher values for these. I''m very curious to understand the disk IO for your systems also. What are the nature of the disks where your database is located? Are they SCSI/SATA/FC or otherwise? What filesystem type are you using? Can you give me any other details you might find interesting? What are the results of the command: iostat -k 5 A few samples of the run would be useful to see, so we can understand the average disk IO going on here. FYI To give you an idea of what we _should_ be seeing for say, just one of your queries, here is a query on a Centos 5.8 box (running just postgresql 8.1 actually): puppetdb=# EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; QUERY PLAN ------------------------------------------------------------------------------------------------- --------------------------------------------------- Aggregate (cost=91792.54..91792.55 rows=1 width=0) (actual time=2611.872..2611.872 rows=1 loops =1) -> HashAggregate (cost=91737.14..91761.76 rows=2462 width=41) (actual time=2606.507..2609.48 9 rows=5478 loops=1) -> Hash Join (cost=184.36..90322.32 rows=565926 width=41) (actual time=7.060..2072.000 rows=867847 loops=1) Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) -> Seq Scan on catalog_resources cr (cost=0.00..80139.47 rows=867847 width=82) ( actual time=0.021..1119.260 rows=867847 loops=1) -> Hash (cost=159.17..159.17 rows=2015 width=41) (actual time=7.032..7.032 rows2195 loops=1) -> Hash Join (cost=62.34..159.17 rows=2015 width=41) (actual time=2.438..5 .830 rows=2195 loops=1) Hash Cond: (cc.certname = c.name) -> Seq Scan on certname_catalogs cc (cost=0.00..65.79 rows=2179 widt h=50) (actual time=0.005..0.908 rows=2195 loops=1) -> Hash (cost=37.15..37.15 rows=2015 width=9) (actual time=2.427..2. 427 rows=2195 loops=1) -> Seq Scan on certnames c (cost=0.00..37.15 rows=2015 width=9 ) (actual time=0.008..1.088 rows=2195 loops=1) Filter: (deactivated IS NULL) Total runtime: 2611.935 ms (13 rows) As you can see, 2 seconds on my box and 475 seconds on yours - the difference is just too large to be considered a normal discrepancy :-). Be mindful, this is on data that represents 2000 nodes ... so its not like my tables are smaller then yours (at least not drastically), however I do notice the number of resources is much higher so I''m going to try and replicate that. For now, I''m going to ponder more on your performance as I believe that affects you the most - so let us get back to you with more updates. If you can give me any details as I''ve requested for now that would be appreciated. ken. On Wed, Mar 13, 2013 at 2:00 AM, ak0ska <akos.hencz@gmail.com> wrote:> Hello Deepak, > > Here are the queries you asked for: > > >> >> Can you fire up psql, point it at your puppetdb database, and run "EXPLAIN >> ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources >> cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND >> c.deactivated IS NULL;"? That will profile the query, and we can see where >> the problem is. >> > > QUERY > PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=3132186.09..3132186.10 rows=1 width=0) (actual > time=348426.375..348426.376 rows=1 loops=1) > -> Hash Join (cost=247.37..3096532.80 rows=14261314 width=0) (actual > time=24059.151..348255.933 rows=1604253 loops=1) > Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) > -> Seq Scan on catalog_resources cr (cost=0.00..2871450.86 > rows=16444286 width=41) (actual time=24053.542..347250.967 rows=2107982 > loops=1) > -> Hash (cost=216.42..216.42 rows=2476 width=41) (actual > time=5.555..5.555 rows=2476 loops=1) > -> Hash Join (cost=94.84..216.42 rows=2476 width=41) > (actual time=1.850..4.555 rows=2476 loops=1) > Hash Cond: (cc.certname = c.name) > -> Seq Scan on certname_catalogs cc (cost=0.00..82.55 > rows=2855 width=62) (actual time=0.010..0.941 rows=2855 loops=1) > -> Hash (cost=63.71..63.71 rows=2490 width=21) > (actual time=1.818..1.818 rows=2490 loops=1) > -> Seq Scan on certnames c (cost=0.00..63.71 > rows=2490 width=21) (actual time=0.013..0.823 rows=2490 loops=1) > Filter: (deactivated IS NULL) > Total runtime: 348446.364 ms > (12 rows) > > > > >> Also worth examining is the GC query (modified to be non-destructive): >> >> EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT * >> FROM catalog_resources cr WHERE cr.resource=resource_params.resource); >> > > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > Nested Loop Anti Join (cost=0.00..1710681.18 rows=360492 width=126) > (actual time=1967.182..1967.182 rows=0 loops=1) > -> Seq Scan on resource_params (cost=0.00..13249.23 rows=389623 > width=126) (actual time=0.013..76.403 rows=389693 loops=1) > -> Index Scan using idx_catalog_resources_resource on catalog_resources > cr (cost=0.00..21429.14 rows=5291 width=41) (actual time=0.004..0.004 > rows=1 loops=389693) > Index Cond: ((cr.resource)::text > (resource_params.resource)::text) > Total runtime: 1967.253 ms > (5 rows) > > > >> And the resource duplication query: >> >> EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM >> catalog_resources cr, certname_catalogs cc, certnames c WHERE >> cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; >> > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=3129997.71..3129997.72 rows=1 width=0) (actual > time=474369.420..474369.420 rows=1 loops=1) > -> HashAggregate (cost=3129927.78..3129958.86 rows=3108 width=41) > (actual time=474351.730..474365.144 rows=49466 loops=1) > -> Hash Join (cost=248.48..3094726.38 rows=14080561 width=41) > (actual time=21536.467..473678.479 rows=1604237 loops=1) > Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) > -> Seq Scan on catalog_resources cr (cost=0.00..2871450.86 > rows=16444286 width=82) (actual time=21531.355..472692.303 rows=1934212 > loops=1) > -> Hash (cost=217.36..217.36 rows=2490 width=41) (actual > time=5.071..5.071 rows=2476 loops=1) > -> Hash Join (cost=94.84..217.36 rows=2490 width=41) > (actual time=1.700..4.048 rows=2476 loops=1) > Hash Cond: (cc.certname = c.name) > -> Seq Scan on certname_catalogs cc > (cost=0.00..83.08 rows=2908 width=62) (actual time=0.010..0.603 rows=2855 > loops=1) > -> Hash (cost=63.71..63.71 rows=2490 width=21) > (actual time=1.670..1.670 rows=2490 loops=1) > -> Seq Scan on certnames c > (cost=0.00..63.71 rows=2490 width=21) (actual time=0.014..0.785 rows=2490 > loops=1) > Filter: (deactivated IS NULL) > Total runtime: 474370.129 ms > (13 rows) > > >> >> And finally, can you give us a dump of your database settings? You can do >> this from within psql by doing "SHOW ALL;" > > > http://pastebin.com/raw.php?i=HfhqfVa4 > > Thank you for your help! > > Cheers, > > ak0ska > > > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to puppet-users+unsubscribe@googlegroups.com. > To post to this group, send email to puppet-users@googlegroups.com. > Visit this group at http://groups.google.com/group/puppet-users?hl=en. > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Hello Ken, I really appreciate you guys looking into this problem, and I''m happy to provide you with the data you ask for. However, I feel like I should ask, whether you think this problem is worth your efforts, if rebuilding the database might solve the issue? Cheers, ak0ska On Thursday, March 14, 2013 8:05:59 AM UTC+1, Ken Barber wrote:> > Hi ak0ska, > > So I''ve been spending the last 2 days trying all kinds of things to > replicate your constraint violation problem and I still am getting > nowhere with it. I''ve been speaking to all kinds of smart people and > we believe its some sort of lock and/or transactional mode problem but > none of my testing can replicate it. > > My latest test was something like: > > https://gist.github.com/kbarber/5157836 > > But with all that preparation and load simulation I still couldn''t > replicate it :-(. > > I think I need more data from you ... in this case, a statement log. > Now a statement log can be pretty heavy, so if we can get the data > surrounding a GC only that should be enough - however I anticipate it > might reduce some performance so if you''re apprehensive about doing it > in production I fully understand. I''m just running out of ideas, and > without more data I may have trouble coming to a conclusion. > > Anyway, if at all possible the instructions for enabling this are as > follows: > > * Go to your postgresql.conf file (On RHEL 6 this is usually located > in /var/lib/pgsql/data) > * Add/modify the entries: > log_statement = ''all'' > log_line_prefix = ''%m PID[%p] SID[%c] VTID[%v] TID[%x] '' > * Restart postgresql > * Logs should go to: /var/lib/pgsql/data/pg_log > > The closer you can do this to being set _just before_ a GC run the > better, as it will reduce the amount of time this setting needs to be > on. Once you have a GC run and a subsequent constraint logged you > should be able to turn it off. Don''t worry about PuppetDB during these > cases, it retries the database so if you can take the small hit with > queries the outage should be quick while you switch off/on the setting > (and be mindful that catalogs/facts/reports submissions will get > queued, queries of course will fail during that short time during DB > restart). > > Now the data itself is going to obviously contain private items, so if > you prefer to send it to me privately thats fine (contact me off this > list if so). Otherwise, it might need some scrubbing. > > So the concern I have with your issue, even though it doesn''t stop a > catalog from being submitted - is that you''re seeing a ''block'' when > those two catalogs get submitted which makes me fear that data doesn''t > get populated until _after_ the GC is complete. This is a concern > because we don''t want data to be delayed like this. If this wasn''t the > case we could happily let it fail and retry, ultimately the data isn''t > lost, but we should be able to process data within reasonable time > limits without you have to worry about database GC slowing things > down. > > Now in regards to your performance, we are going to keep looking into > the results you gave us. Looking at your settings Deepak and I think > you probably need to increase at least work_mem to 16MB and > maintenance_work_mem to 32MB (this is used during vacuum) the main > item being work_mem here really. These are reasonable changes we''ve > seen other users apply with success, but alone they probably won''t > solve your issue - it would be good to just get you working using > higher values for these. > > I''m very curious to understand the disk IO for your systems also. What > are the nature of the disks where your database is located? Are they > SCSI/SATA/FC or otherwise? What filesystem type are you using? Can you > give me any other details you might find interesting? > > What are the results of the command: > > iostat -k 5 > > A few samples of the run would be useful to see, so we can understand > the average disk IO going on here. > > FYI To give you an idea of what we _should_ be seeing for say, just > one of your queries, here is a query on a Centos 5.8 box (running just > postgresql 8.1 actually): > > puppetdb=# EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT > resource FROM catalog_resources cr, certname_catalogs cc, certnames c > WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated > IS NULL) r; > QUERY > PLAN > > ------------------------------------------------------------------------------------------------- > > --------------------------------------------------- > Aggregate (cost=91792.54..91792.55 rows=1 width=0) (actual > time=2611.872..2611.872 rows=1 loops > =1) > -> HashAggregate (cost=91737.14..91761.76 rows=2462 width=41) > (actual time=2606.507..2609.48 > 9 rows=5478 loops=1) > -> Hash Join (cost=184.36..90322.32 rows=565926 width=41) > (actual time=7.060..2072.000 > rows=867847 loops=1) > Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) > -> Seq Scan on catalog_resources cr > (cost=0.00..80139.47 rows=867847 width=82) ( > actual time=0.021..1119.260 rows=867847 loops=1) > -> Hash (cost=159.17..159.17 rows=2015 width=41) > (actual time=7.032..7.032 rows= > 2195 loops=1) > -> Hash Join (cost=62.34..159.17 rows=2015 > width=41) (actual time=2.438..5 > .830 rows=2195 loops=1) > Hash Cond: (cc.certname = c.name) > -> Seq Scan on certname_catalogs cc > (cost=0.00..65.79 rows=2179 widt > h=50) (actual time=0.005..0.908 rows=2195 loops=1) > -> Hash (cost=37.15..37.15 rows=2015 > width=9) (actual time=2.427..2. > 427 rows=2195 loops=1) > -> Seq Scan on certnames c > (cost=0.00..37.15 rows=2015 width=9 > ) (actual time=0.008..1.088 rows=2195 loops=1) > Filter: (deactivated IS NULL) > Total runtime: 2611.935 ms > (13 rows) > > As you can see, 2 seconds on my box and 475 seconds on yours - the > difference is just too large to be considered a normal discrepancy > :-). Be mindful, this is on data that represents 2000 nodes ... so its > not like my tables are smaller then yours (at least not drastically), > however I do notice the number of resources is much higher so I''m > going to try and replicate that. > > For now, I''m going to ponder more on your performance as I believe > that affects you the most - so let us get back to you with more > updates. If you can give me any details as I''ve requested for now that > would be appreciated. > > ken. > > On Wed, Mar 13, 2013 at 2:00 AM, ak0ska <akos....@gmail.com <javascript:>> > wrote: > > Hello Deepak, > > > > Here are the queries you asked for: > > > > > >> > >> Can you fire up psql, point it at your puppetdb database, and run > "EXPLAIN > >> ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, > catalog_resources > >> cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND > >> c.deactivated IS NULL;"? That will profile the query, and we can see > where > >> the problem is. > >> > > > > > QUERY > > PLAN > > > ------------------------------------------------------------------------------------------------------------------------------------------------------- > > > Aggregate (cost=3132186.09..3132186.10 rows=1 width=0) (actual > > time=348426.375..348426.376 rows=1 loops=1) > > -> Hash Join (cost=247.37..3096532.80 rows=14261314 width=0) > (actual > > time=24059.151..348255.933 rows=1604253 loops=1) > > Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) > > -> Seq Scan on catalog_resources cr (cost=0.00..2871450.86 > > rows=16444286 width=41) (actual time=24053.542..347250.967 rows=2107982 > > loops=1) > > -> Hash (cost=216.42..216.42 rows=2476 width=41) (actual > > time=5.555..5.555 rows=2476 loops=1) > > -> Hash Join (cost=94.84..216.42 rows=2476 width=41) > > (actual time=1.850..4.555 rows=2476 loops=1) > > Hash Cond: (cc.certname = c.name) > > -> Seq Scan on certname_catalogs cc > (cost=0.00..82.55 > > rows=2855 width=62) (actual time=0.010..0.941 rows=2855 loops=1) > > -> Hash (cost=63.71..63.71 rows=2490 width=21) > > (actual time=1.818..1.818 rows=2490 loops=1) > > -> Seq Scan on certnames c > (cost=0.00..63.71 > > rows=2490 width=21) (actual time=0.013..0.823 rows=2490 loops=1) > > Filter: (deactivated IS NULL) > > Total runtime: 348446.364 ms > > (12 rows) > > > > > > > > > >> Also worth examining is the GC query (modified to be non-destructive): > >> > >> EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT > * > >> FROM catalog_resources cr WHERE cr.resource=resource_params.resource); > >> > > > > > > QUERY PLAN > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > > Nested Loop Anti Join (cost=0.00..1710681.18 rows=360492 width=126) > > (actual time=1967.182..1967.182 rows=0 loops=1) > > -> Seq Scan on resource_params (cost=0.00..13249.23 rows=389623 > > width=126) (actual time=0.013..76.403 rows=389693 loops=1) > > -> Index Scan using idx_catalog_resources_resource on > catalog_resources > > cr (cost=0.00..21429.14 rows=5291 width=41) (actual time=0.004..0.004 > > rows=1 loops=389693) > > Index Cond: ((cr.resource)::text = > > (resource_params.resource)::text) > > Total runtime: 1967.253 ms > > (5 rows) > > > > > > > >> And the resource duplication query: > >> > >> EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource > FROM > >> catalog_resources cr, certname_catalogs cc, certnames c WHERE > >> cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS > NULL) r; > >> > > > > QUERY PLAN > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > Aggregate (cost=3129997.71..3129997.72 rows=1 width=0) (actual > > time=474369.420..474369.420 rows=1 loops=1) > > -> HashAggregate (cost=3129927.78..3129958.86 rows=3108 width=41) > > (actual time=474351.730..474365.144 rows=49466 loops=1) > > -> Hash Join (cost=248.48..3094726.38 rows=14080561 width=41) > > (actual time=21536.467..473678.479 rows=1604237 loops=1) > > Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) > > -> Seq Scan on catalog_resources cr > (cost=0.00..2871450.86 > > rows=16444286 width=82) (actual time=21531.355..472692.303 rows=1934212 > > loops=1) > > -> Hash (cost=217.36..217.36 rows=2490 width=41) > (actual > > time=5.071..5.071 rows=2476 loops=1) > > -> Hash Join (cost=94.84..217.36 rows=2490 > width=41) > > (actual time=1.700..4.048 rows=2476 loops=1) > > Hash Cond: (cc.certname = c.name) > > -> Seq Scan on certname_catalogs cc > > (cost=0.00..83.08 rows=2908 width=62) (actual time=0.010..0.603 > rows=2855 > > loops=1) > > -> Hash (cost=63.71..63.71 rows=2490 > width=21) > > (actual time=1.670..1.670 rows=2490 loops=1) > > -> Seq Scan on certnames c > > (cost=0.00..63.71 rows=2490 width=21) (actual time=0.014..0.785 > rows=2490 > > loops=1) > > Filter: (deactivated IS NULL) > > Total runtime: 474370.129 ms > > (13 rows) > > > > > >> > >> And finally, can you give us a dump of your database settings? You can > do > >> this from within psql by doing "SHOW ALL;" > > > > > > http://pastebin.com/raw.php?i=HfhqfVa4 > > > > Thank you for your help! > > > > Cheers, > > > > ak0ska > > > > > > -- > > You received this message because you are subscribed to the Google > Groups > > "Puppet Users" group. > > To unsubscribe from this group and stop receiving emails from it, send > an > > email to puppet-users...@googlegroups.com <javascript:>. > > To post to this group, send email to puppet...@googlegroups.com<javascript:>. > > > Visit this group at http://groups.google.com/group/puppet-users?hl=en. > > For more options, visit https://groups.google.com/groups/opt_out. > > > > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
So I have this sinking feeling that all of your problems (including the constraint side-effect) are related to general performance issues on your database ''for some reason we are yet to determine''. This could be related to IO contention, it could be a bad index (although you''ve rebuilt them all right?) or something else. What continues to bother me is the fact you have only managed to reduce your DB size to 30GB, this is far larger than I expect and I it screams ''problem'' to me. It should be less than 5 GB, at least based on my rough calculations from other DBs I''ve seen in the wild. It could very well be that something still isn''t sufficiently organised (index, tablespace data) for optimal performance. A recreation of the database is by all means a very fast way to recreate everything, its like heaving an axe though. Having said that, it is fast and would eradicate a lot of back & forth between us to try and solve this problem. A dump, drop, recreate and restore is also another way, not 100% the same thing, but close to it. Its entirely up to you - if you want to try either option then go ahead. It would be a fast step to achieving a lot of small steps. If it works, we may have lost some clarity on the issue however, so might be good to keep the original DB backup. If it doesn''t work, at least we have ruled out a lot of possibilities in one quick action. So yeah, I wouldn''t be upset if you take this action - there is a reason why I suggested this earlier, and that is it performs a lot of actions in one single step ... and I still think this is a viable action, but since you were keen on trying other things I was happy to go along :-). At the end of the day your time and your risk assessment is up to you. If it was me, yeah, I would try it. Esp. considering the amount of time already spent with very little final result. Thanks for being patient with me on this btw ... :-). ken. On Thu, Mar 14, 2013 at 3:36 AM, ak0ska <akos.hencz@gmail.com> wrote:> Hello Ken, > > I really appreciate you guys looking into this problem, and I''m happy to > provide you with the data you ask for. However, I feel like I should ask, > whether you think this problem is worth your efforts, if rebuilding the > database might solve the issue? > > Cheers, > > ak0ska > > > On Thursday, March 14, 2013 8:05:59 AM UTC+1, Ken Barber wrote: >> >> Hi ak0ska, >> >> So I''ve been spending the last 2 days trying all kinds of things to >> replicate your constraint violation problem and I still am getting >> nowhere with it. I''ve been speaking to all kinds of smart people and >> we believe its some sort of lock and/or transactional mode problem but >> none of my testing can replicate it. >> >> My latest test was something like: >> >> https://gist.github.com/kbarber/5157836 >> >> But with all that preparation and load simulation I still couldn''t >> replicate it :-(. >> >> I think I need more data from you ... in this case, a statement log. >> Now a statement log can be pretty heavy, so if we can get the data >> surrounding a GC only that should be enough - however I anticipate it >> might reduce some performance so if you''re apprehensive about doing it >> in production I fully understand. I''m just running out of ideas, and >> without more data I may have trouble coming to a conclusion. >> >> Anyway, if at all possible the instructions for enabling this are as >> follows: >> >> * Go to your postgresql.conf file (On RHEL 6 this is usually located >> in /var/lib/pgsql/data) >> * Add/modify the entries: >> log_statement = ''all'' >> log_line_prefix = ''%m PID[%p] SID[%c] VTID[%v] TID[%x] '' >> * Restart postgresql >> * Logs should go to: /var/lib/pgsql/data/pg_log >> >> The closer you can do this to being set _just before_ a GC run the >> better, as it will reduce the amount of time this setting needs to be >> on. Once you have a GC run and a subsequent constraint logged you >> should be able to turn it off. Don''t worry about PuppetDB during these >> cases, it retries the database so if you can take the small hit with >> queries the outage should be quick while you switch off/on the setting >> (and be mindful that catalogs/facts/reports submissions will get >> queued, queries of course will fail during that short time during DB >> restart). >> >> Now the data itself is going to obviously contain private items, so if >> you prefer to send it to me privately thats fine (contact me off this >> list if so). Otherwise, it might need some scrubbing. >> >> So the concern I have with your issue, even though it doesn''t stop a >> catalog from being submitted - is that you''re seeing a ''block'' when >> those two catalogs get submitted which makes me fear that data doesn''t >> get populated until _after_ the GC is complete. This is a concern >> because we don''t want data to be delayed like this. If this wasn''t the >> case we could happily let it fail and retry, ultimately the data isn''t >> lost, but we should be able to process data within reasonable time >> limits without you have to worry about database GC slowing things >> down. >> >> Now in regards to your performance, we are going to keep looking into >> the results you gave us. Looking at your settings Deepak and I think >> you probably need to increase at least work_mem to 16MB and >> maintenance_work_mem to 32MB (this is used during vacuum) the main >> item being work_mem here really. These are reasonable changes we''ve >> seen other users apply with success, but alone they probably won''t >> solve your issue - it would be good to just get you working using >> higher values for these. >> >> I''m very curious to understand the disk IO for your systems also. What >> are the nature of the disks where your database is located? Are they >> SCSI/SATA/FC or otherwise? What filesystem type are you using? Can you >> give me any other details you might find interesting? >> >> What are the results of the command: >> >> iostat -k 5 >> >> A few samples of the run would be useful to see, so we can understand >> the average disk IO going on here. >> >> FYI To give you an idea of what we _should_ be seeing for say, just >> one of your queries, here is a query on a Centos 5.8 box (running just >> postgresql 8.1 actually): >> >> puppetdb=# EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT >> resource FROM catalog_resources cr, certname_catalogs cc, certnames c >> WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated >> IS NULL) r; >> QUERY >> PLAN >> >> >> ------------------------------------------------------------------------------------------------- >> --------------------------------------------------- >> Aggregate (cost=91792.54..91792.55 rows=1 width=0) (actual >> time=2611.872..2611.872 rows=1 loops >> =1) >> -> HashAggregate (cost=91737.14..91761.76 rows=2462 width=41) >> (actual time=2606.507..2609.48 >> 9 rows=5478 loops=1) >> -> Hash Join (cost=184.36..90322.32 rows=565926 width=41) >> (actual time=7.060..2072.000 >> rows=867847 loops=1) >> Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) >> -> Seq Scan on catalog_resources cr >> (cost=0.00..80139.47 rows=867847 width=82) ( >> actual time=0.021..1119.260 rows=867847 loops=1) >> -> Hash (cost=159.17..159.17 rows=2015 width=41) >> (actual time=7.032..7.032 rows>> 2195 loops=1) >> -> Hash Join (cost=62.34..159.17 rows=2015 >> width=41) (actual time=2.438..5 >> .830 rows=2195 loops=1) >> Hash Cond: (cc.certname = c.name) >> -> Seq Scan on certname_catalogs cc >> (cost=0.00..65.79 rows=2179 widt >> h=50) (actual time=0.005..0.908 rows=2195 loops=1) >> -> Hash (cost=37.15..37.15 rows=2015 >> width=9) (actual time=2.427..2. >> 427 rows=2195 loops=1) >> -> Seq Scan on certnames c >> (cost=0.00..37.15 rows=2015 width=9 >> ) (actual time=0.008..1.088 rows=2195 loops=1) >> Filter: (deactivated IS NULL) >> Total runtime: 2611.935 ms >> (13 rows) >> >> As you can see, 2 seconds on my box and 475 seconds on yours - the >> difference is just too large to be considered a normal discrepancy >> :-). Be mindful, this is on data that represents 2000 nodes ... so its >> not like my tables are smaller then yours (at least not drastically), >> however I do notice the number of resources is much higher so I''m >> going to try and replicate that. >> >> For now, I''m going to ponder more on your performance as I believe >> that affects you the most - so let us get back to you with more >> updates. If you can give me any details as I''ve requested for now that >> would be appreciated. >> >> ken. >> >> On Wed, Mar 13, 2013 at 2:00 AM, ak0ska <akos....@gmail.com> wrote: >> > Hello Deepak, >> > >> > Here are the queries you asked for: >> > >> > >> >> >> >> Can you fire up psql, point it at your puppetdb database, and run >> >> "EXPLAIN >> >> ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, >> >> catalog_resources >> >> cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND >> >> c.deactivated IS NULL;"? That will profile the query, and we can see >> >> where >> >> the problem is. >> >> >> > >> > >> > QUERY >> > PLAN >> > >> > ------------------------------------------------------------------------------------------------------------------------------------------------------- >> > Aggregate (cost=3132186.09..3132186.10 rows=1 width=0) (actual >> > time=348426.375..348426.376 rows=1 loops=1) >> > -> Hash Join (cost=247.37..3096532.80 rows=14261314 width=0) >> > (actual >> > time=24059.151..348255.933 rows=1604253 loops=1) >> > Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) >> > -> Seq Scan on catalog_resources cr (cost=0.00..2871450.86 >> > rows=16444286 width=41) (actual time=24053.542..347250.967 rows=2107982 >> > loops=1) >> > -> Hash (cost=216.42..216.42 rows=2476 width=41) (actual >> > time=5.555..5.555 rows=2476 loops=1) >> > -> Hash Join (cost=94.84..216.42 rows=2476 width=41) >> > (actual time=1.850..4.555 rows=2476 loops=1) >> > Hash Cond: (cc.certname = c.name) >> > -> Seq Scan on certname_catalogs cc >> > (cost=0.00..82.55 >> > rows=2855 width=62) (actual time=0.010..0.941 rows=2855 loops=1) >> > -> Hash (cost=63.71..63.71 rows=2490 width=21) >> > (actual time=1.818..1.818 rows=2490 loops=1) >> > -> Seq Scan on certnames c >> > (cost=0.00..63.71 >> > rows=2490 width=21) (actual time=0.013..0.823 rows=2490 loops=1) >> > Filter: (deactivated IS NULL) >> > Total runtime: 348446.364 ms >> > (12 rows) >> > >> > >> > >> > >> >> Also worth examining is the GC query (modified to be non-destructive): >> >> >> >> EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT >> >> * >> >> FROM catalog_resources cr WHERE cr.resource=resource_params.resource); >> >> >> > >> > >> > QUERY PLAN >> > >> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >> > Nested Loop Anti Join (cost=0.00..1710681.18 rows=360492 width=126) >> > (actual time=1967.182..1967.182 rows=0 loops=1) >> > -> Seq Scan on resource_params (cost=0.00..13249.23 rows=389623 >> > width=126) (actual time=0.013..76.403 rows=389693 loops=1) >> > -> Index Scan using idx_catalog_resources_resource on >> > catalog_resources >> > cr (cost=0.00..21429.14 rows=5291 width=41) (actual time=0.004..0.004 >> > rows=1 loops=389693) >> > Index Cond: ((cr.resource)::text >> > (resource_params.resource)::text) >> > Total runtime: 1967.253 ms >> > (5 rows) >> > >> > >> > >> >> And the resource duplication query: >> >> >> >> EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource >> >> FROM >> >> catalog_resources cr, certname_catalogs cc, certnames c WHERE >> >> cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) >> >> r; >> >> >> > >> > QUERY PLAN >> > >> > ------------------------------------------------------------------------------------------------------------------------------------------------------------- >> > Aggregate (cost=3129997.71..3129997.72 rows=1 width=0) (actual >> > time=474369.420..474369.420 rows=1 loops=1) >> > -> HashAggregate (cost=3129927.78..3129958.86 rows=3108 width=41) >> > (actual time=474351.730..474365.144 rows=49466 loops=1) >> > -> Hash Join (cost=248.48..3094726.38 rows=14080561 width=41) >> > (actual time=21536.467..473678.479 rows=1604237 loops=1) >> > Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) >> > -> Seq Scan on catalog_resources cr >> > (cost=0.00..2871450.86 >> > rows=16444286 width=82) (actual time=21531.355..472692.303 rows=1934212 >> > loops=1) >> > -> Hash (cost=217.36..217.36 rows=2490 width=41) >> > (actual >> > time=5.071..5.071 rows=2476 loops=1) >> > -> Hash Join (cost=94.84..217.36 rows=2490 >> > width=41) >> > (actual time=1.700..4.048 rows=2476 loops=1) >> > Hash Cond: (cc.certname = c.name) >> > -> Seq Scan on certname_catalogs cc >> > (cost=0.00..83.08 rows=2908 width=62) (actual time=0.010..0.603 >> > rows=2855 >> > loops=1) >> > -> Hash (cost=63.71..63.71 rows=2490 >> > width=21) >> > (actual time=1.670..1.670 rows=2490 loops=1) >> > -> Seq Scan on certnames c >> > (cost=0.00..63.71 rows=2490 width=21) (actual time=0.014..0.785 >> > rows=2490 >> > loops=1) >> > Filter: (deactivated IS NULL) >> > Total runtime: 474370.129 ms >> > (13 rows) >> > >> > >> >> >> >> And finally, can you give us a dump of your database settings? You can >> >> do >> >> this from within psql by doing "SHOW ALL;" >> > >> > >> > http://pastebin.com/raw.php?i=HfhqfVa4 >> > >> > Thank you for your help! >> > >> > Cheers, >> > >> > ak0ska >> > >> > >> > -- >> > You received this message because you are subscribed to the Google >> > Groups >> > "Puppet Users" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> > an >> > email to puppet-users...@googlegroups.com. >> > To post to this group, send email to puppet...@googlegroups.com. >> > Visit this group at http://groups.google.com/group/puppet-users?hl=en. >> > For more options, visit https://groups.google.com/groups/opt_out. >> > >> > > > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to puppet-users+unsubscribe@googlegroups.com. > To post to this group, send email to puppet-users@googlegroups.com. > Visit this group at http://groups.google.com/group/puppet-users?hl=en. > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Hi ak0ska, FWIW - with the help of some of my colleagues we''ve managed to replicate your constraint issue in a lab style environment now: https://gist.github.com/kbarber/5157836 Which is a start. It requires a unique precondition to replicate, and I''ve been unable however to replicate it in any way with PuppetDB directly, so the statement log will still be helpful if you can supply it. ken. On Thu, Mar 14, 2013 at 2:12 PM, Ken Barber <ken@puppetlabs.com> wrote:> So I have this sinking feeling that all of your problems (including > the constraint side-effect) are related to general performance issues > on your database ''for some reason we are yet to determine''. This could > be related to IO contention, it could be a bad index (although you''ve > rebuilt them all right?) or something else. What continues to bother > me is the fact you have only managed to reduce your DB size to 30GB, > this is far larger than I expect and I it screams ''problem'' to me. It > should be less than 5 GB, at least based on my rough calculations from > other DBs I''ve seen in the wild. > > It could very well be that something still isn''t sufficiently > organised (index, tablespace data) for optimal performance. A > recreation of the database is by all means a very fast way to recreate > everything, its like heaving an axe though. Having said that, it is > fast and would eradicate a lot of back & forth between us to try and > solve this problem. > > A dump, drop, recreate and restore is also another way, not 100% the > same thing, but close to it. > > Its entirely up to you - if you want to try either option then go > ahead. It would be a fast step to achieving a lot of small steps. If > it works, we may have lost some clarity on the issue however, so might > be good to keep the original DB backup. If it doesn''t work, at least > we have ruled out a lot of possibilities in one quick action. So yeah, > I wouldn''t be upset if you take this action - there is a reason why I > suggested this earlier, and that is it performs a lot of actions in > one single step ... and I still think this is a viable action, but > since you were keen on trying other things I was happy to go along > :-). At the end of the day your time and your risk assessment is up to > you. If it was me, yeah, I would try it. Esp. considering the amount > of time already spent with very little final result. > > Thanks for being patient with me on this btw ... :-). > > ken. > > On Thu, Mar 14, 2013 at 3:36 AM, ak0ska <akos.hencz@gmail.com> wrote: >> Hello Ken, >> >> I really appreciate you guys looking into this problem, and I''m happy to >> provide you with the data you ask for. However, I feel like I should ask, >> whether you think this problem is worth your efforts, if rebuilding the >> database might solve the issue? >> >> Cheers, >> >> ak0ska >> >> >> On Thursday, March 14, 2013 8:05:59 AM UTC+1, Ken Barber wrote: >>> >>> Hi ak0ska, >>> >>> So I''ve been spending the last 2 days trying all kinds of things to >>> replicate your constraint violation problem and I still am getting >>> nowhere with it. I''ve been speaking to all kinds of smart people and >>> we believe its some sort of lock and/or transactional mode problem but >>> none of my testing can replicate it. >>> >>> My latest test was something like: >>> >>> https://gist.github.com/kbarber/5157836 >>> >>> But with all that preparation and load simulation I still couldn''t >>> replicate it :-(. >>> >>> I think I need more data from you ... in this case, a statement log. >>> Now a statement log can be pretty heavy, so if we can get the data >>> surrounding a GC only that should be enough - however I anticipate it >>> might reduce some performance so if you''re apprehensive about doing it >>> in production I fully understand. I''m just running out of ideas, and >>> without more data I may have trouble coming to a conclusion. >>> >>> Anyway, if at all possible the instructions for enabling this are as >>> follows: >>> >>> * Go to your postgresql.conf file (On RHEL 6 this is usually located >>> in /var/lib/pgsql/data) >>> * Add/modify the entries: >>> log_statement = ''all'' >>> log_line_prefix = ''%m PID[%p] SID[%c] VTID[%v] TID[%x] '' >>> * Restart postgresql >>> * Logs should go to: /var/lib/pgsql/data/pg_log >>> >>> The closer you can do this to being set _just before_ a GC run the >>> better, as it will reduce the amount of time this setting needs to be >>> on. Once you have a GC run and a subsequent constraint logged you >>> should be able to turn it off. Don''t worry about PuppetDB during these >>> cases, it retries the database so if you can take the small hit with >>> queries the outage should be quick while you switch off/on the setting >>> (and be mindful that catalogs/facts/reports submissions will get >>> queued, queries of course will fail during that short time during DB >>> restart). >>> >>> Now the data itself is going to obviously contain private items, so if >>> you prefer to send it to me privately thats fine (contact me off this >>> list if so). Otherwise, it might need some scrubbing. >>> >>> So the concern I have with your issue, even though it doesn''t stop a >>> catalog from being submitted - is that you''re seeing a ''block'' when >>> those two catalogs get submitted which makes me fear that data doesn''t >>> get populated until _after_ the GC is complete. This is a concern >>> because we don''t want data to be delayed like this. If this wasn''t the >>> case we could happily let it fail and retry, ultimately the data isn''t >>> lost, but we should be able to process data within reasonable time >>> limits without you have to worry about database GC slowing things >>> down. >>> >>> Now in regards to your performance, we are going to keep looking into >>> the results you gave us. Looking at your settings Deepak and I think >>> you probably need to increase at least work_mem to 16MB and >>> maintenance_work_mem to 32MB (this is used during vacuum) the main >>> item being work_mem here really. These are reasonable changes we''ve >>> seen other users apply with success, but alone they probably won''t >>> solve your issue - it would be good to just get you working using >>> higher values for these. >>> >>> I''m very curious to understand the disk IO for your systems also. What >>> are the nature of the disks where your database is located? Are they >>> SCSI/SATA/FC or otherwise? What filesystem type are you using? Can you >>> give me any other details you might find interesting? >>> >>> What are the results of the command: >>> >>> iostat -k 5 >>> >>> A few samples of the run would be useful to see, so we can understand >>> the average disk IO going on here. >>> >>> FYI To give you an idea of what we _should_ be seeing for say, just >>> one of your queries, here is a query on a Centos 5.8 box (running just >>> postgresql 8.1 actually): >>> >>> puppetdb=# EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT >>> resource FROM catalog_resources cr, certname_catalogs cc, certnames c >>> WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated >>> IS NULL) r; >>> QUERY >>> PLAN >>> >>> >>> ------------------------------------------------------------------------------------------------- >>> --------------------------------------------------- >>> Aggregate (cost=91792.54..91792.55 rows=1 width=0) (actual >>> time=2611.872..2611.872 rows=1 loops >>> =1) >>> -> HashAggregate (cost=91737.14..91761.76 rows=2462 width=41) >>> (actual time=2606.507..2609.48 >>> 9 rows=5478 loops=1) >>> -> Hash Join (cost=184.36..90322.32 rows=565926 width=41) >>> (actual time=7.060..2072.000 >>> rows=867847 loops=1) >>> Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) >>> -> Seq Scan on catalog_resources cr >>> (cost=0.00..80139.47 rows=867847 width=82) ( >>> actual time=0.021..1119.260 rows=867847 loops=1) >>> -> Hash (cost=159.17..159.17 rows=2015 width=41) >>> (actual time=7.032..7.032 rows>>> 2195 loops=1) >>> -> Hash Join (cost=62.34..159.17 rows=2015 >>> width=41) (actual time=2.438..5 >>> .830 rows=2195 loops=1) >>> Hash Cond: (cc.certname = c.name) >>> -> Seq Scan on certname_catalogs cc >>> (cost=0.00..65.79 rows=2179 widt >>> h=50) (actual time=0.005..0.908 rows=2195 loops=1) >>> -> Hash (cost=37.15..37.15 rows=2015 >>> width=9) (actual time=2.427..2. >>> 427 rows=2195 loops=1) >>> -> Seq Scan on certnames c >>> (cost=0.00..37.15 rows=2015 width=9 >>> ) (actual time=0.008..1.088 rows=2195 loops=1) >>> Filter: (deactivated IS NULL) >>> Total runtime: 2611.935 ms >>> (13 rows) >>> >>> As you can see, 2 seconds on my box and 475 seconds on yours - the >>> difference is just too large to be considered a normal discrepancy >>> :-). Be mindful, this is on data that represents 2000 nodes ... so its >>> not like my tables are smaller then yours (at least not drastically), >>> however I do notice the number of resources is much higher so I''m >>> going to try and replicate that. >>> >>> For now, I''m going to ponder more on your performance as I believe >>> that affects you the most - so let us get back to you with more >>> updates. If you can give me any details as I''ve requested for now that >>> would be appreciated. >>> >>> ken. >>> >>> On Wed, Mar 13, 2013 at 2:00 AM, ak0ska <akos....@gmail.com> wrote: >>> > Hello Deepak, >>> > >>> > Here are the queries you asked for: >>> > >>> > >>> >> >>> >> Can you fire up psql, point it at your puppetdb database, and run >>> >> "EXPLAIN >>> >> ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, >>> >> catalog_resources >>> >> cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND >>> >> c.deactivated IS NULL;"? That will profile the query, and we can see >>> >> where >>> >> the problem is. >>> >> >>> > >>> > >>> > QUERY >>> > PLAN >>> > >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------- >>> > Aggregate (cost=3132186.09..3132186.10 rows=1 width=0) (actual >>> > time=348426.375..348426.376 rows=1 loops=1) >>> > -> Hash Join (cost=247.37..3096532.80 rows=14261314 width=0) >>> > (actual >>> > time=24059.151..348255.933 rows=1604253 loops=1) >>> > Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) >>> > -> Seq Scan on catalog_resources cr (cost=0.00..2871450.86 >>> > rows=16444286 width=41) (actual time=24053.542..347250.967 rows=2107982 >>> > loops=1) >>> > -> Hash (cost=216.42..216.42 rows=2476 width=41) (actual >>> > time=5.555..5.555 rows=2476 loops=1) >>> > -> Hash Join (cost=94.84..216.42 rows=2476 width=41) >>> > (actual time=1.850..4.555 rows=2476 loops=1) >>> > Hash Cond: (cc.certname = c.name) >>> > -> Seq Scan on certname_catalogs cc >>> > (cost=0.00..82.55 >>> > rows=2855 width=62) (actual time=0.010..0.941 rows=2855 loops=1) >>> > -> Hash (cost=63.71..63.71 rows=2490 width=21) >>> > (actual time=1.818..1.818 rows=2490 loops=1) >>> > -> Seq Scan on certnames c >>> > (cost=0.00..63.71 >>> > rows=2490 width=21) (actual time=0.013..0.823 rows=2490 loops=1) >>> > Filter: (deactivated IS NULL) >>> > Total runtime: 348446.364 ms >>> > (12 rows) >>> > >>> > >>> > >>> > >>> >> Also worth examining is the GC query (modified to be non-destructive): >>> >> >>> >> EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT >>> >> * >>> >> FROM catalog_resources cr WHERE cr.resource=resource_params.resource); >>> >> >>> > >>> > >>> > QUERY PLAN >>> > >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>> > Nested Loop Anti Join (cost=0.00..1710681.18 rows=360492 width=126) >>> > (actual time=1967.182..1967.182 rows=0 loops=1) >>> > -> Seq Scan on resource_params (cost=0.00..13249.23 rows=389623 >>> > width=126) (actual time=0.013..76.403 rows=389693 loops=1) >>> > -> Index Scan using idx_catalog_resources_resource on >>> > catalog_resources >>> > cr (cost=0.00..21429.14 rows=5291 width=41) (actual time=0.004..0.004 >>> > rows=1 loops=389693) >>> > Index Cond: ((cr.resource)::text >>> > (resource_params.resource)::text) >>> > Total runtime: 1967.253 ms >>> > (5 rows) >>> > >>> > >>> > >>> >> And the resource duplication query: >>> >> >>> >> EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource >>> >> FROM >>> >> catalog_resources cr, certname_catalogs cc, certnames c WHERE >>> >> cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) >>> >> r; >>> >> >>> > >>> > QUERY PLAN >>> > >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> > Aggregate (cost=3129997.71..3129997.72 rows=1 width=0) (actual >>> > time=474369.420..474369.420 rows=1 loops=1) >>> > -> HashAggregate (cost=3129927.78..3129958.86 rows=3108 width=41) >>> > (actual time=474351.730..474365.144 rows=49466 loops=1) >>> > -> Hash Join (cost=248.48..3094726.38 rows=14080561 width=41) >>> > (actual time=21536.467..473678.479 rows=1604237 loops=1) >>> > Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) >>> > -> Seq Scan on catalog_resources cr >>> > (cost=0.00..2871450.86 >>> > rows=16444286 width=82) (actual time=21531.355..472692.303 rows=1934212 >>> > loops=1) >>> > -> Hash (cost=217.36..217.36 rows=2490 width=41) >>> > (actual >>> > time=5.071..5.071 rows=2476 loops=1) >>> > -> Hash Join (cost=94.84..217.36 rows=2490 >>> > width=41) >>> > (actual time=1.700..4.048 rows=2476 loops=1) >>> > Hash Cond: (cc.certname = c.name) >>> > -> Seq Scan on certname_catalogs cc >>> > (cost=0.00..83.08 rows=2908 width=62) (actual time=0.010..0.603 >>> > rows=2855 >>> > loops=1) >>> > -> Hash (cost=63.71..63.71 rows=2490 >>> > width=21) >>> > (actual time=1.670..1.670 rows=2490 loops=1) >>> > -> Seq Scan on certnames c >>> > (cost=0.00..63.71 rows=2490 width=21) (actual time=0.014..0.785 >>> > rows=2490 >>> > loops=1) >>> > Filter: (deactivated IS NULL) >>> > Total runtime: 474370.129 ms >>> > (13 rows) >>> > >>> > >>> >> >>> >> And finally, can you give us a dump of your database settings? You can >>> >> do >>> >> this from within psql by doing "SHOW ALL;" >>> > >>> > >>> > http://pastebin.com/raw.php?i=HfhqfVa4 >>> > >>> > Thank you for your help! >>> > >>> > Cheers, >>> > >>> > ak0ska >>> > >>> > >>> > -- >>> > You received this message because you are subscribed to the Google >>> > Groups >>> > "Puppet Users" group. >>> > To unsubscribe from this group and stop receiving emails from it, send >>> > an >>> > email to puppet-users...@googlegroups.com. >>> > To post to this group, send email to puppet...@googlegroups.com. >>> > Visit this group at http://groups.google.com/group/puppet-users?hl=en. >>> > For more options, visit https://groups.google.com/groups/opt_out. >>> > >>> > >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Puppet Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to puppet-users+unsubscribe@googlegroups.com. >> To post to this group, send email to puppet-users@googlegroups.com. >> Visit this group at http://groups.google.com/group/puppet-users?hl=en. >> For more options, visit https://groups.google.com/groups/opt_out. >> >>-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Hi ak0ska, How are things going? Anything to report? ken. On Fri, Mar 15, 2013 at 5:00 AM, Ken Barber <ken@puppetlabs.com> wrote:> Hi ak0ska, > > FWIW - with the help of some of my colleagues we''ve managed to > replicate your constraint issue in a lab style environment now: > > https://gist.github.com/kbarber/5157836 > > Which is a start. It requires a unique precondition to replicate, and > I''ve been unable however to replicate it in any way with PuppetDB > directly, so the statement log will still be helpful if you can supply > it. > > ken. > > On Thu, Mar 14, 2013 at 2:12 PM, Ken Barber <ken@puppetlabs.com> wrote: >> So I have this sinking feeling that all of your problems (including >> the constraint side-effect) are related to general performance issues >> on your database ''for some reason we are yet to determine''. This could >> be related to IO contention, it could be a bad index (although you''ve >> rebuilt them all right?) or something else. What continues to bother >> me is the fact you have only managed to reduce your DB size to 30GB, >> this is far larger than I expect and I it screams ''problem'' to me. It >> should be less than 5 GB, at least based on my rough calculations from >> other DBs I''ve seen in the wild. >> >> It could very well be that something still isn''t sufficiently >> organised (index, tablespace data) for optimal performance. A >> recreation of the database is by all means a very fast way to recreate >> everything, its like heaving an axe though. Having said that, it is >> fast and would eradicate a lot of back & forth between us to try and >> solve this problem. >> >> A dump, drop, recreate and restore is also another way, not 100% the >> same thing, but close to it. >> >> Its entirely up to you - if you want to try either option then go >> ahead. It would be a fast step to achieving a lot of small steps. If >> it works, we may have lost some clarity on the issue however, so might >> be good to keep the original DB backup. If it doesn''t work, at least >> we have ruled out a lot of possibilities in one quick action. So yeah, >> I wouldn''t be upset if you take this action - there is a reason why I >> suggested this earlier, and that is it performs a lot of actions in >> one single step ... and I still think this is a viable action, but >> since you were keen on trying other things I was happy to go along >> :-). At the end of the day your time and your risk assessment is up to >> you. If it was me, yeah, I would try it. Esp. considering the amount >> of time already spent with very little final result. >> >> Thanks for being patient with me on this btw ... :-). >> >> ken. >> >> On Thu, Mar 14, 2013 at 3:36 AM, ak0ska <akos.hencz@gmail.com> wrote: >>> Hello Ken, >>> >>> I really appreciate you guys looking into this problem, and I''m happy to >>> provide you with the data you ask for. However, I feel like I should ask, >>> whether you think this problem is worth your efforts, if rebuilding the >>> database might solve the issue? >>> >>> Cheers, >>> >>> ak0ska >>> >>> >>> On Thursday, March 14, 2013 8:05:59 AM UTC+1, Ken Barber wrote: >>>> >>>> Hi ak0ska, >>>> >>>> So I''ve been spending the last 2 days trying all kinds of things to >>>> replicate your constraint violation problem and I still am getting >>>> nowhere with it. I''ve been speaking to all kinds of smart people and >>>> we believe its some sort of lock and/or transactional mode problem but >>>> none of my testing can replicate it. >>>> >>>> My latest test was something like: >>>> >>>> https://gist.github.com/kbarber/5157836 >>>> >>>> But with all that preparation and load simulation I still couldn''t >>>> replicate it :-(. >>>> >>>> I think I need more data from you ... in this case, a statement log. >>>> Now a statement log can be pretty heavy, so if we can get the data >>>> surrounding a GC only that should be enough - however I anticipate it >>>> might reduce some performance so if you''re apprehensive about doing it >>>> in production I fully understand. I''m just running out of ideas, and >>>> without more data I may have trouble coming to a conclusion. >>>> >>>> Anyway, if at all possible the instructions for enabling this are as >>>> follows: >>>> >>>> * Go to your postgresql.conf file (On RHEL 6 this is usually located >>>> in /var/lib/pgsql/data) >>>> * Add/modify the entries: >>>> log_statement = ''all'' >>>> log_line_prefix = ''%m PID[%p] SID[%c] VTID[%v] TID[%x] '' >>>> * Restart postgresql >>>> * Logs should go to: /var/lib/pgsql/data/pg_log >>>> >>>> The closer you can do this to being set _just before_ a GC run the >>>> better, as it will reduce the amount of time this setting needs to be >>>> on. Once you have a GC run and a subsequent constraint logged you >>>> should be able to turn it off. Don''t worry about PuppetDB during these >>>> cases, it retries the database so if you can take the small hit with >>>> queries the outage should be quick while you switch off/on the setting >>>> (and be mindful that catalogs/facts/reports submissions will get >>>> queued, queries of course will fail during that short time during DB >>>> restart). >>>> >>>> Now the data itself is going to obviously contain private items, so if >>>> you prefer to send it to me privately thats fine (contact me off this >>>> list if so). Otherwise, it might need some scrubbing. >>>> >>>> So the concern I have with your issue, even though it doesn''t stop a >>>> catalog from being submitted - is that you''re seeing a ''block'' when >>>> those two catalogs get submitted which makes me fear that data doesn''t >>>> get populated until _after_ the GC is complete. This is a concern >>>> because we don''t want data to be delayed like this. If this wasn''t the >>>> case we could happily let it fail and retry, ultimately the data isn''t >>>> lost, but we should be able to process data within reasonable time >>>> limits without you have to worry about database GC slowing things >>>> down. >>>> >>>> Now in regards to your performance, we are going to keep looking into >>>> the results you gave us. Looking at your settings Deepak and I think >>>> you probably need to increase at least work_mem to 16MB and >>>> maintenance_work_mem to 32MB (this is used during vacuum) the main >>>> item being work_mem here really. These are reasonable changes we''ve >>>> seen other users apply with success, but alone they probably won''t >>>> solve your issue - it would be good to just get you working using >>>> higher values for these. >>>> >>>> I''m very curious to understand the disk IO for your systems also. What >>>> are the nature of the disks where your database is located? Are they >>>> SCSI/SATA/FC or otherwise? What filesystem type are you using? Can you >>>> give me any other details you might find interesting? >>>> >>>> What are the results of the command: >>>> >>>> iostat -k 5 >>>> >>>> A few samples of the run would be useful to see, so we can understand >>>> the average disk IO going on here. >>>> >>>> FYI To give you an idea of what we _should_ be seeing for say, just >>>> one of your queries, here is a query on a Centos 5.8 box (running just >>>> postgresql 8.1 actually): >>>> >>>> puppetdb=# EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT >>>> resource FROM catalog_resources cr, certname_catalogs cc, certnames c >>>> WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated >>>> IS NULL) r; >>>> QUERY >>>> PLAN >>>> >>>> >>>> ------------------------------------------------------------------------------------------------- >>>> --------------------------------------------------- >>>> Aggregate (cost=91792.54..91792.55 rows=1 width=0) (actual >>>> time=2611.872..2611.872 rows=1 loops >>>> =1) >>>> -> HashAggregate (cost=91737.14..91761.76 rows=2462 width=41) >>>> (actual time=2606.507..2609.48 >>>> 9 rows=5478 loops=1) >>>> -> Hash Join (cost=184.36..90322.32 rows=565926 width=41) >>>> (actual time=7.060..2072.000 >>>> rows=867847 loops=1) >>>> Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) >>>> -> Seq Scan on catalog_resources cr >>>> (cost=0.00..80139.47 rows=867847 width=82) ( >>>> actual time=0.021..1119.260 rows=867847 loops=1) >>>> -> Hash (cost=159.17..159.17 rows=2015 width=41) >>>> (actual time=7.032..7.032 rows>>>> 2195 loops=1) >>>> -> Hash Join (cost=62.34..159.17 rows=2015 >>>> width=41) (actual time=2.438..5 >>>> .830 rows=2195 loops=1) >>>> Hash Cond: (cc.certname = c.name) >>>> -> Seq Scan on certname_catalogs cc >>>> (cost=0.00..65.79 rows=2179 widt >>>> h=50) (actual time=0.005..0.908 rows=2195 loops=1) >>>> -> Hash (cost=37.15..37.15 rows=2015 >>>> width=9) (actual time=2.427..2. >>>> 427 rows=2195 loops=1) >>>> -> Seq Scan on certnames c >>>> (cost=0.00..37.15 rows=2015 width=9 >>>> ) (actual time=0.008..1.088 rows=2195 loops=1) >>>> Filter: (deactivated IS NULL) >>>> Total runtime: 2611.935 ms >>>> (13 rows) >>>> >>>> As you can see, 2 seconds on my box and 475 seconds on yours - the >>>> difference is just too large to be considered a normal discrepancy >>>> :-). Be mindful, this is on data that represents 2000 nodes ... so its >>>> not like my tables are smaller then yours (at least not drastically), >>>> however I do notice the number of resources is much higher so I''m >>>> going to try and replicate that. >>>> >>>> For now, I''m going to ponder more on your performance as I believe >>>> that affects you the most - so let us get back to you with more >>>> updates. If you can give me any details as I''ve requested for now that >>>> would be appreciated. >>>> >>>> ken. >>>> >>>> On Wed, Mar 13, 2013 at 2:00 AM, ak0ska <akos....@gmail.com> wrote: >>>> > Hello Deepak, >>>> > >>>> > Here are the queries you asked for: >>>> > >>>> > >>>> >> >>>> >> Can you fire up psql, point it at your puppetdb database, and run >>>> >> "EXPLAIN >>>> >> ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, >>>> >> catalog_resources >>>> >> cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND >>>> >> c.deactivated IS NULL;"? That will profile the query, and we can see >>>> >> where >>>> >> the problem is. >>>> >> >>>> > >>>> > >>>> > QUERY >>>> > PLAN >>>> > >>>> > ------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> > Aggregate (cost=3132186.09..3132186.10 rows=1 width=0) (actual >>>> > time=348426.375..348426.376 rows=1 loops=1) >>>> > -> Hash Join (cost=247.37..3096532.80 rows=14261314 width=0) >>>> > (actual >>>> > time=24059.151..348255.933 rows=1604253 loops=1) >>>> > Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) >>>> > -> Seq Scan on catalog_resources cr (cost=0.00..2871450.86 >>>> > rows=16444286 width=41) (actual time=24053.542..347250.967 rows=2107982 >>>> > loops=1) >>>> > -> Hash (cost=216.42..216.42 rows=2476 width=41) (actual >>>> > time=5.555..5.555 rows=2476 loops=1) >>>> > -> Hash Join (cost=94.84..216.42 rows=2476 width=41) >>>> > (actual time=1.850..4.555 rows=2476 loops=1) >>>> > Hash Cond: (cc.certname = c.name) >>>> > -> Seq Scan on certname_catalogs cc >>>> > (cost=0.00..82.55 >>>> > rows=2855 width=62) (actual time=0.010..0.941 rows=2855 loops=1) >>>> > -> Hash (cost=63.71..63.71 rows=2490 width=21) >>>> > (actual time=1.818..1.818 rows=2490 loops=1) >>>> > -> Seq Scan on certnames c >>>> > (cost=0.00..63.71 >>>> > rows=2490 width=21) (actual time=0.013..0.823 rows=2490 loops=1) >>>> > Filter: (deactivated IS NULL) >>>> > Total runtime: 348446.364 ms >>>> > (12 rows) >>>> > >>>> > >>>> > >>>> > >>>> >> Also worth examining is the GC query (modified to be non-destructive): >>>> >> >>>> >> EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT >>>> >> * >>>> >> FROM catalog_resources cr WHERE cr.resource=resource_params.resource); >>>> >> >>>> > >>>> > >>>> > QUERY PLAN >>>> > >>>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>>> > Nested Loop Anti Join (cost=0.00..1710681.18 rows=360492 width=126) >>>> > (actual time=1967.182..1967.182 rows=0 loops=1) >>>> > -> Seq Scan on resource_params (cost=0.00..13249.23 rows=389623 >>>> > width=126) (actual time=0.013..76.403 rows=389693 loops=1) >>>> > -> Index Scan using idx_catalog_resources_resource on >>>> > catalog_resources >>>> > cr (cost=0.00..21429.14 rows=5291 width=41) (actual time=0.004..0.004 >>>> > rows=1 loops=389693) >>>> > Index Cond: ((cr.resource)::text >>>> > (resource_params.resource)::text) >>>> > Total runtime: 1967.253 ms >>>> > (5 rows) >>>> > >>>> > >>>> > >>>> >> And the resource duplication query: >>>> >> >>>> >> EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource >>>> >> FROM >>>> >> catalog_resources cr, certname_catalogs cc, certnames c WHERE >>>> >> cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) >>>> >> r; >>>> >> >>>> > >>>> > QUERY PLAN >>>> > >>>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> > Aggregate (cost=3129997.71..3129997.72 rows=1 width=0) (actual >>>> > time=474369.420..474369.420 rows=1 loops=1) >>>> > -> HashAggregate (cost=3129927.78..3129958.86 rows=3108 width=41) >>>> > (actual time=474351.730..474365.144 rows=49466 loops=1) >>>> > -> Hash Join (cost=248.48..3094726.38 rows=14080561 width=41) >>>> > (actual time=21536.467..473678.479 rows=1604237 loops=1) >>>> > Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) >>>> > -> Seq Scan on catalog_resources cr >>>> > (cost=0.00..2871450.86 >>>> > rows=16444286 width=82) (actual time=21531.355..472692.303 rows=1934212 >>>> > loops=1) >>>> > -> Hash (cost=217.36..217.36 rows=2490 width=41) >>>> > (actual >>>> > time=5.071..5.071 rows=2476 loops=1) >>>> > -> Hash Join (cost=94.84..217.36 rows=2490 >>>> > width=41) >>>> > (actual time=1.700..4.048 rows=2476 loops=1) >>>> > Hash Cond: (cc.certname = c.name) >>>> > -> Seq Scan on certname_catalogs cc >>>> > (cost=0.00..83.08 rows=2908 width=62) (actual time=0.010..0.603 >>>> > rows=2855 >>>> > loops=1) >>>> > -> Hash (cost=63.71..63.71 rows=2490 >>>> > width=21) >>>> > (actual time=1.670..1.670 rows=2490 loops=1) >>>> > -> Seq Scan on certnames c >>>> > (cost=0.00..63.71 rows=2490 width=21) (actual time=0.014..0.785 >>>> > rows=2490 >>>> > loops=1) >>>> > Filter: (deactivated IS NULL) >>>> > Total runtime: 474370.129 ms >>>> > (13 rows) >>>> > >>>> > >>>> >> >>>> >> And finally, can you give us a dump of your database settings? You can >>>> >> do >>>> >> this from within psql by doing "SHOW ALL;" >>>> > >>>> > >>>> > http://pastebin.com/raw.php?i=HfhqfVa4 >>>> > >>>> > Thank you for your help! >>>> > >>>> > Cheers, >>>> > >>>> > ak0ska >>>> > >>>> > >>>> > -- >>>> > You received this message because you are subscribed to the Google >>>> > Groups >>>> > "Puppet Users" group. >>>> > To unsubscribe from this group and stop receiving emails from it, send >>>> > an >>>> > email to puppet-users...@googlegroups.com. >>>> > To post to this group, send email to puppet...@googlegroups.com. >>>> > Visit this group at http://groups.google.com/group/puppet-users?hl=en. >>>> > For more options, visit https://groups.google.com/groups/opt_out. >>>> > >>>> > >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Puppet Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to puppet-users+unsubscribe@googlegroups.com. >>> To post to this group, send email to puppet-users@googlegroups.com. >>> Visit this group at http://groups.google.com/group/puppet-users?hl=en. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >>>-- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com. To post to this group, send email to puppet-users@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.