thr3ads.net - Puppet users - [Puppet Users] PuppetDB queue problem [Feb 2013]

If this information is useful, please help other people find it:
Share via:

ak0ska

2013-Feb-28 09:14 UTC

[Puppet Users] PuppetDB queue problem

Hello,

We have upgraded Puppetdb from 1.0.2 to 1.1.1 last week, and recently we 
noticed that queue is not processed. We don''t know if this problem is 
directly linked to the upgrade. There are now 25 000 messages waiting in 
the queue. We also see errors in the Postgres log:

ERROR: insert or update on table "certname_catalogs" violates foreign
key > constraint "certname_catalogs_catalog_fkey"
> DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not 
> present in table "catalogs".
> STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) 
> VALUES ($1,$2,$3) RETURNING *
>
Can this be somehow related to the the KahaDB leak thread? 
(https://groups.google.com/d/topic/puppet-users/gWACi4a0BZs/discussion )

Any thoughts? Would flushing the queue solve the problem?

 

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Feb-28 09:38 UTC

head link

[Puppet Users] Re: PuppetDB queue problem

We flushed the queue and also restarted Postgres. Didn''t help. 

On Thursday, February 28, 2013 10:14:16 AM UTC+1, ak0ska
wrote:>
> Hello,
>
> We have upgraded Puppetdb from 1.0.2 to 1.1.1 last week, and recently we 
> noticed that queue is not processed. We don''t know if this problem
is
> directly linked to the upgrade. There are now 25 000 messages waiting in 
> the queue. We also see errors in the Postgres log:
>
> ERROR: insert or update on table "certname_catalogs" violates
foreign key
>> constraint "certname_catalogs_catalog_fkey"
>> DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not
>> present in table "catalogs".
>> STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) 
>> VALUES ($1,$2,$3) RETURNING *
>>
>
> Can this be somehow related to the the KahaDB leak thread? (
> https://groups.google.com/d/topic/puppet-users/gWACi4a0BZs/discussion )
>
> Any thoughts? Would flushing the queue solve the problem?
>
>  
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Feb-28 12:09 UTC

head link

[Puppet Users] Re: PuppetDB queue problem

Or maybe it did. The queue size went up to like 900 and after a while it 
normalized and started slowly decreasing. Right now it''s empty.
There were 2 insert errors in postgres log since its restart. 

 It was perhaps on our end, but it would be good to hear some theories, or 
suggestions where else to look for a problem in a similar case

On Thursday, February 28, 2013 10:38:42 AM UTC+1, ak0ska
wrote:>
> We flushed the queue and also restarted Postgres. Didn''t help. 
>
> On Thursday, February 28, 2013 10:14:16 AM UTC+1, ak0ska wrote:
>>
>> Hello,
>>
>> We have upgraded Puppetdb from 1.0.2 to 1.1.1 last week, and recently
we
>> noticed that queue is not processed. We don''t know if this
problem is
>> directly linked to the upgrade. There are now 25 000 messages waiting
in
>> the queue. We also see errors in the Postgres log:
>>
>> ERROR: insert or update on table "certname_catalogs" violates
foreign key
>>> constraint "certname_catalogs_catalog_fkey"
>>> DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is
not
>>> present in table "catalogs".
>>> STATEMENT: INSERT INTO certname_catalogs
(certname,catalog,timestamp)
>>> VALUES ($1,$2,$3) RETURNING *
>>>
>>
>> Can this be somehow related to the the KahaDB leak thread? (
>> https://groups.google.com/d/topic/puppet-users/gWACi4a0BZs/discussion )
>>
>> Any thoughts? Would flushing the queue solve the problem?
>>
>>  
>>
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Feb-28 12:54 UTC

head link

[Puppet Users] Re: PuppetDB queue problem

The queue jumped back to ~1 600, and according to the puppetdb logs, it 
hasn''t processed a single entry for like 20 minutes, but the log shows
a
lot of slow queries (see example below). Postgres logs show no error.

And now it started processing the queue again.


Slow queries: http://pastebin.com/2B4azRib



On Thursday, February 28, 2013 1:09:15 PM UTC+1, ak0ska
wrote:>
> Or maybe it did. The queue size went up to like 900 and after a while it 
> normalized and started slowly decreasing. Right now it''s empty.
> There were 2 insert errors in postgres log since its restart. 
>
>  It was perhaps on our end, but it would be good to hear some theories, or 
> suggestions where else to look for a problem in a similar case
>
> On Thursday, February 28, 2013 10:38:42 AM UTC+1, ak0ska wrote:
>>
>> We flushed the queue and also restarted Postgres. Didn''t help.
>>
>> On Thursday, February 28, 2013 10:14:16 AM UTC+1, ak0ska wrote:
>>>
>>> Hello,
>>>
>>> We have upgraded Puppetdb from 1.0.2 to 1.1.1 last week, and
recently we
>>> noticed that queue is not processed. We don''t know if this
problem is
>>> directly linked to the upgrade. There are now 25 000 messages
waiting in
>>> the queue. We also see errors in the Postgres log:
>>>
>>> ERROR: insert or update on table "certname_catalogs"
violates foreign
>>>> key constraint "certname_catalogs_catalog_fkey"
>>>> DETAIL: Key
(catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not
>>>> present in table "catalogs".
>>>> STATEMENT: INSERT INTO certname_catalogs
(certname,catalog,timestamp)
>>>> VALUES ($1,$2,$3) RETURNING *
>>>>
>>>
>>> Can this be somehow related to the the KahaDB leak thread? (
>>>
https://groups.google.com/d/topic/puppet-users/gWACi4a0BZs/discussion )
>>>
>>> Any thoughts? Would flushing the queue solve the problem?
>>>
>>>  
>>>
>>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Feb-28 13:28 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

If you clear the queue and rollback to the original version does the
problem disappear? If you''re having processing problems at the latest
version thats what I would do, as I presume we''re talking production
here right?
> Can this be somehow related to the the KahaDB leak thread?
No - it doesn''t sound like it.
> ERROR: insert or update on table "certname_catalogs" violates
foreign key constraint "certname_catalogs_catalog_fkey"
> DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not
present in table "catalogs".
> STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp)
VALUES ($1,$2,$3) RETURNING *
That is a bit of a concern, are you receiving a lot of these? Is this constant?
> Use of unversioned APIs is deprecated; please use /v1/resources
These are appearing in your slow query log - are you sure you''re
running puppetdb-terminus 1.1.1 as well? Did you restart your master
after upgrading the terminus?

ken.

On Thu, Feb 28, 2013 at 12:54 PM, ak0ska <akos.hencz@gmail.com>
wrote:> The queue jumped back to ~1 600, and according to the puppetdb logs, it
> hasn''t processed a single entry for like 20 minutes, but the log
shows a lot
> of slow queries (see example below). Postgres logs show no error.
>
> And now it started processing the queue again.
>
>
> Slow queries: http://pastebin.com/2B4azRib
>
>
>
>
> On Thursday, February 28, 2013 1:09:15 PM UTC+1, ak0ska wrote:
>>
>> Or maybe it did. The queue size went up to like 900 and after a while
it
>> normalized and started slowly decreasing. Right now it''s
empty.
>> There were 2 insert errors in postgres log since its restart.
>>
>>  It was perhaps on our end, but it would be good to hear some theories,
or
>> suggestions where else to look for a problem in a similar case
>>
>> On Thursday, February 28, 2013 10:38:42 AM UTC+1, ak0ska wrote:
>>>
>>> We flushed the queue and also restarted Postgres. Didn''t
help.
>>>
>>> On Thursday, February 28, 2013 10:14:16 AM UTC+1, ak0ska wrote:
>>>>
>>>> Hello,
>>>>
>>>> We have upgraded Puppetdb from 1.0.2 to 1.1.1 last week, and
recently we
>>>> noticed that queue is not processed. We don''t know if
this problem is
>>>> directly linked to the upgrade. There are now 25 000 messages
waiting in the
>>>> queue. We also see errors in the Postgres log:
>>>>
>>>>> ERROR: insert or update on table
"certname_catalogs" violates foreign
>>>>> key constraint "certname_catalogs_catalog_fkey"
>>>>> DETAIL: Key
(catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not
>>>>> present in table "catalogs".
>>>>> STATEMENT: INSERT INTO certname_catalogs
(certname,catalog,timestamp)
>>>>> VALUES ($1,$2,$3) RETURNING *
>>>>
>>>>
>>>> Can this be somehow related to the the KahaDB leak thread?
>>>>
(https://groups.google.com/d/topic/puppet-users/gWACi4a0BZs/discussion )
>>>>
>>>> Any thoughts? Would flushing the queue solve the problem?
>>>>
>>>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to puppet-users+unsubscribe@googlegroups.com.
> To post to this group, send email to puppet-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/puppet-users?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Feb-28 14:21 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

Hi, thanks for trying to help! :)

If you clear the queue and rollback to the original version does the
> problem disappear? If you''re having processing problems at the
latest
> version thats what I would do, as I presume we''re talking
production
> here right? 
>
>Yes, this is production. We would like to avoid reverting, but maybe we 
will have to.
>
> > ERROR: insert or update on table "certname_catalogs"
violates foreign
> key constraint "certname_catalogs_catalog_fkey" 
> > DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is
not
> present in table "catalogs". 
> > STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) 
> VALUES ($1,$2,$3) RETURNING * 
>
> That is a bit of a concern, are you receiving a lot of these? Is this 
> constant? 
>
Before we flushed the queue, quite a lot. Since we flushed it, only 4 times.

 
> > Use of unversioned APIs is deprecated; please use /v1/resources 
>
> These are appearing in your slow query log - are you sure you''re 
> running puppetdb-terminus 1.1.1 as well? Did you restart your master 
> after upgrading the terminus? 
>
> The terminus version is correct, however we were not entirely sure that 
the masters were properly restarted, so we did it now again, let''s see
if
this changes anything.


But the hanging is happening again. I checked what queries run on the 
database server. Some queries are running for ~ 90 mins, and there are some 
idle threads. There are two update commands that are waiting. I assume 
while those 2 don''t get executed, the queue processing is halted.


-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Feb-28 14:34 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

> Hi, thanks for trying to help! :)
I''m @ken_barber on irc btw if that is easier.
>> That is a bit of a concern, are you receiving a lot of these? Is this
>> constant?
>
> Before we flushed the queue, quite a lot. Since we flushed it, only 4
times.
>
>
>>
>> > Use of unversioned APIs is deprecated; please use /v1/resources
>>
>> These are appearing in your slow query log - are you sure
you''re
>> running puppetdb-terminus 1.1.1 as well? Did you restart your master
>> after upgrading the terminus?
>>
>
>  The terminus version is correct, however we were not entirely sure that
the
> masters were properly restarted, so we did it now again, let''s see
if this
> changes anything.
Okay. Did you clear the ActiveMQ queues after doing this? I usually
just move the old KahaDB directory out of the way when I do this.
> But the hanging is happening again. I checked what queries run on the
> database server. Some queries are running for ~ 90 mins, and there are some
> idle threads. There are two update commands that are waiting. I assume
while
> those 2 don''t get executed, the queue processing is halted.
Okay, so were those old queries that are taking 90 minutes, what are they?

Have you tried shutting down the PuppetDB, clearing the queue -
running full vacuum on the database then starting it all up again?

What version of PostgreSQL are we talking here? What distro and
version specifically? How big is the database itself?

How many nodes and how frequently are they checking in? A screenshot
of your puppetdb dashboard might give us a good snapshot of the size
of your environment as well.

Has any tuning been performed on your postgresql server? A copy of
your postgresql.conf would be interesting.

We can do this all privately if you are concerned about sharing
information publicly as well - up to you. Like I said, I''m also on
Freenode IRC as @ken_barber if this is better for you.

ken.

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Feb-28 15:30 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

> I''m @ken_barber on irc btw if that is easier. 
> 
Can''t use IRC here, sorry. :( 
 
> Okay. Did you clear the ActiveMQ queues after doing this? I usually 
> just move the old KahaDB directory out of the way when I do this. 
>
I haven''t though about myself, but it makes sense, so I just flushed
the
queue again while puppetdb service was stopped. Since this last restart it 
seems smooth, but it''s only up for a few minutes. 


 
> Okay, so were those old queries that are taking 90 minutes, what are they? 
>
> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND 
c.deactivated IS NULL
18 instances were running, oldest was ~90 mins at the time

SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources 
cr, certname_catalogs cc, certnames c  WHERE cr.catalog=cc.catalog AND 
cc.certname=c.name AND c.deactivated IS NULL)
12 instances were running, oldest was ~90 mins at the time


 
> Have you tried shutting down the PuppetDB, clearing the queue - 
> running full vacuum on the database then starting it all up again? 
>
Not yet, according to the Postgresql documentation full vacuum is not 
really recommended. Also it might take a while, and PuppetDB would be 
unavailable for the duration.


 
> What version of PostgreSQL are we talking here? What distro and 
> version specifically? How big is the database itself? 
>
8.4.13 on a RHEL6 based distribution. The database size might be an 
indicator that something is not ok. It currently takes up 512Gbs, and it 
grew 100Gbs over a week. We made a database dump before the PuppetDB 
upgrade and it was ~1Gb uncompressed. The full vacuum might solve this, but 
my guess that it would run quite long.


 
> How many nodes and how frequently are they checking in? A screenshot 
> of your puppetdb dashboard might give us a good snapshot of the size 
> of your environment as well.
>
We have ~2500 nodes, run interval for the two major environments: 60mins 
for 1593 machines and 30 mins for 496. Splay limit is 900sec for both.
Two screenshots from today: http://imgur.com/kG5Sth5,bLFqGAX#0 and 
http://imgur.com/kG5Sth5,bLFqGAX#1


 
> Has any tuning been performed on your postgresql server? A copy of 
> your postgresql.conf would be interesting. 
>
No tuning that I''m aware of. The same default postgresql.conf as
deployed
by the puppetlabs-postgres module. It''s here:
http://pastebin.com/njvGm4eu
 
>
> We can do this all privately if you are concerned about sharing 
> information publicly as well - up to you. Like I said, I''m also on
> Freenode IRC as @ken_barber if this is better for you. 
>
> ken. 
>

So far I don''t feel like I shared anything sensitive, so this might be 
useful for others as well.

Thanks for the help!
 

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Feb-28 15:55 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

>> Okay. Did you clear the ActiveMQ queues after doing this? I usually
>> just move the old KahaDB directory out of the way when I do this.
>
>
> I haven''t though about myself, but it makes sense, so I just
flushed the
> queue again while puppetdb service was stopped. Since this last restart it
> seems smooth, but it''s only up for a few minutes.
You mean, you''ve only been watching it for a few minutes, and so far
so good - or it crashed? Sorry - just want to be clear :-).
>> Okay, so were those old queries that are taking 90 minutes, what are
they?
>>
> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr,
> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND
> c.deactivated IS NULL
> 18 instances were running, oldest was ~90 mins at the time
>
> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources
> cr, certname_catalogs cc, certnames c  WHERE cr.catalog=cc.catalog AND
> cc.certname=c.name AND c.deactivated IS NULL)
> 12 instances were running, oldest was ~90 mins at the time
>> Have you tried shutting down the PuppetDB, clearing the queue -
>> running full vacuum on the database then starting it all up again?
>
>
> Not yet, according to the Postgresql documentation full vacuum is not
really
> recommended. Also it might take a while, and PuppetDB would be unavailable
> for the duration.
Its not recommended as a re-occurring maintenance task this is true,
but if your DB has had major changes due to a schema upgrade, or if
vacuum hasn''t ran in a while its generally okay and can provide some
benefit. But yeah, judging by the size of your DB this will take a
long time. I guess my point is, every recommendation has a caveat or
back-story.
>> What version of PostgreSQL are we talking here? What distro and
>> version specifically? How big is the database itself?
>
>
> 8.4.13 on a RHEL6 based distribution. The database size might be an
> indicator that something is not ok. It currently takes up 512Gbs, and it
> grew 100Gbs over a week. We made a database dump before the PuppetDB
upgrade
> and it was ~1Gb uncompressed. The full vacuum might solve this, but my
guess
> that it would run quite long.
Wow - it should be more like 5 GB or less for your size.

It sounds like you''ve got some major fragmentation issues, your
indexes may need rebuilding and yeah - a vacuum will probably help -
and index rebuild. But this will require an outage I would say.

When was the last time it was vacuumed? Try the SQL:

select relname,last_vacuum, last_autovacuum, last_analyze,
last_autoanalyze from pg_stat_user_tables;

And give us the output. I think by default autovacuum should be on for
Postgresql 8.4 on Redhat but I can''t recall.

On another note ... to be honest I can''t be precise about why the
upgrade failed, I''m pretty sure you were running old revisions of the
terminus when you upgraded but that should cause this kind of anguish.
The errors you were receiving about constraints:

Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not
present in table "catalogs".

Should not be occurring at all, which is all very suspect - alas I
have no clue yet as to why. Have they stopped now we have cleared the
queue and restarted?
>> How many nodes and how frequently are they checking in? A screenshot
>> of your puppetdb dashboard might give us a good snapshot of the size
>> of your environment as well.
>
>
> We have ~2500 nodes, run interval for the two major environments: 60mins
for
> 1593 machines and 30 mins for 496. Splay limit is 900sec for both.
> Two screenshots from today: http://imgur.com/kG5Sth5,bLFqGAX#0 and
> http://imgur.com/kG5Sth5,bLFqGAX#1
Okay.
>> Has any tuning been performed on your postgresql server? A copy of
>> your postgresql.conf would be interesting.
>
>
> No tuning that I''m aware of. The same default postgresql.conf as
deployed by
> the puppetlabs-postgres module. It''s here:
http://pastebin.com/njvGm4eu
What is the size of your database? CPU/cores ... and RAM on the box?

Does your puppetdb service live on the same node as the database? I''m
guessing this to be true, as your postgresql.conf is listening only on
''localhost'' ... what is the memory consumption of your apps?
The
output of ''free'' would probably be a good start.

ken.

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Feb-28 17:09 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

FYI, I just upgraded only the PuppetDB part to 1.1.1, using the old
1.0.2 terminus I get no errors:

2013-02-28 17:06:27,711 WARN  [qtp1478462104-39] [http.server] Use of
unversioned APIs is deprecated; please use /v1/commands
2013-02-28 17:06:28,284 INFO  [command-proc-44] [puppetdb.command]
[7a4a1f70-1e85-4256-8f95-21008b6c1199] [replace facts] puppetdb2.vm
2013-02-28 17:06:30,724 WARN  [qtp1478462104-32] [http.server] Use of
unversioned APIs is deprecated; please use /v1/commands
2013-02-28 17:06:32,693 INFO  [command-proc-44] [puppetdb.command]
[a87001c5-f064-4a2d-9f9e-de903f62f824] [replace catalog] puppetdb2.vm

If at all possible - I wouldn''t mind a full copy of your puppetdb.log
... to dig a bit deeper. And I know I told you to clear the KahaDB
queue (I always make this mistake) but I don''t suppose you kept an old
copy of it?

ken.

On Thu, Feb 28, 2013 at 3:55 PM, Ken Barber <ken@puppetlabs.com>
wrote:>>> Okay. Did you clear the ActiveMQ queues after doing this? I usually
>>> just move the old KahaDB directory out of the way when I do this.
>>
>>
>> I haven''t though about myself, but it makes sense, so I just
flushed the
>> queue again while puppetdb service was stopped. Since this last restart
it
>> seems smooth, but it''s only up for a few minutes.
>
> You mean, you''ve only been watching it for a few minutes, and so
far
> so good - or it crashed? Sorry - just want to be clear :-).
>
>>> Okay, so were those old queries that are taking 90 minutes, what
are they?
>>>
>> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr,
>> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND
>> c.deactivated IS NULL
>> 18 instances were running, oldest was ~90 mins at the time
>>
>> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM
catalog_resources
>> cr, certname_catalogs cc, certnames c  WHERE cr.catalog=cc.catalog AND
>> cc.certname=c.name AND c.deactivated IS NULL)
>> 12 instances were running, oldest was ~90 mins at the time
>
>>> Have you tried shutting down the PuppetDB, clearing the queue -
>>> running full vacuum on the database then starting it all up again?
>>
>>
>> Not yet, according to the Postgresql documentation full vacuum is not
really
>> recommended. Also it might take a while, and PuppetDB would be
unavailable
>> for the duration.
>
> Its not recommended as a re-occurring maintenance task this is true,
> but if your DB has had major changes due to a schema upgrade, or if
> vacuum hasn''t ran in a while its generally okay and can provide
some
> benefit. But yeah, judging by the size of your DB this will take a
> long time. I guess my point is, every recommendation has a caveat or
> back-story.
>
>>> What version of PostgreSQL are we talking here? What distro and
>>> version specifically? How big is the database itself?
>>
>>
>> 8.4.13 on a RHEL6 based distribution. The database size might be an
>> indicator that something is not ok. It currently takes up 512Gbs, and
it
>> grew 100Gbs over a week. We made a database dump before the PuppetDB
upgrade
>> and it was ~1Gb uncompressed. The full vacuum might solve this, but my
guess
>> that it would run quite long.
>
> Wow - it should be more like 5 GB or less for your size.
>
> It sounds like you''ve got some major fragmentation issues, your
> indexes may need rebuilding and yeah - a vacuum will probably help -
> and index rebuild. But this will require an outage I would say.
>
> When was the last time it was vacuumed? Try the SQL:
>
> select relname,last_vacuum, last_autovacuum, last_analyze,
> last_autoanalyze from pg_stat_user_tables;
>
> And give us the output. I think by default autovacuum should be on for
> Postgresql 8.4 on Redhat but I can''t recall.
>
> On another note ... to be honest I can''t be precise about why the
> upgrade failed, I''m pretty sure you were running old revisions of
the
> terminus when you upgraded but that should cause this kind of anguish.
> The errors you were receiving about constraints:
>
> Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not
> present in table "catalogs".
>
> Should not be occurring at all, which is all very suspect - alas I
> have no clue yet as to why. Have they stopped now we have cleared the
> queue and restarted?
>
>>> How many nodes and how frequently are they checking in? A
screenshot
>>> of your puppetdb dashboard might give us a good snapshot of the
size
>>> of your environment as well.
>>
>>
>> We have ~2500 nodes, run interval for the two major environments:
60mins for
>> 1593 machines and 30 mins for 496. Splay limit is 900sec for both.
>> Two screenshots from today: http://imgur.com/kG5Sth5,bLFqGAX#0 and
>> http://imgur.com/kG5Sth5,bLFqGAX#1
>
> Okay.
>
>>> Has any tuning been performed on your postgresql server? A copy of
>>> your postgresql.conf would be interesting.
>>
>>
>> No tuning that I''m aware of. The same default postgresql.conf
as deployed by
>> the puppetlabs-postgres module. It''s here:
http://pastebin.com/njvGm4eu
>
> What is the size of your database? CPU/cores ... and RAM on the box?
>
> Does your puppetdb service live on the same node as the database?
I''m
> guessing this to be true, as your postgresql.conf is listening only on
> ''localhost'' ... what is the memory consumption of your
apps? The
> output of ''free'' would probably be a good start.
>
> ken.
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-01 09:09 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

> You mean, you''ve only been watching it for a few minutes, and so
far
> so good - or it crashed? Sorry - just want to be clear :-). 
>
I was watching it for a few minutes and it seemed good. However the queue 
grew up to 4000 items overnight. Also we have more of the constraint 
violation errors now. 


 
> Its not recommended as a re-occurring maintenance task this is true, 
> but if your DB has had major changes due to a schema upgrade, or if 
> vacuum hasn''t ran in a while its generally okay and can provide
some
> benefit. But yeah, judging by the size of your DB this will take a 
> long time. I guess my point is, every recommendation has a caveat or 
> back-story. 
>
We decided to schedule a full vacuum for this weekend.  


 
> Wow - it should be more like 5 GB or less for your size. 
>
> It sounds like you''ve got some major fragmentation issues, your 
> indexes may need rebuilding and yeah - a vacuum will probably help - 
> and index rebuild. But this will require an outage I would say. 
>
> When was the last time it was vacuumed? Try the SQL: 
>
> select relname,last_vacuum, last_autovacuum, last_analyze, 
> last_autoanalyze from pg_stat_user_tables; 
>
> And give us the output. I think by default autovacuum should be on for 
> Postgresql 8.4 on Redhat but I can''t recall. 
>
         relname         | last_vacuum |        last_autovacuum        | 
last_analyze |       last_autoanalyze        
-------------------------+-------------+-------------------------------+--------------+-------------------------------
 edges                   |             | 2013-03-01 09:49:04.659005+01 
|              | 2013-03-01 08:57:59.479092+01
 reports                 |             |                               
|              | 
 resource_events         |             |                               
|              | 
 schema_migrations       |             |                               
|              | 
 certnames               |             | 2013-03-01 09:20:50.378484+01 
|              | 2013-03-01 09:19:49.22173+01
 certname_catalogs       |             | 2013-03-01 09:07:54.251874+01 
|              | 2013-03-01 09:20:50.548025+01
 catalog_resources       |             | 2013-01-29 23:17:04.224172+01 
|              | 2013-01-30 08:47:38.371796+01
 catalogs                |             | 2013-03-01 08:20:48.148931+01 
|              | 2013-03-01 09:19:48.749645+01
 certname_facts_metadata |             | 2013-03-01 09:20:51.318913+01 
|              | 2013-03-01 09:19:50.021701+01
 certname_facts          |             | 2013-03-01 09:19:47.655727+01 
|              | 2013-03-01 09:10:53.688119+01
 resource_params         |             | 2013-02-28 15:21:02.192264+01 
|              | 2013-02-28 13:13:59.806642+01
(11 rows)

We actually did manually vacuum the databse before the upgrade, when we saw 
the difference between the dump and database size. Strange that it
doesn''t
show up in that query. But probably it''s too little too late anyway.
 
> On another note ... to be honest I can''t be precise about why the 
> upgrade failed, I''m pretty sure you were running old revisions of
the
> terminus when you upgraded but that should cause this kind of anguish. 
> The errors you were receiving about constraints: 
>
> Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not 
> present in table "catalogs". 
>
> Should not be occurring at all, which is all very suspect - alas I 
> have no clue yet as to why. Have they stopped now we have cleared the 
> queue and restarted? 
>
As is said above, we have new errors of this kind.


 
> What is the size of your database? CPU/cores ... and RAM on the box? 
>
4 corees @2,27GHz, 12Gb RAM 



Does your puppetdb service live on the same node as the database? I''m
> guessing this to be true, as your postgresql.conf is listening only on 
> ''localhost'' ... what is the memory consumption of your
apps? The
> output of ''free'' would probably be a good start. 
>
Yes, it''s the same machine. For now.
 
"free" ouput
                        total       used       free     shared    
buffers     cached
Mem:      12319548   12150396     169152          0       7388    9504356
-/+ buffers/cache:    2638652    9680896
Swap:      6160376      85212    6075164

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-01 09:19 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

On Thursday, February 28, 2013 6:09:35 PM UTC+1, Ken Barber
wrote:>
> FYI, I just upgraded only the PuppetDB part to 1.1.1, using the old 
> 1.0.2 terminus I get no errors: 
>
> 2013-02-28 17:06:27,711 WARN  [qtp1478462104-39] [http.server] Use of 
> unversioned APIs is deprecated; please use /v1/commands 
> 2013-02-28 17:06:28,284 INFO  [command-proc-44] [puppetdb.command] 
> [7a4a1f70-1e85-4256-8f95-21008b6c1199] [replace facts] puppetdb2.vm 
> 2013-02-28 17:06:30,724 WARN  [qtp1478462104-32] [http.server] Use of 
> unversioned APIs is deprecated; please use /v1/commands 
> 2013-02-28 17:06:32,693 INFO  [command-proc-44] [puppetdb.command] 
> [a87001c5-f064-4a2d-9f9e-de903f62f824] [replace catalog] puppetdb2.vm 
>
>As I was saying, we''re not entirely sure that this behaviour was causes
by
the upgrade. We only noticed it one week after the upgrade (so 
unfortunately the postgres logs from before the update were rotated,
that''s
the default setting), and it might be a coincidence. Then again. maybe not. 
:)
 
> If at all possible - I wouldn''t mind a full copy of your
puppetdb.log
> ... to dig a bit deeper. And I know I told you to clear the KahaDB 
> queue (I always make this mistake) but I don''t suppose you kept an
old
> copy of it? 
> 
We kept the KahaDB backup both times we flushed the queue. The sizes in 
/var/lib/puppetdb/mq/localhost/ are as follows:

1.3G    ./KahaDB
637M    ./KahaDB.old2
1.1G    ./scheduler
6.2G    ./KahaDB.old

The big one is a 25 000 queue.

I would happily send the logs over, but this is something I have discuss 
with my colleagues. 

Again, thank you for helping! :)

ak0ska

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Mar-01 14:25 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

So I''ve been pondering this issue of yours, and I keep coming back to
that error in my mind:

 ERROR: insert or update on table "certname_catalogs" violates foreign
key constraint "certname_catalogs_catalog_fkey"

Regardless of the other issues, 512 GB db - yes its big, but so what?
That shouldn''t cause an integrity violation. It could be a big hint to
something else, but I''m not entirely sure this is the cause.

This error - can we try to track down the corresponding PuppetDB error
message that goes with it in the puppetdb.log or in perhaps the dead
letter queue for activemq? If postgresql is throwing the error - then
there must be a corresponding exception in puppetdb.log.

Its just that when I look at the code ... it doesn''t make sense that
the insert fails here - "Key
(catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in
table "catalogs". ... but I can''t yet see a fault in the
logic inside
the code yet.

Still its going to be hard to replicate without the data on my side.

As far as your issue ... you have a couple of avenues:

* Drop the existing database, allow it to be recreated from scratch -
run noop a serveral times across your entire infrastructure to
repopulate it. Has the advantage of being an effective vacuum at the
same time :-).
* Rollback the PuppetDB version and the database. This will at least
get you back to a known working state, but will still leave you in a
place where you need to recover.

Either case - you would want to take a backup of the existing database
for diagnostic purposes. But I really feel like there is something
structurally wrong here - perhaps the migration scripts have failed
during the database upgrade?

But I''m guessing you can''t stay broken for very long - so I
would take
a snapshot of the current broken state so we can take a look and try
to replicate it, and make moves on getting back to a working state as
suggested above.

I''ll need:

* PuppetDB logs
* Backup of your current DB
* Backup of your old DB
* The broken KahaDB queues

ken.

On Fri, Mar 1, 2013 at 9:19 AM, ak0ska <akos.hencz@gmail.com>
wrote:>
>
> On Thursday, February 28, 2013 6:09:35 PM UTC+1, Ken Barber wrote:
>>
>> FYI, I just upgraded only the PuppetDB part to 1.1.1, using the old
>> 1.0.2 terminus I get no errors:
>>
>> 2013-02-28 17:06:27,711 WARN  [qtp1478462104-39] [http.server] Use of
>> unversioned APIs is deprecated; please use /v1/commands
>> 2013-02-28 17:06:28,284 INFO  [command-proc-44] [puppetdb.command]
>> [7a4a1f70-1e85-4256-8f95-21008b6c1199] [replace facts] puppetdb2.vm
>> 2013-02-28 17:06:30,724 WARN  [qtp1478462104-32] [http.server] Use of
>> unversioned APIs is deprecated; please use /v1/commands
>> 2013-02-28 17:06:32,693 INFO  [command-proc-44] [puppetdb.command]
>> [a87001c5-f064-4a2d-9f9e-de903f62f824] [replace catalog] puppetdb2.vm
>>
>
> As I was saying, we''re not entirely sure that this behaviour was
causes by
> the upgrade. We only noticed it one week after the upgrade (so
unfortunately
> the postgres logs from before the update were rotated, that''s the
default
> setting), and it might be a coincidence. Then again. maybe not. :)
>
>>
>> If at all possible - I wouldn''t mind a full copy of your
puppetdb.log
>> ... to dig a bit deeper. And I know I told you to clear the KahaDB
>> queue (I always make this mistake) but I don''t suppose you
kept an old
>> copy of it?
>
>
> We kept the KahaDB backup both times we flushed the queue. The sizes in
> /var/lib/puppetdb/mq/localhost/ are as follows:
>
> 1.3G    ./KahaDB
> 637M    ./KahaDB.old2
> 1.1G    ./scheduler
> 6.2G    ./KahaDB.old
>
> The big one is a 25 000 queue.
>
> I would happily send the logs over, but this is something I have discuss
> with my colleagues.
>
> Again, thank you for helping! :)
>
> ak0ska
>
> --
> You received this message because you are subscribed to the Google Groups
> "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to puppet-users+unsubscribe@googlegroups.com.
> To post to this group, send email to puppet-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/puppet-users?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Mar-01 14:36 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

Oh - and a copy of the current dead letter queue would be nice, its
normally stored in:

/var/lib/puppetdb/mq/discarded/*

This should also contain the full exceptions for the failed SQL as I
mentioned earlier, so perhaps a glance into those now and letting me
know what the prevalent failure is would be handy.

I can organise a secure space on a Puppetlabs support storage area to
upload this data if you are willing. Just contact me privately to
organise it.

ken.

On Fri, Mar 1, 2013 at 2:25 PM, Ken Barber <ken@puppetlabs.com>
wrote:> So I''ve been pondering this issue of yours, and I keep coming back
to
> that error in my mind:
>
>  ERROR: insert or update on table "certname_catalogs" violates
foreign
> key constraint "certname_catalogs_catalog_fkey"
>
> Regardless of the other issues, 512 GB db - yes its big, but so what?
> That shouldn''t cause an integrity violation. It could be a big
hint to
> something else, but I''m not entirely sure this is the cause.
>
> This error - can we try to track down the corresponding PuppetDB error
> message that goes with it in the puppetdb.log or in perhaps the dead
> letter queue for activemq? If postgresql is throwing the error - then
> there must be a corresponding exception in puppetdb.log.
>
> Its just that when I look at the code ... it doesn''t make sense
that
> the insert fails here - "Key
> (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in
> table "catalogs". ... but I can''t yet see a fault in the
logic inside
> the code yet.
>
> Still its going to be hard to replicate without the data on my side.
>
> As far as your issue ... you have a couple of avenues:
>
> * Drop the existing database, allow it to be recreated from scratch -
> run noop a serveral times across your entire infrastructure to
> repopulate it. Has the advantage of being an effective vacuum at the
> same time :-).
> * Rollback the PuppetDB version and the database. This will at least
> get you back to a known working state, but will still leave you in a
> place where you need to recover.
>
> Either case - you would want to take a backup of the existing database
> for diagnostic purposes. But I really feel like there is something
> structurally wrong here - perhaps the migration scripts have failed
> during the database upgrade?
>
> But I''m guessing you can''t stay broken for very long - so
I would take
> a snapshot of the current broken state so we can take a look and try
> to replicate it, and make moves on getting back to a working state as
> suggested above.
>
> I''ll need:
>
> * PuppetDB logs
> * Backup of your current DB
> * Backup of your old DB
> * The broken KahaDB queues
>
> ken.
>
> On Fri, Mar 1, 2013 at 9:19 AM, ak0ska <akos.hencz@gmail.com> wrote:
>>
>>
>> On Thursday, February 28, 2013 6:09:35 PM UTC+1, Ken Barber wrote:
>>>
>>> FYI, I just upgraded only the PuppetDB part to 1.1.1, using the old
>>> 1.0.2 terminus I get no errors:
>>>
>>> 2013-02-28 17:06:27,711 WARN  [qtp1478462104-39] [http.server] Use
of
>>> unversioned APIs is deprecated; please use /v1/commands
>>> 2013-02-28 17:06:28,284 INFO  [command-proc-44] [puppetdb.command]
>>> [7a4a1f70-1e85-4256-8f95-21008b6c1199] [replace facts] puppetdb2.vm
>>> 2013-02-28 17:06:30,724 WARN  [qtp1478462104-32] [http.server] Use
of
>>> unversioned APIs is deprecated; please use /v1/commands
>>> 2013-02-28 17:06:32,693 INFO  [command-proc-44] [puppetdb.command]
>>> [a87001c5-f064-4a2d-9f9e-de903f62f824] [replace catalog]
puppetdb2.vm
>>>
>>
>> As I was saying, we''re not entirely sure that this behaviour
was causes by
>> the upgrade. We only noticed it one week after the upgrade (so
unfortunately
>> the postgres logs from before the update were rotated, that''s
the default
>> setting), and it might be a coincidence. Then again. maybe not. :)
>>
>>>
>>> If at all possible - I wouldn''t mind a full copy of your
puppetdb.log
>>> ... to dig a bit deeper. And I know I told you to clear the KahaDB
>>> queue (I always make this mistake) but I don''t suppose you
kept an old
>>> copy of it?
>>
>>
>> We kept the KahaDB backup both times we flushed the queue. The sizes in
>> /var/lib/puppetdb/mq/localhost/ are as follows:
>>
>> 1.3G    ./KahaDB
>> 637M    ./KahaDB.old2
>> 1.1G    ./scheduler
>> 6.2G    ./KahaDB.old
>>
>> The big one is a 25 000 queue.
>>
>> I would happily send the logs over, but this is something I have
discuss
>> with my colleagues.
>>
>> Again, thank you for helping! :)
>>
>> ak0ska
>>
>> --
>> You received this message because you are subscribed to the Google
Groups
>> "Puppet Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send
an
>> email to puppet-users+unsubscribe@googlegroups.com.
>> To post to this group, send email to puppet-users@googlegroups.com.
>> Visit this group at http://groups.google.com/group/puppet-users?hl=en.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-01 16:32 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

On Friday, March 1, 2013 3:36:20 PM UTC+1, Ken Barber
wrote:>
> Oh - and a copy of the current dead letter queue would be nice, its 
> normally stored in: 
>
> /var/lib/puppetdb/mq/discarded/* 
>
I will back it up.

> This should also contain the full exceptions for the failed SQL as I 
> mentioned earlier, so perhaps a glance into those now and letting me 
> know what the prevalent failure is would be handy. 
>
Here''s on of them. 

Attempt 1 @ 2013-02-18T17:04:30.512Z

java.lang.IllegalArgumentException: Edge ''{:relationship :contains,
:target
{:title nil, :type nil}, :source {:title "Castor::Installer", :type 
"Class"}}'' refers to resource ''{:title nil, :type
nil}'', which doesn''t
exist in the catalog.
com.puppetlabs.puppetdb.catalog$validate_edges.invoke(catalog.clj:205)
clojure.core$comp$fn__4036.invoke(core.clj:2286)
com.puppetlabs.puppetdb.catalog$eval1498$fn__1499.invoke(catalog.clj:311)
clojure.lang.MultiFn.invoke(MultiFn.java:167)
com.puppetlabs.puppetdb.command$replace_catalog_STAR_$fn__2696.invoke(command.clj:308)
com.puppetlabs.puppetdb.command$replace_catalog_STAR_.invoke(command.clj:308)
com.puppetlabs.puppetdb.command$eval2716$fn__2718.invoke(command.clj:329)
clojure.lang.MultiFn.invoke(MultiFn.java:167)
com.puppetlabs.puppetdb.command$produce_message_handler$fn__2838.invoke(command.clj:566)
com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789$fn__2792.invoke(command.clj:474)
com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown
Source)
com.yammer.metrics.core.Timer.time(Timer.java:91)
com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789.invoke(command.clj:473)
com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774$fn__2775.invoke(command.clj:427)
com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown
Source)
com.yammer.metrics.core.Timer.time(Timer.java:91)
com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774.invoke(command.clj:426)
com.puppetlabs.puppetdb.command$wrap_with_command_parser$fn__2784.invoke(command.clj:449)
com.puppetlabs.puppetdb.command$wrap_with_meter$fn__2765.invoke(command.clj:387)
com.puppetlabs.puppetdb.command$wrap_with_thread_name$fn__2797.invoke(command.clj:489)
clamq.jms$jms_consumer$fn__2452.invoke(jms.clj:38)
clamq.jms.proxy$java.lang.Object$MessageListener$ce893c05.onMessage(Unknown 
Source)
org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:560)
org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:498)
org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:467)
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:325)
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:263)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1058)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1050)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:947)
java.lang.Thread.run(Thread.java:679)

Attempt null @ 2013-02-18T17:04:30.512Z

java.lang.IllegalArgumentException: Edge ''{:relationship :contains,
:target
{:title nil, :type nil}, :source {:title "Castor::Installer", :type 
"Class"}}'' refers to resource ''{:title nil, :type
nil}'', which doesn''t
exist in the catalog.
com.puppetlabs.puppetdb.catalog$validate_edges.invoke(catalog.clj:205)
clojure.core$comp$fn__4036.invoke(core.clj:2286)
com.puppetlabs.puppetdb.catalog$eval1498$fn__1499.invoke(catalog.clj:311)
clojure.lang.MultiFn.invoke(MultiFn.java:167)
com.puppetlabs.puppetdb.command$replace_catalog_STAR_$fn__2696.invoke(command.clj:308)
com.puppetlabs.puppetdb.command$replace_catalog_STAR_.invoke(command.clj:308)
com.puppetlabs.puppetdb.command$eval2716$fn__2718.invoke(command.clj:329)
clojure.lang.MultiFn.invoke(MultiFn.java:167)
com.puppetlabs.puppetdb.command$produce_message_handler$fn__2838.invoke(command.clj:566)
com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789$fn__2792.invoke(command.clj:474)
com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown
Source)
com.yammer.metrics.core.Timer.time(Timer.java:91)
com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789.invoke(command.clj:473)
com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774$fn__2775.invoke(command.clj:427)
com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown
Source)
com.yammer.metrics.core.Timer.time(Timer.java:91)
com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774.invoke(command.clj:426)
com.puppetlabs.puppetdb.command$wrap_with_command_parser$fn__2784.invoke(command.clj:449)
com.puppetlabs.puppetdb.command$wrap_with_meter$fn__2765.invoke(command.clj:387)
com.puppetlabs.puppetdb.command$wrap_with_thread_name$fn__2797.invoke(command.clj:489)
clamq.jms$jms_consumer$fn__2452.invoke(jms.clj:38)
clamq.jms.proxy$java.lang.Object$MessageListener$ce893c05.onMessage(Unknown 
Source)
org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:560)
org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:498)
org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:467)
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:325)
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:263)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1058)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1050)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:947)
java.lang.Thread.run(Thread.java:679)

> I can organise a secure space on a Puppetlabs support storage area to 
> upload this data if you are willing. Just contact me privately to 
> organise it. 
>
I am willing, but this is not up to me. :) I will keep all the backups you 
asked for for now.

> > So I''ve been pondering this issue of yours, and I keep coming
back to
> > that error in my mind: 
> > 
> >  ERROR: insert or update on table "certname_catalogs"
violates foreign
> > key constraint "certname_catalogs_catalog_fkey" 
> > 
> > Regardless of the other issues, 512 GB db - yes its big, but so what? 
> > That shouldn''t cause an integrity violation. It could be a
big hint to
> > something else, but I''m not entirely sure this is the cause. 
> > 
> > This error - can we try to track down the corresponding PuppetDB error
> > message that goes with it in the puppetdb.log or in perhaps the dead 
> > letter queue for activemq? If postgresql is throwing the error - then 
> > there must be a corresponding exception in puppetdb.log. 
> > 
> > Its just that when I look at the code ... it doesn''t make
sense that
> > the insert fails here - "Key 
> > (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in
> > table "catalogs". ... but I can''t yet see a fault
in the logic inside
> > the code yet. 
> > 
> > Still its going to be hard to replicate without the data on my side. 
> > 
> > As far as your issue ... you have a couple of avenues: 
> > 
> > * Drop the existing database, allow it to be recreated from scratch - 
> > run noop a serveral times across your entire infrastructure to 
> > repopulate it. Has the advantage of being an effective vacuum at the 
> > same time :-). 
>
This could be the next step :) If the database maintenance doesn''t
help.

> * Rollback the PuppetDB version and the database. This will at least 
> > get you back to a known working state, but will still leave you in a 
> > place where you need to recover. 
>
> Either case - you would want to take a backup of the existing database 
> > for diagnostic purposes. But I really feel like there is something 
> > structurally wrong here - perhaps the migration scripts have failed 
> > during the database upgrade? 
>
I made a backup today, to have a fresh one before we start the database 
maintenance. The structurally wrong might not be so far fetched, since we 
didn''t upgrade from an official 1.0.2 release. My colleague got a
patched
version (don''t know the details, and can''t ask now, as
he''s on holiday),
because at that time the official release was having an issue with binary 
files (http://projects.puppetlabs.com/issues/17216). One consequence was, 
that value 8 was missing from the schema_migrations table, even though the 
matching modifications were already in place in the database ( 8
rename-fact-column
in /src/com/puppetlabs/puppetdb/scf/migrate.clj ). Because of that, 
upgrades failed initially, when PuppetDB tried to rename non-existing 
columns. The schema looked like it was actually matching the migrations up 
to number 8, so just adding that value seemed to be ok. This was tested in 
a lot smaller scale, but there the upgrade seemed fine, never saw this 
error. Perhaps we overlooked something.

If this is the case, then your suggested solution, to recreate the database 
from scratch, should solve the problem. This will be up our next step, if 
database maintenance doesn''t help.

> > But I''m guessing you can''t stay broken for very long
- so I would take
> > a snapshot of the current broken state so we can take a look and try 
> > to replicate it, and make moves on getting back to a working state as 
> > suggested above. 
> > 
> > I''ll need: 
> > 
> > * PuppetDB logs 
> > * Backup of your current DB 
> > * Backup of your old DB 
> > * The broken KahaDB queues 
> > 
> > ken. 
> > 
>
I have all these backups, but other people have to authorize sharing them. 

Have a nice weekend! :)

ak0ska

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Mar-01 16:47 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

> I made a backup today, to have a fresh one before we start the database
> maintenance. The structurally wrong might not be so far fetched, since we
> didn''t upgrade from an official 1.0.2 release. My colleague got a
patched
> version (don''t know the details, and can''t ask now, as
he''s on holiday),
> because at that time the official release was having an issue with binary
> files (http://projects.puppetlabs.com/issues/17216). One consequence was,
> that value 8 was missing from the schema_migrations table, even though the
> matching modifications were already in place in the database ( 8
> rename-fact-column in /src/com/puppetlabs/puppetdb/scf/migrate.clj ).
> Because of that, upgrades failed initially, when PuppetDB tried to rename
> non-existing columns. The schema looked like it was actually matching the
> migrations up to number 8, so just adding that value seemed to be ok. This
> was tested in a lot smaller scale, but there the upgrade seemed fine, never
> saw this error. Perhaps we overlooked something.
>
> If this is the case, then your suggested solution, to recreate the database
> from scratch, should solve the problem. This will be up our next step, if
> database maintenance doesn''t help.
Well, I don''t think a vacuum will help you - I imagine something is
wrong with the schema right now or some data migration failed during
upgrade. Esp. if you had migration issues from your custom PuppetDB.
Of course I can''t prove this with so little knowledge - but it
certainly does raise a red flag.

The only other option is to compare the schemas with a known good one,
and compare the database data to try and find the fault. This however
is prone to error and might be time-consuming. Recreating the database
sounds like a far more reliable option with a lot more guarantees
around it.

ken.

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-01 17:00 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

Yes, maybe not. The next step will be to recreate it from scratch.

On Friday, March 1, 2013 5:47:06 PM UTC+1, Ken Barber
wrote:>
>
> Well, I don''t think a vacuum will help you - I imagine something
is
> wrong with the schema right now or some data migration failed during 
> upgrade. Esp. if you had migration issues from your custom PuppetDB. 
> Of course I can''t prove this with so little knowledge - but it 
> certainly does raise a red flag. 
>
> The only other option is to compare the schemas with a known good one, 
> and compare the database data to try and find the fault. This however 
> is prone to error and might be time-consuming. Recreating the database 
> sounds like a far more reliable option with a lot more guarantees 
> around it. 
>
> ken. 
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Tedesco

2013-Mar-04 23:51 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

Is puppetdb and postgres on the same server?  
How many node does your environment have?

I had a similar issue and it was linked to I/O.  Can you look at that? 

Mike { Thanks => always }

On Mar 4, 2013, at 4:55 PM, Ken Barber <ken@puppetlabs.com> wrote:
> Any progress today?
> 
> On Fri, Mar 1, 2013 at 9:00 AM, ak0ska <akos.hencz@gmail.com> wrote:
>> Yes, maybe not. The next step will be to recreate it from scratch.
>> 
>> 
>> On Friday, March 1, 2013 5:47:06 PM UTC+1, Ken Barber wrote:
>>> 
>>> 
>>> Well, I don''t think a vacuum will help you - I imagine
something is
>>> wrong with the schema right now or some data migration failed
during
>>> upgrade. Esp. if you had migration issues from your custom
PuppetDB.
>>> Of course I can''t prove this with so little knowledge -
but it
>>> certainly does raise a red flag.
>>> 
>>> The only other option is to compare the schemas with a known good
one,
>>> and compare the database data to try and find the fault. This
however
>>> is prone to error and might be time-consuming. Recreating the
database
>>> sounds like a far more reliable option with a lot more guarantees
>>> around it.
>>> 
>>> ken.
>> 
>> --
>> You received this message because you are subscribed to the Google
Groups
>> "Puppet Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send
an
>> email to puppet-users+unsubscribe@googlegroups.com.
>> To post to this group, send email to puppet-users@googlegroups.com.
>> Visit this group at http://groups.google.com/group/puppet-users?hl=en.
>> For more options, visit https://groups.google.com/groups/opt_out.
> 
> -- 
> You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
email to puppet-users+unsubscribe@googlegroups.com.
> To post to this group, send email to puppet-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/puppet-users?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
> 
> 
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-05 07:57 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

A little update on our story.

Vacuum full was running for the whole weekend, so we didn''t yet have
time
to rebuild indexes, because that would require more downtime, and we''re
not
sure how long it would take. The size of the database didn''t drop that 
much, it''s now ~370Gb.

We already see some improvements. The queue doesn''t get insanely large,
in
fact the highest peak we saw since Monday morning was about 20, but mostly 
it''s between 0 and 3. That''s good. :) One strange benefit is
that we now
see "Resources" and "Resource duplication" values on the
Dashboard, those
were previously just question marks, as you can see on my previous 
screenshots.

However, we still get the constraint violations, steadily 2 violations per 
hour. They appear as "Retried" commands on the dashboard. But if the
queue
reaches the size 0, then does this mean these commands get written to the 
database eventually? The violations seem to happen shortly after puppetdb 
starts garbage collection.

You can see it here: http://pastebin.com/B6VR67LW

We are still thinking about wiping the database, and bringing up a new one, 
as you suggested, but that cannot be done before the next weekend. Perhaps 
we should rebuild the indexes first, and see if it further improves the 
situation.
It''s only been 1 day since the restart, but so far it looks a lot
better
then before.

Cheers,

ak0ska

On Monday, March 4, 2013 10:55:37 PM UTC+1, Ken Barber
wrote:>
> Any progress today? 
>
> On Fri, Mar 1, 2013 at 9:00 AM, ak0ska <akos....@gmail.com
<javascript:>>
> wrote: 
> > Yes, maybe not. The next step will be to recreate it from scratch. 
> > 
> > 
> > On Friday, March 1, 2013 5:47:06 PM UTC+1, Ken Barber wrote: 
> >> 
> >> 
> >> Well, I don''t think a vacuum will help you - I imagine
something is
> >> wrong with the schema right now or some data migration failed
during
> >> upgrade. Esp. if you had migration issues from your custom
PuppetDB.
> >> Of course I can''t prove this with so little knowledge -
but it
> >> certainly does raise a red flag. 
> >> 
> >> The only other option is to compare the schemas with a known good
one,
> >> and compare the database data to try and find the fault. This
however
> >> is prone to error and might be time-consuming. Recreating the
database
> >> sounds like a far more reliable option with a lot more guarantees 
> >> around it. 
> >> 
> >> ken. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "Puppet Users" group. 
> > To unsubscribe from this group and stop receiving emails from it, send
> an 
> > email to puppet-users...@googlegroups.com <javascript:>. 
> > To post to this group, send email to
puppet...@googlegroups.com<javascript:>.
>
> > Visit this group at http://groups.google.com/group/puppet-users?hl=en.
> > For more options, visit https://groups.google.com/groups/opt_out. 
> > 
> > 
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-05 08:00 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

Hey Mike,

Thanks for the suggestions, but we already checked the IO rates, and they 
seemed fine. And yes, PuppetDB and Postgres are on the same machine for 
now, but we plan to change that sometime in the future.

Cheers,

ak0ska

On Tuesday, March 5, 2013 12:51:16 AM UTC+1, Mike wrote:>
> Is puppetdb and postgres on the same server?   
> How many node does your environment have? 
>
> I had a similar issue and it was linked to I/O.  Can you look at that? 
>
> Mike { Thanks => always } 
>
> On Mar 4, 2013, at 4:55 PM, Ken Barber <k...@puppetlabs.com
<javascript:>>
> wrote: 
>
> > Any progress today? 
> > 
> > On Fri, Mar 1, 2013 at 9:00 AM, ak0ska <akos....@gmail.com
<javascript:>>
> wrote: 
> >> Yes, maybe not. The next step will be to recreate it from scratch.
> >> 
> >> 
> >> On Friday, March 1, 2013 5:47:06 PM UTC+1, Ken Barber wrote: 
> >>> 
> >>> 
> >>> Well, I don''t think a vacuum will help you - I
imagine something is
> >>> wrong with the schema right now or some data migration failed
during
> >>> upgrade. Esp. if you had migration issues from your custom
PuppetDB.
> >>> Of course I can''t prove this with so little knowledge
- but it
> >>> certainly does raise a red flag. 
> >>> 
> >>> The only other option is to compare the schemas with a known
good one,
> >>> and compare the database data to try and find the fault. This
however
> >>> is prone to error and might be time-consuming. Recreating the
database
> >>> sounds like a far more reliable option with a lot more
guarantees
> >>> around it. 
> >>> 
> >>> ken. 
> >> 
> >> -- 
> >> You received this message because you are subscribed to the Google
> Groups 
> >> "Puppet Users" group. 
> >> To unsubscribe from this group and stop receiving emails from it,
send
> an 
> >> email to puppet-users...@googlegroups.com <javascript:>. 
> >> To post to this group, send email to
puppet...@googlegroups.com<javascript:>.
>
> >> Visit this group at
http://groups.google.com/group/puppet-users?hl=en.
> >> For more options, visit https://groups.google.com/groups/opt_out. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "Puppet Users" group. 
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to puppet-users...@googlegroups.com <javascript:>. 
> > To post to this group, send email to
puppet...@googlegroups.com<javascript:>.
>
> > Visit this group at http://groups.google.com/group/puppet-users?hl=en.
> > For more options, visit https://groups.google.com/groups/opt_out. 
> > 
> > 
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Tedesco

2013-Mar-05 14:23 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

What are the I/O stats?  Can I just peak at them? 

Mike { Thanks => always }

On Mar 5, 2013, at 3:00 AM, ak0ska <akos.hencz@gmail.com> wrote:
> Hey Mike,
> 
> Thanks for the suggestions, but we already checked the IO rates, and they
seemed fine. And yes, PuppetDB and Postgres are on the same machine for now, but
we plan to change that sometime in the future.
> 
> Cheers,
> 
> ak0ska
> 
> On Tuesday, March 5, 2013 12:51:16 AM UTC+1, Mike wrote:
>> 
>> Is puppetdb and postgres on the same server?   
>> How many node does your environment have? 
>> 
>> I had a similar issue and it was linked to I/O.  Can you look at that?
>> 
>> Mike { Thanks => always } 
>> 
>> On Mar 4, 2013, at 4:55 PM, Ken Barber <k...@puppetlabs.com>
wrote:
>> 
>> > Any progress today? 
>> > 
>> > On Fri, Mar 1, 2013 at 9:00 AM, ak0ska <akos....@gmail.com>
wrote:
>> >> Yes, maybe not. The next step will be to recreate it from
scratch.
>> >> 
>> >> 
>> >> On Friday, March 1, 2013 5:47:06 PM UTC+1, Ken Barber wrote: 
>> >>> 
>> >>> 
>> >>> Well, I don''t think a vacuum will help you - I
imagine something is
>> >>> wrong with the schema right now or some data migration
failed during
>> >>> upgrade. Esp. if you had migration issues from your custom
PuppetDB.
>> >>> Of course I can''t prove this with so little
knowledge - but it
>> >>> certainly does raise a red flag. 
>> >>> 
>> >>> The only other option is to compare the schemas with a
known good one,
>> >>> and compare the database data to try and find the fault.
This however
>> >>> is prone to error and might be time-consuming. Recreating
the database
>> >>> sounds like a far more reliable option with a lot more
guarantees
>> >>> around it. 
>> >>> 
>> >>> ken. 
>> >> 
>> >> -- 
>> >> You received this message because you are subscribed to the
Google Groups
>> >> "Puppet Users" group. 
>> >> To unsubscribe from this group and stop receiving emails from
it, send an
>> >> email to puppet-users...@googlegroups.com. 
>> >> To post to this group, send email to
puppet...@googlegroups.com.
>> >> Visit this group at
http://groups.google.com/group/puppet-users?hl=en.
>> >> For more options, visit
https://groups.google.com/groups/opt_out.
>> > 
>> > -- 
>> > You received this message because you are subscribed to the Google
Groups "Puppet Users" group.
>> > To unsubscribe from this group and stop receiving emails from it,
send an email to puppet-users...@googlegroups.com.
>> > To post to this group, send email to puppet...@googlegroups.com. 
>> > Visit this group at
http://groups.google.com/group/puppet-users?hl=en.
>> > For more options, visit https://groups.google.com/groups/opt_out. 
>> > 
>> >
> -- 
> You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
email to puppet-users+unsubscribe@googlegroups.com.
> To post to this group, send email to puppet-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/puppet-users?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-06 09:23 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

@Mike

iostat -nx

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
md0               0.00     0.00   85.55  405.31  9226.53  3234.60    
25.39     0.00    0.00   0.00   0.00

@Ken

Wow. Thats still way too large for the amount of nodes. I imagine
> re-indexing might help, we can check first. Can you display how big 
> each of your relations are? A query like this might help: 
>
> SELECT nspname || ''.'' || relname AS "relation",
>     pg_size_pretty(pg_relation_size(C.oid)) AS "size" 
>   FROM pg_class C 
>   LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) 
>   WHERE nspname NOT IN (''pg_catalog'',
''information_schema'')
>   ORDER BY pg_relation_size(C.oid) DESC 
>   LIMIT 20; 
>
Indexes seem bloated.

                relation                 |  size   
-----------------------------------------+---------
 public.idx_catalog_resources_tags_gin   | 117 GB
 public.idx_catalog_resources_tags       | 96 GB
 public.idx_catalog_resources_resource   | 39 GB
 public.idx_catalog_resources_catalog    | 39 GB
 public.idx_catalog_resources_type_title | 34 GB
 public.catalog_resources_pkey           | 32 GB
 public.idx_catalog_resources_type       | 16 GB
 public.catalog_resources                | 9454 MB
 public.edges_pkey                       | 2460 MB
 public.edges                            | 875 MB
 public.idx_certname_facts_name          | 447 MB
 public.certname_facts_pkey              | 77 MB
 public.idx_certname_facts_certname      | 66 MB
 public.resource_params                  | 66 MB
 public.idx_resources_params_name        | 60 MB
 public.idx_resources_params_resource    | 50 MB
 public.resource_params_pkey             | 43 MB
 public.certname_facts                   | 41 MB
 pg_toast.pg_toast_16463                 | 34 MB
 pg_toast.pg_toast_16463_index           | 2360 kB

 
> Also - what are the row counts for each of your tables btw? 
>
> If you run this for each table it will get you a row count: 
>
> select count(*) from table_name; 
>
> I''d be curious to see if any of these are higher than what they
should
> be based on your node count, maybe some of the space is just large 
> tables perhaps? 
>
Lines:

certname_facts                 336426
catalogs                            2963
resource_events                0
reports                              0
certnames                         2825
certname_catalogs              2810
certname_facts_metadata    2764
catalog_resources               1881888
resource_params                348806
edges                                3454907
schema_migrations            9 

Perhaps the queries were just timing out on the dashboard? 
>
Probably


This should be evident in your puppetdb.log if you trace the uuid of a
> command. If commands ''fail'' completely, they end up in
the DQL located
> here: 
>
> /var/lib/puppetdb/mq/discarded/ 
>
Just looked at that directory, no entry with a recent date, so I guess they 
go through eventually. 

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-07 13:04 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

On Thursday, March 7, 2013 12:23:13 AM UTC+1, Ken Barber
wrote:>
>
>
> So the index ''idx_catalog_resources_tags'' was removed in
1.1.0 I
> think, so that is no longer needed. 
>
> This points back to making sure you schema matches exactly what a 
> known good 1.1.1 has, as things have been missed obviously. 
>
> Try this tool out: 
>
> http://www.apgdiff.com/diff_online.php 
>
> I''ve got a pg_dump of a good 1.1.1 here: 
>
> https://gist.github.com/kbarber/5104169 
>
> All you need to do is perform a pg_dump of your schema: 
>
> pg_dump --schema-only puppetdb 
>
> And use that online tool to compare your schema with the one I''ve 
> provided. I would obviously double check its accuracy before doing 
> anything drastic, but this should help to get your schema inline with 
> what it should be. If you are paranoid about any part of the diff, 
> paste the results in a gist and we can discuss it. 
>
The only difference seem to be that 1 index, nothing else.
 
>
> Okay, well - lets at least get your schema back inline with what it 
> should be, get your re-indexes done and we can come back round to this 
> problem if it still exists. 
>
> ken. 
>
We will rebuild the indexes during the weekend, and see how it performs 
afterwards.





-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-11 08:50 UTC

head link

[Puppet Users] Re: PuppetDB queue problem

After dropping the obsolete index, and rebuilding the others, the database 
is now ~ 30 GB. We still get the constraint violation errors when garbage 
collection starts.
Also the "Resources" and "Resource duplication" values on
the dashboard are
still question marks, so those queries probably time out.

Although these errors don''t have a big impact on performance (at least
for
now), could they be a sign of something not being quite ok? Shall we still 
consider rebuilding the database?

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Mar-11 18:04 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

> After dropping the obsolete index, and rebuilding the others, the database
> is now ~ 30 GB. We still get the constraint violation errors when garbage
> collection starts.
Okay - can you please send me the puppetdb.log entry that shows the
exception? Including surrounding messages?
> Also the "Resources" and "Resource duplication" values
on the dashboard are
> still question marks, so those queries probably time out.
Again, this should have a corresponding exception in puppetdb.log or
at least a slow query being logged, also - can you pull up the
javascript console and see what error are produced in relation to that
error.
> Although these errors don''t have a big impact on performance (at
least for
> now), could they be a sign of something not being quite ok? Shall we still
> consider rebuilding the database?
Not sure, I need more data.

ken.

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-12 12:38 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

I think my previous comment just got lost.

So, I cut three occurrences of this error from the database log and the 
corresponding part from the puppetdb log. I removed the hostnames, I hope 
it''s still sensible: http://pastebin.com/yvyBDWQE
The unversioned api warnings are not from the masters. They''re from an 
older version of the puppetdbquery library, and the puppetdb module''s 
puppetdb_conn_validator function.

Apart from that, there are 2 kinds of slow queries that appear in the logs:

2013-03-12 11:54:37,149 WARN  [qtp1992135396-3384] [jdbc.internal] Query
> slower than 10s threshold:  actual execution time: 452.2331 seconds; Query:
> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, 
> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND 
> c.deactivated IS NULL; Query Params: 
> 2013-03-12 11:54:42,035 WARN  [qtp1992135396-3391] [jdbc.internal] Query 
> slower than 10s threshold:  actual execution time: 452.0072 seconds; Query:
> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, 
> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND 
> c.deactivated IS NULL; Query Params: 
> 2013-03-12 11:55:43,146 WARN  [qtp1992135396-3392] [jdbc.internal] Query 
> slower than 10s threshold:  actual execution time: 441.2548 seconds; Query:
> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources 
> cr, certname_catalogs cc, certnames c  WHERE cr.catalog=cc.catalog AND 
> cc.certname=c.name AND c.deactivated IS NULL) r; Query Params: 
> 2013-03-12 12:00:56,492 WARN  [qtp1992135396-3402] [jdbc.internal] Query 
> slower than 10s threshold:  actual execution time: 433.4503 seconds; Query:
> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources 
> cr, certname_catalogs cc, certnames c  WHERE cr.catalog=cc.catalog AND 
> cc.certname=c.name AND c.deactivated IS NULL) r; Query Params: 
> 2013-03-12 12:03:30,933 WARN  [qtp1992135396-3392] [jdbc.internal] Query 
> slower than 10s threshold:  actual execution time: 467.7872 seconds; Query:
> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, 
> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND 
> c.deactivated IS NULL; Query Params: 
> 2013-03-12 12:04:18,225 WARN  [qtp1992135396-3396] [jdbc.internal] Query 
> slower than 10s threshold:  actual execution time: 472.9386 seconds; Query:
> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, 
> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND 
> c.deactivated IS NULL; Query Params: 
> 2013-03-12 12:54:49,042 WARN  [qtp1992135396-3463] [jdbc.internal] Query 
> slower than 10s threshold:  actual execution time: 464.9721 seconds; Query:
> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, 
> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND 
> c.deactivated IS NULL; Query Params: 
> 2013-03-12 12:57:58,791 WARN  [qtp1992135396-3482] [jdbc.internal] Query 
> slower than 10s threshold:  actual execution time: 460.9083 seconds; Query:
> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, 
> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND 
> c.deactivated IS NULL; Query Params: 
> 2013-03-12 12:57:58,803 WARN  [qtp1992135396-3490] [jdbc.internal] Query 
> slower than 10s threshold:  actual execution time: 460.7699 seconds; Query:
> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources 
> cr, certname_catalogs cc, certnames c  WHERE cr.catalog=cc.catalog AND 
> cc.certname=c.name AND c.deactivated IS NULL) r; Query Params: 
>

I didn''t see any javascript errors on the dashboard, the queries are
just
slow. I ran the resource duplication query manually on the puppetdb machine:

time curl -X GET -H ''Accept: application/json''
>
http://[puppetdbip]:8080/v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes
> -v
> * About to connect() to [puppetdbip] port 8080 (#0)
> *   Trying [puppetdbip]... connected
> * Connected to [puppetdbip] ([puppetdbip]) port 8080 (#0)
> > GET 
>
/v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes
> HTTP/1.1
> > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 
> NSS/3.13.6.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2
> > Host: [puppetdbip]:8080
> > Accept: application/json
> > 
> < HTTP/1.1 200 OK
> < Date: Tue, 12 Mar 2013 11:55:34 GMT
> < Content-Type: application/json;charset=ISO-8859-1
> < Content-Length: 34
> < Server: Jetty(7.x.y-SNAPSHOT)
> < 
> {
>   "Value" : 0.9691941482891232
> * Connection #0 to host [puppetdbip] left intact
> * Closing connection #0
> }
> real    16m57.604s
> user    0m0.012s
> sys    0m0.025s
>

On Monday, March 11, 2013 7:04:05 PM UTC+1, Ken Barber
wrote:>
> > After dropping the obsolete index, and rebuilding the others, the 
> database 
> > is now ~ 30 GB. We still get the constraint violation errors when 
> garbage 
> > collection starts. 
>
> Okay - can you please send me the puppetdb.log entry that shows the 
> exception? Including surrounding messages? 
>
> > Also the "Resources" and "Resource duplication"
values on the dashboard
> are 
> > still question marks, so those queries probably time out. 
>
> Again, this should have a corresponding exception in puppetdb.log or 
> at least a slow query being logged, also - can you pull up the 
> javascript console and see what error are produced in relation to that 
> error. 
>
> > Although these errors don''t have a big impact on performance
(at least
> for 
> > now), could they be a sign of something not being quite ok? Shall we 
> still 
> > consider rebuilding the database? 
>
> Not sure, I need more data. 
>
> ken. 
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Deepak Giridharagopal

2013-Mar-12 15:28 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

On Tue, Mar 12, 2013 at 6:38 AM, ak0ska <akos.hencz@gmail.com> wrote:
> I think my previous comment just got lost.
>
> So, I cut three occurrences of this error from the database log and the
> corresponding part from the puppetdb log. I removed the hostnames, I hope
> it''s still sensible: http://pastebin.com/yvyBDWQE
> The unversioned api warnings are not from the masters. They''re
from an
> older version of the puppetdbquery library, and the puppetdb
module''s
> puppetdb_conn_validator function.
> Apart from that, there are 2 kinds of slow queries that appear in the logs:
>
> 2013-03-12 11:54:37,149 WARN  [qtp1992135396-3384] [jdbc.internal] Query
>> slower than 10s threshold:  actual execution time: 452.2331 seconds;
Query:
>> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr,
>> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND
>> c.deactivated IS NULL; Query Params:
>> 2013-03-12 11:54:42,035 WARN  [qtp1992135396-3391] [jdbc.internal]
Query
>> slower than 10s threshold:  actual execution time: 452.0072 seconds;
Query:
>> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr,
>> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND
>> c.deactivated IS NULL; Query Params:
>> 2013-03-12 11:55:43,146 WARN  [qtp1992135396-3392] [jdbc.internal]
Query
>> slower than 10s threshold:  actual execution time: 441.2548 seconds;
Query:
>> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM
catalog_resources
>> cr, certname_catalogs cc, certnames c  WHERE cr.catalog=cc.catalog AND
>> cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
>> 2013-03-12 12:00:56,492 WARN  [qtp1992135396-3402] [jdbc.internal]
Query
>> slower than 10s threshold:  actual execution time: 433.4503 seconds;
Query:
>> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM
catalog_resources
>> cr, certname_catalogs cc, certnames c  WHERE cr.catalog=cc.catalog AND
>> cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
>> 2013-03-12 12:03:30,933 WARN  [qtp1992135396-3392] [jdbc.internal]
Query
>> slower than 10s threshold:  actual execution time: 467.7872 seconds;
Query:
>> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr,
>> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND
>> c.deactivated IS NULL; Query Params:
>> 2013-03-12 12:04:18,225 WARN  [qtp1992135396-3396] [jdbc.internal]
Query
>> slower than 10s threshold:  actual execution time: 472.9386 seconds;
Query:
>> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr,
>> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND
>> c.deactivated IS NULL; Query Params:
>> 2013-03-12 12:54:49,042 WARN  [qtp1992135396-3463] [jdbc.internal]
Query
>> slower than 10s threshold:  actual execution time: 464.9721 seconds;
Query:
>> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr,
>> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND
>> c.deactivated IS NULL; Query Params:
>> 2013-03-12 12:57:58,791 WARN  [qtp1992135396-3482] [jdbc.internal]
Query
>> slower than 10s threshold:  actual execution time: 460.9083 seconds;
Query:
>> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr,
>> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND
>> c.deactivated IS NULL; Query Params:
>> 2013-03-12 12:57:58,803 WARN  [qtp1992135396-3490] [jdbc.internal]
Query
>> slower than 10s threshold:  actual execution time: 460.7699 seconds;
Query:
>> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM
catalog_resources
>> cr, certname_catalogs cc, certnames c  WHERE cr.catalog=cc.catalog AND
>> cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
>>
>
>
>Can you fire up psql, point it at your puppetdb database, and run "EXPLAIN
ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources
cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND
c.deactivated IS NULL;"? That will profile the query, and we can see where
the problem is.

Also worth examining is the GC query (modified to be non-destructive):

EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT *
FROM catalog_resources cr WHERE cr.resource=resource_params.resource);

And the resource duplication query:

EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM
catalog_resources cr, certname_catalogs cc, certnames c  WHERE
cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r;

And finally, can you give us a dump of your database settings? You can do
this from within psql by doing "SHOW ALL;"


> I didn''t see any javascript errors on the dashboard, the queries
are just
> slow. I ran the resource duplication query manually on the puppetdb
machine:
>
> time curl -X GET -H ''Accept: application/json''
http://[puppetdbip]:8080/v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes
>> -v
>> * About to connect() to [puppetdbip] port 8080 (#0)
>> *   Trying [puppetdbip]... connected
>> * Connected to [puppetdbip] ([puppetdbip]) port 8080 (#0)
>> > GET
>>
/v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes
>> HTTP/1.1
>> > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7
NSS/
>> 3.13.6.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2
>> > Host: [puppetdbip]:8080
>> > Accept: application/json
>> >
>> < HTTP/1.1 200 OK
>> < Date: Tue, 12 Mar 2013 11:55:34 GMT
>> < Content-Type: application/json;charset=ISO-8859-1
>> < Content-Length: 34
>> < Server: Jetty(7.x.y-SNAPSHOT)
>> <
>> {
>>   "Value" : 0.9691941482891232
>> * Connection #0 to host [puppetdbip] left intact
>> * Closing connection #0
>> }
>> real    16m57.604s
>> user    0m0.012s
>> sys    0m0.025s
>>
>
>
> On Monday, March 11, 2013 7:04:05 PM UTC+1, Ken Barber wrote:
>>
>> > After dropping the obsolete index, and rebuilding the others, the
>> database
>> > is now ~ 30 GB. We still get the constraint violation errors when
>> garbage
>> > collection starts.
>>
>> Okay - can you please send me the puppetdb.log entry that shows the
>> exception? Including surrounding messages?
>>
>> > Also the "Resources" and "Resource
duplication" values on the dashboard
>> are
>> > still question marks, so those queries probably time out.
>>
>> Again, this should have a corresponding exception in puppetdb.log or
>> at least a slow query being logged, also - can you pull up the
>> javascript console and see what error are produced in relation to that
>> error.
>>
>> > Although these errors don''t have a big impact on
performance (at least
>> for
>> > now), could they be a sign of something not being quite ok? Shall
we
>> still
>> > consider rebuilding the database?
>>
>> Not sure, I need more data.
>>
>> ken.
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to puppet-users+unsubscribe@googlegroups.com.
> To post to this group, send email to puppet-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/puppet-users?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-13 09:00 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

Hello Deepak,

Here are the queries you asked for:

> Can you fire up psql, point it at your puppetdb database, and run
"EXPLAIN
> ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources 
> cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND 
> c.deactivated IS NULL;"? That will profile the query, and we can see
where
> the problem is.
>
>                                                                      QUERY 
PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3132186.09..3132186.10 rows=1 width=0) (actual 
time=348426.375..348426.376 rows=1 loops=1)
   ->  Hash Join  (cost=247.37..3096532.80 rows=14261314 width=0) (actual 
time=24059.151..348255.933 rows=1604253 loops=1)
         Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
         ->  Seq Scan on catalog_resources cr  (cost=0.00..2871450.86 
rows=16444286 width=41) (actual time=24053.542..347250.967 rows=2107982 
loops=1)
         ->  Hash  (cost=216.42..216.42 rows=2476 width=41) (actual 
time=5.555..5.555 rows=2476 loops=1)
               ->  Hash Join  (cost=94.84..216.42 rows=2476 width=41) 
(actual time=1.850..4.555 rows=2476 loops=1)
                     Hash Cond: (cc.certname = c.name)
                     ->  Seq Scan on certname_catalogs cc  
(cost=0.00..82.55 rows=2855 width=62) (actual time=0.010..0.941 rows=2855 
loops=1)
                     ->  Hash  (cost=63.71..63.71 rows=2490 width=21) 
(actual time=1.818..1.818 rows=2490 loops=1)
                           ->  Seq Scan on certnames c  (cost=0.00..63.71 
rows=2490 width=21) (actual time=0.013..0.823 rows=2490 loops=1)
                                 Filter: (deactivated IS NULL)
 Total runtime: 348446.364 ms
(12 rows)

 


Also worth examining is the GC query (modified to be
non-destructive):>
> EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT * 
> FROM catalog_resources cr WHERE cr.resource=resource_params.resource);
>
>                                                                               
QUERY 
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.00..1710681.18 rows=360492 width=126) 
(actual time=1967.182..1967.182 rows=0 loops=1)
   ->  Seq Scan on resource_params  (cost=0.00..13249.23 rows=389623 
width=126) (actual time=0.013..76.403 rows=389693 loops=1)
   ->  Index Scan using idx_catalog_resources_resource on catalog_resources 
cr  (cost=0.00..21429.14 rows=5291 width=41) (actual time=0.004..0.004 
rows=1 loops=389693)
         Index Cond: ((cr.resource)::text = 
(resource_params.resource)::text)
 Total runtime: 1967.253 ms
(5 rows) 



And the resource duplication query:>
> EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM 
> catalog_resources cr, certname_catalogs cc, certnames c  WHERE 
> cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r;
>
>                                                                          QUERY 
PLAN                                                                          

-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3129997.71..3129997.72 rows=1 width=0) (actual 
time=474369.420..474369.420 rows=1 loops=1)
   ->  HashAggregate  (cost=3129927.78..3129958.86 rows=3108 width=41) 
(actual time=474351.730..474365.144 rows=49466 loops=1)
         ->  Hash Join  (cost=248.48..3094726.38 rows=14080561 width=41) 
(actual time=21536.467..473678.479 rows=1604237 loops=1)
               Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
               ->  Seq Scan on catalog_resources cr  (cost=0.00..2871450.86 
rows=16444286 width=82) (actual time=21531.355..472692.303 rows=1934212 
loops=1)
               ->  Hash  (cost=217.36..217.36 rows=2490 width=41) (actual 
time=5.071..5.071 rows=2476 loops=1)
                     ->  Hash Join  (cost=94.84..217.36 rows=2490 width=41) 
(actual time=1.700..4.048 rows=2476 loops=1)
                           Hash Cond: (cc.certname = c.name)
                           ->  Seq Scan on certname_catalogs cc  
(cost=0.00..83.08 rows=2908 width=62) (actual time=0.010..0.603 rows=2855 
loops=1)
                           ->  Hash  (cost=63.71..63.71 rows=2490 width=21) 
(actual time=1.670..1.670 rows=2490 loops=1)
                                 ->  Seq Scan on certnames c  
(cost=0.00..63.71 rows=2490 width=21) (actual time=0.014..0.785 rows=2490 
loops=1)
                                       Filter: (deactivated IS NULL)
 Total runtime: 474370.129 ms
(13 rows)

 
> And finally, can you give us a dump of your database settings? You can do 
> this from within psql by doing "SHOW ALL;"
>
http://pastebin.com/raw.php?i=HfhqfVa4

Thank you for your help!

Cheers,

ak0ska
 

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Mar-14 07:05 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

Hi ak0ska,

So I''ve been spending the last 2 days trying all kinds of things to
replicate your constraint violation problem and I still am getting
nowhere with it. I''ve been speaking to all kinds of smart people and
we believe its some sort of lock and/or transactional mode problem but
none of my testing can replicate it.

My latest test was something like:

https://gist.github.com/kbarber/5157836

But with all that preparation and load simulation I still couldn''t
replicate it :-(.

I think I need more data from you ... in this case, a statement log.
Now a statement log can be pretty heavy, so if we can get the data
surrounding a GC only that should be enough - however I anticipate it
might reduce some performance so if you''re apprehensive about doing it
in production I fully understand. I''m just running out of ideas, and
without more data I may have trouble coming to a conclusion.

Anyway, if at all possible the instructions for enabling this are as follows:

* Go to your postgresql.conf file (On RHEL 6 this is usually located
in /var/lib/pgsql/data)
* Add/modify the entries:
log_statement = ''all''
log_line_prefix = ''%m PID[%p] SID[%c] VTID[%v] TID[%x] ''
* Restart postgresql
* Logs should go to: /var/lib/pgsql/data/pg_log

The closer you can do this to being set _just before_ a GC run the
better, as it will reduce the amount of time this setting needs to be
on. Once you have a GC run and a subsequent constraint logged you
should be able to turn it off. Don''t worry about PuppetDB during these
cases, it retries the database so if you can take the small hit with
queries the outage should be quick while you switch off/on the setting
(and be mindful that catalogs/facts/reports submissions will get
queued, queries of course will fail during that short time during DB
restart).

Now the data itself is going to obviously contain private items, so if
you prefer to send it to me privately thats fine (contact me off this
list if so). Otherwise, it might need some scrubbing.

So the concern I have with your issue, even though it doesn''t stop a
catalog from being submitted - is that you''re seeing a
''block'' when
those two catalogs get submitted which makes me fear that data doesn''t
get populated until _after_ the GC is complete. This is a concern
because we don''t want data to be delayed like this. If this
wasn''t the
case we could happily let it fail and retry, ultimately the data isn''t
lost, but we should be able to process data within reasonable time
limits without you have to worry about database GC slowing things
down.

Now in regards to your performance, we are going to keep looking into
the results you gave us. Looking at your settings Deepak and I think
you probably need to increase at least work_mem to 16MB and
maintenance_work_mem to 32MB (this is used during vacuum) the main
item being work_mem here really. These are reasonable changes we''ve
seen other users apply with success, but alone they probably won''t
solve your issue - it would be good to just get you working using
higher values for these.

I''m very curious to understand the disk IO for your systems also. What
are the nature of the disks where your database is located? Are they
SCSI/SATA/FC or otherwise? What filesystem type are you using? Can you
give me any other details you might find interesting?

What are the results of the command:

iostat -k 5

A few samples of the run would be useful to see, so we can understand
the average disk IO going on here.

FYI To give you an idea of what we _should_ be seeing for say, just
one of your queries, here is a query on a Centos 5.8 box (running just
postgresql 8.1 actually):

puppetdb=# EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT
resource FROM catalog_resources cr, certname_catalogs cc, certnames c
WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated
IS NULL) r;
                                                                     QUERY PLAN

-------------------------------------------------------------------------------------------------
---------------------------------------------------
 Aggregate  (cost=91792.54..91792.55 rows=1 width=0) (actual
time=2611.872..2611.872 rows=1 loops
=1)
   ->  HashAggregate  (cost=91737.14..91761.76 rows=2462 width=41)
(actual time=2606.507..2609.48
9 rows=5478 loops=1)
         ->  Hash Join  (cost=184.36..90322.32 rows=565926 width=41)
(actual time=7.060..2072.000
 rows=867847 loops=1)
               Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
               ->  Seq Scan on catalog_resources cr
(cost=0.00..80139.47 rows=867847 width=82) (
actual time=0.021..1119.260 rows=867847 loops=1)
               ->  Hash  (cost=159.17..159.17 rows=2015 width=41)
(actual time=7.032..7.032 rows2195 loops=1)
                     ->  Hash Join  (cost=62.34..159.17 rows=2015
width=41) (actual time=2.438..5
.830 rows=2195 loops=1)
                           Hash Cond: (cc.certname = c.name)
                           ->  Seq Scan on certname_catalogs cc
(cost=0.00..65.79 rows=2179 widt
h=50) (actual time=0.005..0.908 rows=2195 loops=1)
                           ->  Hash  (cost=37.15..37.15 rows=2015
width=9) (actual time=2.427..2.
427 rows=2195 loops=1)
                                 ->  Seq Scan on certnames c
(cost=0.00..37.15 rows=2015 width=9
) (actual time=0.008..1.088 rows=2195 loops=1)
                                       Filter: (deactivated IS NULL)
 Total runtime: 2611.935 ms
(13 rows)

As you can see, 2 seconds on my box and 475 seconds on yours - the
difference is just too large to be considered a normal discrepancy
:-). Be mindful, this is on data that represents 2000 nodes ... so its
not like my tables are smaller then yours (at least not drastically),
however I do notice the number of resources is much higher so I''m
going to try and replicate that.

For now, I''m going to ponder more on your performance as I believe
that affects you the most - so let us get back to you with more
updates. If you can give me any details as I''ve requested for now that
would be appreciated.

ken.

On Wed, Mar 13, 2013 at 2:00 AM, ak0ska <akos.hencz@gmail.com>
wrote:> Hello Deepak,
>
> Here are the queries you asked for:
>
>
>>
>> Can you fire up psql, point it at your puppetdb database, and run
"EXPLAIN
>> ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc,
catalog_resources
>> cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND
>> c.deactivated IS NULL;"? That will profile the query, and we can
see where
>> the problem is.
>>
>
>                                                                       QUERY
> PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=3132186.09..3132186.10 rows=1 width=0) (actual
> time=348426.375..348426.376 rows=1 loops=1)
>    ->  Hash Join  (cost=247.37..3096532.80 rows=14261314 width=0)
(actual
> time=24059.151..348255.933 rows=1604253 loops=1)
>          Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
>          ->  Seq Scan on catalog_resources cr  (cost=0.00..2871450.86
> rows=16444286 width=41) (actual time=24053.542..347250.967 rows=2107982
> loops=1)
>          ->  Hash  (cost=216.42..216.42 rows=2476 width=41) (actual
> time=5.555..5.555 rows=2476 loops=1)
>                ->  Hash Join  (cost=94.84..216.42 rows=2476 width=41)
> (actual time=1.850..4.555 rows=2476 loops=1)
>                      Hash Cond: (cc.certname = c.name)
>                      ->  Seq Scan on certname_catalogs cc 
(cost=0.00..82.55
> rows=2855 width=62) (actual time=0.010..0.941 rows=2855 loops=1)
>                      ->  Hash  (cost=63.71..63.71 rows=2490 width=21)
> (actual time=1.818..1.818 rows=2490 loops=1)
>                            ->  Seq Scan on certnames c 
(cost=0.00..63.71
> rows=2490 width=21) (actual time=0.013..0.823 rows=2490 loops=1)
>                                  Filter: (deactivated IS NULL)
>  Total runtime: 348446.364 ms
> (12 rows)
>
>
>
>
>> Also worth examining is the GC query (modified to be non-destructive):
>>
>> EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT
*
>> FROM catalog_resources cr WHERE cr.resource=resource_params.resource);
>>
>
>
> QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop Anti Join  (cost=0.00..1710681.18 rows=360492 width=126)
> (actual time=1967.182..1967.182 rows=0 loops=1)
>    ->  Seq Scan on resource_params  (cost=0.00..13249.23 rows=389623
> width=126) (actual time=0.013..76.403 rows=389693 loops=1)
>    ->  Index Scan using idx_catalog_resources_resource on
catalog_resources
> cr  (cost=0.00..21429.14 rows=5291 width=41) (actual time=0.004..0.004
> rows=1 loops=389693)
>          Index Cond: ((cr.resource)::text >
(resource_params.resource)::text)
>  Total runtime: 1967.253 ms
> (5 rows)
>
>
>
>> And the resource duplication query:
>>
>> EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource
FROM
>> catalog_resources cr, certname_catalogs cc, certnames c  WHERE
>> cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL)
r;
>>
>
> QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=3129997.71..3129997.72 rows=1 width=0) (actual
> time=474369.420..474369.420 rows=1 loops=1)
>    ->  HashAggregate  (cost=3129927.78..3129958.86 rows=3108 width=41)
> (actual time=474351.730..474365.144 rows=49466 loops=1)
>          ->  Hash Join  (cost=248.48..3094726.38 rows=14080561 width=41)
> (actual time=21536.467..473678.479 rows=1604237 loops=1)
>                Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
>                ->  Seq Scan on catalog_resources cr 
(cost=0.00..2871450.86
> rows=16444286 width=82) (actual time=21531.355..472692.303 rows=1934212
> loops=1)
>                ->  Hash  (cost=217.36..217.36 rows=2490 width=41)
(actual
> time=5.071..5.071 rows=2476 loops=1)
>                      ->  Hash Join  (cost=94.84..217.36 rows=2490
width=41)
> (actual time=1.700..4.048 rows=2476 loops=1)
>                            Hash Cond: (cc.certname = c.name)
>                            ->  Seq Scan on certname_catalogs cc
> (cost=0.00..83.08 rows=2908 width=62) (actual time=0.010..0.603 rows=2855
> loops=1)
>                            ->  Hash  (cost=63.71..63.71 rows=2490
width=21)
> (actual time=1.670..1.670 rows=2490 loops=1)
>                                  ->  Seq Scan on certnames c
> (cost=0.00..63.71 rows=2490 width=21) (actual time=0.014..0.785 rows=2490
> loops=1)
>                                        Filter: (deactivated IS NULL)
>  Total runtime: 474370.129 ms
> (13 rows)
>
>
>>
>> And finally, can you give us a dump of your database settings? You can
do
>> this from within psql by doing "SHOW ALL;"
>
>
> http://pastebin.com/raw.php?i=HfhqfVa4
>
> Thank you for your help!
>
> Cheers,
>
> ak0ska
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to puppet-users+unsubscribe@googlegroups.com.
> To post to this group, send email to puppet-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/puppet-users?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ak0ska

2013-Mar-14 10:36 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

Hello Ken,

I really appreciate you guys looking into this problem, and I''m happy
to
provide you with the data you ask for. However, I feel like I should ask, 
whether you think this problem is worth your efforts, if rebuilding the 
database might solve the issue?

Cheers,

ak0ska

On Thursday, March 14, 2013 8:05:59 AM UTC+1, Ken Barber
wrote:>
> Hi ak0ska, 
>
> So I''ve been spending the last 2 days trying all kinds of things
to
> replicate your constraint violation problem and I still am getting 
> nowhere with it. I''ve been speaking to all kinds of smart people
and
> we believe its some sort of lock and/or transactional mode problem but 
> none of my testing can replicate it. 
>
> My latest test was something like: 
>
> https://gist.github.com/kbarber/5157836 
>
> But with all that preparation and load simulation I still couldn''t
> replicate it :-(. 
>
> I think I need more data from you ... in this case, a statement log. 
> Now a statement log can be pretty heavy, so if we can get the data 
> surrounding a GC only that should be enough - however I anticipate it 
> might reduce some performance so if you''re apprehensive about
doing it
> in production I fully understand. I''m just running out of ideas,
and
> without more data I may have trouble coming to a conclusion. 
>
> Anyway, if at all possible the instructions for enabling this are as 
> follows: 
>
> * Go to your postgresql.conf file (On RHEL 6 this is usually located 
> in /var/lib/pgsql/data) 
> * Add/modify the entries: 
> log_statement = ''all'' 
> log_line_prefix = ''%m PID[%p] SID[%c] VTID[%v] TID[%x] '' 
> * Restart postgresql 
> * Logs should go to: /var/lib/pgsql/data/pg_log 
>
> The closer you can do this to being set _just before_ a GC run the 
> better, as it will reduce the amount of time this setting needs to be 
> on. Once you have a GC run and a subsequent constraint logged you 
> should be able to turn it off. Don''t worry about PuppetDB during
these
> cases, it retries the database so if you can take the small hit with 
> queries the outage should be quick while you switch off/on the setting 
> (and be mindful that catalogs/facts/reports submissions will get 
> queued, queries of course will fail during that short time during DB 
> restart). 
>
> Now the data itself is going to obviously contain private items, so if 
> you prefer to send it to me privately thats fine (contact me off this 
> list if so). Otherwise, it might need some scrubbing. 
>
> So the concern I have with your issue, even though it doesn''t stop
a
> catalog from being submitted - is that you''re seeing a
''block'' when
> those two catalogs get submitted which makes me fear that data
doesn''t
> get populated until _after_ the GC is complete. This is a concern 
> because we don''t want data to be delayed like this. If this
wasn''t the
> case we could happily let it fail and retry, ultimately the data
isn''t
> lost, but we should be able to process data within reasonable time 
> limits without you have to worry about database GC slowing things 
> down. 
>
> Now in regards to your performance, we are going to keep looking into 
> the results you gave us. Looking at your settings Deepak and I think 
> you probably need to increase at least work_mem to 16MB and 
> maintenance_work_mem to 32MB (this is used during vacuum) the main 
> item being work_mem here really. These are reasonable changes
we''ve
> seen other users apply with success, but alone they probably won''t
> solve your issue - it would be good to just get you working using 
> higher values for these. 
>
> I''m very curious to understand the disk IO for your systems also.
What
> are the nature of the disks where your database is located? Are they 
> SCSI/SATA/FC or otherwise? What filesystem type are you using? Can you 
> give me any other details you might find interesting? 
>
> What are the results of the command: 
>
> iostat -k 5 
>
> A few samples of the run would be useful to see, so we can understand 
> the average disk IO going on here. 
>
> FYI To give you an idea of what we _should_ be seeing for say, just 
> one of your queries, here is a query on a Centos 5.8 box (running just 
> postgresql 8.1 actually): 
>
> puppetdb=# EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT 
> resource FROM catalog_resources cr, certname_catalogs cc, certnames c 
> WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated 
> IS NULL) r; 
>                                                                      QUERY 
> PLAN 
>
>
-------------------------------------------------------------------------------------------------
>
> --------------------------------------------------- 
>  Aggregate  (cost=91792.54..91792.55 rows=1 width=0) (actual 
> time=2611.872..2611.872 rows=1 loops 
> =1) 
>    ->  HashAggregate  (cost=91737.14..91761.76 rows=2462 width=41) 
> (actual time=2606.507..2609.48 
> 9 rows=5478 loops=1) 
>          ->  Hash Join  (cost=184.36..90322.32 rows=565926 width=41) 
> (actual time=7.060..2072.000 
>  rows=867847 loops=1) 
>                Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) 
>                ->  Seq Scan on catalog_resources cr 
> (cost=0.00..80139.47 rows=867847 width=82) ( 
> actual time=0.021..1119.260 rows=867847 loops=1) 
>                ->  Hash  (cost=159.17..159.17 rows=2015 width=41) 
> (actual time=7.032..7.032 rows= 
> 2195 loops=1) 
>                      ->  Hash Join  (cost=62.34..159.17 rows=2015 
> width=41) (actual time=2.438..5 
> .830 rows=2195 loops=1) 
>                            Hash Cond: (cc.certname = c.name) 
>                            ->  Seq Scan on certname_catalogs cc 
> (cost=0.00..65.79 rows=2179 widt 
> h=50) (actual time=0.005..0.908 rows=2195 loops=1) 
>                            ->  Hash  (cost=37.15..37.15 rows=2015 
> width=9) (actual time=2.427..2. 
> 427 rows=2195 loops=1) 
>                                  ->  Seq Scan on certnames c 
> (cost=0.00..37.15 rows=2015 width=9 
> ) (actual time=0.008..1.088 rows=2195 loops=1) 
>                                        Filter: (deactivated IS NULL) 
>  Total runtime: 2611.935 ms 
> (13 rows) 
>
> As you can see, 2 seconds on my box and 475 seconds on yours - the 
> difference is just too large to be considered a normal discrepancy 
> :-). Be mindful, this is on data that represents 2000 nodes ... so its 
> not like my tables are smaller then yours (at least not drastically), 
> however I do notice the number of resources is much higher so I''m 
> going to try and replicate that. 
>
> For now, I''m going to ponder more on your performance as I believe
> that affects you the most - so let us get back to you with more 
> updates. If you can give me any details as I''ve requested for now
that
> would be appreciated. 
>
> ken. 
>
> On Wed, Mar 13, 2013 at 2:00 AM, ak0ska <akos....@gmail.com
<javascript:>>
> wrote: 
> > Hello Deepak, 
> > 
> > Here are the queries you asked for: 
> > 
> > 
> >> 
> >> Can you fire up psql, point it at your puppetdb database, and run 
> "EXPLAIN 
> >> ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, 
> catalog_resources 
> >> cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname
AND
> >> c.deactivated IS NULL;"? That will profile the query, and we
can see
> where 
> >> the problem is. 
> >> 
> > 
> >                                                                       
> QUERY 
> > PLAN 
> > 
>
-------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >  Aggregate  (cost=3132186.09..3132186.10 rows=1 width=0) (actual 
> > time=348426.375..348426.376 rows=1 loops=1) 
> >    ->  Hash Join  (cost=247.37..3096532.80 rows=14261314 width=0) 
> (actual 
> > time=24059.151..348255.933 rows=1604253 loops=1) 
> >          Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) 
> >          ->  Seq Scan on catalog_resources cr 
(cost=0.00..2871450.86
> > rows=16444286 width=41) (actual time=24053.542..347250.967
rows=2107982
> > loops=1) 
> >          ->  Hash  (cost=216.42..216.42 rows=2476 width=41) (actual
> > time=5.555..5.555 rows=2476 loops=1) 
> >                ->  Hash Join  (cost=94.84..216.42 rows=2476
width=41)
> > (actual time=1.850..4.555 rows=2476 loops=1) 
> >                      Hash Cond: (cc.certname = c.name) 
> >                      ->  Seq Scan on certname_catalogs cc 
>  (cost=0.00..82.55 
> > rows=2855 width=62) (actual time=0.010..0.941 rows=2855 loops=1) 
> >                      ->  Hash  (cost=63.71..63.71 rows=2490
width=21)
> > (actual time=1.818..1.818 rows=2490 loops=1) 
> >                            ->  Seq Scan on certnames c 
>  (cost=0.00..63.71 
> > rows=2490 width=21) (actual time=0.013..0.823 rows=2490 loops=1) 
> >                                  Filter: (deactivated IS NULL) 
> >  Total runtime: 348446.364 ms 
> > (12 rows) 
> > 
> > 
> > 
> > 
> >> Also worth examining is the GC query (modified to be
non-destructive):
> >> 
> >> EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS
(SELECT
> * 
> >> FROM catalog_resources cr WHERE
cr.resource=resource_params.resource);
> >> 
> > 
> > 
> > QUERY PLAN 
> > 
>
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >  Nested Loop Anti Join  (cost=0.00..1710681.18 rows=360492 width=126) 
> > (actual time=1967.182..1967.182 rows=0 loops=1) 
> >    ->  Seq Scan on resource_params  (cost=0.00..13249.23
rows=389623
> > width=126) (actual time=0.013..76.403 rows=389693 loops=1) 
> >    ->  Index Scan using idx_catalog_resources_resource on 
> catalog_resources 
> > cr  (cost=0.00..21429.14 rows=5291 width=41) (actual time=0.004..0.004
> > rows=1 loops=389693) 
> >          Index Cond: ((cr.resource)::text = 
> > (resource_params.resource)::text) 
> >  Total runtime: 1967.253 ms 
> > (5 rows) 
> > 
> > 
> > 
> >> And the resource duplication query: 
> >> 
> >> EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT
resource
> FROM 
> >> catalog_resources cr, certname_catalogs cc, certnames c  WHERE 
> >> cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS 
> NULL) r; 
> >> 
> > 
> > QUERY PLAN 
> > 
>
-------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >  Aggregate  (cost=3129997.71..3129997.72 rows=1 width=0) (actual 
> > time=474369.420..474369.420 rows=1 loops=1) 
> >    ->  HashAggregate  (cost=3129927.78..3129958.86 rows=3108
width=41)
> > (actual time=474351.730..474365.144 rows=49466 loops=1) 
> >          ->  Hash Join  (cost=248.48..3094726.38 rows=14080561
width=41)
> > (actual time=21536.467..473678.479 rows=1604237 loops=1) 
> >                Hash Cond: ((cr.catalog)::text = (cc.catalog)::text) 
> >                ->  Seq Scan on catalog_resources cr 
>  (cost=0.00..2871450.86 
> > rows=16444286 width=82) (actual time=21531.355..472692.303
rows=1934212
> > loops=1) 
> >                ->  Hash  (cost=217.36..217.36 rows=2490 width=41) 
> (actual 
> > time=5.071..5.071 rows=2476 loops=1) 
> >                      ->  Hash Join  (cost=94.84..217.36 rows=2490 
> width=41) 
> > (actual time=1.700..4.048 rows=2476 loops=1) 
> >                            Hash Cond: (cc.certname = c.name) 
> >                            ->  Seq Scan on certname_catalogs cc 
> > (cost=0.00..83.08 rows=2908 width=62) (actual time=0.010..0.603 
> rows=2855 
> > loops=1) 
> >                            ->  Hash  (cost=63.71..63.71 rows=2490 
> width=21) 
> > (actual time=1.670..1.670 rows=2490 loops=1) 
> >                                  ->  Seq Scan on certnames c 
> > (cost=0.00..63.71 rows=2490 width=21) (actual time=0.014..0.785 
> rows=2490 
> > loops=1) 
> >                                        Filter: (deactivated IS NULL) 
> >  Total runtime: 474370.129 ms 
> > (13 rows) 
> > 
> > 
> >> 
> >> And finally, can you give us a dump of your database settings? You
can
> do 
> >> this from within psql by doing "SHOW ALL;" 
> > 
> > 
> > http://pastebin.com/raw.php?i=HfhqfVa4 
> > 
> > Thank you for your help! 
> > 
> > Cheers, 
> > 
> > ak0ska 
> > 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "Puppet Users" group. 
> > To unsubscribe from this group and stop receiving emails from it, send
> an 
> > email to puppet-users...@googlegroups.com <javascript:>. 
> > To post to this group, send email to
puppet...@googlegroups.com<javascript:>.
>
> > Visit this group at http://groups.google.com/group/puppet-users?hl=en.
> > For more options, visit https://groups.google.com/groups/opt_out. 
> > 
> > 
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Mar-14 21:12 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

So I have this sinking feeling that all of your problems (including
the constraint side-effect) are related to general performance issues
on your database ''for some reason we are yet to determine''.
This could
be related to IO contention, it could be a bad index (although you''ve
rebuilt them all right?) or something else. What continues to bother
me is the fact you have only managed to reduce your DB size to 30GB,
this is far larger than I expect and I it screams ''problem'' to
me. It
should be less than 5 GB, at least based on my rough calculations from
other DBs I''ve seen in the wild.

It could very well be that something still isn''t sufficiently
organised (index, tablespace data) for optimal performance. A
recreation of the database is by all means a very fast way to recreate
everything, its like heaving an axe though. Having said that, it is
fast and would eradicate a lot of back & forth between us to try and
solve this problem.

A dump, drop, recreate and restore is also another way, not 100% the
same thing, but close to it.

Its entirely up to you - if you want to try either option then go
ahead. It would be a fast step to achieving a lot of small steps. If
it works, we may have lost some clarity on the issue however, so might
be good to keep the original DB backup. If it doesn''t work, at least
we have ruled out a lot of possibilities in one quick action. So yeah,
I wouldn''t be upset if you take this action - there is a reason why I
suggested this earlier, and that is it performs a lot of actions in
one single step ... and I still think this is a viable action, but
since you were keen on trying other things I was happy to go along
:-). At the end of the day your time and your risk assessment is up to
you. If it was me, yeah, I would try it. Esp. considering the amount
of time already spent with very little final result.

Thanks for being patient with me on this btw ... :-).

ken.

On Thu, Mar 14, 2013 at 3:36 AM, ak0ska <akos.hencz@gmail.com>
wrote:> Hello Ken,
>
> I really appreciate you guys looking into this problem, and I''m
happy to
> provide you with the data you ask for. However, I feel like I should ask,
> whether you think this problem is worth your efforts, if rebuilding the
> database might solve the issue?
>
> Cheers,
>
> ak0ska
>
>
> On Thursday, March 14, 2013 8:05:59 AM UTC+1, Ken Barber wrote:
>>
>> Hi ak0ska,
>>
>> So I''ve been spending the last 2 days trying all kinds of
things to
>> replicate your constraint violation problem and I still am getting
>> nowhere with it. I''ve been speaking to all kinds of smart
people and
>> we believe its some sort of lock and/or transactional mode problem but
>> none of my testing can replicate it.
>>
>> My latest test was something like:
>>
>> https://gist.github.com/kbarber/5157836
>>
>> But with all that preparation and load simulation I still
couldn''t
>> replicate it :-(.
>>
>> I think I need more data from you ... in this case, a statement log.
>> Now a statement log can be pretty heavy, so if we can get the data
>> surrounding a GC only that should be enough - however I anticipate it
>> might reduce some performance so if you''re apprehensive about
doing it
>> in production I fully understand. I''m just running out of
ideas, and
>> without more data I may have trouble coming to a conclusion.
>>
>> Anyway, if at all possible the instructions for enabling this are as
>> follows:
>>
>> * Go to your postgresql.conf file (On RHEL 6 this is usually located
>> in /var/lib/pgsql/data)
>> * Add/modify the entries:
>> log_statement = ''all''
>> log_line_prefix = ''%m PID[%p] SID[%c] VTID[%v] TID[%x]
''
>> * Restart postgresql
>> * Logs should go to: /var/lib/pgsql/data/pg_log
>>
>> The closer you can do this to being set _just before_ a GC run the
>> better, as it will reduce the amount of time this setting needs to be
>> on. Once you have a GC run and a subsequent constraint logged you
>> should be able to turn it off. Don''t worry about PuppetDB
during these
>> cases, it retries the database so if you can take the small hit with
>> queries the outage should be quick while you switch off/on the setting
>> (and be mindful that catalogs/facts/reports submissions will get
>> queued, queries of course will fail during that short time during DB
>> restart).
>>
>> Now the data itself is going to obviously contain private items, so if
>> you prefer to send it to me privately thats fine (contact me off this
>> list if so). Otherwise, it might need some scrubbing.
>>
>> So the concern I have with your issue, even though it doesn''t
stop a
>> catalog from being submitted - is that you''re seeing a
''block'' when
>> those two catalogs get submitted which makes me fear that data
doesn''t
>> get populated until _after_ the GC is complete. This is a concern
>> because we don''t want data to be delayed like this. If this
wasn''t the
>> case we could happily let it fail and retry, ultimately the data
isn''t
>> lost, but we should be able to process data within reasonable time
>> limits without you have to worry about database GC slowing things
>> down.
>>
>> Now in regards to your performance, we are going to keep looking into
>> the results you gave us. Looking at your settings Deepak and I think
>> you probably need to increase at least work_mem to 16MB and
>> maintenance_work_mem to 32MB (this is used during vacuum) the main
>> item being work_mem here really. These are reasonable changes
we''ve
>> seen other users apply with success, but alone they probably
won''t
>> solve your issue - it would be good to just get you working using
>> higher values for these.
>>
>> I''m very curious to understand the disk IO for your systems
also. What
>> are the nature of the disks where your database is located? Are they
>> SCSI/SATA/FC or otherwise? What filesystem type are you using? Can you
>> give me any other details you might find interesting?
>>
>> What are the results of the command:
>>
>> iostat -k 5
>>
>> A few samples of the run would be useful to see, so we can understand
>> the average disk IO going on here.
>>
>> FYI To give you an idea of what we _should_ be seeing for say, just
>> one of your queries, here is a query on a Centos 5.8 box (running just
>> postgresql 8.1 actually):
>>
>> puppetdb=# EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT
>> resource FROM catalog_resources cr, certname_catalogs cc, certnames c
>> WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated
>> IS NULL) r;
>>                                                                     
QUERY
>> PLAN
>>
>>
>>
-------------------------------------------------------------------------------------------------
>> ---------------------------------------------------
>>  Aggregate  (cost=91792.54..91792.55 rows=1 width=0) (actual
>> time=2611.872..2611.872 rows=1 loops
>> =1)
>>    ->  HashAggregate  (cost=91737.14..91761.76 rows=2462 width=41)
>> (actual time=2606.507..2609.48
>> 9 rows=5478 loops=1)
>>          ->  Hash Join  (cost=184.36..90322.32 rows=565926 width=41)
>> (actual time=7.060..2072.000
>>  rows=867847 loops=1)
>>                Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
>>                ->  Seq Scan on catalog_resources cr
>> (cost=0.00..80139.47 rows=867847 width=82) (
>> actual time=0.021..1119.260 rows=867847 loops=1)
>>                ->  Hash  (cost=159.17..159.17 rows=2015 width=41)
>> (actual time=7.032..7.032 rows>> 2195 loops=1)
>>                      ->  Hash Join  (cost=62.34..159.17 rows=2015
>> width=41) (actual time=2.438..5
>> .830 rows=2195 loops=1)
>>                            Hash Cond: (cc.certname = c.name)
>>                            ->  Seq Scan on certname_catalogs cc
>> (cost=0.00..65.79 rows=2179 widt
>> h=50) (actual time=0.005..0.908 rows=2195 loops=1)
>>                            ->  Hash  (cost=37.15..37.15 rows=2015
>> width=9) (actual time=2.427..2.
>> 427 rows=2195 loops=1)
>>                                  ->  Seq Scan on certnames c
>> (cost=0.00..37.15 rows=2015 width=9
>> ) (actual time=0.008..1.088 rows=2195 loops=1)
>>                                        Filter: (deactivated IS NULL)
>>  Total runtime: 2611.935 ms
>> (13 rows)
>>
>> As you can see, 2 seconds on my box and 475 seconds on yours - the
>> difference is just too large to be considered a normal discrepancy
>> :-). Be mindful, this is on data that represents 2000 nodes ... so its
>> not like my tables are smaller then yours (at least not drastically),
>> however I do notice the number of resources is much higher so
I''m
>> going to try and replicate that.
>>
>> For now, I''m going to ponder more on your performance as I
believe
>> that affects you the most - so let us get back to you with more
>> updates. If you can give me any details as I''ve requested for
now that
>> would be appreciated.
>>
>> ken.
>>
>> On Wed, Mar 13, 2013 at 2:00 AM, ak0ska <akos....@gmail.com>
wrote:
>> > Hello Deepak,
>> >
>> > Here are the queries you asked for:
>> >
>> >
>> >>
>> >> Can you fire up psql, point it at your puppetdb database, and
run
>> >> "EXPLAIN
>> >> ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc,
>> >> catalog_resources
>> >> cr, certnames c WHERE cc.catalog=cr.catalog AND
c.name=cc.certname AND
>> >> c.deactivated IS NULL;"? That will profile the query, and
we can see
>> >> where
>> >> the problem is.
>> >>
>> >
>> >
>> > QUERY
>> > PLAN
>> >
>> >
-------------------------------------------------------------------------------------------------------------------------------------------------------
>> >  Aggregate  (cost=3132186.09..3132186.10 rows=1 width=0) (actual
>> > time=348426.375..348426.376 rows=1 loops=1)
>> >    ->  Hash Join  (cost=247.37..3096532.80 rows=14261314
width=0)
>> > (actual
>> > time=24059.151..348255.933 rows=1604253 loops=1)
>> >          Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
>> >          ->  Seq Scan on catalog_resources cr 
(cost=0.00..2871450.86
>> > rows=16444286 width=41) (actual time=24053.542..347250.967
rows=2107982
>> > loops=1)
>> >          ->  Hash  (cost=216.42..216.42 rows=2476 width=41)
(actual
>> > time=5.555..5.555 rows=2476 loops=1)
>> >                ->  Hash Join  (cost=94.84..216.42 rows=2476
width=41)
>> > (actual time=1.850..4.555 rows=2476 loops=1)
>> >                      Hash Cond: (cc.certname = c.name)
>> >                      ->  Seq Scan on certname_catalogs cc
>> > (cost=0.00..82.55
>> > rows=2855 width=62) (actual time=0.010..0.941 rows=2855 loops=1)
>> >                      ->  Hash  (cost=63.71..63.71 rows=2490
width=21)
>> > (actual time=1.818..1.818 rows=2490 loops=1)
>> >                            ->  Seq Scan on certnames c
>> > (cost=0.00..63.71
>> > rows=2490 width=21) (actual time=0.013..0.823 rows=2490 loops=1)
>> >                                  Filter: (deactivated IS NULL)
>> >  Total runtime: 348446.364 ms
>> > (12 rows)
>> >
>> >
>> >
>> >
>> >> Also worth examining is the GC query (modified to be
non-destructive):
>> >>
>> >> EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS
(SELECT
>> >> *
>> >> FROM catalog_resources cr WHERE
cr.resource=resource_params.resource);
>> >>
>> >
>> >
>> > QUERY PLAN
>> >
>> >
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >  Nested Loop Anti Join  (cost=0.00..1710681.18 rows=360492
width=126)
>> > (actual time=1967.182..1967.182 rows=0 loops=1)
>> >    ->  Seq Scan on resource_params  (cost=0.00..13249.23
rows=389623
>> > width=126) (actual time=0.013..76.403 rows=389693 loops=1)
>> >    ->  Index Scan using idx_catalog_resources_resource on
>> > catalog_resources
>> > cr  (cost=0.00..21429.14 rows=5291 width=41) (actual
time=0.004..0.004
>> > rows=1 loops=389693)
>> >          Index Cond: ((cr.resource)::text >> >
(resource_params.resource)::text)
>> >  Total runtime: 1967.253 ms
>> > (5 rows)
>> >
>> >
>> >
>> >> And the resource duplication query:
>> >>
>> >> EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT
resource
>> >> FROM
>> >> catalog_resources cr, certname_catalogs cc, certnames c  WHERE
>> >> cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated
IS NULL)
>> >> r;
>> >>
>> >
>> > QUERY PLAN
>> >
>> >
-------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >  Aggregate  (cost=3129997.71..3129997.72 rows=1 width=0) (actual
>> > time=474369.420..474369.420 rows=1 loops=1)
>> >    ->  HashAggregate  (cost=3129927.78..3129958.86 rows=3108
width=41)
>> > (actual time=474351.730..474365.144 rows=49466 loops=1)
>> >          ->  Hash Join  (cost=248.48..3094726.38 rows=14080561
width=41)
>> > (actual time=21536.467..473678.479 rows=1604237 loops=1)
>> >                Hash Cond: ((cr.catalog)::text =
(cc.catalog)::text)
>> >                ->  Seq Scan on catalog_resources cr
>> > (cost=0.00..2871450.86
>> > rows=16444286 width=82) (actual time=21531.355..472692.303
rows=1934212
>> > loops=1)
>> >                ->  Hash  (cost=217.36..217.36 rows=2490
width=41)
>> > (actual
>> > time=5.071..5.071 rows=2476 loops=1)
>> >                      ->  Hash Join  (cost=94.84..217.36
rows=2490
>> > width=41)
>> > (actual time=1.700..4.048 rows=2476 loops=1)
>> >                            Hash Cond: (cc.certname = c.name)
>> >                            ->  Seq Scan on certname_catalogs cc
>> > (cost=0.00..83.08 rows=2908 width=62) (actual time=0.010..0.603
>> > rows=2855
>> > loops=1)
>> >                            ->  Hash  (cost=63.71..63.71
rows=2490
>> > width=21)
>> > (actual time=1.670..1.670 rows=2490 loops=1)
>> >                                  ->  Seq Scan on certnames c
>> > (cost=0.00..63.71 rows=2490 width=21) (actual time=0.014..0.785
>> > rows=2490
>> > loops=1)
>> >                                        Filter: (deactivated IS
NULL)
>> >  Total runtime: 474370.129 ms
>> > (13 rows)
>> >
>> >
>> >>
>> >> And finally, can you give us a dump of your database settings?
You can
>> >> do
>> >> this from within psql by doing "SHOW ALL;"
>> >
>> >
>> > http://pastebin.com/raw.php?i=HfhqfVa4
>> >
>> > Thank you for your help!
>> >
>> > Cheers,
>> >
>> > ak0ska
>> >
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "Puppet Users" group.
>> > To unsubscribe from this group and stop receiving emails from it,
send
>> > an
>> > email to puppet-users...@googlegroups.com.
>> > To post to this group, send email to puppet...@googlegroups.com.
>> > Visit this group at
http://groups.google.com/group/puppet-users?hl=en.
>> > For more options, visit https://groups.google.com/groups/opt_out.
>> >
>> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to puppet-users+unsubscribe@googlegroups.com.
> To post to this group, send email to puppet-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/puppet-users?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Mar-15 05:00 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

Hi ak0ska,

FWIW - with the help of some of my colleagues we''ve managed to
replicate your constraint issue in a lab style environment now:

https://gist.github.com/kbarber/5157836

Which is a start. It requires a unique precondition to replicate, and
I''ve been unable however to replicate it in any way with PuppetDB
directly, so the statement log will still be helpful if you can supply
it.

ken.

On Thu, Mar 14, 2013 at 2:12 PM, Ken Barber <ken@puppetlabs.com>
wrote:> So I have this sinking feeling that all of your problems (including
> the constraint side-effect) are related to general performance issues
> on your database ''for some reason we are yet to
determine''. This could
> be related to IO contention, it could be a bad index (although
you''ve
> rebuilt them all right?) or something else. What continues to bother
> me is the fact you have only managed to reduce your DB size to 30GB,
> this is far larger than I expect and I it screams
''problem'' to me. It
> should be less than 5 GB, at least based on my rough calculations from
> other DBs I''ve seen in the wild.
>
> It could very well be that something still isn''t sufficiently
> organised (index, tablespace data) for optimal performance. A
> recreation of the database is by all means a very fast way to recreate
> everything, its like heaving an axe though. Having said that, it is
> fast and would eradicate a lot of back & forth between us to try and
> solve this problem.
>
> A dump, drop, recreate and restore is also another way, not 100% the
> same thing, but close to it.
>
> Its entirely up to you - if you want to try either option then go
> ahead. It would be a fast step to achieving a lot of small steps. If
> it works, we may have lost some clarity on the issue however, so might
> be good to keep the original DB backup. If it doesn''t work, at
least
> we have ruled out a lot of possibilities in one quick action. So yeah,
> I wouldn''t be upset if you take this action - there is a reason
why I
> suggested this earlier, and that is it performs a lot of actions in
> one single step ... and I still think this is a viable action, but
> since you were keen on trying other things I was happy to go along
> :-). At the end of the day your time and your risk assessment is up to
> you. If it was me, yeah, I would try it. Esp. considering the amount
> of time already spent with very little final result.
>
> Thanks for being patient with me on this btw ... :-).
>
> ken.
>
> On Thu, Mar 14, 2013 at 3:36 AM, ak0ska <akos.hencz@gmail.com> wrote:
>> Hello Ken,
>>
>> I really appreciate you guys looking into this problem, and
I''m happy to
>> provide you with the data you ask for. However, I feel like I should
ask,
>> whether you think this problem is worth your efforts, if rebuilding the
>> database might solve the issue?
>>
>> Cheers,
>>
>> ak0ska
>>
>>
>> On Thursday, March 14, 2013 8:05:59 AM UTC+1, Ken Barber wrote:
>>>
>>> Hi ak0ska,
>>>
>>> So I''ve been spending the last 2 days trying all kinds of
things to
>>> replicate your constraint violation problem and I still am getting
>>> nowhere with it. I''ve been speaking to all kinds of smart
people and
>>> we believe its some sort of lock and/or transactional mode problem
but
>>> none of my testing can replicate it.
>>>
>>> My latest test was something like:
>>>
>>> https://gist.github.com/kbarber/5157836
>>>
>>> But with all that preparation and load simulation I still
couldn''t
>>> replicate it :-(.
>>>
>>> I think I need more data from you ... in this case, a statement
log.
>>> Now a statement log can be pretty heavy, so if we can get the data
>>> surrounding a GC only that should be enough - however I anticipate
it
>>> might reduce some performance so if you''re apprehensive
about doing it
>>> in production I fully understand. I''m just running out of
ideas, and
>>> without more data I may have trouble coming to a conclusion.
>>>
>>> Anyway, if at all possible the instructions for enabling this are
as
>>> follows:
>>>
>>> * Go to your postgresql.conf file (On RHEL 6 this is usually
located
>>> in /var/lib/pgsql/data)
>>> * Add/modify the entries:
>>> log_statement = ''all''
>>> log_line_prefix = ''%m PID[%p] SID[%c] VTID[%v] TID[%x]
''
>>> * Restart postgresql
>>> * Logs should go to: /var/lib/pgsql/data/pg_log
>>>
>>> The closer you can do this to being set _just before_ a GC run the
>>> better, as it will reduce the amount of time this setting needs to
be
>>> on. Once you have a GC run and a subsequent constraint logged you
>>> should be able to turn it off. Don''t worry about PuppetDB
during these
>>> cases, it retries the database so if you can take the small hit
with
>>> queries the outage should be quick while you switch off/on the
setting
>>> (and be mindful that catalogs/facts/reports submissions will get
>>> queued, queries of course will fail during that short time during
DB
>>> restart).
>>>
>>> Now the data itself is going to obviously contain private items, so
if
>>> you prefer to send it to me privately thats fine (contact me off
this
>>> list if so). Otherwise, it might need some scrubbing.
>>>
>>> So the concern I have with your issue, even though it
doesn''t stop a
>>> catalog from being submitted - is that you''re seeing a
''block'' when
>>> those two catalogs get submitted which makes me fear that data
doesn''t
>>> get populated until _after_ the GC is complete. This is a concern
>>> because we don''t want data to be delayed like this. If
this wasn''t the
>>> case we could happily let it fail and retry, ultimately the data
isn''t
>>> lost, but we should be able to process data within reasonable time
>>> limits without you have to worry about database GC slowing things
>>> down.
>>>
>>> Now in regards to your performance, we are going to keep looking
into
>>> the results you gave us. Looking at your settings Deepak and I
think
>>> you probably need to increase at least work_mem to 16MB and
>>> maintenance_work_mem to 32MB (this is used during vacuum) the main
>>> item being work_mem here really. These are reasonable changes
we''ve
>>> seen other users apply with success, but alone they probably
won''t
>>> solve your issue - it would be good to just get you working using
>>> higher values for these.
>>>
>>> I''m very curious to understand the disk IO for your
systems also. What
>>> are the nature of the disks where your database is located? Are
they
>>> SCSI/SATA/FC or otherwise? What filesystem type are you using? Can
you
>>> give me any other details you might find interesting?
>>>
>>> What are the results of the command:
>>>
>>> iostat -k 5
>>>
>>> A few samples of the run would be useful to see, so we can
understand
>>> the average disk IO going on here.
>>>
>>> FYI To give you an idea of what we _should_ be seeing for say, just
>>> one of your queries, here is a query on a Centos 5.8 box (running
just
>>> postgresql 8.1 actually):
>>>
>>> puppetdb=# EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT
DISTINCT
>>> resource FROM catalog_resources cr, certname_catalogs cc, certnames
c
>>> WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND
c.deactivated
>>> IS NULL) r;
>>>                                                                    
QUERY
>>> PLAN
>>>
>>>
>>>
-------------------------------------------------------------------------------------------------
>>> ---------------------------------------------------
>>>  Aggregate  (cost=91792.54..91792.55 rows=1 width=0) (actual
>>> time=2611.872..2611.872 rows=1 loops
>>> =1)
>>>    ->  HashAggregate  (cost=91737.14..91761.76 rows=2462
width=41)
>>> (actual time=2606.507..2609.48
>>> 9 rows=5478 loops=1)
>>>          ->  Hash Join  (cost=184.36..90322.32 rows=565926
width=41)
>>> (actual time=7.060..2072.000
>>>  rows=867847 loops=1)
>>>                Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
>>>                ->  Seq Scan on catalog_resources cr
>>> (cost=0.00..80139.47 rows=867847 width=82) (
>>> actual time=0.021..1119.260 rows=867847 loops=1)
>>>                ->  Hash  (cost=159.17..159.17 rows=2015
width=41)
>>> (actual time=7.032..7.032 rows>>> 2195 loops=1)
>>>                      ->  Hash Join  (cost=62.34..159.17
rows=2015
>>> width=41) (actual time=2.438..5
>>> .830 rows=2195 loops=1)
>>>                            Hash Cond: (cc.certname = c.name)
>>>                            ->  Seq Scan on certname_catalogs cc
>>> (cost=0.00..65.79 rows=2179 widt
>>> h=50) (actual time=0.005..0.908 rows=2195 loops=1)
>>>                            ->  Hash  (cost=37.15..37.15
rows=2015
>>> width=9) (actual time=2.427..2.
>>> 427 rows=2195 loops=1)
>>>                                  ->  Seq Scan on certnames c
>>> (cost=0.00..37.15 rows=2015 width=9
>>> ) (actual time=0.008..1.088 rows=2195 loops=1)
>>>                                        Filter: (deactivated IS
NULL)
>>>  Total runtime: 2611.935 ms
>>> (13 rows)
>>>
>>> As you can see, 2 seconds on my box and 475 seconds on yours - the
>>> difference is just too large to be considered a normal discrepancy
>>> :-). Be mindful, this is on data that represents 2000 nodes ... so
its
>>> not like my tables are smaller then yours (at least not
drastically),
>>> however I do notice the number of resources is much higher so
I''m
>>> going to try and replicate that.
>>>
>>> For now, I''m going to ponder more on your performance as I
believe
>>> that affects you the most - so let us get back to you with more
>>> updates. If you can give me any details as I''ve requested
for now that
>>> would be appreciated.
>>>
>>> ken.
>>>
>>> On Wed, Mar 13, 2013 at 2:00 AM, ak0ska <akos....@gmail.com>
wrote:
>>> > Hello Deepak,
>>> >
>>> > Here are the queries you asked for:
>>> >
>>> >
>>> >>
>>> >> Can you fire up psql, point it at your puppetdb database,
and run
>>> >> "EXPLAIN
>>> >> ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc,
>>> >> catalog_resources
>>> >> cr, certnames c WHERE cc.catalog=cr.catalog AND
c.name=cc.certname AND
>>> >> c.deactivated IS NULL;"? That will profile the query,
and we can see
>>> >> where
>>> >> the problem is.
>>> >>
>>> >
>>> >
>>> > QUERY
>>> > PLAN
>>> >
>>> >
-------------------------------------------------------------------------------------------------------------------------------------------------------
>>> >  Aggregate  (cost=3132186.09..3132186.10 rows=1 width=0)
(actual
>>> > time=348426.375..348426.376 rows=1 loops=1)
>>> >    ->  Hash Join  (cost=247.37..3096532.80 rows=14261314
width=0)
>>> > (actual
>>> > time=24059.151..348255.933 rows=1604253 loops=1)
>>> >          Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
>>> >          ->  Seq Scan on catalog_resources cr 
(cost=0.00..2871450.86
>>> > rows=16444286 width=41) (actual time=24053.542..347250.967
rows=2107982
>>> > loops=1)
>>> >          ->  Hash  (cost=216.42..216.42 rows=2476 width=41)
(actual
>>> > time=5.555..5.555 rows=2476 loops=1)
>>> >                ->  Hash Join  (cost=94.84..216.42 rows=2476
width=41)
>>> > (actual time=1.850..4.555 rows=2476 loops=1)
>>> >                      Hash Cond: (cc.certname = c.name)
>>> >                      ->  Seq Scan on certname_catalogs cc
>>> > (cost=0.00..82.55
>>> > rows=2855 width=62) (actual time=0.010..0.941 rows=2855
loops=1)
>>> >                      ->  Hash  (cost=63.71..63.71 rows=2490
width=21)
>>> > (actual time=1.818..1.818 rows=2490 loops=1)
>>> >                            ->  Seq Scan on certnames c
>>> > (cost=0.00..63.71
>>> > rows=2490 width=21) (actual time=0.013..0.823 rows=2490
loops=1)
>>> >                                  Filter: (deactivated IS NULL)
>>> >  Total runtime: 348446.364 ms
>>> > (12 rows)
>>> >
>>> >
>>> >
>>> >
>>> >> Also worth examining is the GC query (modified to be
non-destructive):
>>> >>
>>> >> EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT
EXISTS (SELECT
>>> >> *
>>> >> FROM catalog_resources cr WHERE
cr.resource=resource_params.resource);
>>> >>
>>> >
>>> >
>>> > QUERY PLAN
>>> >
>>> >
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> >  Nested Loop Anti Join  (cost=0.00..1710681.18 rows=360492
width=126)
>>> > (actual time=1967.182..1967.182 rows=0 loops=1)
>>> >    ->  Seq Scan on resource_params  (cost=0.00..13249.23
rows=389623
>>> > width=126) (actual time=0.013..76.403 rows=389693 loops=1)
>>> >    ->  Index Scan using idx_catalog_resources_resource on
>>> > catalog_resources
>>> > cr  (cost=0.00..21429.14 rows=5291 width=41) (actual
time=0.004..0.004
>>> > rows=1 loops=389693)
>>> >          Index Cond: ((cr.resource)::text >>> >
(resource_params.resource)::text)
>>> >  Total runtime: 1967.253 ms
>>> > (5 rows)
>>> >
>>> >
>>> >
>>> >> And the resource duplication query:
>>> >>
>>> >> EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT
resource
>>> >> FROM
>>> >> catalog_resources cr, certname_catalogs cc, certnames c 
WHERE
>>> >> cr.catalog=cc.catalog AND cc.certname=c.name AND
c.deactivated IS NULL)
>>> >> r;
>>> >>
>>> >
>>> > QUERY PLAN
>>> >
>>> >
-------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> >  Aggregate  (cost=3129997.71..3129997.72 rows=1 width=0)
(actual
>>> > time=474369.420..474369.420 rows=1 loops=1)
>>> >    ->  HashAggregate  (cost=3129927.78..3129958.86
rows=3108 width=41)
>>> > (actual time=474351.730..474365.144 rows=49466 loops=1)
>>> >          ->  Hash Join  (cost=248.48..3094726.38
rows=14080561 width=41)
>>> > (actual time=21536.467..473678.479 rows=1604237 loops=1)
>>> >                Hash Cond: ((cr.catalog)::text =
(cc.catalog)::text)
>>> >                ->  Seq Scan on catalog_resources cr
>>> > (cost=0.00..2871450.86
>>> > rows=16444286 width=82) (actual time=21531.355..472692.303
rows=1934212
>>> > loops=1)
>>> >                ->  Hash  (cost=217.36..217.36 rows=2490
width=41)
>>> > (actual
>>> > time=5.071..5.071 rows=2476 loops=1)
>>> >                      ->  Hash Join  (cost=94.84..217.36
rows=2490
>>> > width=41)
>>> > (actual time=1.700..4.048 rows=2476 loops=1)
>>> >                            Hash Cond: (cc.certname = c.name)
>>> >                            ->  Seq Scan on
certname_catalogs cc
>>> > (cost=0.00..83.08 rows=2908 width=62) (actual
time=0.010..0.603
>>> > rows=2855
>>> > loops=1)
>>> >                            ->  Hash  (cost=63.71..63.71
rows=2490
>>> > width=21)
>>> > (actual time=1.670..1.670 rows=2490 loops=1)
>>> >                                  ->  Seq Scan on certnames
c
>>> > (cost=0.00..63.71 rows=2490 width=21) (actual
time=0.014..0.785
>>> > rows=2490
>>> > loops=1)
>>> >                                        Filter: (deactivated IS
NULL)
>>> >  Total runtime: 474370.129 ms
>>> > (13 rows)
>>> >
>>> >
>>> >>
>>> >> And finally, can you give us a dump of your database
settings? You can
>>> >> do
>>> >> this from within psql by doing "SHOW ALL;"
>>> >
>>> >
>>> > http://pastebin.com/raw.php?i=HfhqfVa4
>>> >
>>> > Thank you for your help!
>>> >
>>> > Cheers,
>>> >
>>> > ak0ska
>>> >
>>> >
>>> > --
>>> > You received this message because you are subscribed to the
Google
>>> > Groups
>>> > "Puppet Users" group.
>>> > To unsubscribe from this group and stop receiving emails from
it, send
>>> > an
>>> > email to puppet-users...@googlegroups.com.
>>> > To post to this group, send email to
puppet...@googlegroups.com.
>>> > Visit this group at
http://groups.google.com/group/puppet-users?hl=en.
>>> > For more options, visit
https://groups.google.com/groups/opt_out.
>>> >
>>> >
>>
>> --
>> You received this message because you are subscribed to the Google
Groups
>> "Puppet Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send
an
>> email to puppet-users+unsubscribe@googlegroups.com.
>> To post to this group, send email to puppet-users@googlegroups.com.
>> Visit this group at http://groups.google.com/group/puppet-users?hl=en.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Barber

2013-Mar-21 10:19 UTC

head link

Re: [Puppet Users] Re: PuppetDB queue problem

Hi ak0ska,

How are things going? Anything to report?

ken.

On Fri, Mar 15, 2013 at 5:00 AM, Ken Barber <ken@puppetlabs.com>
wrote:> Hi ak0ska,
>
> FWIW - with the help of some of my colleagues we''ve managed to
> replicate your constraint issue in a lab style environment now:
>
> https://gist.github.com/kbarber/5157836
>
> Which is a start. It requires a unique precondition to replicate, and
> I''ve been unable however to replicate it in any way with PuppetDB
> directly, so the statement log will still be helpful if you can supply
> it.
>
> ken.
>
> On Thu, Mar 14, 2013 at 2:12 PM, Ken Barber <ken@puppetlabs.com>
wrote:
>> So I have this sinking feeling that all of your problems (including
>> the constraint side-effect) are related to general performance issues
>> on your database ''for some reason we are yet to
determine''. This could
>> be related to IO contention, it could be a bad index (although
you''ve
>> rebuilt them all right?) or something else. What continues to bother
>> me is the fact you have only managed to reduce your DB size to 30GB,
>> this is far larger than I expect and I it screams
''problem'' to me. It
>> should be less than 5 GB, at least based on my rough calculations from
>> other DBs I''ve seen in the wild.
>>
>> It could very well be that something still isn''t sufficiently
>> organised (index, tablespace data) for optimal performance. A
>> recreation of the database is by all means a very fast way to recreate
>> everything, its like heaving an axe though. Having said that, it is
>> fast and would eradicate a lot of back & forth between us to try
and
>> solve this problem.
>>
>> A dump, drop, recreate and restore is also another way, not 100% the
>> same thing, but close to it.
>>
>> Its entirely up to you - if you want to try either option then go
>> ahead. It would be a fast step to achieving a lot of small steps. If
>> it works, we may have lost some clarity on the issue however, so might
>> be good to keep the original DB backup. If it doesn''t work, at
least
>> we have ruled out a lot of possibilities in one quick action. So yeah,
>> I wouldn''t be upset if you take this action - there is a
reason why I
>> suggested this earlier, and that is it performs a lot of actions in
>> one single step ... and I still think this is a viable action, but
>> since you were keen on trying other things I was happy to go along
>> :-). At the end of the day your time and your risk assessment is up to
>> you. If it was me, yeah, I would try it. Esp. considering the amount
>> of time already spent with very little final result.
>>
>> Thanks for being patient with me on this btw ... :-).
>>
>> ken.
>>
>> On Thu, Mar 14, 2013 at 3:36 AM, ak0ska <akos.hencz@gmail.com>
wrote:
>>> Hello Ken,
>>>
>>> I really appreciate you guys looking into this problem, and
I''m happy to
>>> provide you with the data you ask for. However, I feel like I
should ask,
>>> whether you think this problem is worth your efforts, if rebuilding
the
>>> database might solve the issue?
>>>
>>> Cheers,
>>>
>>> ak0ska
>>>
>>>
>>> On Thursday, March 14, 2013 8:05:59 AM UTC+1, Ken Barber wrote:
>>>>
>>>> Hi ak0ska,
>>>>
>>>> So I''ve been spending the last 2 days trying all kinds
of things to
>>>> replicate your constraint violation problem and I still am
getting
>>>> nowhere with it. I''ve been speaking to all kinds of
smart people and
>>>> we believe its some sort of lock and/or transactional mode
problem but
>>>> none of my testing can replicate it.
>>>>
>>>> My latest test was something like:
>>>>
>>>> https://gist.github.com/kbarber/5157836
>>>>
>>>> But with all that preparation and load simulation I still
couldn''t
>>>> replicate it :-(.
>>>>
>>>> I think I need more data from you ... in this case, a statement
log.
>>>> Now a statement log can be pretty heavy, so if we can get the
data
>>>> surrounding a GC only that should be enough - however I
anticipate it
>>>> might reduce some performance so if you''re
apprehensive about doing it
>>>> in production I fully understand. I''m just running out
of ideas, and
>>>> without more data I may have trouble coming to a conclusion.
>>>>
>>>> Anyway, if at all possible the instructions for enabling this
are as
>>>> follows:
>>>>
>>>> * Go to your postgresql.conf file (On RHEL 6 this is usually
located
>>>> in /var/lib/pgsql/data)
>>>> * Add/modify the entries:
>>>> log_statement = ''all''
>>>> log_line_prefix = ''%m PID[%p] SID[%c] VTID[%v] TID[%x]
''
>>>> * Restart postgresql
>>>> * Logs should go to: /var/lib/pgsql/data/pg_log
>>>>
>>>> The closer you can do this to being set _just before_ a GC run
the
>>>> better, as it will reduce the amount of time this setting needs
to be
>>>> on. Once you have a GC run and a subsequent constraint logged
you
>>>> should be able to turn it off. Don''t worry about
PuppetDB during these
>>>> cases, it retries the database so if you can take the small hit
with
>>>> queries the outage should be quick while you switch off/on the
setting
>>>> (and be mindful that catalogs/facts/reports submissions will
get
>>>> queued, queries of course will fail during that short time
during DB
>>>> restart).
>>>>
>>>> Now the data itself is going to obviously contain private
items, so if
>>>> you prefer to send it to me privately thats fine (contact me
off this
>>>> list if so). Otherwise, it might need some scrubbing.
>>>>
>>>> So the concern I have with your issue, even though it
doesn''t stop a
>>>> catalog from being submitted - is that you''re seeing a
''block'' when
>>>> those two catalogs get submitted which makes me fear that data
doesn''t
>>>> get populated until _after_ the GC is complete. This is a
concern
>>>> because we don''t want data to be delayed like this. If
this wasn''t the
>>>> case we could happily let it fail and retry, ultimately the
data isn''t
>>>> lost, but we should be able to process data within reasonable
time
>>>> limits without you have to worry about database GC slowing
things
>>>> down.
>>>>
>>>> Now in regards to your performance, we are going to keep
looking into
>>>> the results you gave us. Looking at your settings Deepak and I
think
>>>> you probably need to increase at least work_mem to 16MB and
>>>> maintenance_work_mem to 32MB (this is used during vacuum) the
main
>>>> item being work_mem here really. These are reasonable changes
we''ve
>>>> seen other users apply with success, but alone they probably
won''t
>>>> solve your issue - it would be good to just get you working
using
>>>> higher values for these.
>>>>
>>>> I''m very curious to understand the disk IO for your
systems also. What
>>>> are the nature of the disks where your database is located? Are
they
>>>> SCSI/SATA/FC or otherwise? What filesystem type are you using?
Can you
>>>> give me any other details you might find interesting?
>>>>
>>>> What are the results of the command:
>>>>
>>>> iostat -k 5
>>>>
>>>> A few samples of the run would be useful to see, so we can
understand
>>>> the average disk IO going on here.
>>>>
>>>> FYI To give you an idea of what we _should_ be seeing for say,
just
>>>> one of your queries, here is a query on a Centos 5.8 box
(running just
>>>> postgresql 8.1 actually):
>>>>
>>>> puppetdb=# EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT
DISTINCT
>>>> resource FROM catalog_resources cr, certname_catalogs cc,
certnames c
>>>> WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND
c.deactivated
>>>> IS NULL) r;
>>>>                                                                
QUERY
>>>> PLAN
>>>>
>>>>
>>>>
-------------------------------------------------------------------------------------------------
>>>> ---------------------------------------------------
>>>>  Aggregate  (cost=91792.54..91792.55 rows=1 width=0) (actual
>>>> time=2611.872..2611.872 rows=1 loops
>>>> =1)
>>>>    ->  HashAggregate  (cost=91737.14..91761.76 rows=2462
width=41)
>>>> (actual time=2606.507..2609.48
>>>> 9 rows=5478 loops=1)
>>>>          ->  Hash Join  (cost=184.36..90322.32 rows=565926
width=41)
>>>> (actual time=7.060..2072.000
>>>>  rows=867847 loops=1)
>>>>                Hash Cond: ((cr.catalog)::text =
(cc.catalog)::text)
>>>>                ->  Seq Scan on catalog_resources cr
>>>> (cost=0.00..80139.47 rows=867847 width=82) (
>>>> actual time=0.021..1119.260 rows=867847 loops=1)
>>>>                ->  Hash  (cost=159.17..159.17 rows=2015
width=41)
>>>> (actual time=7.032..7.032 rows>>>> 2195 loops=1)
>>>>                      ->  Hash Join  (cost=62.34..159.17
rows=2015
>>>> width=41) (actual time=2.438..5
>>>> .830 rows=2195 loops=1)
>>>>                            Hash Cond: (cc.certname = c.name)
>>>>                            ->  Seq Scan on certname_catalogs
cc
>>>> (cost=0.00..65.79 rows=2179 widt
>>>> h=50) (actual time=0.005..0.908 rows=2195 loops=1)
>>>>                            ->  Hash  (cost=37.15..37.15
rows=2015
>>>> width=9) (actual time=2.427..2.
>>>> 427 rows=2195 loops=1)
>>>>                                  ->  Seq Scan on certnames c
>>>> (cost=0.00..37.15 rows=2015 width=9
>>>> ) (actual time=0.008..1.088 rows=2195 loops=1)
>>>>                                        Filter: (deactivated IS
NULL)
>>>>  Total runtime: 2611.935 ms
>>>> (13 rows)
>>>>
>>>> As you can see, 2 seconds on my box and 475 seconds on yours -
the
>>>> difference is just too large to be considered a normal
discrepancy
>>>> :-). Be mindful, this is on data that represents 2000 nodes ...
so its
>>>> not like my tables are smaller then yours (at least not
drastically),
>>>> however I do notice the number of resources is much higher so
I''m
>>>> going to try and replicate that.
>>>>
>>>> For now, I''m going to ponder more on your performance
as I believe
>>>> that affects you the most - so let us get back to you with more
>>>> updates. If you can give me any details as I''ve
requested for now that
>>>> would be appreciated.
>>>>
>>>> ken.
>>>>
>>>> On Wed, Mar 13, 2013 at 2:00 AM, ak0ska
<akos....@gmail.com> wrote:
>>>> > Hello Deepak,
>>>> >
>>>> > Here are the queries you asked for:
>>>> >
>>>> >
>>>> >>
>>>> >> Can you fire up psql, point it at your puppetdb
database, and run
>>>> >> "EXPLAIN
>>>> >> ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs
cc,
>>>> >> catalog_resources
>>>> >> cr, certnames c WHERE cc.catalog=cr.catalog AND
c.name=cc.certname AND
>>>> >> c.deactivated IS NULL;"? That will profile the
query, and we can see
>>>> >> where
>>>> >> the problem is.
>>>> >>
>>>> >
>>>> >
>>>> > QUERY
>>>> > PLAN
>>>> >
>>>> >
-------------------------------------------------------------------------------------------------------------------------------------------------------
>>>> >  Aggregate  (cost=3132186.09..3132186.10 rows=1 width=0)
(actual
>>>> > time=348426.375..348426.376 rows=1 loops=1)
>>>> >    ->  Hash Join  (cost=247.37..3096532.80
rows=14261314 width=0)
>>>> > (actual
>>>> > time=24059.151..348255.933 rows=1604253 loops=1)
>>>> >          Hash Cond: ((cr.catalog)::text =
(cc.catalog)::text)
>>>> >          ->  Seq Scan on catalog_resources cr 
(cost=0.00..2871450.86
>>>> > rows=16444286 width=41) (actual time=24053.542..347250.967
rows=2107982
>>>> > loops=1)
>>>> >          ->  Hash  (cost=216.42..216.42 rows=2476
width=41) (actual
>>>> > time=5.555..5.555 rows=2476 loops=1)
>>>> >                ->  Hash Join  (cost=94.84..216.42
rows=2476 width=41)
>>>> > (actual time=1.850..4.555 rows=2476 loops=1)
>>>> >                      Hash Cond: (cc.certname = c.name)
>>>> >                      ->  Seq Scan on certname_catalogs
cc
>>>> > (cost=0.00..82.55
>>>> > rows=2855 width=62) (actual time=0.010..0.941 rows=2855
loops=1)
>>>> >                      ->  Hash  (cost=63.71..63.71
rows=2490 width=21)
>>>> > (actual time=1.818..1.818 rows=2490 loops=1)
>>>> >                            ->  Seq Scan on certnames c
>>>> > (cost=0.00..63.71
>>>> > rows=2490 width=21) (actual time=0.013..0.823 rows=2490
loops=1)
>>>> >                                  Filter: (deactivated IS
NULL)
>>>> >  Total runtime: 348446.364 ms
>>>> > (12 rows)
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >> Also worth examining is the GC query (modified to be
non-destructive):
>>>> >>
>>>> >> EXPLAIN ANALYZE SELECT * FROM resource_params WHERE
NOT EXISTS (SELECT
>>>> >> *
>>>> >> FROM catalog_resources cr WHERE
cr.resource=resource_params.resource);
>>>> >>
>>>> >
>>>> >
>>>> > QUERY PLAN
>>>> >
>>>> >
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>> >  Nested Loop Anti Join  (cost=0.00..1710681.18 rows=360492
width=126)
>>>> > (actual time=1967.182..1967.182 rows=0 loops=1)
>>>> >    ->  Seq Scan on resource_params 
(cost=0.00..13249.23 rows=389623
>>>> > width=126) (actual time=0.013..76.403 rows=389693 loops=1)
>>>> >    ->  Index Scan using idx_catalog_resources_resource
on
>>>> > catalog_resources
>>>> > cr  (cost=0.00..21429.14 rows=5291 width=41) (actual
time=0.004..0.004
>>>> > rows=1 loops=389693)
>>>> >          Index Cond: ((cr.resource)::text >>>>
> (resource_params.resource)::text)
>>>> >  Total runtime: 1967.253 ms
>>>> > (5 rows)
>>>> >
>>>> >
>>>> >
>>>> >> And the resource duplication query:
>>>> >>
>>>> >> EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT
DISTINCT resource
>>>> >> FROM
>>>> >> catalog_resources cr, certname_catalogs cc, certnames
c  WHERE
>>>> >> cr.catalog=cc.catalog AND cc.certname=c.name AND
c.deactivated IS NULL)
>>>> >> r;
>>>> >>
>>>> >
>>>> > QUERY PLAN
>>>> >
>>>> >
-------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>> >  Aggregate  (cost=3129997.71..3129997.72 rows=1 width=0)
(actual
>>>> > time=474369.420..474369.420 rows=1 loops=1)
>>>> >    ->  HashAggregate  (cost=3129927.78..3129958.86
rows=3108 width=41)
>>>> > (actual time=474351.730..474365.144 rows=49466 loops=1)
>>>> >          ->  Hash Join  (cost=248.48..3094726.38
rows=14080561 width=41)
>>>> > (actual time=21536.467..473678.479 rows=1604237 loops=1)
>>>> >                Hash Cond: ((cr.catalog)::text =
(cc.catalog)::text)
>>>> >                ->  Seq Scan on catalog_resources cr
>>>> > (cost=0.00..2871450.86
>>>> > rows=16444286 width=82) (actual time=21531.355..472692.303
rows=1934212
>>>> > loops=1)
>>>> >                ->  Hash  (cost=217.36..217.36 rows=2490
width=41)
>>>> > (actual
>>>> > time=5.071..5.071 rows=2476 loops=1)
>>>> >                      ->  Hash Join  (cost=94.84..217.36
rows=2490
>>>> > width=41)
>>>> > (actual time=1.700..4.048 rows=2476 loops=1)
>>>> >                            Hash Cond: (cc.certname =
c.name)
>>>> >                            ->  Seq Scan on
certname_catalogs cc
>>>> > (cost=0.00..83.08 rows=2908 width=62) (actual
time=0.010..0.603
>>>> > rows=2855
>>>> > loops=1)
>>>> >                            ->  Hash  (cost=63.71..63.71
rows=2490
>>>> > width=21)
>>>> > (actual time=1.670..1.670 rows=2490 loops=1)
>>>> >                                  ->  Seq Scan on
certnames c
>>>> > (cost=0.00..63.71 rows=2490 width=21) (actual
time=0.014..0.785
>>>> > rows=2490
>>>> > loops=1)
>>>> >                                        Filter:
(deactivated IS NULL)
>>>> >  Total runtime: 474370.129 ms
>>>> > (13 rows)
>>>> >
>>>> >
>>>> >>
>>>> >> And finally, can you give us a dump of your database
settings? You can
>>>> >> do
>>>> >> this from within psql by doing "SHOW ALL;"
>>>> >
>>>> >
>>>> > http://pastebin.com/raw.php?i=HfhqfVa4
>>>> >
>>>> > Thank you for your help!
>>>> >
>>>> > Cheers,
>>>> >
>>>> > ak0ska
>>>> >
>>>> >
>>>> > --
>>>> > You received this message because you are subscribed to
the Google
>>>> > Groups
>>>> > "Puppet Users" group.
>>>> > To unsubscribe from this group and stop receiving emails
from it, send
>>>> > an
>>>> > email to puppet-users...@googlegroups.com.
>>>> > To post to this group, send email to
puppet...@googlegroups.com.
>>>> > Visit this group at
http://groups.google.com/group/puppet-users?hl=en.
>>>> > For more options, visit
https://groups.google.com/groups/opt_out.
>>>> >
>>>> >
>>>
>>> --
>>> You received this message because you are subscribed to the Google
Groups
>>> "Puppet Users" group.
>>> To unsubscribe from this group and stop receiving emails from it,
send an
>>> email to puppet-users+unsubscribe@googlegroups.com.
>>> To post to this group, send email to puppet-users@googlegroups.com.
>>> Visit this group at
http://groups.google.com/group/puppet-users?hl=en.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to puppet-users+unsubscribe@googlegroups.com.
To post to this group, send email to puppet-users@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Puppet users - Feb 2013 - PuppetDB queue problem

[Puppet Users] PuppetDB queue problem

[Puppet Users] Re: PuppetDB queue problem

[Puppet Users] Re: PuppetDB queue problem

[Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

[Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem

Re: [Puppet Users] Re: PuppetDB queue problem