thr3ads.net - dtrace discuss - [dtrace-discuss] USDT probes from PostgreSQL [Nov 2007]

If this information is useful, please help other people find it:
Share via:

Paul van den Bogaard

2007-Nov-16 14:29 UTC

[dtrace-discuss] USDT probes from PostgreSQL

while trying to use the USDT made available to us in Postgresql database I have 
problems enabling them.

On my database server I ran 1024+ PG processes (1024 clients). The server is a 8
core T2000 @1000MHz. 32GB. 7 storage arrays. Solaris 10 Update 4. Home compiled
PG 8.3 beta 1 (optimized binary for T1 chip) with DTrace probes enabled.

When running without enabling the probes I have approx 25% of idle time. Expect
there is some wait queue forming (and time being spend) on some locks.
Fortunately there are some USDT for this to enable closer inspection.

Overenthusiastically I used the following script:

postgresql*:::lwlock-acquire
{ thee line action }

and the same for the corresponding release-, the waitstart- and the waitend
probes.
Result: had to reboot my T2000 by using the system console and powering it off
and on again after waiting for more than 12 hours to see if it came back.
(used -x dynvarsize=8m)

Sure too many probes enabled. However this result is lets say
"surprising". Indeed very dangerous since database can be used in
production environments. In my view this is not good.

Ok I now try to enable these 4 probes. Each probe in three  clauses (12 clauses
per process) for 25 processes.

After using -x cleanrate=201 -x dynvarsize=64m (determined through trial and
error) and having set  dtrace_dof_maxsize=1m with mdb I see  the following:

After 15 minutes my test finishes and dtrace is still trying to  find the probes
of interest. Lots of idle time throughout this period. Although far from
constant.

DTrace now stops with the following message:

dtrace -x dynvarsize=64m -x cleanrate=203 -qs lw.d    >LW/lw.out
dtrace: failed to compile script lw.d: line 374: failed to grab process 14510

BTW I specified my clauses like:

postgresql15250:postgres:LWLockAcquire:lwlock-endwait,
postgresql14637:postgres:LWLockAcquire:lwlock-endwait,
postgresql15470:postgres:LWLockAcquire:lwlock-endwait,
postgresql15395:postgres:LWLockAcquire:lwlock-endwait,
postgresql15329:postgres:LWLockAcquire:lwlock-endwait,
postgresql14795:postgres:LWLockAcquire:lwlock-endwait,
postgresql14527:postgres:LWLockAcquire:lwlock-endwait,
postgresql14869:postgres:LWLockAcquire:lwlock-endwait,
postgresql14807:postgres:LWLockAcquire:lwlock-endwait,
postgresql14910:postgres:LWLockAcquire:lwlock-endwait,
postgresql15195:postgres:LWLockAcquire:lwlock-endwait,
postgresql14574:postgres:LWLockAcquire:lwlock-endwait,
postgresql14917:postgres:LWLockAcquire:lwlock-endwait,
postgresql14662:postgres:LWLockAcquire:lwlock-endwait,
postgresql14831:postgres:LWLockAcquire:lwlock-endwait,
postgresql15027:postgres:LWLockAcquire:lwlock-endwait,
postgresql14999:postgres:LWLockAcquire:lwlock-endwait,
postgresql15009:postgres:LWLockAcquire:lwlock-endwait,
postgresql14699:postgres:LWLockAcquire:lwlock-endwait,
postgresql15232:postgres:LWLockAcquire:lwlock-endwait,
postgresql15153:postgres:LWLockAcquire:lwlock-endwait,
postgresql14776:postgres:LWLockAcquire:lwlock-endwait,
postgresql15332:postgres:LWLockAcquire:lwlock-endwait,
postgresql15050:postgres:LWLockAcquire:lwlock-endwait,
postgresql14510:postgres:LWLockAcquire:lwlock-endwait
/arg1 == 1 && self->tsw1[arg0]/
{
        @cntw1[arg0] = count();
        @timw1[arg0] = sum(timestamp - self->tsw1[arg0]);
        self->tsw1[arg0] = 0;
}

And this is repeated for all twelve clauses where the actions and test differ.
However the actions are very similar to this one. Basically using differently
named aggregations. Same holds for the predicates used.


Questions: how do I make this workable. Meaning  speedier in startup.  Without
reducing my attention to just one or two processes.
Question: why is my system so unresponsive (I waited more that 12 hours to see
if it came back) forcing me to power off/on to get it alive and kicking again?


And please note that we are planning to add many, many more probes in PG source.
However if  the above is the result I fear great fears ...

Thanks
--Paul


--
This message posted from opensolaris.org

Adam Leventhal

2007-Nov-19 08:43 UTC

head link

[dtrace-discuss] USDT probes from PostgreSQL

On Fri, Nov 16, 2007 at 06:29:45AM -0800, Paul van den Bogaard
wrote:> Result: had to reboot my T2000 by using the system console and powering it
off and on again after waiting for more than 12 hours to see if it came back.
That''s quite surprising -- and distressing. Did you get a dump? Do you
have
any more data from which we might be able to debug the problem?
> Sure too many probes enabled. However this result is lets say
"surprising". Indeed very dangerous since database can be used in
production environments. In my view this is not good.
That doesn''t sound like too many probes enabled at all. We''ve
enabled millions
of probes without incident.
> dtrace -x dynvarsize=64m -x cleanrate=203 -qs lw.d    >LW/lw.out
> dtrace: failed to compile script lw.d: line 374: failed to grab process
14510
This may be that you''re hitting the file descriptor limit. Try taking a
look
with truss on the dtrace(1M) process.
> Questions: how do I make this workable. Meaning  speedier in startup. 
Without reducing my attention to just one or two processes.
We''d need to figure out where the system is spending its time -- DTrace
is
a great place to start.
> Question: why is my system so unresponsive (I waited more that 12 hours to
see if it came back) forcing me to power off/on to get it alive and kicking
again?
Let''s work offline to try to sort this out.
> And please note that we are planning to add many, many more probes in PG
source. However if  the above is the result I fear great fears ...
You''re not enabling that many probes; I don''t think
that''s the cause of
either problem.

Adam

-- 
Adam Leventhal, FishWorks                        http://blogs.sun.com/ahl

Paul van den Bogaard

2007-Nov-20 08:12 UTC

head link

[dtrace-discuss] USDT probes from PostgreSQL

> On Fri, Nov 16, 2007 at 06:29:45AM -0800, Paul van
> den Bogaard wrote:
> > Result: had to reboot my T2000 by using the system
> console and powering it off and on again after
> waiting for more than 12 hours to see if it came
> back.
> 
> That''s quite surprising -- and distressing. Did you
> get a dump? Do you have
> any more data from which we might be able to debug
> the problem?
> 
> > Sure too many probes enabled. However this result
> is lets say "surprising". Indeed very dangerous since
> database can be used in production environments. In
> my view this is not good.
> 
> That doesn''t sound like too many probes enabled at
> all. We''ve enabled millions
> of probes without incident.
> 
> > dtrace -x dynvarsize=64m -x cleanrate=203 -qs lw.d
>    >LW/lw.out
> trace: failed to compile script lw.d: line 374:
> failed to grab process 14510
> 
> This may be that you''re hitting the file descriptor
> limit. Try taking a look
> with truss on the dtrace(1M) process.
I think the processes that needed to be traced exited because they were done. It
just took too long to compile (something that includes "attaching to
probes")> 
> > Questions: how do I make this workable. Meaning
> speedier in startup.  Without reducing my attention
>  to just one or two processes.
> We''d need to figure out where the system is spending
> its time -- DTrace is
> a great place to start.
> 
> > Question: why is my system so unresponsive (I
> waited more that 12 hours to see if it came back)
> forcing me to power off/on to get it alive and
> kicking again?
> 
> Let''s work offline to try to sort this out.
> 
> > And please note that we are planning to add many,
> many more probes in PG source. However if  the above
> is the result I fear great fears ...
> 
> You''re not enabling that many probes; I don''t think
> that''s the cause of
> either problem.
> 
> Adam
> 
> -- 
> Adam Leventhal, FishWorks
>                        http://blogs.sun.com/ahl
> ________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris.org
Did some more tests and it looks like there is a relation between number of
probes, size of shared memory and time-to-compile.

Using a PostgreSQL 8.3 beta 1 database on a T2000 (32GB) I varied the size of
the shared memory, the number of processes and the number of probes. The shared
memory test is sized by changing the number of so called shared_buffers. Each
buffer is 8KB.

Timing is done in a simple way. When the BEGIN probe fires its action
printf''s a string. Output is redirected to a file.
With a 

while ls -l dtrace.outputfile; do sleep 5; done 


I see the file size. When it changes from 0 to not zero I know the BEGIN probe
fired. Sure there are some other things that influence this, but I expect to
look into minutes of total time to get a better understanding of what is
happening.

Results (all times in seconds):

50 probes enabled (5 probes for 10 different processes):

<code>
            N users     100     400    1200
     BUFFERS        -----------------------
       100000            25      30       30
       400000            45      50       50
     1200000            95     105     105
</code>

Changing the number of users has vary little effect. Changing the size of shared
memory definitely has an effect.


300 probes enabled (30 probes for 10 different processes):
(beware the 400000 buffers and the 400 user tests are not present. Only looking
at the "extremes", and added to these a little by increasing the
number of buffers too)

<code>
          N users     100       1200
   BUFFERS       -----------------
     100000          125         155
   1200000          545         595           
   1600000          695         730        
  2000000        memory problems, scan rate shows up in vmstat -p.
</code>


1200 probes (30 probes for 40 different processes):

<code>
           N users   100     1200
   BUFFERS   ------------------
     100000         475      600
   1600000        2710    3055        
</code>
 

These numbers speak for themselves. The long time to compile has a relation with
shared memory size and the number of probes that are enabled. Remaining question
is why?


--
This message posted from opensolaris.org

Possibly Parallel Threads

Search for more maybe matching threads

dtrace discuss - Nov 2007 - USDT probes from PostgreSQL

[dtrace-discuss] USDT probes from PostgreSQL

[dtrace-discuss] USDT probes from PostgreSQL

[dtrace-discuss] USDT probes from PostgreSQL

Possibly Parallel Threads