while trying to use the USDT made available to us in Postgresql database I have problems enabling them. On my database server I ran 1024+ PG processes (1024 clients). The server is a 8 core T2000 @1000MHz. 32GB. 7 storage arrays. Solaris 10 Update 4. Home compiled PG 8.3 beta 1 (optimized binary for T1 chip) with DTrace probes enabled. When running without enabling the probes I have approx 25% of idle time. Expect there is some wait queue forming (and time being spend) on some locks. Fortunately there are some USDT for this to enable closer inspection. Overenthusiastically I used the following script: postgresql*:::lwlock-acquire { thee line action } and the same for the corresponding release-, the waitstart- and the waitend probes. Result: had to reboot my T2000 by using the system console and powering it off and on again after waiting for more than 12 hours to see if it came back. (used -x dynvarsize=8m) Sure too many probes enabled. However this result is lets say "surprising". Indeed very dangerous since database can be used in production environments. In my view this is not good. Ok I now try to enable these 4 probes. Each probe in three clauses (12 clauses per process) for 25 processes. After using -x cleanrate=201 -x dynvarsize=64m (determined through trial and error) and having set dtrace_dof_maxsize=1m with mdb I see the following: After 15 minutes my test finishes and dtrace is still trying to find the probes of interest. Lots of idle time throughout this period. Although far from constant. DTrace now stops with the following message: dtrace -x dynvarsize=64m -x cleanrate=203 -qs lw.d >LW/lw.out dtrace: failed to compile script lw.d: line 374: failed to grab process 14510 BTW I specified my clauses like: postgresql15250:postgres:LWLockAcquire:lwlock-endwait, postgresql14637:postgres:LWLockAcquire:lwlock-endwait, postgresql15470:postgres:LWLockAcquire:lwlock-endwait, postgresql15395:postgres:LWLockAcquire:lwlock-endwait, postgresql15329:postgres:LWLockAcquire:lwlock-endwait, postgresql14795:postgres:LWLockAcquire:lwlock-endwait, postgresql14527:postgres:LWLockAcquire:lwlock-endwait, postgresql14869:postgres:LWLockAcquire:lwlock-endwait, postgresql14807:postgres:LWLockAcquire:lwlock-endwait, postgresql14910:postgres:LWLockAcquire:lwlock-endwait, postgresql15195:postgres:LWLockAcquire:lwlock-endwait, postgresql14574:postgres:LWLockAcquire:lwlock-endwait, postgresql14917:postgres:LWLockAcquire:lwlock-endwait, postgresql14662:postgres:LWLockAcquire:lwlock-endwait, postgresql14831:postgres:LWLockAcquire:lwlock-endwait, postgresql15027:postgres:LWLockAcquire:lwlock-endwait, postgresql14999:postgres:LWLockAcquire:lwlock-endwait, postgresql15009:postgres:LWLockAcquire:lwlock-endwait, postgresql14699:postgres:LWLockAcquire:lwlock-endwait, postgresql15232:postgres:LWLockAcquire:lwlock-endwait, postgresql15153:postgres:LWLockAcquire:lwlock-endwait, postgresql14776:postgres:LWLockAcquire:lwlock-endwait, postgresql15332:postgres:LWLockAcquire:lwlock-endwait, postgresql15050:postgres:LWLockAcquire:lwlock-endwait, postgresql14510:postgres:LWLockAcquire:lwlock-endwait /arg1 == 1 && self->tsw1[arg0]/ { @cntw1[arg0] = count(); @timw1[arg0] = sum(timestamp - self->tsw1[arg0]); self->tsw1[arg0] = 0; } And this is repeated for all twelve clauses where the actions and test differ. However the actions are very similar to this one. Basically using differently named aggregations. Same holds for the predicates used. Questions: how do I make this workable. Meaning speedier in startup. Without reducing my attention to just one or two processes. Question: why is my system so unresponsive (I waited more that 12 hours to see if it came back) forcing me to power off/on to get it alive and kicking again? And please note that we are planning to add many, many more probes in PG source. However if the above is the result I fear great fears ... Thanks --Paul -- This message posted from opensolaris.org
On Fri, Nov 16, 2007 at 06:29:45AM -0800, Paul van den Bogaard wrote:> Result: had to reboot my T2000 by using the system console and powering it off and on again after waiting for more than 12 hours to see if it came back.That''s quite surprising -- and distressing. Did you get a dump? Do you have any more data from which we might be able to debug the problem?> Sure too many probes enabled. However this result is lets say "surprising". Indeed very dangerous since database can be used in production environments. In my view this is not good.That doesn''t sound like too many probes enabled at all. We''ve enabled millions of probes without incident.> dtrace -x dynvarsize=64m -x cleanrate=203 -qs lw.d >LW/lw.out > dtrace: failed to compile script lw.d: line 374: failed to grab process 14510This may be that you''re hitting the file descriptor limit. Try taking a look with truss on the dtrace(1M) process.> Questions: how do I make this workable. Meaning speedier in startup. Without reducing my attention to just one or two processes.We''d need to figure out where the system is spending its time -- DTrace is a great place to start.> Question: why is my system so unresponsive (I waited more that 12 hours to see if it came back) forcing me to power off/on to get it alive and kicking again?Let''s work offline to try to sort this out.> And please note that we are planning to add many, many more probes in PG source. However if the above is the result I fear great fears ...You''re not enabling that many probes; I don''t think that''s the cause of either problem. Adam -- Adam Leventhal, FishWorks http://blogs.sun.com/ahl
> On Fri, Nov 16, 2007 at 06:29:45AM -0800, Paul van > den Bogaard wrote: > > Result: had to reboot my T2000 by using the system > console and powering it off and on again after > waiting for more than 12 hours to see if it came > back. > > That''s quite surprising -- and distressing. Did you > get a dump? Do you have > any more data from which we might be able to debug > the problem? > > > Sure too many probes enabled. However this result > is lets say "surprising". Indeed very dangerous since > database can be used in production environments. In > my view this is not good. > > That doesn''t sound like too many probes enabled at > all. We''ve enabled millions > of probes without incident. > > > dtrace -x dynvarsize=64m -x cleanrate=203 -qs lw.d > >LW/lw.out > trace: failed to compile script lw.d: line 374: > failed to grab process 14510 > > This may be that you''re hitting the file descriptor > limit. Try taking a look > with truss on the dtrace(1M) process.I think the processes that needed to be traced exited because they were done. It just took too long to compile (something that includes "attaching to probes")> > > Questions: how do I make this workable. Meaning > speedier in startup. Without reducing my attention > to just one or two processes. > We''d need to figure out where the system is spending > its time -- DTrace is > a great place to start. > > > Question: why is my system so unresponsive (I > waited more that 12 hours to see if it came back) > forcing me to power off/on to get it alive and > kicking again? > > Let''s work offline to try to sort this out. > > > And please note that we are planning to add many, > many more probes in PG source. However if the above > is the result I fear great fears ... > > You''re not enabling that many probes; I don''t think > that''s the cause of > either problem. > > Adam > > -- > Adam Leventhal, FishWorks > http://blogs.sun.com/ahl > ________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.orgDid some more tests and it looks like there is a relation between number of probes, size of shared memory and time-to-compile. Using a PostgreSQL 8.3 beta 1 database on a T2000 (32GB) I varied the size of the shared memory, the number of processes and the number of probes. The shared memory test is sized by changing the number of so called shared_buffers. Each buffer is 8KB. Timing is done in a simple way. When the BEGIN probe fires its action printf''s a string. Output is redirected to a file. With a while ls -l dtrace.outputfile; do sleep 5; done I see the file size. When it changes from 0 to not zero I know the BEGIN probe fired. Sure there are some other things that influence this, but I expect to look into minutes of total time to get a better understanding of what is happening. Results (all times in seconds): 50 probes enabled (5 probes for 10 different processes): <code> N users 100 400 1200 BUFFERS ----------------------- 100000 25 30 30 400000 45 50 50 1200000 95 105 105 </code> Changing the number of users has vary little effect. Changing the size of shared memory definitely has an effect. 300 probes enabled (30 probes for 10 different processes): (beware the 400000 buffers and the 400 user tests are not present. Only looking at the "extremes", and added to these a little by increasing the number of buffers too) <code> N users 100 1200 BUFFERS ----------------- 100000 125 155 1200000 545 595 1600000 695 730 2000000 memory problems, scan rate shows up in vmstat -p. </code> 1200 probes (30 probes for 40 different processes): <code> N users 100 1200 BUFFERS ------------------ 100000 475 600 1600000 2710 3055 </code> These numbers speak for themselves. The long time to compile has a relation with shared memory size and the number of probes that are enabled. Remaining question is why? -- This message posted from opensolaris.org