Iz Rozenfeld
2006-Apr-27 20:16 UTC
[dtrace-discuss] An approach to instrumenting a Solaris shutdown ?
Hello all, I have a situation whereby I am experiencing a problem shutting down (read: NOT rebooting, but shutting down to "init 0", or "init 5", or "init 6" for that matter). My workarounds thus far are to use "reboot" or "halt", depending on what I am trying to do, but I''d like to really understand what''s happening with "init". I see some activity with svc.startd being logged back to /dev/console while the system is coming down but after a certain while the system just ends up hanging. I''ve narrowed this down to a point where it only occurs when the state of the SMF NIS client service is ''enabled'' I would like to gather some more data points with respect to trying to instrument Solaris on its way "down". Imagine this would be a catch22 as some of the critical components of the kernel (providers, modules) would not be available to do this cleanly ... yet wondering whether anyone thought about this sort of problem before, and could maybe point me to a blog or a doc where I could get some ideas of executing on this approach ? Thanks much, Isaac -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20060427/b33261de/attachment.html>
Jonathan Adams
2006-May-04 17:48 UTC
[dtrace-discuss] An approach to instrumenting a Solaris shutdown ?
On Thu, Apr 27, 2006 at 04:16:51PM -0400, Iz Rozenfeld wrote:> Hello all, > I have a situation whereby I am experiencing a problem shutting down > (read: NOT rebooting, but shutting down to "init 0", or "init 5", or > "init 6" for that matter). My workarounds thus far are to use "reboot" > or "halt", depending on what I am trying to do, but I''d like to really > understand what''s happening with "init". I see some activity with > svc.startd being logged back to /dev/console while the system is > coming down but after a certain while the system just ends up hanging. > I''ve narrowed this down to a point where it only occurs when the state > of the SMF NIS client service is ''enabled''> I would like to gather some more data points with respect to trying to > instrument Solaris on its way "down".> Imagine this would be a catch22 as some of the critical components > of the kernel (providers, modules) would not be available to do this > cleanly ... yet wondering whether anyone thought about this sort of > problem before, and could maybe point me to a blog or a doc where I > could get some ideas of executing on this approach ?You can''t use standard userland tracing, since the shutdown process will kill the dtrace(1M) process, shutting down the probes. You want to do anonymous tracing (using the -A flag). After you do that, you''ll need to convince dtrace to unload and re-load. This should be as simple as: # update_drv dtrace but this will (most likely) fail: Cannot unload module: dtrace Will be unloaded upon reboot. The problem is that there are some userland SDT probes and helpers whose presence causes dtrace to not be unloadable. The main culprits are "java" processes and the "nfsmapid" process. You need to do: # svcadm disable -t svc:/network/nfs/mapid:default and kill off any java processes. Then: # modunload -i 0 # modunload -i 0 # update_drv dtrace should succeed. To get the data, you should include in your script a probe which causes the system to panic() at an appropriate point. Possibilities include fbt::kadmin:entry, which is the underlying call used to reboot the system, a :::tick-1s probe which waits until a certain amount of time has passed (be careful here, since the probes will re-appear on boot, and you don''t want to get into a panic loop. A simple test, like: --- cut here --- BEGIN /timestamp <= 3 * 60 * 1000000000/ /* make sure we''re >3 mins after boot */ { printf("too close to booting, script canceled\n"); exit(0); } BEGIN { base = timestamp; } profile:::tick-1s /(timestamp - base) > 20 * 60 * 1000000000/ /* twenty minutes after start */ { panic(); } --- cut here --- should do the trick. (the exit() will turn off all further processing of your script) To get the data after the reboot, you''ll need to wait for savecore to complete, then go into /var/crash/machinename, mdb *.n (where n is the largest number in the directory), then do:> ::dtrace_stateADDR MINOR PROC NAME FILE 300b2598700 2 - <anonymous> -> 300b2598700::dtraceCPU ID FUNCTION:NAME 1 1 :BEGIN foo! 0 Cheers, - jonathan -- Jonathan Adams, Solaris Kernel Development