thr3ads.net - dtrace discuss - [dtrace-discuss] An approach to instrumenting a Solaris shutdown ? [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Iz Rozenfeld

2006-Apr-27 20:16 UTC

[dtrace-discuss] An approach to instrumenting a Solaris shutdown ?

Hello all,
I have a situation whereby I am experiencing a problem shutting down (read: NOT
rebooting, but shutting down to "init 0", or "init 5", or
"init 6" for that matter). My workarounds thus far are to use
"reboot" or "halt", depending on what I am trying to do, but
I''d like to really understand what''s happening with
"init".   I see some activity with svc.startd being logged back to
/dev/console while the system is coming down but after a certain while the
system just ends up hanging.  I''ve narrowed this down to a point where
it only occurs when the state of the SMF  NIS client service is
''enabled''
I would like to gather some more data points with respect to trying to
instrument Solaris on its way "down".
Imagine this would be a catch22 as some of the critical components of the kernel
(providers, modules) would not be available to do this cleanly ... yet wondering
whether anyone thought about this sort of problem before, and could maybe point
me to a blog or a doc where I could get some ideas of executing on this approach
?
Thanks much,
Isaac
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20060427/b33261de/attachment.html>

Jonathan Adams

2006-May-04 17:48 UTC

head link

[dtrace-discuss] An approach to instrumenting a Solaris shutdown ?

On Thu, Apr 27, 2006 at 04:16:51PM -0400, Iz Rozenfeld
wrote:> Hello all,
> I have a situation whereby I am experiencing a problem shutting down
> (read: NOT rebooting, but shutting down to "init 0", or
"init 5", or
> "init 6" for that matter). My workarounds thus far are to use
"reboot"
> or "halt", depending on what I am trying to do, but I''d
like to really
> understand what''s happening with "init".  I see some
activity with
> svc.startd being logged back to /dev/console while the system is
> coming down but after a certain while the system just ends up hanging.
> I''ve narrowed this down to a point where it only occurs when the
state
> of the SMF NIS client service is ''enabled''
> I would like to gather some more data points with respect to trying to
> instrument Solaris on its way "down".
> Imagine this would be a catch22 as some of the critical components
> of the kernel (providers, modules) would not be available to do this
> cleanly ... yet wondering whether anyone thought about this sort of
> problem before, and could maybe point me to a blog or a doc where I
> could get some ideas of executing on this approach ?
You can''t use standard userland tracing, since the shutdown process
will
kill the dtrace(1M) process, shutting down the probes.  You want to do
anonymous tracing (using the -A flag).  After you do that, you''ll need
to
convince dtrace to unload and re-load.  This should be as simple as:

	# update_drv dtrace

but this will (most likely) fail:

	Cannot unload module: dtrace
	Will be unloaded upon reboot.

The problem is that there are some userland SDT probes and helpers whose
presence causes dtrace to not be unloadable.  The main culprits are
"java" processes and the "nfsmapid" process.  You need to
do:

	# svcadm disable -t svc:/network/nfs/mapid:default

and kill off any java processes.  Then:

	# modunload -i 0
	# modunload -i 0
	# update_drv dtrace

should succeed.


To get the data, you should include in your script a probe which causes
the system to panic() at an appropriate point.  Possibilities include
fbt::kadmin:entry, which is the underlying call used to reboot the
system, a :::tick-1s probe which waits until a certain amount of time has
passed (be careful here, since the probes will re-appear on boot, and you
don''t want to get into a panic loop.  A simple test, like:

--- cut here ---
BEGIN
/timestamp <= 3 * 60 * 1000000000/	/* make sure we''re >3 mins
after boot */
{
	printf("too close to booting, script canceled\n");
	exit(0);
}

BEGIN
{
	base = timestamp;
}

profile:::tick-1s
/(timestamp - base) > 20 * 60 * 1000000000/ /* twenty minutes after start */
{
	panic();
}
--- cut here ---

should do the trick. (the exit() will turn off all further processing of your
script)

To get the data after the reboot, you''ll need to wait for savecore to
complete, then go into /var/crash/machinename, mdb *.n (where n is the
largest number in the directory), then do:
> ::dtrace_state            ADDR MINOR             PROC NAME                         FILE
     300b2598700     2                - <anonymous>                    
-> 300b2598700::dtraceCPU     ID                    FUNCTION:NAME
  1      1                           :BEGIN foo!
        0

Cheers,
- jonathan

-- 
Jonathan Adams, Solaris Kernel Development

dtrace discuss - Apr 2006 - An approach to instrumenting a Solaris shutdown ?

[dtrace-discuss] An approach to instrumenting a Solaris shutdown ?

[dtrace-discuss] An approach to instrumenting a Solaris shutdown ?