G''Day Fletcher,
On Thu, Jan 24, 2008 at 09:58:32AM -0800, Fletcher Cocquyt
wrote:>
> Fellow dtracers - I am diagnosing a busy spamassassin/mailman/sendmail
> system - the CPU and memory usage seems normal, but iotop shows I/O
> and network are very busy (and initiating TCP connections is quite
> slow sometimes (even after tuning the stack))
>
>
> I suspect DNS latency - how can I use Dtrace to quantify latency due
> to DNS lookups on this server?
There isn''t a DNS provider yet, or plans to write one. We do need
something to serve this role, as DNS latencies are a common woe. DNS
probes were added to the Mozilla DTrace provider as part of firefox:
http://blogs.sun.com/jmr/entry/dtrace_mozilla_dns_timing_mystery
Which demonstrates how useful this data can be.
You can hit up the individual processes with the pid provider, which
will be easier if requests funnel through one process, such as nscd
(Sun''s standard local caching deamon), or named (BIND, which you could
setup as a forwarder).
Digging using the pid provider in this case is easier than it sounds -
the DNS lookup latency, which we are interested in, will dwarf others.
For example, here I traced the a.out segment of nscd while I performed
a host lookup:
# dtrace -Fn ''BEGIN { last = timestamp; }
pid$target:a.out::entry,pid$target:a.out::return { trace(timestamp - last); last
= timestamp; }'' -p `pgrep nscd`
dtrace: description ''BEGIN '' matched 801 probes
CPU FUNCTION
0 -> switcher 6132453361
0 -> _nscd_restart_if_cfgfile_changed 5578
0 <- _nscd_restart_if_cfgfile_changed 32471
0 -> validate_pheader 2508
0 <- validate_pheader 2151
0 -> if_selfcred_return_per_user_door 1777
0 <- if_selfcred_return_per_user_door 1988
0 -> lookup 1939
0 -> nsc_lookup 3474
0 -> lookup_int 1505
0 -> get_cache_ctx 4978
0 -> get_cache_idx 1725
0 <- get_cache_idx 2608
0 <- get_cache_ctx 1695
0 -> nsc_get_db 1646
0 <- nsc_get_db 2172
0 -> check_config 1987
0 -> nsw_config_in_phdr 2304
0 <- nsw_config_in_phdr 1913
0 <- check_config 2049
0 -> check_db_file 1677
0 <- check_db_file 15843
0 -> lookup_cache 2036
0 -> hash_find 1917
0 -> nsc_db_cis_key_gethash 1639
0 -> cis_gethash 1730
0 <- cis_gethash 3025
0 <- nsc_db_cis_key_gethash 1612
0 <- hash_find 2064
0 -> nsc_db_cis_key_compar 1905
0 <- nsc_db_cis_key_compar 2292
0 -> nsc_db_cis_key_compar 831
0 <- nsc_db_cis_key_compar 1175
0 -> nsc_db_cis_key_compar 709
0 <- nsc_db_cis_key_compar 1065
0 -> nsc_db_cis_key_compar 717
0 <- nsc_db_cis_key_compar 1262
0 -> nsc_db_cis_key_compar 784
0 <- nsc_db_cis_key_compar 1035
0 -> nsc_db_cis_key_compar 724
0 <- nsc_db_cis_key_compar 1100
0 -> queue_adjust 4504
0 <- queue_adjust 1768
0 <- lookup_cache 2102
0 -> _nscd_get_clearance 2225
0 <- _nscd_get_clearance 4161
0 -> nss_psearch 2228
0 -> set_initf_key 3403
0 <- set_initf_key 2349
0 -> nss_search 1724
0 -> getparams 1782
0 -> nscd_initf 1617
0 -> get_initf_key 1430
0 <- get_initf_key 1616
0 <- nscd_initf 1553
0 <- getparams 2424
0 -> try_local 2038
0 <- try_local 1664
0 -> _nscd_get_nsw_state 2198
0 -> _get_nsw_state_int 1643
0 -> _nscd_mutex_lock 4128
0 -> _nscd_get 1747
0 -> _nscd_is_int_addr 2106
0 -> _nscd_get_db_entry 5274
0 -> calc_hash 1578
0 <- calc_hash 2011
0 -> scan_hash 1609
0 <- scan_hash 1403
0 <- _nscd_get_db_entry 1378
0 <- _nscd_is_int_addr 2266
0 <- _nscd_get 1809
0 <- _nscd_mutex_lock 1918
0 -> _nscd_mutex_unlock 1917
0 -> _nscd_release 1718
0 <- _nscd_release 2237
0 <- _nscd_mutex_unlock 1825
0 <- _get_nsw_state_int 1788
0 <- _nscd_get_nsw_state 1492
0 -> _nscd_get_smf_state 2513
0 <- _nscd_get_smf_state 2243
0 -> try_local2 1454
0 <- try_local2 1689
0 -> search_dns_withttl 2024
0 <- search_dns_withttl 1823
0 -> retry_test 69505
0 <- retry_test 1923
0 -> _nscd_get_smf_state 967
0 <- _nscd_get_smf_state 1116
0 -> try_local2 700
0 <- try_local2 845
0 -> search_dns_withttl 950
0 -> get_dns_funcs 1651
0 -> get_lib_func 1484
0 <- get_lib_func 1656
0 <- get_dns_funcs 1184
0 <- search_dns_withttl 164291454
0 -> retry_test 6615
0 <- retry_test 3033
0 -> _nscd_put_nsw_state 2427
0 -> _put_nsw_state_int 1765
0 -> _nscd_mutex_lock 3284
0 -> _nscd_get 1895
0 -> _nscd_is_int_addr 2480
0 -> _nscd_get_db_entry 6470
0 -> calc_hash 1624
0 <- calc_hash 2095
0 -> scan_hash 1683
0 <- scan_hash 1463
0 <- _nscd_get_db_entry 1642
0 <- _nscd_is_int_addr 2734
0 <- _nscd_get 2036
0 <- _nscd_mutex_lock 1903
0 -> _nscd_mutex_unlock 1770
0 -> _nscd_release 1742
0 <- _nscd_release 2148
0 <- _nscd_mutex_unlock 1970
0 <- _put_nsw_state_int 1796
0 <- _nscd_put_nsw_state 1368
0 <- nss_search 1887
0 -> clear_initf_key 2140
0 <- clear_initf_key 2251
0 <- nss_psearch 2240
0 -> _nscd_release_clearance 2921
0 <- _nscd_release_clearance 3870
0 -> nscd_signal 1775
0 <- nscd_signal 1744
0 -> dup_packed_buffer 2289
0 <- dup_packed_buffer 4124
0 -> start_threads 6257
0 <- start_threads 1899
0 <- lookup_int 2260
0 <- nsc_lookup 1454
^C
So search_dns_withttl() looks interesting.
A bit more digging showed that libresolv''s res_nquery() looked even
better,
since it has the lookup string as arg1.
Tracing res_nquery() can be performed like this:
------------------------------------------------------------
#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
{
printf("%-20s %-10s %10s %s\n", "TIME",
"FUNC", "DELTA(ms)", "ARG");
}
pid$target:libresolv:res_nquery:entry
{
self->start = timestamp;
self->arg1 = copyinstr(arg1);
}
pid$target:libresolv:res_nquery:return
/self->start/
{
this->delta = (timestamp - self->start) / 1000000;
printf("%-20Y %-10s %10d %s\n", walltimestamp, probefunc,
this->delta,
self->arg1);
self->start = 0;
self->arg1 = 0;
}
------------------------------------------------------------
An example of this running:
# ./nscd_dns_lookups.d -p `pgrep nscd`
TIME FUNC DELTA(ms) ARG
2008 Jan 24 11:23:11 res_nquery 21 wikis.sun.com
2008 Jan 24 11:23:17 res_nquery 6110 www.sun.co.uk
2008 Jan 24 11:23:49 res_nquery 0 grouper
"grouper" was a local lookup, and returned quickly.
"www.sun.co.uk"
really did take 6.1 seconds to return.
I spent about ten minutes writing nscd_dns_lookups.d as a demo - don''t
treat this as the best possible way to get this data.
We should have a provider for nscd, however this won''t trace
applications
that talk to DNS servers directly. There might be a better way to do DNS
tracing once the DTrace network providers exist.
Brendan
--
Brendan
[CA, USA]