Environment: sendmail 8.14.4 on FreeBSD 8.0-RELEASE-p2 Since upgrading a few local servers to FreeBSD 8.0-RELEASE (and subsequently 8.0-RELEASE-p2), I have been seeing VERY intermittent problems with sendmail persistent queue runners. One or more queue runners will fail to wake up (having been told to sleep for either 1 or 5 seconds) and mail accumulates in their queue group queues. I have only seen this about 4 times but at least once on each of the three 8.0 servers. I've been seeing something like one occurrence per fortnight overall. The first few times I re-started sendmail. On Saturday I spent longer looking at it. - attached to each of the stuck queue runner processes via gdb to try to see where they were stuck - backtraces from both process were identical and looked sane - attached to a happy queue runner process and got an identical backtrace - exited gdb and discovered that the stuck queue runners had woken up and flushed their queues! The stuck queue runner processes had been stuck for several hours (judging by the timestamps on the queued mail messages) but the gdb attach apparently woke them up! PROCESS STATES BEFORE DEBUG (stuck runners are in 'I' state) PID TT STAT TIME COMMAND 80298 ?? Ss 0:17.68 sendmail: accepting connections (sendmail) 80299 ?? I 0:46.62 sendmail: running queue: /var/spool/mqueue/qd1/df (sendmail) 80300 ?? I 0:08.83 sendmail: running queue: /var/spool/mqueue/mby/df (sendmail) 80301 ?? S 0:31.58 sendmail: running queue: /var/spool/mqueue/oz/df (sendmail) 80302 ?? S 0:30.71 sendmail: running queue: /var/spool/mqueue/rw2/df (sendmail) 80303 ?? S 0:33.29 sendmail: running queue: /var/spool/mqueue/hold/df (sendmail) 80304 ?? S 0:30.55 sendmail: running queue: /var/spool/mqueue/pgp/df (sendmail) BACKTRACE OF STUCK PROCESS 80299 (gdb) bt #0 0x28346547 in sigsuspend () from /lib/libc.so.7 #1 0x28344e98 in sigpause () from /lib/libc.so.7 #2 0x2833be3e in pause () from /lib/libc.so.7 #3 0x080cc7c8 in sleep () #4 0x08099c51 in run_work_group () #5 0x08099ebf in runqueue () #6 0x0805538d in main () BACKTRACE OF HAPPY PROCESS 80301 (gdb) bt #0 0x28346547 in sigsuspend () from /lib/libc.so.7 #1 0x28344e98 in sigpause () from /lib/libc.so.7 #2 0x2833be3e in pause () from /lib/libc.so.7 #3 0x080cc7c8 in sleep () #4 0x08099c51 in run_work_group () #5 0x08099ebf in runqueue () #6 0x0805538d in main () PROCESS STATES AFTER DEBUG PID TT STAT TIME COMMAND 80298 ?? Ss 0:17.69 sendmail: accepting connections (sendmail) 80299 ?? S 0:46.66 sendmail: running queue: /var/spool/mqueue/qd1/df (sendmail) 80300 ?? S 0:08.85 sendmail: running queue: /var/spool/mqueue/mby/df (sendmail) 80301 ?? S 0:31.60 sendmail: running queue: /var/spool/mqueue/oz/df (sendmail) 80302 ?? S 0:30.73 sendmail: running queue: /var/spool/mqueue/rw2/df (sendmail) 80303 ?? S 0:33.32 sendmail: running queue: /var/spool/mqueue/hold/df (sendmail) 80304 ?? S 0:30.58 sendmail: running queue: /var/spool/mqueue/pgp/df (sendmail) SENDMAIL DETAILS Version 8.14.4 Compiled with: DNSMAP LOG MAP_REGEX MATCHGECOS MILTER MIME7TO8 MIME8TO7 NAMED_BIND NETINET NETUNIX NEWDB NIS PIPELINING SASLv2 SCANF STARTTLS USERDB XDEBUG /usr/sbin/sendmail: libsasl2.so.2 => /usr/local/lib/libsasl2.so.2 (0x28154000) libssl.so.7 => /usr/local/lib/libssl.so.7 (0x2816a000) libcrypto.so.7 => /usr/local/lib/libcrypto.so.7 (0x281ad000) libutil.so.8 => /lib/libutil.so.8 (0x282f2000) libc.so.7 => /lib/libc.so.7 (0x28300000) libz.so.5 => /lib/libz.so.5 (0x2840c000) I posted about this in comp.mail.sendmail and was told...> sleep() should be one of these calls: > > if (njobs == 0 && WorkGrp[wgrp].wg_lowqintvl < MIN_SLEEP_TIME) > sleep(MIN_SLEEP_TIME); > else if (WorkGrp[wgrp].wg_lowqintvl <= 0) > sleep(QueueIntvl > 0 ? QueueIntvl : MIN_SLEEP_TIME); > else > sleep(WorkGrp[wgrp].wg_lowqintvl); > > Unless you have a really large value for one of these, the process > should continue after a while.The above code snippet is from sendmail/queue.c which fixes MIN_SLEEP_TIME at 5. QueueIntvl defaults to 1. wg_lowqintvl defaults to 0. I have not set any configuration or runtime options to override these defaults, so my persistent queue runners should be sleeping for either 1s or 5s only (not hours!). -- John Marshall -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20100223/0abe5c60/attachment.pgp
On Tue, Feb 23, 2010 at 12:35:22PM +1100, John Marshall wrote:> Environment: sendmail 8.14.4 on FreeBSD 8.0-RELEASE-p2 > > Since upgrading a few local servers to FreeBSD 8.0-RELEASE (and > subsequently 8.0-RELEASE-p2), I have been seeing VERY intermittent > problems with sendmail persistent queue runners. One or more queue > runners will fail to wake up (having been told to sleep for either 1 or > 5 seconds) and mail accumulates in their queue group queues. > > I have only seen this about 4 times but at least once on each of the > three 8.0 servers. I've been seeing something like one occurrence per > fortnight overall. The first few times I re-started sendmail. On > Saturday I spent longer looking at it. > > - attached to each of the stuck queue runner processes via gdb to > try to see where they were stuck > - backtraces from both process were identical and looked sane > - attached to a happy queue runner process and got an identical > backtrace > - exited gdb and discovered that the stuck queue runners had woken > up and flushed their queues! > > The stuck queue runner processes had been stuck for several hours > (judging by the timestamps on the queued mail messages) but the gdb > attach apparently woke them up! > > PROCESS STATES BEFORE DEBUG (stuck runners are in 'I' state) > > PID TT STAT TIME COMMAND > 80298 ?? Ss 0:17.68 sendmail: accepting connections (sendmail) > 80299 ?? I 0:46.62 sendmail: running queue: /var/spool/mqueue/qd1/df (sendmail) > 80300 ?? I 0:08.83 sendmail: running queue: /var/spool/mqueue/mby/df (sendmail) > 80301 ?? S 0:31.58 sendmail: running queue: /var/spool/mqueue/oz/df (sendmail) > 80302 ?? S 0:30.71 sendmail: running queue: /var/spool/mqueue/rw2/df (sendmail) > 80303 ?? S 0:33.29 sendmail: running queue: /var/spool/mqueue/hold/df (sendmail) > 80304 ?? S 0:30.55 sendmail: running queue: /var/spool/mqueue/pgp/df (sendmail) > > BACKTRACE OF STUCK PROCESS 80299 > > (gdb) bt > #0 0x28346547 in sigsuspend () from /lib/libc.so.7 > #1 0x28344e98 in sigpause () from /lib/libc.so.7 > #2 0x2833be3e in pause () from /lib/libc.so.7 > #3 0x080cc7c8 in sleep () > #4 0x08099c51 in run_work_group () > #5 0x08099ebf in runqueue () > #6 0x0805538d in main () > > BACKTRACE OF HAPPY PROCESS 80301 > > (gdb) bt > #0 0x28346547 in sigsuspend () from /lib/libc.so.7 > #1 0x28344e98 in sigpause () from /lib/libc.so.7 > #2 0x2833be3e in pause () from /lib/libc.so.7 > #3 0x080cc7c8 in sleep () > #4 0x08099c51 in run_work_group () > #5 0x08099ebf in runqueue () > #6 0x0805538d in main () > > PROCESS STATES AFTER DEBUG > > PID TT STAT TIME COMMAND > 80298 ?? Ss 0:17.69 sendmail: accepting connections (sendmail) > 80299 ?? S 0:46.66 sendmail: running queue: /var/spool/mqueue/qd1/df (sendmail) > 80300 ?? S 0:08.85 sendmail: running queue: /var/spool/mqueue/mby/df (sendmail) > 80301 ?? S 0:31.60 sendmail: running queue: /var/spool/mqueue/oz/df (sendmail) > 80302 ?? S 0:30.73 sendmail: running queue: /var/spool/mqueue/rw2/df (sendmail) > 80303 ?? S 0:33.32 sendmail: running queue: /var/spool/mqueue/hold/df (sendmail) > 80304 ?? S 0:30.58 sendmail: running queue: /var/spool/mqueue/pgp/df (sendmail) > > SENDMAIL DETAILS > > Version 8.14.4 > Compiled with: DNSMAP LOG MAP_REGEX MATCHGECOS MILTER MIME7TO8 MIME8TO7 > NAMED_BIND NETINET NETUNIX NEWDB NIS PIPELINING SASLv2 SCANF > STARTTLS USERDB XDEBUG > > /usr/sbin/sendmail: > libsasl2.so.2 => /usr/local/lib/libsasl2.so.2 (0x28154000) > libssl.so.7 => /usr/local/lib/libssl.so.7 (0x2816a000) > libcrypto.so.7 => /usr/local/lib/libcrypto.so.7 (0x281ad000) > libutil.so.8 => /lib/libutil.so.8 (0x282f2000) > libc.so.7 => /lib/libc.so.7 (0x28300000) > libz.so.5 => /lib/libz.so.5 (0x2840c000) > > I posted about this in comp.mail.sendmail and was told... > > > sleep() should be one of these calls: > > > > if (njobs == 0 && WorkGrp[wgrp].wg_lowqintvl < MIN_SLEEP_TIME) > > sleep(MIN_SLEEP_TIME); > > else if (WorkGrp[wgrp].wg_lowqintvl <= 0) > > sleep(QueueIntvl > 0 ? QueueIntvl : MIN_SLEEP_TIME); > > else > > sleep(WorkGrp[wgrp].wg_lowqintvl); > > > > Unless you have a really large value for one of these, the process > > should continue after a while. > > The above code snippet is from sendmail/queue.c which fixes > MIN_SLEEP_TIME at 5. QueueIntvl defaults to 1. wg_lowqintvl defaults > to 0. I have not set any configuration or runtime options to override > these defaults, so my persistent queue runners should be sleeping for > either 1s or 5s only (not hours!).I think the best way to collect the data would be ktrace the queue runners, preferrably starting the ktrace before they are stuck. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20100223/39c1f017/attachment.pgp
Updates following some off-line discussions and debugging with John on IRC. I've cc'd gshapiro@ because the problem appears to be sendmail, rather than the FreeBSD kernel. On 2010-Feb-23 12:35:22 +1100, John Marshall <john.marshall@riverwillow.com.au> wrote:>Environment: sendmail 8.14.4 on FreeBSD 8.0-RELEASE-p2Note that this is stock ISC sendmail, not the sendmail in either the base system or the port.>I posted about this in comp.mail.sendmail and was told... > >> sleep() should be one of these calls: >> >> if (njobs == 0 && WorkGrp[wgrp].wg_lowqintvl < MIN_SLEEP_TIME) >> sleep(MIN_SLEEP_TIME); >> else if (WorkGrp[wgrp].wg_lowqintvl <= 0) >> sleep(QueueIntvl > 0 ? QueueIntvl : MIN_SLEEP_TIME); >> else >> sleep(WorkGrp[wgrp].wg_lowqintvl);Whilst it's true that the code calls sleep(), it's not calling sleep(3) in the FreeBSD libc. Instead it's calling a sleep() defined in libsm/clock.c - which is a horrible maze of #ifdefs. John has pre-processed that code and the result it at: http://www.riverwillow.net.au/~john/sm/clock.preprocessed At a quick look, the code is broken: sm_seteventm() generates a one-off timer using setitimer(2), which will send SIGALRM when it expires. sm_releasesignal() then unblocks SIGALRM. In theory, the SIGALRM could be delivered anywhere after the (!SmSleepDone) test and before pause() is called - in which case, the signal is lost and pause() will sleep forever. On 2010-Feb-24 08:13:06 +1100, John Marshall <john.marshall@riverwillow.com.au> wrote:>My ktrace file was created with 'ktrace -g 48501'. I have the result of >'kdump -R -p 48504' available at: > > <http://www.riverwillow.net.au/~john/8_0/rwsrv04_201002240725.kdump.gz>The syscall pattern near the end of this file is significantly different from that elsewhere in the file - with gettimeofday(), sigprocmask() and sigsuspend() looping fairly rapidly. Interestingly, sigsuspend() is returning EINTR but no signal is reported. I'm not sure what could cause this. This syscall pattern looks like the while() loop in sendmail's sleep(), though it does appear that the loop is exited on that occasion but not on the following occasion (though the reason for this behaviour is unclear). Overall, it appears that there is a race condition in sendmail and something in the 8.0 signal handling appears to make this race easier to lose. Going back to the original clock.c source code, the other thing that is obvious is the HAVE_NANOSLEEP block - if this was active, sleep() would call nanosleep(2) and the whole signal mess would be avoided. It's not clear when that code was added but clock.c has not been touched for many years. In the sendmail in FreeBSD-8.0, there is no other reference to HAVE_NANOSLEEP within sendmail. sendmail 8.14.4 (in 8-STABLE) has HAVE_NANOSLEEP enabled on Solaris 11 only. Is there any reason why HAVE_NANOSLEEP is not defined for FreeBSD? Looking back through the commit logs, nanosleep(2) was implemented in sys/kern/kern_time.c v1.23 on Thu May 8 14:16:25 1997 UTC - that's just before RELENG_2_2. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20100224/f1b8324e/attachment.pgp