At work, amoungst my stable of old computers running FreeBSD, I have a
Fujitsu M800 - a 4 Zeon SMP processor with 4 GB of memory. This
primarily runs Nagios and a small and lightly used MySQL database, along
with a few inbound FTP transfers per minute. It has a Mylex card based
disc subsystem, ruling out crash dumps.
At some point during 5.5-STABLE this machine started to occasionally hang while
performing its daily "application" housekeeping - closing and
restarting
Apache and Nagios, and dumping the database. Upgrading to 6.2-STABLE
appeared to solve the problem, with no problems visible while running
1,000 cycles of the sequence which seemed to provoke the problem.
cvsup for this version of the kernel and userland was run at 01:20 GMT
on 06 March.
However, shortly after 15:15 last Sunday afternoon the machine hung
again "out of the blue". kdb diagnostics were taken some 12 hours
later,
and look somewhat odd. Maybe it was left to fester for too long.
ps etc output at http://www.stade.co.uk/crash/console which contains
boot to boot serial console output, including some output from test
cycles. I'd be grateful for any expert comments on the ps etc output.
Supporting stuff.
[root@beastie ~/crash]# df
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/mlxd0s1a 507630 70074 396946 15% /
devfs 1 1 0 100% /dev
/dev/mlxd0s1f 63541498 44355014 14103166 76% /home
/dev/mlxd0s1e 16244334 6784900 8159888 45% /usr
/dev/mlxd0s1d 1012974 117456 814482 13% /var
/dev/md0 1646 32 1484 2% /home/topftp/instances
/dev/md1 253678 132 233252 0% /tmp
[root@beastie ~]# find /var -inum 23 -ls
23 4 -rw-r--r-- 1 daemon daemon 60 Mar
12 20:22 /var/rwho/whod.xjamesfriis
Problem stopped http and FTP logging soon after 15:14 on Sunday 11, diagnostics
taken and machine rebooted around 04:30 on Monday 12.
172.19.112.92 - - [11/Mar/2007:15:14:53 +0000] "GET / HTTP/1.0" 200
688 "-" "check_http/1.89 (nagios-plugins 1.4.3)"
<time passes>
172.19.112.92 - - [12/Mar/2007:04:44:14 +0000] "GET / HTTP/1.0" 200
688 "-" "check_http/1.89 (nagios-plugins 1.4.3)"
Mar 11 15:15:35 beastie ftpd[91652]: connection from appsupcen (10.208.1.134)
Mar 11 15:15:35 beastie ftpd[91652]: FTP LOGIN FROM appsupcen as topftp
Mar 11 15:15:35 beastie ftpd[91652]: session root changed to
/home/topftp/instances
Mar 11 15:15:35 beastie ftpd[91652]: put in.env_status.html.gz = 592 bytes (wd:
/topftp/appsupcen; chrooted)
<time passes>
Mar 11 15:15:35 beastie ftpd[91652]: rename in.env_status.html.gz
env_status.html.gz (wd: /topftp/appsupcen; chrooted)
Mar 12 04:44:31 beastie ftpd[1161]: connection from appsupcen (10.208.1.134)
Mar 12 04:44:31 beastie ftpd[1161]: FTP LOGIN FROM appsupcen as topftp
Mar 12 04:44:31 beastie ftpd[1161]: session root changed to
/home/topftp/instances
Mar 12 04:44:31 beastie ftpd[1161]: mkdir topftp/appsupcen (wd: /; chrooted)
Support diary:
15:20
Beastie seems like its crashed and down;
16:54
Beastie is now longer pingable by rjmon1;
04:30 - 04:43
(support person quoting from the documentation I'd provided about what
to do after a hang)
Type "return tilde hash" (CR~#) which will make cu send a break signal
to beastie, and should cause beastie to drop into the ddb kernel debugger.
In the following, you may see "more" prompts. Type space at each for
the next page.
Type these debugger commands
ps
show pcpu
show allpcpu
show locks
show alllocks
show lockedvnods
trace
alltrace
04:43 - beastie now back up and working now by typing call cpu_reset()
after the above commands to reboot beastie.
AW: preserved and inspected diagnostic output. It looks very unlike
that for previous crashes (without a serial console) where a noticable
feature was many ftpd processes in a UFS state. Possibly "things
happened" in the 12 hour period between the onset of the problem on
Sunday afternoon and the diagnostics being taken on Monday morning.
--
Adrian Wontroba
Adrian's Birthday Celebration: Crewe Limelight, Saturday 17 March. David
Hughes and Tiny Tin Lady. Free but ticketed - email me your postal
address if you want to come. No under 18s.
On Tue, Mar 13, 2007 at 02:08:48PM +0000, Adrian Wontroba wrote:> At work, amoungst my stable of old computers running FreeBSD, I have a > Fujitsu M800 - a 4 Zeon SMP processor with 4 GB of memory. This > primarily runs Nagios and a small and lightly used MySQL database, along > with a few inbound FTP transfers per minute. It has a Mylex card based > disc subsystem, ruling out crash dumps. > > At some point during 5.5-STABLE this machine started to occasionally hang ...Another 6-STABLE (cvsupped on 27/03/07) example, with diagnostics taken rather sooner after the hang. Processes with wmesg=ufs feature often in the ps output. http://www.stade.co.uk/crash1/ -- Adrian Wontroba
On Mon, Apr 23, 2007 at 03:56:32AM +0100, Adrian Wontroba wrote:> On Tue, Mar 13, 2007 at 02:08:48PM +0000, Adrian Wontroba wrote: > > At work, amoungst my stable of old computers running FreeBSD, I have a > > Fujitsu M800 - a 4 Zeon SMP processor with 4 GB of memory. This > > primarily runs Nagios and a small and lightly used MySQL database, along > > with a few inbound FTP transfers per minute. It has a Mylex card based > > disc subsystem, ruling out crash dumps. > > > > At some point during 5.5-STABLE this machine started to occasionally hang ... > > Another 6-STABLE (cvsupped on 27/03/07) example, with diagnostics taken > rather sooner after the hang. Processes with wmesg=ufs feature often in > the ps output. > > http://www.stade.co.uk/crash1/I would suspect the mlx controller. There is several processes (for instance, 988, 50918) waiting for completion of block read, and processes in the "ufs" states are the result of the lock cascade, IMHO. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20070423/4fc97947/attachment.pgp
Kostik Belousov wrote:> I would suspect the mlx controller. There is several processes (for instance, > 988, 50918) waiting for completion of block read, and processes in the "ufs" > states are the result of the lock cascade, IMHO.It may be possible that controller is not guilty. You can easily reproduce lock in "ufs" state with commands from the "How-To-Repeat" section of: http://www.FreeBSD.org/cgi/query-pr.cgi?pr=kern/107439 The PR is closed but the problem still exists in recent 6.2-STABLE. GENERIC has the problem too, GENERIC+INVARIANTS panices at once instead of producing locked processes. Eugene Grosbein.