At work, amoungst my stable of old computers running FreeBSD, I have a Fujitsu M800 - a 4 Zeon SMP processor with 4 GB of memory. This primarily runs Nagios and a small and lightly used MySQL database, along with a few inbound FTP transfers per minute. It has a Mylex card based disc subsystem, ruling out crash dumps. At some point during 5.5-STABLE this machine started to occasionally hang while performing its daily "application" housekeeping - closing and restarting Apache and Nagios, and dumping the database. Upgrading to 6.2-STABLE appeared to solve the problem, with no problems visible while running 1,000 cycles of the sequence which seemed to provoke the problem. cvsup for this version of the kernel and userland was run at 01:20 GMT on 06 March. However, shortly after 15:15 last Sunday afternoon the machine hung again "out of the blue". kdb diagnostics were taken some 12 hours later, and look somewhat odd. Maybe it was left to fester for too long. ps etc output at http://www.stade.co.uk/crash/console which contains boot to boot serial console output, including some output from test cycles. I'd be grateful for any expert comments on the ps etc output. Supporting stuff. [root@beastie ~/crash]# df Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/mlxd0s1a 507630 70074 396946 15% / devfs 1 1 0 100% /dev /dev/mlxd0s1f 63541498 44355014 14103166 76% /home /dev/mlxd0s1e 16244334 6784900 8159888 45% /usr /dev/mlxd0s1d 1012974 117456 814482 13% /var /dev/md0 1646 32 1484 2% /home/topftp/instances /dev/md1 253678 132 233252 0% /tmp [root@beastie ~]# find /var -inum 23 -ls 23 4 -rw-r--r-- 1 daemon daemon 60 Mar 12 20:22 /var/rwho/whod.xjamesfriis Problem stopped http and FTP logging soon after 15:14 on Sunday 11, diagnostics taken and machine rebooted around 04:30 on Monday 12. 172.19.112.92 - - [11/Mar/2007:15:14:53 +0000] "GET / HTTP/1.0" 200 688 "-" "check_http/1.89 (nagios-plugins 1.4.3)" <time passes> 172.19.112.92 - - [12/Mar/2007:04:44:14 +0000] "GET / HTTP/1.0" 200 688 "-" "check_http/1.89 (nagios-plugins 1.4.3)" Mar 11 15:15:35 beastie ftpd[91652]: connection from appsupcen (10.208.1.134) Mar 11 15:15:35 beastie ftpd[91652]: FTP LOGIN FROM appsupcen as topftp Mar 11 15:15:35 beastie ftpd[91652]: session root changed to /home/topftp/instances Mar 11 15:15:35 beastie ftpd[91652]: put in.env_status.html.gz = 592 bytes (wd: /topftp/appsupcen; chrooted) <time passes> Mar 11 15:15:35 beastie ftpd[91652]: rename in.env_status.html.gz env_status.html.gz (wd: /topftp/appsupcen; chrooted) Mar 12 04:44:31 beastie ftpd[1161]: connection from appsupcen (10.208.1.134) Mar 12 04:44:31 beastie ftpd[1161]: FTP LOGIN FROM appsupcen as topftp Mar 12 04:44:31 beastie ftpd[1161]: session root changed to /home/topftp/instances Mar 12 04:44:31 beastie ftpd[1161]: mkdir topftp/appsupcen (wd: /; chrooted) Support diary: 15:20 Beastie seems like its crashed and down; 16:54 Beastie is now longer pingable by rjmon1; 04:30 - 04:43 (support person quoting from the documentation I'd provided about what to do after a hang) Type "return tilde hash" (CR~#) which will make cu send a break signal to beastie, and should cause beastie to drop into the ddb kernel debugger. In the following, you may see "more" prompts. Type space at each for the next page. Type these debugger commands ps show pcpu show allpcpu show locks show alllocks show lockedvnods trace alltrace 04:43 - beastie now back up and working now by typing call cpu_reset() after the above commands to reboot beastie. AW: preserved and inspected diagnostic output. It looks very unlike that for previous crashes (without a serial console) where a noticable feature was many ftpd processes in a UFS state. Possibly "things happened" in the 12 hour period between the onset of the problem on Sunday afternoon and the diagnostics being taken on Monday morning. -- Adrian Wontroba Adrian's Birthday Celebration: Crewe Limelight, Saturday 17 March. David Hughes and Tiny Tin Lady. Free but ticketed - email me your postal address if you want to come. No under 18s.
On Tue, Mar 13, 2007 at 02:08:48PM +0000, Adrian Wontroba wrote:> At work, amoungst my stable of old computers running FreeBSD, I have a > Fujitsu M800 - a 4 Zeon SMP processor with 4 GB of memory. This > primarily runs Nagios and a small and lightly used MySQL database, along > with a few inbound FTP transfers per minute. It has a Mylex card based > disc subsystem, ruling out crash dumps. > > At some point during 5.5-STABLE this machine started to occasionally hang ...Another 6-STABLE (cvsupped on 27/03/07) example, with diagnostics taken rather sooner after the hang. Processes with wmesg=ufs feature often in the ps output. http://www.stade.co.uk/crash1/ -- Adrian Wontroba
On Mon, Apr 23, 2007 at 03:56:32AM +0100, Adrian Wontroba wrote:> On Tue, Mar 13, 2007 at 02:08:48PM +0000, Adrian Wontroba wrote: > > At work, amoungst my stable of old computers running FreeBSD, I have a > > Fujitsu M800 - a 4 Zeon SMP processor with 4 GB of memory. This > > primarily runs Nagios and a small and lightly used MySQL database, along > > with a few inbound FTP transfers per minute. It has a Mylex card based > > disc subsystem, ruling out crash dumps. > > > > At some point during 5.5-STABLE this machine started to occasionally hang ... > > Another 6-STABLE (cvsupped on 27/03/07) example, with diagnostics taken > rather sooner after the hang. Processes with wmesg=ufs feature often in > the ps output. > > http://www.stade.co.uk/crash1/I would suspect the mlx controller. There is several processes (for instance, 988, 50918) waiting for completion of block read, and processes in the "ufs" states are the result of the lock cascade, IMHO. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20070423/4fc97947/attachment.pgp
Kostik Belousov wrote:> I would suspect the mlx controller. There is several processes (for instance, > 988, 50918) waiting for completion of block read, and processes in the "ufs" > states are the result of the lock cascade, IMHO.It may be possible that controller is not guilty. You can easily reproduce lock in "ufs" state with commands from the "How-To-Repeat" section of: http://www.FreeBSD.org/cgi/query-pr.cgi?pr=kern/107439 The PR is closed but the problem still exists in recent 6.2-STABLE. GENERIC has the problem too, GENERIC+INVARIANTS panices at once instead of producing locked processes. Eugene Grosbein.