thr3ads.net - freebsd stable - 6.2-STABLE deadlock? [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Adrian Wontroba

2007-Mar-13 14:49 UTC

6.2-STABLE deadlock?

At work, amoungst my stable of old computers running FreeBSD, I have a
Fujitsu M800 - a 4 Zeon SMP processor with 4 GB of memory. This
primarily runs Nagios and a small and lightly used MySQL database, along
with a few inbound FTP transfers per minute. It has a Mylex card based
disc subsystem, ruling out crash dumps.

At some point during 5.5-STABLE this machine started to occasionally hang while
performing its daily "application" housekeeping - closing and
restarting
Apache and Nagios, and dumping the database. Upgrading to 6.2-STABLE
appeared to solve the problem, with no problems visible while running
1,000 cycles of the sequence which seemed to provoke the problem.

cvsup for this version of the kernel and userland was run at 01:20 GMT
on 06 March.

However, shortly after 15:15 last Sunday afternoon the machine hung
again "out of the blue". kdb diagnostics were taken some 12 hours
later,
and look somewhat odd. Maybe it was left to fester for too long.

ps etc output at http://www.stade.co.uk/crash/console which contains
boot to boot serial console output, including some output from test
cycles. I'd be grateful for any expert comments on the ps etc output.

Supporting stuff. 

[root@beastie ~/crash]# df
Filesystem    1K-blocks     Used    Avail Capacity  Mounted on
/dev/mlxd0s1a    507630    70074   396946    15%    /
devfs                 1        1        0   100%    /dev
/dev/mlxd0s1f  63541498 44355014 14103166    76%    /home
/dev/mlxd0s1e  16244334  6784900  8159888    45%    /usr
/dev/mlxd0s1d   1012974   117456   814482    13%    /var
/dev/md0           1646       32     1484     2%    /home/topftp/instances
/dev/md1         253678      132   233252     0%    /tmp

[root@beastie ~]# find /var -inum 23 -ls
    23        4 -rw-r--r--    1 daemon           daemon                 60 Mar
12 20:22 /var/rwho/whod.xjamesfriis

Problem stopped http and FTP logging soon after 15:14 on Sunday 11, diagnostics
taken and machine rebooted around 04:30 on Monday 12.

172.19.112.92 - - [11/Mar/2007:15:14:53 +0000] "GET / HTTP/1.0" 200
688 "-" "check_http/1.89 (nagios-plugins 1.4.3)"
<time passes>
172.19.112.92 - - [12/Mar/2007:04:44:14 +0000] "GET / HTTP/1.0" 200
688 "-" "check_http/1.89 (nagios-plugins 1.4.3)"

Mar 11 15:15:35 beastie ftpd[91652]: connection from appsupcen (10.208.1.134)
Mar 11 15:15:35 beastie ftpd[91652]: FTP LOGIN FROM appsupcen as topftp
Mar 11 15:15:35 beastie ftpd[91652]: session root changed to
/home/topftp/instances
Mar 11 15:15:35 beastie ftpd[91652]: put in.env_status.html.gz = 592 bytes (wd:
/topftp/appsupcen; chrooted)
<time passes>
Mar 11 15:15:35 beastie ftpd[91652]: rename in.env_status.html.gz
env_status.html.gz (wd: /topftp/appsupcen; chrooted)
Mar 12 04:44:31 beastie ftpd[1161]: connection from appsupcen (10.208.1.134)
Mar 12 04:44:31 beastie ftpd[1161]: FTP LOGIN FROM appsupcen as topftp
Mar 12 04:44:31 beastie ftpd[1161]: session root changed to
/home/topftp/instances
Mar 12 04:44:31 beastie ftpd[1161]: mkdir topftp/appsupcen (wd: /; chrooted)

Support diary:

15:20
Beastie seems like its crashed and down;

16:54
Beastie is now longer pingable by rjmon1;

04:30 - 04:43
(support person quoting from the documentation I'd provided about what
to do after a hang)
Type "return tilde hash" (CR~#) which will make cu send a break signal
to beastie, and should cause beastie to drop into the ddb kernel debugger.
In the following, you may see "more" prompts. Type space at each for
the next page.
Type these debugger commands
ps
show pcpu
show allpcpu
show locks
show alllocks
show lockedvnods
trace
alltrace
04:43 - beastie now back up and working now by typing call cpu_reset()
after the above commands to reboot beastie.

AW: preserved and inspected diagnostic output. It looks very unlike
that for previous crashes (without a serial console) where a noticable
feature was many ftpd processes in a UFS state. Possibly "things
happened" in the 12 hour period between the onset of the problem on
Sunday afternoon and the diagnostics being taken on Monday morning.

-- 
Adrian Wontroba
Adrian's Birthday Celebration: Crewe Limelight, Saturday 17 March. David
Hughes and Tiny Tin Lady.  Free but ticketed - email me your postal
address if you want to come. No under 18s.

Adrian Wontroba

2007-Apr-23 03:36 UTC

head link

6.2-STABLE deadlock?

On Tue, Mar 13, 2007 at 02:08:48PM +0000, Adrian Wontroba
wrote:> At work, amoungst my stable of old computers running FreeBSD, I have a
> Fujitsu M800 - a 4 Zeon SMP processor with 4 GB of memory. This
> primarily runs Nagios and a small and lightly used MySQL database, along
> with a few inbound FTP transfers per minute. It has a Mylex card based
> disc subsystem, ruling out crash dumps.
> 
> At some point during 5.5-STABLE this machine started to occasionally hang
...
Another 6-STABLE (cvsupped on 27/03/07) example, with diagnostics taken
rather sooner after the hang.  Processes with wmesg=ufs feature often in
the ps output.

http://www.stade.co.uk/crash1/

-- 
Adrian Wontroba

Kostik Belousov

2007-Apr-23 11:39 UTC

head link

6.2-STABLE deadlock?

On Mon, Apr 23, 2007 at 03:56:32AM +0100, Adrian Wontroba
wrote:> On Tue, Mar 13, 2007 at 02:08:48PM +0000, Adrian Wontroba wrote:
> > At work, amoungst my stable of old computers running FreeBSD, I have a
> > Fujitsu M800 - a 4 Zeon SMP processor with 4 GB of memory. This
> > primarily runs Nagios and a small and lightly used MySQL database,
along
> > with a few inbound FTP transfers per minute. It has a Mylex card based
> > disc subsystem, ruling out crash dumps.
> > 
> > At some point during 5.5-STABLE this machine started to occasionally
hang ...
> 
> Another 6-STABLE (cvsupped on 27/03/07) example, with diagnostics taken
> rather sooner after the hang.  Processes with wmesg=ufs feature often in
> the ps output.
> 
> http://www.stade.co.uk/crash1/
I would suspect the mlx controller. There is several processes (for instance,
988, 50918) waiting for completion of block read, and processes in the
"ufs"
states are the result of the lock cascade, IMHO.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20070423/4fc97947/attachment.pgp

Eugene Grosbein

2007-Apr-25 02:45 UTC

head link

6.2-STABLE deadlock?

Kostik Belousov wrote:
> I would suspect the mlx controller. There is several processes (for
instance,
> 988, 50918) waiting for completion of block read, and processes in the
"ufs"
> states are the result of the lock cascade, IMHO.
It may be possible that controller is not guilty.

You can easily reproduce lock in "ufs" state with commands from
the "How-To-Repeat" section of:
http://www.FreeBSD.org/cgi/query-pr.cgi?pr=kern/107439

The PR is closed but the problem still exists in recent 6.2-STABLE.
GENERIC has the problem too, GENERIC+INVARIANTS panices at once
instead of producing locked processes.

Eugene Grosbein.

freebsd stable - Mar 2007 - 6.2-STABLE deadlock?

6.2-STABLE deadlock?

6.2-STABLE deadlock?

6.2-STABLE deadlock?

6.2-STABLE deadlock?