NAKAJI Hiroyuki
2009-Jun-06 14:33 UTC
Big problem still remains with 7.2-STABLE locking up
Hi, I noticed, some months ago, frequent lockups on my RELENG_6 server with ECS PM800-M2, Celeron 2.6GHz (UP), 2GB ram, ATA HDDs and 3Com NIC(xl0), and then I gave up this old server. Last month, I replaced this 'unstable' server to the new one with 7.2-RELEASE which worked very well until I setup it as 'a server'. The problem began just after it started 'the services'. My story is very similar to Pete's. http://lists.freebsd.org/pipermail/freebsd-stable/2009-January/047487.html I followed some instructions in the list thread. But unfortunately, the big problem still remains. 7.2-STABLE server locks up frequently. Help! :-( The server is NEC Express5800 S70/SD. o CPU: Intel(R) Celeron(R) CPU 440 @ 2.00GHz (2280.25-MHz K8-class CPU) o 6GB RAM o ACPI APIC Table: <NEC DT000020> o 80GB and 250GB SATA HDDs o http://www.heimat.gr.jp/~nakaji/localhost/dmesg.boot The kernel configuration is: include GENERIC ident HEIMAT options MSGBUF_SIZE=81920 makeoptions DEBUG=-g options KDB options DDB options BREAK_TO_DEBUGGER options QUOTA options DEVICE_POLLING options HZ=1000 options SW_WATCHDOG options DEBUG_VFS_LOCKS options INVARIANTS options INVARIANT_SUPPORT options WITNESS options WITNESS_SKIPSPIN options LOCK_PROFILING This server runs as web server, nfs server, dhcp server, ntp server, mail server with spam checks, ML server, usenet server and so on. From /etc/rc.conf*, there are some "_enable" lines as shown below. o ntpdate o ntpd o nfs_server o sshd o inetd o named o sendmail o rtadvd o watchdogd o dhcpd o snmpd o apache22 o samba o zope29 o zope210 o amavisd o amavisd_milter o cvsupd o ntop o compat6x o munin_node o spamd o spamass_milter o smartd o mailman o sshblock o innd o skkserv>From munin's graphs, the 'resets' value in netstat is increasing whileon other 'desktops' it remains zero. Though I did not find if there is a threshold of 'resets', when it reaches to 0.8 - 1.2 the server gets "lockup". No ping response, no messages on cosole, no keyboard response, and, of cource, Ctrl-Alt-Esc does not function, when it locks up. I wonder why netstat's reset is increasing. I had learned a workaround from other Japanese guys, that is, enabling ichwd and running watchdogd can reboot the box when it locks up if the box has ICH. Exactly, after about 4 hours, the box rebooted while I was in bed last night. Watchdogd functions very well. Advice? Thanks. -- NAKAJI Hiroyuki
2009/6/6 NAKAJI Hiroyuki <nakaji@jp.freebsd.org>:> Hi, > > I noticed, some months ago, frequent lockups on my RELENG_6 server with > ECS PM800-M2, Celeron 2.6GHz (UP), 2GB ram, ATA HDDs and 3Com NIC(xl0), > and then I gave up this old server. > > Last month, I replaced this 'unstable' server to the new one with > 7.2-RELEASE which worked very well until I setup it as 'a server'. The > problem began just after it started 'the services'. > > My story is very similar to Pete's. > http://lists.freebsd.org/pipermail/freebsd-stable/2009-January/047487.html > > I followed some instructions in the list thread. But unfortunately, the > big problem still remains. 7.2-STABLE server locks up frequently. > > Help! :-( > > The server is NEC Express5800 S70/SD. > > o CPU: Intel(R) Celeron(R) CPU 440 @ 2.00GHz (2280.25-MHz K8-class CPU) > o 6GB RAM > o ACPI APIC Table: <NEC DT000020> > o 80GB and 250GB SATA HDDs > o http://www.heimat.gr.jp/~nakaji/localhost/dmesg.boot > > The kernel configuration is: > > include GENERIC > ident ? HEIMAT > options MSGBUF_SIZE=81920 > makeoptions ? ? DEBUG=-g > options KDB > options DDB > options BREAK_TO_DEBUGGER > options QUOTAWere you unmounting any of the QUOTA'ed filesystems? I'm aware of a possible deadlock between quota and unmount path which is very difficult to trigger though. Anyways, the only one way we have to debug this is getting some help by the user. 1) Drop the option WITNESS_SPIKSPIN (as we would like to debug spinlocks too) and LOCK_PROFILING (in order to create higher contention and kill some barriers) 2) Once you get the deadlock break in the DDB debugger 3) Once you are in DDB informations which could be very useful are: db> show allpcpu db> show alllocks db> show lockedvnods db> ps db> allthreads Note that this is a lot of printout so you won't be able of collecting all these informations if not with a serial connection. 4) Dump the content so that we can further look at locks structure states once we identify something useful (ideally, keeping the machine up in DDB for that would be very useful, but often not viable) Let me know. Attilio -- Peace can only be achieved by understanding - A. Einstein
> My story is very similar to Pete's. > http://lists.freebsd.org/pipermail/freebsd-stable/2009-January/047487.htmlMy problem, which you link to there, tturrned out to be due to ICMP redirects, and is most definitely fixed in 7.2. So, your problem is not the same as mine, but some of the tips given there may help you ddebug it.> I followed some instructions in the list thread. But unfortunately, the > big problem still remains. 7.2-STABLE server locks up frequently.Are you using the latest STABLE ? I am rolling out the one from a few days ago with the bce fixes, and that works fine.> The kernel configuration is:...> options BREAK_TO_DEBUGGERWhen the box locks up, can you actyually break to the debugger ? This is how we eventually tracked down my problem. -pete.