thr3ads.net - freebsd stable - Big problems with 7.1 locking up :-( [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Pete French

2009-Jan-09 01:58 UTC

Big problems with 7.1 locking up :-(

I have a number of HP 1U servers, all of which were running 7.0
perfectly happily. I have been testing 7.1 in it's various incarnations
for the last couple of months on our test server and it has performed
perfectly.

So the last two days I have been round upgrading all our servers, knowing
that I had run the system stably on identical hardware for some time.

Since then I have starte seeing machines lock up. This always happens under
heavy disc load. When I bring the machine back up then sometimes it fails
to fsck due to a partialy truncated inode. The locksup appear to
be disc related - on my mysql msater machine it will come back up with
files somewhat shorted than  those which ahve aready been transmitted to
the slave (i.e. some data was in memory, and claimed to have been written
to the drive, but never made it onto the disc).

The only time I have seen anything useful on the screen was during one lockup
where I got a message about a spin lock being held too long and some
comment in parentheses about it being a turnstile lock.

Help! :-(

I am now downgrading all the machine to 7.0 as fast as I can - though the
machine I am trying to compile it on has locked up once during the compile
so I havent got anywhere so far.

The machines are HP Proliant DL360 G5s - they have an embedded P400i
RAID controller with a pair of mirrored drives connected. Each one has
both ethernets connected, bundled using lagg and LACP.

Advice ?

-pete.

Guy Helmer

2009-Jan-09 14:49 UTC

head link

Big problems with 7.1 locking up :-(

Pete French wrote:> I have a number of HP 1U servers, all of which were running 7.0
> perfectly happily. I have been testing 7.1 in it's various incarnations
> for the last couple of months on our test server and it has performed
> perfectly.
>
> So the last two days I have been round upgrading all our servers, knowing
> that I had run the system stably on identical hardware for some time.
>
> Since then I have starte seeing machines lock up. This always happens under
> heavy disc load. When I bring the machine back up then sometimes it fails
> to fsck due to a partialy truncated inode. The locksup appear to
> be disc related - on my mysql msater machine it will come back up with
> files somewhat shorted than  those which ahve aready been transmitted to
> the slave (i.e. some data was in memory, and claimed to have been written
> to the drive, but never made it onto the disc).
>
> The only time I have seen anything useful on the screen was during one
lockup
> where I got a message about a spin lock being held too long and some
> comment in parentheses about it being a turnstile lock.
>
> Help! :-(
>
> I am now downgrading all the machine to 7.0 as fast as I can - though the
> machine I am trying to compile it on has locked up once during the compile
> so I havent got anywhere so far.
>
> The machines are HP Proliant DL360 G5s - they have an embedded P400i
> RAID controller with a pair of mirrored drives connected. Each one has
> both ethernets connected, bundled using lagg and LACP.
>
>   I can't tell whether my situation is related, but I am seeing lockups on 
SMP Supermicro servers with both older (NetBurst-ish) and current Xeon 
CPUs.  I have been dropping into the kernel debugger and getting lock 
information and process backtraces, but so far nothing has been 
conclusively identified.  I think the issue I'm seeing was introduced 
sometime between October 2 and November 24 in the RELENG_7 branch, and I 
suppose the next step is to do a binary search for the offending change.

Guy

-- 
Guy Helmer, Ph.D.
Chief System Architect
Palisade Systems, Inc.

Robert Blayzor

2009-Jan-09 19:55 UTC

head link

Big problems with 7.1 locking up :-(

On Jan 8, 2009, at 8:58 PM, Pete French wrote:> I have a number of HP 1U servers, all of which were running 7.0
> perfectly happily. I have been testing 7.1 in it's various  
> incarnations
> for the last couple of months on our test server and it has performed
> perfectly.

I noticed a problem with 7.0 on a couple of Dell servers.  Not sure if  
this is related but when our system "froze" the box was pingable, and
you could switch virtual consoles... however, you could not type  
anything on the screen or connect to any sockets.  Num-lock would  
still work so the box wasn't solidly frozen.  This used to happen a  
couple of times every week or two.  We've since then compiled the  
kernel under the BSD scheduler to rule that out, and so far so good.   
(our box was a Dell PE1750, 2GB of RAM, amr RAID controller, bge  
network driver)  The primary application was just ntpd and apache with  
mpm_worker & threads.

Since ULE is now default in 7.1 and not in 7.0, perhaps you can try  
that?

-- 
Robert Blayzor, BOFH
INOC, LLC
rblayzor@inoc.net
http://www.inoc.net/~rblayzor/

Pete French

2009-Jan-09 21:43 UTC

head link

Big problems with 7.1 locking up :-(

> Since ULE is now default in 7.1 and not in 7.0, perhaps you can try  
> that?
Actually you might be on to something there.... one of the main differences
between out test GL360 and the live ones is that the test one has less
cores in it, and is under less load. So multiprocessing problems may well
show up on the live where they wont on the test box. I shall try
building a kernel with the BSD scheduler adn see what happens there.
probbaly not today, as am loathe to cause anymore downtime right now.

thanks,

-pete.

Garance A Drosihn

2009-Jan-10 03:19 UTC

head link

Big problems with 7.1 locking up :-(

At 1:58 AM +0000 1/9/09, Pete French wrote:>I have a number of HP 1U servers, all of which were running 7.0
>perfectly happily. I have been testing 7.1 in it's various incarnations
>for the last couple of months on our test server and it has performed
>perfectly.
>
>So the last two days I have been round upgrading all our servers, knowing
>that I had run the system stably on identical hardware for some time.
>
>Since then I have starte seeing machines lock up. This always happens
>under heavy disc load. When I bring the machine back up then sometimes
>it fails to fsck due to a partialy truncated inode. The locksup appear
>to be disc related  [...]
One of my friends is also having trouble with lockups on two machines
he had upgraded to 7.1.  Also seems to be related to heavy disk I/O,
although I'm not sure the symptoms are the same as what you report.
Both machines had been running 7.0-release without trouble.  On at
least one of the systems, he's also working with (what I consider)
very large file systems (over 2 TB).  Both machines are using a 3ware
controller with its RAID.

I realize that isn't much to go on, but it suggests that there is
some problem wider than just your (Pete's) usage.  I think his
situation is such that lockups like this are simply not acceptable,
and the last I heard he was reverting back to 7.0-release.

-- 
Garance Alistair Drosehn            =   gad@gilead.netel.rpi.edu
Senior Systems Programmer           or  gad@freebsd.org
Rensselaer Polytechnic Institute    or  drosih@rpi.edu

Pete French

2009-Jan-11 09:16 UTC

head link

Lock order reversals using bce in 7.1

Here is a better set of images. This machine was compiled
with the following config file:

include         GENERIC
ident           DEBUG

options         KDB
options         DDB
options         SW_WATCHDOG
options         DEBUG_VFS_LOCKS
options         MUTEX_DEBUG
options         WITNESS
options         WITNESS_KDB
options         LOCK_PROFILING
options         INVARIANTS
options         INVARIANT_SUPPORT
options         DIAGNOSTIC

On booting it almost immediately does this:

        http://www.twisted.org.uk/~pete/71_lor.png

The output of trace, show pcpu, show locks, show allpcpu and show alllocks
are shown in the following images:

        http://www.twisted.org.uk/~pete/71_locks_trace.png
        http://www.twisted.org.uk/~pete/71_pcpu_alllocks.png
        http://www.twisted.org.uk/~pete/71_allpcpu1.png
        http://www.twisted.org.uk/~pete/71_allpcpu2.png

I am going to revent the machine back to a normal kernel now - is there
anything I might be able to do to stop this, or do I need to roll everything
back to 7.0 ?

cheers,

-pete.

Dylan Cochran

2009-Jan-11 11:01 UTC

head link

Big problems with 7.1 locking up :-(

On Sun, Jan 11, 2009 at 11:27 AM, Pete French
<petefrench@ticketswitch.com> wrote:>> My kernconf is below, try building the kernel, and send an email
>> containing the backtrace from any process that has blocked (in my
>
> Well, I havent managed to get a backtrace, but immediately upon
> booting the system halts with the following:
>
>        http://www.twisted.org.uk/~pete/71_lor1.jpg
Not Found

Pete French

2009-Jan-12 11:00 UTC

head link

Big problems with 7.1 locking up :-(

> I'm not sure if you've done this already, but the normal
suggestions apply:
> have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do
> any results / panics / etc result?  Sometimes these debugging tools are
able
> to convert hangs into panics, which gives us much more ability to debug
them.
OK, I have now had a machine hand again, with the correct debug options in
the kernel. The screen looked like this when I went to restart it:

	http://toybox.twisted.org.uk/~pete/71_lor2.png

It had not, however, dropped into any kind of debugger. Also there appear
to me console messages after the lock order reversal - is that normal ?

The machine did stay up for a signifanct amount of time before doing this. I
notice that it is more or less identical to the one I posted whenI
had WITNESS_KDB in the kernel too, so maybe those results arent
entirely suprious after all ?

Given it hasnt dropped to a debugger, is there anything else I can try ? 

-pete.

Pete French

2009-Jan-13 03:50 UTC

head link

Big problems with 7.1 locking up :-(

> It was mentioned previous in this thread that CPUTYPE could be an
> issue. Did you change this if you customized your kernel?
Actually, I think thats been ruled out as a possible cause, along
with the scheduler. Certainly I have tried it both ways and
there is no difference, and I think i saw that the others had too.

-pete.

Pete French

2009-Jan-13 06:11 UTC

head link

Big problems with 7.1 locking up :-(

> Silly question but do you have powerd enabled on that server? If so,
> does disabling it help? Also do you have any of these in /etc/rc.conf
> (i.e., they are not the same as the default values in
> /etc/defaults/rc.conf):
> performance_cx_lowest="HIGH"    # Online CPU idle state
> performance_cpu_freq="NONE"     # Online CPU frequency
> economy_cx_lowest="HIGH"        # Offline CPU idle state
> economy_cpu_freq="NONE"         # Offline CPU frequency
No, none of those. My rc.conf is below. The only slightly unusual thing I
am doing is using lagg rather than the interfaces directly I guess, but
that has worked fine for ages.

-pete.


hostname="florentine.rattatosk"
cloned_interfaces="lagg0"
network_interfaces="lo0 bce0 bce1 lagg0"
ifconfig_bce0="up"
ifconfig_bce1="up"
ifconfig_lagg0="laggproto lacp laggport bce0 laggport bce1"

ipv4_addrs_lagg0="10.48.19.0/16 10.48.19.229/16 10.48.19.223/16
10.48.19.243/16 10.48.19.226/16 10
.48.19.224/16 10.48.19.227/16 10.48.19.239/16 10.48.19.225/16 10.48.19.230/16
10.48.19.232/16 10.4
8.19.228/16 10.48.19.235/16 10.48.19.244/16 10.48.19.245/16"

defaultrouter="10.48.0.9"

inetd_enable="YES"
sshd_enable="YES"

dhcpd_enable="YES"
dhcpd_ifaces="lagg0"
dhcpd_flags="-q"
dhcpd_conf="/usr/local/etc/dhcpd.conf"
dhcpd_withumask="022"

nfs_client_enable="YES"
nfs_server_enable="YES"
portmap_enable="YES"
rpcbind_enable="YES"

named_enable="YES"
pdns_enable="YES"
pdns_recursor_enable="NO"

mysql_enable="YES"

apache22_http_accept_enable="YES"
apache22_enable="YES"

ntpd_enable="YES"
ntpd_sync_on_start="YES"

exim_enable="YES"
exim_flags="-bd -q10m"
sendmail_enable="NONE"
sendmail_submit_enable="NO"
sendmail_outbound_enable="NO"
sendmail_msp_queue_enable="NO"

Andriy Gapon

2009-Jan-14 07:29 UTC

head link

Simple? Hardware upgrade.

on 14/01/2009 16:34 Jorge Biquez said the following:> b) If is possible to "clone" the same installation to a new
faster disk
> (like a sata 250GB). I know I can install a /.x version and for sure
> will work but here the idea is to have things running as usual without
> problems. This installation is very stable and secure and has been with
> us for years.... we would like to keep it working for more years.... :=)
Somewhat tangential - are you sure that a new faster disk would really
perform faster on that old PIII system? Even if you use an expansion
card (which itself might require updates to kernel, ata driver at
least), PCI bus speed will stay limited to the same old value.

-- 
Andriy Gapon

Erik Trulsson

2009-Jan-14 07:51 UTC

head link

Simple? Hardware upgrade.

On Wed, Jan 14, 2009 at 05:28:45PM +0200, Andriy Gapon
wrote:> on 14/01/2009 16:34 Jorge Biquez said the following:
> > b) If is possible to "clone" the same installation to a new
faster disk
> > (like a sata 250GB). I know I can install a /.x version and for sure
> > will work but here the idea is to have things running as usual without
> > problems. This installation is very stable and secure and has been
with
> > us for years.... we would like to keep it working for more years....
:=)
> 
> Somewhat tangential - are you sure that a new faster disk would really
> perform faster on that old PIII system? Even if you use an expansion
> card (which itself might require updates to kernel, ata driver at
> least), PCI bus speed will stay limited to the same old value.
The PCI bus should still be much faster than his old disk, so there is almost
certainly room for improvement.
(The latest generation of harddisk, on the other hand, are fast enough that
a standard 32-bit/33MHz PCI-bus can actually be a bottle-neck.)


-- 
<Insert your favourite quote here.>
Erik Trulsson
ertr1013@student.uu.se

Mike Tancsa

2009-Jan-14 09:00 UTC

head link

Simple? Hardware upgrade.

At 09:34 AM 1/14/2009, Jorge Biquez wrote:>Hello all.
>
>I have a 4.11 Stable version that has been working without problems 
>in the last years. We do not need nothing else for the moment but we 
>are looking to have more speed. It has been running under a double 
>Pentium III processor with 512MB of ram and it has a disk of 40GB.
>
>I was wondering of it is possible to do 2 things.
>
>a) Only put the disk in a new machine at least a double core with 
>2GB of RAM. My guess is that could boot with a few problems on 
>hardware.... what do you think?
>
>b) If is possible to "clone" the same installation to a new faster
>disk (like a sata 250GB). I know I can install a /.x version and for 
>sure will work but here the idea is to have things running as usual 
>without problems. This installation is very stable and secure and 
>has been with us for years.... we would like to keep it working for 
>more years.... :=)
>
>on b). Is there a simple way to do it?

Copying the disk is easy enough. However, 4.11 is VERY old and doesnt 
necessarily support the latest in hardware or even recent 
hardware.  e.g. it might not recognize the SATA controller, or might 
not work well with it.  Cloning the disk is easy.  dump | restore 
will work well. Google for the terms  "copy disk dump restore 
freebsd" and you will find lots of HOWTO docs

What I suggest is if you really cant start fresh with 7.1R, install a 
fresh copy of 4.11 onto the new hardware and make sure it works. Then 
try duplicating the disk via dump and restore.

         ---Mike

Pete French

2009-Jan-16 07:40 UTC

head link

Big problems with 7.1 locking up :-(

> If you do INVARIANTS + WITNESS + WITNESS_SKIPSPIN, that should be good. 
> WITNESS does a number of things, including tracking (and being judgemental 
> about) lock order.  One nice side effect of that tracking is that we keep 
> track of a lot more lock state explicitly, so DDB's "show
allocks", "show
> locks", etc, commands can build on that.  "show lockedvnods"
works without
> WITNESS, though, so your results so far suggest this is likely not related
to
> vnode locking.
Right, I've gone back to my DEBUG kernel which has a lot of options in it,
including all the above. It has locked almost immediately luckily, so
now I have it sitting at the debugger prompt. The output from 'show
alllocks'
is here:

http://toybox.twisted.org.uk/~pete/71_show_alllocks.png

Which of these are worth tracing ?

-pte.

Michel Talon

2009-Jan-18 04:51 UTC

head link

Big problems with 7.1 locking up :-(

Tomas Randa wrote:> Hello,
> 
> I have similar problems. The last "good" kernel I have from
stable
> brach, october the 8. Then in next upgrade, I saw big problems with 
> performance.
I can add a "me too" here. This is on my desktop, very lightly loaded.
This computer never had a single problem under FreeBSD so i don't suspect
a hardware problem. My previous upgrade was FreeBSD 7.0-STABLE #0: Tue
Jul 22, and worked perfectly fine with exactly the same software
configuration. 
Now i have FreeBSD 7.1-STABLE #0: Mon Jan  5 , and the situation is
disastrous. Freshly after boot the machine seems to work normal, but
after a few days it becomes slower and slower, windows takes seconds to
appear, firefox3 begins to have garbled output, etc. Then i had the
following problem, firefox got stuck in kernel, impossible to kill it by
kill -9. Needless to say i inspected everything, dmesg, xsession-errors,
top, etc. without seeing anything suspicious. So i rebooted, and bingo!
the machine paniced, mentioning firefox. But the panic itself get stuck
and i had to push the reset button, so no dump. After reboot, machine
works OK for two or three days, then problems begin again. I am
convinced there is a big problem in the kernel. For reference, here is
top and dmesg:

CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 264M Active, 613M Inact, 485M Wired, 22M Cache, 112M Buf, 116M Free
Swap: 2023M Total, 4K Used, 2023M Free

  PID USERNAME       THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU
COMMAND
62965 michel           1  44    0  3532K  1884K CPU1   1   0:00  0.29%
top
 2327 root             1  44    0   161M 29228K select 1  30:39  0.00%
Xorg
95937 root             1  44    0 24112K 16800K select 1   2:35  0.00%
kdm-bin_gr
 3099 root             1   4    0  3304K  1028K select 0   1:30  0.00%
moused
 2209 news             1   8    0  3464K  1052K wait   0   0:37  0.00%
sh
  884 root             1  44    0  4712K  2028K select 1   0:12  0.00%
ntpd
  453 _pflogd          1 -58    0  3380K  1352K bpf    0   0:11  0.00%
pflogd
 1634 www              1   4    0  6268K  2656K kqread 0   0:10  0.00%
lighttpd
  788 root             1  44    0  3164K  3184K select 0   0:04  0.00%
amd
 2206 news             1  44    0 15208K 12160K select 0   0:03  0.00%
innd
  879 root             9   4    0  5432K  2460K kqread 1   0:02  0.00%
nscd
  955 root             1  44    0  2736K  1216K select 1   0:02  0.00%
master
  758 root             1  44    0  3164K  1340K select 1   0:02  0.00%
ypbind
...........

so no memory problem

Copyright (c) 1992-2009 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 7.1-STABLE #0: Mon Jan  5 14:29:23 CET 2009
    michel@niobe.lpthe.jussieu.fr:/usr/obj/usr/src/sys/NIOBE
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) 4 CPU 3.06GHz (3073.65-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0xf27  Stepping = 7

Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x4400<CNXT-ID,xTPR>
  Logical CPUs per core: 2
real memory  = 1610530816 (1535 MB)
avail memory = 1568387072 (1495 MB)
ACPI APIC Table: <ASUS   P4PE    >
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
This module (opensolaris) contains code covered by the
Common Development and Distribution License (CDDL)
see http://opensolaris.org/os/licensing/opensolaris_license/
ioapic0 <Version 2.0> irqs 0-23 on motherboard
acpi0: <ASUS P4PE> on motherboard
acpi0: Overriding SCI Interrupt from IRQ 9 to IRQ 22
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
acpi0: reservation of 0, a0000 (3) failed
acpi0: reservation of 100000, 5ff00000 (3) failed
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0xe408-0xe40b on acpi0
acpi_button0: <Power Button> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
agp0: <Intel 82845G host to AGP bridge> on hostb0
pcib1: <ACPI PCI-PCI bridge> at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
vgapci0: <VGA-compatible display> port 0xd800-0xd8ff mem
0xe0000000-0xefffffff,0xdf000000-0xdf00ffff irq 16 at device 0.0 on pci1
uhci0: <Intel 82801DB (ICH4) USB controller USB-A> port 0xb800-0xb81f irq
16 at device 29.0 on pci0
uhci0: [GIANT-LOCKED]
uhci0: [ITHREAD]
usb0: <Intel 82801DB (ICH4) USB controller USB-A> on uhci0
usb0: USB revision 1.0
uhub0: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb0
uhub0: 2 ports with 2 removable, self powered
uhci1: <Intel 82801DB (ICH4) USB controller USB-B> port 0xb400-0xb41f irq
19 at device 29.1 on pci0
uhci1: [GIANT-LOCKED]
uhci1: [ITHREAD]
usb1: <Intel 82801DB (ICH4) USB controller USB-B> on uhci1
usb1: USB revision 1.0
uhub1: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb1
uhub1: 2 ports with 2 removable, self powered
uhci2: <Intel 82801DB (ICH4) USB controller USB-C> port 0xb000-0xb01f irq
18 at device 29.2 on pci0
uhci2: [GIANT-LOCKED]
uhci2: [ITHREAD]
usb2: <Intel 82801DB (ICH4) USB controller USB-C> on uhci2
usb2: USB revision 1.0
uhub2: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb2
uhub2: 2 ports with 2 removable, self powered
ehci0: <Intel 82801DB/L/M (ICH4) USB 2.0 controller> mem
0xde800000-0xde8003ff irq 23 at device 29.7 on pci0
ehci0: [GIANT-LOCKED]
ehci0: [ITHREAD]
usb3: EHCI version 1.0
usb3: companion controllers, 2 ports each: usb0 usb1 usb2
usb3: <Intel 82801DB/L/M (ICH4) USB 2.0 controller> on ehci0
usb3: USB revision 2.0
uhub3: <Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1> on usb3
uhub3: 6 ports with 6 removable, self powered
pcib2: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci2: <ACPI PCI bus> on pcib2
bfe0: <Broadcom BCM4401 Fast Ethernet> mem 0xde000000-0xde001fff irq 20 at
device 5.0 on pci2
miibus0: <MII bus> on bfe0
bmtphy0: <BCM4401 10/100baseTX PHY> PHY 1 on miibus0
bmtphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
bfe0: Ethernet address: 00:0c:6e:04:5d:39
bfe0: [ITHREAD]
fxp0: <Intel 82559 Pro/100 Ethernet> port 0xa800-0xa83f mem
0xdd800000-0xdd800fff,0xdd000000-0xdd0fffff irq 23 at device 11.0 on pci2
miibus1: <MII bus> on fxp0
inphy0: <i82555 10/100 media interface> PHY 1 on miibus1
inphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
fxp0: Ethernet address: 00:02:b3:1d:df:8e
fxp0: [ITHREAD]
sym0: <875> port 0xa400-0xa4ff mem
0xdc800000-0xdc8000ff,0xdc000000-0xdc000fff irq 20 at device 12.0 on pci2
sym0: Tekram NVRAM, ID 7, Fast-20, SE, parity checking
sym0: [ITHREAD]
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel ICH4 UDMA100 controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f irq 18 at device 31.1 on pci0
ata0: <ATA channel 0> on atapci0
ata0: [ITHREAD]
ata1: <ATA channel 1> on atapci0
ata1: [ITHREAD]
pcm0: <Intel ICH4 (82801DB)> port 0x9800-0x98ff,0x9400-0x943f mem
0xdb800000-0xdb8001ff,0xdb000000-0xdb0000ff irq 17 at device 31.5 on pci0
pcm0: [ITHREAD]
pcm0: <Analog Devices AD1980 AC97 Codec>
fdc0: <floppy drive controller> port 0x3f2-0x3f5,0x3f7 irq 6 drq 2 on
acpi0
fdc0: [FILTER]
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on
acpi0
sio0: type 16550A
sio0: [FILTER]
sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550A
sio1: [FILTER]
atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
atkbd0: [ITHREAD]
cpu0: <ACPI CPU> on acpi0
p4tcc0: <CPU Frequency Thermal Control> on cpu0
cpu1: <ACPI CPU> on acpi0
p4tcc1: <CPU Frequency Thermal Control> on cpu1
pmtimer0 on isa0
orm0: <ISA Option ROMs> at iomem 0xc0000-0xccfff,0xd0000-0xd0fff pnpid
ORM0000 on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
ppc0: <Parallel port> at port 0x378-0x37f irq 7 on isa0
ppc0: Generic chipset (EPP/NIBBLE) in COMPATIBLE mode
ppbus0: <Parallel port bus> on ppc0
ppbus0: [ITHREAD]
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
ppc0: [GIANT-LOCKED]
ppc0: [ITHREAD]
ums0: <vendor 0x04d9 product 0x048e, class 0/0, rev 1.10/8.00, addr 2> on
uhub1
ums0: 5 buttons and Z dir.
WARNING: ZFS is considered to be an experimental feature in FreeBSD.
Timecounters tick every 1.000 msec
Waiting 5 seconds for SCSI devices to settle
ZFS filesystem version 6
ZFS storage pool version 6
ad0: 58644MB <Maxtor 6Y060L0 YAR41VW0> at ata0-master UDMA100
acd0: DVDR <TSSTcorpCD/DVDW SH-S182D/SB02> at ata1-master UDMA33
acd0: FAILURE - INQUIRY ILLEGAL REQUEST asc=0x24 ascq=0x00 
(probe5:sym0:0:5:0): phase change 6-7 6@01a0c7a8 resid=4.
(da0:sym0:0:5:0): phase change 6-7 6@01a0c7a8 resid=4.
da0 at sym0 bus 0 target 5 lun 0
da0: <IOMEGA ZIP 100 J.03> Removable Direct Access SCSI-2 device 
da0: 3.300MB/s transfers
da0: 96MB (196608 512 byte sectors: 64H 32S/T 96C)
cd0 at ata1 bus 0 target 0 lun 0
cd0: <TSSTcorp CD/DVDW SH-S182D SB02> Removable CD-ROM SCSI-0 device 
cd0: 33.000MB/s transfers
cd0: Attempt to query device size failed: NOT READY, Medium not present - tray
closed
SMP: AP CPU #1 Launched!
Trying to mount root from ufs:/dev/ad0s1a
WARNING: / was not properly dismounted
WARNING: /home was not properly dismounted
/home: mount pending error: blocks 128 files 10
WARNING: /usr was not properly dismounted
WARNING: /var was not properly dismounted
WARNING: TMPFS is considered to be a highly experimental feature in FreeBSD.
fxp0: Microcode loaded, int_delay: 1000 usec  bundle_max: 6
fxp0: Microcode loaded, int_delay: 1000 usec  bundle_max: 6
fxp0: Microcode loaded, int_delay: 1000 usec  bundle_max: 6
fxp0: Microcode loaded, int_delay: 1000 usec  bundle_max: 6

-- 

Michel TALON

Pete French

2009-Jan-19 03:39 UTC

head link

Big problems with 7.1 locking up :-(

> yes, do ps - threads in state L or LL and RUN are especially interesting,
> trace of pids 28, 27, and threads wich L on locked chan.
heres the output of alllocks,

	http://toybox.twisted.org.uk/~pete/71_show_alllocks.png

here are the pages of PS:

	http://toybox.twisted.org.uk/~pete/71_lock_ps2/

(next time I boot this I will disable http to avoid getting so many)

I cant see any which are in L, LL or RUN state there though. A few RL
and WL towards the end. Traces on 28 and 27 are here:

	http://toybox.twisted.org.uk/~pete/71_trace_28.png
	http://toybox.twisted.org.uk/~pete/71_trace_27a.png
	http://toybox.twisted.org.uk/~pete/71_trace_27b.png

I also did traces on 19 and 16 as (like 28 and 27) they are in a "CPU"
state, so may be of interest ?

	http://toybox.twisted.org.uk/~pete/71_trace_19.png
	http://toybox.twisted.org.uk/~pete/71_trace_16.png

-pete.

Chagin Dmitry

2009-Jan-19 05:25 UTC

head link

Big problems with 7.1 locking up :-(

On Mon, Jan 19, 2009 at 11:39:08AM +0000, Pete French
wrote:> > yes, do ps - threads in state L or LL and RUN are especially
interesting,
> > trace of pids 28, 27, and threads wich L on locked chan.
> 
> heres the output of alllocks,
> 
> 	http://toybox.twisted.org.uk/~pete/71_show_alllocks.png
> 
> here are the pages of PS:
> 
> 	http://toybox.twisted.org.uk/~pete/71_lock_ps2/
> 
> (next time I boot this I will disable http to avoid getting so many)
> 
> I cant see any which are in L, LL or RUN state there though. A few RL
> and WL towards the end. Traces on 28 and 27 are here:
> 
> 	http://toybox.twisted.org.uk/~pete/71_trace_28.png
> 	http://toybox.twisted.org.uk/~pete/71_trace_27a.png
> 	http://toybox.twisted.org.uk/~pete/71_trace_27b.png
> 
> I also did traces on 19 and 16 as (like 28 and 27) they are in a
"CPU"
> state, so may be of interest ?
> 
> 	http://toybox.twisted.org.uk/~pete/71_trace_19.png
> 	http://toybox.twisted.org.uk/~pete/71_trace_16.png
> 
Probably it is your case, try please.

http://www.freebsd.org/cgi/query-pr.cgi?pr=130652&cat
-- 
Have fun!
chd

Pete Carah

2009-Jan-19 05:40 UTC

head link

Big problems with 7.1 locking up :-(

Kris writes:> You and anyone else seeing performance problems should try to work 
> through the advice given here:

  > [1]http://people.freebsd.org/~kris/scaling/Help_my_system_is_slow.pdf

Well,  all the people in this thread have noticed that WITH NO CONFIG CHANGES f
rom configs
that worked fine in the past, their systems are very slow and/or locking up (mi
ne are both) with
the stable branch sometime (I noticed it sometime in December, but it got worse
 with the release.)
Most were OK in October; mine (I think) were OK in late November - may narrow t
hings down?  Two of my
systems that lock up have no internal visibility when they do (Soekris
4801's r
outing; the only
time-intensive things running are routing (done in irq context) and pflog.  The
se run with 60+
meg ram free.)  These are complete lockups, though I did manage to get a ps out
 of my laptop last
night by waiting 20 _minutes_ for it to start (!).  This is not a generic perfo
rmance problem.  The laptop
had 55 minutes of cpu time in the softdepflush thread after being up about an h
our and 10 mins;
this might give a hint.  I didn't spot LL/RL state threads at the same time
bec
ause I didn't know
to.  Now I do.  BTW - the same ps showed 8 or so user-space procs in R state wi
th NO cpu time; the
kernel was hogging all of it for over an hour.
Firefox did indeed trigger this one as someone else noted.  A soekris doing onl
y routing+nat has no such
excuse...  At least PHK was nice enough to note the watchdog in another thread
:-)

-- Pete

References

   1. http://people.freebsd.org/%7Ekris/scaling/Help_my_system_is_slow.pdf

Pete Carah

2009-Jan-19 14:20 UTC

head link

Big problems with 7.1 locking up :-(

I have done some (lots of) kernel debugging in the past.  I have several 
points:

1. I shouldn't *have* to kernel debug for a normal usage of an
official release.

2. One of the soekris boxes is 2800 MILES away, in a remote location,
with noone present that is a skilled (or, indeed, any kind of) programmer.
I usually thought I could trust a release, especially when I had been
using the stable branch updated at about monthly intervals on 3 servers
with no problems.  (actually, I waited a while on 7.0 because .0 releases
are traditionally quirky; in this case 7.0-rel worked fine and 7.1 has
problems.)  (and my servers are still running the *same* compilation of
kernel/world with no problems; the hangs are unique to either the laptop
(which only started doing this badly with a Jan 9 csup) and the Soekris boxes
(which started hangs sometime in December; they clearly don't run X...)
[ I've backed my house source to -stable of 12/1/08 and hope this will help;
I don't have the time to fool around too much, and particularly to kernel
debug something that shouldn't need it.]

I can't even start X at all on this laptop now.  At least I can boot it,
but it isn't much use for work unless it can run X.

3. I can't afford the time to debug my tools (freebsd is a tool, not an
experiment, for lots of people, including me...)  I use this laptop at 
work in a place where I am *not* working on freebsd. (nor am I even allowed
to at work...)

-- Pete

Robert Watson

2009-Jan-29 14:39 UTC

head link

Big problems with 7.1 locking up :-(

On Fri, 9 Jan 2009, Pete French wrote:
> I have a number of HP 1U servers, all of which were running 7.0 perfectly 
> happily. I have been testing 7.1 in it's various incarnations for the
last
> couple of months on our test server and it has performed perfectly.
>
> So the last two days I have been round upgrading all our servers, knowing 
> that I had run the system stably on identical hardware for some time.
For those following this other than Pete, who I've been in private 
correspondence with: it seems that he is running into two different deadlocks 
in the routing code.  One of them (at least) is triggered by a lock order 
problem relating to the processing of ICMP redirects -- uncommon in most 
configurations, but quite a few on his network, which triggers quickly under 
load.  Kip Macy has corrected at least one (both?) problems in head, and plans 
to MFC the fixes in the near future.  We'll follow up further once the fixes
are merged, and if any further problems transpire.

Robert N M Watson
Computer Laboratory
University of Cambridge
>
> Since then I have starte seeing machines lock up. This always happens under
> heavy disc load. When I bring the machine back up then sometimes it fails
> to fsck due to a partialy truncated inode. The locksup appear to
> be disc related - on my mysql msater machine it will come back up with
> files somewhat shorted than  those which ahve aready been transmitted to
> the slave (i.e. some data was in memory, and claimed to have been written
> to the drive, but never made it onto the disc).
>
> The only time I have seen anything useful on the screen was during one
lockup
> where I got a message about a spin lock being held too long and some
> comment in parentheses about it being a turnstile lock.
>
> Help! :-(
>
> I am now downgrading all the machine to 7.0 as fast as I can - though the
> machine I am trying to compile it on has locked up once during the compile
> so I havent got anywhere so far.
>
> The machines are HP Proliant DL360 G5s - they have an embedded P400i
> RAID controller with a pair of mirrored drives connected. Each one has
> both ethernets connected, bundled using lagg and LACP.
>
> Advice ?
>
> -pete.
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
"freebsd-stable-unsubscribe@freebsd.org"
>

Pete French

2009-Feb-08 05:12 UTC

head link

Big problems with 7.1 locking up :-(

> load.  Kip Macy has corrected at least one (both?) problems in head, and
> plans to MFC the fixes in the near future.  We'll follow up further
once
> the fixes are merged, and if any further problems transpire.
Hi, just wondering if we are any closer to having the MFC for this yet, or
if there are any patches I could test ?

cheers,

-pete.

Mike Tancsa

2009-Feb-17 17:10 UTC

head link

Big problems with 7.1 locking up :-(

At 05:38 PM 1/29/2009, Robert Watson wrote:
>On Fri, 9 Jan 2009, Pete French wrote:
>
>>I have a number of HP 1U servers, all of which were running 7.0 
>>perfectly happily. I have been testing 7.1 in it's various 
>>incarnations for the last couple of months on our test server and 
>>it has performed perfectly.
>>
>>So the last two days I have been round upgrading all our servers, 
>>knowing that I had run the system stably on identical hardware for some
time.
>
>For those following this other than Pete, who I've been in private 
>correspondence with: it seems that he is running into two different 
>deadlocks in the routing code.  One of them (at least) is triggered 
>by a lock order problem relating to the processing of ICMP redirects 
>-- uncommon in most configurations, but quite a few on his network, 
>which triggers quickly under load.  Kip Macy has corrected at least 
>one (both?) problems in head, and plans to MFC the fixes in the near 
>future.  We'll follow up further once the fixes are merged, and if 
>any further problems transpire.
Hi Robert,
         Do you have any other details about these issues ? Were the 
fixes ever MFC'd

         ---Mike

freebsd stable - Jan 2009 - Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Lock order reversals using bce in 7.1

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Simple? Hardware upgrade.

Simple? Hardware upgrade.

Simple? Hardware upgrade.

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(

Big problems with 7.1 locking up :-(