I have a number of HP 1U servers, all of which were running 7.0 perfectly happily. I have been testing 7.1 in it's various incarnations for the last couple of months on our test server and it has performed perfectly. So the last two days I have been round upgrading all our servers, knowing that I had run the system stably on identical hardware for some time. Since then I have starte seeing machines lock up. This always happens under heavy disc load. When I bring the machine back up then sometimes it fails to fsck due to a partialy truncated inode. The locksup appear to be disc related - on my mysql msater machine it will come back up with files somewhat shorted than those which ahve aready been transmitted to the slave (i.e. some data was in memory, and claimed to have been written to the drive, but never made it onto the disc). The only time I have seen anything useful on the screen was during one lockup where I got a message about a spin lock being held too long and some comment in parentheses about it being a turnstile lock. Help! :-( I am now downgrading all the machine to 7.0 as fast as I can - though the machine I am trying to compile it on has locked up once during the compile so I havent got anywhere so far. The machines are HP Proliant DL360 G5s - they have an embedded P400i RAID controller with a pair of mirrored drives connected. Each one has both ethernets connected, bundled using lagg and LACP. Advice ? -pete.
Pete French wrote:> I have a number of HP 1U servers, all of which were running 7.0 > perfectly happily. I have been testing 7.1 in it's various incarnations > for the last couple of months on our test server and it has performed > perfectly. > > So the last two days I have been round upgrading all our servers, knowing > that I had run the system stably on identical hardware for some time. > > Since then I have starte seeing machines lock up. This always happens under > heavy disc load. When I bring the machine back up then sometimes it fails > to fsck due to a partialy truncated inode. The locksup appear to > be disc related - on my mysql msater machine it will come back up with > files somewhat shorted than those which ahve aready been transmitted to > the slave (i.e. some data was in memory, and claimed to have been written > to the drive, but never made it onto the disc). > > The only time I have seen anything useful on the screen was during one lockup > where I got a message about a spin lock being held too long and some > comment in parentheses about it being a turnstile lock. > > Help! :-( > > I am now downgrading all the machine to 7.0 as fast as I can - though the > machine I am trying to compile it on has locked up once during the compile > so I havent got anywhere so far. > > The machines are HP Proliant DL360 G5s - they have an embedded P400i > RAID controller with a pair of mirrored drives connected. Each one has > both ethernets connected, bundled using lagg and LACP. > >I can't tell whether my situation is related, but I am seeing lockups on SMP Supermicro servers with both older (NetBurst-ish) and current Xeon CPUs. I have been dropping into the kernel debugger and getting lock information and process backtraces, but so far nothing has been conclusively identified. I think the issue I'm seeing was introduced sometime between October 2 and November 24 in the RELENG_7 branch, and I suppose the next step is to do a binary search for the offending change. Guy -- Guy Helmer, Ph.D. Chief System Architect Palisade Systems, Inc.
On Jan 8, 2009, at 8:58 PM, Pete French wrote:> I have a number of HP 1U servers, all of which were running 7.0 > perfectly happily. I have been testing 7.1 in it's various > incarnations > for the last couple of months on our test server and it has performed > perfectly.I noticed a problem with 7.0 on a couple of Dell servers. Not sure if this is related but when our system "froze" the box was pingable, and you could switch virtual consoles... however, you could not type anything on the screen or connect to any sockets. Num-lock would still work so the box wasn't solidly frozen. This used to happen a couple of times every week or two. We've since then compiled the kernel under the BSD scheduler to rule that out, and so far so good. (our box was a Dell PE1750, 2GB of RAM, amr RAID controller, bge network driver) The primary application was just ntpd and apache with mpm_worker & threads. Since ULE is now default in 7.1 and not in 7.0, perhaps you can try that? -- Robert Blayzor, BOFH INOC, LLC rblayzor@inoc.net http://www.inoc.net/~rblayzor/
> Since ULE is now default in 7.1 and not in 7.0, perhaps you can try > that?Actually you might be on to something there.... one of the main differences between out test GL360 and the live ones is that the test one has less cores in it, and is under less load. So multiprocessing problems may well show up on the live where they wont on the test box. I shall try building a kernel with the BSD scheduler adn see what happens there. probbaly not today, as am loathe to cause anymore downtime right now. thanks, -pete.
At 1:58 AM +0000 1/9/09, Pete French wrote:>I have a number of HP 1U servers, all of which were running 7.0 >perfectly happily. I have been testing 7.1 in it's various incarnations >for the last couple of months on our test server and it has performed >perfectly. > >So the last two days I have been round upgrading all our servers, knowing >that I had run the system stably on identical hardware for some time. > >Since then I have starte seeing machines lock up. This always happens >under heavy disc load. When I bring the machine back up then sometimes >it fails to fsck due to a partialy truncated inode. The locksup appear >to be disc related [...]One of my friends is also having trouble with lockups on two machines he had upgraded to 7.1. Also seems to be related to heavy disk I/O, although I'm not sure the symptoms are the same as what you report. Both machines had been running 7.0-release without trouble. On at least one of the systems, he's also working with (what I consider) very large file systems (over 2 TB). Both machines are using a 3ware controller with its RAID. I realize that isn't much to go on, but it suggests that there is some problem wider than just your (Pete's) usage. I think his situation is such that lockups like this are simply not acceptable, and the last I heard he was reverting back to 7.0-release. -- Garance Alistair Drosehn = gad@gilead.netel.rpi.edu Senior Systems Programmer or gad@freebsd.org Rensselaer Polytechnic Institute or drosih@rpi.edu
Here is a better set of images. This machine was compiled with the following config file: include GENERIC ident DEBUG options KDB options DDB options SW_WATCHDOG options DEBUG_VFS_LOCKS options MUTEX_DEBUG options WITNESS options WITNESS_KDB options LOCK_PROFILING options INVARIANTS options INVARIANT_SUPPORT options DIAGNOSTIC On booting it almost immediately does this: http://www.twisted.org.uk/~pete/71_lor.png The output of trace, show pcpu, show locks, show allpcpu and show alllocks are shown in the following images: http://www.twisted.org.uk/~pete/71_locks_trace.png http://www.twisted.org.uk/~pete/71_pcpu_alllocks.png http://www.twisted.org.uk/~pete/71_allpcpu1.png http://www.twisted.org.uk/~pete/71_allpcpu2.png I am going to revent the machine back to a normal kernel now - is there anything I might be able to do to stop this, or do I need to roll everything back to 7.0 ? cheers, -pete.
On Sun, Jan 11, 2009 at 11:27 AM, Pete French <petefrench@ticketswitch.com> wrote:>> My kernconf is below, try building the kernel, and send an email >> containing the backtrace from any process that has blocked (in my > > Well, I havent managed to get a backtrace, but immediately upon > booting the system halts with the following: > > http://www.twisted.org.uk/~pete/71_lor1.jpgNot Found
> I'm not sure if you've done this already, but the normal suggestions apply: > have you compiled with INVARIANTS/WITNESS/DDB/KDB/BREAK_TO_DEBUGGER, and do > any results / panics / etc result? Sometimes these debugging tools are able > to convert hangs into panics, which gives us much more ability to debug them.OK, I have now had a machine hand again, with the correct debug options in the kernel. The screen looked like this when I went to restart it: http://toybox.twisted.org.uk/~pete/71_lor2.png It had not, however, dropped into any kind of debugger. Also there appear to me console messages after the lock order reversal - is that normal ? The machine did stay up for a signifanct amount of time before doing this. I notice that it is more or less identical to the one I posted whenI had WITNESS_KDB in the kernel too, so maybe those results arent entirely suprious after all ? Given it hasnt dropped to a debugger, is there anything else I can try ? -pete.
> It was mentioned previous in this thread that CPUTYPE could be an > issue. Did you change this if you customized your kernel?Actually, I think thats been ruled out as a possible cause, along with the scheduler. Certainly I have tried it both ways and there is no difference, and I think i saw that the others had too. -pete.
> Silly question but do you have powerd enabled on that server? If so, > does disabling it help? Also do you have any of these in /etc/rc.conf > (i.e., they are not the same as the default values in > /etc/defaults/rc.conf): > performance_cx_lowest="HIGH" # Online CPU idle state > performance_cpu_freq="NONE" # Online CPU frequency > economy_cx_lowest="HIGH" # Offline CPU idle state > economy_cpu_freq="NONE" # Offline CPU frequencyNo, none of those. My rc.conf is below. The only slightly unusual thing I am doing is using lagg rather than the interfaces directly I guess, but that has worked fine for ages. -pete. hostname="florentine.rattatosk" cloned_interfaces="lagg0" network_interfaces="lo0 bce0 bce1 lagg0" ifconfig_bce0="up" ifconfig_bce1="up" ifconfig_lagg0="laggproto lacp laggport bce0 laggport bce1" ipv4_addrs_lagg0="10.48.19.0/16 10.48.19.229/16 10.48.19.223/16 10.48.19.243/16 10.48.19.226/16 10 .48.19.224/16 10.48.19.227/16 10.48.19.239/16 10.48.19.225/16 10.48.19.230/16 10.48.19.232/16 10.4 8.19.228/16 10.48.19.235/16 10.48.19.244/16 10.48.19.245/16" defaultrouter="10.48.0.9" inetd_enable="YES" sshd_enable="YES" dhcpd_enable="YES" dhcpd_ifaces="lagg0" dhcpd_flags="-q" dhcpd_conf="/usr/local/etc/dhcpd.conf" dhcpd_withumask="022" nfs_client_enable="YES" nfs_server_enable="YES" portmap_enable="YES" rpcbind_enable="YES" named_enable="YES" pdns_enable="YES" pdns_recursor_enable="NO" mysql_enable="YES" apache22_http_accept_enable="YES" apache22_enable="YES" ntpd_enable="YES" ntpd_sync_on_start="YES" exim_enable="YES" exim_flags="-bd -q10m" sendmail_enable="NONE" sendmail_submit_enable="NO" sendmail_outbound_enable="NO" sendmail_msp_queue_enable="NO"
on 14/01/2009 16:34 Jorge Biquez said the following:> b) If is possible to "clone" the same installation to a new faster disk > (like a sata 250GB). I know I can install a /.x version and for sure > will work but here the idea is to have things running as usual without > problems. This installation is very stable and secure and has been with > us for years.... we would like to keep it working for more years.... :=)Somewhat tangential - are you sure that a new faster disk would really perform faster on that old PIII system? Even if you use an expansion card (which itself might require updates to kernel, ata driver at least), PCI bus speed will stay limited to the same old value. -- Andriy Gapon
On Wed, Jan 14, 2009 at 05:28:45PM +0200, Andriy Gapon wrote:> on 14/01/2009 16:34 Jorge Biquez said the following: > > b) If is possible to "clone" the same installation to a new faster disk > > (like a sata 250GB). I know I can install a /.x version and for sure > > will work but here the idea is to have things running as usual without > > problems. This installation is very stable and secure and has been with > > us for years.... we would like to keep it working for more years.... :=) > > Somewhat tangential - are you sure that a new faster disk would really > perform faster on that old PIII system? Even if you use an expansion > card (which itself might require updates to kernel, ata driver at > least), PCI bus speed will stay limited to the same old value.The PCI bus should still be much faster than his old disk, so there is almost certainly room for improvement. (The latest generation of harddisk, on the other hand, are fast enough that a standard 32-bit/33MHz PCI-bus can actually be a bottle-neck.) -- <Insert your favourite quote here.> Erik Trulsson ertr1013@student.uu.se
At 09:34 AM 1/14/2009, Jorge Biquez wrote:>Hello all. > >I have a 4.11 Stable version that has been working without problems >in the last years. We do not need nothing else for the moment but we >are looking to have more speed. It has been running under a double >Pentium III processor with 512MB of ram and it has a disk of 40GB. > >I was wondering of it is possible to do 2 things. > >a) Only put the disk in a new machine at least a double core with >2GB of RAM. My guess is that could boot with a few problems on >hardware.... what do you think? > >b) If is possible to "clone" the same installation to a new faster >disk (like a sata 250GB). I know I can install a /.x version and for >sure will work but here the idea is to have things running as usual >without problems. This installation is very stable and secure and >has been with us for years.... we would like to keep it working for >more years.... :=) > >on b). Is there a simple way to do it?Copying the disk is easy enough. However, 4.11 is VERY old and doesnt necessarily support the latest in hardware or even recent hardware. e.g. it might not recognize the SATA controller, or might not work well with it. Cloning the disk is easy. dump | restore will work well. Google for the terms "copy disk dump restore freebsd" and you will find lots of HOWTO docs What I suggest is if you really cant start fresh with 7.1R, install a fresh copy of 4.11 onto the new hardware and make sure it works. Then try duplicating the disk via dump and restore. ---Mike
> If you do INVARIANTS + WITNESS + WITNESS_SKIPSPIN, that should be good. > WITNESS does a number of things, including tracking (and being judgemental > about) lock order. One nice side effect of that tracking is that we keep > track of a lot more lock state explicitly, so DDB's "show allocks", "show > locks", etc, commands can build on that. "show lockedvnods" works without > WITNESS, though, so your results so far suggest this is likely not related to > vnode locking.Right, I've gone back to my DEBUG kernel which has a lot of options in it, including all the above. It has locked almost immediately luckily, so now I have it sitting at the debugger prompt. The output from 'show alllocks' is here: http://toybox.twisted.org.uk/~pete/71_show_alllocks.png Which of these are worth tracing ? -pte.
Tomas Randa wrote:> Hello, > > I have similar problems. The last "good" kernel I have from stable > brach, october the 8. Then in next upgrade, I saw big problems with > performance.I can add a "me too" here. This is on my desktop, very lightly loaded. This computer never had a single problem under FreeBSD so i don't suspect a hardware problem. My previous upgrade was FreeBSD 7.0-STABLE #0: Tue Jul 22, and worked perfectly fine with exactly the same software configuration. Now i have FreeBSD 7.1-STABLE #0: Mon Jan 5 , and the situation is disastrous. Freshly after boot the machine seems to work normal, but after a few days it becomes slower and slower, windows takes seconds to appear, firefox3 begins to have garbled output, etc. Then i had the following problem, firefox got stuck in kernel, impossible to kill it by kill -9. Needless to say i inspected everything, dmesg, xsession-errors, top, etc. without seeing anything suspicious. So i rebooted, and bingo! the machine paniced, mentioning firefox. But the panic itself get stuck and i had to push the reset button, so no dump. After reboot, machine works OK for two or three days, then problems begin again. I am convinced there is a big problem in the kernel. For reference, here is top and dmesg: CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Mem: 264M Active, 613M Inact, 485M Wired, 22M Cache, 112M Buf, 116M Free Swap: 2023M Total, 4K Used, 2023M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 62965 michel 1 44 0 3532K 1884K CPU1 1 0:00 0.29% top 2327 root 1 44 0 161M 29228K select 1 30:39 0.00% Xorg 95937 root 1 44 0 24112K 16800K select 1 2:35 0.00% kdm-bin_gr 3099 root 1 4 0 3304K 1028K select 0 1:30 0.00% moused 2209 news 1 8 0 3464K 1052K wait 0 0:37 0.00% sh 884 root 1 44 0 4712K 2028K select 1 0:12 0.00% ntpd 453 _pflogd 1 -58 0 3380K 1352K bpf 0 0:11 0.00% pflogd 1634 www 1 4 0 6268K 2656K kqread 0 0:10 0.00% lighttpd 788 root 1 44 0 3164K 3184K select 0 0:04 0.00% amd 2206 news 1 44 0 15208K 12160K select 0 0:03 0.00% innd 879 root 9 4 0 5432K 2460K kqread 1 0:02 0.00% nscd 955 root 1 44 0 2736K 1216K select 1 0:02 0.00% master 758 root 1 44 0 3164K 1340K select 1 0:02 0.00% ypbind ........... so no memory problem Copyright (c) 1992-2009 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 7.1-STABLE #0: Mon Jan 5 14:29:23 CET 2009 michel@niobe.lpthe.jussieu.fr:/usr/obj/usr/src/sys/NIOBE Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Pentium(R) 4 CPU 3.06GHz (3073.65-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf27 Stepping = 7 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x4400<CNXT-ID,xTPR> Logical CPUs per core: 2 real memory = 1610530816 (1535 MB) avail memory = 1568387072 (1495 MB) ACPI APIC Table: <ASUS P4PE > FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 This module (opensolaris) contains code covered by the Common Development and Distribution License (CDDL) see http://opensolaris.org/os/licensing/opensolaris_license/ ioapic0 <Version 2.0> irqs 0-23 on motherboard acpi0: <ASUS P4PE> on motherboard acpi0: Overriding SCI Interrupt from IRQ 9 to IRQ 22 acpi0: [ITHREAD] acpi0: Power Button (fixed) acpi0: reservation of 0, a0000 (3) failed acpi0: reservation of 100000, 5ff00000 (3) failed Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0xe408-0xe40b on acpi0 acpi_button0: <Power Button> on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 agp0: <Intel 82845G host to AGP bridge> on hostb0 pcib1: <ACPI PCI-PCI bridge> at device 1.0 on pci0 pci1: <ACPI PCI bus> on pcib1 vgapci0: <VGA-compatible display> port 0xd800-0xd8ff mem 0xe0000000-0xefffffff,0xdf000000-0xdf00ffff irq 16 at device 0.0 on pci1 uhci0: <Intel 82801DB (ICH4) USB controller USB-A> port 0xb800-0xb81f irq 16 at device 29.0 on pci0 uhci0: [GIANT-LOCKED] uhci0: [ITHREAD] usb0: <Intel 82801DB (ICH4) USB controller USB-A> on uhci0 usb0: USB revision 1.0 uhub0: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb0 uhub0: 2 ports with 2 removable, self powered uhci1: <Intel 82801DB (ICH4) USB controller USB-B> port 0xb400-0xb41f irq 19 at device 29.1 on pci0 uhci1: [GIANT-LOCKED] uhci1: [ITHREAD] usb1: <Intel 82801DB (ICH4) USB controller USB-B> on uhci1 usb1: USB revision 1.0 uhub1: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb1 uhub1: 2 ports with 2 removable, self powered uhci2: <Intel 82801DB (ICH4) USB controller USB-C> port 0xb000-0xb01f irq 18 at device 29.2 on pci0 uhci2: [GIANT-LOCKED] uhci2: [ITHREAD] usb2: <Intel 82801DB (ICH4) USB controller USB-C> on uhci2 usb2: USB revision 1.0 uhub2: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb2 uhub2: 2 ports with 2 removable, self powered ehci0: <Intel 82801DB/L/M (ICH4) USB 2.0 controller> mem 0xde800000-0xde8003ff irq 23 at device 29.7 on pci0 ehci0: [GIANT-LOCKED] ehci0: [ITHREAD] usb3: EHCI version 1.0 usb3: companion controllers, 2 ports each: usb0 usb1 usb2 usb3: <Intel 82801DB/L/M (ICH4) USB 2.0 controller> on ehci0 usb3: USB revision 2.0 uhub3: <Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1> on usb3 uhub3: 6 ports with 6 removable, self powered pcib2: <ACPI PCI-PCI bridge> at device 30.0 on pci0 pci2: <ACPI PCI bus> on pcib2 bfe0: <Broadcom BCM4401 Fast Ethernet> mem 0xde000000-0xde001fff irq 20 at device 5.0 on pci2 miibus0: <MII bus> on bfe0 bmtphy0: <BCM4401 10/100baseTX PHY> PHY 1 on miibus0 bmtphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto bfe0: Ethernet address: 00:0c:6e:04:5d:39 bfe0: [ITHREAD] fxp0: <Intel 82559 Pro/100 Ethernet> port 0xa800-0xa83f mem 0xdd800000-0xdd800fff,0xdd000000-0xdd0fffff irq 23 at device 11.0 on pci2 miibus1: <MII bus> on fxp0 inphy0: <i82555 10/100 media interface> PHY 1 on miibus1 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp0: Ethernet address: 00:02:b3:1d:df:8e fxp0: [ITHREAD] sym0: <875> port 0xa400-0xa4ff mem 0xdc800000-0xdc8000ff,0xdc000000-0xdc000fff irq 20 at device 12.0 on pci2 sym0: Tekram NVRAM, ID 7, Fast-20, SE, parity checking sym0: [ITHREAD] isab0: <PCI-ISA bridge> at device 31.0 on pci0 isa0: <ISA bus> on isab0 atapci0: <Intel ICH4 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f irq 18 at device 31.1 on pci0 ata0: <ATA channel 0> on atapci0 ata0: [ITHREAD] ata1: <ATA channel 1> on atapci0 ata1: [ITHREAD] pcm0: <Intel ICH4 (82801DB)> port 0x9800-0x98ff,0x9400-0x943f mem 0xdb800000-0xdb8001ff,0xdb000000-0xdb0000ff irq 17 at device 31.5 on pci0 pcm0: [ITHREAD] pcm0: <Analog Devices AD1980 AC97 Codec> fdc0: <floppy drive controller> port 0x3f2-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: [FILTER] fd0: <1440-KB 3.5" drive> on fdc0 drive 0 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio0: [FILTER] sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A sio1: [FILTER] atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] atkbd0: [ITHREAD] cpu0: <ACPI CPU> on acpi0 p4tcc0: <CPU Frequency Thermal Control> on cpu0 cpu1: <ACPI CPU> on acpi0 p4tcc1: <CPU Frequency Thermal Control> on cpu1 pmtimer0 on isa0 orm0: <ISA Option ROMs> at iomem 0xc0000-0xccfff,0xd0000-0xd0fff pnpid ORM0000 on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 ppc0: <Parallel port> at port 0x378-0x37f irq 7 on isa0 ppc0: Generic chipset (EPP/NIBBLE) in COMPATIBLE mode ppbus0: <Parallel port bus> on ppc0 ppbus0: [ITHREAD] lpt0: <Printer> on ppbus0 lpt0: Interrupt-driven port ppi0: <Parallel I/O> on ppbus0 ppc0: [GIANT-LOCKED] ppc0: [ITHREAD] ums0: <vendor 0x04d9 product 0x048e, class 0/0, rev 1.10/8.00, addr 2> on uhub1 ums0: 5 buttons and Z dir. WARNING: ZFS is considered to be an experimental feature in FreeBSD. Timecounters tick every 1.000 msec Waiting 5 seconds for SCSI devices to settle ZFS filesystem version 6 ZFS storage pool version 6 ad0: 58644MB <Maxtor 6Y060L0 YAR41VW0> at ata0-master UDMA100 acd0: DVDR <TSSTcorpCD/DVDW SH-S182D/SB02> at ata1-master UDMA33 acd0: FAILURE - INQUIRY ILLEGAL REQUEST asc=0x24 ascq=0x00 (probe5:sym0:0:5:0): phase change 6-7 6@01a0c7a8 resid=4. (da0:sym0:0:5:0): phase change 6-7 6@01a0c7a8 resid=4. da0 at sym0 bus 0 target 5 lun 0 da0: <IOMEGA ZIP 100 J.03> Removable Direct Access SCSI-2 device da0: 3.300MB/s transfers da0: 96MB (196608 512 byte sectors: 64H 32S/T 96C) cd0 at ata1 bus 0 target 0 lun 0 cd0: <TSSTcorp CD/DVDW SH-S182D SB02> Removable CD-ROM SCSI-0 device cd0: 33.000MB/s transfers cd0: Attempt to query device size failed: NOT READY, Medium not present - tray closed SMP: AP CPU #1 Launched! Trying to mount root from ufs:/dev/ad0s1a WARNING: / was not properly dismounted WARNING: /home was not properly dismounted /home: mount pending error: blocks 128 files 10 WARNING: /usr was not properly dismounted WARNING: /var was not properly dismounted WARNING: TMPFS is considered to be a highly experimental feature in FreeBSD. fxp0: Microcode loaded, int_delay: 1000 usec bundle_max: 6 fxp0: Microcode loaded, int_delay: 1000 usec bundle_max: 6 fxp0: Microcode loaded, int_delay: 1000 usec bundle_max: 6 fxp0: Microcode loaded, int_delay: 1000 usec bundle_max: 6 -- Michel TALON
> yes, do ps - threads in state L or LL and RUN are especially interesting, > trace of pids 28, 27, and threads wich L on locked chan.heres the output of alllocks, http://toybox.twisted.org.uk/~pete/71_show_alllocks.png here are the pages of PS: http://toybox.twisted.org.uk/~pete/71_lock_ps2/ (next time I boot this I will disable http to avoid getting so many) I cant see any which are in L, LL or RUN state there though. A few RL and WL towards the end. Traces on 28 and 27 are here: http://toybox.twisted.org.uk/~pete/71_trace_28.png http://toybox.twisted.org.uk/~pete/71_trace_27a.png http://toybox.twisted.org.uk/~pete/71_trace_27b.png I also did traces on 19 and 16 as (like 28 and 27) they are in a "CPU" state, so may be of interest ? http://toybox.twisted.org.uk/~pete/71_trace_19.png http://toybox.twisted.org.uk/~pete/71_trace_16.png -pete.
On Mon, Jan 19, 2009 at 11:39:08AM +0000, Pete French wrote:> > yes, do ps - threads in state L or LL and RUN are especially interesting, > > trace of pids 28, 27, and threads wich L on locked chan. > > heres the output of alllocks, > > http://toybox.twisted.org.uk/~pete/71_show_alllocks.png > > here are the pages of PS: > > http://toybox.twisted.org.uk/~pete/71_lock_ps2/ > > (next time I boot this I will disable http to avoid getting so many) > > I cant see any which are in L, LL or RUN state there though. A few RL > and WL towards the end. Traces on 28 and 27 are here: > > http://toybox.twisted.org.uk/~pete/71_trace_28.png > http://toybox.twisted.org.uk/~pete/71_trace_27a.png > http://toybox.twisted.org.uk/~pete/71_trace_27b.png > > I also did traces on 19 and 16 as (like 28 and 27) they are in a "CPU" > state, so may be of interest ? > > http://toybox.twisted.org.uk/~pete/71_trace_19.png > http://toybox.twisted.org.uk/~pete/71_trace_16.png >Probably it is your case, try please. http://www.freebsd.org/cgi/query-pr.cgi?pr=130652&cat -- Have fun! chd
Kris writes:> You and anyone else seeing performance problems should try to work > through the advice given here:> [1]http://people.freebsd.org/~kris/scaling/Help_my_system_is_slow.pdf Well, all the people in this thread have noticed that WITH NO CONFIG CHANGES f rom configs that worked fine in the past, their systems are very slow and/or locking up (mi ne are both) with the stable branch sometime (I noticed it sometime in December, but it got worse with the release.) Most were OK in October; mine (I think) were OK in late November - may narrow t hings down? Two of my systems that lock up have no internal visibility when they do (Soekris 4801's r outing; the only time-intensive things running are routing (done in irq context) and pflog. The se run with 60+ meg ram free.) These are complete lockups, though I did manage to get a ps out of my laptop last night by waiting 20 _minutes_ for it to start (!). This is not a generic perfo rmance problem. The laptop had 55 minutes of cpu time in the softdepflush thread after being up about an h our and 10 mins; this might give a hint. I didn't spot LL/RL state threads at the same time bec ause I didn't know to. Now I do. BTW - the same ps showed 8 or so user-space procs in R state wi th NO cpu time; the kernel was hogging all of it for over an hour. Firefox did indeed trigger this one as someone else noted. A soekris doing onl y routing+nat has no such excuse... At least PHK was nice enough to note the watchdog in another thread :-) -- Pete References 1. http://people.freebsd.org/%7Ekris/scaling/Help_my_system_is_slow.pdf
I have done some (lots of) kernel debugging in the past. I have several points: 1. I shouldn't *have* to kernel debug for a normal usage of an official release. 2. One of the soekris boxes is 2800 MILES away, in a remote location, with noone present that is a skilled (or, indeed, any kind of) programmer. I usually thought I could trust a release, especially when I had been using the stable branch updated at about monthly intervals on 3 servers with no problems. (actually, I waited a while on 7.0 because .0 releases are traditionally quirky; in this case 7.0-rel worked fine and 7.1 has problems.) (and my servers are still running the *same* compilation of kernel/world with no problems; the hangs are unique to either the laptop (which only started doing this badly with a Jan 9 csup) and the Soekris boxes (which started hangs sometime in December; they clearly don't run X...) [ I've backed my house source to -stable of 12/1/08 and hope this will help; I don't have the time to fool around too much, and particularly to kernel debug something that shouldn't need it.] I can't even start X at all on this laptop now. At least I can boot it, but it isn't much use for work unless it can run X. 3. I can't afford the time to debug my tools (freebsd is a tool, not an experiment, for lots of people, including me...) I use this laptop at work in a place where I am *not* working on freebsd. (nor am I even allowed to at work...) -- Pete
On Fri, 9 Jan 2009, Pete French wrote:> I have a number of HP 1U servers, all of which were running 7.0 perfectly > happily. I have been testing 7.1 in it's various incarnations for the last > couple of months on our test server and it has performed perfectly. > > So the last two days I have been round upgrading all our servers, knowing > that I had run the system stably on identical hardware for some time.For those following this other than Pete, who I've been in private correspondence with: it seems that he is running into two different deadlocks in the routing code. One of them (at least) is triggered by a lock order problem relating to the processing of ICMP redirects -- uncommon in most configurations, but quite a few on his network, which triggers quickly under load. Kip Macy has corrected at least one (both?) problems in head, and plans to MFC the fixes in the near future. We'll follow up further once the fixes are merged, and if any further problems transpire. Robert N M Watson Computer Laboratory University of Cambridge> > Since then I have starte seeing machines lock up. This always happens under > heavy disc load. When I bring the machine back up then sometimes it fails > to fsck due to a partialy truncated inode. The locksup appear to > be disc related - on my mysql msater machine it will come back up with > files somewhat shorted than those which ahve aready been transmitted to > the slave (i.e. some data was in memory, and claimed to have been written > to the drive, but never made it onto the disc). > > The only time I have seen anything useful on the screen was during one lockup > where I got a message about a spin lock being held too long and some > comment in parentheses about it being a turnstile lock. > > Help! :-( > > I am now downgrading all the machine to 7.0 as fast as I can - though the > machine I am trying to compile it on has locked up once during the compile > so I havent got anywhere so far. > > The machines are HP Proliant DL360 G5s - they have an embedded P400i > RAID controller with a pair of mirrored drives connected. Each one has > both ethernets connected, bundled using lagg and LACP. > > Advice ? > > -pete. > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" >
> load. Kip Macy has corrected at least one (both?) problems in head, and > plans to MFC the fixes in the near future. We'll follow up further once > the fixes are merged, and if any further problems transpire.Hi, just wondering if we are any closer to having the MFC for this yet, or if there are any patches I could test ? cheers, -pete.
At 05:38 PM 1/29/2009, Robert Watson wrote:>On Fri, 9 Jan 2009, Pete French wrote: > >>I have a number of HP 1U servers, all of which were running 7.0 >>perfectly happily. I have been testing 7.1 in it's various >>incarnations for the last couple of months on our test server and >>it has performed perfectly. >> >>So the last two days I have been round upgrading all our servers, >>knowing that I had run the system stably on identical hardware for some time. > >For those following this other than Pete, who I've been in private >correspondence with: it seems that he is running into two different >deadlocks in the routing code. One of them (at least) is triggered >by a lock order problem relating to the processing of ICMP redirects >-- uncommon in most configurations, but quite a few on his network, >which triggers quickly under load. Kip Macy has corrected at least >one (both?) problems in head, and plans to MFC the fixes in the near >future. We'll follow up further once the fixes are merged, and if >any further problems transpire.Hi Robert, Do you have any other details about these issues ? Were the fixes ever MFC'd ---Mike