This has been happening since 5.3-R, I've been tuning different parameters to no avail. I've taken the disks off of the onboard ICH5 controller and put them a promise TX4 S150 controller, but still the same thing happens. The system freezes, but isn't totally dead. It'll still respond to pings, the screensaver still functions, but it won't respond to a CAD at the console. But if I press 'Enter' at the console, it'll give me a 'login:' prompt, but after entering the username, it never comes back with the 'password:' prompt. After manually resetting the system it boots and says 'Automatic file system check failed; help!' and drops into single user mode. Running fsck manually corrects errors on all volumes. Then it'll boot from that point. This seems to be triggered by daily periodic as it happens at 3:02-3:03AM each time. But it doesn't happen *every* morning. I suspect a bug in FreeBSD because this mode of failure happens on 3 different machines, all configured similarly. ASUS P4P800 2G RAM (though the other affected systems only have 1G) 80G Seagate Barracuda SATA drives (one system now on Promise TX4 S150 controller, others on onboard ICH5) On my lightly loaded systems, it happens rarely. On my mailserver (fairly heavy disk load), it happens quite frequently. How can I troubleshoot this? dmesg follows: Copyright (c) 1992-2005 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.4-RC2 #2: Wed Apr 13 17:35:20 MDT 2005 root@postmaster.etv.net:/usr/obj/usr/src/sys/Postmaster Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Pentium(R) 4 CPU 2.60GHz (2605.92-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf29 Stepping = 9 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA ,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Hyperthreading: 2 logical CPUs real memory = 2146631680 (2047 MB) avail memory = 2095153152 (1998 MB) ACPI APIC Table: <A M I OEMAPIC > ioapic0 <Version 2.0> irqs 0-23 on motherboard npx0: <math processor> on motherboard npx0: INT 16 interface acpi0: <A M I OEMXSDT> on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0 cpu0: <ACPI CPU> on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 agp0: <Intel 82865 host to AGP bridge> mem 0xf8000000-0xfbffffff at device 0.0 on pci0 pcib1: <ACPI PCI-PCI bridge> at device 1.0 on pci0 pci1: <ACPI PCI bus> on pcib1 pci1: <display, VGA> at device 0.0 (no driver attached) uhci0: <Intel 82801EB (ICH5) USB controller USB-A> port 0xef00-0xef1f irq 16 at device 29.0 on pci0 usb0: <Intel 82801EB (ICH5) USB controller USB-A> on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: <Intel 82801EB (ICH5) USB controller USB-B> port 0xef20-0xef3f irq 19 at device 29.1 on pci0 usb1: <Intel 82801EB (ICH5) USB controller USB-B> on uhci1 usb1: USB revision 1.0 uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered uhci2: <Intel 82801EB (ICH5) USB controller USB-C> port 0xef40-0xef5f irq 18 at device 29.2 on pci0 usb2: <Intel 82801EB (ICH5) USB controller USB-C> on uhci2 usb2: USB revision 1.0 uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub2: 2 ports with 2 removable, self powered uhci3: <Intel 82801EB (ICH5) USB controller USB-D> port 0xef80-0xef9f irq 16 at device 29.3 on pci0 usb3: <Intel 82801EB (ICH5) USB controller USB-D> on uhci3 usb3: USB revision 1.0 uhub3: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub3: 2 ports with 2 removable, self powered pci0: <serial bus, USB> at device 29.7 (no driver attached) pcib2: <ACPI PCI-PCI bridge> at device 30.0 on pci0 pci2: <ACPI PCI bus> on pcib2 skc0: <3Com 3C940 Gigabit Ethernet> port 0xd800-0xd8ff mem 0xfeafc000-0xfeafffff irq 22 at device 5.0 on pci2 skc0: 3Com Gigabit LOM (3C940) rev. (0x1) sk0: <Marvell Semiconductor, Inc. Yukon> on skc0 sk0: Ethernet address: 00:0c:6e:54:4b:19 miibus0: <MII bus> on sk0 e1000phy0: <Marvell 88E1000 Gigabit PHY> on miibus0 e1000phy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX-FDX, auto atapci0: <Promise PDC20319 SATA150 controller> port 0xdc00-0xdc7f,0xdfa0-0xdfaf,0xdf00-0xdf3f mem 0xfeac0000-0xfeadffff,0xfeafb000-0xfeafbfff irq 21 at device 9.0 on pci2 atapci0: failed: rid 0x20 is memory, requested 4 ata2: channel #0 on atapci0 ata3: channel #1 on atapci0 ata4: channel #2 on atapci0 ata5: channel #3 on atapci0 xl0: <3Com 3c905C-TX Fast Etherlink XL> port 0xd480-0xd4ff mem 0xfeaf9c00-0xfeaf9c7f irq 20 at device 12.0 on pci2 miibus1: <MII bus> on xl0 ukphy0: <Generic IEEE 802.3u media interface> on miibus1 ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto xl0: Ethernet address: 00:04:75:f1:1c:7e isab0: <PCI-ISA bridge> at device 31.0 on pci0 isa0: <ISA bus> on isab0 atapci1: <Intel ICH5 UDMA100 controller> port 0xfc00-0xfc0f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 31.1 on pci0 ata0: channel #0 on atapci1 ata1: channel #1 on atapci1 pci0: <serial bus, SMBus> at device 31.3 (no driver attached) pci0: <multimedia, audio> at device 31.5 (no driver attached) acpi_button0: <Power Button> on acpi0 atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 psm0: <PS/2 Mouse> irq 12 on atkbdc0 psm0: model IntelliMouse, device ID 3 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A fdc0: <floppy drive controller (FDE)> port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on acpi0 ppc0: <Standard parallel printer port> port 0x378-0x37f irq 7 on acpi0 ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode ppbus0: <Parallel port bus> on ppc0 plip0: <PLIP network interface> on ppbus0 lpt0: <Printer> on ppbus0 lpt0: Interrupt-driven port ppi0: <Parallel I/O> on ppbus0 orm0: <ISA Option ROMs> at iomem 0xcf000-0xcf7ff,0xc8000-0xcefff,0xc0000-0xc7fff on isa0 pmtimer0 on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounter "TSC" frequency 2605917008 Hz quality 800 Timecounters tick every 10.000 msec acd0: CDROM <FX54++M/Y01E> at ata1-master PIO4 ad4: 76319MB <ST380013AS/3.05> [155061/16/63] at ata2-master SATA150 ad6: 76319MB <ST380013AS/3.05> [155061/16/63] at ata3-master SATA150 ad8: 117246MB <Maxtor 6Y120M0/YAR511W0> [238216/16/63] at ata4-master SATA150 ad10: 76319MB <ST380013AS/3.05> [155061/16/63] at ata5-master SATA150 ar0: 76319MB <ATA RAID1 array> [9729/255/63] status: READY subdisks: disk0 READY on ad4 at ata2-master disk1 READY on ad6 at ata3-master ar1: 76319MB <ATA RAID1 array> [9729/255/63] status: READY subdisks: disk0 READY on ad10 at ata5-master disk1 READY on ad8 at ata4-master Mounting root from ufs:/dev/ar0s1a WARNING: / was not properly dismounted
Elliot Finley wrote:> This has been happening since 5.3-R, I've been tuning different parameters > to no avail. I've taken the disks off of the onboard ICH5 controller and > put them a promise TX4 S150 controller, but still the same thing happens. > > The system freezes, but isn't totally dead. It'll still respond to pings, > the screensaver still functions, but it won't respond to a CAD at the > console. But if I press 'Enter' at the console, it'll give me a 'login:' > prompt, but after entering the username, it never comes back with the > 'password:' prompt. > > After manually resetting the system it boots and says 'Automatic file system > check failed; help!' and drops into single user mode. Running fsck manually > corrects errors on all volumes. Then it'll boot from that point. > > This seems to be triggered by daily periodic as it happens at 3:02-3:03AM > each time. But it doesn't happen *every* morning.Hmm, sounds as a deadlock somewhere. On the ATA part, try the ATA mkIII patches on http://people.freebsd.org/~sos/ATA and see if that changes anything. -- -S?ren
On Mon, May 16, 2005 at 06:40:01AM -0600, Elliot Finley wrote:>The system freezes, but isn't totally dead. It'll still respond to pings, >the screensaver still functions, but it won't respond to a CAD at the >console. But if I press 'Enter' at the console, it'll give me a 'login:' >prompt, but after entering the username, it never comes back with the >'password:' prompt....>On my lightly loaded systems, it happens rarely. On my mailserver (fairly >heavy disk load), it happens quite frequently.This could equally be a filesystem deadlock (race-to-root) rather than something in the ATA controller. Do you know if it happens gradually (starts with one or two non-responsive, unkillable processes and gets worse until nothing happens)?>How can I troubleshoot this?Re-compile the kernel with: options KDB options DDB makeoptions DEBUG=-g and ensure you have a "dumpdev" in /etc/rc.conf. When you get a lockup, drop to DDB (Ctrl-Alt-ESC) and run "show lockedvnods", "ps" and "call doadump()". If you post the output (a serial console will help here) someone might be able to provide more pointers. (The crashdump will help with later debugging). Note: If you don't have another FreeBSD system handy, a hard copy of ddb(4) will be very handy if you want to play around in DDB. -- Peter
On Tue, 17 May 2005 freebsd-stable-request@freebsd.org wrote:> Date: Mon, 16 May 2005 06:40:01 -0600 > From: "Elliot Finley" <efinleywork@efinley.com> > Subject: 5.4-RC2 freezing - ATA related? > To: <freebsd-stable@freebsd.org> > Cc: sos@freebsd.org > Message-ID: <001801c55a14$609720d0$37cba1cd@emerytelcom.com> > Content-Type: text/plain; charset="iso-8859-1" > > This has been happening since 5.3-R, I've been tuning different parameters > to no avail. I've taken the disks off of the onboard ICH5 controller and > put them a promise TX4 S150 controller, but still the same thing happens. > > The system freezes, but isn't totally dead. It'll still respond to pings, > the screensaver still functions, but it won't respond to a CAD at the > console. But if I press 'Enter' at the console, it'll give me a 'login:' > prompt, but after entering the username, it never comes back with the > 'password:' prompt. > > After manually resetting the system it boots and says 'Automatic file system > check failed; help!' and drops into single user mode. Running fsck manually > corrects errors on all volumes. Then it'll boot from that point. > > This seems to be triggered by daily periodic as it happens at 3:02-3:03AM > each time. But it doesn't happen *every* morning. > > I suspect a bug in FreeBSD because this mode of failure happens on 3 > different machines, all configured similarly.You can add a fourth. Ever since 5.1 (my first 5.x install) I have experienced the same problem, again with an Intel ICH5 ATA controller. The symptoms are exactly the same -- the hang is normally triggered during the periodic runs just after 3AM. The hang does occur at other times as well, but with nowhere near the same consistency. The only solution I found at that time was reverting to 4.10, though that is obviously suboptimal. I could be persuaded to reinstall 5.x on the machine if I'd be sure to get someone to look into this. Thanks, Brent Casavant -- Brent Casavant http://www.angeltread.org/ KD5EMB -.- -.. ..... . -- -... 44 54'24"N 93 03'21"W 907FASL EN34lv
On Mon, May 16, 2005 at 06:40:01AM -0600, Elliot Finley wrote:> This has been happening since 5.3-R, I've been tuning different parameters > to no avail. I've taken the disks off of the onboard ICH5 controller and > put them a promise TX4 S150 controller, but still the same thing happens. > > The system freezes, but isn't totally dead. It'll still respond to pings, > the screensaver still functions, but it won't respond to a CAD at the > console. But if I press 'Enter' at the console, it'll give me a 'login:' > prompt, but after entering the username, it never comes back with the > 'password:' prompt. > > After manually resetting the system it boots and says 'Automatic file system > check failed; help!' and drops into single user mode. Running fsck manually > corrects errors on all volumes. Then it'll boot from that point. > > This seems to be triggered by daily periodic as it happens at 3:02-3:03AM > each time. But it doesn't happen *every* morning. > > I suspect a bug in FreeBSD because this mode of failure happens on 3 > different machines, all configured similarly. >We are having similar problems to this on a box, won't go into great detail at the moment but will post results when we have finished testing. -- Jamie Heckford Network Manager Trident Microsystems Ltd. t: +44(0)1737-780790 f: +44(0)1737-771908 w: http://www.tridentmicrosystems.co.uk/
On Mon, May 16, 2005 at 06:40:01AM -0600, Elliot Finley wrote:> This has been happening since 5.3-R, I've been tuning different parameters > to no avail. I've taken the disks off of the onboard ICH5 controller and > put them a promise TX4 S150 controller, but still the same thing happens. > > The system freezes, but isn't totally dead. It'll still respond to pings, > the screensaver still functions, but it won't respond to a CAD at the > console. But if I press 'Enter' at the console, it'll give me a 'login:' > prompt, but after entering the username, it never comes back with the > 'password:' prompt. > > After manually resetting the system it boots and says 'Automatic file system > check failed; help!' and drops into single user mode. Running fsck manually > corrects errors on all volumes. Then it'll boot from that point. > > This seems to be triggered by daily periodic as it happens at 3:02-3:03AM > each time. But it doesn't happen *every* morning. > > I suspect a bug in FreeBSD because this mode of failure happens on 3 > different machines, all configured similarly. > > ASUS P4P800 > 2G RAM (though the other affected systems only have 1G) > 80G Seagate Barracuda SATA drives (one system now on Promise TX4 S150 > controller, others on onboard ICH5) > > On my lightly loaded systems, it happens rarely. On my mailserver (fairly > heavy disk load), it happens quite frequently. > > How can I troubleshoot this?Managed to get a dump on our system for a similar prob we are getting: [GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"] GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-marcel-freebsd". #0 doadump () at pcpu.h:160 160 __asm __volatile("movl %%fs:0,%0" : "=r" (td)); (kgdb) bt #0 doadump () at pcpu.h:160 #1 0xc05131ae in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:410 #2 0xc0513474 in panic (fmt=0xc06c3da5 "%s") at /usr/src/sys/kern/kern_shutdown.c:566 #3 0xc0691e18 in trap_fatal (frame=0xecb4bb34, eva=532) at /usr/src/sys/i386/i386/trap.c:817 #4 0xc0691b73 in trap_pfault (frame=0xecb4bb34, usermode=0, eva=532) at /usr/src/sys/i386/i386/trap.c:735 #5 0xc0691771 in trap (frame {tf_fs = -1068433384, tf_es = -989790192, tf_ds = 16, tf_edi = -1066124736, tf_esi = -1066124736, tf_ebp = -323699844, tf_isp = -323699872, tf_ebx = -1007063716, tf_edx = 528, tf_ecx = -1013235680, tf_eax = 307472464, tf_trapno = 12, tf_err = 2, tf_eip = -1067870386, tf_cs = 8, tf_eflags = 66050, tf_esp = -989760240, tf_ss = -1007063716}) at /usr/src/sys/i386/i386/trap.c:425 #6 0xc068168a in calltrap () at /usr/src/sys/i386/i386/exception.s:140 #7 0xc0510018 in crcopy () at /usr/src/sys/kern/kern_prot.c:1810 #8 0xc0598c77 in in_pcbdetach (inp=0xc0743a40) at /usr/src/sys/netinet/in_pcb.c:720 #9 0xc05b21a6 in tcp_close (tp=0x0) at /usr/src/sys/netinet/tcp_subr.c:783 #10 0xc05ae560 in tcp_input (m=0xc3a6a300, off0=20) at /usr/src/sys/netinet/tcp_input.c:2308 #11 0xc05a5aed in ip_input (m=0xc3a6a300) at /usr/src/sys/netinet/ip_input.c:776 #12 0xc0582f13 in netisr_processqueue (ni=0xc0742498) at /usr/src/sys/net/netisr.c:233 #13 0xc058310a in swi_net (dummy=0x0) at /usr/src/sys/net/netisr.c:346 #14 0xc04ffa79 in ithread_loop (arg=0xc3481600) at /usr/src/sys/kern/kern_intr.c:547 #15 0xc04fed0c in fork_exit (callout=0xc04ff928 <ithread_loop>, arg=0xc3481600, frame=0xecb4bd38) at /usr/src/sys/kern/kern_fork.c:791 #16 0xc06816ec in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:209 (kgdb) Help? ;) -- Jamie Heckford Network Manager Trident Microsystems Ltd. t: +44(0)1737-780790 f: +44(0)1737-771908 w: http://www.tridentmicrosystems.co.uk/