This has been happening since 5.3-R, I've been tuning different parameters
to no avail.  I've taken the disks off of the onboard ICH5 controller and
put them a promise TX4 S150 controller, but still the same thing happens.
The system freezes, but isn't totally dead.  It'll still respond to
pings,
the screensaver still functions, but it won't respond to a CAD at the
console.  But if I press 'Enter' at the console, it'll give me a
'login:'
prompt, but after entering the username, it never comes back with the
'password:' prompt.
After manually resetting the system it boots and says 'Automatic file system
check failed; help!' and drops into single user mode.  Running fsck manually
corrects errors on all volumes.  Then it'll boot from that point.
This seems to be triggered by daily periodic as it happens at 3:02-3:03AM
each time.  But it doesn't happen *every* morning.
I suspect a bug in FreeBSD because this mode of failure happens on 3
different machines, all configured similarly.
ASUS P4P800
2G RAM (though the other affected systems only have 1G)
80G Seagate Barracuda SATA drives (one system now on Promise TX4 S150
controller, others on onboard ICH5)
On my lightly loaded systems, it happens rarely.  On my mailserver (fairly
heavy disk load), it happens quite frequently.
How can I troubleshoot this?
dmesg follows:
Copyright (c) 1992-2005 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD 5.4-RC2 #2: Wed Apr 13 17:35:20 MDT 2005
    root@postmaster.etv.net:/usr/obj/usr/src/sys/Postmaster
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) 4 CPU 2.60GHz (2605.92-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0xf29  Stepping = 9
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA
,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Hyperthreading: 2 logical CPUs
real memory  = 2146631680 (2047 MB)
avail memory = 2095153152 (1998 MB)
ACPI APIC Table: <A M I  OEMAPIC >
ioapic0 <Version 2.0> irqs 0-23 on motherboard
npx0: <math processor> on motherboard
npx0: INT 16 interface
acpi0: <A M I OEMXSDT> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0
cpu0: <ACPI CPU> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
agp0: <Intel 82865 host to AGP bridge> mem 0xf8000000-0xfbffffff at device
0.0 on pci0
pcib1: <ACPI PCI-PCI bridge> at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pci1: <display, VGA> at device 0.0 (no driver attached)
uhci0: <Intel 82801EB (ICH5) USB controller USB-A> port 0xef00-0xef1f irq
16
at device 29.0 on pci0
usb0: <Intel 82801EB (ICH5) USB controller USB-A> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <Intel 82801EB (ICH5) USB controller USB-B> port 0xef20-0xef3f irq
19
at device 29.1 on pci0
usb1: <Intel 82801EB (ICH5) USB controller USB-B> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2: <Intel 82801EB (ICH5) USB controller USB-C> port 0xef40-0xef5f irq
18
at device 29.2 on pci0
usb2: <Intel 82801EB (ICH5) USB controller USB-C> on uhci2
usb2: USB revision 1.0
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
uhci3: <Intel 82801EB (ICH5) USB controller USB-D> port 0xef80-0xef9f irq
16
at device 29.3 on pci0
usb3: <Intel 82801EB (ICH5) USB controller USB-D> on uhci3
usb3: USB revision 1.0
uhub3: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub3: 2 ports with 2 removable, self powered
pci0: <serial bus, USB> at device 29.7 (no driver attached)
pcib2: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci2: <ACPI PCI bus> on pcib2
skc0: <3Com 3C940 Gigabit Ethernet> port 0xd800-0xd8ff mem
0xfeafc000-0xfeafffff irq 22 at device 5.0 on pci2
skc0: 3Com Gigabit LOM (3C940) rev. (0x1)
sk0: <Marvell Semiconductor, Inc. Yukon> on skc0
sk0: Ethernet address: 00:0c:6e:54:4b:19
miibus0: <MII bus> on sk0
e1000phy0: <Marvell 88E1000 Gigabit PHY> on miibus0
e1000phy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX-FDX,
auto
atapci0: <Promise PDC20319 SATA150 controller> port
0xdc00-0xdc7f,0xdfa0-0xdfaf,0xdf00-0xdf3f mem
0xfeac0000-0xfeadffff,0xfeafb000-0xfeafbfff irq 21 at device 9.0 on pci2
atapci0: failed: rid 0x20 is memory, requested 4
ata2: channel #0 on atapci0
ata3: channel #1 on atapci0
ata4: channel #2 on atapci0
ata5: channel #3 on atapci0
xl0: <3Com 3c905C-TX Fast Etherlink XL> port 0xd480-0xd4ff mem
0xfeaf9c00-0xfeaf9c7f irq 20 at device 12.0 on pci2
miibus1: <MII bus> on xl0
ukphy0: <Generic IEEE 802.3u media interface> on miibus1
ukphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
xl0: Ethernet address: 00:04:75:f1:1c:7e
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci1: <Intel ICH5 UDMA100 controller> port
0xfc00-0xfc0f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 31.1 on pci0
ata0: channel #0 on atapci1
ata1: channel #1 on atapci1
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
pci0: <multimedia, audio> at device 31.5 (no driver attached)
acpi_button0: <Power Button> on acpi0
atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: model IntelliMouse, device ID 3
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on
acpi0
sio0: type 16550A
sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550A
fdc0: <floppy drive controller (FDE)> port 0x3f7,0x3f0-0x3f5 irq 6 drq 2
on
acpi0
ppc0: <Standard parallel printer port> port 0x378-0x37f irq 7 on acpi0
ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode
ppbus0: <Parallel port bus> on ppc0
plip0: <PLIP network interface> on ppbus0
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
orm0: <ISA Option ROMs> at iomem
0xcf000-0xcf7ff,0xc8000-0xcefff,0xc0000-0xc7fff on isa0
pmtimer0 on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounter "TSC" frequency 2605917008 Hz quality 800
Timecounters tick every 10.000 msec
acd0: CDROM <FX54++M/Y01E> at ata1-master PIO4
ad4: 76319MB <ST380013AS/3.05> [155061/16/63] at ata2-master SATA150
ad6: 76319MB <ST380013AS/3.05> [155061/16/63] at ata3-master SATA150
ad8: 117246MB <Maxtor 6Y120M0/YAR511W0> [238216/16/63] at ata4-master
SATA150
ad10: 76319MB <ST380013AS/3.05> [155061/16/63] at ata5-master SATA150
ar0: 76319MB <ATA RAID1 array> [9729/255/63] status: READY subdisks:
 disk0 READY on ad4 at ata2-master
 disk1 READY on ad6 at ata3-master
ar1: 76319MB <ATA RAID1 array> [9729/255/63] status: READY subdisks:
 disk0 READY on ad10 at ata5-master
 disk1 READY on ad8 at ata4-master
Mounting root from ufs:/dev/ar0s1a
WARNING: / was not properly dismounted
Elliot Finley wrote:> This has been happening since 5.3-R, I've been tuning different parameters > to no avail. I've taken the disks off of the onboard ICH5 controller and > put them a promise TX4 S150 controller, but still the same thing happens. > > The system freezes, but isn't totally dead. It'll still respond to pings, > the screensaver still functions, but it won't respond to a CAD at the > console. But if I press 'Enter' at the console, it'll give me a 'login:' > prompt, but after entering the username, it never comes back with the > 'password:' prompt. > > After manually resetting the system it boots and says 'Automatic file system > check failed; help!' and drops into single user mode. Running fsck manually > corrects errors on all volumes. Then it'll boot from that point. > > This seems to be triggered by daily periodic as it happens at 3:02-3:03AM > each time. But it doesn't happen *every* morning.Hmm, sounds as a deadlock somewhere. On the ATA part, try the ATA mkIII patches on http://people.freebsd.org/~sos/ATA and see if that changes anything. -- -S?ren
On Mon, May 16, 2005 at 06:40:01AM -0600, Elliot Finley wrote:>The system freezes, but isn't totally dead. It'll still respond to pings, >the screensaver still functions, but it won't respond to a CAD at the >console. But if I press 'Enter' at the console, it'll give me a 'login:' >prompt, but after entering the username, it never comes back with the >'password:' prompt....>On my lightly loaded systems, it happens rarely. On my mailserver (fairly >heavy disk load), it happens quite frequently.This could equally be a filesystem deadlock (race-to-root) rather than something in the ATA controller. Do you know if it happens gradually (starts with one or two non-responsive, unkillable processes and gets worse until nothing happens)?>How can I troubleshoot this?Re-compile the kernel with: options KDB options DDB makeoptions DEBUG=-g and ensure you have a "dumpdev" in /etc/rc.conf. When you get a lockup, drop to DDB (Ctrl-Alt-ESC) and run "show lockedvnods", "ps" and "call doadump()". If you post the output (a serial console will help here) someone might be able to provide more pointers. (The crashdump will help with later debugging). Note: If you don't have another FreeBSD system handy, a hard copy of ddb(4) will be very handy if you want to play around in DDB. -- Peter
On Tue, 17 May 2005 freebsd-stable-request@freebsd.org wrote:> Date: Mon, 16 May 2005 06:40:01 -0600 > From: "Elliot Finley" <efinleywork@efinley.com> > Subject: 5.4-RC2 freezing - ATA related? > To: <freebsd-stable@freebsd.org> > Cc: sos@freebsd.org > Message-ID: <001801c55a14$609720d0$37cba1cd@emerytelcom.com> > Content-Type: text/plain; charset="iso-8859-1" > > This has been happening since 5.3-R, I've been tuning different parameters > to no avail. I've taken the disks off of the onboard ICH5 controller and > put them a promise TX4 S150 controller, but still the same thing happens. > > The system freezes, but isn't totally dead. It'll still respond to pings, > the screensaver still functions, but it won't respond to a CAD at the > console. But if I press 'Enter' at the console, it'll give me a 'login:' > prompt, but after entering the username, it never comes back with the > 'password:' prompt. > > After manually resetting the system it boots and says 'Automatic file system > check failed; help!' and drops into single user mode. Running fsck manually > corrects errors on all volumes. Then it'll boot from that point. > > This seems to be triggered by daily periodic as it happens at 3:02-3:03AM > each time. But it doesn't happen *every* morning. > > I suspect a bug in FreeBSD because this mode of failure happens on 3 > different machines, all configured similarly.You can add a fourth. Ever since 5.1 (my first 5.x install) I have experienced the same problem, again with an Intel ICH5 ATA controller. The symptoms are exactly the same -- the hang is normally triggered during the periodic runs just after 3AM. The hang does occur at other times as well, but with nowhere near the same consistency. The only solution I found at that time was reverting to 4.10, though that is obviously suboptimal. I could be persuaded to reinstall 5.x on the machine if I'd be sure to get someone to look into this. Thanks, Brent Casavant -- Brent Casavant http://www.angeltread.org/ KD5EMB -.- -.. ..... . -- -... 44 54'24"N 93 03'21"W 907FASL EN34lv
On Mon, May 16, 2005 at 06:40:01AM -0600, Elliot Finley wrote:> This has been happening since 5.3-R, I've been tuning different parameters > to no avail. I've taken the disks off of the onboard ICH5 controller and > put them a promise TX4 S150 controller, but still the same thing happens. > > The system freezes, but isn't totally dead. It'll still respond to pings, > the screensaver still functions, but it won't respond to a CAD at the > console. But if I press 'Enter' at the console, it'll give me a 'login:' > prompt, but after entering the username, it never comes back with the > 'password:' prompt. > > After manually resetting the system it boots and says 'Automatic file system > check failed; help!' and drops into single user mode. Running fsck manually > corrects errors on all volumes. Then it'll boot from that point. > > This seems to be triggered by daily periodic as it happens at 3:02-3:03AM > each time. But it doesn't happen *every* morning. > > I suspect a bug in FreeBSD because this mode of failure happens on 3 > different machines, all configured similarly. >We are having similar problems to this on a box, won't go into great detail at the moment but will post results when we have finished testing. -- Jamie Heckford Network Manager Trident Microsystems Ltd. t: +44(0)1737-780790 f: +44(0)1737-771908 w: http://www.tridentmicrosystems.co.uk/
On Mon, May 16, 2005 at 06:40:01AM -0600, Elliot Finley wrote:> This has been happening since 5.3-R, I've been tuning different parameters > to no avail. I've taken the disks off of the onboard ICH5 controller and > put them a promise TX4 S150 controller, but still the same thing happens. > > The system freezes, but isn't totally dead. It'll still respond to pings, > the screensaver still functions, but it won't respond to a CAD at the > console. But if I press 'Enter' at the console, it'll give me a 'login:' > prompt, but after entering the username, it never comes back with the > 'password:' prompt. > > After manually resetting the system it boots and says 'Automatic file system > check failed; help!' and drops into single user mode. Running fsck manually > corrects errors on all volumes. Then it'll boot from that point. > > This seems to be triggered by daily periodic as it happens at 3:02-3:03AM > each time. But it doesn't happen *every* morning. > > I suspect a bug in FreeBSD because this mode of failure happens on 3 > different machines, all configured similarly. > > ASUS P4P800 > 2G RAM (though the other affected systems only have 1G) > 80G Seagate Barracuda SATA drives (one system now on Promise TX4 S150 > controller, others on onboard ICH5) > > On my lightly loaded systems, it happens rarely. On my mailserver (fairly > heavy disk load), it happens quite frequently. > > How can I troubleshoot this?Managed to get a dump on our system for a similar prob we are getting: [GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"] GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-marcel-freebsd". #0 doadump () at pcpu.h:160 160 __asm __volatile("movl %%fs:0,%0" : "=r" (td)); (kgdb) bt #0 doadump () at pcpu.h:160 #1 0xc05131ae in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:410 #2 0xc0513474 in panic (fmt=0xc06c3da5 "%s") at /usr/src/sys/kern/kern_shutdown.c:566 #3 0xc0691e18 in trap_fatal (frame=0xecb4bb34, eva=532) at /usr/src/sys/i386/i386/trap.c:817 #4 0xc0691b73 in trap_pfault (frame=0xecb4bb34, usermode=0, eva=532) at /usr/src/sys/i386/i386/trap.c:735 #5 0xc0691771 in trap (frame {tf_fs = -1068433384, tf_es = -989790192, tf_ds = 16, tf_edi = -1066124736, tf_esi = -1066124736, tf_ebp = -323699844, tf_isp = -323699872, tf_ebx = -1007063716, tf_edx = 528, tf_ecx = -1013235680, tf_eax = 307472464, tf_trapno = 12, tf_err = 2, tf_eip = -1067870386, tf_cs = 8, tf_eflags = 66050, tf_esp = -989760240, tf_ss = -1007063716}) at /usr/src/sys/i386/i386/trap.c:425 #6 0xc068168a in calltrap () at /usr/src/sys/i386/i386/exception.s:140 #7 0xc0510018 in crcopy () at /usr/src/sys/kern/kern_prot.c:1810 #8 0xc0598c77 in in_pcbdetach (inp=0xc0743a40) at /usr/src/sys/netinet/in_pcb.c:720 #9 0xc05b21a6 in tcp_close (tp=0x0) at /usr/src/sys/netinet/tcp_subr.c:783 #10 0xc05ae560 in tcp_input (m=0xc3a6a300, off0=20) at /usr/src/sys/netinet/tcp_input.c:2308 #11 0xc05a5aed in ip_input (m=0xc3a6a300) at /usr/src/sys/netinet/ip_input.c:776 #12 0xc0582f13 in netisr_processqueue (ni=0xc0742498) at /usr/src/sys/net/netisr.c:233 #13 0xc058310a in swi_net (dummy=0x0) at /usr/src/sys/net/netisr.c:346 #14 0xc04ffa79 in ithread_loop (arg=0xc3481600) at /usr/src/sys/kern/kern_intr.c:547 #15 0xc04fed0c in fork_exit (callout=0xc04ff928 <ithread_loop>, arg=0xc3481600, frame=0xecb4bd38) at /usr/src/sys/kern/kern_fork.c:791 #16 0xc06816ec in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:209 (kgdb) Help? ;) -- Jamie Heckford Network Manager Trident Microsystems Ltd. t: +44(0)1737-780790 f: +44(0)1737-771908 w: http://www.tridentmicrosystems.co.uk/