Alson van der Meulen
2008-Aug-16 12:53 UTC
Panic in nfs/ffs after upgrade from 6.2 to 6.3-RELEASE
Hello, This file server was upgraded from 6.2-RELEASE-p$something to 6.3-RELEASE-p3 (current RELENG_6_3 sources via csup) on August, 11 and rebooted to the new kernel on August, 12. The school is closed due to holidays. The only load is mail delivery to Maildirs via NFS from another server (just 50 mailboxes or so, none of them very busy) and nightly backups. I went there on August, 13 to perform some maintenance on unrelated systems. Performed a Windows installation (+ assorted desktop software) that fetched the installation files from Samba running on this server. This procedure downloads a total of ~5G in bursts via a 100mbit link. After most of this was over (I think it wast just installing patches at that point), the file server crashed. The crash: [GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"] GNU gdb 6.1.1 [FreeBSD] [...] Unread portion of the kernel message buffer: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x1c fault code = supervisor write, page not present instruction pointer = 0x20:0xc06fd937 stack pointer = 0x28:0xe78e78ec frame pointer = 0x28:0xe78e7914 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 847 (nfsd) trap number = 12 panic: page fault cpuid = 1 KDB: stack backtrace: kdb_backtrace(100,c532f900,28,e78e78ac,c,...) at kdb_backtrace+0x29 panic(c0976f0c,c09cfe95,0,fffff,c532e49b,...) at panic+0x114 trap_fatal(e78e78ac,1c,c532f900,0,c,...) at trap_fatal+0x2ce trap_pfault(e78e78ac,0,1c) at trap_pfault+0x1f7 trap(e78e0008,c06a0028,e78e0028,200012,d9156118,...) at trap+0x325 calltrap() at calltrap+0x5 --- trap 0xc, eip = 0xc06fd937, esp = 0xe78e78ec, ebp = 0xe78e7914 --- getnewbuf(0,0,4000,4000) at getnewbuf+0x1bb getblk(c5964cc0,0,0,4000,0,...) at getblk+0x360 cluster_read(c5964cc0,46e5,0,0,0,...) at cluster_read+0xde ffs_read(e78e7b08) at ffs_read+0x25f VOP_READ_APV(c0a65560,e78e7b08) at VOP_READ_APV+0x38 nfsrv_read(c69b3800,c520d900,c532f900,e78e7c98,0,...) at nfsrv_read+0xb16 nfssvc_nfsd(c532f900) at nfssvc_nfsd+0x435 nfssvc(c532f900,e78e7d04) at nfssvc+0x1c0 syscall(3b,3b,3b,1,0,...) at syscall+0x2bf Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (155, FreeBSD ELF32, nfssvc), eip = 0x280bdf17, esp = 0xbfbfeb0c, ebp = 0xbfbfeb28 --- Uptime: 12h24m16s Dumping 1023 MB (2 chunks) chunk 0: 1MB (159 pages) ... ok chunk 1: 1023MB (261840 pages) 1007 991 975 959 943 927 911 895 879 863 847 831 815 799 783 767 751 735 719 703 687 671 655 639 623 607 591 575 559 543 527 511 495 479 463 447 431 415 399 383 367 351 335 319 303 287 271 255 239 223 207 191 175 159 143 127 111 95 79 63 47 31 15 #0 doadump () at pcpu.h:165 165 __asm __volatile("movl %%fs:0,%0" : "=r" (td)); (kgdb) list *0xc06fd937 0xc06fd937 is in getnewbuf (atomic.h:149). 144 static __inline int 145 atomic_cmpset_int(volatile u_int *dst, u_int exp, u_int src) 146 { 147 int res = exp; 148 149 __asm __volatile ( 150 " " __XSTRING(MPLOCKED) " " 151 " cmpxchgl %2,%1 ; " 152 " setz %%al ; " 153 " movzbl %%al,%0 ; " (kgdb) bt #0 doadump () at pcpu.h:165 #1 0xc06b25fe in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409 #2 0xc06b2955 in panic (fmt=0xc0976f0c "%s") at /usr/src/sys/kern/kern_shutdown.c:565 #3 0xc0924c86 in trap_fatal (frame=0xe78e78ac, eva=28) at /usr/src/sys/i386/i386/trap.c:838 #4 0xc092498f in trap_pfault (frame=0xe78e78ac, usermode=0, eva=28) at /usr/src/sys/i386/i386/trap.c:745 #5 0xc0924585 in trap (frame {tf_fs = -410124280, tf_es = -1066794968, tf_ds = -410124248, tf_edi = 2097170, tf_esi = -652910312, tf_ebp = -410093292, tf_isp = -410093352, tf_ebx = 0, tf_edx = -1062624684, tf_ecx = -986515200, tf_eax = 4, tf_trapno = 12, tf_err = 2, tf_eip = -1066411721, tf_cs = 32, tf_eflags = 66182, tf_esp = 1, tf_ss = 4}) at /usr/src/sys/i386/i386/trap.c:435 #6 0xc090ec9a in calltrap () at /usr/src/sys/i386/i386/exception.s:139 #7 0xc06fd937 in getnewbuf (slpflag=0, slptimeo=0, size=16384, maxsize=16384) at atomic.h:149 #8 0xc06fefdc in getblk (vp=0xc5964cc0, blkno=0, size=16384, slpflag=0, slptimeo=0, flags=0) at /usr/src/sys/kern/vfs_bio.c:2516 #9 0xc0702a8a in cluster_read (vp=0xc5964cc0, filesize=18149, lblkno=0, size=16384, cred=0x0, totread=8192, seqcount=5, bpp=0x4) at /usr/src/sys/kern/vfs_cluster.c:118 #10 0xc081213b in ffs_read (ap=0x4) at /usr/src/sys/ufs/ffs/ffs_vnops.c:503 #11 0xc09368a0 in VOP_READ_APV (vop=0x4, a=0xc0a9a254) at vnode_if.c:643 #12 0xc07ad086 in nfsrv_read (nfsd=0xc69b3800, slp=0xc520d900, td=0xc532f900, mrq=0xe78e7c98) at vnode_if.h:343 #13 0xc07bd329 in nfssvc_nfsd (td=0x4) at /usr/src/sys/nfsserver/nfs_syscalls.c:474 #14 0xc07bcb08 in nfssvc (td=0xc532f900, uap=0xe78e7d04) at /usr/src/sys/nfsserver/nfs_syscalls.c:181 #15 0xc0924fcb in syscall (frame {tf_fs = 59, tf_es = 59, tf_ds = 59, tf_edi = 1, tf_esi = 0, tf_ebp = -1077941464, tf_isp = -410092188, tf_ebx = 4, tf_edx = 672460376, tf_ecx = 25, tf_eax = 155, tf_trapno = 12, tf_err = 2, tf_eip = 671866647, tf_cs = 51, tf_eflags = 662, tf_esp = -1077941492, tf_ss = 59}) at /usr/src/sys/i386/i386/trap.c:984 #16 0xc090ecef in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:200 #17 0x00000033 in ?? () Previous frame inner to this frame (corrupt stack?) I have the crash dump and kernel build files still available, so let me know if I can provide any other useful output from the dump. The crash is in the NFS server part of the kernel, which is funny because NFS is only used for mail delivery. Between 0am and 11.59pm, 40 mails very delivered from the mailserver to maildirs on NFS. There were also a few clients checking their mailboxes via IMAP. The server booted again at 15.25, these are the imapd logs from around that time: Aug 13 15:22:48 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x], port=[63802], protocol=IMAP Aug 13 15:22:48 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x], port=[63803], protocol=IMAP Aug 13 15:22:49 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x], headers=0, body=0, rcvd=96, sent=470, time=1, starttls=1 Aug 13 15:22:49 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x], headers=0, body=0, rcvd=252, sent=944, time=1, starttls=1 Aug 13 15:22:49 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x], port=[63806], protocol=IMAP Aug 13 15:22:49 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x], headers=0, body=0, rcvd=38, sent=273, time=0, starttls=1 Aug 13 15:22:52 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x], port=[63808], protocol=IMAP Aug 13 15:22:52 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x], headers=0, body=0, rcvd=38, sent=273, time=0, starttls=1 Aug 13 15:22:53 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x], port=[63811], protocol=IMAP Aug 13 15:22:53 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x], headers=0, body=0, rcvd=106, sent=392, time=0, starttls=1 Aug 13 15:22:54 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x], port=[63812], protocol=IMAP <panic / reboot> Aug 13 15:26:35 eraser imapd-ssl: LOGIN, user=YYY, ip=[::ffff:x.x.x.x], port=[64582], protocol=IMAP Aug 13 15:26:35 eraser imapd-ssl: LOGOUT, user=YYY, ip=[::ffff:x.x.x.x], headers=0, body=0, rcvd=106, sent=394, time=0, starttls=1 Aug 13 15:26:50 eraser imapd-ssl: DISCONNECTED, user=XXX, ip=[::ffff:x.x.x.x], headers=0, body=0, rcvd=135, sent=10825, time=236, starttls=1 Aug 13 15:27:55 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x], port=[64975], protocol=IMAP Aug 13 15:27:55 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x], headers=0, body=0, rcvd=106, sent=392, time=1, starttls=1 The six almost simultaneous logins from user XXX look suspicious and from the timestamp seem to be just prior to the crash, but shouldn't lead to the crash of the nfs server. The lack of logout after the 15:22:54 one suggests that this was the point that the file server went offline. The only directory that was mounted at that time was in /export (/dev/mirror/gm1a): alson@damaged:~$ gmirror status Name Status Components mirror/gm0 DEGRADED ad4s1 mirror/gm1 COMPLETE ad8s1 ad10s1 The server had been up for about 250 days with similar (and usually significantly heavier) load with 6.2-STABLE before the upgrade. The evening after the crash, I received the following message from smartmontools (not sure if it's related): Aug 13 21:19:30 damaged smartd[881]: Device: /dev/ad10, 4 Currently unreadable (pending) sectors Aug 13 21:19:30 damaged smartd[881]: Device: /dev/ad10, 4 Offline uncorrectable sectors ad10 is part of the mirror, but I don't think this should be a problem since (a) the disk should have corrected it and (b) gmirror should have dropped the disk and try to read from ad8 if it had returned a read error. The server has been stable since, but hasn't seen much usage either. I have no idea if I can reproduce it. My first thought was disk corruption, but this doesn't appear to be the case: # fsck -f [...] ** /dev/mirror/gm1a ** Last Mounted on /export ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? [yn] y 953252 files, 47080232 used, 44658220 free (159804 frags, 5562302 blocks, 0.2% fragmentation) ***** FILE SYSTEM MARKED CLEAN ***** ***** FILE SYSTEM WAS MODIFIED ***** dmesg is attached below. regards, Alson Copyright (c) 1992-2008 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 6.3-RELEASE-p3 #10: Mon Aug 11 22:01:47 CEST 2008 root@damaged.example.com:/usr/obj/usr/src/sys/DAMAGED Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Pentium(R) 4 CPU 3.00GHz (3000.12-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf43 Stepping = 3 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x649d<SSE3,RSVD2,MON,DS_CPL,EST,CNXT-ID,CX16,xTPR> AMD Features=0x20000000<LM> Logical CPUs per core: 2 real memory = 1073545216 (1023 MB) avail memory = 1037168640 (989 MB) ACPI APIC Table: <A M I OEMAPIC > FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 ioapic0 <Version 2.0> irqs 0-23 on motherboard kbd1 at kbdmux0 ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413) hptrr: HPT RocketRAID controller driver v1.1 (Aug 11 2008 22:01:29) acpi0: <A M I OEMRSDT> on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0 cpu0: <ACPI CPU> on acpi0 acpi_throttle0: <ACPI CPU Throttling> on cpu0 cpu1: <ACPI CPU> on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 pcib1: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0 pci4: <ACPI PCI bus> on pcib1 pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0 pci3: <ACPI PCI bus> on pcib2 bge0: <Broadcom BCM5750 C1, ASIC rev. 0x4201> mem 0xfeaf0000-0xfeafffff irq 16 at device 0.0 on pci3 miibus0: <MII bus> on bge0 brgphy0: <BCM5750 10/100/1000baseTX PHY> on miibus0 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto bge0: Ethernet address: 00:18:f3:2a:e0:cf pcib3: <ACPI PCI-PCI bridge> irq 17 at device 28.5 on pci0 pci2: <ACPI PCI bus> on pcib3 bge1: <Broadcom BCM5750 C1, ASIC rev. 0x4201> mem 0xfe9f0000-0xfe9fffff irq 17 at device 0.0 on pci2 miibus1: <MII bus> on bge1 brgphy1: <BCM5750 10/100/1000baseTX PHY> on miibus1 brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto bge1: Ethernet address: 00:18:f3:2a:e0:58 uhci0: <UHCI (generic) USB controller> port 0xd880-0xd89f irq 16 at device 29.0 on pci0 uhci0: [GIANT-LOCKED] usb0: <UHCI (generic) USB controller> on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: <UHCI (generic) USB controller> port 0xdc00-0xdc1f irq 17 at device 29.1 on pci0 uhci1: [GIANT-LOCKED] usb1: <UHCI (generic) USB controller> on uhci1 usb1: USB revision 1.0 uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered uhci2: <UHCI (generic) USB controller> port 0xe000-0xe01f irq 18 at device 29.2 on pci0 uhci2: [GIANT-LOCKED] usb2: <UHCI (generic) USB controller> on uhci2 usb2: USB revision 1.0 uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub2: 2 ports with 2 removable, self powered uhci3: <UHCI (generic) USB controller> port 0xe080-0xe09f irq 19 at device 29.3 on pci0 uhci3: [GIANT-LOCKED] usb3: <UHCI (generic) USB controller> on uhci3 usb3: USB revision 1.0 uhub3: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub3: 2 ports with 2 removable, self powered ehci0: <Intel 82801GB/R (ICH7) USB 2.0 controller> mem 0xfebff800-0xfebffbff irq 16 at device 29.7 on pci0 ehci0: [GIANT-LOCKED] usb4: EHCI version 1.0 usb4: companion controllers, 2 ports each: usb0 usb1 usb2 usb3 usb4: <Intel 82801GB/R (ICH7) USB 2.0 controller> on ehci0 usb4: USB revision 2.0 uhub4: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1 uhub4: 8 ports with 8 removable, self powered pcib4: <ACPI PCI-PCI bridge> at device 30.0 on pci0 pci1: <ACPI PCI bus> on pcib4 puc0: <Dolphin Peripherals 4036> port 0xcc00-0xcc1f,0xc880-0xc887,0xc800-0xc807 irq 20 at device 1.0 on pci1 sio4: <Dolphin Peripherals 4036> on puc0 sio4: type 16550A sio4: unable to activate interrupt in fast mode - using normal mode sio5: <Dolphin Peripherals 4036> on puc0 sio5: type 16550A sio5: unable to activate interrupt in fast mode - using normal mode pci1: <display, VGA> at device 2.0 (no driver attached) isab0: <PCI-ISA bridge> at device 31.0 on pci0 isa0: <ISA bus> on isab0 atapci0: <Intel ICH7 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 31.1 on pci0 ata0: <ATA channel 0> on atapci0 ata1: <ATA channel 1> on atapci0 atapci1: <Intel AHCI controller> port 0xec00-0xec07,0xe880-0xe883,0xe800-0xe807,0xe480-0xe483,0xe400-0xe41f mem 0xfebffc00-0xfebfffff irq 19 at device 31.2 on pci0 atapci1: AHCI Version 01.10 controller with 4 ports detected ata2: <ATA channel 0> on atapci1 ata3: <ATA channel 1> on atapci1 ata4: <ATA channel 2> on atapci1 ata5: <ATA channel 3> on atapci1 ichsmb0: <Intel 82801GB (ICH7) SMBus controller> port 0x400-0x41f at device 31.3 on pci0 ichsmb0: [GIANT-LOCKED] smbus0: <System Management Bus> on ichsmb0 smb0: <SMBus generic I/O> on smbus0 acpi_button0: <Power Button> on acpi0 fdc0: <floppy drive controller (FDE)> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: [FAST] ppc0: <Standard parallel printer port> port 0x378-0x37f irq 7 on acpi0 ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode ppbus0: <Parallel port bus> on ppc0 plip0: <PLIP network interface> on ppbus0 lpt0: <Printer> on ppbus0 lpt0: Interrupt-driven port ppi0: <Parallel I/O> on ppbus0 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A, console pmtimer0 on isa0 orm0: <ISA Option ROM> at iomem 0xc0000-0xc7fff on isa0 atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounters tick every 1.000 msec hptrr: no controller detected. acd0: CDROM <CD-ROM 56X/AKH/A80> at ata0-master PIO4 ad4: 35304MB <WDC WD360GD-00FLA2 31.08F31> at ata2-master SATA150 ad8: 190782MB <Seagate ST3200826AS 3.03> at ata4-master SATA150 GEOM_MIRROR: Device gm0 created (id=1495539501). GEOM_MIRROR: Device gm0: provider ad4s1 detected. ad10: 190782MB <Seagate ST3200826AS 3.03> at ata5-master SATA150 GEOM_MIRROR: Device gm1 created (id=1615731132). GEOM_MIRROR: Device gm1: provider ad8s1 detected. SMP: AP CPU #1 Launched! GEOM_MIRROR: Device gm1: provider ad10s1 detected. GEOM_MIRROR: Device gm1: provider ad8s1 activated. GEOM_MIRROR: Device gm1: provider mirror/gm1 launched. GEOM_MIRROR: Device gm1: rebuilding provider ad10s1. Root mount waiting for: GMIRROR Root mount waiting for: GMIRROR Root mount waiting for: GMIRROR Root mount waiting for: GMIRROR GEOM_MIRROR: Force device gm0 start due to timeout. GEOM_MIRROR: Device gm0: provider ad4s1 activated. GEOM_MIRROR: Device gm0: provider mirror/gm0 launched. Trying to mount root from ufs:/dev/mirror/gm0a vlan2: link state changed to UP vlan1: link state changed to UP bge1: link state changed to UP GEOM_MIRROR: Device gm1: rebuilding provider ad10s1 finished. GEOM_MIRROR: Device gm1: provider ad10s1 activated.