thr3ads.net - freebsd stable - Panic in nfs/ffs after upgrade from 6.2 to 6.3-RELEASE [Aug 2008]

If this information is useful, please help other people find it:
Share via:
Alson van der Meulen
2008-Aug-16 12:53 UTC
Panic in nfs/ffs after upgrade from 6.2 to 6.3-RELEASE

Hello,

This file server was upgraded from 6.2-RELEASE-p$something to
6.3-RELEASE-p3 (current RELENG_6_3 sources via csup) on August, 11 and
rebooted to the new kernel on August, 12. The school is closed due to
holidays. The only load is mail delivery to Maildirs via NFS from
another server (just 50 mailboxes or so, none of them very busy) and
nightly backups. I went there on August, 13 to perform some maintenance
on unrelated systems. Performed a Windows installation (+ assorted
desktop software) that fetched the installation files from Samba running
on this server. This procedure downloads a total of ~5G in bursts via a
100mbit link. After most of this was over (I think it wast just
installing patches at that point), the file server crashed.

The crash:

[GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so:
Undefined symbol "ps_pglobal_lookup"]
GNU gdb 6.1.1 [FreeBSD]
[...]

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x1c
fault code		= supervisor write, page not present
instruction pointer	= 0x20:0xc06fd937
stack pointer	        = 0x28:0xe78e78ec
frame pointer	        = 0x28:0xe78e7914
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, def32 1, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 847 (nfsd)
trap number		= 12
panic: page fault
cpuid = 1
KDB: stack backtrace:
kdb_backtrace(100,c532f900,28,e78e78ac,c,...) at kdb_backtrace+0x29
panic(c0976f0c,c09cfe95,0,fffff,c532e49b,...) at panic+0x114
trap_fatal(e78e78ac,1c,c532f900,0,c,...) at trap_fatal+0x2ce
trap_pfault(e78e78ac,0,1c) at trap_pfault+0x1f7
trap(e78e0008,c06a0028,e78e0028,200012,d9156118,...) at trap+0x325
calltrap() at calltrap+0x5
--- trap 0xc, eip = 0xc06fd937, esp = 0xe78e78ec, ebp = 0xe78e7914 ---
getnewbuf(0,0,4000,4000) at getnewbuf+0x1bb
getblk(c5964cc0,0,0,4000,0,...) at getblk+0x360
cluster_read(c5964cc0,46e5,0,0,0,...) at cluster_read+0xde
ffs_read(e78e7b08) at ffs_read+0x25f
VOP_READ_APV(c0a65560,e78e7b08) at VOP_READ_APV+0x38
nfsrv_read(c69b3800,c520d900,c532f900,e78e7c98,0,...) at nfsrv_read+0xb16
nfssvc_nfsd(c532f900) at nfssvc_nfsd+0x435
nfssvc(c532f900,e78e7d04) at nfssvc+0x1c0
syscall(3b,3b,3b,1,0,...) at syscall+0x2bf
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (155, FreeBSD ELF32, nfssvc), eip = 0x280bdf17, esp = 0xbfbfeb0c,
ebp = 0xbfbfeb28 ---
Uptime: 12h24m16s
Dumping 1023 MB (2 chunks)
  chunk 0: 1MB (159 pages) ... ok
  chunk 1: 1023MB (261840 pages) 1007 991 975 959 943 927 911 895 879 863 847
831 815 799 783 767 751 735 719 703 687 671 655 639 623 607 591 575 559 543 527
511 495 479 463 447 431 415 399 383 367 351 335 319 303 287 271 255 239 223 207
191 175 159 143 127 111 95 79 63 47 31 15

#0  doadump () at pcpu.h:165
165		__asm __volatile("movl %%fs:0,%0" : "=r" (td));
(kgdb) list *0xc06fd937
0xc06fd937 is in getnewbuf (atomic.h:149).
144	static __inline int
145	atomic_cmpset_int(volatile u_int *dst, u_int exp, u_int src)
146	{
147		int res = exp;
148	
149		__asm __volatile (
150		"	" __XSTRING(MPLOCKED) "	"
151		"	cmpxchgl %2,%1 ;	"
152		"       setz	%%al ;		"
153		"	movzbl	%%al,%0 ;	"
(kgdb) bt
#0  doadump () at pcpu.h:165
#1  0xc06b25fe in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409
#2  0xc06b2955 in panic (fmt=0xc0976f0c "%s") at
/usr/src/sys/kern/kern_shutdown.c:565
#3  0xc0924c86 in trap_fatal (frame=0xe78e78ac, eva=28) at
/usr/src/sys/i386/i386/trap.c:838
#4  0xc092498f in trap_pfault (frame=0xe78e78ac, usermode=0, eva=28) at
/usr/src/sys/i386/i386/trap.c:745
#5  0xc0924585 in trap (frame      {tf_fs = -410124280, tf_es = -1066794968,
tf_ds = -410124248, tf_edi = 2097170, tf_esi = -652910312, tf_ebp = -410093292,
tf_isp = -410093352, tf_ebx = 0, tf_edx = -1062624684, tf_ecx = -986515200,
tf_eax = 4, tf_trapno = 12, tf_err = 2, tf_eip = -1066411721, tf_cs = 32,
tf_eflags = 66182, tf_esp = 1, tf_ss = 4}) at /usr/src/sys/i386/i386/trap.c:435
#6  0xc090ec9a in calltrap () at /usr/src/sys/i386/i386/exception.s:139
#7  0xc06fd937 in getnewbuf (slpflag=0, slptimeo=0, size=16384, maxsize=16384)
at atomic.h:149
#8  0xc06fefdc in getblk (vp=0xc5964cc0, blkno=0, size=16384, slpflag=0,
slptimeo=0, flags=0)
    at /usr/src/sys/kern/vfs_bio.c:2516
#9  0xc0702a8a in cluster_read (vp=0xc5964cc0, filesize=18149, lblkno=0,
size=16384, cred=0x0, totread=8192, seqcount=5,
    bpp=0x4) at /usr/src/sys/kern/vfs_cluster.c:118
#10 0xc081213b in ffs_read (ap=0x4) at /usr/src/sys/ufs/ffs/ffs_vnops.c:503
#11 0xc09368a0 in VOP_READ_APV (vop=0x4, a=0xc0a9a254) at vnode_if.c:643
#12 0xc07ad086 in nfsrv_read (nfsd=0xc69b3800, slp=0xc520d900, td=0xc532f900,
mrq=0xe78e7c98) at vnode_if.h:343
#13 0xc07bd329 in nfssvc_nfsd (td=0x4) at
/usr/src/sys/nfsserver/nfs_syscalls.c:474
#14 0xc07bcb08 in nfssvc (td=0xc532f900, uap=0xe78e7d04) at
/usr/src/sys/nfsserver/nfs_syscalls.c:181
#15 0xc0924fcb in syscall (frame      {tf_fs = 59, tf_es = 59, tf_ds = 59,
tf_edi = 1, tf_esi = 0, tf_ebp = -1077941464, tf_isp = -410092188, tf_ebx = 4,
tf_edx = 672460376, tf_ecx = 25, tf_eax = 155, tf_trapno = 12, tf_err = 2,
tf_eip = 671866647, tf_cs = 51, tf_eflags = 662, tf_esp = -1077941492, tf_ss =
59}) at /usr/src/sys/i386/i386/trap.c:984
#16 0xc090ecef in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:200
#17 0x00000033 in ?? ()
Previous frame inner to this frame (corrupt stack?)

I have the crash dump and kernel build files still available, so let me
know if I can provide any other useful output from the dump.

The crash is in the NFS server part of the kernel, which is funny
because NFS is only used for mail delivery. Between 0am and 11.59pm, 40
mails very delivered from the mailserver to maildirs on NFS. There were
also a few clients checking their mailboxes via IMAP. The server
booted again at 15.25, these are the imapd logs from around that time:
Aug 13 15:22:48 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x],
port=[63802], protocol=IMAP
Aug 13 15:22:48 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x],
port=[63803], protocol=IMAP
Aug 13 15:22:49 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x],
headers=0, body=0, rcvd=96, sent=470, time=1, starttls=1
Aug 13 15:22:49 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x],
headers=0, body=0, rcvd=252, sent=944, time=1, starttls=1
Aug 13 15:22:49 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x],
port=[63806], protocol=IMAP
Aug 13 15:22:49 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x],
headers=0, body=0, rcvd=38, sent=273, time=0, starttls=1
Aug 13 15:22:52 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x],
port=[63808], protocol=IMAP
Aug 13 15:22:52 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x],
headers=0, body=0, rcvd=38, sent=273, time=0, starttls=1
Aug 13 15:22:53 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x],
port=[63811], protocol=IMAP
Aug 13 15:22:53 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x],
headers=0, body=0, rcvd=106, sent=392, time=0, starttls=1
Aug 13 15:22:54 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x],
port=[63812], protocol=IMAP
<panic / reboot>
Aug 13 15:26:35 eraser imapd-ssl: LOGIN, user=YYY, ip=[::ffff:x.x.x.x],
port=[64582], protocol=IMAP
Aug 13 15:26:35 eraser imapd-ssl: LOGOUT, user=YYY, ip=[::ffff:x.x.x.x],
headers=0, body=0, rcvd=106, sent=394, time=0, starttls=1
Aug 13 15:26:50 eraser imapd-ssl: DISCONNECTED, user=XXX, ip=[::ffff:x.x.x.x],
headers=0, body=0, rcvd=135, sent=10825, time=236, starttls=1
Aug 13 15:27:55 eraser imapd-ssl: LOGIN, user=XXX, ip=[::ffff:x.x.x.x],
port=[64975], protocol=IMAP
Aug 13 15:27:55 eraser imapd-ssl: LOGOUT, user=XXX, ip=[::ffff:x.x.x.x],
headers=0, body=0, rcvd=106, sent=392, time=1, starttls=1

The six almost simultaneous logins from user XXX look suspicious and
from the timestamp seem to be just prior to the crash, but shouldn't
lead to the crash of the nfs server. The lack of logout after the
15:22:54 one suggests that this was the point that the file server went
offline.

The only directory that was mounted at that time was in /export
(/dev/mirror/gm1a):
alson@damaged:~$ gmirror status
      Name    Status  Components
mirror/gm0  DEGRADED  ad4s1
mirror/gm1  COMPLETE  ad8s1
                      ad10s1

The server had been up for about 250 days with similar (and usually
significantly heavier) load with 6.2-STABLE before the upgrade. The
evening after the crash, I received the following message from
smartmontools (not sure if it's related):
Aug 13 21:19:30 damaged smartd[881]: Device: /dev/ad10, 4 Currently unreadable
(pending) sectors
Aug 13 21:19:30 damaged smartd[881]: Device: /dev/ad10, 4 Offline uncorrectable
sectors

ad10 is part of the mirror, but I don't think this should be a problem
since (a) the disk should have corrected it and (b) gmirror should have
dropped the disk and try to read from ad8 if it had returned a read
error.

The server has been stable since, but hasn't seen much usage either. I
have no idea if I can reproduce it.

My first thought was disk corruption, but this doesn't appear to be the
case:
# fsck -f 
[...]
** /dev/mirror/gm1a
** Last Mounted on /export
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? [yn] y
953252 files, 47080232 used, 44658220 free (159804 frags, 5562302 blocks, 0.2%
fragmentation)

***** FILE SYSTEM MARKED CLEAN *****

***** FILE SYSTEM WAS MODIFIED *****

dmesg is attached below.

regards,
Alson


Copyright (c) 1992-2008 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 6.3-RELEASE-p3 #10: Mon Aug 11 22:01:47 CEST 2008
    root@damaged.example.com:/usr/obj/usr/src/sys/DAMAGED
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) 4 CPU 3.00GHz (3000.12-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0xf43  Stepping = 3
 
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x649d<SSE3,RSVD2,MON,DS_CPL,EST,CNXT-ID,CX16,xTPR>
  AMD Features=0x20000000<LM>
  Logical CPUs per core: 2
real memory  = 1073545216 (1023 MB)
avail memory = 1037168640 (989 MB)
ACPI APIC Table: <A M I  OEMAPIC >
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
ioapic0 <Version 2.0> irqs 0-23 on motherboard
kbd1 at kbdmux0
ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413)
hptrr: HPT RocketRAID controller driver v1.1 (Aug 11 2008 22:01:29)
acpi0: <A M I OEMRSDT> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0
cpu0: <ACPI CPU> on acpi0
acpi_throttle0: <ACPI CPU Throttling> on cpu0
cpu1: <ACPI CPU> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci4: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0
pci3: <ACPI PCI bus> on pcib2
bge0: <Broadcom BCM5750 C1, ASIC rev. 0x4201> mem 0xfeaf0000-0xfeafffff
irq 16 at device 0.0 on pci3
miibus0: <MII bus> on bge0
brgphy0: <BCM5750 10/100/1000baseTX PHY> on miibus0
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
bge0: Ethernet address: 00:18:f3:2a:e0:cf
pcib3: <ACPI PCI-PCI bridge> irq 17 at device 28.5 on pci0
pci2: <ACPI PCI bus> on pcib3
bge1: <Broadcom BCM5750 C1, ASIC rev. 0x4201> mem 0xfe9f0000-0xfe9fffff
irq 17 at device 0.0 on pci2
miibus1: <MII bus> on bge1
brgphy1: <BCM5750 10/100/1000baseTX PHY> on miibus1
brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
bge1: Ethernet address: 00:18:f3:2a:e0:58
uhci0: <UHCI (generic) USB controller> port 0xd880-0xd89f irq 16 at device
29.0 on pci0
uhci0: [GIANT-LOCKED]
usb0: <UHCI (generic) USB controller> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <UHCI (generic) USB controller> port 0xdc00-0xdc1f irq 17 at device
29.1 on pci0
uhci1: [GIANT-LOCKED]
usb1: <UHCI (generic) USB controller> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2: <UHCI (generic) USB controller> port 0xe000-0xe01f irq 18 at device
29.2 on pci0
uhci2: [GIANT-LOCKED]
usb2: <UHCI (generic) USB controller> on uhci2
usb2: USB revision 1.0
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
uhci3: <UHCI (generic) USB controller> port 0xe080-0xe09f irq 19 at device
29.3 on pci0
uhci3: [GIANT-LOCKED]
usb3: <UHCI (generic) USB controller> on uhci3
usb3: USB revision 1.0
uhub3: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub3: 2 ports with 2 removable, self powered
ehci0: <Intel 82801GB/R (ICH7) USB 2.0 controller> mem
0xfebff800-0xfebffbff irq 16 at device 29.7 on pci0
ehci0: [GIANT-LOCKED]
usb4: EHCI version 1.0
usb4: companion controllers, 2 ports each: usb0 usb1 usb2 usb3
usb4: <Intel 82801GB/R (ICH7) USB 2.0 controller> on ehci0
usb4: USB revision 2.0
uhub4: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub4: 8 ports with 8 removable, self powered
pcib4: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci1: <ACPI PCI bus> on pcib4
puc0: <Dolphin Peripherals 4036> port
0xcc00-0xcc1f,0xc880-0xc887,0xc800-0xc807 irq 20 at device 1.0 on pci1
sio4: <Dolphin Peripherals 4036> on puc0
sio4: type 16550A
sio4: unable to activate interrupt in fast mode - using normal mode
sio5: <Dolphin Peripherals 4036> on puc0
sio5: type 16550A
sio5: unable to activate interrupt in fast mode - using normal mode
pci1: <display, VGA> at device 2.0 (no driver attached)
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel ICH7 UDMA100 controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 31.1 on pci0
ata0: <ATA channel 0> on atapci0
ata1: <ATA channel 1> on atapci0
atapci1: <Intel AHCI controller> port
0xec00-0xec07,0xe880-0xe883,0xe800-0xe807,0xe480-0xe483,0xe400-0xe41f mem
0xfebffc00-0xfebfffff irq 19 at device 31.2 on pci0
atapci1: AHCI Version 01.10 controller with 4 ports detected
ata2: <ATA channel 0> on atapci1
ata3: <ATA channel 1> on atapci1
ata4: <ATA channel 2> on atapci1
ata5: <ATA channel 3> on atapci1
ichsmb0: <Intel 82801GB (ICH7) SMBus controller> port 0x400-0x41f at
device 31.3 on pci0
ichsmb0: [GIANT-LOCKED]
smbus0: <System Management Bus> on ichsmb0
smb0: <SMBus generic I/O> on smbus0
acpi_button0: <Power Button> on acpi0
fdc0: <floppy drive controller (FDE)> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2
on acpi0
fdc0: [FAST]
ppc0: <Standard parallel printer port> port 0x378-0x37f irq 7 on acpi0
ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode
ppbus0: <Parallel port bus> on ppc0
plip0: <PLIP network interface> on ppbus0
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on
acpi0
sio0: type 16550A, console
pmtimer0 on isa0
orm0: <ISA Option ROM> at iomem 0xc0000-0xc7fff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounters tick every 1.000 msec
hptrr: no controller detected.
acd0: CDROM <CD-ROM 56X/AKH/A80> at ata0-master PIO4
ad4: 35304MB <WDC WD360GD-00FLA2 31.08F31> at ata2-master SATA150
ad8: 190782MB <Seagate ST3200826AS 3.03> at ata4-master SATA150
GEOM_MIRROR: Device gm0 created (id=1495539501).
GEOM_MIRROR: Device gm0: provider ad4s1 detected.
ad10: 190782MB <Seagate ST3200826AS 3.03> at ata5-master SATA150
GEOM_MIRROR: Device gm1 created (id=1615731132).
GEOM_MIRROR: Device gm1: provider ad8s1 detected.
SMP: AP CPU #1 Launched!
GEOM_MIRROR: Device gm1: provider ad10s1 detected.
GEOM_MIRROR: Device gm1: provider ad8s1 activated.
GEOM_MIRROR: Device gm1: provider mirror/gm1 launched.
GEOM_MIRROR: Device gm1: rebuilding provider ad10s1.
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
GEOM_MIRROR: Force device gm0 start due to timeout.
GEOM_MIRROR: Device gm0: provider ad4s1 activated.
GEOM_MIRROR: Device gm0: provider mirror/gm0 launched.
Trying to mount root from ufs:/dev/mirror/gm0a
vlan2: link state changed to UP
vlan1: link state changed to UP
bge1: link state changed to UP
GEOM_MIRROR: Device gm1: rebuilding provider ad10s1 finished.
GEOM_MIRROR: Device gm1: provider ad10s1 activated.
Possibly Parallel Threads

Search for more apparently analagous threads
freebsd stable - Aug 2008 - Panic in nfs/ffs after upgrade from 6.2 to 6.3-RELEASE

Panic in nfs/ffs after upgrade from 6.2 to 6.3-RELEASE

Possibly Parallel Threads

Wisdom of the Ancients