On a few occasions all different remote servers I have had nfs cause servers to stop responding so I stopped using it all the servers were either 6.0 release 6.1 release or 6-stable. We recently discovered sshfs which supports cross platform mounting server is linux and I mounted on a freebsd 6.1 release using security branch up to date. it was working fine for around 5 to 6 days with some problems with sshfs not updating files that are updated but wasnt compromising the stability of the freebsd server I just remounted to keep up to date. Then today the linux server had network problems so the sshfs timed out and there is 2 dirs I mount, the first mounted fine a bit slow but connected but when I ran the command to mount the 2nd dir the server stopped responding. My 2nd ssh terminal was alive I tried to run top to see if sshfs was hanging or something but when I hit enter top didnt run and the 2nd terminal was froze, note both terminals didnt timeout and a ircd running on the server also did not timeout but the box wasnt listening to any new requests, it was responding to pings fine. I have a remote reboot facility on the box but no local access and no kvm/serial console facility available this is the case for all of my servers. I initially tried a soft reboot which uses ctrl-alt-delete but the pings kept replying so I could see the reboot wasn initiated indicating some kind of console lockup as well, I then did a hard reboot which brought the server back. All logs stopped when the first lockup occured so no errors etc. recorded bear in mind I have no local access to this machine. It does appear that 6.x has some kind of serious remote mounting bug because I never had these nfs problems in freebsd 5.x. I would be interested in any thoughts as to what could help me I have rebooted the server now with network mpsafe disabled to see if this will help it is using a generic kernel with the following changes. options directio, polling, noadaptive mutexes, adaptive giant,ipv6 and nfs disabled. dmesg output below. I left the reboot showing vnodes because it also looks supicous it took so long for it to synch the disks, this was following a working reboot the remote reboot of course was improper shutdown. The hd is a sata2 but dmesg shows as ata33 Syncing disks, vnodes remaining...3 3 1 0 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 0 0 0 done All buffers synced. Copyright (c) 1992-2006 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 6.1-RELEASE-p10 #1: Sat Nov 11 23:02:09 GMT 2006 admin@heaven.chrysalisnet.org:/usr/obj/usr/src/sys/HEAVEN WARNING: MPSAFE network stack disabled, expect reduced performance. ACPI APIC Table: <A M I OEMAPIC > Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: AMD Athlon(tm) 64 Processor 3800+ (2410.95-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0x40ff2 Stepping = 2 Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2> Features2=0x2001<SSE3,CX16> AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow+,3DNow> AMD Features2=0x1d<LAHF,<b2>,<b3>,CR8> real memory = 939261952 (895 MB) avail memory = 909828096 (867 MB) ioapic0 <Version 1.1> irqs 0-23 on motherboard kbd1 at kbdmux0 acpi0: <A M I OEMRSDT> on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-safe" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x508-0x50b on acpi0 cpu0: <ACPI CPU> on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 pci0: <memory, RAM> at device 0.0 (no driver attached) isab0: <PCI-ISA bridge> at device 1.0 on pci0 isa0: <ISA bus> on isab0 pci0: <serial bus, SMBus> at device 1.1 (no driver attached) pci0: <memory, RAM> at device 1.2 (no driver attached) pci0: <processor> at device 1.3 (no driver attached) ohci0: <OHCI (generic) USB controller> mem 0xdfe7f000-0xdfe7ffff irq 21 at device 2.0 on pci0 ohci0: [GIANT-LOCKED] usb0: OHCI version 1.0, legacy support usb0: SMM does not respond, resetting usb0: <OHCI (generic) USB controller> on ohci0 usb0: USB revision 1.0 uhub0: nVidia OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 10 ports with 10 removable, self powered ehci0: <EHCI (generic) USB 2.0 controller> mem 0xdfe7ec00-0xdfe7ecff irq 22 at device 2.1 on pci0 ehci0: [GIANT-LOCKED] usb1: EHCI version 1.0 usb1: companion controller, 10 ports each: usb0 usb1: <EHCI (generic) USB 2.0 controller> on ehci0 usb1: USB revision 2.0 uhub1: nVidia EHCI root hub, class 9/0, rev 2.00/1.00, addr 1 uhub1: 10 ports with 10 removable, self powered pcib1: <ACPI PCI-PCI bridge> at device 4.0 on pci0 pci1: <ACPI PCI bus> on pcib1 fxp0: <Intel 82550 Pro/100 Ethernet> port 0xec00-0xec3f mem 0xdffff000-0xdffffff f,0xdffc0000-0xdffdffff irq 16 at device 6.0 on pci1 miibus0: <MII bus> on fxp0 inphy0: <i82555 10/100 media interface> on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp0: Ethernet address: 00:02:b3:bf:b5:c9 fxp0: [GIANT-LOCKED] pci0: <multimedia> at device 5.0 (no driver attached) atapci0: <GENERIC ATA controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 6.0 on pci0 ata0: <ATA channel 0> on atapci0 ata1: <ATA channel 1> on atapci0 atapci1: <GENERIC ATA controller> port 0xd480-0xd487,0xd400-0xd403,0xd080-0xd087,0xd000-0xd003,0xcc00-0xcc0f mem 0xdfe7d000-0xdfe7dfff irq 20 at device 8.0 on pci0 ata2: <ATA channel 0> on atapci1 ata3: <ATA channel 1> on atapci1 atapci2: <GENERIC ATA controller> port 0xc880-0xc887,0xc800-0xc803,0xc480-0xc487,0xc400-0xc403,0xc080-0xc08f mem 0xdfe7c000-0xdfe7cfff irq 21 at device 8.1 on pci0 ata4: <ATA channel 0> on atapci2 ata5: <ATA channel 1> on atapci2 pcib2: <ACPI PCI-PCI bridge> at device 9.0 on pci0 pci2: <ACPI PCI bus> on pcib2 pcib3: <ACPI PCI-PCI bridge> at device 11.0 on pci0 pci3: <ACPI PCI bus> on pcib3 pcib4: <ACPI PCI-PCI bridge> at device 12.0 on pci0 pci4: <ACPI PCI bus> on pcib4 pci0: <display, VGA> at device 13.0 (no driver attached) acpi_button0: <Power Button> on acpi0 fdc0: <floppy drive controller (FDE)> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: [FAST] ppc0: <Standard parallel printer port> port 0x378-0x37f irq 7 on acpi0 ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode ppbus0: <Parallel port bus> on ppc0 plip0: <PLIP network interface> on ppbus0 lpt0: <Printer> on ppbus0 lpt0: Interrupt-driven port ppi0: <Parallel I/O> on ppbus0 atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] sio0: configured irq 4 not in bitmap of probed irqs 0 sio0: port may not be enabled sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A pmtimer0 on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio1: configured irq 3 not in bitmap of probed irqs 0 sio1: port may not be enabled vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounter "TSC" frequency 2410945801 Hz quality 800 Timecounters tick every 1.000 msec ad4: 238475MB <SAMSUNG SP2504C VT100-41> at ata2-master UDMA33 Trying to mount root from ufs:/dev/ad4s1a Regards Chris
On Wed, Nov 22, 2006 at 05:49:12AM +0000, Chris wrote:> On a few occasions all different remote servers I have had nfs cause > servers to stop responding so I stopped using it all the servers were > either 6.0 release 6.1 release or 6-stable. > > We recently discovered sshfs which supports cross platform mounting > server is linux and I mounted on a freebsd 6.1 release using security > branch up to date. > > it was working fine for around 5 to 6 days with some problems with > sshfs not updating files that are updated but wasnt compromising the > stability of the freebsd server I just remounted to keep up to date. > Then today the linux server had network problems so the sshfs timed > out and there is 2 dirs I mount, the first mounted fine a bit slow but > connected but when I ran the command to mount the 2nd dir the server > stopped responding. > > My 2nd ssh terminal was alive I tried to run top to see if sshfs was > hanging or something but when I hit enter top didnt run and the 2nd > terminal was froze, note both terminals didnt timeout and a ircd > running on the server also did not timeout but the box wasnt listening > to any new requests, it was responding to pings fine. > > I have a remote reboot facility on the box but no local access and no > kvm/serial console facility available this is the case for all of my > servers. I initially tried a soft reboot which uses ctrl-alt-delete > but the pings kept replying so I could see the reboot wasn initiated > indicating some kind of console lockup as well, I then did a hard > reboot which brought the server back. > > All logs stopped when the first lockup occured so no errors etc. > recorded bear in mind I have no local access to this machine. It does > appear that 6.x has some kind of serious remote mounting bug because I > never had these nfs problems in freebsd 5.x. > > I would be interested in any thoughts as to what could help me I have > rebooted the server now with network mpsafe disabled to see if this > will help it is using a generic kernel with the following changes.Sounds like your "sshfs" is causing the kernel to deadlock in that error situation. You can confirm by enabling DEBUG_LOCKS and DEBUG_VFS_LOCKS, then breaking to DDB and running 'show lockedvnods' when the deadlock occurs. If you're still having problems with NFS on 6.2, we'd much rather you reported those so that we can investigate and try to fix them. Kris -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20061122/8fa83632/attachment.pgp
Chris <chrcoluk@gmail.com> wrote:>kris a development on this, someone else posted about a nfs problem >and reading his post some starkling point he made about network cards, >he stated he only gets the bug on sis rl and fxp.Sorry for the misunderstanding, but I think that the 'NFS via TCP' thread covers other bugs, ie the inability to mount NFS v3 over TCP. I've tested the cards above, and the person I replied to encountered the same bug with a bge card. My solution was to remove custom nfs settings in sysctl.conf. I don't know which one was the culprit because I don't have the time to look into it further. My poking uncovered a set of crashing bugs and potentially a livelock. I would agree that NFS is very fragile in RELENG_6. So far <fingers crossed> I've not run into an NFS server deadlock you described. scs
On 14/12/06, Kris Kennaway <kris@obsecurity.org> wrote:> On Thu, Dec 14, 2006 at 01:28:48AM +0000, Chris wrote: > > > It does make sense if thats the problem since the entire server even > > locally stops working properly, and it always follows a unexpected > > nfs/sshfs disconnection ie. network timeout. > > > > I am now running 6.2-RC that has the new file and currently at 1 day > > 11hrs uptime. > > OK, thanks for following part of the advice I gave a month ago ;) Let > us know if the problems persist. > > Kris > > >Will do, also I have parts for local machine now so if it persists I will get that online to diagnose locally. Chris
On Mon, Dec 18, 2006 at 12:39:13AM +0000, Chris wrote:> On 14/12/06, Kris Kennaway <kris@obsecurity.org> wrote: > >On Thu, Dec 14, 2006 at 01:28:48AM +0000, Chris wrote: > > > >> It does make sense if thats the problem since the entire server even > >> locally stops working properly, and it always follows a unexpected > >> nfs/sshfs disconnection ie. network timeout. > >> > >> I am now running 6.2-RC that has the new file and currently at 1 day > >> 11hrs uptime. > > > >OK, thanks for following part of the advice I gave a month ago ;) Let > >us know if the problems persist. > > > >Kris > > > > > > > > Early today the nfs hub was rebooted so had a unexpected disconnection > also noted by the sshfs timeout prompt waiting for me in the terminal > , was able to remount fine and no server lockup or other probolems. > > Current uptime is 5 days, 10:48OK, good to know. Thanks, Kris -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20061218/cd1ee1e5/attachment.pgp
On 14/12/06, Kris Kennaway <kris@obsecurity.org> wrote:> On Thu, Dec 14, 2006 at 01:28:48AM +0000, Chris wrote: > > > It does make sense if thats the problem since the entire server even > > locally stops working properly, and it always follows a unexpected > > nfs/sshfs disconnection ie. network timeout. > > > > I am now running 6.2-RC that has the new file and currently at 1 day > > 11hrs uptime. > > OK, thanks for following part of the advice I gave a month ago ;) Let > us know if the problems persist. > > Kris > > >Early today the nfs hub was rebooted so had a unexpected disconnection also noted by the sshfs timeout prompt waiting for me in the terminal , was able to remount fine and no server lockup or other probolems. Current uptime is 5 days, 10:48 Chris
On 18/12/06, Kris Kennaway <kris@obsecurity.org> wrote:> On Mon, Dec 18, 2006 at 12:39:13AM +0000, Chris wrote: > > On 14/12/06, Kris Kennaway <kris@obsecurity.org> wrote: > > >On Thu, Dec 14, 2006 at 01:28:48AM +0000, Chris wrote: > > > > > >> It does make sense if thats the problem since the entire server even > > >> locally stops working properly, and it always follows a unexpected > > >> nfs/sshfs disconnection ie. network timeout. > > >> > > >> I am now running 6.2-RC that has the new file and currently at 1 day > > >> 11hrs uptime. > > > > > >OK, thanks for following part of the advice I gave a month ago ;) Let > > >us know if the problems persist. > > > > > >Kris > > > > > > > > > > > > > Early today the nfs hub was rebooted so had a unexpected disconnection > > also noted by the sshfs timeout prompt waiting for me in the terminal > > , was able to remount fine and no server lockup or other probolems. > > > > Current uptime is 5 days, 10:48 > > OK, good to know. > > Thanks, > Kris > > > >Some bad news, I was offline for a day here, then I logged in today reattached to screen, and was greeted with a timeout message to the sshfs server, at this point server still functioning fine. When I ran the sshfs command again it locked, with only pings responding and had to hard reboot it. I will setup my local machne now so I can do proper debugging for you. Chris