On a few occasions all different remote servers I have had nfs cause
servers to stop responding so I stopped using it all the servers were
either 6.0 release 6.1 release or 6-stable.
We recently discovered sshfs which supports cross platform mounting
server is linux and I mounted on a freebsd 6.1 release using security
branch up to date.
it was working fine for around 5 to 6 days with some problems with
sshfs not updating files that are updated but wasnt compromising the
stability of the freebsd server I just remounted to keep up to date.
Then today the linux server had network problems so the sshfs timed
out and there is 2 dirs I mount, the first mounted fine a bit slow but
connected but when I ran the command to mount the 2nd dir the server
stopped responding.
My 2nd ssh terminal was alive I tried to run top to see if sshfs was
hanging or something but when I hit enter top didnt run and the 2nd
terminal was froze, note both terminals didnt timeout and a ircd
running on the server also did not timeout but the box wasnt listening
to any new requests, it was responding to pings fine.
I have a remote reboot facility on the box but no local access and no
kvm/serial console facility available this is the case for all of my
servers. I initially tried a soft reboot which uses ctrl-alt-delete
but the pings kept replying so I could see the reboot wasn initiated
indicating some kind of console lockup as well, I then did a hard
reboot which brought the server back.
All logs stopped when the first lockup occured so no errors etc.
recorded bear in mind I have no local access to this machine. It does
appear that 6.x has some kind of serious remote mounting bug because I
never had these nfs problems in freebsd 5.x.
I would be interested in any thoughts as to what could help me I have
rebooted the server now with network mpsafe disabled to see if this
will help it is using a generic kernel with the following changes.
options directio, polling, noadaptive mutexes, adaptive giant,ipv6 and
nfs disabled.
dmesg output below. I left the reboot showing vnodes because it also
looks supicous it took so long for it to synch the disks, this was
following a working reboot the remote reboot of course was improper
shutdown.
The hd is a sata2 but dmesg shows as ata33
Syncing disks, vnodes remaining...3 3 1 0 2 1 1 1 1 1 1 1 1 1 1 1 2 1
1 1 1 1 1 0 0 0 done
All buffers synced.
Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 6.1-RELEASE-p10 #1: Sat Nov 11 23:02:09 GMT 2006
admin@heaven.chrysalisnet.org:/usr/obj/usr/src/sys/HEAVEN
WARNING: MPSAFE network stack disabled, expect reduced performance.
ACPI APIC Table: <A M I OEMAPIC >
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: AMD Athlon(tm) 64 Processor 3800+ (2410.95-MHz 686-class CPU)
Origin = "AuthenticAMD" Id = 0x40ff2 Stepping = 2
Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
Features2=0x2001<SSE3,CX16>
AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow+,3DNow>
AMD Features2=0x1d<LAHF,<b2>,<b3>,CR8>
real memory = 939261952 (895 MB)
avail memory = 909828096 (867 MB)
ioapic0 <Version 1.1> irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: <A M I OEMRSDT> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-safe" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x508-0x50b on acpi0
cpu0: <ACPI CPU> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pci0: <memory, RAM> at device 0.0 (no driver attached)
isab0: <PCI-ISA bridge> at device 1.0 on pci0
isa0: <ISA bus> on isab0
pci0: <serial bus, SMBus> at device 1.1 (no driver attached)
pci0: <memory, RAM> at device 1.2 (no driver attached)
pci0: <processor> at device 1.3 (no driver attached)
ohci0: <OHCI (generic) USB controller> mem 0xdfe7f000-0xdfe7ffff irq
21 at device 2.0 on pci0
ohci0: [GIANT-LOCKED]
usb0: OHCI version 1.0, legacy support
usb0: SMM does not respond, resetting
usb0: <OHCI (generic) USB controller> on ohci0
usb0: USB revision 1.0
uhub0: nVidia OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 10 ports with 10 removable, self powered
ehci0: <EHCI (generic) USB 2.0 controller> mem 0xdfe7ec00-0xdfe7ecff
irq 22 at device 2.1 on pci0
ehci0: [GIANT-LOCKED]
usb1: EHCI version 1.0
usb1: companion controller, 10 ports each: usb0
usb1: <EHCI (generic) USB 2.0 controller> on ehci0
usb1: USB revision 2.0
uhub1: nVidia EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub1: 10 ports with 10 removable, self powered
pcib1: <ACPI PCI-PCI bridge> at device 4.0 on pci0
pci1: <ACPI PCI bus> on pcib1
fxp0: <Intel 82550 Pro/100 Ethernet> port 0xec00-0xec3f mem
0xdffff000-0xdffffff
f,0xdffc0000-0xdffdffff irq 16 at device 6.0 on pci1
miibus0: <MII bus> on fxp0
inphy0: <i82555 10/100 media interface> on miibus0
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
fxp0: Ethernet address: 00:02:b3:bf:b5:c9
fxp0: [GIANT-LOCKED]
pci0: <multimedia> at device 5.0 (no driver attached)
atapci0: <GENERIC ATA controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 6.0 on
pci0
ata0: <ATA channel 0> on atapci0
ata1: <ATA channel 1> on atapci0
atapci1: <GENERIC ATA controller> port
0xd480-0xd487,0xd400-0xd403,0xd080-0xd087,0xd000-0xd003,0xcc00-0xcc0f
mem 0xdfe7d000-0xdfe7dfff irq 20 at device 8.0 on pci0
ata2: <ATA channel 0> on atapci1
ata3: <ATA channel 1> on atapci1
atapci2: <GENERIC ATA controller> port
0xc880-0xc887,0xc800-0xc803,0xc480-0xc487,0xc400-0xc403,0xc080-0xc08f
mem 0xdfe7c000-0xdfe7cfff irq 21 at device 8.1 on pci0
ata4: <ATA channel 0> on atapci2
ata5: <ATA channel 1> on atapci2
pcib2: <ACPI PCI-PCI bridge> at device 9.0 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> at device 11.0 on pci0
pci3: <ACPI PCI bus> on pcib3
pcib4: <ACPI PCI-PCI bridge> at device 12.0 on pci0
pci4: <ACPI PCI bus> on pcib4
pci0: <display, VGA> at device 13.0 (no driver attached)
acpi_button0: <Power Button> on acpi0
fdc0: <floppy drive controller (FDE)> port 0x3f0-0x3f5,0x3f7 irq 6 drq
2 on acpi0
fdc0: [FAST]
ppc0: <Standard parallel printer port> port 0x378-0x37f irq 7 on acpi0
ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode
ppbus0: <Parallel port bus> on ppc0
plip0: <PLIP network interface> on ppbus0
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
sio0: configured irq 4 not in bitmap of probed irqs 0
sio0: port may not be enabled
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on
acpi0
sio0: type 16550A
pmtimer0 on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio1: configured irq 3 not in bitmap of probed irqs 0
sio1: port may not be enabled
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounter "TSC" frequency 2410945801 Hz quality 800
Timecounters tick every 1.000 msec
ad4: 238475MB <SAMSUNG SP2504C VT100-41> at ata2-master UDMA33
Trying to mount root from ufs:/dev/ad4s1a
Regards
Chris
On Wed, Nov 22, 2006 at 05:49:12AM +0000, Chris wrote:> On a few occasions all different remote servers I have had nfs cause > servers to stop responding so I stopped using it all the servers were > either 6.0 release 6.1 release or 6-stable. > > We recently discovered sshfs which supports cross platform mounting > server is linux and I mounted on a freebsd 6.1 release using security > branch up to date. > > it was working fine for around 5 to 6 days with some problems with > sshfs not updating files that are updated but wasnt compromising the > stability of the freebsd server I just remounted to keep up to date. > Then today the linux server had network problems so the sshfs timed > out and there is 2 dirs I mount, the first mounted fine a bit slow but > connected but when I ran the command to mount the 2nd dir the server > stopped responding. > > My 2nd ssh terminal was alive I tried to run top to see if sshfs was > hanging or something but when I hit enter top didnt run and the 2nd > terminal was froze, note both terminals didnt timeout and a ircd > running on the server also did not timeout but the box wasnt listening > to any new requests, it was responding to pings fine. > > I have a remote reboot facility on the box but no local access and no > kvm/serial console facility available this is the case for all of my > servers. I initially tried a soft reboot which uses ctrl-alt-delete > but the pings kept replying so I could see the reboot wasn initiated > indicating some kind of console lockup as well, I then did a hard > reboot which brought the server back. > > All logs stopped when the first lockup occured so no errors etc. > recorded bear in mind I have no local access to this machine. It does > appear that 6.x has some kind of serious remote mounting bug because I > never had these nfs problems in freebsd 5.x. > > I would be interested in any thoughts as to what could help me I have > rebooted the server now with network mpsafe disabled to see if this > will help it is using a generic kernel with the following changes.Sounds like your "sshfs" is causing the kernel to deadlock in that error situation. You can confirm by enabling DEBUG_LOCKS and DEBUG_VFS_LOCKS, then breaking to DDB and running 'show lockedvnods' when the deadlock occurs. If you're still having problems with NFS on 6.2, we'd much rather you reported those so that we can investigate and try to fix them. Kris -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20061122/8fa83632/attachment.pgp
Chris <chrcoluk@gmail.com> wrote:>kris a development on this, someone else posted about a nfs problem >and reading his post some starkling point he made about network cards, >he stated he only gets the bug on sis rl and fxp.Sorry for the misunderstanding, but I think that the 'NFS via TCP' thread covers other bugs, ie the inability to mount NFS v3 over TCP. I've tested the cards above, and the person I replied to encountered the same bug with a bge card. My solution was to remove custom nfs settings in sysctl.conf. I don't know which one was the culprit because I don't have the time to look into it further. My poking uncovered a set of crashing bugs and potentially a livelock. I would agree that NFS is very fragile in RELENG_6. So far <fingers crossed> I've not run into an NFS server deadlock you described. scs
On 14/12/06, Kris Kennaway <kris@obsecurity.org> wrote:> On Thu, Dec 14, 2006 at 01:28:48AM +0000, Chris wrote: > > > It does make sense if thats the problem since the entire server even > > locally stops working properly, and it always follows a unexpected > > nfs/sshfs disconnection ie. network timeout. > > > > I am now running 6.2-RC that has the new file and currently at 1 day > > 11hrs uptime. > > OK, thanks for following part of the advice I gave a month ago ;) Let > us know if the problems persist. > > Kris > > >Will do, also I have parts for local machine now so if it persists I will get that online to diagnose locally. Chris
On Mon, Dec 18, 2006 at 12:39:13AM +0000, Chris wrote:> On 14/12/06, Kris Kennaway <kris@obsecurity.org> wrote: > >On Thu, Dec 14, 2006 at 01:28:48AM +0000, Chris wrote: > > > >> It does make sense if thats the problem since the entire server even > >> locally stops working properly, and it always follows a unexpected > >> nfs/sshfs disconnection ie. network timeout. > >> > >> I am now running 6.2-RC that has the new file and currently at 1 day > >> 11hrs uptime. > > > >OK, thanks for following part of the advice I gave a month ago ;) Let > >us know if the problems persist. > > > >Kris > > > > > > > > Early today the nfs hub was rebooted so had a unexpected disconnection > also noted by the sshfs timeout prompt waiting for me in the terminal > , was able to remount fine and no server lockup or other probolems. > > Current uptime is 5 days, 10:48OK, good to know. Thanks, Kris -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20061218/cd1ee1e5/attachment.pgp
On 14/12/06, Kris Kennaway <kris@obsecurity.org> wrote:> On Thu, Dec 14, 2006 at 01:28:48AM +0000, Chris wrote: > > > It does make sense if thats the problem since the entire server even > > locally stops working properly, and it always follows a unexpected > > nfs/sshfs disconnection ie. network timeout. > > > > I am now running 6.2-RC that has the new file and currently at 1 day > > 11hrs uptime. > > OK, thanks for following part of the advice I gave a month ago ;) Let > us know if the problems persist. > > Kris > > >Early today the nfs hub was rebooted so had a unexpected disconnection also noted by the sshfs timeout prompt waiting for me in the terminal , was able to remount fine and no server lockup or other probolems. Current uptime is 5 days, 10:48 Chris
On 18/12/06, Kris Kennaway <kris@obsecurity.org> wrote:> On Mon, Dec 18, 2006 at 12:39:13AM +0000, Chris wrote: > > On 14/12/06, Kris Kennaway <kris@obsecurity.org> wrote: > > >On Thu, Dec 14, 2006 at 01:28:48AM +0000, Chris wrote: > > > > > >> It does make sense if thats the problem since the entire server even > > >> locally stops working properly, and it always follows a unexpected > > >> nfs/sshfs disconnection ie. network timeout. > > >> > > >> I am now running 6.2-RC that has the new file and currently at 1 day > > >> 11hrs uptime. > > > > > >OK, thanks for following part of the advice I gave a month ago ;) Let > > >us know if the problems persist. > > > > > >Kris > > > > > > > > > > > > > Early today the nfs hub was rebooted so had a unexpected disconnection > > also noted by the sshfs timeout prompt waiting for me in the terminal > > , was able to remount fine and no server lockup or other probolems. > > > > Current uptime is 5 days, 10:48 > > OK, good to know. > > Thanks, > Kris > > > >Some bad news, I was offline for a day here, then I logged in today reattached to screen, and was greeted with a timeout message to the sshfs server, at this point server still functioning fine. When I ran the sshfs command again it locked, with only pings responding and had to hard reboot it. I will setup my local machne now so I can do proper debugging for you. Chris