Since the latest updates to sys/net/if_tap.c (I suspect) in 6.2-STABLE my openvpn tap server is using up all available CPU time (99%) effectively killing the box. I replicated this on a second machine which was about 3 weeks behind with a cvsup to the latest from RELENG_6 and after rebuilding the kernel it does the same thing. I have portupgraded openvpn to the latest release (openvpn-2.0.6_7), but this had not made any difference. Is anyone else experiencing similar problems? Regards Emile
On 3/14/07, Emile Coetzee <EmileC@clarotech.co.za> wrote:> Since the latest updates to sys/net/if_tap.c (I suspect) in 6.2-STABLE > my openvpn tap server is using up all available CPU time (99%) > effectively killing the box. > > I replicated this on a second machine which was about 3 weeks behind > with a cvsup to the latest from RELENG_6 and after rebuilding the kernel > it does the same thing. > > I have portupgraded openvpn to the latest release (openvpn-2.0.6_7), but > this had not made any difference. Is anyone else experiencing similar > problems? >It usually happens when OpenVPN is told to retry connecting to the remote endpoint and the remote endpoint isn't accessible due to network conditions. It happens the same even on Windows.> Regards > Emile > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" >-- If it's there, and you can see it, it's real. If it's not there, and you can see it, it's virtual. If it's there, and you can't see it, it's transparent. If it's not there, and you can't see it, you erased it.
On Wed, Mar 14, 2007 at 03:11:58PM +0200, Emile Coetzee wrote:> Since the latest updates to sys/net/if_tap.c (I suspect) in 6.2-STABLE > my openvpn tap server is using up all available CPU time (99%) > effectively killing the box. > > I replicated this on a second machine which was about 3 weeks behind > with a cvsup to the latest from RELENG_6 and after rebuilding the kernel > it does the same thing. > > I have portupgraded openvpn to the latest release (openvpn-2.0.6_7), but > this had not made any difference. Is anyone else experiencing similar > problems?I haven't seen this on either end of our OpenVPN setup. Ours: endpoint A: FreeBSD eos.sc1.parodius.com 6.2-STABLE FreeBSD 6.2-STABLE #0: Thu Mar 8 10:41:09 PST 2007 openvpn-2.0.6_7 Secure IP/Ethernet tunnel daemon bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 ether 2e:2e:ec:9c:76:02 id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 maxage 20 holdcnt 6 proto stp maxaddr 100 timeout 1200 root id 00:00:00:00:00:00 priority 0 ifcost 0 port 0 member: tap0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> member: bge1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> tap0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500 ether 00:bd:22:14:00:00 Opened by PID 36044 endpoint B: FreeBSD medusa.parodius.com 6.1-STABLE FreeBSD 6.1-STABLE #0: Sat Aug 5 09:36:02 PDT 2006 openvpn-2.0.6_4 Secure IP/Ethernet tunnel daemon bridge0: flags=8043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether e6:7a:00:c2:48:e5 priority 32768 hellotime 2 fwddelay 15 maxage 20 member: tap0 flags=3<LEARNING,DISCOVER> member: em1 flags=3<LEARNING,DISCOVER> tap0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500 ether 00:bd:a4:27:00:00 Opened by PID 849 One thing I am noticing, however, is that on endpoint A the pid tap0 claims to be associated with is incorrect. My guess is that this is due to some fork() stuff that's going on. tap0 claims the wrong pid, but the pidfile has the correct pid. eos# ps -auxw | grep 36044 eos# eos# ps -auxw | grep openvpn root 36048 0.0 0.1 3248 2644 ?? Ss 7:03AM 0:00.02 /usr/local/sbin/openvpn --cd ... eos# cat /var/run/openvpn.pid 36048 While on endpoint B: medusa# ps -auxw | grep 849 root 849 0.0 0.2 3604 2268 ?? Ss 3Feb07 2:18.63 /usr/local/sbin/openvpn --cd ... medusa# cat /var/run/openvpn.pid 849 -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
I have done some more investigation. Rolling back to RELENG_6_2 solves the problem. I have now had this problem on 3 boxes in 2 days and have been able to reproduce it on a 4th in our lab. Starting openvpn with full debug it stops just before the point where the TAP devices gets initiated. And then the output from top shows openvpn on 99% cpu time (well almost). RELENG_6 [snip] Thu Mar 15 08:27:39 2007 us=333199 OpenVPN 2.0.6 i386-portbld-freebsd6.2 [SSL] [LZO] built on Mar 14 2007 Thu Mar 15 08:27:39 2007 us=344375 Diffie-Hellman initialized with 1024 bit key Thu Mar 15 08:27:39 2007 us=345355 MTU DYNAMIC mtu=0, flags=1, 0 -> 138 Thu Mar 15 08:27:39 2007 us=345376 TLS-Auth MTU parms [ L:1574 D:138 EF:38 EB:0 ET:0 EL:0 ] Thu Mar 15 08:27:39 2007 us=345390 MTU DYNAMIC mtu=1450, flags=2, 1574 -> 1450 RELENG_6_2 [snip] Thu Mar 15 07:18:23 2007 us=437582 OpenVPN 2.0.6 i386-portbld-freebsd6.2 [SSL] [LZO] built on Mar 14 2007 Thu Mar 15 07:18:23 2007 us=448672 Diffie-Hellman initialized with 1024 bit key Thu Mar 15 07:18:23 2007 us=449653 MTU DYNAMIC mtu=0, flags=1, 0 -> 138 Thu Mar 15 07:18:23 2007 us=449676 TLS-Auth MTU parms [ L:1574 D:138 EF:38 EB:0 ET:0 EL:0 ] Thu Mar 15 07:18:23 2007 us=449692 MTU DYNAMIC mtu=1450, flags=2, 1574 -> 1450 Thu Mar 15 07:18:23 2007 us=449759 TUN/TAP device /dev/tap0 opened Thu Mar 15 07:18:23 2007 us=449832 Data Channel MTU parms [ L:1574 D:1450 EF:42 EB:135 ET:32 EL:0 AF:3/1 ] Thu Mar 15 07:18:23 2007 us=450475 GID set to nobody Thu Mar 15 07:18:23 2007 us=450504 UID set to nobody Thu Mar 15 07:18:23 2007 us=450532 Socket Buffers: R=[42080->65536] S=[9216->65536] Thu Mar 15 07:18:23 2007 us=450552 UDPv4 link local (bound): [undef]:1195 Thu Mar 15 07:18:23 2007 us=450567 UDPv4 link remote: [undef] Thu Mar 15 07:18:23 2007 us=450586 MULTI: multi_init called, r=256 v=256 Thu Mar 15 07:18:23 2007 us=450623 IFCONFIG POOL: base=192.168.251.200 size=11 Thu Mar 15 07:18:23 2007 us=450649 IFCONFIG POOL LIST Thu Mar 15 07:18:23 2007 us=450670 PO_INIT maxevents=4 flags=0x00000002 Thu Mar 15 07:18:23 2007 us=450688 Initialization Sequence Completed Thu Mar 15 07:18:23 2007 us=450702 SCHEDULE: schedule_find_least NULL Thu Mar 15 07:18:23 2007 us=450718 PO_CTL rwflags=0x0001 ev=5 arg=0x080891c0 Thu Mar 15 07:18:23 2007 us=450733 PO_CTL rwflags=0x0001 ev=6 arg=0x080891c4 Thu Mar 15 07:18:23 2007 us=450751 I/O WAIT TR|Tw|SR|Sw [10/0] last pid: 1066; load averages: 0.49, 0.12, 0.04 up 0+00:48:00 08:28:16 31 processes: 2 running, 29 sleeping CPU states: 0.0% user, 0.0% nice, 98.5% system, 1.5% interrupt, 0.0% idle Mem: 11M Active, 7956K Inact, 17M Wired, 11M Buf, 448M Free Swap: 1024M Total, 1024M Free PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 1066 root 1 122 0 2984K 2436K RUN 0:36 96.72% openvpn 786 root 1 96 0 3056K 1900K select 0:00 0.00% ntpd [snip] Vlad Galu: As this is a tap server there is no remote end point to connect to. It is waiting for a windows client to connect. As you can see from the openvpn logs the tap device is not even getting initiated. Luigi Rizzo: I'm no programmer, so suggestions on how to fix it are welcome ;) Regards Emile
Emile Coetzee wrote:> I have done some more investigation. Rolling back to RELENG_6_2 solves the > problem. I have now had this problem on 3 boxes in 2 days and have been able > to reproduce it on a 4th in our lab. > > Starting openvpn with full debug it stops just before the point where the > TAP devices gets initiated. And then the output from top shows openvpn on > 99% cpu time (well almost). >The recent changes to tap(4) were tested with openvpn and ppp before committing, and this is an isolated report at the moment. Other users on the list have suggested that the problem you are reporting does not affect FreeBSD alone. Please consider doing the following: 1. Use ktrace(8) to extract a kdump of openvpn as it runs into this problem, so that you may determine exactly which syscalls are invoked when you see this problem, and post it on this list. Consider using the ktrace pid mode to attach to openvpn as it blocks, as opposed to running openvpn under ktrace as a child process. 2. Implementing the usleep() fix as discussed by luigi, as the information in point 1 will give clues as to where the code is spinning. Regards, BMS
Hi Bruce I have run a ktrace as you suggested I put the following in a script /usr/local/sbin/openvpn --daemon --config /usr/local/etc/openvpn/tapserver.conf and called it tapserver.sh and then used ktrace -d -f openvpn.dump ./tapserver.sh This captured all the information for the parent PID 1082 I then watched top in a second session and identified the PID (1083) of the openvpn process that was using up the CPU time. I then ran ktrace -p 1083 -f openvpn2.dump This produced no data at all. The dump is fairly large so I have dropped it here: http://pastebin.com/899318 If required/allowed I can post it to the list. Regards Emile
> > I then ran > > ktrace -p 1083 -f openvpn2.dump > > This produced no data at all. >>Shouldn't. :-) You need to run ktrace then kdump to dump the trace.<I figured that part out on my own thanks ;) The second file is 0kb in size running a kdump on it produces no output.>Correlate the cpu spin with the wallclock time in the kdump output, andpaste an excerpt of what you get -- it sounds like it is spinning on a syscall.< I have no idea what exactly you want me to do here. Point me at some reading material if you like or show me some examples.>I can't seem to get the stuff from pastebin, it just times out, so ok topost an excerpt here.< No worries below is the last part of the first ktrace. I have dropped a full copy on our webserver: http://www.clarotech.co.za/dump/openvpn.txt 1082 sh RET read 4096/0x1000 1082 sh CALL mmap(0,0xea000,0x5,0x20002,0x3,0,0,0) 1082 sh RET mmap 672038912/0x280e8000 1082 sh CALL mprotect(0x281b5000,0x1000,0x7) 1082 sh RET mprotect 0 1082 sh CALL mprotect(0x281b5000,0x1000,0x5) 1082 sh RET mprotect 0 1082 sh CALL mmap(0x281b6000,0x6000,0x3,0x12,0x3,0,0xcd000,0) 1082 sh RET mmap 672882688/0x281b6000 1082 sh CALL mmap(0x281bc000,0x16000,0x3,0x1012,0xffffffff,0,0,0) 1082 sh RET mmap 672907264/0x281bc000 1082 sh CALL close(0x3) 1082 sh RET close 0 1082 sh CALL access(0x28092000,0) 1082 sh NAMI "/lib/libncurses.so.6" 1082 sh RET access 0 1082 sh CALL sysarch(0xa,0xbfbfec60) 1082 sh RET sysarch 0 1082 sh CALL mmap(0,0x440,0x3,0x1000,0xffffffff,0,0,0) 1082 sh RET mmap 672997376/0x281d2000 1082 sh CALL munmap(0x281d2000,0x440) 1082 sh RET munmap 0 1082 sh CALL mmap(0,0x3f8,0x3,0x1000,0xffffffff,0,0,0) 1082 sh RET mmap 672997376/0x281d2000 1082 sh CALL munmap(0x281d2000,0x3f8) 1082 sh RET munmap 0 1082 sh CALL mmap(0,0x1208,0x3,0x1000,0xffffffff,0,0,0) 1082 sh RET mmap 672997376/0x281d2000 1082 sh CALL munmap(0x281d2000,0x1208) 1082 sh RET munmap 0 1082 sh CALL mmap(0,0x58d8,0x3,0x1000,0xffffffff,0,0,0) 1082 sh RET mmap 672997376/0x281d2000 1082 sh CALL munmap(0x281d2000,0x58d8) 1082 sh RET munmap 0 1082 sh CALL sigprocmask(0x1,0x28087980,0xbfbfec30) 1082 sh RET sigprocmask 0 1082 sh CALL sigprocmask(0x3,0x28087990,0) 1082 sh RET sigprocmask 0 1082 sh CALL getpid 1082 sh RET getpid 1082/0x43a 1082 sh CALL geteuid 1082 sh RET geteuid 0 1082 sh CALL getppid 1082 sh RET getppid 1006/0x3ee 1082 sh CALL readlink(0x281ade4b,0xbfbfe810,0x3f) 1082 sh NAMI "/etc/malloc.conf" 1082 sh RET readlink -1 errno 2 No such file or directory 1082 sh CALL issetugid 1082 sh RET issetugid 0 1082 sh CALL mmap(0,0x1000,0x3,0x1002,0xffffffff,0,0,0) 1082 sh RET mmap 672997376/0x281d2000 1082 sh CALL break(0x8066000) 1082 sh RET break 0 1082 sh CALL break(0x8067000) 1082 sh RET break 0 1082 sh CALL break(0x8068000) 1082 sh RET break 0 1082 sh CALL stat(0x8067004,0xbfbfebb0) 1082 sh NAMI "/var/mail/support" 1082 sh RET stat 0 1082 sh CALL getuid 1082 sh RET getuid 0 1082 sh CALL geteuid 1082 sh RET geteuid 0 1082 sh CALL getgid 1082 sh RET getgid 0 1082 sh CALL getegid 1082 sh RET getegid 0 1082 sh CALL open(0xbfbfee07,0,0x28066b71) 1082 sh NAMI "./tapserver.sh" 1082 sh RET open 3 1082 sh CALL fcntl(0x3,0,0xa) 1082 sh RET fcntl 10/0xa 1082 sh CALL close(0x3) 1082 sh RET close 0 1082 sh CALL fcntl(0xa,0x2,0x1) 1082 sh RET fcntl 0 1082 sh CALL sigaction(0x2,0,0xbfbfebb0) 1082 sh RET sigaction 0 1082 sh CALL sigaction(0x2,0xbfbfebb0,0xbfbfeb90) 1082 sh RET sigaction 0 1082 sh CALL sigaction(0x2,0,0xbfbfebb0) 1082 sh RET sigaction 0 1082 sh CALL sigaction(0x2,0xbfbfebb0,0) 1082 sh RET sigaction 0 1082 sh CALL sigaction(0x3,0,0xbfbfebb0) 1082 sh RET sigaction 0 1082 sh CALL sigaction(0x3,0xbfbfebb0,0xbfbfeb90) 1082 sh RET sigaction 0 1082 sh CALL sigaction(0x3,0,0xbfbfebb0) 1082 sh RET sigaction 0 1082 sh CALL sigaction(0x3,0xbfbfebb0,0) 1082 sh RET sigaction 0 1082 sh CALL sigaction(0xf,0,0xbfbfebb0) 1082 sh RET sigaction 0 1082 sh CALL sigaction(0xf,0xbfbfebb0,0xbfbfeb90) 1082 sh RET sigaction 0 1082 sh CALL sigaction(0x1c,0,0xbfbfebb0) 1082 sh RET sigaction 0 1082 sh CALL sigaction(0x1c,0xbfbfebb0,0xbfbfeb90) 1082 sh RET sigaction 0 1082 sh CALL __getcwd(0xbfbfe860,0x400) 1082 sh RET __getcwd 0 1082 sh CALL read(0xa,0x8063d60,0x3ff) 1082 sh GIO fd 10 read 80 bytes "/usr/local/sbin/openvpn --daemon --config /usr/local/etc/openvpn/tapserver.conf " 1082 sh RET read 80/0x50 1082 sh CALL break(0x8069000) 1082 sh RET break 0 1082 sh CALL fork 1082 sh RET fork 1083/0x43b 1082 sh CALL getpgrp 1082 sh RET getpgrp 1082/0x43a 1082 sh CALL wait4(0xffffffff,0xbfbfeac8,0x2,0)