Hans van Kranenburg
2012-Jun-29 14:06 UTC
[Pkg-xen-devel] Bug#679533: Traffic forwarding issue between Xen domU/dom0 and ovs
Package: xen-hypervisor-4.0-amd64 Version: 4.0.1-5.2 Hi, We're seeing weird behaviour regarding network traffic between a Xen dom0 and domU. I reported this issue yesterday to the openvswitch-discuss mailing list, but it seems this could also be a (regression?) bug in the xen hypervisor or the linux kernel... (we've upgraded xen-hypervisor-4.0-amd64 4.0.1-4 -> 4.0.1-5.2 and linux-image-2.6.32-5-xen-amd64 2.6.32-41squeeze2 -> 2.6.32-45) This setup is using Debian GNU/Linux, with Xen 4.0 and the Debian 2.6.32 kernel and Openvswitch 1.4.0-1 packages from Debian (30 Jan 2012) which I repackaged for use with squeeze. (http://packages.mendix.com/debian/pool/main/o/openvswitch/) ii libxenstore3.0 4.0.1-5.2 Xenstore communications library for Xen ii linux-image-2.6-xen-amd64 2.6.32+29 Linux 2.6 for 64-bit PCs (meta-package), Xen dom0 support ii linux-image-2.6.32-5-xen-amd64 2.6.32-45 Linux 2.6.32 for 64-bit PCs, Xen dom0 support ii openvswitch-datapath-module-2.6.32-5-xen-amd64 1.4.0-1~mxbp60+1 Open vSwitch Linux datapath kernel module ii xen-hypervisor-4.0-amd64 4.0.1-5.2 The Xen Hypervisor on AMD64 ii xen-linux-system-2.6-xen-amd64 2.6.32+29 Xen system with Linux 2.6 for 64-bit PCs (meta-package) ii xen-linux-system-2.6.32-5-xen-amd64 2.6.32-45 Xen system with Linux 2.6.32 on 64-bit PCs (meta-package) ii xen-utils-4.0 4.0.1-5.2 XEN administrative tools ii xen-utils-common 4.0.0-1 XEN administrative tools - common files ii xenstore-utils 4.0.1-5.2 Xenstore utilities for Xen ii openvswitch-common 1.4.0-1~mxbp60+1 Open vSwitch common components ii openvswitch-controller 1.4.0-1~mxbp60+1 Open vSwitch controller implementation ii openvswitch-datapath-module-2.6.32-5-xen-amd64 1.4.0-1~mxbp60+1 Open vSwitch Linux datapath kernel module ii openvswitch-pki 1.4.0-1~mxbp60+1 Open vSwitch public key infrastructure dependency package ii openvswitch-switch 1.4.0-1~mxbp60+1 Open vSwitch switch implementations I'll try to describe the symptoms in detail, then mention a few thoughts I have about it. This issue started showing up on a single interface of a virtual machine about 5 hours after we live-migrated a number of xen domU's (including this one) to identical server hardware (using lvm on iSCSI shared storage), in order to install the security-upgrade of the xen hypervisor package. One of the domU interfaces which is attached to ovs stopped forwarding traffic into the domU, but seems to buffer all packets until a data packet flows from the domU to ovs, and then flushes traffic into the domU for a very short time only, after which no traffic into the domU is sent again at all, until the next packet flows out etc... Interface inside domU: 6: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:16:3e:00:24:09 brd ff:ff:ff:ff:ff:ff inet 10.140.36.9/24 brd 10.140.36.255 scope global eth4 inet6 fe80::216:3eff:fe00:2409/64 scope link valid_lft forever preferred_lft forever Excerpt from ovs-vsctl show on the dom0: # ovs-vsctl show 5d701b49-1f5a-4a9d-8bfb-f064b3f4ed95 Bridge "ovs0" Port "domU-24-09" tag: 36 Interface "domU-24-09" [...] ovs_version: "1.4.0+build0" When pinging from another host in the same vlan (10.140.36.2), no traffic reaches host 10.140.36.9 $ ping -c 3 10.140.36.9 PING 10.140.36.9 (10.140.36.9) 56(84) bytes of data. From 10.140.36.2 icmp_seq=1 Destination Host Unreachable From 10.140.36.2 icmp_seq=2 Destination Host Unreachable From 10.140.36.2 icmp_seq=3 Destination Host Unreachable --- 10.140.36.9 ping statistics --- 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2016ms dump-flows on the dom0 shows no flow having something to do with the mac address of the domU # ovs-dpctl dump-flows ovs0 | grep 00:16:3e:00:24:09 On the interface at the ovs (dom0) side, I can see arp requests, which are not being answered by the problematic host. # tcpdump -n -i domU-24-09 tcpdump: WARNING: domU-24-09: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on domU-24-09, link-type EN10MB (Ethernet), capture size 65535 bytes 18:52:48.113540 ARP, Request who-has 10.140.36.9 tell 10.140.36.2, length 42 18:52:49.113556 ARP, Request who-has 10.140.36.9 tell 10.140.36.2, length 42 18:52:50.113552 ARP, Request who-has 10.140.36.9 tell 10.140.36.2, length 42 But... inside the domU, no traffic is listed when dumping the eth4 interface: # tcpdump -i eth4 -n tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth4, link-type EN10MB (Ethernet), capture size 65535 bytes As soon as I generate traffic from inside the domU to the outside world over this interface, traffic in both directions starts to flow: $ ping 10.140.36.2 (as seen from the problematic domU to the outside) PING 10.140.36.2 (10.140.36.2) 56(84) bytes of data. 64 bytes from 10.140.36.2: icmp_req=1 ttl=64 time=1008 ms 64 bytes from 10.140.36.2: icmp_req=2 ttl=64 time=0.274 ms 64 bytes from 10.140.36.2: icmp_req=3 ttl=64 time=0.205 ms 64 bytes from 10.140.36.2: icmp_req=4 ttl=64 time=0.218 ms 64 bytes from 10.140.36.2: icmp_req=5 ttl=64 time=0.182 ms 64 bytes from 10.140.36.2: icmp_req=6 ttl=64 time=0.212 ms 64 bytes from 10.140.36.2: icmp_req=7 ttl=64 time=0.230 ms ^C --- 10.140.36.2 ping statistics --- 7 packets transmitted, 7 received, 0% packet loss, time 6006ms rtt min/avg/max/mdev = 0.182/144.295/1008.745/352.910 ms, pipe 2 At 10.140.36.2, where I was pinging 10.140.36.9, I suddenly see pongs, ending in a rtt that's exactly the same value 0<x<1000 ms. The 568 ms in this example is the time that it takes between sending the ping, and waiting for the ping from inside the domU the other way round, directly after which the ping from the outside finally enters the domU. From 10.140.36.2 icmp_seq=544 Destination Host Unreachable From 10.140.36.2 icmp_seq=545 Destination Host Unreachable From 10.140.36.2 icmp_seq=546 Destination Host Unreachable From 10.140.36.2 icmp_seq=547 Destination Host Unreachable From 10.140.36.2 icmp_seq=548 Destination Host Unreachable From 10.140.36.2 icmp_seq=549 Destination Host Unreachable 64 bytes from 10.140.36.9: icmp_req=469 ttl=64 time=83104 ms 64 bytes from 10.140.36.9: icmp_req=470 ttl=64 time=82097 ms 64 bytes from 10.140.36.9: icmp_req=471 ttl=64 time=81089 ms 64 bytes from 10.140.36.9: icmp_req=472 ttl=64 time=80081 ms 64 bytes from 10.140.36.9: icmp_req=473 ttl=64 time=79073 ms 64 bytes from 10.140.36.9: icmp_req=474 ttl=64 time=78065 ms 64 bytes from 10.140.36.9: icmp_req=475 ttl=64 time=77057 ms 64 bytes from 10.140.36.9: icmp_req=476 ttl=64 time=76049 ms 64 bytes from 10.140.36.9: icmp_req=477 ttl=64 time=75041 ms 64 bytes from 10.140.36.9: icmp_req=478 ttl=64 time=74033 ms 64 bytes from 10.140.36.9: icmp_req=479 ttl=64 time=73025 ms 64 bytes from 10.140.36.9: icmp_req=480 ttl=64 time=72017 ms 64 bytes from 10.140.36.9: icmp_req=481 ttl=64 time=71009 ms 64 bytes from 10.140.36.9: icmp_req=482 ttl=64 time=70001 ms 64 bytes from 10.140.36.9: icmp_req=483 ttl=64 time=68993 ms 64 bytes from 10.140.36.9: icmp_req=484 ttl=64 time=67985 ms 64 bytes from 10.140.36.9: icmp_req=485 ttl=64 time=66977 ms 64 bytes from 10.140.36.9: icmp_req=486 ttl=64 time=65969 ms 64 bytes from 10.140.36.9: icmp_req=487 ttl=64 time=64961 ms 64 bytes from 10.140.36.9: icmp_req=488 ttl=64 time=63953 ms 64 bytes from 10.140.36.9: icmp_req=489 ttl=64 time=62945 ms 64 bytes from 10.140.36.9: icmp_req=490 ttl=64 time=61937 ms 64 bytes from 10.140.36.9: icmp_req=491 ttl=64 time=60929 ms 64 bytes from 10.140.36.9: icmp_req=492 ttl=64 time=59921 ms 64 bytes from 10.140.36.9: icmp_req=493 ttl=64 time=58913 ms 64 bytes from 10.140.36.9: icmp_req=494 ttl=64 time=57905 ms 64 bytes from 10.140.36.9: icmp_req=495 ttl=64 time=56897 ms 64 bytes from 10.140.36.9: icmp_req=496 ttl=64 time=55889 ms 64 bytes from 10.140.36.9: icmp_req=497 ttl=64 time=54881 ms 64 bytes from 10.140.36.9: icmp_req=498 ttl=64 time=53873 ms 64 bytes from 10.140.36.9: icmp_req=499 ttl=64 time=52865 ms 64 bytes from 10.140.36.9: icmp_req=500 ttl=64 time=51857 ms 64 bytes from 10.140.36.9: icmp_req=501 ttl=64 time=50849 ms 64 bytes from 10.140.36.9: icmp_req=502 ttl=64 time=49841 ms 64 bytes from 10.140.36.9: icmp_req=503 ttl=64 time=48833 ms 64 bytes from 10.140.36.9: icmp_req=504 ttl=64 time=47825 ms 64 bytes from 10.140.36.9: icmp_req=550 ttl=64 time=2585 ms 64 bytes from 10.140.36.9: icmp_req=551 ttl=64 time=1576 ms 64 bytes from 10.140.36.9: icmp_req=552 ttl=64 time=575 ms 64 bytes from 10.140.36.9: icmp_req=553 ttl=64 time=573 ms 64 bytes from 10.140.36.9: icmp_req=554 ttl=64 time=571 ms 64 bytes from 10.140.36.9: icmp_req=555 ttl=64 time=570 ms 64 bytes from 10.140.36.9: icmp_req=556 ttl=64 time=569 ms 64 bytes from 10.140.36.9: icmp_req=557 ttl=64 time=568 ms As soon as I stop pinging from the inside of the domU to the outside (to the dom0/ovs) traffic from the outside to the inside of the domU immediately stops: 64 bytes from 10.140.36.9: icmp_req=609 ttl=64 time=815 ms 64 bytes from 10.140.36.9: icmp_req=610 ttl=64 time=815 ms 64 bytes from 10.140.36.9: icmp_req=611 ttl=64 time=815 ms 64 bytes from 10.140.36.9: icmp_req=612 ttl=64 time=815 ms 64 bytes from 10.140.36.9: icmp_req=613 ttl=64 time=815 ms 64 bytes from 10.140.36.9: icmp_req=614 ttl=64 time=815 ms 64 bytes from 10.140.36.9: icmp_req=615 ttl=64 time=815 ms 64 bytes from 10.140.36.9: icmp_req=616 ttl=64 time=815 ms 64 bytes from 10.140.36.9: icmp_req=617 ttl=64 time=815 ms 64 bytes from 10.140.36.9: icmp_req=618 ttl=64 time=815 ms 64 bytes from 10.140.36.9: icmp_req=619 ttl=64 time=815 ms From 10.140.36.2 icmp_seq=662 Destination Host Unreachable From 10.140.36.2 icmp_seq=663 Destination Host Unreachable From 10.140.36.2 icmp_seq=664 Destination Host Unreachable From 10.140.36.2 icmp_seq=665 Destination Host Unreachable From 10.140.36.2 icmp_seq=666 Destination Host Unreachable From 10.140.36.2 icmp_seq=667 Destination Host Unreachable From 10.140.36.2 icmp_seq=668 Destination Host Unreachable From 10.140.36.2 icmp_seq=669 Destination Host Unreachable From 10.140.36.2 icmp_seq=670 Destination Host Unreachable From 10.140.36.2 icmp_seq=671 Destination Host Unreachable When pinging from the domU to the outside, a fair number of them have a rtt of exactly 1 second: 64 bytes from 10.140.36.2: icmp_req=35 ttl=64 time=1000 ms 64 bytes from 10.140.36.2: icmp_req=36 ttl=64 time=1000 ms 64 bytes from 10.140.36.2: icmp_req=37 ttl=64 time=1000 ms 64 bytes from 10.140.36.2: icmp_req=38 ttl=64 time=0.219 ms 64 bytes from 10.140.36.2: icmp_req=39 ttl=64 time=1000 ms 64 bytes from 10.140.36.2: icmp_req=40 ttl=64 time=0.216 ms 64 bytes from 10.140.36.2: icmp_req=41 ttl=64 time=1000 ms 64 bytes from 10.140.36.2: icmp_req=42 ttl=64 time=0.195 ms 64 bytes from 10.140.36.2: icmp_req=43 ttl=64 time=1000 ms 64 bytes from 10.140.36.2: icmp_req=44 ttl=64 time=1000 ms 64 bytes from 10.140.36.2: icmp_req=45 ttl=64 time=1000 ms 64 bytes from 10.140.36.2: icmp_req=46 ttl=64 time=1000 ms 64 bytes from 10.140.36.2: icmp_req=47 ttl=64 time=1000 ms 64 bytes from 10.140.36.2: icmp_req=48 ttl=64 time=0.208 ms 64 bytes from 10.140.36.2: icmp_req=49 ttl=64 time=0.251 ms 64 bytes from 10.140.36.2: icmp_req=50 ttl=64 time=0.244 ms So.... While pinging to the outside, this is an example of the output of ovs-dpctl dump-flows ovs0 | grep 00:16:3e:00:24:09 on the dom0: in_port(2),eth(src=00:16:3e:00:24:02,dst=00:16:3e:00:24:09),eth_type(0x8100),vlan(vid=36,pcp=0),encap(eth_type(0x0800),ipv4(src=10.140.36.2,dst=10.140.36.9,proto=1,tos=0,ttl=64,frag=no),icmp(type=0,code=0)), packets:98, bytes:9996, used:0.820s, actions:pop_vlan,61 in_port(61),eth(src=00:16:3e:00:24:09,dst=00:16:3e:00:24:02),eth_type(0x0806),arp(sip=10.140.36.9,tip=10.140.36.2,op=2,sha=00:16:3e:00:24:09,tha=00:16:3e:00:24:02), packets:0, bytes:0, used:never, actions:push_vlan(vid=36,pcp=0),1 in_port(2),eth(src=00:16:3e:00:24:02,dst=00:16:3e:00:24:09),eth_type(0x8100),vlan(vid=36,pcp=0),encap(eth_type(0x0806),arp(sip=10.140.36.2,tip=10.140.36.9,op=1,sha=00:16:3e:00:24:02,tha=00:00:00:00:00:00)), packets:1, bytes:60, used:2.820s, actions:pop_vlan,61 in_port(61),eth(src=00:16:3e:00:24:09,dst=00:16:3e:00:24:02),eth_type(0x0800),ipv4(src=10.140.36.9,dst=10.140.36.2,proto=1,tos=0,ttl=64,frag=no),icmp(type=8,code=0), packets:98, bytes:9604, used:0.819s, actions:push_vlan(vid=36,pcp=0),1 We've been testing and debugging for some hours: - when I reboot this domU on the same dom0, the issue remains. - when I shutdown/create this domU on the same dom0, the issue remains - when I live-migrate this domU to another dom0, the issue disappears. - when I live-migrate this domU back from the other dom0 to the orignal one where I had this issue, the issue re-appears. - when I live-migrate *any* domU on our infrastructure to *any* dom0, this issue might suddenly/immediately arise on one of the interfaces this domU has attached to ovs. We've been using this version of openvswitch (v1.4.0 - 30 Jan 2012) after a few weeks of testing in production since march 2012, but never did a real lot of live-migration until now. Any thoughts on this issue would be much appreciated... Right now the only domU interface that is broken is an interface of a backup dns resolver on one of our customer vlans (we've been moving around some test/acceptance systems of which 2 also broke today with exactly the same symptoms), so I'm leaving the situation as-is right now, as I don't want to risk breaking a customer production domU right now. Luckily the're some sort of workaround to resolve issues when they arise by moving a domU around to other servers a few extra times until hopefully all network interfaces work again. :| -- Hans van Kranenburg - System / Network Engineer +31 (0)10 2760434 | hans.van.kranenburg at mendix.com | www.mendix.com
Ian Campbell
2012-Jul-10 18:16 UTC
[Pkg-xen-devel] Bug#679533: Bug#679533: Traffic forwarding issue between Xen domU/dom0 and ovs
On Fri, 2012-06-29 at 16:06 +0200, Hans van Kranenburg wrote:> We're seeing weird behaviour regarding network traffic between a Xen > dom0 and domU. I reported this issue yesterday to the > openvswitch-discuss mailing list, but it seems this could also be a > (regression?) bug in the xen hypervisor or the linux kernel...Did you get any response from the vswitch folks? Did this used to work? You say "(regression?)". This seems more likely on first glance to be an issue with the vswitch rather than with Xen or the kernel, unless you can reproduce with Linux bridging too. In fact it is very unlikely to be a hypervisor issue (the hypervisor isn't really involved in network traffic). However it would still be useful to try the 4.1 hypervisor and 3.2 kernel from Wheezy if you can. Thanks, Ian.
Debian Bug Tracking System
2014-Nov-29 23:24 UTC
[Pkg-xen-devel] Bug#679533: marked as done (Traffic forwarding issue between Xen domU/dom0)
Your message dated Sun, 30 Nov 2014 00:15:22 +0100 with message-id <547A538A.9050802 at mendix.com> and subject line Closing #679533: Traffic forwarding issue between, Xen domU/dom0 has caused the Debian Bug report #679533, regarding Traffic forwarding issue between Xen domU/dom0 to be marked as done. This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact owner at bugs.debian.org immediately.) -- 679533: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=679533 Debian Bug Tracking System Contact owner at bugs.debian.org with problems -------------- next part -------------- An embedded message was scrubbed... From: Hans van Kranenburg <hans.van.kranenburg at mendix.com> Subject: Traffic forwarding issue between Xen domU/dom0 and ovs Date: Fri, 29 Jun 2012 16:06:40 +0200 Size: 16427 URL: <http://lists.alioth.debian.org/pipermail/pkg-xen-devel/attachments/20141129/bae759e6/attachment.mht> -------------- next part -------------- An embedded message was scrubbed... From: Hans van Kranenburg <hans.van.kranenburg at mendix.com> Subject: Closing #679533: Traffic forwarding issue between, Xen domU/dom0 Date: Sun, 30 Nov 2014 00:15:22 +0100 Size: 2109 URL: <http://lists.alioth.debian.org/pipermail/pkg-xen-devel/attachments/20141129/bae759e6/attachment-0001.mht>