thr3ads.net - Xen devel - The strange case of xen_netback not returning ARP replies [May 2012]

If this information is useful, please help other people find it:
Share via:

Joanna Rutkowska

2012-May-16 12:18 UTC

The strange case of xen_netback not returning ARP replies

Hello,

I''m facing a rather strange problem with the netback interface. My
setup
involves a netvm, which has some physical network interfaces assigned,
and a client VM where a net front is running (exposed as eth0) and which
is connected to that netvm (via vif42.0 interface, as seen in the netvm
on the dumps below).

Now, the netvm has two physical network interfaces assigned:
1) A standard Intel AGN (iwlwifi module, interface wlan0) -- this is
just a PCI devices assigned

2) A USB 3G modem (cdc_ncm module, usb0 interface) -- this has been made
available to the netvm by assigning a whole USB controller, where the 3G
modem is connected to. This works fine.

We do NAT in netvm for the traffic coming on vif* and send it out
through the default outgoing interface, e.g. wlan0. Now, as long as I
use the wlan0 for networking all works great. I''ve been using this
setup
for years, no problem here.

However, when I switch to usb0 as a default outgoing interface in the
netvm, something strange happens. The networking works fine via usb0 for
some time (a few minutes typically), yet suddenly, after enough packets
got exchanged, the networking stops working.

When I run tcpdump on the vif* interface I can see that suddenly there
is nobody (in the netvm) to reply for the ARP requests from the client
VM (the client vm has Xen ID = 42 in this dump, and IP = .5, and gateway
= .1):

[root@netvm user]# tcpdump -ni vif42.0 arp
tcpdump: WARNING: vif42.0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 bytes
13:41:55.031819 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:41:56.031860 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:41:57.031794 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:41:59.287308 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:42:00.283853 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:42:01.283816 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:42:03.231324 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length

... and this now continues until no end.

For comparison, this is how it looks when I use networking via wlan0:

[root@netvm user]# tcpdump -ni vif42.0 arp
tcpdump: WARNING: vif42.0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 bytes
13:39:00.215883 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:39:00.215911 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28
13:39:21.799844 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:39:21.799869 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28

We can see that every once in a while an ARP request for 10.137.1.1
appears (a gateway for clientvm, so the netvm), yet this is immediately
being answered (by netvm, as I understand).

Now, this behavior seems really strange, because:

1) AFAIU, the ARP replies are/should be generated by the netback
interface in the netvm (vif*).

2) It shouldn''t matter, for the netback code, how the packets are later
routed (via wlan0 vs. usb0) to provide this (dummy) arp response?

3) ...yet, for some reason, in the case when packets are later routed
through usb0, the netback is not willing to generate arp response???

Or am I misunderstanding this, and it is somebody else who is generating
the arp responses? The final NIC?

Some additional notes:
1) We make sure to set /proc/sys/net/ipv4/conf/vif*/proxy_arp to 1

2) When this "arp hang" happens, the networking (via usb0) is still
working fine in the netvm (i.e. I can do ping google.com from the netvm)

3) This has been tested on various VM kernels (in the netvm): 3.0.4,
3.2.7, and 3.3.5 -- all exhibit the same behavior.

4) Nothing spectacular in the logs of the netvm, however, I can often
see this crash in the *client* VM:

[ 1257.228761] ------------[ cut here ]------------
[ 1257.228767] WARNING: at
/home/user/qubes-src/kernel/kernel-3.3.5/linux-3.3.5/fs/sysfs/file.c:498
sysfs_attr_ns+0x93/0xa0()
[ 1257.228776] sysfs: kobject eth0 without dirent
[ 1257.228780] Modules linked in: iptable_raw bnep bluetooth rfkill
ipt_MASQUERADE ipt_REJECT xt_state xt_tcpudp xen_netback iptable_filter
iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
ip_tables x_tables xen_netfront microcode pcspkr u2mfn(O) xen_blkback
xen_evtchn autofs4 ext4 jbd2 crc16 dm_snapshot xen_blkfront [last
unloaded: scsi_wait_scan]
[ 1257.228819] Pid: 11, comm: xenwatch Tainted: G        W  O
3.3.5-1.pvops.qubes.x86_64 #1
[ 1257.228825] Call Trace:
[ 1257.228830]  [<ffffffff810495aa>] warn_slowpath_common+0x7a/0xb0
[ 1257.228836]  [<ffffffff81049681>] warn_slowpath_fmt+0x41/0x50
[ 1257.228842]  [<ffffffff81057ba7>] ? lock_timer_base+0x37/0x70
[ 1257.228850]  [<ffffffff811a7433>] sysfs_attr_ns+0x93/0xa0
[ 1257.228856]  [<ffffffff811a7aef>] sysfs_remove_file+0x1f/0x40
[ 1257.228862]  [<ffffffff812e5622>] device_remove_file+0x12/0x20
[ 1257.228870]  [<ffffffffa00faf5a>] xennet_remove+0x84/0xac
[xen_netfront]
[ 1257.228875]  [<ffffffff812b5c82>] xenbus_dev_remove+0x42/0xa0
[ 1257.228881]  [<ffffffff812e85a7>] __device_release_driver+0x77/0xd0
[ 1257.228887]  [<ffffffff812e86e8>] device_release_driver+0x28/0x40
[ 1257.228895]  [<ffffffff812e790f>] bus_remove_device+0x10f/0x180
[ 1257.228901]  [<ffffffff812e5808>] device_del+0x118/0x1c0
[ 1257.228906]  [<ffffffff812e58cd>] device_unregister+0x1d/0x60
[ 1257.228914]  [<ffffffff812b5a46>] xenbus_dev_changed+0x96/0x1b0
[ 1257.228920]  [<ffffffff812b74b4>] frontend_changed+0x24/0x50
[ 1257.228926]  [<ffffffff812b4221>] xenwatch_thread+0xb1/0x170
[ 1257.228933]  [<ffffffff8106aea0>] ? wake_up_bit+0x40/0x40
[ 1257.228939]  [<ffffffff812b4170>] ? xenbus_thread+0x40/0x40
[ 1257.228944]  [<ffffffff8106a9a6>] kthread+0x96/0xa0
[ 1257.228951]  [<ffffffff81465724>] kernel_thread_helper+0x4/0x10
[ 1257.228959]  [<ffffffff8145c7fc>] ? retint_restore_args+0x5/0x6
[ 1257.228964]  [<ffffffff81465720>] ? gs_change+0x13/0x13
[ 1257.228968] ---[ end trace 75286ef58ce0391f ]---

But this seems rather irrelevant, as it seems like it is the netvm that
is failing here, i.e. it doesn''t generate ARP responses?

I would appreciate any help with this issue!

Thanks,
joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2012-May-22 19:53 UTC

head link

Re: The strange case of xen_netback not returning ARP replies

On Wed, May 16, 2012 at 02:18:27PM +0200, Joanna Rutkowska
wrote:> Hello,
> 
> I''m facing a rather strange problem with the netback interface. My
setup
> involves a netvm, which has some physical network interfaces assigned,
> and a client VM where a net front is running (exposed as eth0) and which
> is connected to that netvm (via vif42.0 interface, as seen in the netvm
> on the dumps below).
> 
> Now, the netvm has two physical network interfaces assigned:
> 1) A standard Intel AGN (iwlwifi module, interface wlan0) -- this is
> just a PCI devices assigned
> 
> 2) A USB 3G modem (cdc_ncm module, usb0 interface) -- this has been made
> available to the netvm by assigning a whole USB controller, where the 3G
> modem is connected to. This works fine.
There are some patches posted about netback and SKB slots that might
apply to the problem you guys are seeing.
> 
> We do NAT in netvm for the traffic coming on vif* and send it out
> through the default outgoing interface, e.g. wlan0. Now, as long as I
> use the wlan0 for networking all works great. I''ve been using this
setup
> for years, no problem here.
> 
> However, when I switch to usb0 as a default outgoing interface in the
> netvm, something strange happens. The networking works fine via usb0 for
> some time (a few minutes typically), yet suddenly, after enough packets
> got exchanged, the networking stops working.
> 
> When I run tcpdump on the vif* interface I can see that suddenly there
> is nobody (in the netvm) to reply for the ARP requests from the client
> VM (the client vm has Xen ID = 42 in this dump, and IP = .5, and gateway
> = .1):
> 
> [root@netvm user]# tcpdump -ni vif42.0 arp
> tcpdump: WARNING: vif42.0: no IPv4 address assigned
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 bytes
> 13:41:55.031819 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:41:56.031860 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:41:57.031794 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:41:59.287308 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:42:00.283853 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:42:01.283816 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:42:03.231324 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length
> 
> ... and this now continues until no end.
> 
> For comparison, this is how it looks when I use networking via wlan0:
> 
> [root@netvm user]# tcpdump -ni vif42.0 arp
> tcpdump: WARNING: vif42.0: no IPv4 address assigned
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 bytes
> 13:39:00.215883 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:39:00.215911 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28
> 13:39:21.799844 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:39:21.799869 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28
> 
> We can see that every once in a while an ARP request for 10.137.1.1
> appears (a gateway for clientvm, so the netvm), yet this is immediately
> being answered (by netvm, as I understand).
> 
> Now, this behavior seems really strange, because:
> 
> 1) AFAIU, the ARP replies are/should be generated by the netback
> interface in the netvm (vif*).
> 
> 2) It shouldn''t matter, for the netback code, how the packets are
later
> routed (via wlan0 vs. usb0) to provide this (dummy) arp response?
> 
> 3) ...yet, for some reason, in the case when packets are later routed
> through usb0, the netback is not willing to generate arp response???
> 
> Or am I misunderstanding this, and it is somebody else who is generating
> the arp responses? The final NIC?
> 
> Some additional notes:
> 1) We make sure to set /proc/sys/net/ipv4/conf/vif*/proxy_arp to 1
> 
> 2) When this "arp hang" happens, the networking (via usb0) is
still
> working fine in the netvm (i.e. I can do ping google.com from the netvm)
> 
> 3) This has been tested on various VM kernels (in the netvm): 3.0.4,
> 3.2.7, and 3.3.5 -- all exhibit the same behavior.
> 
> 4) Nothing spectacular in the logs of the netvm, however, I can often
> see this crash in the *client* VM:
> 
> [ 1257.228761] ------------[ cut here ]------------
> [ 1257.228767] WARNING: at
> /home/user/qubes-src/kernel/kernel-3.3.5/linux-3.3.5/fs/sysfs/file.c:498
> sysfs_attr_ns+0x93/0xa0()
> [ 1257.228776] sysfs: kobject eth0 without dirent
> [ 1257.228780] Modules linked in: iptable_raw bnep bluetooth rfkill
> ipt_MASQUERADE ipt_REJECT xt_state xt_tcpudp xen_netback iptable_filter
> iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
> ip_tables x_tables xen_netfront microcode pcspkr u2mfn(O) xen_blkback
> xen_evtchn autofs4 ext4 jbd2 crc16 dm_snapshot xen_blkfront [last
> unloaded: scsi_wait_scan]
> [ 1257.228819] Pid: 11, comm: xenwatch Tainted: G        W  O
> 3.3.5-1.pvops.qubes.x86_64 #1
> [ 1257.228825] Call Trace:
> [ 1257.228830]  [<ffffffff810495aa>] warn_slowpath_common+0x7a/0xb0
> [ 1257.228836]  [<ffffffff81049681>] warn_slowpath_fmt+0x41/0x50
> [ 1257.228842]  [<ffffffff81057ba7>] ? lock_timer_base+0x37/0x70
> [ 1257.228850]  [<ffffffff811a7433>] sysfs_attr_ns+0x93/0xa0
> [ 1257.228856]  [<ffffffff811a7aef>] sysfs_remove_file+0x1f/0x40
> [ 1257.228862]  [<ffffffff812e5622>] device_remove_file+0x12/0x20
> [ 1257.228870]  [<ffffffffa00faf5a>] xennet_remove+0x84/0xac
[xen_netfront]
> [ 1257.228875]  [<ffffffff812b5c82>] xenbus_dev_remove+0x42/0xa0
> [ 1257.228881]  [<ffffffff812e85a7>]
__device_release_driver+0x77/0xd0
> [ 1257.228887]  [<ffffffff812e86e8>] device_release_driver+0x28/0x40
> [ 1257.228895]  [<ffffffff812e790f>] bus_remove_device+0x10f/0x180
> [ 1257.228901]  [<ffffffff812e5808>] device_del+0x118/0x1c0
> [ 1257.228906]  [<ffffffff812e58cd>] device_unregister+0x1d/0x60
> [ 1257.228914]  [<ffffffff812b5a46>] xenbus_dev_changed+0x96/0x1b0
> [ 1257.228920]  [<ffffffff812b74b4>] frontend_changed+0x24/0x50
> [ 1257.228926]  [<ffffffff812b4221>] xenwatch_thread+0xb1/0x170
> [ 1257.228933]  [<ffffffff8106aea0>] ? wake_up_bit+0x40/0x40
> [ 1257.228939]  [<ffffffff812b4170>] ? xenbus_thread+0x40/0x40
> [ 1257.228944]  [<ffffffff8106a9a6>] kthread+0x96/0xa0
> [ 1257.228951]  [<ffffffff81465724>] kernel_thread_helper+0x4/0x10
> [ 1257.228959]  [<ffffffff8145c7fc>] ? retint_restore_args+0x5/0x6
> [ 1257.228964]  [<ffffffff81465720>] ? gs_change+0x13/0x13
> [ 1257.228968] ---[ end trace 75286ef58ce0391f ]---
> 
> But this seems rather irrelevant, as it seems like it is the netvm that
> is failing here, i.e. it doesn''t generate ARP responses?
> 
> I would appreciate any help with this issue!
> 
> Thanks,
> joanna.
> 

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Joanna Rutkowska

2012-May-26 11:04 UTC

head link

Re: The strange case of xen_netback not returning ARP replies

On 05/22/12 21:53, Konrad Rzeszutek Wilk wrote:> On Wed, May 16, 2012 at 02:18:27PM +0200, Joanna Rutkowska wrote:
>> Hello,
>>
>> I''m facing a rather strange problem with the netback
interface. My setup
>> involves a netvm, which has some physical network interfaces assigned,
>> and a client VM where a net front is running (exposed as eth0) and
which
>> is connected to that netvm (via vif42.0 interface, as seen in the netvm
>> on the dumps below).
>>
>> Now, the netvm has two physical network interfaces assigned:
>> 1) A standard Intel AGN (iwlwifi module, interface wlan0) -- this is
>> just a PCI devices assigned
>>
>> 2) A USB 3G modem (cdc_ncm module, usb0 interface) -- this has been
made
>> available to the netvm by assigning a whole USB controller, where the
3G
>> modem is connected to. This works fine.
> 
> There are some patches posted about netback and SKB slots that might
> apply to the problem you guys are seeing.
> Yes, indeed the SKB patch solved this problem, thanks!

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Reasonably Related Threads

Search for more reasonably related threads

Xen devel - May 2012 - The strange case of xen_netback not returning ARP replies

The strange case of xen_netback not returning ARP replies

Re: The strange case of xen_netback not returning ARP replies

Re: The strange case of xen_netback not returning ARP replies

Reasonably Related Threads