Hi all-
I am seeing an intermittent lockup on my machine''s networking as soon
as I apply a network load. On a pool of 80 the first one will lock up
generally within 15-20 minutes of beginning the workload. The symptom
is I see a long list of the following in /var/log/messages:
Aug 16 18:32:49 localhost kernel: netback[1]: TXP193 is DMA mapped
Aug 16 18:32:49 localhost kernel: netback[1]: TXP211 is DMA mapped
Aug 16 18:32:49 localhost kernel: netback[1]: TXP232 is DMA mapped
Aug 16 18:32:49 localhost kernel: netback[1]: TXP157 is DMA mapped
Aug 16 18:32:49 localhost kernel: netback[0]: TXP44 is DMA mapped
this seems to clog up the networking pipeline which leads to stall in
my NIC driver:
Aug 16 18:32:58 localhost kernel: ------------[ cut here ]------------
Aug 16 18:32:58 localhost kernel: WARNING: at
net/sched/sch_generic.c:261 dev_watchdog+0x241/0x250()
Aug 16 18:32:58 localhost kernel: Hardware name: C51G,MCP51
Aug 16 18:32:58 localhost kernel: NETDEV WATCHDOG: eth0 (tg3):
transmit queue 0 timed out
Aug 16 18:32:58 localhost kernel: Modules linked in: nfs nfs_acl
auth_rpcgss sch_htb lockd sunrpc 8021q openvswitch ipt_REJECT
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp
iptable_filter ip_tables x_tables binfmt_misc nls_utf8 isofs video
output sbs sbshc fan container battery ac parport_pc lp parport nvram
thermal rtc_cmos processor evdev sg tg3 button thermal_sys rtc_core
sata_sil24 rtc_lib serio_raw tpm_tis tpm tpm_bios i2c_nforce2 pcspkr
i2c_core ide_generic dm_snapshot dm_zero dm_mirror dm_region_hash
dm_log dm_mod sata_nv pata_acpi ata_generic libata sd_mod scsi_mod
ext3 jbd uhci_hcd ohci_hcd ehci_hcd usbcore fbcon font tileblit
bitblit softcursor
Aug 16 18:32:58 localhost kernel: Pid: 0, comm: swapper Not tainted
2.6.32.12-0.7.1.xs6.0.2.553.170674xen #1
Aug 16 18:32:58 localhost kernel: Call Trace:
Aug 16 18:32:58 localhost kernel: [<c031a1a1>] ? dev_watchdog+0x241/0x250
Aug 16 18:32:58 localhost kernel: [<c031a1a1>] ? dev_watchdog+0x241/0x250
Aug 16 18:32:58 localhost kernel: [<c012e0bc>]
warn_slowpath_common+0x7c/0xa0
Aug 16 18:32:58 localhost kernel: [<c031a1a1>] ? dev_watchdog+0x241/0x250
Aug 16 18:32:58 localhost kernel: [<c012e126>]
warn_slowpath_fmt+0x26/0x30
Aug 16 18:32:58 localhost kernel: [<c031a1a1>] dev_watchdog+0x241/0x250
Aug 16 18:32:58 localhost kernel: [<c02188f6>] ?
blk_rq_timed_out_timer+0xe6/0x110
Aug 16 18:32:58 localhost kernel: [<c0137fe1>]
run_timer_softirq+0x151/0x200
Aug 16 18:32:58 localhost kernel: [<c0319f60>] ? dev_watchdog+0x0/0x250
Aug 16 18:32:58 localhost kernel: [<c013359a>] __do_softirq+0xba/0x180
Aug 16 18:32:58 localhost kernel: [<c015b657>] ?
handle_IRQ_event+0x37/0x100
Aug 16 18:32:58 localhost kernel: [<c015e774>] ?
move_native_irq+0x14/0x50
Aug 16 18:32:58 localhost kernel: [<c01336d5>] do_softirq+0x75/0x80
Aug 16 18:32:58 localhost kernel: [<c01339bb>] irq_exit+0x2b/0x40
Aug 16 18:32:58 localhost kernel: [<c029c7b7>]
evtchn_do_upcall+0x1e7/0x330
Aug 16 18:32:58 localhost kernel: [<c010470f>]
hypervisor_callback+0x43/0x4b
Aug 16 18:32:58 localhost kernel: [<c0107095>] ? xen_safe_halt+0xb5/0x150
Aug 16 18:32:58 localhost kernel: [<c010adae>] xen_idle+0x1e/0x50
Aug 16 18:32:58 localhost kernel: [<c0102a7b>] cpu_idle+0x3b/0x60
Aug 16 18:32:58 localhost kernel: [<c0373c43>] rest_init+0x53/0x60
Aug 16 18:32:58 localhost kernel: [<c04f5cea>] start_kernel+0x29a/0x340
Aug 16 18:32:58 localhost kernel: [<c04f55f0>] ?
unknown_bootoption+0x0/0x1f0
Aug 16 18:32:58 localhost kernel: [<c04f507c>]
i386_start_kernel+0x7c/0x90
Aug 16 18:32:58 localhost kernel: ---[ end trace 76ea5a31a8fc2f33 ]---
and after the NIC driver fails netback un-stalls itself:
Aug 16 18:33:00 localhost kernel: tg3 0000:01:00.0: tg3_stop_block
timed out, ofs=1400 enable_bit=2
Aug 16 18:33:00 localhost kernel: pci 0000:00:02.0: eth0: Link is down
Aug 16 18:33:00 localhost kernel: netback[1]: DMA mapped TXP 203 released
Aug 16 18:33:00 localhost kernel: netback[1]: DMA mapped TXP 212 released
Aug 16 18:33:00 localhost kernel: netback[2]: DMA mapped TXP 94 released
Aug 16 18:33:00 localhost kernel: netback[1]: DMA mapped TXP 159 released
To get packets moving again I have to have a serial console to the
host, rmmod the tg3 driver, modprobe it, ifconfig up the interface and
restart OVS.
I''ve tried a variety of things to debug the problem:
-Turning off all hardware acceleration on the NIC from ethtool
-Different OVS versions
-Using a single dom0 vcpu
-Turning off irqbalance and MSI
-Trying the latest stable kernel in my VMs (3.5.3)
-Tried a newer TG3 driver from the Citrix crew
(http://forums.citrix.com/thread.jspa?threadID=311744)
But to no avail. I don''t ever see the "is DMA mapped"
messages under
normal operation, so it seems like whatever is causing dom0 to believe
that the memory in the netback/front rings is DMA mapped is the
problem. If anyone has any suggestions on how to approach/solve this
problem I am open to ideas, I''ve spent a couple weeks on and off on it
with no resolution. I''m attaching a tar with all the log messages
from the system if they can help.
Thanks in advance,
David
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel