Philipp Hahn
2011-Feb-25  13:40 UTC
[Xen-users] RFH: Windows2003+GPLPV packet-receive breaks after some time (Xen 3.4.3 amd64)
Hello,
one of our domU Windows system with GPL-PV driver regularly has problems
with its network connection: After some time the VM does not receive any
packets anymore. It''s seems to be only a problem with receiving, since
sending
ARP packets still works:
	tcpdump -i vif147.0 -n arp | grep -FA1 --color XXX.X.71.77
If I try to ping the domU from the dom0, I only see the request going to the
domU, but no answer:
	13:49:17.106405 arp who-has XXX.X.71.77 tell XXX.X.12.47
If I try to ping some host from the domU, I see the request leaving the domU
and the answer arriving for the domU, but no following ICMP messaged:
	13:48:37.569618 arp who-has XXX.X.22.12 tell XXX.X.71.77
	13:48:37.570002 arp reply XXX.X.22.12 is-at 00:16:3e:aa:ed:fa
We have saved the state of the VM to a file, which when restored puts the domU
back in the broken state.
We collected some information, but now are stuck on how to best proceed, since
we don''t know enough of Xens and GPLPVs internal working.
Can we (or someone els) diagnose, why received packages are not properly 
handled?
Should we install the debug driver and what should we do when the problem next 
occurs. (I''m not afraid of debuggers and assembler, but only on Linux
and not
much with Windows)
Arch: amd64
Xen: 3.4.3
dom0: 2.6.32-17 (Debian)
domU: Windows 2003 Service Pack 2
GPLPV: 0.11.0.238
	Xen network device settings:
		Check checksum on RX packets: Enabled
		Checksum Offload: Enabled
		Large Send Offload: 61440
		Locally Administrated Address: Not set
		MTU: 1500
		Rx Interrupt Moderation (beta): Disabled
		Scatter/Gather: Enabled
# xm network-list 147
Idx BE     MAC Addr.     handle state evt-ch tx-/rx-ring-ref BE-path
0   0  00:16:3e:af:fa:a5    0     4      9     
15732/15741   /local/domain/0/backend/vif/147/0
# xenstore-ls /local/domain/0/backend/vif/147
0 = ""
 bridge = "XXXXXX0"
 domain = "XXXX010"
 handle = "0"
 uuid = "c550619d-3a4f-edfd-a22c-4b11a84b5728"
 script = "/etc/xen/scripts/vif-bridge"
 state = "4"
 frontend = "/local/domain/147/device/vif/0"
 mac = "00:16:3e:af:fa:a5"
 online = "1"
 frontend-id = "147"
 feature-sg = "1"
 feature-gso-tcpv4 = "1"
 feature-rx-copy = "1"
 feature-rx-flip = "0"
 feature-smart-poll = "1"
 hotplug-status = "connected"
# netstat -s
Ip:
    2982289885 total packets received
    2708867 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    2931918645 incoming packets delivered
    1504949163 requests sent out
    1 outgoing packets dropped
    1 dropped because of missing route
    2683 reassemblies required
    1137 packets reassembled ok
    2630 fragments received ok
    5669 fragments created
Icmp:
    811639 ICMP messages received
    400 input ICMP message failed.
    ICMP-Eingabehistogramm:
        destination unreachable: 4632
        redirects: 619
        echo requests: 806046
        echo replies: 185
        timestamp request: 44
        address mask request: 67
    876674 ICMP messages sent
    0 ICMP messages failed
    ICMP-Ausgabehistogramm:
        destination unreachable: 70142
        echo request: 452
        echo replies: 806036
        timestamp replies: 44
IcmpMsg:
        InType0: 185
        InType3: 4632
        InType5: 619
        InType8: 806046
        InType10: 2
        InType13: 44
        InType17: 67
        InType37: 44
        OutType0: 806036
        OutType3: 70142
        OutType8: 452
        OutType14: 44
Tcp:
    537972 active connections openings
    141489 passive connection openings
    3940 failed connection attempts
    90207 connection resets received
    9 connections established
    2877624333 segments received
    1502888618 segments send out
    378040 segments retransmited
    0 bad segments received.
    703715 resets sent
Udp:
    132980 packets received
    69241 packets to unknown port received.
    0 packet receive errors
    804134 packets sent
UdpLite:
TcpExt:
    56 resets received for embryonic SYN_RECV sockets
    514936 TCP sockets finished time wait in fast timer
    35 time wait sockets recycled by time stamp
    8541829 delayed acks sent
    7168 delayed acks further delayed because of locked socket
    Quick ack mode was activated 52 times
    1019468 packets directly queued to recvmsg prequeue.
    2562618 bytes directly in process context from backlog
    623450772 bytes directly received in process context from prequeue
    1470140290 packet headers predicted
    431554 packets header predicted and directly queued to user
    6735803 acknowledgments not containing data payload received
    1939166757 predicted acknowledgments
    61400 times recovered from packet loss by selective acknowledgements
    Detected reordering 1 times using FACK
    1 congestion windows fully recovered without slow start   
    5 congestion windows partially recovered using Hoe heuristic
    691 congestion windows recovered without slow start after partial ack
    302881 TCP data loss events
    TCPLostRetransmit: 6025
    719 timeouts after SACK recovery
    2 timeouts in loss state
    347805 fast retransmits
    19845 forward retransmits
    5795 retransmits in slow start
    3332 other TCP timeouts
    94 SACK retransmits failed
    77 DSACKs sent for old packets
    42 DSACKs received
    65852 connections reset due to unexpected data
    87707 connections reset due to early user close
    3 connections aborted due to timeout
    TCP ran low on memory 1 times
    TCPDSACKIgnoredOld: 40
    TCPDSACKIgnoredNoUndo: 2
    TCPSpuriousRTOs: 1
    TCPSackShifted: 1746475
    TCPSackMerged: 379798
    TCPSackShiftFallback: 115430
IpExt:
    InMcastPkts: 880
    InBcastPkts: 53279804
    InOctets: -1233700460
    OutOctets: 753104249
    InMcastOctets: 26566
    InBcastOctets: 1189547847
Sincerely
Philipp Hahn
-- 
Philipp Hahn           Open Source Software Engineer      hahn@univention.de
Univention GmbH        Linux for Your Business        fon: +49 421 22 232- 0
Mary-Somerville-Str.1  28359 Bremen                   fax: +49 421 22 232-99
                                                   http://www.univention.de/
** Besuchen Sie uns auf der CeBIT in Hannover **
** Auf dem Univention Stand D36 in Halle 2    **
** Vom 01. bis 05. März 2011                  **
_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users
James Harper
2011-Feb-26  00:47 UTC
RE: [Xen-users] RFH: Windows2003+GPLPV packet-receive breaks after sometime (Xen 3.4.3 amd64)
> Hello, > > one of our domU Windows system with GPL-PV driver regularly hasproblems> with its network connection: After some time the VM does not receiveany> packets anymore. It''s seems to be only a problem with receiving, sincesending> ARP packets still works: > > tcpdump -i vif147.0 -n arp | grep -FA1 --color XXX.X.71.77 > > If I try to ping the domU from the dom0, I only see the request goingto the> domU, but no answer: > 13:49:17.106405 arp who-has XXX.X.71.77 tell XXX.X.12.47 > > If I try to ping some host from the domU, I see the request leavingthe domU> and the answer arriving for the domU, but no following ICMP messaged: > 13:48:37.569618 arp who-has XXX.X.22.12 tell XXX.X.71.77 > 13:48:37.570002 arp reply XXX.X.22.12 is-at 00:16:3e:aa:ed:fa > > We have saved the state of the VM to a file, which when restored putsthe domU> back in the broken state. > > We collected some information, but now are stuck on how to bestproceed, since> we don''t know enough of Xens and GPLPVs internal working. > Can we (or someone els) diagnose, why received packages are notproperly> handled? > Should we install the debug driver and what should we do when theproblem next> occurs. (I''m not afraid of debuggers and assembler, but only on Linuxand not> much with Windows) >Do you have any Linux PV domains? If you install the debug version of the driver then you''ll get info written to /var/log/xen/qemu-dm-<domUname>.log which might show something useful Also, try turning off all the offload functions in the advanced properties of the network adapter under Linux. Does your Dom0 have any GRE tunnels? I have seen problems when these are used before, but that''s a Dom0 routing interaction with checksum offloading. James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Philipp Hahn
2011-Feb-28  08:04 UTC
Re: [Xen-users] RFH: Windows2003+GPLPV packet-receive breaks after sometime (Xen 3.4.3 amd64)
Hello James, thanks for your fast answer. Am Samstag 26 Februar 2011 01:47:23 schrieb James Harper:> Do you have any Linux PV domains?I don''t understand, where a Linux PV domains fits in here, since the problematic domU is a Windows Domain. Is this for cross-testing PV problems?> If you install the debug version of the driver then you''ll get info > written to /var/log/xen/qemu-dm-<domUname>.log which might show > something useful > > Also, try turning off all the offload functions in the advanced > properties of the network adapter under Linux.Will try.> Does your Dom0 have any GRE tunnels? I have seen problems when these are > used before, but that''s a Dom0 routing interaction with checksum > offloading.Not that I know off. Is it possible to detect, that these errors? What I find strange is that the error occurs only after some time, after everything worked fine. The occurrence of the error might be corelated to some high network traffic load, when the network backup starts. Sincerely Philipp Hahn -- Philipp Hahn Open Source Software Engineer hahn@univention.de Univention GmbH Linux for Your Business fon: +49 421 22 232- 0 Mary-Somerville-Str.1 28359 Bremen fax: +49 421 22 232-99 http://www.univention.de/ ** Besuchen Sie uns auf der CeBIT in Hannover ** ** Auf dem Univention Stand D36 in Halle 2 ** ** Vom 01. bis 05. März 2011 ** _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
James Harper
2011-Feb-28  11:10 UTC
RE: [Xen-users] RFH: Windows2003+GPLPV packet-receive breaks aftersometime (Xen 3.4.3 amd64)
> > Hello James, > > thanks for your fast answer. > > Am Samstag 26 Februar 2011 01:47:23 schrieb James Harper: > > Do you have any Linux PV domains? > > I don''t understand, where a Linux PV domains fits in here, since the > problematic domU is a Windows Domain. Is this for cross-testing PVproblems? Yes. If the problem occurs in a Linux PV domain (or even a Linux HVM domain with PV drivers) then it rules GPLPV out as a problem> > > If you install the debug version of the driver then you''ll get info > > written to /var/log/xen/qemu-dm-<domUname>.log which might show > > something useful > > > > Also, try turning off all the offload functions in the advanced > > properties of the network adapter under Linux. > > Will try. > > > Does your Dom0 have any GRE tunnels? I have seen problems when theseare> > used before, but that''s a Dom0 routing interaction with checksum > > offloading. > > Not that I know off. Is it possible to detect, that these errors? > What I find strange is that the error occurs only after some time,after> everything worked fine. The occurrence of the error might be corelatedto> some high network traffic load, when the network backup starts. >With offload functions enabled I have seen these problems in conjunction with GRE tunnels but not on LAN traffic and not with offload functions disabled. James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Philipp Hahn
2011-Mar-07  10:16 UTC
Re: [Xen-users] RFH: Windows2003+GPLPV packet-receive breaks aftersometime (Xen 3.4.3 amd64)
Hello James, hello List, thank so far for your support Am Montag 28 Februar 2011 12:10:29 schrieb James Harper:> Yes. If the problem occurs in a Linux PV domain (or even a Linux HVM > domain with PV drivers) then it rules GPLPV out as a problemThe problem has only been observed on Windows VMs with GplPv, never on any Linux VM or on Windows VMs without GplPv (as far as I know). Not all Windows VMs show the described behavior, and it takes some to to occur, normally corelated to the nightly network backup. The problem seems to exists since a long time: we have reports of problems going back as far as versions 0.9x of the GplPv driver.> > > If you install the debug version of the driver then you''ll get info > > > written to /var/log/xen/qemu-dm-<domUname>.log which might show > > > something useful > > > > > > Also, try turning off all the offload functions in the advanced > > > properties of the network adapter under Linux.Okay, the debug version (GPLPV 0.10.0.238) is now installed and it shows the following messages: # grep XenNet qemu-dm-xnts010.log XenNet --> DriverEntry XenNet DriverObject = 8A787778, RegistryPath = 8A822000 XenNet NdisGetVersion = 50002 XenNet ndis_wrapper_handle = 00000000 XenNet ndis_wrapper_handle = 8A814C00 XenNet NdisMInitializeWrapper succeeded XenNet MajorNdisVersion = 5, MinorNdisVersion = 1 XenNet about to call NdisMRegisterMiniport XenNet called NdisMRegisterMiniport XenNet <-- DriverEntry XenNet --> XenNet_Init XenNet IRQL = 0 XenNet nrl_length = 40 XenNet irq_vector = 01c, irq_level = 01c, irq_mode = NdisInterruptLevelSensitive XenNet XEN_INIT_TYPE_13 XenNet XEN_INIT_TYPE_VECTORS XenNet XEN_INIT_TYPE_DEVICE_STATE - 8A9F8FB4 XenNet --> XenNet_D0Entry XenNet --> XenNet_ConnectBackend XenNet XEN_INIT_TYPE_13 XenNet XEN_INIT_TYPE_VECTORS XenNet XEN_INIT_TYPE_DEVICE_STATE - 8A9F8FB4 XenNet XEN_INIT_TYPE_RING - tx-ring-ref = 8A6CD000 XenNet XEN_INIT_TYPE_RING - rx-ring-ref = 8A6CC000 XenNet XEN_INIT_TYPE_EVENT_CHANNEL - event-channel = 9 XenNet XEN_INIT_TYPE_READ_STRING - mac = 00:16:3e:af:fa:a5 XenNet XEN_INIT_TYPE_READ_STRING - feature-sg = 1 XenNet XEN_INIT_TYPE_READ_STRING - feature-gso-tcpv4 = 1 XenNet XEN_INIT_TYPE_17 XenNet <-- XenNet_ConnectBackend XenNet --> XenNet_RxInit XenNet <-- XenNet_RxInit XenNet <-- XenNet_D0Entry XenNet --> XenNet_PnPEventNotify XenNet NdisDevicePnPEventPowerProfileChanged XenNet <-- XenNet_PnPEventNotify XenNet (BUFFER_TOO_SHORT 100 > 28) XenNet (BUFFER_TOO_SHORT 152 > 0) XenNet (BUFFER_TOO_SHORT 152 > 0) XenNet cannot allocate packet XenNet No free packets XenNet Ran out of packets The last three messages are repeated multiple times. (I can send you the full log per private Email, if you want to take a look.) Since it might be related: /sys/class/net/vif205.0/ shows the following statistics/, where I find the number of tx_dropped unsettling: ./statistics/rx_packets:242028431 ./statistics/tx_packets:170064873 ./statistics/rx_bytes:340462359805 ./statistics/tx_bytes:19457838604 ./statistics/rx_errors:0 ./statistics/tx_errors:0 ./statistics/rx_dropped:0 ./statistics/tx_dropped:1349522 ./statistics/multicast:0 ./statistics/collisions:0 ./statistics/rx_length_errors:0 ./statistics/rx_over_errors:0 ./statistics/rx_crc_errors:0 ./statistics/rx_frame_errors:0 ./statistics/rx_fifo_errors:0 ./statistics/rx_missed_errors:0 ./statistics/tx_aborted_errors:0 ./statistics/tx_carrier_errors:0 ./statistics/tx_fifo_errors:0 ./statistics/tx_heartbeat_errors:0 ./statistics/tx_window_errors:0 ./statistics/rx_compressed:0 ./statistics/tx_compressed:0 I also noticed the following message, which I can''t put into any context: # tail -f /var/log/xen/xend-debug.log xc_map_foreign_range: ioctl failed: Bad address Sincerely Philipp Hahn -- Philipp Hahn Open Source Software Engineer hahn@univention.de Univention GmbH Linux for Your Business fon: +49 421 22 232- 0 Mary-Somerville-Str.1 28359 Bremen fax: +49 421 22 232-99 http://www.univention.de/ _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
James Harper
2011-Mar-07  10:58 UTC
RE: [Xen-users] RFH: Windows2003+GPLPV packet-receive breaks aftersometime (Xen 3.4.3 amd64)
> XenNet <-- XenNet_PnPEventNotify > XenNet (BUFFER_TOO_SHORT 100 > 28) > XenNet (BUFFER_TOO_SHORT 152 > 0) > XenNet (BUFFER_TOO_SHORT 152 > 0) > XenNet cannot allocate packet > XenNet No free packets > XenNet Ran out of packets > > The last three messages are repeated multiple times. > > (I can send you the full log per private Email, if you want to take alook.)>Probably not useful to send the full log, I think you''ve definitely identified a leak. Strange that I''ve never seen it before... I have several DomU''s with several different versions of GPLPV with several different combinations of checksum and large send offload enabled and disabled, and some of them have been up for months. Did you try with the offload features disabled?> ./statistics/tx_dropped:1349522The messages you are seeing above are in the rx path in DomU which means the tx path in Dom0. Do your DomU''s receive a large amount of traffic? Most of my traffic would be in the other direction, and ->DomU traffic would be mostly at WAN speeds, not LAN speeds... I''ll have a look at the code and see if I''ve missed something. James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Philipp Hahn
2011-Mar-07  13:04 UTC
Re: [Xen-users] RFH: Windows2003+GPLPV packet-receive breaks aftersometime (Xen 3.4.3 amd64)
Hello James, thanks again for the prompt answer, Am Montag 07 März 2011 11:58:50 schrieb James Harper:> > XenNet <-- XenNet_PnPEventNotify > > XenNet (BUFFER_TOO_SHORT 100 > 28) > > XenNet (BUFFER_TOO_SHORT 152 > 0) > > XenNet (BUFFER_TOO_SHORT 152 > 0) > > XenNet cannot allocate packet > > XenNet No free packets > > XenNet Ran out of packets > > > > The last three messages are repeated multiple times. > > > > (I can send you the full log per private Email, if you want to take a > > look.)There were some more messages related to networking, which my grep missed: XenNet XEN_INIT_TYPE_DEVICE_STATE - 8A9F8FB4 ScatterGather = 1 LargeSendOffload = 61440 ChecksumOffload = 1 ChecksumOffloadRxCheck = 1 MTU = 1500 RxInterruptModeration = 0 Could not read NetworkAddress value (c0000001) or value is invalid XenNet --> XenNet_D0Entry ... XenNet <-- XenNet_D0Entry Get Unknown OID 0x10202 Get Unknown OID 0x10203 XenNet --> XenNet_PnPEventNotify XenNet NdisDevicePnPEventPowerProfileChanged XenNet <-- XenNet_PnPEventNotify Get Unknown OID 0x10201 Get Unknown OID 0xfc010210 Get OID_TCP_TASK_OFFLOAD XenNet (BUFFER_TOO_SHORT 100 > 28) Get OID_TCP_TASK_OFFLOAD config_csum enabled nto = 8A4141A4 nto->Size = 24 nto->TaskBufferLength = 16 config_gso enabled nto = 8A4141C8 nto->Size = 24 nto->TaskBufferLength = 16 &(nttls->IpOptions) = 8A4141E9 Set OID_TCP_TASK_OFFLOAD TcpIpChecksumNdisTask V4Transmit.IpOptionsSupported = 0 V4Transmit.TcpOptionsSupported = 1 V4Transmit.TcpChecksum = 1 V4Transmit.UdpChecksum = 0 V4Transmit.IpChecksum = 0 V4Receive.IpOptionsSupported = 0 V4Receive.TcpOptionsSupported = 0 V4Receive.TcpChecksum = 1 V4Receive.UdpChecksum = 0 V4Receive.IpChecksum = 0 V6Transmit.IpOptionsSupported = 0 V6Transmit.TcpOptionsSupported = 0 V6Transmit.TcpChecksum = 0 V6Transmit.UdpChecksum = 0 V6Receive.IpOptionsSupported = 0 V6Receive.TcpOptionsSupported = 0 V6Receive.TcpChecksum = 0 V6Receive.UdpChecksum = 0 TcpLargeSendNdisTask MaxOffLoadSize = 61440 MinSegmentCount = 4 TcpOptions = 0 IpOptions = 0 Get OID_PNP_CAPABILITIES Set Unknown OID 0x10119 Set OID_GEN_CURRENT_LOOKAHEAD 128 (8A6CE000) Set OID_GEN_CURRENT_PACKET_FILTER (xi = 8A6CE000) NDIS_PACKET_TYPE_DIRECTED NDIS_PACKET_TYPE_MULTICAST NDIS_PACKET_TYPE_BROADCAST Get Unknown OID 0x10203 XenNet (BUFFER_TOO_SHORT 152 > 0) Get Unknown OID 0x10117 XenVbd SCSIOP_MODE_SENSE llbaa = 0, dbd = 0, page_code = 63, allocation_length = 12 XenPCI --> XenPci_EvtDeviceUsageNotification> Did you try with the offload features disabled?Uups: Because of the switch to the debugging drivers, those features were re-enabled. We just disabled them again, which also unblocked the domain for now. We''ll monitor those domains for some time and see, if the problem re-apprears.> > ./statistics/tx_dropped:1349522 > > The messages you are seeing above are in the rx path in DomU which means > the tx path in Dom0. Do your DomU''s receive a large amount of traffic?Both systems currently showing the problem do send 10 times more data then they receive, but that might still be above your average test case: # ifconfig vif205.0 | tail -n 2 # Linuxs point of view RX bytes:361147609996 (336.3 GiB) TX bytes:20727355262 (19.3 GiB) RX bytes:1292788783876 (1.1 TiB) TX bytes:170447442659 (158.7 GiB)> Most of my traffic would be in the other direction, and ->DomU traffic > would be mostly at WAN speeds, not LAN speeds... I''ll have a look at the > code and see if I''ve missed something.Thanks again. Sincerely Philipp Hahn -- Philipp Hahn Open Source Software Engineer hahn@univention.de Univention GmbH Linux for Your Business fon: +49 421 22 232- 0 Mary-Somerville-Str.1 28359 Bremen fax: +49 421 22 232-99 http://www.univention.de/ _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users