Moritz Mühlenhoff
2008-Oct-27 16:03 UTC
[Xen-users] GPLPV drivers block when copying huge files
Hi, I hope this is the correct place to report problems with the GPLPV drivers: The Xen setup is the following: - Dom0: A 2.6.18-based Xen kernel (from Debian Etch) and Xen 3.2.0 (3.2.0-3~bpo4+2 to be precise) - DomU: GPLPV drivers 0.9.11-pre18 and Windows 2003 (32 bits) The performance with GPLPV is quite good (the speedup in network speed is almost ten times), but there are problems when copying huge files over the network using Windows Explorer: After about 40% some packets get lost (as tested by having a continuous ping on the virtualised host) for some seconds and Windows aborts copying the file. /var/log/xen/xend.log contains an error message, which correlates to the time of connection tear down: ERROR (__init__:1072) Internal error handling system.methodSignature: Invalid result signatures not supported Operation with smaller files is reliable. The initial setup was based on a DRBD file system, but the error could be reproduced on plain file storage as well. Likewise, it could be ruled out that the problem is SMP-specific, since it could be reproduced with a single processor system as well. When booting without GPLPV driver support the problems disappears. Is this a known problem? What further information could be collected to pinpoint the problem? Cheers, Moritz -- Moritz Mühlenhoff muehlenhoff@univention.de fon: +49 421 22 232- 0 Development Linux for Your Business fax: +49 421 22 232-99 Univention GmbH http://www.univention.de/ mobil: +49 175 22 999 23 _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
James Harper
2008-Oct-28 01:08 UTC
RE: [Xen-users] GPLPV drivers block when copying huge files
> Hi, > I hope this is the correct place to report problems with the GPLPV > drivers: > > The Xen setup is the following: > - Dom0: A 2.6.18-based Xen kernel (from Debian Etch) and Xen 3.2.0 > (3.2.0-3~bpo4+2 to be precise) > - DomU: GPLPV drivers 0.9.11-pre18 and Windows 2003 (32 bits) > > The performance with GPLPV is quite good (the speedup in network speed is > almost ten times), but there are problems when copying huge files over the > network using Windows Explorer: > After about 40% some packets get lost (as tested by having a continuous > ping > on the virtualised host) for some seconds and Windows aborts copying the > file. /var/log/xen/xend.log contains an error message, which correlates to > the time of connection tear down: > > ERROR (__init__:1072) Internal error handling system.methodSignature: > Invalid > result signatures not supported > > Operation with smaller files is reliable. > > The initial setup was based on a DRBD file system, but the error could be > reproduced on plain file storage as well. Likewise, it could be ruled out > that the problem is SMP-specific, since it could be reproduced with a > single > processor system as well. > > When booting without GPLPV driver support the problems disappears. > Is this a known problem? What further information could be collected to > pinpoint the problem? >Not a known problem, but it should be reproducible. I have done some large file copying, but your ''huge'' and my ''large'' may be vastly different - how big are the files you are talking about? How willing are you to run some tests for me? The xennet drivers have no limits on how much memory they could allocate internally for packet buffers. I''ve never seen that as a problem but it''s possible that they could be running out of resources and not handling that situation properly. Alternatively they could be running windows out of resources and windows isn''t handling that properly. Can you please test and tell me if the 40% figure changes if you give your windows DomU much more or much less memory? I would hope that with much less memory it would fail much earlier... It could also be related to gso or csum offload... do things change if you disable either of those? Thanks for the feedback James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Moritz Mühlenhoff
2008-Oct-28 13:07 UTC
Re: [Xen-users] GPLPV drivers block when copying huge files
Hi, James Harper wrote:> Not a known problem, but it should be reproducible. I have done some large > file copying, but your ''huge'' and my ''large'' may be vastly different - how > big are the files you are talking about?The file in question is ca. 9.5 GB large. (Current testing has all been done with the same file, since the driver should copy them independantly from the binary content of the file. If there''s reason to believe it could be triggered by file content, we can of couse rerun the tests with a new file of similar size).> How willing are you to run some tests for me?Very much, GPLPV is highly useful for free environments of virtualised Windows hosts. Thanks for developing it!> The xennet drivers have no > limits on how much memory they could allocate internally for packet > buffers. I''ve never seen that as a problem but it''s possible that they > could be running out of resources and not handling that situation properly. > Alternatively they could be running windows out of resources and windows > isn''t handling that properly. > > Can you please test and tell me if the 40% figure changes if you give your > windows DomU much more or much less memory? I would hope that with much > less memory it would fail much earlier...The memory of the target DomU has been raised from 256 MB to 768 MB and the memory of the source DomU from 512 to 2 GB. This doesn''t change anything, the copy aborts at the same position and 5-10 pings are lost. Some further data points: It doesn''t make a difference which DomU initiates the file transfer, the behaviour is identical. If the other connection point is a physical Windows machine, the error is triggered less frequent, but can still be reproduced occasionally. The connection is Gigabit-based. The Windows event log doesn''t show any exceptional log entries.> It could also be related to gso or csum offload... do things change if you > disable either of those?"Checksum Offload" is currently enabled, we''ll test disabling it. Is GSO the same as "Large Send Offload"? If so, we can test it as well. Cheers, Moritz -- Moritz Mühlenhoff muehlenhoff@univention.de fon: +49 421 22 232- 0 Development Linux for Your Business fax: +49 421 22 232-99 Univention GmbH http://www.univention.de/ mobil: +49 175 22 999 23 _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
James Harper
2008-Oct-28 23:35 UTC
RE: [Xen-users] GPLPV drivers block when copying huge files
> > It could also be related to gso or csum offload... do things change if > > you disable either of those? > > "Checksum Offload" is currently enabled, we''ll test disabling it. > Is GSO the same as "Large Send Offload"? If so, we can test it as well. >Yes, please disable Large Send Offload and let me know if the problem persists. Thanks James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Moritz Mühlenhoff
2008-Oct-29 09:32 UTC
Re: [Xen-users] GPLPV drivers block when copying huge files
James Harper wrote:> > > It could also be related to gso or csum offload... do things change if > > > you disable either of those? > > > > "Checksum Offload" is currently enabled, we''ll test disabling it. > > Is GSO the same as "Large Send Offload"? If so, we can test it as well. > > Yes, please disable Large Send Offload and let me know if the problem > persists.Good news: Disabling "Large Send Offload" fixes the problem. How does enabled LSO in the GPLPV driver interact with the LSO implemented by the NIC driver in Linux? The mentioned problem could be reproduced on three machines with Broadcom NICs (BCM5708 and BCM715), but wasn''t triggered on a system with an onboard Nvidia (MCP51). Cheers, Moritz -- Moritz Muehlenhoff muehlenhoff@univention.de fon: +49 421 22 232- 0 Development Linux for Your Business fax: +49 421 22 232-99 Univention GmbH http://www.univention.de/ mobil: +49 175 22 999 23 _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
James Harper
2008-Oct-29 09:54 UTC
RE: [Xen-users] GPLPV drivers block when copying huge files
> > Good news: Disabling "Large Send Offload" fixes the problem. > > How does enabled LSO in the GPLPV driver interact with the LSO implemented > by > the NIC driver in Linux? The mentioned problem could be reproduced on > three > machines with Broadcom NICs (BCM5708 and BCM715), but wasn''t triggered on > a > system with an onboard Nvidia (MCP51). >It''s all a bit of a mystery... but I think it goes something like this: LSO allows the operating system to send a large (>MTU) TCP packet to the hardware. The hardware then breaks the packet up into header+MSS sized chunks, recalculates the sequence numbers and checksums, and sends them. Efficiency gains are to be had in doing this. In a DomU, if a packet is sent to the virtual adapter, where it flows onto the Dom0 Bridge interface, and then onto a real hardware interface, then it is also more efficient to keep the packet ''large'' all the way to the physical nic where the hardware then finally breaks it up and sends it. If, instead of going to a real hardware interface, the packet goes to another DomU, then it is also more efficient to keep the packet ''large'' all the way to the DomU, and never bother breaking it up. Also, because the originating DomU marked the flag as ''checksum correct'', the destination DomU never bothers checking the checksum, so you save doing the original checksum calculation, and all the inbetween checksum validation. There''s a lot that could go wrong there isn''t there? Particularly if you start doing NAT or something funny on the bridge! Windows support Large Send Offload in a mostly compatible way, at least under Windows 2003 server. XP has a known problem with interaction with the firewall service, but still seems to work okay once that is disabled. Unfortunately Windows likes to break the packet up into more pieces than the Linux backend can handle, so in the GPLPV drivers I allocate some memory pages and copy the packet data into them. This reduces the number of scatter-gather segments to a maximum of 16, which also reduces the amount of space each packet takes up on the ring. What Windows doesn''t support though is the Large Receive Offload, which the xen network backend driver assumes is supported of LSO is advertised as supported. So I have to fudge this - if Dom0 sends me a ''large'' packet, I break it up before giving it to Windows. Also, although Windows says that it supports RX checksum offload, where the network adapter (or xennet in this case) calculates the checksum and reports on it''s correctness, it turns out that it doesn''t really. If the checksum is never calculated because the packet originated on a Linux DomU, and I just tell Windows ''the checksum is correct, just trust me'', it goes and checks anyway and then drops the packet when it turns out to not be correct. The above two paragraphs are why the TX speed from the GPLPV drivers is much faster (~2x) than RX. So there is plenty of room for me to have made errors there... Is anything useful reported if you run debugview from sysinternals? James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Fábio M. Catunda
2008-Oct-29 13:37 UTC
Re: [Xen-users] GPLPV drivers block when copying huge files
Hi all, I have the same problem here, I tried to find some help with google with no luck, finnally I received this message on the list with the solution! :-D Well, what I could trace is that the problem occurs only when communicating between two DomUs and when using samba/CIFS. I made a test transfer with HTTP and if my memory don''t fail, it worked. I was facing this problem and I was needing a solution, the only that I could find was to put samba on Dom0 (ok, this is not good), and finally everything worked for me. Some tcpdump showed me that I was having a lot of checksum errors when communicating from domU to domU, but I could not find a reason, but James explanation sounds great for me. An additional information is that the problem also happen when transferring lot of files in a roll, not just with big files. What is still missing for me is why the thing works just fine between dom0 and domU!? Oh, one more thing, what is the best way to disable Large Send Offload? Is it possible to disable it just for a domU with some xen config parameter or do I have to do something like ethtook -k <bridge??or hwinterface> gso off, I''m a little confused here. Thanks in advance. Catunda. James Harper escreveu:>> Good news: Disabling "Large Send Offload" fixes the problem. >> >> How does enabled LSO in the GPLPV driver interact with the LSO implemented >> by >> the NIC driver in Linux? The mentioned problem could be reproduced on >> three >> machines with Broadcom NICs (BCM5708 and BCM715), but wasn''t triggered on >> a >> system with an onboard Nvidia (MCP51). >> >> > > It''s all a bit of a mystery... but I think it goes something like this: > > LSO allows the operating system to send a large (>MTU) TCP packet to the hardware. The hardware then breaks the packet up into header+MSS sized chunks, recalculates the sequence numbers and checksums, and sends them. Efficiency gains are to be had in doing this. > > In a DomU, if a packet is sent to the virtual adapter, where it flows onto the Dom0 Bridge interface, and then onto a real hardware interface, then it is also more efficient to keep the packet ''large'' all the way to the physical nic where the hardware then finally breaks it up and sends it. > > If, instead of going to a real hardware interface, the packet goes to another DomU, then it is also more efficient to keep the packet ''large'' all the way to the DomU, and never bother breaking it up. Also, because the originating DomU marked the flag as ''checksum correct'', the destination DomU never bothers checking the checksum, so you save doing the original checksum calculation, and all the inbetween checksum validation. > > There''s a lot that could go wrong there isn''t there? Particularly if you start doing NAT or something funny on the bridge! > > Windows support Large Send Offload in a mostly compatible way, at least under Windows 2003 server. XP has a known problem with interaction with the firewall service, but still seems to work okay once that is disabled. Unfortunately Windows likes to break the packet up into more pieces than the Linux backend can handle, so in the GPLPV drivers I allocate some memory pages and copy the packet data into them. This reduces the number of scatter-gather segments to a maximum of 16, which also reduces the amount of space each packet takes up on the ring. > > What Windows doesn''t support though is the Large Receive Offload, which the xen network backend driver assumes is supported of LSO is advertised as supported. So I have to fudge this - if Dom0 sends me a ''large'' packet, I break it up before giving it to Windows. > > Also, although Windows says that it supports RX checksum offload, where the network adapter (or xennet in this case) calculates the checksum and reports on it''s correctness, it turns out that it doesn''t really. If the checksum is never calculated because the packet originated on a Linux DomU, and I just tell Windows ''the checksum is correct, just trust me'', it goes and checks anyway and then drops the packet when it turns out to not be correct. > > The above two paragraphs are why the TX speed from the GPLPV drivers is much faster (~2x) than RX. > > So there is plenty of room for me to have made errors there... > > Is anything useful reported if you run debugview from sysinternals? > > James > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
James Harper
2008-Oct-30 04:24 UTC
RE: [Xen-users] GPLPV drivers block when copying huge files
> > Oh, one more thing, what is the best way to disable Large Send Offload? > Is it possible to disable it just for a domU with some xen config > parameter or do I have to do something like ethtook -k <bridge??or > hwinterface> gso off, I''m a little confused here. >If you disable it in a DomU (Windows+GPLPV or other) then the DomU should advertise to Dom0 that the interface does not support it. James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Fábio M. Catunda
2008-Oct-30 15:38 UTC
Re: [Xen-users] GPLPV drivers block when copying huge files
James Harper escreveu:>> Oh, one more thing, what is the best way to disable Large Send Offload? >> Is it possible to disable it just for a domU with some xen config >> parameter or do I have to do something like ethtook -k <bridge??or >> hwinterface> gso off, I''m a little confused here. >> >> > > If you disable it in a DomU (Windows+GPLPV or other) then the DomU should advertise to Dom0 that the interface does not support it. > > James > >James, It worked, thanks. I will leave the exact command here to maybe help somebody that is as confused as I was. I''m using Xen on Debian, so the physical interface is named pethN, so the command is: ethtool -K peth0 tx off gso on James, I''m still curious to know why things work when the server is in Dom0, is there a reason for that? Thanks. Fábio Catunda. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users