Some version of Windows appear to give the network adapter driver a packet broken up into fairly small pieces, eg Page 0: 14 bytes of Ethernet Header Page 1: 20 bytes of IP Header Page 2: 20 bytes of TCP Header Page 3: 1460 bytes of TCP Data When this happens, Linux appears to not pass the packets beyond the vifX.Y interface - a tcpdump on (say) vif455.0 shows packets but a tcpdump on eth0 does not show all the packets - packets with a bad checksum don''t make it that far. Our best guess is that the Linux checksum offload code can''t cope with the way Windows is fragmenting the packets, but maybe Xen is somehow involved in this... Can someone please confirm that this is a limitation of Linux and/or Xen? Thanks James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen
2008-Apr-05 10:17 UTC
Re: [Xen-devel] Linux TCP Checksum offload limitations
On Fri, Apr 04, 2008 at 10:04:53PM +1100, James Harper wrote:> Some version of Windows appear to give the network adapter driver a > packet broken up into fairly small pieces, eg > Page 0: 14 bytes of Ethernet Header > Page 1: 20 bytes of IP Header > Page 2: 20 bytes of TCP Header > Page 3: 1460 bytes of TCP Data > > When this happens, Linux appears to not pass the packets beyond the > vifX.Y interface - a tcpdump on (say) vif455.0 shows packets but a > tcpdump on eth0 does not show all the packets - packets with a bad > checksum don''t make it that far. > > Our best guess is that the Linux checksum offload code can''t cope with > the way Windows is fragmenting the packets, but maybe Xen is somehow > involved in this... > > Can someone please confirm that this is a limitation of Linux and/or > Xen? >What version of Windows has this problem? Did you find out anything about it yet? -- Pasi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > Can someone please confirm that this is a limitation of Linux and/or > > Xen? > > What version of Windows has this problem?I am not seeing it on Windows 2003 sp1 or sp2, but am seeing it on XP sp2. Andy is seeing it on XP sp2 and I think Windows 2003.> Did you find out anything about it yet?We''re working on it. The solution I''m testing right now involves copying the header fragments to a single buffer. Not sure what it will do for performance... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen
2008-Apr-05 10:41 UTC
Re: [Xen-devel] Linux TCP Checksum offload limitations
On Sat, Apr 05, 2008 at 09:23:25PM +1100, James Harper wrote:> > > Can someone please confirm that this is a limitation of Linux and/or > > > Xen? > > > > What version of Windows has this problem? > > I am not seeing it on Windows 2003 sp1 or sp2, but am seeing it on XP sp2. > > Andy is seeing it on XP sp2 and I think Windows 2003. >OK.> > Did you find out anything about it yet? > > We''re working on it. The solution I''m testing right now involves copying > the header fragments to a single buffer. Not sure what it will do for performance... >Yeah I was just going to ask how does the working packet/page layout look like.. -- Pasi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, 4 Apr 2008 22:04:53 +1100 "James Harper" <james.harper@bendigoit.com.au> wrote:> Some version of Windows appear to give the network adapter driver a > packet broken up into fairly small pieces, eg > Page 0: 14 bytes of Ethernet Header > Page 1: 20 bytes of IP Header > Page 2: 20 bytes of TCP Header > Page 3: 1460 bytes of TCP DataNDIS fragments are nothing to do with the wire side interface> Our best guess is that the Linux checksum offload code can''t cope with > the way Windows is fragmenting the packets, but maybe Xen is somehow > involved in this...Unconnected with Linux, Xen bug. Xen is responsible for handling NDIS s/g lists on the windows side and turning them into a single virtual network packet _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> On Fri, 4 Apr 2008 22:04:53 +1100 > "James Harper" <james.harper@bendigoit.com.au> wrote: > > > Some version of Windows appear to give the network adapter driver a > > packet broken up into fairly small pieces, eg > > Page 0: 14 bytes of Ethernet Header > > Page 1: 20 bytes of IP Header > > Page 2: 20 bytes of TCP Header > > Page 3: 1460 bytes of TCP Data > > NDIS fragments are nothing to do with the wire side interfaceAn NDIS fragment is just a page of data...> > Our best guess is that the Linux checksum offload code can''t copewith> > the way Windows is fragmenting the packets, but maybe Xen is somehow > > involved in this... > > Unconnected with Linux, Xen bug. Xen is responsible for handling NDISs/g> lists on the windows side and turning them into a single virtualnetwork> packetA single virtual network packet as passed from windows to Xen consists of one or more pages of data. When the first page contains at least the Ethernet+IP+TCP header, everything works great. When the first page contains the Ethernet header, the second page the IP header, the third the TCP header, and subsequent pages contain the data, Linux refuses to accept that ''csum_blank'' is valid and drops the packet _after_ it leaves the vif interface. Just to elaborate on that, Xen successfully builds a packet out of the pages, and I can definitely see the packet via a tcpdump on (say) vif537.0, but it is dropped by Linux before it gets passed on the bridge. So Linux initially accepts the packet as valid. Now, from looking at the code I can see that an skb can definitely handle a packet with the data split across multiple pages, but my theory is that the Linux checksum offload stuff can''t handle having the packet _header_ (Ethernet+IP+TCP) split across multiple pages. This gives me four possible truths... 1. Linux definitely requires that the first page in an skb consist of a complete packet header, and this is a documented requirement but I couldn''t find it (eg it''s a bug for my Windows PV drivers to give a packet like this to Xen) 2. As above but it is not documented anywhere (eg it''s a bug in the documentation). 3. Linux should handle the complete packet header being split across multiple pages, but for some reason it doesn''t, and it''s never come up before (eg it''s a bug in the Linux csum offload code) 4. Something else I haven''t thought of. I guess I''m just looking for someone who knows about these things to say that "yes, Linux should handle such a header split" or "no, Linux doesn''t handle this, fix your NDIS driver." I have actually done the latter for now - the windows PV drivers now merge enough data together to guarantee that the entire Ethernet+IP+TCP header is on a single page, but there are overheads in doing that. Thanks James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> "James Harper" <james.harper@bendigoit.com.au> writes: > > > > 1. Linux definitely requires that the first page in an skb consistof a> > complete packet header, and this is a documented requirement but I > > couldn''t find it (eg it''s a bug for my Windows PV drivers to give a > > packet like this to Xen) > > It is true for TX. RX makes some attempts to fix up packets > in this case (but it''s slow), but not TX because it is assumed > no TX code is stupid enough to do anything like this.I''m not sure I agree with your declaration of ''stupid''. Windows presumably gets its share of performance testing so I assume that there are reasons for doing it this way. I''m guessing that they just reuse the same page over and over where the Ethernet src and dst address don''t change (always true for a single connection), and use a pool of pre-setup pages for sending so all they need to do is update the ip length in the IP header and the seq and ack fields in the TCP header. Windows has the disadvantage that it has made the assumption that it is going to be talking to real hardware that can always handle this, not an emulated hardware device (from Windows PoV) that is a bit more limited in what it can handle.> You''ll just have to fix it up somewhere in your driver.Yep. That''s what I''ve done. I''ll just have to live with the performance hit I guess. The fact that I can''t tell Windows to please put the whole packet header on one page is pretty stupid, although the end result in that case may just be that NDIS does the assembly instead of me.> Should be enough to do a pskb_may_pull()My driver is on the Windows side of things, so I''m pretty much stuck with only giving Linux what it can cope with.> > 2. As above but it is not documented anywhere (eg it''s a bug in the > > documentation). > > Well like in most complex and fast evolving software documentation > is not always complete and uptodate.I would have been happy with a comment in the linux src. That counts as documentation to me :) But as you said, it''s probably a reasonable assumption that the packet header is completely on one page and just that this situation has never come up before. Thanks for clarifying James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > I''m not sure I agree with your declaration of ''stupid''. Windows > > stupid in the context of linux networking. Other systems have other > tradeoffs.Understood.> > > presumably gets its share of performance testing so I assume thatthere> > are reasons for doing it this way. I''m guessing that they just reusethe> > same page over and over where the Ethernet src and dst address don''t > > change (always true for a single connection), and use a pool of > > pre-setup pages for sending so all they need to do is update the ip > > length in the IP header and the seq and ack fields in the TCPheader.> > > > That would sound pretty dumb if true because ethernet headers are very > very > cheap to set up. If if you consider reusing IOMMU mappings (whichWindows> doesn''t support anyways afaik) it probably wouldn''t be a good idea.If you are sending hundreds of thousands of packets per second then reusing the same page over and over might give you some benefit. No setup is better than very cheap setup. But...> > > Windows has the disadvantage that it has made the assumption that itis> > going to be talking to real hardware that can always handle this,not an> > emulated hardware device (from Windows PoV) that is a bit morelimited> > in what it can handle. > > Even real hardware goes slower when it has to do more scatter-gatherYes. I hadn''t thought of that. I guess it depends on if "allocating a page, looking up and copying the src and dst mac address" is cheaper than "get the same page we used last time and putting it on the sg list". With the latter, we are still sending actual data so there is still a page to allocate for that anyway, and for the rest of the headers... As I said, Microsoft as a whole appears to act pretty stupid at times, but I''m sure their coders know how to do performance analysis on driver behaviour and figuring out which gives better performance.> > > You''ll just have to fix it up somewhere in your driver. > > > > Yep. That''s what I''ve done. I''ll just have to live with theperformance> > hit I guess. The fact that I can''t tell Windows to please put thewhole> > If you code it right the performance hit should be rather small. You > just have to copy the header, not everything.Yep. I allocate a new page (well... just get it off my freelist) then just loop around and keep appending data until I have a full header.> > > Should be enough to do a pskb_may_pull() > > > > My driver is on the Windows side of things, so I''m pretty much stuck > > with only giving Linux what it can cope with. > > I meant on the Xen frontend side.My windows driver is the xen frontend. It is to Windows what netfront is to Linux. Linux obviously doesn''t have this problem on it''s frontend :) James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel