thr3ads.net - Fedora xen - [Fedora-xen] TCP checksum corruption [May 2007]

If this information is useful, please help other people find it:
Share via:

Mike McGrath

2007-May-08 16:39 UTC

[Fedora-xen] TCP checksum corruption

We''re using xen heavily in Fedora''s Infrastructure and
presently a
number of the xen domU hosts are experiencing terrible checksum issues.  
I''ve tried the ethtool -K eth0 tx off fix and it didn''t work.

dom0:  2.6.19-1.2911.fc6xen x86_64
domU: 2.6.20-1.2933.fc6xen x86_64

Anyone have any ideas?

    -Mike

Daniel P. Berrange

2007-May-08 17:34 UTC

head link

Re: [Fedora-xen] TCP checksum corruption

On Tue, May 08, 2007 at 11:39:14AM -0500, Mike McGrath
wrote:> We''re using xen heavily in Fedora''s Infrastructure and
presently a
> number of the xen domU hosts are experiencing terrible checksum issues.  
> I''ve tried the ethtool -K eth0 tx off fix and it didn''t
work.
What sort of network config have you got with these ?  Briding straight
to physical device, or NAT''d ?
> dom0:  2.6.19-1.2911.fc6xen x86_64
> domU: 2.6.20-1.2933.fc6xen x86_64
> 
> Anyone have any ideas?
There are a couple issues at play:

 - There is a general bug in 2.6.20  that breaks checksum offload
   when used with NAT.
 - In 2.6.19 or later Dom0 will transmits to guests using checksum
   offload, so DHCP client in the guest will mistakenly thing it
   has a corrupt checksum.

To address the first bug requires disabling checksum offload in the eth0 in
the guest. ethtool -K eth0 tx off    in the guest should do it.

To address the 2nd is really difficult since the FC6 install images themsves
have a broken DHCP client for example, so we need to workaround it in the
kernel. This can be done by disabling checksums on the device in Dom0 - any
of vifN.0,  xenbr0, phet0 should have ethtook -K <dev> tx off done.

NB, ignore eth0 in Dom0, that''s a fake device so turning off tx on that
does
not fix things.

So in summary, to get it working in general case requires:

   ethtool -K eth0 tx off    in guest

And

   ethtool -K <dev> tx off   on whatever bridge device the guest is
attached to

We have just fixed both these issues in the latest  rawhide kernel available
in Koji http://koji.fedoraproject.org/packages/kernel-xen-2.6/2.6.20/2925.7.fc7/
and intend to also apply them to FC5 & 6 asap.

Regards,
Dan.
-- 
|=- Red Hat, Engineering, Emerging Technologies, Boston.  +1 978 392 2496 -=|
|=-           Perl modules: http://search.cpan.org/~danberr/              -=|
|=-               Projects: http://freshmeat.net/~danielpb/               -=|
|=-  GnuPG: 7D3B9505   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505  -=|

Mike McGrath

2007-May-08 17:41 UTC

head link

Re: [Fedora-xen] TCP checksum corruption

Daniel P. Berrange wrote:> On Tue, May 08, 2007 at 11:39:14AM -0500, Mike McGrath wrote:
>   
>> We''re using xen heavily in Fedora''s Infrastructure
and presently a
>> number of the xen domU hosts are experiencing terrible checksum issues.
>> I''ve tried the ethtool -K eth0 tx off fix and it
didn''t work.
>>     
>
> What sort of network config have you got with these ?  Briding straight
> to physical device, or NAT''d ?
>
>   
Bridge>
> There are a couple issues at play:
>
>  - There is a general bug in 2.6.20  that breaks checksum offload
>    when used with NAT.
>  - In 2.6.19 or later Dom0 will transmits to guests using checksum
>    offload, so DHCP client in the guest will mistakenly thing it
>    has a corrupt checksum.
>
> To address the first bug requires disabling checksum offload in the eth0 in
> the guest. ethtool -K eth0 tx off    in the guest should do it.
>
> To address the 2nd is really difficult since the FC6 install images
themsves
> have a broken DHCP client for example, so we need to workaround it in the
> kernel. This can be done by disabling checksums on the device in Dom0 - any
> of vifN.0,  xenbr0, phet0 should have ethtook -K <dev> tx off done.
>
> NB, ignore eth0 in Dom0, that''s a fake device so turning off tx on
that does
> not fix things.
>
> So in summary, to get it working in general case requires:
>
>    ethtool -K eth0 tx off    in guest
>
> And
>
>    ethtool -K <dev> tx off   on whatever bridge device the guest is
attached to
>   I''ve actually run that on every interface on every dom[0,U] on the box 
:).  I''ve also tried it on two other hosts.  One a RHEL5 dom0 and the 
other had different hardware but was also a FC6 dom0.  I can arrange 
access to the box if you''re interested.

    -Mike

Daniel P. Berrange

2007-May-08 17:54 UTC

head link

Re: [Fedora-xen] TCP checksum corruption

On Tue, May 08, 2007 at 12:41:02PM -0500, Mike McGrath
wrote:> Daniel P. Berrange wrote:
> >On Tue, May 08, 2007 at 11:39:14AM -0500, Mike McGrath wrote:
> >  
> >>We''re using xen heavily in Fedora''s
Infrastructure and presently a
> >>number of the xen domU hosts are experiencing terrible checksum
issues.
> >>I''ve tried the ethtool -K eth0 tx off fix and it
didn''t work.
> >
> >What sort of network config have you got with these ?  Briding straight
> >to physical device, or NAT''d ?
> Bridge
That''s good - should avoid the NAT related bugs there then.
> >There are a couple issues at play:
> >
> > - There is a general bug in 2.6.20  that breaks checksum offload
> >   when used with NAT.
> > - In 2.6.19 or later Dom0 will transmits to guests using checksum
> >   offload, so DHCP client in the guest will mistakenly thing it
> >   has a corrupt checksum.
> >
> >To address the first bug requires disabling checksum offload in the
eth0 in
> >the guest. ethtool -K eth0 tx off    in the guest should do it.
> >
> >To address the 2nd is really difficult since the FC6 install images 
> >themsves
> >have a broken DHCP client for example, so we need to workaround it in
the
> >kernel. This can be done by disabling checksums on the device in Dom0 -
any
> >of vifN.0,  xenbr0, phet0 should have ethtook -K <dev> tx off
done.
> >
> >NB, ignore eth0 in Dom0, that''s a fake device so turning off
tx on that
> >does
> >not fix things.
> >
> >So in summary, to get it working in general case requires:
> >
> >   ethtool -K eth0 tx off    in guest
> >
> >And
> >
> >   ethtool -K <dev> tx off   on whatever bridge device the guest
is
> >   attached to
> >  
> I''ve actually run that on every interface on every dom[0,U] on the
box
> :).  I''ve also tried it on two other hosts.  One a RHEL5 dom0 and
the
> other had different hardware but was also a FC6 dom0.  I can arrange 
> access to the box if you''re interested.
Ok that makes absolutely no sense to me now :-)  Everytime I hit it I was
able to solve it eventually by setting ''tx off'' on some combo
of devices.
The RHEL-5 Dom0 kernel also already has the neccessary fixes in which is
even odder that it doesn''t work for you. 

Dan.
-- 
|=- Red Hat, Engineering, Emerging Technologies, Boston.  +1 978 392 2496 -=|
|=-           Perl modules: http://search.cpan.org/~danberr/              -=|
|=-               Projects: http://freshmeat.net/~danielpb/               -=|
|=-  GnuPG: 7D3B9505   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505  -=|

Mike McGrath

2007-May-08 18:00 UTC

head link

Re: [Fedora-xen] TCP checksum corruption

Daniel P. Berrange wrote:> Ok that makes absolutely no sense to me now :-)  Everytime I hit it I was
> able to solve it eventually by setting ''tx off'' on some
combo of devices.
> The RHEL-5 Dom0 kernel also already has the neccessary fixes in which is
> even odder that it doesn''t work for you. 
>   
So the question is where to proceed from here?  I can tcpdump on the 
bridge, peth and the domu host to see the errors.

    -Mike

Daniel P. Berrange

2007-May-08 18:05 UTC

head link

Re: [Fedora-xen] TCP checksum corruption

On Tue, May 08, 2007 at 01:00:32PM -0500, Mike McGrath
wrote:> Daniel P. Berrange wrote:
> >Ok that makes absolutely no sense to me now :-)  Everytime I hit it I
was
> >able to solve it eventually by setting ''tx off'' on
some combo of devices.
> >The RHEL-5 Dom0 kernel also already has the neccessary fixes in which
is
> >even odder that it doesn''t work for you. 
> >  
> 
> So the question is where to proceed from here?  I can tcpdump on the 
> bridge, peth and the domu host to see the errors.
Actually tcpdump showing errors does not neccessarily mean that there are
errors! If checksum offload is enabled, then you expect tcpdump to show
errors, because the checksum is not filled in by the OS - its left up to
the physical NIC. If you''ve turned ''tx off'' on
absolutely every device
then, tcpdump ought to show correct checksums though.

Dan.
-- 
|=- Red Hat, Engineering, Emerging Technologies, Boston.  +1 978 392 2496 -=|
|=-           Perl modules: http://search.cpan.org/~danberr/              -=|
|=-               Projects: http://freshmeat.net/~danielpb/               -=|
|=-  GnuPG: 7D3B9505   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505  -=|

Mike McGrath

2007-May-08 18:08 UTC

head link

Re: [Fedora-xen] TCP checksum corruption

Daniel P. Berrange wrote:> On Tue, May 08, 2007 at 01:00:32PM -0500, Mike McGrath wrote:
>   
>> Daniel P. Berrange wrote:
>>     
>>> Ok that makes absolutely no sense to me now :-)  Everytime I hit it
I was
>>> able to solve it eventually by setting ''tx off''
on some combo of devices.
>>> The RHEL-5 Dom0 kernel also already has the neccessary fixes in
which is
>>> even odder that it doesn''t work for you. 
>>>  
>>>       
>> So the question is where to proceed from here?  I can tcpdump on the 
>> bridge, peth and the domu host to see the errors.
>>     
>
> Actually tcpdump showing errors does not neccessarily mean that there are
> errors! If checksum offload is enabled, then you expect tcpdump to show
> errors, because the checksum is not filled in by the OS - its left up to
> the physical NIC. If you''ve turned ''tx off'' on
absolutely every device
> then, tcpdump ought to show correct checksums though.
>   
So here''s the core of the problem.  This is actually our koji builder.
When running:
koji list-tagged --quiet --latest --inherit f7-final

on lan, it takes about 10 seconds, when running it remote to another 
machine it will run 12 minutes +.

    -Mike

Fedora xen - May 2007 - TCP checksum corruption

[Fedora-xen] TCP checksum corruption

Re: [Fedora-xen] TCP checksum corruption

Re: [Fedora-xen] TCP checksum corruption

Re: [Fedora-xen] TCP checksum corruption

Re: [Fedora-xen] TCP checksum corruption

Re: [Fedora-xen] TCP checksum corruption

Re: [Fedora-xen] TCP checksum corruption