I wrote to xen-users a few weeks ago about difficulty I was having
with the TCP checksum offload feature[1]. The xen-users list seems to
have a fair few people who are having difficulty with this
optimisation. In order to debug my problem, I ended up modifying the
Xen 3.0.1 network backend driver[2]. Since I''m suggesting a change to
the behaviour and default configuration, it seems appropriate to post
here:
Background:
Hardcoded in the Xen 3.0.1 network backend driver (in the supplied
patch to Linux 2.6.12) is the notion that packets `outbound'' through
the network backend (destined for a frontend in another guest) do not
ever need to be checksummed.
I can''t find any design documentation which explains this decision, but
I presume that this is the result of the following chain of reasoning
about virtual network interfaces:
1. The backend is in dom0 and the frontend is in some domU.
2. domU does not have and use any physical network hardware.
3. The domU does not act as a router-encapsulator. (eg,
run a VPN client, tunnel endpoint, etc. etc.)
4. The domU will always know correctly whether the packet
originated from dom0 (checksum not needed, not calculated) or from
some other machine and just came via domU (checksum calculated and
needed).
5. Therefore all packets leaving dom0 for domU will terminate
on that domU and do not need to be checksummed.
(It is possible that there''s something fancy happening in the
frontend; I briefly looked at that code but didn''t take the time to
understand it fully.)
All of the assumptions 1-4 can be false. 1-3 can be false in many
network topologies and the system should not assume that the network
topology is as set up by the provided default configuration scripts.
4 is apparently false in my case and caused the symptoms I saw.
While Xen allows the frontend interface''s `transmit checksum
offload''
(ie, for packets leaving that guest) to be enabled and disabled from
userland, so that checksum calculation can be suprresed, it does not
allow the `receive checksum offload'' (for packets entering the guest)
to be controlled, and it does not allow the backend''s checksum
processing to be enabled and disabled (in 3.0.1, at least).
Some observations:
In the general case, it is not possible to determine whether any
particular packet needs checksum processing (generation, outbound, or
checking, inbound) without knowledge of the network topology and
configuration. This network topology and configuration could be very
complex, as many of the guests supported by Xen have very
sophisticated (not to say dangerous!) mixed-layer packet routing and
mangling capabilities; additionally, Xen guests (including dom0 and
domU) may well contain instances of routers or encapsulators which
will further complicate the topology.
Therefore, it is not possible to encode rules for correct behaviour in
the code for Xen''s virtual network devices. The correct behaviour can
only be determined by the network configuration scripts which are also
responsible for establishing the desired network topology.
Ie, the behaviour must be configurable from userland.
In many (most?) scenarios, checksums cannot safely be suppressed for
any significant proportion of the traffic. If the guests are strongly
isolated with their own filesystems and the purpose is providing
multiple largely-independent hardware platforms, guest-guest
communication will be relatively rare, and of course communications
from one guest to the internet at large must be checksummed. The
suppression is only useful when a large amount of network traffic has
the different guests as endpoints; the most likely scenario is one
where the guests share `network'' filesystems from dom0 - but this is
not the default configuration with the supplied scripts, and doing it
safely involves significant effort to ensure that the fs traffic is
protected from interference.
Ie, the checksum offload should be disabled by default.
It''s probably too hard to write sensible rules, or provide a sensible
mechanism, to allow different packets traversing the same interface to
be treated differently. The administrator will probably want to
control the checksumming via iptables rules, routing tables, or other
normal host-side mechanisms and Linux''s packet-handling system is not
ideally suited for this AFAIAA.
So, I conclude that:
* Checksum suppression for virtual network backends should not be done
with NETIF_F_NO_CSUM but with NETIF_F_IP_CSUM or the like, as for
the frontends.
* Any code in the frontend that attempts to decide whether the
peer for a packet is the backend guest itself or some other machine
further away should be removed.
* Checksum suppression control with ethtool -K should be supported
both for outbound and inbound packets on both frontend and backend
devices.
* The default should have checksum suppression enabled.
* Ideally, there would be example scripts which provide guest domains
with a set of eth1''s on a private entirely-virtual network, all of
whose interfaces have checksums suppressed, and which does not
exchange packets with the wider Internet. This could be used for
intra-system NFS, etc.
Thanks,
Ian.
[1] http://lists.xensource.com/archives/html/xen-users/2006-03/msg00135.html
and the subsequent thread.
[2] http://lists.xensource.com/archives/html/xen-users/2006-03/msg00159.html
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Ian Jackson wrote:> Hardcoded in the Xen 3.0.1 network backend driver (in the supplied > patch to Linux 2.6.12) is the notion that packets `outbound'' through > the network backend (destined for a frontend in another guest) do not > ever need to be checksummed.Yes. Excellent and timely summary. I just started looking into the offload problem for VLANs. Jon Mason and Jim Dykman generated a patch for the IPSec environment issue, but due to concerns about whether it would be acceptable upstream, this hasn''t yet been blessed. I''d really like to look at that bug in a wider context with many of the issues you just specified addressed, but this was going to be post 3.0.2 and distro release happening.> I can''t find any design documentation which explains this decision, but > I presume that this is the result of the following chain of reasoning > about virtual network interfaces: > 1. The backend is in dom0 and the frontend is in some domU. > 2. domU does not have and use any physical network hardware. > 3. The domU does not act as a router-encapsulator. (eg, > run a VPN client, tunnel endpoint, etc. etc.) > 4. The domU will always know correctly whether the packet > originated from dom0 (checksum not needed, not calculated) or from > some other machine and just came via domU (checksum calculated and > needed). > 5. Therefore all packets leaving dom0 for domU will terminate > on that domU and do not need to be checksummed. > (It is possible that there''s something fancy happening in the > frontend; I briefly looked at that code but didn''t take the time to > understand it fully.)At the point this was done, there was not support for a different model (backend in dom0, frontend in domU). It was assumed to be the traffic model.> All of the assumptions 1-4 can be false. 1-3 can be false in many > network topologies and the system should not assume that the network > topology is as set up by the provided default configuration scripts. > 4 is apparently false in my case and caused the symptoms I saw. > > While Xen allows the frontend interface''s `transmit checksum offload'' > (ie, for packets leaving that guest) to be enabled and disabled from > userland, so that checksum calculation can be suprresed, it does not > allow the `receive checksum offload'' (for packets entering the guest) > to be controlled, and it does not allow the backend''s checksum > processing to be enabled and disabled (in 3.0.1, at least).Since I believe we only initiate for outgoing, suppressing the offload on the transmit on DomU should be enough to bypass this behaviour(?).> Therefore, it is not possible to encode rules for correct behaviour in > the code for Xen''s virtual network devices. The correct behaviour can > only be determined by the network configuration scripts which are also > responsible for establishing the desired network topology. > > Ie, the behaviour must be configurable from userland.I agree this should be configurable.> In many (most?) scenarios, checksums cannot safely be suppressed for > any significant proportion of the traffic. If the guests are stronglyMajority of the workloads probably expect guest <-> remote communication. I''d be interested in which workloads (if any) expect heavy dom0 <-> guest or guest <-> guest communication.> isolated with their own filesystems and the purpose is providing > multiple largely-independent hardware platforms, guest-guest > communication will be relatively rare, and of course communications > from one guest to the internet at large must be checksummed. TheDeferring the checksum to dom0 [Assumption = dom0 is where it reaches the physical hw] where it can be offloaded to the real hardware is not a bad idea - expected to be a non-trivial performance boost.> suppression is only useful when a large amount of network traffic has > the different guests as endpoints; the most likely scenario is one > where the guests share `network'' filesystems from dom0 - but this is > not the default configuration with the supplied scripts, and doing it > safely involves significant effort to ensure that the fs traffic is > protected from interference. > > Ie, the checksum offload should be disabled by default.> * Checksum suppression for virtual network backends should not be done > with NETIF_F_NO_CSUM but with NETIF_F_IP_CSUM or the like, as for > the frontends.Exactly what I was going to look into (changing the way we do the implementation right now) for post-3.0.2.> * Any code in the frontend that attempts to decide whether the > peer for a packet is the backend guest itself or some other machine > further away should be removed.Perhaps.> * Checksum suppression control with ethtool -K should be supported > both for outbound and inbound packets on both frontend and backend > devices.Definitely.> * The default should have checksum suppression enabled.Agreed.> * Ideally, there would be example scripts which provide guest domains > with a set of eth1''s on a private entirely-virtual network, all of > whose interfaces have checksums suppressed, and which does not > exchange packets with the wider Internet. This could be used for > intra-system NFS, etc.Exactly. Yes. :) thanks, Nivedita _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Nivedita Singhvi writes ("Re: [Xen-devel] checksum
`offload''"):> Excellent and timely summary. I just started looking into
> the offload problem for VLANs. Jon Mason and Jim Dykman
> generated a patch for the IPSec environment issue, but
> due to concerns about whether it would be acceptable
> upstream, this hasn''t yet been blessed. I''d really like
> to look at that bug in a wider context with many of the
> issues you just specified addressed, but this was going
> to be post 3.0.2 and distro release happening.
Would it be better to disable this feature in 3.0.2 in the meantime ?
Just a suggestion. When I first encountered this problem I naturally
searched the xen-users archives and it seems to be causing trouble for
a fair few people and the ethtool -K rune is being handed around as
folklore amonst the poor unwashed, therr (although of course it
doesn''t always work).
> > [various assumptions, including:]
> > 3. The domU does not act as a router-encapsulator. (eg,
> > run a VPN client, tunnel endpoint, etc. etc.)
>
> At the point this was done, there was not support for
> a different model (backend in dom0, frontend in domU).
> It was assumed to be the traffic model.
My assumption no.3 could still have been violated easily, surely, by
running a VPN client in a domU which the dom0 uses for some traffic ?
VPN packets would leave dom0 for domU via the virtual interface, be
encapsulated and encrypted there (bad checksum and all), and be routed
back out via dom0. The eventual receiving system would decrypt it and
find the checksum was wrong.
I assume no-one has done that, or if they did they''ve noticed it
doesn''t work and have tried something else. Given the still rather
high prevalance of weird and strange VPN endpoint programs, wanting to
encapsulate one in a domU isn''t that silly an idea. Likewise with
IPv6-over-IPv4 tunnelling and other such things.
> > While Xen allows the frontend interface''s `transmit checksum
offload''
> > (ie, for packets leaving that guest) to be enabled and disabled from
> > userland, so that checksum calculation can be suprresed, it does not
> > allow the `receive checksum offload'' (for packets entering
the guest)
> > to be controlled, and it does not allow the backend''s
checksum
> > processing to be enabled and disabled (in 3.0.1, at least).
>
> Since I believe we only initiate for outgoing, suppressing
> the offload on the transmit on DomU should be enough to
> bypass this behaviour(?).
I don''t understand the word `initiate'' in this context. Do
you mean
to refer to which endpoint initiaties the traffic flow ? That doesn''t
seem relevant and is in any case not even necessarily a meaningful
context in IP (the modern prevalance of NAT and stateful firewalling
notwithstanding).
Suppressing the offload on the transmit in domU is not sufficient. I
found that it was necessary to suppress the offload (ie, suppress the
`optimisation'' away of the checksum calculation, ie actually calculate
the checksum) on the transmit in dom0, which can only be done with a
source code patch.
This must have been because the machinery for suppressing the checksum
_checking_ on the _receive_ in domU wasn''t working. I haven''t
read
the frontend driver code but if the backend code ever works at all
there _must_ be some such suppression arrangements. It seems very
likely to me that these arrangements for suppressing receive checksum
checking will sometimes suppress the checksum inappropriately. After
all, the information needed to make a correct decision is not
available. In my case the checking was mistakenly not suppressed, so
the packets were rejected by the domU; but in another case the
checking might be mistakenly suppressed so that corrupted packets from
outside the physical host might be accepted unquestioned by a domU.
It seems quite possible to me that this bug does in fact exist in my
own setup and I can only hope that it doesn''t bite me somehow with
corrupted data. (If I were more worried I''d patch the frontend driver
too to remove the offload feature.)
> Deferring the checksum to dom0 [Assumption = dom0 is where
> it reaches the physical hw] where it can be offloaded
> to the real hardware is not a bad idea - expected to be a
> non-trivial performance boost.
Yes, I can see that that might be useful. But it''s very complicated:
If you want to do this I think you have to add a flag to the packet as
it crosses the domU<->dom0 interface which indicates whether the
checksum has been suppressed. This is because otherwise the kernel
with the actual hardware will not know to instruct the hardware to
compute and insert the checksum, since it will think that the checksum
is already correct.
There are three possibilities:
1. `Transmitter'' has not calculated the checksum; the
`receiver''
must do so if the packet is to leave via another interface
(or arrange that the onward interface offload does so).
2. Packet was received from another physical host by the virtual
interface `transitter'' and the `transmitter'' (or the
incoming
other interface offload) has already checked the checksum, so the
`receiver'' need not do so; the `receiver'' may assume that
the
packet checksum is correct so that nothing special needs to be
done if the packet will leave via another interface.
3. Packet checksum is supposed to be valid but must be checked
by the `receivier''.
This information needs to be correctly propagated through the
in-kernel routing system - and arrangements need to me made for the
checksum to be checked/computed/recomputed if (eg) iptables rules need
values of checksum-covered fields, or modify them.
Note that in principle these considerations apply separately to each
checksum in the header: a UDP packet inside IPv4 inside an ethernet
frame has several checksums, some of which are transparently passed
through by (say) dom0 and some of which are checked and recomputed -
and the behaviour depends on whether the relay kernel (dom0, probably)
is acting as a switch, router, NAPT, or something even more horrid.
With knowledge of the topology, it might be possible to arrange that
these kinds of decisions don''t need a flag to accompany the
dom0<->domU packet transmission, but that''s not the hard part:
the
hard part is threading the `must still calculate checksum on this''
note through the kernel''s routing/bridgeing system so that it knows to
overwrite the correct subset of the checksums.
It is not safe to always overwrite the checksums unless they were
checked earlier, because that risks fixing up the checksum(s) on
already damaged packets.
> > * The default should have checksum suppression enabled.
>
> Agreed.
Oh dear, I meant `disabled''. That is, the checksums should be
calculated and checked `normally''.
Ian.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel