When I was preparing Xen''s netback driver for upstream one of the things I removed was the zero-copy guest transmit (i.e. netback receive) support. In this mode guest data pages ("foreign pages") were mapped into the backend domain (using Xen grant-table functionality) and placed into the skb''s paged frag list (skb_shinfo(skb)->frags, I hope I am using the right term). Once the page is finished with netback unmaps it in order to return it to the guest (we really want to avoid returning such pages to the general allocation pool!). Unfortunately "page is finished with" is an event which there is no way for the driver to see[0] and therefore I replaced the grant-mapping with a grant-copy for upstreaming which has performance and scalability implications (since the copy happens in, and therefore is accounted to, the backend domain instead of the frontend domain). The root of the problem here is that the network stack manipulates the paged frags using bare get/put_page and therefore has no visibility into when a page reference count drops to zero and therefore there is no way to provide an interface for netback to know when it has to tear down the grant map. I think this has implications for users other than Xen as well. For instance I have previously observed an issue where NFS can transmit bogus data onto the wire due to ACKs which arrive late and cross over with the queuing of a retransmit on the client side (see http://marc.info/?l=linux-nfs&m=122424132729720&w=2 which mainly discusses RPC protocol level retransmit but I subsequently saw similar issues due to TCP retransmission too). The issue here is that an ACK from the server which is delayed in the network (but not dropped) can arrive after a retransmission has been queued. The arrival of this ACK causes the NFS client to complete the write back to userspace but the same page is still referenced from the retransmitted skb. Therefore if userspace reuses the write buffer quickly enough then incorrect data can go out in the retransmission. Ideally NFS (and I suspect any network filesystem or block device, e.g. iSCSI, could suffer from this sort of issue) would be able to wait to complete the write until the buffer was actually completely finished with. Someone also suggested the Infiniband might also have an interest in this sort of thing, although I must admit I don''t know enough about IB to imagine why (perhaps it''s just the same as the NFS/iSCSI cases). We''ve previously looked into solutions using the skb destructor callback but that falls over if the skb is cloned since you also need to know when the clone is destroyed. Jeremy Fitzhardinge and I subsequently looked at the possibility of a no-clone skb flag (i.e. always forcing a copy instead of a clone) but IIRC honouring it universally turned into a very twisty maze with a number of nasty corner cases etc. It also seemed that the proportion of SKBs which get cloned at least once appeared as if it could be quite high which would presumably make the performance impact unacceptable when using the flag. Another issue with using the skb destructor is that functions such as __pskb_pull_tail will eat (and free) pages from the start of the frag array such that by the time the skb destructor is called they are no longer there. AIUI Rusty Russell had previously looked into a per-page destructor in the shinfo but found that it couldn''t be made to work (I don''t remember why, or if I even knew at the time). Could that be an approach worth reinvestigating? I can''t really think of any other solution which doesn''t involve some sort of driver callback at the time a page is free()d. I expect that wrapping the uses of get/put_page in a network specific wrapper (e.g. skb_{get,frag}_frag(skb, nr) would be a useful first step in any solution. That''s a pretty big task/patch in itself but could be done. Might it be worthwhile in for its own sake? Does anyone have any ideas or advice for other approaches I could try (either on the driver or stack side)? FWIW I proposed a session on the subject for LPC this year. The proposal was for the virtualisation track although as I say I think the class of problem reaches a bit wider than that. Whether the session will be a discussion around ways of solving the issue or a presentation on the solution remains to be seen ;-) Ian. [0] at least with a mainline kernel, in the older out-of-tree Xen stuff we had a PageForeign page-flag and a destructor function in a spare struct page field which was called from the mm free routines (free_pages_prepare and free_hot_cold_page). I''m under no illusions about the upstreamability of this approach... _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Jun-24 17:29 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On 06/24/2011 08:43 AM, Ian Campbell wrote:> We''ve previously looked into solutions using the skb destructor callback > but that falls over if the skb is cloned since you also need to know > when the clone is destroyed. Jeremy Fitzhardinge and I subsequently > looked at the possibility of a no-clone skb flag (i.e. always forcing a > copy instead of a clone) but IIRC honouring it universally turned into a > very twisty maze with a number of nasty corner cases etc. It also seemed > that the proportion of SKBs which get cloned at least once appeared as > if it could be quite high which would presumably make the performance > impact unacceptable when using the flag. Another issue with using the > skb destructor is that functions such as __pskb_pull_tail will eat (and > free) pages from the start of the frag array such that by the time the > skb destructor is called they are no longer there. > > AIUI Rusty Russell had previously looked into a per-page destructor in > the shinfo but found that it couldn''t be made to work (I don''t remember > why, or if I even knew at the time). Could that be an approach worth > reinvestigating? > > I can''t really think of any other solution which doesn''t involve some > sort of driver callback at the time a page is free()d.One simple approach would be to simply make sure that we retain a page reference on any granted pages so that the network stack''s put pages will never result in them being released back to the kernel. We can also install an skb destructor. If it sees a page being released with a refcount of 1, then we know its our own reference and can free the page immediately. If the refcount is > 1 then we can add it to a queue of pending pages, which can be periodically polled to free pages whose other references have been dropped. However, the question is how large will this queue get? If it remains small then this scheme could be entirely practical. But if almost every page ends up having transient stray references, it could become very awkward. So it comes down to "how useful is an skb destructor callback as a heuristic for page free"? That said, I think an event-based rather than polling based mechanism would be much more preferable.> I expect that wrapping the uses of get/put_page in a network specific > wrapper (e.g. skb_{get,frag}_frag(skb, nr) would be a useful first step > in any solution. That''s a pretty big task/patch in itself but could be > done. Might it be worthwhile in for its own sake?Is there some way to do it so that you''d get compiler warnings/errors in missed cases? I guess wrap "struct page" in some other type would go some way to helping.> Does anyone have any ideas or advice for other approaches I could try > (either on the driver or stack side)? > > FWIW I proposed a session on the subject for LPC this year. The proposal > was for the virtualisation track although as I say I think the class of > problem reaches a bit wider than that. Whether the session will be a > discussion around ways of solving the issue or a presentation on the > solution remains to be seen ;-) > > Ian. > > [0] at least with a mainline kernel, in the older out-of-tree Xen stuff > we had a PageForeign page-flag and a destructor function in a spare > struct page field which was called from the mm free routines > (free_pages_prepare and free_hot_cold_page). I''m under no illusions > about the upstreamability of this approach...When I last asked AKPM about this - a long time ago - the problem was that we''d simply run out of page flags (at least on 32-bit x86), so it simply wasn''t implementable. But since then the page flags have been rearranged and I think there''s less pressure on them - but they''re still a valuable resource, so the justification would need to be strong (ie, multiple convincing users). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Eric Dumazet
2011-Jun-24 17:56 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
Le vendredi 24 juin 2011 à 10:29 -0700, Jeremy Fitzhardinge a écrit :> On 06/24/2011 08:43 AM, Ian Campbell wrote: > > We''ve previously looked into solutions using the skb destructor callback > > but that falls over if the skb is cloned since you also need to know > > when the clone is destroyed. Jeremy Fitzhardinge and I subsequently > > looked at the possibility of a no-clone skb flag (i.e. always forcing a > > copy instead of a clone) but IIRC honouring it universally turned into a > > very twisty maze with a number of nasty corner cases etc. It also seemed > > that the proportion of SKBs which get cloned at least once appeared as > > if it could be quite high which would presumably make the performance > > impact unacceptable when using the flag. Another issue with using the > > skb destructor is that functions such as __pskb_pull_tail will eat (and > > free) pages from the start of the frag array such that by the time the > > skb destructor is called they are no longer there. > > > > AIUI Rusty Russell had previously looked into a per-page destructor in > > the shinfo but found that it couldn''t be made to work (I don''t remember > > why, or if I even knew at the time). Could that be an approach worth > > reinvestigating? > > > > I can''t really think of any other solution which doesn''t involve some > > sort of driver callback at the time a page is free()d. >This reminds me the packet mmap (tx path) games we play with pages. net/packet/af_packet.c : tpacket_destruct_skb(), poking TP_STATUS_AVAILABLE back to user to tell him he can reuse space...> One simple approach would be to simply make sure that we retain a page > reference on any granted pages so that the network stack''s put pages > will never result in them being released back to the kernel. We can > also install an skb destructor. If it sees a page being released with a > refcount of 1, then we know its our own reference and can free the page > immediately. If the refcount is > 1 then we can add it to a queue of > pending pages, which can be periodically polled to free pages whose > other references have been dropped. > > However, the question is how large will this queue get? If it remains > small then this scheme could be entirely practical. But if almost every > page ends up having transient stray references, it could become very > awkward. > > So it comes down to "how useful is an skb destructor callback as a > heuristic for page free"? >Dangerous I would say. You could have a skb1 page transferred to another skb2, and call skb1 destructor way before page being released. TCP stack could do that in tcp_collapse() [ it currently doesnt play with pages ] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Jun-24 18:21 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On 06/24/2011 10:56 AM, Eric Dumazet wrote:> Le vendredi 24 juin 2011 à 10:29 -0700, Jeremy Fitzhardinge a écrit : >> On 06/24/2011 08:43 AM, Ian Campbell wrote: >>> We''ve previously looked into solutions using the skb destructor callback >>> but that falls over if the skb is cloned since you also need to know >>> when the clone is destroyed. Jeremy Fitzhardinge and I subsequently >>> looked at the possibility of a no-clone skb flag (i.e. always forcing a >>> copy instead of a clone) but IIRC honouring it universally turned into a >>> very twisty maze with a number of nasty corner cases etc. It also seemed >>> that the proportion of SKBs which get cloned at least once appeared as >>> if it could be quite high which would presumably make the performance >>> impact unacceptable when using the flag. Another issue with using the >>> skb destructor is that functions such as __pskb_pull_tail will eat (and >>> free) pages from the start of the frag array such that by the time the >>> skb destructor is called they are no longer there. >>> >>> AIUI Rusty Russell had previously looked into a per-page destructor in >>> the shinfo but found that it couldn''t be made to work (I don''t remember >>> why, or if I even knew at the time). Could that be an approach worth >>> reinvestigating? >>> >>> I can''t really think of any other solution which doesn''t involve some >>> sort of driver callback at the time a page is free()d. > This reminds me the packet mmap (tx path) games we play with pages. > > net/packet/af_packet.c : tpacket_destruct_skb(), poking > TP_STATUS_AVAILABLE back to user to tell him he can reuse space...Yes. Its similar in the sense that its a tx from a page which isn''t being handed over entirely to the network stack, but has some other longer-term lifetime.>> One simple approach would be to simply make sure that we retain a page >> reference on any granted pages so that the network stack''s put pages >> will never result in them being released back to the kernel. We can >> also install an skb destructor. If it sees a page being released with a >> refcount of 1, then we know its our own reference and can free the page >> immediately. If the refcount is > 1 then we can add it to a queue of >> pending pages, which can be periodically polled to free pages whose >> other references have been dropped. >> >> However, the question is how large will this queue get? If it remains >> small then this scheme could be entirely practical. But if almost every >> page ends up having transient stray references, it could become very >> awkward. >> >> So it comes down to "how useful is an skb destructor callback as a >> heuristic for page free"? >> > Dangerous I would say. You could have a skb1 page transferred to another > skb2, and call skb1 destructor way before page being released.Under what circumstances would that happen?> TCP stack could do that in tcp_collapse() [ it currently doesnt play > with pages ]Do you mean "dangerous" in the sense that many pages could end up being tied up in the pending-release queue? We''d always check the page refcount, so it should never release pages prematurely. Thanks, J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Miller
2011-Jun-24 19:46 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
From: Jeremy Fitzhardinge <jeremy@goop.org> Date: Fri, 24 Jun 2011 11:21:15 -0700>> Dangerous I would say. You could have a skb1 page transferred to another >> skb2, and call skb1 destructor way before page being released. > > Under what circumstances would that happen?Pages get transferred between different SKBs all the time. For example, GRO makes extensive use of this technique. See net/core/skbuff.c:skb_gro_receive(). It is just one example. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Jun-24 20:11 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On 06/24/2011 12:46 PM, David Miller wrote:> Pages get transferred between different SKBs all the time. > > For example, GRO makes extensive use of this technique. > See net/core/skbuff.c:skb_gro_receive(). > > It is just one example.I see, and the new skb doesn''t get a destructor copied from the original, so there''d be no second callback. Are the pages still attached to the first skb, or are they transferred completely? Thanks, J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Miller
2011-Jun-24 20:27 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
From: Jeremy Fitzhardinge <jeremy@goop.org> Date: Fri, 24 Jun 2011 13:11:59 -0700> Are the pages still attached to the first skb, or are they transferred > completely?In this case they are transferred completely, the refcount isn''t modified at all. There are cases where partial transfers occur, and in such cases the refcount is modified. Just swim around in net/core/skbuff.c you''ll find them :-) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Jun-24 22:44 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On Fri, 2011-06-24 at 18:29 +0100, Jeremy Fitzhardinge wrote:> On 06/24/2011 08:43 AM, Ian Campbell wrote: > > We''ve previously looked into solutions using the skb destructor callback > > but that falls over if the skb is cloned since you also need to know > > when the clone is destroyed. Jeremy Fitzhardinge and I subsequently > > looked at the possibility of a no-clone skb flag (i.e. always forcing a > > copy instead of a clone) but IIRC honouring it universally turned into a > > very twisty maze with a number of nasty corner cases etc. It also seemed > > that the proportion of SKBs which get cloned at least once appeared as > > if it could be quite high which would presumably make the performance > > impact unacceptable when using the flag. Another issue with using the > > skb destructor is that functions such as __pskb_pull_tail will eat (and > > free) pages from the start of the frag array such that by the time the > > skb destructor is called they are no longer there. > > > > AIUI Rusty Russell had previously looked into a per-page destructor in > > the shinfo but found that it couldn''t be made to work (I don''t remember > > why, or if I even knew at the time). Could that be an approach worth > > reinvestigating? > > > > I can''t really think of any other solution which doesn''t involve some > > sort of driver callback at the time a page is free()d. > > One simple approach would be to simply make sure that we retain a page > reference on any granted pages so that the network stack''s put pages > will never result in them being released back to the kernel. We can > also install an skb destructor. If it sees a page being released with a > refcount of 1, then we know its our own reference and can free the page > immediately. If the refcount is > 1 then we can add it to a queue of > pending pages, which can be periodically polled to free pages whose > other references have been dropped.One problem with this is that some functions (__pskb_pull_tail) drop the ref count and then remove the page from the skb''s fraglist. So by the time the destructor is called you have lost the page and cannot do the refcount checking. I suppose we could keep a queue of _all_ pages we ever put in an SKB which we poll. We could still check for pages with count==1 in the destructor. Apart from the other issues with the destructor not being copied over clone etc which would cause us to fall-back to polling the queue more often than not I reckon.> That said, I think an event-based rather than polling based mechanism > would be much more preferable.Absolutely.> > I expect that wrapping the uses of get/put_page in a network specific > > wrapper (e.g. skb_{get,frag}_frag(skb, nr) would be a useful first step > > in any solution. That''s a pretty big task/patch in itself but could be > > done. Might it be worthwhile in for its own sake? > > Is there some way to do it so that you''d get compiler warnings/errors in > missed cases? I guess wrap "struct page" in some other type would go > some way to helping.I was thinking it could be done by changing the field name (e.g. even just to _frag), add the wrapper and fixup everything grep could find and then run an allBLAHconfig build, fix the compile errors, repeat. Once the transition is complete we would have the option of putting the name back -- since it would only mean changing the wrapper. Although I don''t know if we would necessarily want that since otherwise new open-coded users will likely creep in.> > Does anyone have any ideas or advice for other approaches I could try > > (either on the driver or stack side)? > > > > FWIW I proposed a session on the subject for LPC this year. The proposal > > was for the virtualisation track although as I say I think the class of > > problem reaches a bit wider than that. Whether the session will be a > > discussion around ways of solving the issue or a presentation on the > > solution remains to be seen ;-) > > > > Ian. > > > > [0] at least with a mainline kernel, in the older out-of-tree Xen stuff > > we had a PageForeign page-flag and a destructor function in a spare > > struct page field which was called from the mm free routines > > (free_pages_prepare and free_hot_cold_page). I''m under no illusions > > about the upstreamability of this approach... > > When I last asked AKPM about this - a long time ago - the problem was > that we''d simply run out of page flags (at least on 32-bit x86), so it > simply wasn''t implementable. But since then the page flags have been > rearranged and I think there''s less pressure on them - but they''re still > a valuable resource, so the justification would need to be strong (ie, > multiple convincing users). > > J_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Jun-24 22:48 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On 06/24/2011 03:44 PM, Ian Campbell wrote:> One problem with this is that some functions (__pskb_pull_tail) drop the > ref count and then remove the page from the skb''s fraglist. So by the > time the destructor is called you have lost the page and cannot do the > refcount checking. > > I suppose we could keep a queue of _all_ pages we ever put in an SKB > which we poll.Right, that seems like the only way to make sure we don''t lose anything.> We could still check for pages with count==1 in the > destructor. Apart from the other issues with the destructor not being > copied over clone etc which would cause us to fall-back to polling the > queue more often than not I reckon.Yeah, sounds like it.>> That said, I think an event-based rather than polling based mechanism >> would be much more preferable. > Absolutely.>From what Eric and David were saying, it seems we don''t really have anyother workable options. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Jun-25 11:58 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On Fri, 2011-06-24 at 13:11 -0700, Jeremy Fitzhardinge wrote:> On 06/24/2011 12:46 PM, David Miller wrote: > > Pages get transferred between different SKBs all the time. > > > > For example, GRO makes extensive use of this technique. > > See net/core/skbuff.c:skb_gro_receive(). > > > > It is just one example. > > I see, and the new skb doesn''t get a destructor copied from the > original, so there''d be no second callback.What about if we were to have a per-shinfo destructor (called once for each page as its refcount goes 1->0, from whichever skb ends up with the last ref) as well as the skb-destructors. This already handles the cloning case but when pages are moved between shinfo then would it make sense for that to be propagated between skb''s under these circumstances and/or require them to be the same? Since in the case of something like skb_gro_receive the skbs (and hence the frag array pages) are all from the same ''owner'' (even if the skb is actually created by the stack on their behalf) I suspect this could work? But I bet this assumption isn''t valid in all cases. In which case I end up wondering about a destructor per page in the frag array. At which point we might as well consider it as a part of the core mm stuff rather than something net specific? Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Michael S. Tsirkin
2011-Jun-26 10:25 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On Fri, Jun 24, 2011 at 04:43:22PM +0100, Ian Campbell wrote:> In this mode guest data pages ("foreign pages") were mapped into the > backend domain (using Xen grant-table functionality) and placed into the > skb''s paged frag list (skb_shinfo(skb)->frags, I hope I am using the > right term). Once the page is finished with netback unmaps it in order > to return it to the guest (we really want to avoid returning such pages > to the general allocation pool!).Are the pages writeable by the source guest while netback processes them? If yes, firewalling becomes unreliable as the packet can be modified after it''s checked, right? Also, for guest to guest communication, do you wait for the destination to stop looking at the packet in order to return it to the source? If yes, can source guest networking be disrupted by a slow destination?> Jeremy Fitzhardinge and I subsequently > looked at the possibility of a no-clone skb flag (i.e. always forcing a > copy instead of a clone)I think this is the approach that the patchset ''macvtap/vhost TX zero-copy support'' takes.> but IIRC honouring it universally turned into a > very twisty maze with a number of nasty corner cases etc.Any examples? Are they covered by the patchset above?> FWIW I proposed a session on the subject for LPC this year.We also plan to discuss this on kvm forum 2011 (colocated with linuxcon 2011). http://www.linux-kvm.org/page/KVM_Forum_2011 -- MST _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Jun-27 09:41 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On Sun, 2011-06-26 at 11:25 +0100, Michael S. Tsirkin wrote:> On Fri, Jun 24, 2011 at 04:43:22PM +0100, Ian Campbell wrote: > > In this mode guest data pages ("foreign pages") were mapped into the > > backend domain (using Xen grant-table functionality) and placed into the > > skb''s paged frag list (skb_shinfo(skb)->frags, I hope I am using the > > right term). Once the page is finished with netback unmaps it in order > > to return it to the guest (we really want to avoid returning such pages > > to the general allocation pool!). > > Are the pages writeable by the source guest while netback processes > them? If yes, firewalling becomes unreliable as the packet can be > modified after it''s checked, right?We only map the paged frags, the linear area is always copied (enough to cover maximally sized TCP/IP, including options), for this reason.> Also, for guest to guest communication, do you wait for > the destination to stop looking at the packet in order > to return it to the source? If yes, can source guest > networking be disrupted by a slow destination?There is a timeout which ultimately does a copy into dom0 memory and frees up the domain grant for return to the sending guest.> > Jeremy Fitzhardinge and I subsequently > > looked at the possibility of a no-clone skb flag (i.e. always forcing a > > copy instead of a clone) > > I think this is the approach that the patchset > ''macvtap/vhost TX zero-copy support'' takes.That''s TX from the guests PoV, the same as I am looking at here, correct? I should definitely check this work out, thanks for the pointer. Is V7 (http://marc.info/?l=linux-kernel&m=130661128431312&w=2) the most recent posting? I suppose one difference with this is that it deals with data from "dom0" userspace buffers rather than (what looks like) kernel memory, although I don''t know if that matters yet. Also it hangs off of struct sock which netback doesn''t have. Anyway I''ll check it out.> > but IIRC honouring it universally turned into a > > very twisty maze with a number of nasty corner cases etc. > > Any examples? Are they covered by the patchset above?It was quite a while ago so I don''t remember many of the specifics. Jeremy might remember better but for example any broadcast traffic hitting a bridge (a very interesting case for Xen), seems like a likely case? pcap was another one which I do remember, but that''s obviously less critical. I presume with the TX zero-copy support the "copying due to attempted clone" rate is low?> > FWIW I proposed a session on the subject for LPC this year. > We also plan to discuss this on kvm forum 2011 > (colocated with linuxcon 2011). > http://www.linux-kvm.org/page/KVM_Forum_2011I had already considered coming to LinuxCon for other reasons but unfortunately I have family commitments around then :-( Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Michael S. Tsirkin
2011-Jun-27 10:21 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On Mon, Jun 27, 2011 at 10:41:35AM +0100, Ian Campbell wrote:> On Sun, 2011-06-26 at 11:25 +0100, Michael S. Tsirkin wrote: > > On Fri, Jun 24, 2011 at 04:43:22PM +0100, Ian Campbell wrote: > > > In this mode guest data pages ("foreign pages") were mapped into the > > > backend domain (using Xen grant-table functionality) and placed into the > > > skb''s paged frag list (skb_shinfo(skb)->frags, I hope I am using the > > > right term). Once the page is finished with netback unmaps it in order > > > to return it to the guest (we really want to avoid returning such pages > > > to the general allocation pool!). > > > > Are the pages writeable by the source guest while netback processes > > them? If yes, firewalling becomes unreliable as the packet can be > > modified after it''s checked, right? > > We only map the paged frags, the linear area is always copied (enough to > cover maximally sized TCP/IP, including options), for this reason.Hmm. That''ll cover the most common scenarios (such as port filtering) but not deep inspection. Not sure how important that is.> > Also, for guest to guest communication, do you wait for > > the destination to stop looking at the packet in order > > to return it to the source? If yes, can source guest > > networking be disrupted by a slow destination? > > There is a timeout which ultimately does a copy into dom0 memory and > frees up the domain grant for return to the sending guest.Interesting. How long''s the timeout?> > > Jeremy Fitzhardinge and I subsequently > > > looked at the possibility of a no-clone skb flag (i.e. always forcing a > > > copy instead of a clone) > > > > I think this is the approach that the patchset > > ''macvtap/vhost TX zero-copy support'' takes. > > That''s TX from the guests PoV, the same as I am looking at here, > correct?Right.> I should definitely check this work out, thanks for the pointer. Is V7 > (http://marc.info/?l=linux-kernel&m=130661128431312&w=2) the most recent > posting?I think so.> I suppose one difference with this is that it deals with data from > "dom0" userspace buffers rather than (what looks like) kernel memory, > although I don''t know if that matters yet. Also it hangs off of struct > sock which netback doesn''t have. Anyway I''ll check it out.I think the most important detail is the copy on clone approach. We can make it controlled by an skb flag if necessary.> > > but IIRC honouring it universally turned into a > > > very twisty maze with a number of nasty corner cases etc. > > > > Any examples? Are they covered by the patchset above? > > It was quite a while ago so I don''t remember many of the specifics. > Jeremy might remember better but for example any broadcast traffic > hitting a bridge (a very interesting case for Xen), seems like a likely > case? pcap was another one which I do remember, but that''s obviously > less critical.Last I looked I thought these clone the skb, so if a copy happens on clone things will work correctly?> I presume with the TX zero-copy support the "copying due to attempted > clone" rate is low?Yes. My understanding is that this version targets a non-bridged setup (guest connected to a macvlan on a physical dev) as the first step.> > > FWIW I proposed a session on the subject for LPC this year. > > We also plan to discuss this on kvm forum 2011 > > (colocated with linuxcon 2011). > > http://www.linux-kvm.org/page/KVM_Forum_2011 > > I had already considered coming to LinuxCon for other reasons but > unfortunately I have family commitments around then :-( > > Ian.And I''m not coming to LPC this year :( -- MST _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Jun-27 10:54 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On Mon, 2011-06-27 at 11:21 +0100, Michael S. Tsirkin wrote:> On Mon, Jun 27, 2011 at 10:41:35AM +0100, Ian Campbell wrote: > > On Sun, 2011-06-26 at 11:25 +0100, Michael S. Tsirkin wrote: > > > On Fri, Jun 24, 2011 at 04:43:22PM +0100, Ian Campbell wrote: > > > > In this mode guest data pages ("foreign pages") were mapped into the > > > > backend domain (using Xen grant-table functionality) and placed into the > > > > skb''s paged frag list (skb_shinfo(skb)->frags, I hope I am using the > > > > right term). Once the page is finished with netback unmaps it in order > > > > to return it to the guest (we really want to avoid returning such pages > > > > to the general allocation pool!). > > > > > > Are the pages writeable by the source guest while netback processes > > > them? If yes, firewalling becomes unreliable as the packet can be > > > modified after it''s checked, right? > > > > We only map the paged frags, the linear area is always copied (enough to > > cover maximally sized TCP/IP, including options), for this reason. > > Hmm. That''ll cover the most common scenarios > (such as port filtering) but not deep inspection.Right.> Not sure how important that is. > > > > Also, for guest to guest communication, do you wait for > > > the destination to stop looking at the packet in order > > > to return it to the source? If yes, can source guest > > > networking be disrupted by a slow destination? > > > > There is a timeout which ultimately does a copy into dom0 memory and > > frees up the domain grant for return to the sending guest. > > Interesting. How long''s the timeout?1 second IIRC.> > I suppose one difference with this is that it deals with data from > > "dom0" userspace buffers rather than (what looks like) kernel memory, > > although I don''t know if that matters yet. Also it hangs off of struct > > sock which netback doesn''t have. Anyway I''ll check it out. > > I think the most important detail is the copy on clone approach. > We can make it controlled by an skb flag if necessary. > > > > > but IIRC honouring it universally turned into a > > > > very twisty maze with a number of nasty corner cases etc. > > > > > > Any examples? Are they covered by the patchset above? > > > > It was quite a while ago so I don''t remember many of the specifics. > > Jeremy might remember better but for example any broadcast traffic > > hitting a bridge (a very interesting case for Xen), seems like a likely > > case? pcap was another one which I do remember, but that''s obviously > > less critical. > > Last I looked I thought these clone the skb, so if a copy happens on > clone things will work correctly?Things should be correct, but won''t necessarily perform well. In particular if the clones (which become copies with this flag) are frequent enough then there is no advantage to doing mapping instead of just copying upfront, in fact it probably hurts overall. Taking a quick look at the callers of skb_clone I also see skb_segment in there. Since Xen tries to pass around large skbs (using LRO/GSO over the PV interface) in order to amortise costs it is quite common for things to undergo GSO as they hit the physical device. I''m not sure if these commonly hit the specific code path which causes a clone though.> > I presume with the TX zero-copy support the "copying due to attempted > > clone" rate is low? > > Yes. My understanding is that this version targets a non-bridged setup > (guest connected to a macvlan on a physical dev) as the first step.OK.> > > > FWIW I proposed a session on the subject for LPC this year. > > > We also plan to discuss this on kvm forum 2011 > > > (colocated with linuxcon 2011). > > > http://www.linux-kvm.org/page/KVM_Forum_2011 > > > > I had already considered coming to LinuxCon for other reasons but > > unfortunately I have family commitments around then :-(> And I''m not coming to LPC this year :(That''s a shame. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Michael S. Tsirkin
2011-Jun-27 11:19 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On Mon, Jun 27, 2011 at 11:54:02AM +0100, Ian Campbell wrote:> On Mon, 2011-06-27 at 11:21 +0100, Michael S. Tsirkin wrote: > > On Mon, Jun 27, 2011 at 10:41:35AM +0100, Ian Campbell wrote: > > > On Sun, 2011-06-26 at 11:25 +0100, Michael S. Tsirkin wrote: > > > > On Fri, Jun 24, 2011 at 04:43:22PM +0100, Ian Campbell wrote: > > > > > In this mode guest data pages ("foreign pages") were mapped into the > > > > > backend domain (using Xen grant-table functionality) and placed into the > > > > > skb''s paged frag list (skb_shinfo(skb)->frags, I hope I am using the > > > > > right term). Once the page is finished with netback unmaps it in order > > > > > to return it to the guest (we really want to avoid returning such pages > > > > > to the general allocation pool!). > > > > > > > > Are the pages writeable by the source guest while netback processes > > > > them? If yes, firewalling becomes unreliable as the packet can be > > > > modified after it''s checked, right? > > > > > > We only map the paged frags, the linear area is always copied (enough to > > > cover maximally sized TCP/IP, including options), for this reason. > > > > Hmm. That''ll cover the most common scenarios > > (such as port filtering) but not deep inspection. > > Right. > > > Not sure how important that is. > > > > > > Also, for guest to guest communication, do you wait for > > > > the destination to stop looking at the packet in order > > > > to return it to the source? If yes, can source guest > > > > networking be disrupted by a slow destination? > > > > > > There is a timeout which ultimately does a copy into dom0 memory and > > > frees up the domain grant for return to the sending guest. > > > > Interesting. How long''s the timeout? > > 1 second IIRC.I think that''s unlikely to prevent networking disruption, only complete loss of networking.> > > I suppose one difference with this is that it deals with data from > > > "dom0" userspace buffers rather than (what looks like) kernel memory, > > > although I don''t know if that matters yet. Also it hangs off of struct > > > sock which netback doesn''t have. Anyway I''ll check it out. > > > > I think the most important detail is the copy on clone approach. > > We can make it controlled by an skb flag if necessary. > > > > > > > but IIRC honouring it universally turned into a > > > > > very twisty maze with a number of nasty corner cases etc. > > > > > > > > Any examples? Are they covered by the patchset above? > > > > > > It was quite a while ago so I don''t remember many of the specifics. > > > Jeremy might remember better but for example any broadcast traffic > > > hitting a bridge (a very interesting case for Xen), seems like a likely > > > case? pcap was another one which I do remember, but that''s obviously > > > less critical. > > > > Last I looked I thought these clone the skb, so if a copy happens on > > clone things will work correctly? > > Things should be correct, but won''t necessarily perform well. > > In particular if the clones (which become copies with this flag) are > frequent enough then there is no advantage to doing mapping instead of > just copying upfront, in fact it probably hurts overall.True. Further, the CPU used up by the copy isn''t accounted for in the appropriate cgroup.> Taking a quick look at the callers of skb_clone I also see skb_segment > in there. Since Xen tries to pass around large skbs (using LRO/GSO over > the PV interface) in order to amortise costs it is quite common for > things to undergo GSO as they hit the physical device. I''m not sure if > these commonly hit the specific code path which causes a clone though.Probably not, I think this patchset was tested with GSO as well.> > > I presume with the TX zero-copy support the "copying due to attempted > > > clone" rate is low? > > > > Yes. My understanding is that this version targets a non-bridged setup > > (guest connected to a macvlan on a physical dev) as the first step. > > OK. > > > > > > FWIW I proposed a session on the subject for LPC this year. > > > > We also plan to discuss this on kvm forum 2011 > > > > (colocated with linuxcon 2011). > > > > http://www.linux-kvm.org/page/KVM_Forum_2011 > > > > > > I had already considered coming to LinuxCon for other reasons but > > > unfortunately I have family commitments around then :-( > > > And I''m not coming to LPC this year :( > > That''s a shame. > > Ian. >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Jun-27 14:42 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On Fri, 2011-06-24 at 18:56 +0100, Eric Dumazet wrote:> Le vendredi 24 juin 2011 à 10:29 -0700, Jeremy Fitzhardinge a écrit : > > On 06/24/2011 08:43 AM, Ian Campbell wrote: > > > We''ve previously looked into solutions using the skb destructor callback > > > but that falls over if the skb is cloned since you also need to know > > > when the clone is destroyed. Jeremy Fitzhardinge and I subsequently > > > looked at the possibility of a no-clone skb flag (i.e. always forcing a > > > copy instead of a clone) but IIRC honouring it universally turned into a > > > very twisty maze with a number of nasty corner cases etc. It also seemed > > > that the proportion of SKBs which get cloned at least once appeared as > > > if it could be quite high which would presumably make the performance > > > impact unacceptable when using the flag. Another issue with using the > > > skb destructor is that functions such as __pskb_pull_tail will eat (and > > > free) pages from the start of the frag array such that by the time the > > > skb destructor is called they are no longer there. > > > > > > AIUI Rusty Russell had previously looked into a per-page destructor in > > > the shinfo but found that it couldn''t be made to work (I don''t remember > > > why, or if I even knew at the time). Could that be an approach worth > > > reinvestigating? > > > > > > I can''t really think of any other solution which doesn''t involve some > > > sort of driver callback at the time a page is free()d. > > > > This reminds me the packet mmap (tx path) games we play with pages. > > net/packet/af_packet.c : tpacket_destruct_skb(), poking > TP_STATUS_AVAILABLE back to user to tell him he can reuse space...This is OK because af_packet involves no kernel side protocol and hence there can be no retransmits etc? Otherwise you would suffer from the same sorts of issues as I described with NFS at [0]? However it seems like this might still have a problem if your SKBs are ever cloned. What happens in this case, e.g if a user of AF_PACKET sends a broadcast via a device associated with a bridge[1] (where it would be flooded)? Wouldn''t you get into the situation where the destructor of the initial skb is called before one or more of the clones going to a different destination were sent. So you would send TP_STATUS_AVAILABLE to userspace before the stack was really finished with the page and run the risk of userspace reusing the buffer, leading to incorrect bytes going to some destinations? It looks to me that anything which does any zero-copy type thing to the network stack will have problems with one or both of protocol retransmit or SKB clone. There''s simply no mechanism to know when the stack is really finished with a page. Ian. [0] http://marc.info/?l=linux-nfs&m=122424132729720&w=2 [1] Since dhcp clients typically use AF_PACKET and you typically put the IP address on the bridge itself in these configurations this won''t be that unusual. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Jun-27 20:51 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On 06/25/2011 12:58 PM, Ian Campbell wrote:> On Fri, 2011-06-24 at 13:11 -0700, Jeremy Fitzhardinge wrote: >> On 06/24/2011 12:46 PM, David Miller wrote: >>> Pages get transferred between different SKBs all the time. >>> >>> For example, GRO makes extensive use of this technique. >>> See net/core/skbuff.c:skb_gro_receive(). >>> >>> It is just one example. >> I see, and the new skb doesn''t get a destructor copied from the >> original, so there''d be no second callback. > What about if we were to have a per-shinfo destructor (called once for > each page as its refcount goes 1->0, from whichever skb ends up with the > last ref) as well as the skb-destructors.We never want the refcount for granted pages to go from 1 -> 0. The safest thing is to make sure we always elevate the refcount to make sure that nothing else can ever drop the last ref. If we can trust the network stack to always do the last release (and not hand it off to something else to do it), then we could have a destructor which gets called before the last ref drop (or leaves the ref drop to the destructor to do), and do everything required that way. But it seems pretty fragile. At the very least it would need a thorough code audit to make sure that everything handles page lifetimes in the expected way - but then I''d still worry about out-of-tree patches breaking something in subtle ways.> This already handles the > cloning case but when pages are moved between shinfo then would it make > sense for that to be propagated between skb''s under these circumstances > and/or require them to be the same? Since in the case of something like > skb_gro_receive the skbs (and hence the frag array pages) are all from > the same ''owner'' (even if the skb is actually created by the stack on > their behalf) I suspect this could work? > > But I bet this assumption isn''t valid in all cases.Hm.> In which case I end up wondering about a destructor per page in the frag > array. At which point we might as well consider it as a part of the core > mm stuff rather than something net specific?Doing it generically still needs some kind of marker that the page has a special-case destructor (and the destructor pointer itself). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Miller
2011-Jun-27 22:49 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
From: Ian Campbell <Ian.Campbell@eu.citrix.com> Date: Mon, 27 Jun 2011 15:42:04 +0100> However it seems like this might still have a problem if your SKBs are > ever cloned. What happens in this case, e.g if a user of AF_PACKET sends > a broadcast via a device associated with a bridge[1] (where it would be > flooded)?You don''t need a bridge to get a clone on transmit, the packet scheduler can do clones. Just grep for skb_clone in the packet action handlers net/sched/act_*.c _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Jun-28 10:24 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On Mon, 2011-06-27 at 23:49 +0100, David Miller wrote:> From: Ian Campbell <Ian.Campbell@eu.citrix.com> > Date: Mon, 27 Jun 2011 15:42:04 +0100 > > > However it seems like this might still have a problem if your SKBs are > > ever cloned. What happens in this case, e.g if a user of AF_PACKET sends > > a broadcast via a device associated with a bridge[1] (where it would be > > flooded)? > > You don''t need a bridge to get a clone on transmit, the packet > scheduler can do clones. Just grep for skb_clone in the packet > action handlers net/sched/act_*.cAre you sure? I only see skb_cloned() and skb_clone_writeable() under there )(3.0-rc4) and not any actual skb_clone()s. The only actual clone I see under there is in net/sched/sch_netem.c. However it sounds like it is expected that a clone can happen on pretty any skb which makes the frag lifecycle issue seem like one which could effect anything which sends a page to the network without relinquishing complete control of it (common in any kind of zero-copy scenario). Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Jun-28 10:25 UTC
[Xen-devel] Re: SKB paged fragment lifecycle on receive
On Mon, 2011-06-27 at 21:51 +0100, Jeremy Fitzhardinge wrote:> On 06/25/2011 12:58 PM, Ian Campbell wrote: > > On Fri, 2011-06-24 at 13:11 -0700, Jeremy Fitzhardinge wrote: > >> On 06/24/2011 12:46 PM, David Miller wrote: > >>> Pages get transferred between different SKBs all the time. > >>> > >>> For example, GRO makes extensive use of this technique. > >>> See net/core/skbuff.c:skb_gro_receive(). > >>> > >>> It is just one example. > >> I see, and the new skb doesn''t get a destructor copied from the > >> original, so there''d be no second callback. > > What about if we were to have a per-shinfo destructor (called once for > > each page as its refcount goes 1->0, from whichever skb ends up with the > > last ref) as well as the skb-destructors. > > We never want the refcount for granted pages to go from 1 -> 0. The > safest thing is to make sure we always elevate the refcount to make sure > that nothing else can ever drop the last ref.I guess I meant called just after the refcount goes 2->1 or something. _But_... thinking about it some more I don''t think this scheme works in the general case because the entity injecting into the network stack may not be the only reference count holder. Consider the NFS case -- in that case there is already a reference because the page belongs to userspace. You certainly don''t want to wait for the process to exit before considering the page done with. You may also have multiple reference counts due to multiple threads doing I/O on (overlapping or not) portions of the same page etc. So now we''re into the realms of a shinfo per frag reference count and destructor, which reintroduces the whole question of what happens if a page is copied/moved to another shinfo. Could we add a per-frag pointer to the shinfo pointing to: struct { atomic_t ref; void (destructor*)(struct skb, int fragnr, /*TBD*/); void *data; /* user data, maybe */ }; This data structure is owned by whoever injected the page into the stack and the pointer to it is propagated everywhere that page goes, including across skb splits and pages moving or copied between skbs etc etc. The split between ref counting on this and struct page needs a bit of careful thought but perhaps it''s IFF this struct exists the stack will hold a single struct page refcount and use this new refcount for everything else, dropping the struct page refcount when the frag refcount hits 0. Otherwise it simply uses struct page refcount as normal.> If we can trust the network stack to always do the last release (and not > hand it off to something else to do it), then we could have a destructor > which gets called before the last ref drop (or leaves the ref drop to > the destructor to do), and do everything required that way. But it > seems pretty fragile. At the very least it would need a thorough code > audit to make sure that everything handles page lifetimes in the > expected way - but then I''d still worry about out-of-tree patches > breaking something in subtle ways. > > > This already handles the > > cloning case but when pages are moved between shinfo then would it make > > sense for that to be propagated between skb''s under these circumstances > > and/or require them to be the same? Since in the case of something like > > skb_gro_receive the skbs (and hence the frag array pages) are all from > > the same ''owner'' (even if the skb is actually created by the stack on > > their behalf) I suspect this could work? > > > > But I bet this assumption isn''t valid in all cases. > > Hm.e.g. if you have some sort of tunnel protocol which is encapsulating multiple skb streams into a single one might you get an skb with pages from multiple sources?> > In which case I end up wondering about a destructor per page in the frag > > array. At which point we might as well consider it as a part of the core > > mm stuff rather than something net specific? > > Doing it generically still needs some kind of marker that the page has a > special-case destructor (and the destructor pointer itself).True. you only really need the destructor pointer (which can be NULL) but yes, it would add stuff to struct page which may not be useful elsewhere. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Maybe Matching Threads
- xennet: skb rides the rocket: 20 slots
- [PATCH net] vhost: net: switch to use data copy if pending DMAs exceed the limit
- [PATCH net] vhost: net: switch to use data copy if pending DMAs exceed the limit
- [PATCH net] vhost: net: switch to use data copy if pending DMAs exceed the limit
- [PATCH net] vhost: net: switch to use data copy if pending DMAs exceed the limit