I think this is v4, but I''ve sort of lost count, sorry that it''s taken me so long to get back to this stuff. The following series makes use of the skb fragment API (which is in 3.2 +) to add a per-paged-fragment destructor callback. This can be used by creators of skbs who are interested in the lifecycle of the pages included in that skb after they have handed it off to the network stack. The mail at [0] contains some more background and rationale but basically the completed series will allow entities which inject pages into the networking stack to receive a notification when the stack has really finished with those pages (i.e. including retransmissions, clones, pull-ups etc) and not just when the original skb is finished with, which is beneficial to many subsystems which wish to inject pages into the network stack without giving up full ownership of those page''s lifecycle. It implements something broadly along the lines of what was described in [1]. I have also included a patch to the RPC subsystem which uses this API to fix the bug which I describe at [2]. I''ve also had some interest from David VemLehn and Bart Van Assche regarding using this functionality in the context of vmsplice and iSCSI targets respectively (I think). Changes since last time: * Added skb_orphan_frags API for the use of recipients of SKBs who may hold onto the SKB for a long time (this is analogous to skb_orphan). This was pointed out by Michael. The TUN driver is currently the only user. * I can''t for the life of me get anything to actually hit this code path. I''ve been trying with an NFS server running in a Xen HVM domain with emulated (e.g. tap) networking and a client in domain 0, using the NFS fix in this series which generates SKBs with destructors set, so far -- nothing. I suspect that lack of TSO/GSO etc on the TAP interface is causing the frags to be copied to normal pages during skb_segment(). * Various fixups related to the change of alignment/padding in shinfo, in particular to build_skb as pointed out by Eric. * Tweaked ordering of shinfo members to ensure that all hotpath variables up to and including the first frag fit within (and are aligned to) a single 64 byte cache line. (Eric again) I ran a monothread UDP benchmark (similar to that described by Eric in e52fcb2462ac) and don''t see any difference in pps throughput, it was ~810,000 pps both before and after. Cheers, Ian. [0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2 [1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2 [2] http://marc.info/?l=linux-nfs&m=122424132729720&w=2
Michael S. Tsirkin
2012-Apr-10 14:58 UTC
Re: [PATCH v4 0/10] skb paged fragment destructors
On Tue, Apr 10, 2012 at 03:26:05PM +0100, Ian Campbell wrote:> * I can''t for the life of me get anything to actually hit > this code path. I''ve been trying with an NFS server > running in a Xen HVM domain with emulated (e.g. tap) > networking and a client in domain 0, using the NFS fix > in this series which generates SKBs with destructors > set, so far -- nothing. I suspect that lack of TSO/GSO > etc on the TAP interface is causing the frags to be > copied to normal pages during skb_segment().To enable gso you need to call TUNSETOFFLOAD. -- MST
Michael S. Tsirkin
2012-Apr-10 15:00 UTC
Re: [PATCH v4 0/10] skb paged fragment destructors
On Tue, Apr 10, 2012 at 03:26:05PM +0100, Ian Campbell wrote:> I think this is v4, but I''ve sort of lost count, sorry that it''s taken > me so long to get back to this stuff. > > The following series makes use of the skb fragment API (which is in 3.2 > +) to add a per-paged-fragment destructor callback. This can be used by > creators of skbs who are interested in the lifecycle of the pages > included in that skb after they have handed it off to the network stack. > > The mail at [0] contains some more background and rationale but > basically the completed series will allow entities which inject pages > into the networking stack to receive a notification when the stack has > really finished with those pages (i.e. including retransmissions, > clones, pull-ups etc) and not just when the original skb is finished > with, which is beneficial to many subsystems which wish to inject pages > into the network stack without giving up full ownership of those page''s > lifecycle. It implements something broadly along the lines of what was > described in [1]. > > I have also included a patch to the RPC subsystem which uses this API to > fix the bug which I describe at [2]. > > I''ve also had some interest from David VemLehn and Bart Van Assche > regarding using this functionality in the context of vmsplice and iSCSI > targets respectively (I think). > > Changes since last time: > > * Added skb_orphan_frags API for the use of recipients of SKBs who > may hold onto the SKB for a long time (this is analogous to > skb_orphan). This was pointed out by Michael. The TUN driver is > currently the only user. > * I can''t for the life of me get anything to actually hit > this code path. I''ve been trying with an NFS server > running in a Xen HVM domain with emulated (e.g. tap) > networking and a client in domain 0, using the NFS fix > in this series which generates SKBs with destructors > set, so far -- nothing. I suspect that lack of TSO/GSO > etc on the TAP interface is causing the frags to be > copied to normal pages during skb_segment().Will take a look tomorrow, thanks!> * Various fixups related to the change of alignment/padding in > shinfo, in particular to build_skb as pointed out by Eric. > * Tweaked ordering of shinfo members to ensure that all hotpath > variables up to and including the first frag fit within (and are > aligned to) a single 64 byte cache line. (Eric again) > > I ran a monothread UDP benchmark (similar to that described by Eric in > e52fcb2462ac) and don''t see any difference in pps throughput, it was > ~810,000 pps both before and after. > > Cheers, > Ian. > > [0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2 > [1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2 > [2] http://marc.info/?l=linux-nfs&m=122424132729720&w=2 > > > >
On 04/10/12 14:26, Ian Campbell wrote:> I think this is v4, but I''ve sort of lost count, sorry that it''s taken > me so long to get back to this stuff. > > The following series makes use of the skb fragment API (which is in 3.2 > +) to add a per-paged-fragment destructor callback. This can be used by > creators of skbs who are interested in the lifecycle of the pages > included in that skb after they have handed it off to the network stack.Hello Ian, Great to see v4 of this patch series. But which kernel version has this patch series been based on ? I''ve tried to apply this series on 3.4-rc2 but apparently applying patch 09/10 failed: patching file net/ceph/messenger.c Hunk #1 FAILED at 851. 1 out of 1 hunk FAILED -- saving rejects to file net/ceph/messenger.c.rej Regards, Bart.
On Tue, 2012-04-10 at 16:46 +0100, Bart Van Assche wrote:> On 04/10/12 14:26, Ian Campbell wrote: > > > I think this is v4, but I''ve sort of lost count, sorry that it''s taken > > me so long to get back to this stuff. > > > > The following series makes use of the skb fragment API (which is in 3.2 > > +) to add a per-paged-fragment destructor callback. This can be used by > > creators of skbs who are interested in the lifecycle of the pages > > included in that skb after they have handed it off to the network stack. > > > Hello Ian, > > Great to see v4 of this patch series. But which kernel version has this > patch series been based on ? I''ve tried to apply this series on 3.4-rc2It''s based on net-next/master. Specifically commit de8856d2c11f. Ian.
On 04/10/12 15:50, Ian Campbell wrote:> On Tue, 2012-04-10 at 16:46 +0100, Bart Van Assche wrote: >> Great to see v4 of this patch series. But which kernel version has this >> patch series been based on ? I''ve tried to apply this series on 3.4-rc2 > > It''s based on net-next/master. Specifically commit de8856d2c11f.Thanks, that information allowed me to apply the patch series and to test it with kernel 3.4-rc2 and iSCSI target code. The test ran fine. The failure to apply this patch series on 3.4-rc2 I had reported turned out to be an easy to resolve merge conflict: + static int ceph_tcp_sendpage(struct socket *sock, struct page *page, + int offset, size_t size, int more) + { + int flags = MSG_DONTWAIT | MSG_NOSIGNAL | (more ? MSG_MORE : MSG_EOR); + int ret; + - ret = kernel_sendpage(sock, page, offset, size, flags); ++ ret = kernel_sendpage(sock, page, NULL, offset, size, flags); + if (ret == -EAGAIN) + ret = 0; + + return ret; + } + Bart.