Various drivers do skb_orphan() because they do not free transmitted skbs in a timely manner (causing apps which hit their socket limits to stall, socket close to hang, etc.). DaveM points out that there are advantages to doing it generally (it's more likely to be on same CPU than after xmit), and I couldn't find any new starvation issues in simple benchmarking here. This patch adds skb_orphan to the start of dev_hard_start_xmit(): it can be premature in the NETDEV_TX_BUSY case, but that's uncommon. I removed the drivers' skb_orphan(), though it's harmless. Signed-off-by: Rusty Russell <rusty at rustcorp.com.au> Cc: Divy Le Ray <divy at chelsio.com> Cc: Roland Dreier <rolandd at cisco.com> Cc: Pavel Emelianov <xemul at openvz.org> Cc: Dan Williams <dcbw at redhat.com> Cc: libertas-dev at lists.infradead.org --- drivers/net/cxgb3/sge.c | 27 --------------------------- drivers/net/loopback.c | 2 -- drivers/net/mlx4/en_tx.c | 4 ---- drivers/net/niu.c | 3 +-- drivers/net/veth.c | 2 -- drivers/net/wireless/libertas/tx.c | 3 --- net/core/dev.c | 21 +++++---------------- 7 files changed, 6 insertions(+), 56 deletions(-) diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c --- a/drivers/net/cxgb3/sge.c +++ b/drivers/net/cxgb3/sge.c @@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str dev->trans_start = jiffies; spin_unlock(&q->lock); - /* - * We do not use Tx completion interrupts to free DMAd Tx packets. - * This is good for performamce but means that we rely on new Tx - * packets arriving to run the destructors of completed packets, - * which open up space in their sockets' send queues. Sometimes - * we do not get such new packets causing Tx to stall. A single - * UDP transmitter is a good example of this situation. We have - * a clean up timer that periodically reclaims completed packets - * but it doesn't run often enough (nor do we want it to) to prevent - * lengthy stalls. A solution to this problem is to run the - * destructor early, after the packet is queued but before it's DMAd. - * A cons is that we lie to socket memory accounting, but the amount - * of extra memory is reasonable (limited by the number of Tx - * descriptors), the packets do actually get freed quickly by new - * packets almost always, and for protocols like TCP that wait for - * acks to really free up the data the extra memory is even less. - * On the positive side we run the destructors on the sending CPU - * rather than on a potentially different completing CPU, usually a - * good thing. We also run them without holding our Tx queue lock, - * unlike what reclaim_completed_tx() would otherwise do. - * - * Run the destructor before telling the DMA engine about the packet - * to make sure it doesn't complete and get freed prematurely. - */ - if (likely(!skb_shared(skb))) - skb_orphan(skb); - write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl); check_ring_tx_db(adap, q); return NETDEV_TX_OK; diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff { struct pcpu_lstats *pcpu_lstats, *lb_stats; - skb_orphan(skb); - skb->protocol = eth_type_trans(skb,dev); /* it's OK to use per_cpu_ptr() because BHs are off */ diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c --- a/drivers/net/mlx4/en_tx.c +++ b/drivers/net/mlx4/en_tx.c @@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf) tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size); - /* Run destructor before passing skb to HW */ - if (likely(!skb_shared(skb))) - skb_orphan(skb); - /* Ensure new descirptor hits memory * before setting ownership of this descriptor to HW */ wmb(); diff --git a/drivers/net/niu.c b/drivers/net/niu.c --- a/drivers/net/niu.c +++ b/drivers/net/niu.c @@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff } kfree_skb(skb); skb = skb_new; - } else - skb_orphan(skb); + } align = ((unsigned long) skb->data & (16 - 1)); headroom = align + sizeof(struct tx_pkt_hdr); diff --git a/drivers/net/veth.c b/drivers/net/veth.c --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb struct veth_net_stats *stats, *rcv_stats; int length, cpu; - skb_orphan(skb); - priv = netdev_priv(dev); rcv = priv->peer; rcv_priv = netdev_priv(rcv); diff --git a/drivers/net/wireless/libertas/tx.c b/drivers/net/wireless/libertas/tx.c --- a/drivers/net/wireless/libertas/tx.c +++ b/drivers/net/wireless/libertas/tx.c @@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff * if (priv->monitormode) { /* Keep the skb to echo it back once Tx feedback is received from FW */ - skb_orphan(skb); - - /* Keep the skb around for when we get feedback */ priv->currenttxskb = skb; } else { free: diff --git a/net/core/dev.c b/net/core/dev.c --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff * const struct net_device_ops *ops = dev->netdev_ops; int rc; + /* Call destructor here. It's SMP-cache-friendly and avoids issues + * with drivers which can hold transmitted skbs for long times */ + skb_orphan(skb); + if (likely(!skb->next)) { if (!list_empty(&ptype_all)) dev_queue_xmit_nit(skb, dev); @@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff * goto gso; } - rc = ops->ndo_start_xmit(skb, dev); - /* - * TODO: if skb_orphan() was called by - * dev->hard_start_xmit() (for example, the unmodified - * igb driver does that; bnx2 doesn't), then - * skb_tx_software_timestamp() will be unable to send - * back the time stamp. - * - * How can this be prevented? Always create another - * reference to the socket before calling - * dev->hard_start_xmit()? Prevent that skb_orphan() - * does anything in dev->hard_start_xmit() by clearing - * the skb destructor before the call and restoring it - * afterwards, then doing the skb_orphan() ourselves? - */ - return rc; + return ops->ndo_start_xmit(skb, dev); } gso:
Rusty Russell a ?crit :> Various drivers do skb_orphan() because they do not free transmitted > skbs in a timely manner (causing apps which hit their socket limits to > stall, socket close to hang, etc.). > > DaveM points out that there are advantages to doing it generally (it's > more likely to be on same CPU than after xmit), and I couldn't find > any new starvation issues in simple benchmarking here. > >If really no starvations are possible at all, I really wonder why some guys added memory accounting to UDP flows. Maybe they dont run "simple benchmarks" but real apps ? :) For TCP, I agree your patch is a huge benefit, since its paced by remote ACKS and window control, but an UDP sender will likely be able to saturate a link. Maybe we can try to selectively call skb_orphan() only for tcp packets ? I understand your motivations are the driver simplifications, so you want all packets being orphaned... hmm... This patch adds skb_orphan to the start of dev_hard_start_xmit(): it> can be premature in the NETDEV_TX_BUSY case, but that's uncommon. > > I removed the drivers' skb_orphan(), though it's harmless. > > Signed-off-by: Rusty Russell <rusty at rustcorp.com.au> > Cc: Divy Le Ray <divy at chelsio.com> > Cc: Roland Dreier <rolandd at cisco.com> > Cc: Pavel Emelianov <xemul at openvz.org> > Cc: Dan Williams <dcbw at redhat.com> > Cc: libertas-dev at lists.infradead.org > --- > drivers/net/cxgb3/sge.c | 27 --------------------------- > drivers/net/loopback.c | 2 -- > drivers/net/mlx4/en_tx.c | 4 ---- > drivers/net/niu.c | 3 +-- > drivers/net/veth.c | 2 -- > drivers/net/wireless/libertas/tx.c | 3 --- > net/core/dev.c | 21 +++++---------------- > 7 files changed, 6 insertions(+), 56 deletions(-) > > diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c > --- a/drivers/net/cxgb3/sge.c > +++ b/drivers/net/cxgb3/sge.c > @@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str > dev->trans_start = jiffies; > spin_unlock(&q->lock); > > - /* > - * We do not use Tx completion interrupts to free DMAd Tx packets. > - * This is good for performamce but means that we rely on new Tx > - * packets arriving to run the destructors of completed packets, > - * which open up space in their sockets' send queues. Sometimes > - * we do not get such new packets causing Tx to stall. A single > - * UDP transmitter is a good example of this situation. We have > - * a clean up timer that periodically reclaims completed packets > - * but it doesn't run often enough (nor do we want it to) to prevent > - * lengthy stalls. A solution to this problem is to run the > - * destructor early, after the packet is queued but before it's DMAd. > - * A cons is that we lie to socket memory accounting, but the amount > - * of extra memory is reasonable (limited by the number of Tx > - * descriptors), the packets do actually get freed quickly by new > - * packets almost always, and for protocols like TCP that wait for > - * acks to really free up the data the extra memory is even less. > - * On the positive side we run the destructors on the sending CPU > - * rather than on a potentially different completing CPU, usually a > - * good thing. We also run them without holding our Tx queue lock, > - * unlike what reclaim_completed_tx() would otherwise do. > - * > - * Run the destructor before telling the DMA engine about the packet > - * to make sure it doesn't complete and get freed prematurely. > - */ > - if (likely(!skb_shared(skb))) > - skb_orphan(skb); > - > write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl); > check_ring_tx_db(adap, q); > return NETDEV_TX_OK; > diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c > --- a/drivers/net/loopback.c > +++ b/drivers/net/loopback.c > @@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff > { > struct pcpu_lstats *pcpu_lstats, *lb_stats; > > - skb_orphan(skb); > - > skb->protocol = eth_type_trans(skb,dev); > > /* it's OK to use per_cpu_ptr() because BHs are off */ > diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c > --- a/drivers/net/mlx4/en_tx.c > +++ b/drivers/net/mlx4/en_tx.c > @@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st > if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf) > tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size); > > - /* Run destructor before passing skb to HW */ > - if (likely(!skb_shared(skb))) > - skb_orphan(skb); > - > /* Ensure new descirptor hits memory > * before setting ownership of this descriptor to HW */ > wmb(); > diff --git a/drivers/net/niu.c b/drivers/net/niu.c > --- a/drivers/net/niu.c > +++ b/drivers/net/niu.c > @@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff > } > kfree_skb(skb); > skb = skb_new; > - } else > - skb_orphan(skb); > + } > > align = ((unsigned long) skb->data & (16 - 1)); > headroom = align + sizeof(struct tx_pkt_hdr); > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > --- a/drivers/net/veth.c > +++ b/drivers/net/veth.c > @@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb > struct veth_net_stats *stats, *rcv_stats; > int length, cpu; > > - skb_orphan(skb); > - > priv = netdev_priv(dev); > rcv = priv->peer; > rcv_priv = netdev_priv(rcv); > diff --git a/drivers/net/wireless/libertas/tx.c > b/drivers/net/wireless/libertas/tx.c > --- a/drivers/net/wireless/libertas/tx.c > +++ b/drivers/net/wireless/libertas/tx.c > @@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff * > if (priv->monitormode) { > /* Keep the skb to echo it back once Tx feedback is > received from FW */ > - skb_orphan(skb); > - > - /* Keep the skb around for when we get feedback */ > priv->currenttxskb = skb; > } else { > free: > diff --git a/net/core/dev.c b/net/core/dev.c > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff * > const struct net_device_ops *ops = dev->netdev_ops; > int rc; > > + /* Call destructor here. It's SMP-cache-friendly and avoids issues > + * with drivers which can hold transmitted skbs for long times */ > + skb_orphan(skb); > + > if (likely(!skb->next)) { > if (!list_empty(&ptype_all)) > dev_queue_xmit_nit(skb, dev); > @@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff * > goto gso; > } > > - rc = ops->ndo_start_xmit(skb, dev); > - /* > - * TODO: if skb_orphan() was called by > - * dev->hard_start_xmit() (for example, the unmodified > - * igb driver does that; bnx2 doesn't), then > - * skb_tx_software_timestamp() will be unable to send > - * back the time stamp. > - * > - * How can this be prevented? Always create another > - * reference to the socket before calling > - * dev->hard_start_xmit()? Prevent that skb_orphan() > - * does anything in dev->hard_start_xmit() by clearing > - * the skb destructor before the call and restoring it > - * afterwards, then doing the skb_orphan() ourselves? > - */ > - return rc; > + return ops->ndo_start_xmit(skb, dev); > } > > gso: > >
On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it > can be premature in the NETDEV_TX_BUSY case, but that's uncommon.Would it be possible to make the new skb_orphan() at the start of dev_hard_start_xmit() conditionally so that it is not executed for packets that are to be time stamped? As discussed before (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk socket pointer is required for sending back the send time stamp from inside the device driver. Calling skb_orphan() unconditionally as in this patch would break the hardware time stamping of outgoing packets. This should work without much overhead (skb_tx() expands to a lookup of the skb_shared_info): if (!skb_tx(skb)->flags) skb_orphan(skb); Instead of looking at the individual skb_shared_tx "hardware" and "software" bits I'm just checking the whole byte here because it is shorter and perhaps faster. Semantically the result is the same. -- Best Regards, Patrick Ohly The content of this message is my personal opinion only and although I am an employee of Intel, the statements I make here in no way represent Intel's position on the issue, nor am I authorized to speak on behalf of Intel on this matter.
From: Patrick Ohly <patrick.ohly at intel.com> Date: Mon, 01 Jun 2009 21:47:22 +0200> On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote: >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon. > > Would it be possible to make the new skb_orphan() at the start of > dev_hard_start_xmit() conditionally so that it is not executed for > packets that are to be time stamped? > > As discussed before > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk > socket pointer is required for sending back the send time stamp from > inside the device driver. Calling skb_orphan() unconditionally as in > this patch would break the hardware time stamping of outgoing packets.Indeed, we need to check that case, at a minimum. And there are other potentially other problems. For example, I wonder how this interacts with the new TX MMAP af_packet support in net-next-2.6 :-/
Possibly Parallel Threads
- [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
- [Bridge] [BUG/PATCH/RFC] bridge: locally generated broadcast traffic may block sender
- Need help configuring wireless NIC
- CEEA-2014:1269 CentOS 7 linux-firmware Enhancement Update
- [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt