David Lamparter
2017-Aug-17 11:03 UTC
[Bridge] [PATCH 1/6] bridge: learn dst metadata in FDB
Hi Nikolay, On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote:> On 16/08/17 20:01, David Lamparter wrote: > > This implements holding dst metadata information in the bridge layer, > > but only for unicast entries in the MAC table. Multicast is still left > > to design and implement. > > Sorry but I do not agree with this change, adding a special case for > VPLS in the bridge codeI don't think this is specific to VPLS at all, though you're right that VPLS is the only user currently.> and hitting the fast path for everyone in a few different places for a > feature that the majority will not use does not sound acceptable to > me. We've been trying hard to optimize it, trying to avoid additional > cache lines, removing tests and keeping special cases to a minimum.skb->dst is on the same cacheline as skb->len. fdb->md_dst is on the same cacheline as fdb->dst. Both will be 0 in a lot of cases, so this should be two null checks on data that is hot in the cache. Are you sure this is an actual problem?> I understand that you want to use the fdb tables and avoid > duplication, but this is not worth it. There're other similar use > cases and they have their own private fdb tables, that way the user > can opt out and is much cleaner and separated.Sure, this can be done. I think it's a noticeable performance penalty to have the entire fdb copied (multiple times for H-VPLS even), but I understand that it's preferable to have the normal cases faster in exchange. As the previous paragraph notes, I still wonder if that hit to the normal case exists though. I will leave this to Amine, he's paid to work on VPLS while I'm doing this for fun ;) There is however another concern I have here. As I noted in my introductory mail, I'm working on the bridge MDB making similar changes. And there's actually strong use cases for this in both VPLS and the 802.11 code (though I'm not sure I can code the latter one up, it's related to rate control and this is seriously complicated - the goal is to select a multicast rate based on the now-known receiving STAs). I really hope you're not suggesting the entire MDB with IPv4 & IPv6 snooping be duplicated into both VPLS and mac80211?> As you've noted this is only an RFC so I will not point out every issue, but there seems > to be a major problem with br_fdb_update(), note that it runs without any locks except RCU.Right, Thanks! ... I only thought about concurrent access, forgetting about concurrent modification... I'll replace it with an xchg I think. (No need for a lock that way) That said, now that I think about it, the skb_dst_set_noref() in the following chunk is probably not safe if the VPLS device has a qdisc on it. I was relying on the fact that br_dev_xmit is holding RCU there, but if the SKB is queued, md_dst might go away in the meantime... ... probably need to change this to dst_hold() + skb_dst_set()... -David> > diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c > > index 861ae2a165f4..534cacf02f8d 100644 > > --- a/net/bridge/br_device.c > > +++ b/net/bridge/br_device.c > > @@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev) > > else > > br_flood(br, skb, BR_PKT_MULTICAST, false, true); > > } else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) { > > + struct dst_entry *md_dst = rcu_dereference(dst->md_dst); > > + if (md_dst) > > + skb_dst_set_noref(skb, md_dst); > > br_forward(dst->dst, skb, false, true); > > } else { > > br_flood(br, skb, BR_PKT_UNICAST, false, true);> > diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c > > index a79b648aac88..0751fcb89699 100644 > > @@ -567,10 +579,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source, > > source->dev->name, addr, vid); > > } else { > > unsigned long now = jiffies; > > + struct dst_entry *ref_md = rcu_access_pointer(fdb->md_dst); > > > > /* fastpath: update of existing entry */ > > - if (unlikely(source != fdb->dst)) { > > + if (unlikely(source != fdb->dst || > > + dst_metadata_cmp(md_dst, ref_md))) { > > fdb->dst = source; > > + dst_release(ref_md); > > + rcu_assign_pointer(fdb->md_dst, > > + dst_clone(md_dst)); > > fdb_modified = true; > > /* Take over HW learned entry */ > > if (unlikely(fdb->added_by_external_learn))
Nikolay Aleksandrov
2017-Aug-17 11:39 UTC
[Bridge] [PATCH 1/6] bridge: learn dst metadata in FDB
On 17/08/17 14:03, David Lamparter wrote:> Hi Nikolay, > > > On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote: >> On 16/08/17 20:01, David Lamparter wrote: >>> This implements holding dst metadata information in the bridge layer, >>> but only for unicast entries in the MAC table. Multicast is still left >>> to design and implement. >> >> Sorry but I do not agree with this change, adding a special case for >> VPLS in the bridge code > > I don't think this is specific to VPLS at all, though you're right that > VPLS is the only user currently. > >> and hitting the fast path for everyone in a few different places for a >> feature that the majority will not use does not sound acceptable to >> me. We've been trying hard to optimize it, trying to avoid additional >> cache lines, removing tests and keeping special cases to a minimum. > > skb->dst is on the same cacheline as skb->len. > fdb->md_dst is on the same cacheline as fdb->dst. > Both will be 0 in a lot of cases, so this should be two null checks on > data that is hot in the cache. Are you sure this is an actual problem? >Sure - no, I haven't benchmarked it, but I don't see skb->len being on the same cache line as _skb_refdst assuming 64 byte cache lines. But again any special cases, in my opinion, should be handled on their own, it is both about the fast path and the code complexity that they bring in.>> I understand that you want to use the fdb tables and avoid >> duplication, but this is not worth it. There're other similar use >> cases and they have their own private fdb tables, that way the user >> can opt out and is much cleaner and separated. > > Sure, this can be done. I think it's a noticeable performance penalty > to have the entire fdb copied (multiple times for H-VPLS even), but I > understand that it's preferable to have the normal cases faster in > exchange. As the previous paragraph notes, I still wonder if that hit > to the normal case exists though. > > I will leave this to Amine, he's paid to work on VPLS while I'm doing > this for fun ;) > > There is however another concern I have here. As I noted in my > introductory mail, I'm working on the bridge MDB making similar changes. > And there's actually strong use cases for this in both VPLS and the > 802.11 code (though I'm not sure I can code the latter one up, it's > related to rate control and this is seriously complicated - the goal is > to select a multicast rate based on the now-known receiving STAs). > > I really hope you're not suggesting the entire MDB with IPv4 & IPv6 > snooping be duplicated into both VPLS and mac80211? >Code can always be shared if there are more users, no need to stuff everything in the bridge, but I'm not that familiar with this case, once patches are out I can comment further.>> As you've noted this is only an RFC so I will not point out every issue, but there seems >> to be a major problem with br_fdb_update(), note that it runs without any locks except RCU. > > Right, Thanks! ... I only thought about concurrent access, forgetting > about concurrent modification... I'll replace it with an xchg I think. > (No need for a lock that way)I think you can still lose references to a dst that way, what if someone changes the dst you read before the xchg and you xchg it ?> > That said, now that I think about it, the skb_dst_set_noref() in the > following chunk is probably not safe if the VPLS device has a qdisc on > it. I was relying on the fact that br_dev_xmit is holding RCU there, > but if the SKB is queued, md_dst might go away in the meantime... > ... probably need to change this to dst_hold() + skb_dst_set()... > > > -David > >>> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c >>> index 861ae2a165f4..534cacf02f8d 100644 >>> --- a/net/bridge/br_device.c >>> +++ b/net/bridge/br_device.c >>> @@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev) >>> else >>> br_flood(br, skb, BR_PKT_MULTICAST, false, true); >>> } else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) { >>> + struct dst_entry *md_dst = rcu_dereference(dst->md_dst); >>> + if (md_dst) >>> + skb_dst_set_noref(skb, md_dst); >>> br_forward(dst->dst, skb, false, true); >>> } else { >>> br_flood(br, skb, BR_PKT_UNICAST, false, true); > > >>> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c >>> index a79b648aac88..0751fcb89699 100644 >>> @@ -567,10 +579,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source, >>> source->dev->name, addr, vid); >>> } else { >>> unsigned long now = jiffies; >>> + struct dst_entry *ref_md = rcu_access_pointer(fdb->md_dst); >>> >>> /* fastpath: update of existing entry */ >>> - if (unlikely(source != fdb->dst)) { >>> + if (unlikely(source != fdb->dst || >>> + dst_metadata_cmp(md_dst, ref_md))) { >>> fdb->dst = source; >>> + dst_release(ref_md); >>> + rcu_assign_pointer(fdb->md_dst, >>> + dst_clone(md_dst)); >>> fdb_modified = true; >>> /* Take over HW learned entry */ >>> if (unlikely(fdb->added_by_external_learn))