Nikolay Aleksandrov
2017-Aug-17 11:39 UTC
[Bridge] [PATCH 1/6] bridge: learn dst metadata in FDB
On 17/08/17 14:03, David Lamparter wrote:> Hi Nikolay, > > > On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote: >> On 16/08/17 20:01, David Lamparter wrote: >>> This implements holding dst metadata information in the bridge layer, >>> but only for unicast entries in the MAC table. Multicast is still left >>> to design and implement. >> >> Sorry but I do not agree with this change, adding a special case for >> VPLS in the bridge code > > I don't think this is specific to VPLS at all, though you're right that > VPLS is the only user currently. > >> and hitting the fast path for everyone in a few different places for a >> feature that the majority will not use does not sound acceptable to >> me. We've been trying hard to optimize it, trying to avoid additional >> cache lines, removing tests and keeping special cases to a minimum. > > skb->dst is on the same cacheline as skb->len. > fdb->md_dst is on the same cacheline as fdb->dst. > Both will be 0 in a lot of cases, so this should be two null checks on > data that is hot in the cache. Are you sure this is an actual problem? >Sure - no, I haven't benchmarked it, but I don't see skb->len being on the same cache line as _skb_refdst assuming 64 byte cache lines. But again any special cases, in my opinion, should be handled on their own, it is both about the fast path and the code complexity that they bring in.>> I understand that you want to use the fdb tables and avoid >> duplication, but this is not worth it. There're other similar use >> cases and they have their own private fdb tables, that way the user >> can opt out and is much cleaner and separated. > > Sure, this can be done. I think it's a noticeable performance penalty > to have the entire fdb copied (multiple times for H-VPLS even), but I > understand that it's preferable to have the normal cases faster in > exchange. As the previous paragraph notes, I still wonder if that hit > to the normal case exists though. > > I will leave this to Amine, he's paid to work on VPLS while I'm doing > this for fun ;) > > There is however another concern I have here. As I noted in my > introductory mail, I'm working on the bridge MDB making similar changes. > And there's actually strong use cases for this in both VPLS and the > 802.11 code (though I'm not sure I can code the latter one up, it's > related to rate control and this is seriously complicated - the goal is > to select a multicast rate based on the now-known receiving STAs). > > I really hope you're not suggesting the entire MDB with IPv4 & IPv6 > snooping be duplicated into both VPLS and mac80211? >Code can always be shared if there are more users, no need to stuff everything in the bridge, but I'm not that familiar with this case, once patches are out I can comment further.>> As you've noted this is only an RFC so I will not point out every issue, but there seems >> to be a major problem with br_fdb_update(), note that it runs without any locks except RCU. > > Right, Thanks! ... I only thought about concurrent access, forgetting > about concurrent modification... I'll replace it with an xchg I think. > (No need for a lock that way)I think you can still lose references to a dst that way, what if someone changes the dst you read before the xchg and you xchg it ?> > That said, now that I think about it, the skb_dst_set_noref() in the > following chunk is probably not safe if the VPLS device has a qdisc on > it. I was relying on the fact that br_dev_xmit is holding RCU there, > but if the SKB is queued, md_dst might go away in the meantime... > ... probably need to change this to dst_hold() + skb_dst_set()... > > > -David > >>> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c >>> index 861ae2a165f4..534cacf02f8d 100644 >>> --- a/net/bridge/br_device.c >>> +++ b/net/bridge/br_device.c >>> @@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev) >>> else >>> br_flood(br, skb, BR_PKT_MULTICAST, false, true); >>> } else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) { >>> + struct dst_entry *md_dst = rcu_dereference(dst->md_dst); >>> + if (md_dst) >>> + skb_dst_set_noref(skb, md_dst); >>> br_forward(dst->dst, skb, false, true); >>> } else { >>> br_flood(br, skb, BR_PKT_UNICAST, false, true); > > >>> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c >>> index a79b648aac88..0751fcb89699 100644 >>> @@ -567,10 +579,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source, >>> source->dev->name, addr, vid); >>> } else { >>> unsigned long now = jiffies; >>> + struct dst_entry *ref_md = rcu_access_pointer(fdb->md_dst); >>> >>> /* fastpath: update of existing entry */ >>> - if (unlikely(source != fdb->dst)) { >>> + if (unlikely(source != fdb->dst || >>> + dst_metadata_cmp(md_dst, ref_md))) { >>> fdb->dst = source; >>> + dst_release(ref_md); >>> + rcu_assign_pointer(fdb->md_dst, >>> + dst_clone(md_dst)); >>> fdb_modified = true; >>> /* Take over HW learned entry */ >>> if (unlikely(fdb->added_by_external_learn))
Nikolay Aleksandrov
2017-Aug-17 11:51 UTC
[Bridge] [PATCH 1/6] bridge: learn dst metadata in FDB
On 17/08/17 14:39, Nikolay Aleksandrov wrote:> On 17/08/17 14:03, David Lamparter wrote: >> Hi Nikolay, >> >> >> On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote: >>> On 16/08/17 20:01, David Lamparter wrote: >>>> This implements holding dst metadata information in the bridge layer, >>>> but only for unicast entries in the MAC table. Multicast is still left >>>> to design and implement. >>> >>> Sorry but I do not agree with this change, adding a special case for >>> VPLS in the bridge code >> >> I don't think this is specific to VPLS at all, though you're right that >> VPLS is the only user currently. >> >>> and hitting the fast path for everyone in a few different places for a >>> feature that the majority will not use does not sound acceptable to >>> me. We've been trying hard to optimize it, trying to avoid additional >>> cache lines, removing tests and keeping special cases to a minimum. >> >> skb->dst is on the same cacheline as skb->len. >> fdb->md_dst is on the same cacheline as fdb->dst. >> Both will be 0 in a lot of cases, so this should be two null checks on >> data that is hot in the cache. Are you sure this is an actual problem? >> > > Sure - no, I haven't benchmarked it, but I don't see skb->len being on > the same cache line as _skb_refdst assuming 64 byte cache lines.I should've been clearer - that obviously depends on the kernel config, but in order for them to be in the same line you need to disable either one of conntrack, bridge_netfilter or xfrm, these are almost always enabled (at least in all major distributions).> But again any special cases, in my opinion, should be handled on their own, > it is both about the fast path and the code complexity that they bring in. > >>> I understand that you want to use the fdb tables and avoid >>> duplication, but this is not worth it. There're other similar use >>> cases and they have their own private fdb tables, that way the user >>> can opt out and is much cleaner and separated. >> >> Sure, this can be done. I think it's a noticeable performance penalty >> to have the entire fdb copied (multiple times for H-VPLS even), but I >> understand that it's preferable to have the normal cases faster in >> exchange. As the previous paragraph notes, I still wonder if that hit >> to the normal case exists though. >> >> I will leave this to Amine, he's paid to work on VPLS while I'm doing >> this for fun ;) >> >> There is however another concern I have here. As I noted in my >> introductory mail, I'm working on the bridge MDB making similar changes. >> And there's actually strong use cases for this in both VPLS and the >> 802.11 code (though I'm not sure I can code the latter one up, it's >> related to rate control and this is seriously complicated - the goal is >> to select a multicast rate based on the now-known receiving STAs). >> >> I really hope you're not suggesting the entire MDB with IPv4 & IPv6 >> snooping be duplicated into both VPLS and mac80211? >> > > Code can always be shared if there are more users, no need to stuff everything in > the bridge, but I'm not that familiar with this case, once patches are out I can > comment further. > >>> As you've noted this is only an RFC so I will not point out every issue, but there seems >>> to be a major problem with br_fdb_update(), note that it runs without any locks except RCU. >> >> Right, Thanks! ... I only thought about concurrent access, forgetting >> about concurrent modification... I'll replace it with an xchg I think. >> (No need for a lock that way) > > I think you can still lose references to a dst that way, what if someone changes the > dst you read before the xchg and you xchg it ? > >> >> That said, now that I think about it, the skb_dst_set_noref() in the >> following chunk is probably not safe if the VPLS device has a qdisc on >> it. I was relying on the fact that br_dev_xmit is holding RCU there, >> but if the SKB is queued, md_dst might go away in the meantime... >> ... probably need to change this to dst_hold() + skb_dst_set()... >> >> >> -David >> >>>> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c >>>> index 861ae2a165f4..534cacf02f8d 100644 >>>> --- a/net/bridge/br_device.c >>>> +++ b/net/bridge/br_device.c >>>> @@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev) >>>> else >>>> br_flood(br, skb, BR_PKT_MULTICAST, false, true); >>>> } else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) { >>>> + struct dst_entry *md_dst = rcu_dereference(dst->md_dst); >>>> + if (md_dst) >>>> + skb_dst_set_noref(skb, md_dst); >>>> br_forward(dst->dst, skb, false, true); >>>> } else { >>>> br_flood(br, skb, BR_PKT_UNICAST, false, true); >> >> >>>> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c >>>> index a79b648aac88..0751fcb89699 100644 >>>> @@ -567,10 +579,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source, >>>> source->dev->name, addr, vid); >>>> } else { >>>> unsigned long now = jiffies; >>>> + struct dst_entry *ref_md = rcu_access_pointer(fdb->md_dst); >>>> >>>> /* fastpath: update of existing entry */ >>>> - if (unlikely(source != fdb->dst)) { >>>> + if (unlikely(source != fdb->dst || >>>> + dst_metadata_cmp(md_dst, ref_md))) { >>>> fdb->dst = source; >>>> + dst_release(ref_md); >>>> + rcu_assign_pointer(fdb->md_dst, >>>> + dst_clone(md_dst)); >>>> fdb_modified = true; >>>> /* Take over HW learned entry */ >>>> if (unlikely(fdb->added_by_external_learn)) >
David Lamparter
2017-Aug-17 12:45 UTC
[Bridge] [PATCH 1/6] bridge: learn dst metadata in FDB
On Thu, Aug 17, 2017 at 02:39:43PM +0300, Nikolay Aleksandrov wrote:> On 17/08/17 14:03, David Lamparter wrote: > > On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote: > >> On 16/08/17 20:01, David Lamparter wrote: > >> and hitting the fast path for everyone in a few different places for a > >> feature that the majority will not use does not sound acceptable to > >> me. We've been trying hard to optimize it, trying to avoid additional > >> cache lines, removing tests and keeping special cases to a minimum. > > > > skb->dst is on the same cacheline as skb->len. > > fdb->md_dst is on the same cacheline as fdb->dst. > > Both will be 0 in a lot of cases, so this should be two null checks on > > data that is hot in the cache. Are you sure this is an actual problem? > > > > Sure - no, I haven't benchmarked it, but I don't see skb->len being on > the same cache line as _skb_refdst assuming 64 byte cache lines. > But again any special cases, in my opinion, should be handled on their own, > it is both about the fast path and the code complexity that they bring in.(separate thread) [cut]> > I really hope you're not suggesting the entire MDB with IPv4 & IPv6 > > snooping be duplicated into both VPLS and mac80211? > > Code can always be shared if there are more users, no need to stuff > everything in the bridge,The MDB code is far from trivial, has several configuration knobs, and even sends its own queries if configured to do so. It can also use quite a bit of memory of there's a nontrivial number of multicast groups. I *really* think it shouldn't be duplicated.> but I'm not that familiar with this case, once patches are out I can > comment further.I've pushed my hacks to: https://github.com/eqvinox/vpls-linux-kernel/commits/mdb-hack (top two commits) THIS IS ABSOLUTELY A PROOF OF CONCEPT. It doesn't un-learn dst metadata, it probably leaks buckets, and it may kill your cat. (I haven't pushed my attempts at mac80211, because I haven't gotten anywhere useful there just yet.)> >> As you've noted this is only an RFC so I will not point out every issue, but there seems > >> to be a major problem with br_fdb_update(), note that it runs without any locks except RCU. > > > > Right, Thanks! ... I only thought about concurrent access, forgetting > > about concurrent modification... I'll replace it with an xchg I think. > > (No need for a lock that way) > > I think you can still lose references to a dst that way, what if someone changes the > dst you read before the xchg and you xchg it ?The dst to be released is the return from (atomic) xchg, not the value read earlier for comparison. This can happen in parallel, but apart from a little extra churn in the update case it has no ill effects. If someone changes it in the meantime, they have new dst information for the fdb entry, and so do we. With xchg'ing it, either one of the updates will stick and the other will be properly released. Considering that there is no correct ordering here (either packet could be processed a nanosecond later or earlier), this is perfectly fine as an outcome. -David