Vladimir Oltean
2021-Jul-03 11:56 UTC
[Bridge] [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
For this series I have taken Tobias' work from here: https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias at waldekranz.com/ and made the following changes: - I collected and integrated (hopefully all of) Nikolay's, Ido's and my feedback on the bridge driver changes. Otherwise, the structure of the bridge changes is pretty much the same as Tobias left it. - I basically rewrote the DSA infrastructure for the data plane forwarding offload, based on the commonalities with another switch driver for which I implemented this feature (not submitted here) - I adapted mv88e6xxx to use the new infrastructure, hopefully it still works but I didn't test that The data plane of the software bridge can be partially offloaded to switchdev, in the sense that we can trust the accelerator to: (a) look up its FDB (which is more or less in sync with the software bridge FDB) for selecting the destination ports for a packet (b) replicate the frame in hardware in case it's a multicast/broadcast, instead of the software bridge having to clone it and send the clones to each net device one at a time. This reduces the bandwidth needed between the CPU and the accelerator, as well as the CPU time spent. The data path forwarding offload is managed per "hardware domain" - a generalization of the "offload_fwd_mark" concept which is being introduced in this series. Every packet is delivered only once to each hardware domain. In addition, Tobias said in the original cover letter: ===================## Overview vlan1 vlan2 \ / .-----------. | br0 | '-----------' / / \ \ swp0 swp1 swp2 eth0 : : : (hwdom 1) Up to this point, switchdevs have been trusted with offloading forwarding between bridge ports, e.g. forwarding a unicast from swp0 to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This series extends forward offloading to include some new classes of traffic: - Locally originating flows, i.e. packets that ingress on br0 that are to be forwarded to one or several of the ports swp{0,1,2}. Notably this also includes routed flows, e.g. a packet ingressing swp0 on VLAN 1 which is then routed over to VLAN 2 by the CPU and then forwarded to swp1 is "locally originating" from br0's point of view. - Flows originating from "foreign" interfaces, i.e. an interface that is not offloaded by a particular switchdev instance. This includes ports belonging to other switchdev instances. A typical example would be flows from eth0 towards swp{0,1,2}. The bridge still looks up its FDB/MDB as usual and then notifies the switchdev driver that a particular skb should be offloaded if it matches one of the classes above. It does so by using the _accel version of dev_queue_xmit, supplying its own netdev as the "subordinate" device. The driver can react to the presence of the subordinate in its .ndo_select_queue in what ever way it needs to make sure to forward the skb in much the same way that it would for packets ingressing on regular ports. Hardware domains to which a particular skb has been forwarded are recorded so that duplicates are avoided. The main performance benefit is thus seen on multicast flows. Imagine for example that: - An IP camera is connected to swp0 (VLAN 1) - The CPU is acting as a multicast router, routing the group from VLAN 1 to VLAN 2. - There are subscribers for the group in question behind both swp1 and swp2 (VLAN 2). With this offloading in place, the bridge need only send a single skb to the driver, which will send it to the hardware marked in such a way that the switch will perform the multicast replication according to the MDB configuration. Naturally, the number of saved skb_clones increase linearly with the number of subscribed ports. As an extra benefit, on mv88e6xxx, this also allows the switch to perform source address learning on these flows, which avoids having to sync dynamic FDB entries over slow configuration interfaces like MDIO to avoid flows directed towards the CPU being flooded as unknown unicast by the switch. ## RFC - In general, what do you think about this idea? - hwdom. What do you think about this terminology? Personally I feel that we had too many things called offload_fwd_mark, and that as the use of the bridge internal ID (nbp->offload_fwd_mark) expands, it might be useful to have a separate term for it. - .dfwd_{add,del}_station. Am I stretching this abstraction too far, and if so do you have any suggestion/preference on how to signal the offloading from the bridge down to the switchdev driver? - The way that flooding is implemented in br_forward.c (lazily cloning skbs) means that you have to mark the forwarding as completed very early (right after should_deliver in maybe_deliver) in order to avoid duplicates. Is there some way to move this decision point to a later stage that I am missing? - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not compatible with unicast-to-multicast being used on a port. Then again, I think that this would also be broken for regular switchdev bridge offloading as this flag is not offloaded to the switchdev port, so there is no way for the driver to refuse it. Any ideas on how to handle this? ## mv88e6xxx Specifics Since we are now only receiving a single skb for both unicast and multicast flows, we can tag the packets with the FORWARD command instead of FROM_CPU. The swich(es) will then forward the packet in accordance with its ATU, VTU, STU, and PVT configuration - just like for packets ingressing on user ports. Crucially, FROM_CPU is still used for: - Ports in standalone mode. - Flows that are trapped to the CPU and software-forwarded by a bridge. Note that these flows match neither of the classes discussed in the overview. - Packets that are sent directly to a port netdev without going through the bridge, e.g. lldpd sending out PDU via an AF_PACKET socket. We thus have a pretty clean separation where the data plane uses FORWARDs and the control plane uses TO_/FROM_CPU. The barrier between different bridges is enforced by port based VLANs on mv88e6xxx, which in essence is a mapping from a source device/port pair to an allowed set of egress ports. In order to have a FORWARD frame (which carries a _source_ device/port) correctly mapped by the PVT, we must use a unique pair for each bridge. Fortunately, there is typically lots of unused address space in most switch trees. When was the last time you saw an mv88e6xxx product using more than 4 chips? Even if you found one with 16 (!) devices, you would still have room to allocate 16*16 virtual ports to software bridges. Therefore, the mv88e6xxx driver will allocate a virtual device/port pair to each bridge that it offloads. All members of the same bridge are then configured to allow packets from this virtual port in their PVTs. =================== Tobias Waldekranz (5): net: dfwd: constrain existing users to macvlan subordinates net: bridge: disambiguate offload_fwd_mark net: bridge: switchdev: recycle unused hwdoms net: bridge: switchdev: allow the data plane forwarding to be offloaded net: dsa: tag_dsa: offload the bridge forwarding process Vladimir Oltean (5): net: extract helpers for binding a subordinate device to TX queues net: allow ndo_select_queue to go beyond dev->num_real_tx_queues net: dsa: track the number of switches in a tree net: dsa: add support for bridge forwarding offload net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in the PVT drivers/net/dsa/mv88e6xxx/chip.c | 106 +++++++++++- .../net/ethernet/intel/fm10k/fm10k_netdev.c | 3 + drivers/net/ethernet/intel/i40e/i40e_main.c | 3 + drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 + include/linux/if_bridge.h | 1 + include/linux/netdevice.h | 13 +- include/net/dsa.h | 37 ++++ net/bridge/br_forward.c | 18 +- net/bridge/br_if.c | 4 +- net/bridge/br_private.h | 49 +++++- net/bridge/br_switchdev.c | 163 +++++++++++++++--- net/bridge/br_vlan.c | 10 +- net/core/dev.c | 31 +++- net/dsa/dsa2.c | 3 + net/dsa/dsa_priv.h | 28 +++ net/dsa/port.c | 35 ++++ net/dsa/slave.c | 134 +++++++++++++- net/dsa/switch.c | 58 +++++++ net/dsa/tag_dsa.c | 60 ++++++- 19 files changed, 700 insertions(+), 59 deletions(-) -- 2.25.1
Vladimir Oltean
2021-Jul-03 11:56 UTC
[Bridge] [RFC PATCH v2 net-next 01/10] net: dfwd: constrain existing users to macvlan subordinates
From: Tobias Waldekranz <tobias at waldekranz.com> The dfwd_add/del_station NDOs are currently only used by the macvlan subsystem to request L2 forwarding offload from lower devices. In order add support for other types of devices (like bridges), we constrain the current users to make sure that the subordinate requesting the offload is in fact a macvlan. Signed-off-by: Tobias Waldekranz <tobias at waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean at nxp.com> --- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 3 +++ drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +++ drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +++ 3 files changed, 9 insertions(+) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c index 2fb52bd6fc0e..4dba6e6a282d 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c @@ -1352,6 +1352,9 @@ static void *fm10k_dfwd_add_station(struct net_device *dev, int size, i; u16 vid, glort; + if (!netif_is_macvlan(sdev)) + return ERR_PTR(-EOPNOTSUPP); + /* The hardware supported by fm10k only filters on the destination MAC * address. In order to avoid issues we only support offloading modes * where the hardware can actually provide the functionality. diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 861e59a350bd..812ad241a049 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -7629,6 +7629,9 @@ static void *i40e_fwd_add(struct net_device *netdev, struct net_device *vdev) struct i40e_fwd_adapter *fwd; int avail_macvlan, ret; + if (!netif_is_macvlan(vdev)) + return ERR_PTR(-EOPNOTSUPP); + if ((pf->flags & I40E_FLAG_DCB_ENABLED)) { netdev_info(netdev, "Macvlans are not supported when DCB is enabled\n"); return ERR_PTR(-EINVAL); diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index ffff69efd78a..1ecdb7dc9534 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -9938,6 +9938,9 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev) int tcs = adapter->hw_tcs ? : 1; int pool, err; + if (!netif_is_macvlan(vdev)) + return ERR_PTR(-EOPNOTSUPP); + if (adapter->xdp_prog) { e_warn(probe, "L2FW offload is not supported with XDP\n"); return ERR_PTR(-EINVAL); -- 2.25.1
Vladimir Oltean
2021-Jul-03 11:56 UTC
[Bridge] [RFC PATCH v2 net-next 02/10] net: bridge: disambiguate offload_fwd_mark
From: Tobias Waldekranz <tobias at waldekranz.com> Before this change, four related - but distinct - concepts where named offload_fwd_mark: - skb->offload_fwd_mark: Set by the switchdev driver if the underlying hardware has already forwarded this frame to the other ports in the same hardware domain. - nbp->offload_fwd_mark: An idetifier used to group ports that share the same hardware forwarding domain. - br->offload_fwd_mark: Counter used to make sure that unique IDs are used in cases where a bridge contains ports from multiple hardware domains. - skb->cb->offload_fwd_mark: The hardware domain on which the frame ingressed and was forwarded. Introduce the term "hardware forwarding domain" ("hwdom") in the bridge to denote a set of ports with the following property: If an skb with skb->offload_fwd_mark set, is received on a port belonging to hwdom N, that frame has already been forwarded to all other ports in hwdom N. By decoupling the name from "offload_fwd_mark", we can extend the term's definition in the future - e.g. to add constraints that describe expected egress behavior - without overloading the meaning of "offload_fwd_mark". - nbp->offload_fwd_mark thus becomes nbp->hwdom. - br->offload_fwd_mark becomes br->last_hwdom. - skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. The slight change in naming here mandates a slight change in behavior of the nbp_switchdev_frame_mark() function. Previously, it only set this value in skb->cb for packets with skb->offload_fwd_mark true (ones which were forwarded in hardware). Whereas now we always track the incoming hwdom for all packets coming from a switchdev (even for the packets which weren't forwarded in hardware, such as STP BPDUs, IGMP reports etc). As all uses of skb->cb->offload_fwd_mark were already gated behind checks of skb->offload_fwd_mark, this will not introduce any functional change, but it paves the way for future changes where the ingressing hwdom must be known for frames coming from a switchdev regardless of whether they were forwarded in hardware or not (basically, if the skb comes from a switchdev, skb->cb->src_hwdom now always tracks which one). A typical example where this is relevant: the switchdev has a fixed configuration to trap STP BPDUs, but STP is not running on the bridge and the group_fwd_mask allows them to be forwarded. Say we have this setup: br0 / | \ / | \ swp0 swp1 swp2 A BPDU comes in on swp0 and is trapped to the CPU; the driver does not set skb->offload_fwd_mark. The bridge determines that the frame should be forwarded to swp{1,2}. It is imperative that forward offloading is _not_ allowed in this case, as the source hwdom is already "poisoned". Recording the source hwdom allows this case to be handled properly. Signed-off-by: Tobias Waldekranz <tobias at waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean at nxp.com> --- net/bridge/br_if.c | 2 +- net/bridge/br_private.h | 10 +++++----- net/bridge/br_switchdev.c | 16 ++++++++-------- 3 files changed, 14 insertions(+), 14 deletions(-) diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index f7d2f472ae24..73fa703f8df5 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -643,7 +643,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev, if (err) goto err5; - err = nbp_switchdev_mark_set(p); + err = nbp_switchdev_hwdom_set(p); if (err) goto err6; diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 2b48b204205e..e16879caaaf3 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -329,7 +329,7 @@ struct net_bridge_port { struct netpoll *np; #endif #ifdef CONFIG_NET_SWITCHDEV - int offload_fwd_mark; + int hwdom; #endif u16 group_fwd_mask; u16 backup_redirected_cnt; @@ -476,7 +476,7 @@ struct net_bridge { u32 auto_cnt; #ifdef CONFIG_NET_SWITCHDEV - int offload_fwd_mark; + int last_hwdom; #endif struct hlist_head fdb_list; @@ -506,7 +506,7 @@ struct br_input_skb_cb { #endif #ifdef CONFIG_NET_SWITCHDEV - int offload_fwd_mark; + int src_hwdom; #endif }; @@ -1645,7 +1645,7 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; } /* br_switchdev.c */ #ifdef CONFIG_NET_SWITCHDEV -int nbp_switchdev_mark_set(struct net_bridge_port *p); +int nbp_switchdev_hwdom_set(struct net_bridge_port *p); void nbp_switchdev_frame_mark(const struct net_bridge_port *p, struct sk_buff *skb); bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p, @@ -1665,7 +1665,7 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb) skb->offload_fwd_mark = 0; } #else -static inline int nbp_switchdev_mark_set(struct net_bridge_port *p) +static inline int nbp_switchdev_hwdom_set(struct net_bridge_port *p) { return 0; } diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c index d3adee0f91f9..833fd30482c2 100644 --- a/net/bridge/br_switchdev.c +++ b/net/bridge/br_switchdev.c @@ -8,20 +8,20 @@ #include "br_private.h" -static int br_switchdev_mark_get(struct net_bridge *br, struct net_device *dev) +static int br_switchdev_hwdom_get(struct net_bridge *br, struct net_device *dev) { struct net_bridge_port *p; /* dev is yet to be added to the port list. */ list_for_each_entry(p, &br->port_list, list) { if (netdev_port_same_parent_id(dev, p->dev)) - return p->offload_fwd_mark; + return p->hwdom; } - return ++br->offload_fwd_mark; + return ++br->last_hwdom; } -int nbp_switchdev_mark_set(struct net_bridge_port *p) +int nbp_switchdev_hwdom_set(struct net_bridge_port *p) { struct netdev_phys_item_id ppid = { }; int err; @@ -35,7 +35,7 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p) return err; } - p->offload_fwd_mark = br_switchdev_mark_get(p->br, p->dev); + p->hwdom = br_switchdev_hwdom_get(p->br, p->dev); return 0; } @@ -43,15 +43,15 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p) void nbp_switchdev_frame_mark(const struct net_bridge_port *p, struct sk_buff *skb) { - if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark)) - BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark; + if (p->hwdom) + BR_INPUT_SKB_CB(skb)->src_hwdom = p->hwdom; } bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p, const struct sk_buff *skb) { return !skb->offload_fwd_mark || - BR_INPUT_SKB_CB(skb)->offload_fwd_mark != p->offload_fwd_mark; + BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom; } /* Flags that can be offloaded to hardware */ -- 2.25.1
Vladimir Oltean
2021-Jul-03 11:56 UTC
[Bridge] [RFC PATCH v2 net-next 03/10] net: bridge: switchdev: recycle unused hwdoms
From: Tobias Waldekranz <tobias at waldekranz.com> Since hwdoms have only been used thus far for equality comparisons, the bridge has used the simplest possible assignment policy; using a counter to keep track of the last value handed out. With the upcoming transmit offloading, we need to perform set operations efficiently based on hwdoms, e.g. we want to answer questions like "has this skb been forwarded to any port within this hwdom?" Move to a bitmap-based allocation scheme that recycles hwdoms once all members leaves the bridge. This means that we can use a single unsigned long to keep track of the hwdoms that have received an skb. v1->v2: convert the typedef DECLARE_BITMAP(br_hwdom_map_t, BR_HWDOM_MAX) into a plain unsigned long. Signed-off-by: Tobias Waldekranz <tobias at waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean at nxp.com> --- net/bridge/br_if.c | 4 +- net/bridge/br_private.h | 27 ++++++++--- net/bridge/br_switchdev.c | 94 ++++++++++++++++++++++++++------------- 3 files changed, 85 insertions(+), 40 deletions(-) diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index 73fa703f8df5..adaf78e45c23 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -349,6 +349,7 @@ static void del_nbp(struct net_bridge_port *p) nbp_backup_clear(p); nbp_update_port_count(br); + nbp_switchdev_del(p); netdev_upper_dev_unlink(dev, br->dev); @@ -643,7 +644,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev, if (err) goto err5; - err = nbp_switchdev_hwdom_set(p); + err = nbp_switchdev_add(p); if (err) goto err6; @@ -704,6 +705,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev, list_del_rcu(&p->list); br_fdb_delete_by_port(br, p, 0, 1); nbp_update_port_count(br); + nbp_switchdev_del(p); err6: netdev_upper_dev_unlink(dev, br->dev); err5: diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index e16879caaaf3..9ff09a32e3f8 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -29,6 +29,8 @@ #define BR_MULTICAST_DEFAULT_HASH_MAX 4096 +#define BR_HWDOM_MAX BITS_PER_LONG + #define BR_VERSION "2.3" /* Control of forwarding link local multicast */ @@ -476,7 +478,7 @@ struct net_bridge { u32 auto_cnt; #ifdef CONFIG_NET_SWITCHDEV - int last_hwdom; + unsigned long busy_hwdoms; #endif struct hlist_head fdb_list; @@ -1645,7 +1647,6 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; } /* br_switchdev.c */ #ifdef CONFIG_NET_SWITCHDEV -int nbp_switchdev_hwdom_set(struct net_bridge_port *p); void nbp_switchdev_frame_mark(const struct net_bridge_port *p, struct sk_buff *skb); bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p, @@ -1659,17 +1660,15 @@ void br_switchdev_fdb_notify(struct net_bridge *br, int br_switchdev_port_vlan_add(struct net_device *dev, u16 vid, u16 flags, struct netlink_ext_ack *extack); int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid); +int nbp_switchdev_add(struct net_bridge_port *p); +void nbp_switchdev_del(struct net_bridge_port *p); +void br_switchdev_init(struct net_bridge *br); static inline void br_switchdev_frame_unmark(struct sk_buff *skb) { skb->offload_fwd_mark = 0; } #else -static inline int nbp_switchdev_hwdom_set(struct net_bridge_port *p) -{ - return 0; -} - static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p, struct sk_buff *skb) { @@ -1710,6 +1709,20 @@ br_switchdev_fdb_notify(struct net_bridge *br, static inline void br_switchdev_frame_unmark(struct sk_buff *skb) { } + +static inline int nbp_switchdev_add(struct net_bridge_port *p) +{ + return 0; +} + +static inline void nbp_switchdev_del(struct net_bridge_port *p) +{ +} + +static inline void br_switchdev_init(struct net_bridge *br) +{ +} + #endif /* CONFIG_NET_SWITCHDEV */ /* br_arp_nd_proxy.c */ diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c index 833fd30482c2..f3120f13c293 100644 --- a/net/bridge/br_switchdev.c +++ b/net/bridge/br_switchdev.c @@ -8,38 +8,6 @@ #include "br_private.h" -static int br_switchdev_hwdom_get(struct net_bridge *br, struct net_device *dev) -{ - struct net_bridge_port *p; - - /* dev is yet to be added to the port list. */ - list_for_each_entry(p, &br->port_list, list) { - if (netdev_port_same_parent_id(dev, p->dev)) - return p->hwdom; - } - - return ++br->last_hwdom; -} - -int nbp_switchdev_hwdom_set(struct net_bridge_port *p) -{ - struct netdev_phys_item_id ppid = { }; - int err; - - ASSERT_RTNL(); - - err = dev_get_port_parent_id(p->dev, &ppid, true); - if (err) { - if (err == -EOPNOTSUPP) - return 0; - return err; - } - - p->hwdom = br_switchdev_hwdom_get(p->br, p->dev); - - return 0; -} - void nbp_switchdev_frame_mark(const struct net_bridge_port *p, struct sk_buff *skb) { @@ -156,3 +124,65 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid) return switchdev_port_obj_del(dev, &v.obj); } + +static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining) +{ + struct net_bridge *br = joining->br; + struct net_bridge_port *p; + int hwdom; + + /* joining is yet to be added to the port list. */ + list_for_each_entry(p, &br->port_list, list) { + if (netdev_port_same_parent_id(joining->dev, p->dev)) { + joining->hwdom = p->hwdom; + return 0; + } + } + + hwdom = find_next_zero_bit(&br->busy_hwdoms, BR_HWDOM_MAX, 1); + if (hwdom >= BR_HWDOM_MAX) + return -EBUSY; + + set_bit(hwdom, &br->busy_hwdoms); + joining->hwdom = hwdom; + return 0; +} + +static void nbp_switchdev_hwdom_put(struct net_bridge_port *leaving) +{ + struct net_bridge *br = leaving->br; + struct net_bridge_port *p; + + /* leaving is no longer in the port list. */ + list_for_each_entry(p, &br->port_list, list) { + if (p->hwdom == leaving->hwdom) + return; + } + + clear_bit(leaving->hwdom, &br->busy_hwdoms); +} + +int nbp_switchdev_add(struct net_bridge_port *p) +{ + struct netdev_phys_item_id ppid = { }; + int err; + + ASSERT_RTNL(); + + err = dev_get_port_parent_id(p->dev, &ppid, true); + if (err) { + if (err == -EOPNOTSUPP) + return 0; + return err; + } + + return nbp_switchdev_hwdom_set(p); +} + +void nbp_switchdev_del(struct net_bridge_port *p) +{ + ASSERT_RTNL(); + + if (p->hwdom) + nbp_switchdev_hwdom_put(p); +} -- 2.25.1
Vladimir Oltean
2021-Jul-03 11:56 UTC
[Bridge] [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded
From: Tobias Waldekranz <tobias at waldekranz.com> Allow switchdevs to forward frames from the CPU in accordance with the bridge configuration in the same way as is done between bridge ports. This means that the bridge will only send a single skb towards one of the ports under the switchdev's control, and expects the driver to deliver the packet to all eligible ports in its domain. Primarily this improves the performance of multicast flows with multiple subscribers, as it allows the hardware to perform the frame replication. The basic flow between the driver and the bridge is as follows: - The switchdev accepts the offload by returning a non-null pointer from .ndo_dfwd_add_station when the port is added to the bridge. - The bridge sends offloadable skbs to one of the ports under the switchdev's control using dev_queue_xmit_accel. - The switchdev notices the offload by checking for a non-NULL "sb_dev" in the core's call to .ndo_select_queue. v1->v2: - convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long - introduce a static key "br_switchdev_fwd_offload_used" to minimize the impact of the newly introduced feature on all the setups which don't have hardware that can make use of it - introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache line access - reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in __br_forward() - do not strip VLAN on egress if forwarding offload on VLAN-aware bridge is being used - propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP Signed-off-by: Tobias Waldekranz <tobias at waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean at nxp.com> --- include/linux/if_bridge.h | 1 + net/bridge/br_forward.c | 18 +++++++- net/bridge/br_private.h | 24 +++++++++++ net/bridge/br_switchdev.c | 87 +++++++++++++++++++++++++++++++++++++-- net/bridge/br_vlan.c | 10 ++++- 5 files changed, 135 insertions(+), 5 deletions(-) diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h index b651c5e32a28..a47b86ab7f96 100644 --- a/include/linux/if_bridge.h +++ b/include/linux/if_bridge.h @@ -57,6 +57,7 @@ struct br_ip_list { #define BR_MRP_AWARE BIT(17) #define BR_MRP_LOST_CONT BIT(18) #define BR_MRP_LOST_IN_CONT BIT(19) +#define BR_FWD_OFFLOAD BIT(20) #define BR_DEFAULT_AGEING_TIME (300 * HZ) diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c index 07856362538f..919246a2c7eb 100644 --- a/net/bridge/br_forward.c +++ b/net/bridge/br_forward.c @@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p, int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb) { + struct net_device *sb_dev = NULL; + skb_push(skb, ETH_HLEN); if (!is_skb_forwardable(skb->dev, skb)) goto drop; @@ -48,7 +50,14 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb skb_set_network_header(skb, depth); } - dev_queue_xmit(skb); + if (br_switchdev_accels_skb(skb)) { + sb_dev = BR_INPUT_SKB_CB(skb)->brdev; + + WARN_ON_ONCE(br_vlan_enabled(sb_dev) && + !skb_vlan_tag_present(skb)); + } + + dev_queue_xmit_accel(skb, sb_dev); return 0; @@ -76,6 +85,11 @@ static void __br_forward(const struct net_bridge_port *to, struct net *net; int br_hook; + /* Mark the skb for forwarding offload early so that br_handle_vlan() + * can know whether to pop the VLAN header on egress or keep it. + */ + nbp_switchdev_frame_mark_accel(to, skb); + vg = nbp_vlan_group_rcu(to); skb = br_handle_vlan(to->br, to, vg, skb); if (!skb) @@ -174,6 +188,8 @@ static struct net_bridge_port *maybe_deliver( if (!should_deliver(p, skb)) return prev; + nbp_switchdev_frame_mark_fwd(p, skb); + if (!prev) goto out; diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 9ff09a32e3f8..655212df57f7 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -332,6 +332,7 @@ struct net_bridge_port { #endif #ifdef CONFIG_NET_SWITCHDEV int hwdom; + void *accel_priv; #endif u16 group_fwd_mask; u16 backup_redirected_cnt; @@ -508,7 +509,9 @@ struct br_input_skb_cb { #endif #ifdef CONFIG_NET_SWITCHDEV + u8 fwd_accel:1; int src_hwdom; + unsigned long fwd_hwdoms; #endif }; @@ -1647,6 +1650,12 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; } /* br_switchdev.c */ #ifdef CONFIG_NET_SWITCHDEV +bool br_switchdev_accels_skb(struct sk_buff *skb); + +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p, + struct sk_buff *skb); +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p, + struct sk_buff *skb); void nbp_switchdev_frame_mark(const struct net_bridge_port *p, struct sk_buff *skb); bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p, @@ -1669,6 +1678,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb) skb->offload_fwd_mark = 0; } #else +static inline bool br_switchdev_accels_skb(struct sk_buff *skb) +{ + return false; +} + +static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p, + struct sk_buff *skb) +{ +} + +static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p, + struct sk_buff *skb) +{ +} + static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p, struct sk_buff *skb) { diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c index f3120f13c293..8653d9a540a1 100644 --- a/net/bridge/br_switchdev.c +++ b/net/bridge/br_switchdev.c @@ -8,6 +8,40 @@ #include "br_private.h" +static struct static_key_false br_switchdev_fwd_offload_used; + +static bool nbp_switchdev_can_accel(const struct net_bridge_port *p, + const struct sk_buff *skb) +{ + if (!static_branch_unlikely(&br_switchdev_fwd_offload_used)) + return false; + + return (p->flags & BR_FWD_OFFLOAD) && + (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom); +} + +bool br_switchdev_accels_skb(struct sk_buff *skb) +{ + if (!static_branch_unlikely(&br_switchdev_fwd_offload_used)) + return false; + + return BR_INPUT_SKB_CB(skb)->fwd_accel; +} + +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p, + struct sk_buff *skb) +{ + if (nbp_switchdev_can_accel(p, skb)) + BR_INPUT_SKB_CB(skb)->fwd_accel = true; +} + +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p, + struct sk_buff *skb) +{ + if (nbp_switchdev_can_accel(p, skb)) + set_bit(p->hwdom, &BR_INPUT_SKB_CB(skb)->fwd_hwdoms); +} + void nbp_switchdev_frame_mark(const struct net_bridge_port *p, struct sk_buff *skb) { @@ -18,8 +52,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p, bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p, const struct sk_buff *skb) { - return !skb->offload_fwd_mark || - BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom; + struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb); + + return !test_bit(p->hwdom, &cb->fwd_hwdoms) && + (!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom); } /* Flags that can be offloaded to hardware */ @@ -125,6 +161,39 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid) return switchdev_port_obj_del(dev, &v.obj); } +static int nbp_switchdev_fwd_offload_add(struct net_bridge_port *p) +{ + void *priv; + + if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD)) + return 0; + + priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev); + if (IS_ERR(priv)) { + int err = PTR_ERR(priv); + + return err == -EOPNOTSUPP ? 0 : err; + } + + p->accel_priv = priv; + p->flags |= BR_FWD_OFFLOAD; + static_branch_inc(&br_switchdev_fwd_offload_used); + + return 0; +} + +static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p) +{ + if (!p->accel_priv) + return; + + p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv); + + p->accel_priv = NULL; + p->flags &= ~BR_FWD_OFFLOAD; + static_branch_dec(&br_switchdev_fwd_offload_used); +} + static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining) { struct net_bridge *br = joining->br; @@ -176,13 +245,25 @@ int nbp_switchdev_add(struct net_bridge_port *p) return err; } - return nbp_switchdev_hwdom_set(p); + err = nbp_switchdev_hwdom_set(p); + if (err) + return err; + + if (p->hwdom) { + err = nbp_switchdev_fwd_offload_add(p); + if (err) + return err; + } + + return 0; } void nbp_switchdev_del(struct net_bridge_port *p) { ASSERT_RTNL(); + nbp_switchdev_fwd_offload_del(p); + if (p->hwdom) nbp_switchdev_hwdom_put(p); } diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c index a08e9f193009..bf014efa5851 100644 --- a/net/bridge/br_vlan.c +++ b/net/bridge/br_vlan.c @@ -457,7 +457,15 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br, u64_stats_update_end(&stats->syncp); } - if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED) + /* If the skb will be sent using forwarding offload, the assumption is + * that the switchdev will inject the packet into hardware together + * with the bridge VLAN, so that it can be forwarded according to that + * VLAN. The switchdev should deal with popping the VLAN header in + * hardware on each egress port as appropriate. So only strip the VLAN + * header if forwarding offload is not being used. + */ + if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED && + !br_switchdev_accels_skb(skb)) __vlan_hwaccel_clear_tag(skb); if (p && (p->flags & BR_VLAN_TUNNEL) && -- 2.25.1
Vladimir Oltean
2021-Jul-03 11:57 UTC
[Bridge] [RFC PATCH v2 net-next 05/10] net: extract helpers for binding a subordinate device to TX queues
Currently, the acceleration scheme for offloading the data plane of upper devices to hardware is geared towards a single topology: that of macvlan interfaces, where there is a lower interface with many uppers. We would like to use the same acceleration framework for the bridge data plane, but there we have a single upper interface with many lowers. This matters because commit ffcfe25bb50f ("net: Add support for subordinate device traffic classes") has pulled some logic out of ixgbe_select_queue() and moved it into net/core/dev.c as if it was generic enough to do so. In particular, it created a scheme where: - ixgbe calls netdev_set_sb_channel() on the macvlan interface, which changes the macvlan's dev->num_tc to a negative value (-channel). The value itself is not used anywhere in any relevant manner, it only matters that it's negative, because: - when ixgbe calls netdev_bind_sb_channel_queue(), the macvlan is checked for being configured as a subordinate channel (its num_tc must be smaller than zero) and its tc_to_txq guts are being scavenged to hold what ixgbe puts in it (for each traffic class, a mapping is recorded towards an ixgbe TX ring dedicated to that macvlan). This is safe because "we can pretty much guarantee that the tc_to_txq mappings and XPS maps for the upper device are unused". - when a packet is to be transmitted on the ixgbe interface on behalf of a macvlan upper and a TX queue is to be selected, netdev_pick_tx() -> skb_tx_hash() looks at the tc_to_txq array of the macvlan sb_dev, which was populated by ixgbe. The packet reaches the dedicated TX ring. Fun, but netdev hierarchies with one upper and many lowers cannot do this, because if multiple lowers tried to lay their eggs into the same tc_to_txq array of the same upper, they would have to coordinate somehow. So it doesn't quite work. But nonetheless, to make sure of the subordinate device concept, we need access to the sb_dev in the ndo_start_xmit() method, and the only place we can retrieve it from is: netdev_get_tx_queue(dev, skb_get_queue_mapping(skb))->sb_dev So we need that pointer populated and not much else. Refactor the code which assigns the subordinate device pointer per lower interface TX queue into a dedicated set of helpers and export it. Signed-off-by: Vladimir Oltean <vladimir.oltean at nxp.com> --- include/linux/netdevice.h | 7 +++++++ net/core/dev.c | 31 +++++++++++++++++++++++-------- 2 files changed, 30 insertions(+), 8 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index eaf5bb008aa9..16c88e416693 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2301,6 +2301,13 @@ static inline void net_prefetchw(void *p) #endif } +void netdev_bind_tx_queues_to_sb_dev(struct net_device *dev, + struct net_device *sb_dev, + u16 count, u16 offset); + +void netdev_unbind_tx_queues_from_sb_dev(struct net_device *dev, + struct net_device *sb_dev); + void netdev_unbind_sb_channel(struct net_device *dev, struct net_device *sb_dev); int netdev_bind_sb_channel_queue(struct net_device *dev, diff --git a/net/core/dev.c b/net/core/dev.c index c253c2aafe97..02e3a6941381 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2957,21 +2957,37 @@ int netdev_set_num_tc(struct net_device *dev, u8 num_tc) } EXPORT_SYMBOL(netdev_set_num_tc); -void netdev_unbind_sb_channel(struct net_device *dev, - struct net_device *sb_dev) +void netdev_bind_tx_queues_to_sb_dev(struct net_device *dev, + struct net_device *sb_dev, + u16 count, u16 offset) +{ + while (count--) + netdev_get_tx_queue(dev, count + offset)->sb_dev = sb_dev; +} +EXPORT_SYMBOL_GPL(netdev_bind_tx_queues_to_sb_dev); + +void netdev_unbind_tx_queues_from_sb_dev(struct net_device *dev, + struct net_device *sb_dev) { struct netdev_queue *txq = &dev->_tx[dev->num_tx_queues]; + while (txq-- != &dev->_tx[0]) { + if (txq->sb_dev == sb_dev) + txq->sb_dev = NULL; + } +} +EXPORT_SYMBOL_GPL(netdev_unbind_tx_queues_from_sb_dev); + +void netdev_unbind_sb_channel(struct net_device *dev, + struct net_device *sb_dev) +{ #ifdef CONFIG_XPS netif_reset_xps_queues_gt(sb_dev, 0); #endif memset(sb_dev->tc_to_txq, 0, sizeof(sb_dev->tc_to_txq)); memset(sb_dev->prio_tc_map, 0, sizeof(sb_dev->prio_tc_map)); - while (txq-- != &dev->_tx[0]) { - if (txq->sb_dev == sb_dev) - txq->sb_dev = NULL; - } + netdev_unbind_tx_queues_from_sb_dev(dev, sb_dev); } EXPORT_SYMBOL(netdev_unbind_sb_channel); @@ -2994,8 +3010,7 @@ int netdev_bind_sb_channel_queue(struct net_device *dev, /* Provide a way for Tx queue to find the tc_to_txq map or * XPS map for itself. */ - while (count--) - netdev_get_tx_queue(dev, count + offset)->sb_dev = sb_dev; + netdev_bind_tx_queues_to_sb_dev(dev, sb_dev, count, offset); return 0; } -- 2.25.1
Vladimir Oltean
2021-Jul-03 11:57 UTC
[Bridge] [RFC PATCH v2 net-next 06/10] net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
When using a bridge upper as a subordinate device, switchdev interfaces must allocate a TX queue for it, in order to have the information needed in .ndo_start_xmit() whether the skb comes from the bridge or not. The dedicated TX queue has the ->sb_dev pointer pointing to the bridge device, and the only assumption that can be made is that any skb on that queue must be coming from the bridge. So no other skbs can be sent on that. The default netdev_pick_tx() -> skb_tx_hash() policy hashes between TX queues of the same priority. To make the scheme work, switchdev drivers offloading a bridge need to implement their own .ndo_select_queue() which selects the dedicated TX queue for packets coming from the sb_dev, and lets netdev_pick_tx() choose from the rest of the TX queues for the rest. The implication is that the dedicated TX queue for the sb_dev must be outside of the dev->num_real_tx_queues range, because otherwise, netdev_pick_tx() might choose that TX queue for packets which aren't actually coming from our sb_dev and therefore the assumption made in the driver's .ndo_start_xmit() would be wrong. This patch lifts the restriction in netdev_core_pick_tx() which says that the dedicated TX queue for the sb_dev cannot be larger than the num_real_tx_queues. With this, netdev_pick_tx() can safely pick between the non-dedicated TX queues. Signed-off-by: Vladimir Oltean <vladimir.oltean at nxp.com> --- include/linux/netdevice.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 16c88e416693..d43f6ddd12a1 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3697,10 +3697,10 @@ static inline void netdev_reset_queue(struct net_device *dev_queue) */ static inline u16 netdev_cap_txqueue(struct net_device *dev, u16 queue_index) { - if (unlikely(queue_index >= dev->real_num_tx_queues)) { - net_warn_ratelimited("%s selects TX queue %d, but real number of TX queues is %d\n", + if (unlikely(queue_index >= dev->num_tx_queues)) { + net_warn_ratelimited("%s selects TX queue %d, but number of TX queues is %d\n", dev->name, queue_index, - dev->real_num_tx_queues); + dev->num_tx_queues); return 0; } -- 2.25.1
Vladimir Oltean
2021-Jul-03 11:57 UTC
[Bridge] [RFC PATCH v2 net-next 07/10] net: dsa: track the number of switches in a tree
In preparation of supporting data plane forwarding on behalf of a software bridge, some drivers might need to view bridges as virtual switches behind the CPU port in a cross-chip topology. Give them some help and let them know how many physical switches there are in the tree, so that they can count the virtual switches starting from that number on. Note that the first dsa_switch_ops method where this information is reliably available is .setup(). This is because of how DSA works: in a tree with 3 switches, each calling dsa_register_switch(), the first 2 will advance until dsa_tree_setup() -> dsa_tree_setup_routing_table() and exit with error code 0 because the topology is not complete. Since probing is parallel at this point, one switch does not know about the existence of the other. Then the third switch comes, and for it, dsa_tree_setup_routing_table() returns complete = true. This switch goes ahead and calls dsa_tree_setup_switches() for everybody else, calling their .setup() methods too. This acts as the synchronization point. Signed-off-by: Vladimir Oltean <vladimir.oltean at nxp.com> --- include/net/dsa.h | 3 +++ net/dsa/dsa2.c | 3 +++ 2 files changed, 6 insertions(+) diff --git a/include/net/dsa.h b/include/net/dsa.h index 33f40c1ec379..89626eab92b9 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -159,6 +159,9 @@ struct dsa_switch_tree { */ struct net_device **lags; unsigned int lags_len; + + /* Track the largest switch index within a tree */ + unsigned int last_switch; }; #define dsa_lags_foreach_id(_id, _dst) \ diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c index 185629f27f80..de5e93ba2a9d 100644 --- a/net/dsa/dsa2.c +++ b/net/dsa/dsa2.c @@ -1265,6 +1265,9 @@ static int dsa_switch_parse_member_of(struct dsa_switch *ds, return -EEXIST; } + if (ds->dst->last_switch < ds->index) + ds->dst->last_switch = ds->index; + return 0; } -- 2.25.1
Vladimir Oltean
2021-Jul-03 11:57 UTC
[Bridge] [RFC PATCH v2 net-next 08/10] net: dsa: add support for bridge forwarding offload
For a DSA switch, to offload the forwarding process of a bridge device means to send the packets coming from the software bridge as data plane packets. This is contrary to everything that DSA has done so far, because the current taggers only know to send control packets (ones that target a specific destination port), whereas data plane packets are supposed to be forwarded according to the FDB lookup, much like packets ingressing on any regular ingress port. If the FDB lookup process returns multiple destination ports (flooding, multicast), then replication is also handled by the switch hardware - the bridge only sends a single packet and avoids the skb_clone(). DSA plays a substantial role in backing the forwarding offload, and leaves relatively few things up to the switch driver. In particular, DSA creates an accel_priv structure per port associated with each possible bridge upper, and for each bridge it keeps a zero-based index (the number of the bridge). Multiple ports enslaved to the same bridge have a pointer to the same accel_priv structure. The way this offloading scheme (borrowed from macvlan offloading on Intel hardware) works is that lower interfaces are supposed to reserve a netdev TX queue corresponding to each offloadable upper ("subordinate") interface. DSA reserves a single TX queue per port, a queue outside the num_real_tx_queues range. That special TX queue has a ->sb_dev pointer, which is the reason why we use it in the first place (to have access to the sb_dev from .ndo_start_xmit). DSA then implements a custom .ndo_select_queue to direct packets on behalf of the bridge to that special queue, and leaves netdev_pick_tx to pick among the num_real_tx_queues (excluding the sb_dev queue) using the default policies. It is assumed that both the tagger must support forwarding offload (it must search for the subordinate device - the bridge), and must therefore set the ".bridge_fwd_offload = true" capability, as well as the switch driver (this must set in ds->num_fwd_offloading_bridges the maximum number of bridges for which it can offload forwarding). The tagger can check if the TX queue that the skb is being transmitted on has a subordinate device (sb_dev) associated with it or not. If it does, it can be sure that the subordinate device is a bridge, and it can use the dp->accel_priv to get further information about that bridge, such as the bridge number. It can then compose a DSA tag for injecting a data plane packet into that bridge number. For the switch driver side, we offer two new pair of dsa_switch_ops methods which are modeled after .port_bridge_{join,leave} and .crosschip_bridge_{join,leave}. These are .port_bridge_fwd_offload_{add,del} and the cross-chip equivalents. These methods are provided in case the driver needs to configure the hardware to treat packets coming from that bridge software interface as data plane packets. The bridge calls our .ndo_dfwd_add_station immediately after netdev_master_upper_dev_link(), so to switch drivers, the effect is that the .port_bridge_fwd_offload_add() method is called immediately after .port_bridge_join(). Signed-off-by: Vladimir Oltean <vladimir.oltean at nxp.com> --- include/net/dsa.h | 34 ++++++++++++ net/dsa/dsa_priv.h | 17 ++++++ net/dsa/port.c | 35 ++++++++++++ net/dsa/slave.c | 134 ++++++++++++++++++++++++++++++++++++++++++++- net/dsa/switch.c | 58 ++++++++++++++++++++ 5 files changed, 277 insertions(+), 1 deletion(-) diff --git a/include/net/dsa.h b/include/net/dsa.h index 89626eab92b9..5d111cc2e403 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -103,6 +103,7 @@ struct dsa_device_ops { * its RX filter. */ bool promisc_on_master; + bool bridge_fwd_offload; }; /* This structure defines the control interfaces that are overlayed by the @@ -162,6 +163,9 @@ struct dsa_switch_tree { /* Track the largest switch index within a tree */ unsigned int last_switch; + + /* Track the bridges with forwarding offload enabled */ + unsigned long fwd_offloading_bridges; }; #define dsa_lags_foreach_id(_id, _dst) \ @@ -224,6 +228,10 @@ struct dsa_mall_tc_entry { }; }; +struct dsa_bridge_fwd_accel_priv { + struct net_device *sb_dev; + int bridge_num; +}; struct dsa_port { /* A CPU port is physically connected to a master device. @@ -294,6 +302,8 @@ struct dsa_port { struct list_head fdbs; struct list_head mdbs; + struct dsa_bridge_fwd_accel_priv *accel_priv; + bool setup; }; @@ -410,6 +420,12 @@ struct dsa_switch { */ unsigned int num_lag_ids; + /* Drivers that support bridge forwarding offload should set this to + * the maximum number of bridges spanning the same switch tree that can + * be offloaded. + */ + unsigned int num_fwd_offloading_bridges; + size_t num_ports; }; @@ -693,6 +709,14 @@ struct dsa_switch_ops { struct net_device *bridge); void (*port_bridge_leave)(struct dsa_switch *ds, int port, struct net_device *bridge); + /* Called right after .port_bridge_join() */ + int (*port_bridge_fwd_offload_add)(struct dsa_switch *ds, int port, + struct net_device *bridge, + int bridge_num); + /* Called right before .port_bridge_leave() */ + void (*port_bridge_fwd_offload_del)(struct dsa_switch *ds, int port, + struct net_device *bridge, + int bridge_num); void (*port_stp_state_set)(struct dsa_switch *ds, int port, u8 state); void (*port_fast_age)(struct dsa_switch *ds, int port); @@ -777,6 +801,16 @@ struct dsa_switch_ops { struct netdev_lag_upper_info *info); int (*crosschip_lag_leave)(struct dsa_switch *ds, int sw_index, int port, struct net_device *lag); + int (*crosschip_bridge_fwd_offload_add)(struct dsa_switch *ds, + int tree_index, + int sw_index, int port, + struct net_device *br, + int bridge_num); + void (*crosschip_bridge_fwd_offload_del)(struct dsa_switch *ds, + int tree_index, + int sw_index, int port, + struct net_device *br, + int bridge_num); /* * PTP functionality diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h index f201c33980bf..c577338b5bb7 100644 --- a/net/dsa/dsa_priv.h +++ b/net/dsa/dsa_priv.h @@ -14,10 +14,14 @@ #include <net/dsa.h> #include <net/gro_cells.h> +#define DSA_MAX_NUM_OFFLOADING_BRIDGES BITS_PER_LONG + enum { DSA_NOTIFIER_AGEING_TIME, DSA_NOTIFIER_BRIDGE_JOIN, DSA_NOTIFIER_BRIDGE_LEAVE, + DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_ADD, + DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_DEL, DSA_NOTIFIER_FDB_ADD, DSA_NOTIFIER_FDB_DEL, DSA_NOTIFIER_HOST_FDB_ADD, @@ -54,6 +58,15 @@ struct dsa_notifier_bridge_info { int port; }; +/* DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_* */ +struct dsa_notifier_bridge_fwd_offload_info { + struct net_device *br; + int tree_index; + int sw_index; + int port; + int bridge_num; +}; + /* DSA_NOTIFIER_FDB_* */ struct dsa_notifier_fdb_info { int sw_index; @@ -197,6 +210,10 @@ int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br, int dsa_port_pre_bridge_leave(struct dsa_port *dp, struct net_device *br, struct netlink_ext_ack *extack); void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br); +int dsa_port_bridge_fwd_offload_add(struct dsa_port *dp, + struct net_device *br, int bridge_num); +void dsa_port_bridge_fwd_offload_del(struct dsa_port *dp, + struct net_device *br, int bridge_num); int dsa_port_lag_change(struct dsa_port *dp, struct netdev_lag_lower_state_info *linfo); int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag_dev, diff --git a/net/dsa/port.c b/net/dsa/port.c index 28b45b7e66df..3c268d00908c 100644 --- a/net/dsa/port.c +++ b/net/dsa/port.c @@ -344,6 +344,41 @@ void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br) dsa_port_switchdev_unsync_attrs(dp); } +int dsa_port_bridge_fwd_offload_add(struct dsa_port *dp, + struct net_device *br, int bridge_num) +{ + struct dsa_notifier_bridge_fwd_offload_info info = { + .tree_index = dp->ds->dst->index, + .sw_index = dp->ds->index, + .port = dp->index, + .br = br, + .bridge_num = bridge_num, + }; + + return dsa_port_notify(dp, DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_ADD, + &info); +} + +void dsa_port_bridge_fwd_offload_del(struct dsa_port *dp, + struct net_device *br, int bridge_num) +{ + struct dsa_notifier_bridge_fwd_offload_info info = { + .tree_index = dp->ds->dst->index, + .sw_index = dp->ds->index, + .port = dp->index, + .br = br, + .bridge_num = bridge_num, + }; + struct net_device *dev = dp->slave; + int err; + + err = dsa_port_notify(dp, DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_DEL, + &info); + if (err) + netdev_err(dev, "failed to notify fwd offload del: %pe\n", + ERR_PTR(err)); +} + int dsa_port_lag_change(struct dsa_port *dp, struct netdev_lag_lower_state_info *linfo) { diff --git a/net/dsa/slave.c b/net/dsa/slave.c index ffbba1e71551..003f3bb9c51a 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -1679,6 +1679,119 @@ static int dsa_slave_fill_forward_path(struct net_device_path_ctx *ctx, return 0; } +/* Direct packets coming from the data plane of the bridge to a dedicated TX + * queue, and let the generic netdev_pick_tx() handle the rest via hashing + * among TX queues of the same priority. + */ +static u16 dsa_slave_select_queue(struct net_device *dev, struct sk_buff *skb, + struct net_device *sb_dev) +{ + struct dsa_port *dp = dsa_slave_to_port(dev); + struct dsa_switch *ds = dp->ds; + + if (unlikely(sb_dev)) + return ds->num_tx_queues; + + return netdev_pick_tx(dev, skb, sb_dev); +} + +static struct dsa_bridge_fwd_accel_priv * +dsa_find_accel_priv_by_sb_dev(struct dsa_switch_tree *dst, + struct net_device *sb_dev) +{ + struct dsa_port *dp; + + list_for_each_entry(dp, &dst->ports, list) + if (dp->accel_priv && dp->accel_priv->sb_dev == sb_dev) + return dp->accel_priv; + + return NULL; +} + +static void dsa_slave_fwd_offload_del(struct net_device *dev, void *sb_dev) +{ + struct dsa_bridge_fwd_accel_priv *accel_priv; + struct dsa_port *dp = dsa_slave_to_port(dev); + struct dsa_switch *ds = dp->ds; + struct dsa_switch_tree *dst; + int bridge_num; + + if (!netif_is_bridge_master(sb_dev)) + return; + + dst = ds->dst; + + accel_priv = dp->accel_priv; + bridge_num = accel_priv->bridge_num; + + dp->accel_priv = NULL; + + /* accel_priv no longer in use, time to clean it up */ + if (!dsa_find_accel_priv_by_sb_dev(dst, sb_dev)) { + clear_bit(accel_priv->bridge_num, &dst->fwd_offloading_bridges); + kfree(accel_priv); + } + + netdev_unbind_tx_queues_from_sb_dev(dev, sb_dev); + + /* Notify the chips only once the offload has been deactivated, so + * that they can update their configuration accordingly. + */ + dsa_port_bridge_fwd_offload_del(dp, sb_dev, bridge_num); +} + +static void *dsa_slave_fwd_offload_add(struct net_device *dev, + struct net_device *sb_dev) +{ + struct dsa_bridge_fwd_accel_priv *accel_priv; + struct dsa_port *dp = dsa_slave_to_port(dev); + struct dsa_switch *ds = dp->ds; + struct dsa_switch_tree *dst; + int err; + + if (!netif_is_bridge_master(sb_dev)) + return ERR_PTR(-EOPNOTSUPP); + + dst = ds->dst; + + accel_priv = dsa_find_accel_priv_by_sb_dev(dst, sb_dev); + if (!accel_priv) { + /* First port that offloads forwarding for this bridge */ + int bridge_num; + + bridge_num = find_first_zero_bit(&dst->fwd_offloading_bridges, + DSA_MAX_NUM_OFFLOADING_BRIDGES); + if (bridge_num >= ds->num_fwd_offloading_bridges) + return ERR_PTR(-EOPNOTSUPP); + + accel_priv = kzalloc(sizeof(*accel_priv), GFP_KERNEL); + if (!accel_priv) + return ERR_PTR(-ENOMEM); + + accel_priv->sb_dev = sb_dev; + accel_priv->bridge_num = bridge_num; + + set_bit(bridge_num, &dst->fwd_offloading_bridges); + } + + dp->accel_priv = accel_priv; + + /* There can be only one master upper interface for each port in the + * case of bridge forwarding offload, so just bind a single TX queue to + * that subordinate device, the last one. + */ + netdev_bind_tx_queues_to_sb_dev(dev, sb_dev, 1, ds->num_tx_queues); + + err = dsa_port_bridge_fwd_offload_add(dp, sb_dev, + accel_priv->bridge_num); + if (err) { + dsa_slave_fwd_offload_del(dev, sb_dev); + return ERR_PTR(err); + } + + return accel_priv; +} + static const struct net_device_ops dsa_slave_netdev_ops = { .ndo_open = dsa_slave_open, .ndo_stop = dsa_slave_close, @@ -1703,6 +1816,9 @@ static const struct net_device_ops dsa_slave_netdev_ops = { .ndo_get_devlink_port = dsa_slave_get_devlink_port, .ndo_change_mtu = dsa_slave_change_mtu, .ndo_fill_forward_path = dsa_slave_fill_forward_path, + .ndo_dfwd_add_station = dsa_slave_fwd_offload_add, + .ndo_dfwd_del_station = dsa_slave_fwd_offload_del, + .ndo_select_queue = dsa_slave_select_queue, }; static struct device_type dsa_type = { @@ -1819,6 +1935,11 @@ void dsa_slave_setup_tagger(struct net_device *slave) slave->needed_tailroom += master->needed_tailroom; p->xmit = cpu_dp->tag_ops->xmit; + + if (cpu_dp->tag_ops->bridge_fwd_offload) + slave->features |= NETIF_F_HW_L2FW_DOFFLOAD; + else + slave->features &= ~NETIF_F_HW_L2FW_DOFFLOAD; } static struct lock_class_key dsa_slave_netdev_xmit_lock_key; @@ -1877,10 +1998,21 @@ int dsa_slave_create(struct dsa_port *port) slave_dev = alloc_netdev_mqs(sizeof(struct dsa_slave_priv), name, NET_NAME_UNKNOWN, ether_setup, - ds->num_tx_queues, 1); + ds->num_tx_queues + 1, 1); if (slave_dev == NULL) return -ENOMEM; + /* To avoid changing the number of TX queues at runtime depending on + * whether the tagging protocol in use supports bridge forwarding + * offload or not, just assume that all tagging protocols do, and + * unconditionally register one extra TX queue to back that offload. + * Then set num_real_tx_queues such that it will never be selected by + * netdev_pick_tx(), just by ourselves. + */ + ret = netif_set_real_num_tx_queues(slave_dev, ds->num_tx_queues); + if (ret) + goto out_free; + slave_dev->features = master->vlan_features | NETIF_F_HW_TC; if (ds->ops->port_vlan_add && ds->ops->port_vlan_del) slave_dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER; diff --git a/net/dsa/switch.c b/net/dsa/switch.c index 248455145982..f0033906f36b 100644 --- a/net/dsa/switch.c +++ b/net/dsa/switch.c @@ -154,6 +154,58 @@ static int dsa_switch_bridge_leave(struct dsa_switch *ds, return 0; } +static int +dsa_switch_bridge_fwd_offload_add(struct dsa_switch *ds, + struct dsa_notifier_bridge_fwd_offload_info *info) +{ + struct dsa_switch_tree *dst = ds->dst; + int tree_index = info->tree_index; + int bridge_num = info->bridge_num; + struct net_device *br = info->br; + int sw_index = info->sw_index; + int port = info->port; + + if (dst->index == tree_index && ds->index == sw_index && + ds->ops->port_bridge_fwd_offload_add) + return ds->ops->port_bridge_fwd_offload_add(ds, port, br, + bridge_num); + + if ((dst->index != tree_index || ds->index != sw_index) && + ds->ops->crosschip_bridge_fwd_offload_add) + return ds->ops->crosschip_bridge_fwd_offload_add(ds, + tree_index, + sw_index, + port, br, + bridge_num); + + return -EOPNOTSUPP; +} + +static int +dsa_switch_bridge_fwd_offload_del(struct dsa_switch *ds, + struct dsa_notifier_bridge_fwd_offload_info *info) +{ + struct dsa_switch_tree *dst = ds->dst; + int tree_index = info->tree_index; + int bridge_num = info->bridge_num; + struct net_device *br = info->br; + int sw_index = info->sw_index; + int port = info->port; + + if (dst->index == tree_index && ds->index == sw_index && + ds->ops->port_bridge_fwd_offload_del) + ds->ops->port_bridge_fwd_offload_del(ds, port, br, + bridge_num); + + if ((dst->index != info->tree_index || ds->index != info->sw_index) && + ds->ops->crosschip_bridge_fwd_offload_del) + ds->ops->crosschip_bridge_fwd_offload_del(ds, tree_index, + sw_index, port, br, + bridge_num); + + return 0; +} + /* Matches for all upstream-facing ports (the CPU port and all upstream-facing * DSA links) that sit between the targeted port on which the notifier was * emitted and its dedicated CPU port. @@ -663,6 +715,12 @@ static int dsa_switch_event(struct notifier_block *nb, case DSA_NOTIFIER_BRIDGE_LEAVE: err = dsa_switch_bridge_leave(ds, info); break; + case DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_ADD: + err = dsa_switch_bridge_fwd_offload_add(ds, info); + break; + case DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_DEL: + err = dsa_switch_bridge_fwd_offload_del(ds, info); + break; case DSA_NOTIFIER_FDB_ADD: err = dsa_switch_fdb_add(ds, info); break; -- 2.25.1
Vladimir Oltean
2021-Jul-03 11:57 UTC
[Bridge] [RFC PATCH v2 net-next 09/10] net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in the PVT
The mv88e6xxx switches have the ability to receive FORWARD (data plane) frames from the CPU port and route them according to the FDB. We can use this to offload the forwarding process of packets sent by the software bridge. Because DSA supports bridge domain isolation between user ports, just sending FORWARD frames is not enough, as they might leak the intended broadcast domain of the bridge on behalf of which the packets are sent. It should be noted that FORWARD frames are also (and typically) used to forward data plane packets on DSA links in cross-chip topologies. The FORWARD frame header contains the source port and switch ID, and switches receiving this frame header forward the packet according to their cross-chip port-based VLAN table (PVT). To address the bridging domain isolation in the context of offloading the forwarding on TX, the idea is that we can reuse the parts of the PVT that don't have any physical switch mapped to them, one entry for each software bridge. The switches will therefore think that behind their upstream port lie many switches, all in fact backed up by software bridges through tag_dsa.c, which constructs FORWARD packets with the right switch ID corresponding to each bridge. The mapping we use is absolutely trivial: DSA gives us a unique bridge number, and we add the number of the physical switches in the DSA switch tree to that, to obtain a unique virtual bridge device number to use in the PVT. Co-developed-by: Tobias Waldekranz <tobias at waldekranz.com> Signed-off-by: Tobias Waldekranz <tobias at waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean at nxp.com> --- drivers/net/dsa/mv88e6xxx/chip.c | 106 +++++++++++++++++++++++++++++-- 1 file changed, 102 insertions(+), 4 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index beb41572d04e..6b9c1a77d874 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -1221,14 +1221,38 @@ static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port) bool found = false; u16 pvlan; - list_for_each_entry(dp, &dst->ports, list) { - if (dp->ds->index == dev && dp->index == port) { + /* dev is a physical switch */ + if (dev <= dst->last_switch) { + list_for_each_entry(dp, &dst->ports, list) { + if (dp->ds->index == dev && dp->index == port) { + /* dp might be a DSA link or a user port, so it + * might or might not have a bridge_dev + * pointer. Use the "found" variable for both + * cases. + */ + br = dp->bridge_dev; + found = true; + break; + } + } + /* dev is a virtual bridge */ + } else { + list_for_each_entry(dp, &dst->ports, list) { + struct dsa_bridge_fwd_accel_priv *accel_priv = dp->accel_priv; + + if (!accel_priv) + continue; + + if (accel_priv->bridge_num + 1 + dst->last_switch != dev) + continue; + + br = accel_priv->sb_dev; found = true; break; } } - /* Prevent frames from unknown switch or port */ + /* Prevent frames from unknown switch or virtual bridge */ if (!found) return 0; @@ -1236,7 +1260,6 @@ static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port) if (dp->type == DSA_PORT_TYPE_CPU || dp->type == DSA_PORT_TYPE_DSA) return mv88e6xxx_port_mask(chip); - br = dp->bridge_dev; pvlan = 0; /* Frames from user ports can egress any local DSA links and CPU ports, @@ -2422,6 +2445,68 @@ static void mv88e6xxx_crosschip_bridge_leave(struct dsa_switch *ds, mv88e6xxx_reg_unlock(chip); } +/* Treat the software bridge as a virtual single-port switch behind the + * CPU and map in the PVT. First dst->last_switch elements are taken by + * physical switches, so start from beyond that range. + */ +static int mv88e6xxx_map_virtual_bridge_to_pvt(struct dsa_switch *ds, + int bridge_num) +{ + u8 dev = bridge_num + ds->dst->last_switch + 1; + struct mv88e6xxx_chip *chip = ds->priv; + int err; + + mv88e6xxx_reg_lock(chip); + err = mv88e6xxx_pvt_map(chip, dev, 0); + mv88e6xxx_reg_unlock(chip); + + return err; +} + +static int mv88e6xxx_bridge_fwd_offload_add(struct dsa_switch *ds, int port, + struct net_device *br, + int bridge_num) +{ + return mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num); +} + +static void mv88e6xxx_bridge_fwd_offload_del(struct dsa_switch *ds, int port, + struct net_device *br, + int bridge_num) +{ + int err; + + err = mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num); + if (err) { + dev_err(ds->dev, "failed to remap cross-chip Port VLAN: %pe\n", + ERR_PTR(err)); + } +} + +static int +mv88e6xxx_crosschip_bridge_fwd_offload_add(struct dsa_switch *ds, + int tree_index, int sw_index, + int port, struct net_device *br, + int bridge_num) +{ + return mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num); +} + +static void +mv88e6xxx_crosschip_bridge_fwd_offload_del(struct dsa_switch *ds, + int tree_index, int sw_index, + int port, struct net_device *br, + int bridge_num) +{ + int err; + + err = mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num); + if (err) { + dev_err(ds->dev, "failed to remap cross-chip Port VLAN: %pe\n", + ERR_PTR(err)); + } +} + static int mv88e6xxx_software_reset(struct mv88e6xxx_chip *chip) { if (chip->info->ops->reset) @@ -3025,6 +3110,15 @@ static int mv88e6xxx_setup(struct dsa_switch *ds) chip->ds = ds; ds->slave_mii_bus = mv88e6xxx_default_mdio_bus(chip); + /* Since virtual bridges are mapped in the PVT, the number we support + * depends on the physical switch topology. We need to let DSA figure + * that out and therefore we cannot set this at dsa_register_switch() + * time. + */ + if (mv88e6xxx_has_pvt(chip)) + ds->num_fwd_offloading_bridges = MV88E6XXX_MAX_PVT_SWITCHES - + ds->dst->last_switch - 1; + mv88e6xxx_reg_lock(chip); if (chip->info->ops->setup_errata) { @@ -6128,6 +6222,10 @@ static const struct dsa_switch_ops mv88e6xxx_switch_ops = { .crosschip_lag_change = mv88e6xxx_crosschip_lag_change, .crosschip_lag_join = mv88e6xxx_crosschip_lag_join, .crosschip_lag_leave = mv88e6xxx_crosschip_lag_leave, + .port_bridge_fwd_offload_add = mv88e6xxx_bridge_fwd_offload_add, + .port_bridge_fwd_offload_del = mv88e6xxx_bridge_fwd_offload_del, + .crosschip_bridge_fwd_offload_add = mv88e6xxx_crosschip_bridge_fwd_offload_add, + .crosschip_bridge_fwd_offload_del = mv88e6xxx_crosschip_bridge_fwd_offload_del, }; static int mv88e6xxx_register_switch(struct mv88e6xxx_chip *chip) -- 2.25.1
Vladimir Oltean
2021-Jul-03 11:57 UTC
[Bridge] [RFC PATCH v2 net-next 10/10] net: dsa: tag_dsa: offload the bridge forwarding process
From: Tobias Waldekranz <tobias at waldekranz.com> Allow the DSA tagger to generate FORWARD frames for offloaded skbs sent from a bridge that we offload, allowing the switch to handle any frame replication that may be required. This also means that source address learning takes place on packets sent from the CPU, meaning that return traffic no longer needs to be flooded as unknown unicast. Signed-off-by: Tobias Waldekranz <tobias at waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean at nxp.com> --- net/dsa/dsa_priv.h | 11 +++++++++ net/dsa/tag_dsa.c | 60 +++++++++++++++++++++++++++++++++++++++------- 2 files changed, 63 insertions(+), 8 deletions(-) diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h index c577338b5bb7..c070157cd967 100644 --- a/net/dsa/dsa_priv.h +++ b/net/dsa/dsa_priv.h @@ -389,6 +389,17 @@ static inline struct sk_buff *dsa_untag_bridge_pvid(struct sk_buff *skb) return skb; } +static inline struct net_device * +dsa_slave_get_sb_dev(const struct net_device *dev, struct sk_buff *skb) +{ + u16 queue_mapping = skb_get_queue_mapping(skb); + struct netdev_queue *txq; + + txq = netdev_get_tx_queue(dev, queue_mapping); + + return txq->sb_dev; +} + /* switch.c */ int dsa_switch_register_notifier(struct dsa_switch *ds); void dsa_switch_unregister_notifier(struct dsa_switch *ds); diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c index a822355afc90..9151ed141b3e 100644 --- a/net/dsa/tag_dsa.c +++ b/net/dsa/tag_dsa.c @@ -125,8 +125,49 @@ enum dsa_code { static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev, u8 extra) { + struct net_device *sb_dev = dsa_slave_get_sb_dev(dev, skb); struct dsa_port *dp = dsa_slave_to_port(dev); + u8 tag_dev, tag_port; + enum dsa_cmd cmd; u8 *dsa_header; + u16 pvid = 0; + int err; + + if (sb_dev) { + /* Don't bother finding the accel_priv corresponding with this + * subordinate device, we know it's the bridge becase we can't + * offload anything else, so just search for it under the port, + * we know it's the same. + */ + struct dsa_bridge_fwd_accel_priv *accel_priv = dp->accel_priv; + struct dsa_switch_tree *dst = dp->ds->dst; + + cmd = DSA_CMD_FORWARD; + + /* When offloading forwarding for a bridge, inject FORWARD + * packets on behalf of a virtual switch device with an index + * past the physical switches. + */ + tag_dev = dst->last_switch + 1 + accel_priv->bridge_num; + tag_port = 0; + + /* If we are offloading forwarding for a VLAN-unaware bridge, + * inject packets to hardware using the bridge's pvid, since + * that's where the packets ingressed from. + */ + if (!br_vlan_enabled(sb_dev)) { + /* Safe because __dev_queue_xmit() runs under + * rcu_read_lock_bh() + */ + err = br_vlan_get_pvid_rcu(sb_dev, &pvid); + if (err) + return NULL; + } + } else { + cmd = DSA_CMD_FROM_CPU; + tag_dev = dp->ds->index; + tag_port = dp->index; + } if (skb->protocol == htons(ETH_P_8021Q)) { if (extra) { @@ -134,10 +175,10 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev, memmove(skb->data, skb->data + extra, 2 * ETH_ALEN); } - /* Construct tagged FROM_CPU DSA tag from 802.1Q tag. */ + /* Construct tagged DSA tag from 802.1Q tag. */ dsa_header = skb->data + 2 * ETH_ALEN + extra; - dsa_header[0] = (DSA_CMD_FROM_CPU << 6) | 0x20 | dp->ds->index; - dsa_header[1] = dp->index << 3; + dsa_header[0] = (cmd << 6) | 0x20 | tag_dev; + dsa_header[1] = tag_port << 3; /* Move CFI field from byte 2 to byte 1. */ if (dsa_header[2] & 0x10) { @@ -148,12 +189,13 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev, skb_push(skb, DSA_HLEN + extra); memmove(skb->data, skb->data + DSA_HLEN + extra, 2 * ETH_ALEN); - /* Construct untagged FROM_CPU DSA tag. */ + /* Construct untagged DSA tag. */ dsa_header = skb->data + 2 * ETH_ALEN + extra; - dsa_header[0] = (DSA_CMD_FROM_CPU << 6) | dp->ds->index; - dsa_header[1] = dp->index << 3; - dsa_header[2] = 0x00; - dsa_header[3] = 0x00; + + dsa_header[0] = (cmd << 6) | tag_dev; + dsa_header[1] = tag_port << 3; + dsa_header[2] = pvid >> 8; + dsa_header[3] = pvid & 0xff; } return skb; @@ -304,6 +346,7 @@ static const struct dsa_device_ops dsa_netdev_ops = { .xmit = dsa_xmit, .rcv = dsa_rcv, .needed_headroom = DSA_HLEN, + .bridge_fwd_offload = true, }; DSA_TAG_DRIVER(dsa_netdev_ops); @@ -347,6 +390,7 @@ static const struct dsa_device_ops edsa_netdev_ops = { .xmit = edsa_xmit, .rcv = edsa_rcv, .needed_headroom = EDSA_HLEN, + .bridge_fwd_offload = true, }; DSA_TAG_DRIVER(edsa_netdev_ops); -- 2.25.1
Tobias Waldekranz
2021-Jul-03 22:04 UTC
[Bridge] [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean at nxp.com> wrote:> For this series I have taken Tobias' work from here: > https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias at waldekranz.com/ > and made the following changes: > - I collected and integrated (hopefully all of) Nikolay's, Ido's and my > feedback on the bridge driver changes. Otherwise, the structure of the > bridge changes is pretty much the same as Tobias left it. > - I basically rewrote the DSA infrastructure for the data plane > forwarding offload, based on the commonalities with another switch > driver for which I implemented this feature (not submitted here) > - I adapted mv88e6xxx to use the new infrastructure, hopefully it still > works but I didn't test thatHi Vladimir, Sorry that I have dropped the ball on this series. I have actually had a v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx specific problems. (See below)> The data plane of the software bridge can be partially offloaded to > switchdev, in the sense that we can trust the accelerator to: > (a) look up its FDB (which is more or less in sync with the software > bridge FDB) for selecting the destination ports for a packet > (b) replicate the frame in hardware in case it's a multicast/broadcast, > instead of the software bridge having to clone it and send the > clones to each net device one at a time. This reduces the bandwidth > needed between the CPU and the accelerator, as well as the CPU time > spent. > > The data path forwarding offload is managed per "hardware domain" - a > generalization of the "offload_fwd_mark" concept which is being > introduced in this series. Every packet is delivered only once to each > hardware domain. > > In addition, Tobias said in the original cover letter: > > ===================> ## Overview > > vlan1 vlan2 > \ / > .-----------. > | br0 | > '-----------' > / / \ \ > swp0 swp1 swp2 eth0 > : : : > (hwdom 1) > > Up to this point, switchdevs have been trusted with offloading > forwarding between bridge ports, e.g. forwarding a unicast from swp0 > to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This > series extends forward offloading to include some new classes of > traffic: > > - Locally originating flows, i.e. packets that ingress on br0 that are > to be forwarded to one or several of the ports swp{0,1,2}. Notably > this also includes routed flows, e.g. a packet ingressing swp0 on > VLAN 1 which is then routed over to VLAN 2 by the CPU and then > forwarded to swp1 is "locally originating" from br0's point of view. > > - Flows originating from "foreign" interfaces, i.e. an interface that > is not offloaded by a particular switchdev instance. This includes > ports belonging to other switchdev instances. A typical example > would be flows from eth0 towards swp{0,1,2}. > > The bridge still looks up its FDB/MDB as usual and then notifies the > switchdev driver that a particular skb should be offloaded if it > matches one of the classes above. It does so by using the _accel > version of dev_queue_xmit, supplying its own netdev as the > "subordinate" device. The driver can react to the presence of the > subordinate in its .ndo_select_queue in what ever way it needs to make > sure to forward the skb in much the same way that it would for packets > ingressing on regular ports. > > Hardware domains to which a particular skb has been forwarded are > recorded so that duplicates are avoided. > > The main performance benefit is thus seen on multicast flows. Imagine > for example that: > > - An IP camera is connected to swp0 (VLAN 1) > > - The CPU is acting as a multicast router, routing the group from VLAN > 1 to VLAN 2. > > - There are subscribers for the group in question behind both swp1 and > swp2 (VLAN 2). > > With this offloading in place, the bridge need only send a single skb > to the driver, which will send it to the hardware marked in such a way > that the switch will perform the multicast replication according to > the MDB configuration. Naturally, the number of saved skb_clones > increase linearly with the number of subscribed ports. > > As an extra benefit, on mv88e6xxx, this also allows the switch to > perform source address learning on these flows, which avoids having to > sync dynamic FDB entries over slow configuration interfaces like MDIO > to avoid flows directed towards the CPU being flooded as unknown > unicast by the switch. > > > ## RFC > > - In general, what do you think about this idea? > > - hwdom. What do you think about this terminology? Personally I feel > that we had too many things called offload_fwd_mark, and that as the > use of the bridge internal ID (nbp->offload_fwd_mark) expands, it > might be useful to have a separate term for it. > > - .dfwd_{add,del}_station. Am I stretching this abstraction too far, > and if so do you have any suggestion/preference on how to signal the > offloading from the bridge down to the switchdev driver? > > - The way that flooding is implemented in br_forward.c (lazily cloning > skbs) means that you have to mark the forwarding as completed very > early (right after should_deliver in maybe_deliver) in order to > avoid duplicates. Is there some way to move this decision point to a > later stage that I am missing? > > - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not > compatible with unicast-to-multicast being used on a port. Then > again, I think that this would also be broken for regular switchdev > bridge offloading as this flag is not offloaded to the switchdev > port, so there is no way for the driver to refuse it. Any ideas on > how to handle this? > > > ## mv88e6xxx Specifics > > Since we are now only receiving a single skb for both unicast and > multicast flows, we can tag the packets with the FORWARD command > instead of FROM_CPU. The swich(es) will then forward the packet in > accordance with its ATU, VTU, STU, and PVT configuration - just like > for packets ingressing on user ports. > > Crucially, FROM_CPU is still used for: > > - Ports in standalone mode. > > - Flows that are trapped to the CPU and software-forwarded by a > bridge. Note that these flows match neither of the classes discussed > in the overview. > > - Packets that are sent directly to a port netdev without going > through the bridge, e.g. lldpd sending out PDU via an AF_PACKET > socket. > > We thus have a pretty clean separation where the data plane uses > FORWARDs and the control plane uses TO_/FROM_CPU. > > The barrier between different bridges is enforced by port based VLANs > on mv88e6xxx, which in essence is a mapping from a source device/port > pair to an allowed set of egress ports.Unless I am missing something, it turns out that the PVT is not enough to support multiple (non-VLAN filtering) bridges in multi-chip setups. While the isolation barrier works, there is no way of correctly managing automatic learning.> In order to have a FORWARD > frame (which carries a _source_ device/port) correctly mapped by the > PVT, we must use a unique pair for each bridge. > > Fortunately, there is typically lots of unused address space in most > switch trees. When was the last time you saw an mv88e6xxx product > using more than 4 chips? Even if you found one with 16 (!) devices, > you would still have room to allocate 16*16 virtual ports to software > bridges. > > Therefore, the mv88e6xxx driver will allocate a virtual device/port > pair to each bridge that it offloads. All members of the same bridge > are then configured to allow packets from this virtual port in their > PVTs.So while this solution is cute, it does not work in this example: CPU | .-----. .-0-1-. .-0-1-. | sw0 | | sw1 | '-2-3-' '-2-3-' - [sw0p2, sw1p2] are attached to one bridge - [sw0p3, sw1p3] are attached to another bridge - Neither bridge uses VLAN filtering Since no VLAN information available in the frames, the source addresses of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be separated into different FIDs. They will all be placed in the respective port's default FID. Thus, the two bridges are not isolated with respect to their FDBs. My current plan is therefore to start by reworking how bridges are isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for each non-filtering bridge. Two of these can be easily managed since both VID 0 and 4095 are illegal on the wire but allowed in the VTU - after that it gets tricky. The best scheme I have come up with is to just grab an unused VID when adding any subsequent non-filtering bridge; in the event that that VID is requested by a filtering bridge or a VLAN upper, you move the non-filtering bridge to another currently unused VID. Does that sound reasonable?> ===================> > Tobias Waldekranz (5): > net: dfwd: constrain existing users to macvlan subordinates > net: bridge: disambiguate offload_fwd_mark > net: bridge: switchdev: recycle unused hwdoms > net: bridge: switchdev: allow the data plane forwarding to be > offloaded > net: dsa: tag_dsa: offload the bridge forwarding process > > Vladimir Oltean (5): > net: extract helpers for binding a subordinate device to TX queues > net: allow ndo_select_queue to go beyond dev->num_real_tx_queues > net: dsa: track the number of switches in a tree > net: dsa: add support for bridge forwarding offload > net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in > the PVT > > drivers/net/dsa/mv88e6xxx/chip.c | 106 +++++++++++- > .../net/ethernet/intel/fm10k/fm10k_netdev.c | 3 + > drivers/net/ethernet/intel/i40e/i40e_main.c | 3 + > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 + > include/linux/if_bridge.h | 1 + > include/linux/netdevice.h | 13 +- > include/net/dsa.h | 37 ++++ > net/bridge/br_forward.c | 18 +- > net/bridge/br_if.c | 4 +- > net/bridge/br_private.h | 49 +++++- > net/bridge/br_switchdev.c | 163 +++++++++++++++--- > net/bridge/br_vlan.c | 10 +- > net/core/dev.c | 31 +++- > net/dsa/dsa2.c | 3 + > net/dsa/dsa_priv.h | 28 +++ > net/dsa/port.c | 35 ++++ > net/dsa/slave.c | 134 +++++++++++++- > net/dsa/switch.c | 58 +++++++ > net/dsa/tag_dsa.c | 60 ++++++- > 19 files changed, 700 insertions(+), 59 deletions(-) > > -- > 2.25.1
DENG Qingfang
2021-Jul-05 04:20 UTC
[Bridge] [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
Hi Vladimir, On Sat, Jul 03, 2021 at 02:56:55PM +0300, Vladimir Oltean wrote:> For this series I have taken Tobias' work from here: > https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias at waldekranz.com/ > and made the following changes: > - I collected and integrated (hopefully all of) Nikolay's, Ido's and my > feedback on the bridge driver changes. Otherwise, the structure of the > bridge changes is pretty much the same as Tobias left it. > - I basically rewrote the DSA infrastructure for the data plane > forwarding offload, based on the commonalities with another switch > driver for which I implemented this feature (not submitted here) > - I adapted mv88e6xxx to use the new infrastructure, hopefully it still > works but I didn't test that > > The data plane of the software bridge can be partially offloaded to > switchdev, in the sense that we can trust the accelerator to: > (a) look up its FDB (which is more or less in sync with the software > bridge FDB) for selecting the destination ports for a packet > (b) replicate the frame in hardware in case it's a multicast/broadcast, > instead of the software bridge having to clone it and send the > clones to each net device one at a time. This reduces the bandwidth > needed between the CPU and the accelerator, as well as the CPU time > spent.Many DSA taggers use port bit field in their TX tags, which allows replication in hardware. (multiple bits set = send to multiple ports) I wonder if the tagger API can be updated to support this.> > The data path forwarding offload is managed per "hardware domain" - a > generalization of the "offload_fwd_mark" concept which is being > introduced in this series. Every packet is delivered only once to each > hardware domain. > > In addition, Tobias said in the original cover letter: > > ===================> ## Overview > > vlan1 vlan2 > \ / > .-----------. > | br0 | > '-----------' > / / \ \ > swp0 swp1 swp2 eth0 > : : : > (hwdom 1) > > Up to this point, switchdevs have been trusted with offloading > forwarding between bridge ports, e.g. forwarding a unicast from swp0 > to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This > series extends forward offloading to include some new classes of > traffic: > > - Locally originating flows, i.e. packets that ingress on br0 that are > to be forwarded to one or several of the ports swp{0,1,2}. Notably > this also includes routed flows, e.g. a packet ingressing swp0 on > VLAN 1 which is then routed over to VLAN 2 by the CPU and then > forwarded to swp1 is "locally originating" from br0's point of view. > > - Flows originating from "foreign" interfaces, i.e. an interface that > is not offloaded by a particular switchdev instance. This includes > ports belonging to other switchdev instances. A typical example > would be flows from eth0 towards swp{0,1,2}. > > The bridge still looks up its FDB/MDB as usual and then notifies the > switchdev driver that a particular skb should be offloaded if it > matches one of the classes above. It does so by using the _accel > version of dev_queue_xmit, supplying its own netdev as the > "subordinate" device. The driver can react to the presence of the > subordinate in its .ndo_select_queue in what ever way it needs to make > sure to forward the skb in much the same way that it would for packets > ingressing on regular ports. > > Hardware domains to which a particular skb has been forwarded are > recorded so that duplicates are avoided. > > The main performance benefit is thus seen on multicast flows. Imagine > for example that: > > - An IP camera is connected to swp0 (VLAN 1) > > - The CPU is acting as a multicast router, routing the group from VLAN > 1 to VLAN 2. > > - There are subscribers for the group in question behind both swp1 and > swp2 (VLAN 2). > > With this offloading in place, the bridge need only send a single skb > to the driver, which will send it to the hardware marked in such a way > that the switch will perform the multicast replication according to > the MDB configuration. Naturally, the number of saved skb_clones > increase linearly with the number of subscribed ports. > > As an extra benefit, on mv88e6xxx, this also allows the switch to > perform source address learning on these flows, which avoids having to > sync dynamic FDB entries over slow configuration interfaces like MDIO > to avoid flows directed towards the CPU being flooded as unknown > unicast by the switch. > > > ## RFC > > - In general, what do you think about this idea? > > - hwdom. What do you think about this terminology? Personally I feel > that we had too many things called offload_fwd_mark, and that as the > use of the bridge internal ID (nbp->offload_fwd_mark) expands, it > might be useful to have a separate term for it. > > - .dfwd_{add,del}_station. Am I stretching this abstraction too far, > and if so do you have any suggestion/preference on how to signal the > offloading from the bridge down to the switchdev driver? > > - The way that flooding is implemented in br_forward.c (lazily cloning > skbs) means that you have to mark the forwarding as completed very > early (right after should_deliver in maybe_deliver) in order to > avoid duplicates. Is there some way to move this decision point to a > later stage that I am missing? > > - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not > compatible with unicast-to-multicast being used on a port. Then > again, I think that this would also be broken for regular switchdev > bridge offloading as this flag is not offloaded to the switchdev > port, so there is no way for the driver to refuse it. Any ideas on > how to handle this? > > > ## mv88e6xxx Specifics > > Since we are now only receiving a single skb for both unicast and > multicast flows, we can tag the packets with the FORWARD command > instead of FROM_CPU. The swich(es) will then forward the packet in > accordance with its ATU, VTU, STU, and PVT configuration - just like > for packets ingressing on user ports. > > Crucially, FROM_CPU is still used for: > > - Ports in standalone mode. > > - Flows that are trapped to the CPU and software-forwarded by a > bridge. Note that these flows match neither of the classes discussed > in the overview. > > - Packets that are sent directly to a port netdev without going > through the bridge, e.g. lldpd sending out PDU via an AF_PACKET > socket. > > We thus have a pretty clean separation where the data plane uses > FORWARDs and the control plane uses TO_/FROM_CPU. > > The barrier between different bridges is enforced by port based VLANs > on mv88e6xxx, which in essence is a mapping from a source device/port > pair to an allowed set of egress ports. In order to have a FORWARD > frame (which carries a _source_ device/port) correctly mapped by the > PVT, we must use a unique pair for each bridge. > > Fortunately, there is typically lots of unused address space in most > switch trees. When was the last time you saw an mv88e6xxx product > using more than 4 chips? Even if you found one with 16 (!) devices, > you would still have room to allocate 16*16 virtual ports to software > bridges. > > Therefore, the mv88e6xxx driver will allocate a virtual device/port > pair to each bridge that it offloads. All members of the same bridge > are then configured to allow packets from this virtual port in their > PVTs. > ===================> > Tobias Waldekranz (5): > net: dfwd: constrain existing users to macvlan subordinates > net: bridge: disambiguate offload_fwd_mark > net: bridge: switchdev: recycle unused hwdoms > net: bridge: switchdev: allow the data plane forwarding to be > offloaded > net: dsa: tag_dsa: offload the bridge forwarding process > > Vladimir Oltean (5): > net: extract helpers for binding a subordinate device to TX queues > net: allow ndo_select_queue to go beyond dev->num_real_tx_queues > net: dsa: track the number of switches in a tree > net: dsa: add support for bridge forwarding offload > net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in > the PVT > > drivers/net/dsa/mv88e6xxx/chip.c | 106 +++++++++++- > .../net/ethernet/intel/fm10k/fm10k_netdev.c | 3 + > drivers/net/ethernet/intel/i40e/i40e_main.c | 3 + > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 + > include/linux/if_bridge.h | 1 + > include/linux/netdevice.h | 13 +- > include/net/dsa.h | 37 ++++ > net/bridge/br_forward.c | 18 +- > net/bridge/br_if.c | 4 +- > net/bridge/br_private.h | 49 +++++- > net/bridge/br_switchdev.c | 163 +++++++++++++++--- > net/bridge/br_vlan.c | 10 +- > net/core/dev.c | 31 +++- > net/dsa/dsa2.c | 3 + > net/dsa/dsa_priv.h | 28 +++ > net/dsa/port.c | 35 ++++ > net/dsa/slave.c | 134 +++++++++++++- > net/dsa/switch.c | 58 +++++++ > net/dsa/tag_dsa.c | 60 ++++++- > 19 files changed, 700 insertions(+), 59 deletions(-) > > -- > 2.25.1 >