Nikolay Aleksandrov
2018-Jul-22 08:17 UTC
[Bridge] [PATCH net-next v2 0/2] net: bridge: add support for backup port
Hi, This set introduces a new bridge port option that allows any port to have any other port (in the same bridge of course) as its backup and traffic will be forwarded to the backup port when the primary goes down. This is mainly used in MLAG and EVPN setups where we have peerlink path which is a backup of many (or even all) ports and is a participating bridge port itself. There's more detailed information in patch 02. Patch 01 just prepares the port sysfs code for options that take raw value. The main issues that this set solves are scalability and fallback latency. We have used similar code for over 6 months now to bring the fallback latency of the backup peerlink down and avoid fdb notification storms. Also due to the nature of master devices such setup is currently not possible, and last but not least having tens of thousands of fdbs require thousands of calls to switch. I've also CCed our MLAG experts that have been using similar option. v2: In patch 01 use kstrdup/kfree to avoid casting the const buf. In order to avoid using GFP_ATOMIC or always allocating I kept the spinlock inside each branch. Thanks, Nik Nikolay Aleksandrov (2): net: bridge: add support for raw sysfs port options net: bridge: add support for backup port include/uapi/linux/if_link.h | 1 + net/bridge/br_forward.c | 16 +++++++- net/bridge/br_if.c | 53 ++++++++++++++++++++++++++ net/bridge/br_netlink.c | 30 ++++++++++++++- net/bridge/br_private.h | 3 ++ net/bridge/br_sysfs_if.c | 89 +++++++++++++++++++++++++++++++++++++------- 6 files changed, 176 insertions(+), 16 deletions(-) -- 2.11.0
Nikolay Aleksandrov
2018-Jul-22 08:17 UTC
[Bridge] [PATCH net-next v2 1/2] net: bridge: add support for raw sysfs port options
This patch adds a new alternative store callback for port sysfs options which takes a raw value (buf) and can use it directly. It is needed for the backup port sysfs support since we have to pass the device by its name. Signed-off-by: Nikolay Aleksandrov <nikolay at cumulusnetworks.com> --- There are a few checkpatch warnings here because of the old code, I've noted them in my todo and will send separate patch for simple_strtoul and the function prototypes. v2: Use kstrndup/kfree instead of casting the const buf. In order to avoid using GFP_ATOMIC or always allocating I kept the spinlock inside each branch. net/bridge/br_sysfs_if.c | 56 ++++++++++++++++++++++++++++++++++++------------ 1 file changed, 42 insertions(+), 14 deletions(-) diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c index f99c5bf5c906..c4f5c83c92d2 100644 --- a/net/bridge/br_sysfs_if.c +++ b/net/bridge/br_sysfs_if.c @@ -25,6 +25,15 @@ struct brport_attribute { struct attribute attr; ssize_t (*show)(struct net_bridge_port *, char *); int (*store)(struct net_bridge_port *, unsigned long); + int (*store_raw)(struct net_bridge_port *, char *); +}; + +#define BRPORT_ATTR_RAW(_name, _mode, _show, _store) \ +const struct brport_attribute brport_attr_##_name = { \ + .attr = {.name = __stringify(_name), \ + .mode = _mode }, \ + .show = _show, \ + .store_raw = _store, \ }; #define BRPORT_ATTR(_name, _mode, _show, _store) \ @@ -270,27 +279,46 @@ static ssize_t brport_store(struct kobject *kobj, struct brport_attribute *brport_attr = to_brport_attr(attr); struct net_bridge_port *p = to_brport(kobj); ssize_t ret = -EINVAL; - char *endp; unsigned long val; + char *endp; if (!ns_capable(dev_net(p->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; - val = simple_strtoul(buf, &endp, 0); - if (endp != buf) { - if (!rtnl_trylock()) - return restart_syscall(); - if (p->dev && p->br && brport_attr->store) { - spin_lock_bh(&p->br->lock); - ret = brport_attr->store(p, val); - spin_unlock_bh(&p->br->lock); - if (!ret) { - br_ifinfo_notify(RTM_NEWLINK, NULL, p); - ret = count; - } + if (!rtnl_trylock()) + return restart_syscall(); + + if (!p->dev || !p->br) + goto out_unlock; + + if (brport_attr->store_raw) { + char *buf_copy; + + buf_copy = kstrndup(buf, count, GFP_KERNEL); + if (!buf_copy) { + ret = -ENOMEM; + goto out_unlock; } - rtnl_unlock(); + spin_lock_bh(&p->br->lock); + ret = brport_attr->store_raw(p, buf_copy); + spin_unlock_bh(&p->br->lock); + kfree(buf_copy); + } else if (brport_attr->store) { + val = simple_strtoul(buf, &endp, 0); + if (endp == buf) + goto out_unlock; + spin_lock_bh(&p->br->lock); + ret = brport_attr->store(p, val); + spin_unlock_bh(&p->br->lock); } + + if (!ret) { + br_ifinfo_notify(RTM_NEWLINK, NULL, p); + ret = count; + } +out_unlock: + rtnl_unlock(); + return ret; } -- 2.11.0
Nikolay Aleksandrov
2018-Jul-22 08:17 UTC
[Bridge] [PATCH net-next v2 2/2] net: bridge: add support for backup port
This patch adds a new port attribute - IFLA_BRPORT_BACKUP_PORT, which allows to set a backup port to be used for known unicast traffic if the port has gone carrier down. The backup pointer is rcu protected and set only under RTNL, a counter is maintained so when deleting a port we know how many other ports reference it as a backup and we remove it from all. Also the pointer is in the first cache line which is hot at the time of the check and thus in the common case we only add one more test. The backup port will be used only for the non-flooding case since it's a part of the bridge and the flooded packets will be forwarded to it anyway. To remove the forwarding just send a 0/non-existing backup port. This is used to avoid numerous scalability problems when using MLAG most notably if we have thousands of fdbs one would need to change all of them on port carrier going down which takes too long and causes a storm of fdb notifications (and again when the port comes back up). In a Multi-chassis Link Aggregation setup usually hosts are connected to two different switches which act as a single logical switch. Those switches usually have a control and backup link between them called peerlink which might be used for communication in case a host loses connectivity to one of them. We need a fast way to failover in case a host port goes down and currently none of the solutions (like bond) cannot fulfill the requirements because the participating ports are actually the "master" devices and must have the same peerlink as their backup interface and at the same time all of them must participate in the bridge device. As Roopa noted it's normal practice in routing called fast re-route where a precalculated backup path is used when the main one is down. Another use case of this is with EVPN, having a single vxlan device which is backup of every port. Due to the nature of master devices it's not currently possible to use one device as a backup for many and still have all of them participate in the bridge (which is master itself). More detailed information about MLAG is available at the link below. https://docs.cumulusnetworks.com/display/DOCS/Multi-Chassis+Link+Aggregation+-+MLAG Signed-off-by: Nikolay Aleksandrov <nikolay at cumulusnetworks.com> --- include/uapi/linux/if_link.h | 1 + net/bridge/br_forward.c | 16 ++++++++++++- net/bridge/br_if.c | 53 ++++++++++++++++++++++++++++++++++++++++++++ net/bridge/br_netlink.c | 30 ++++++++++++++++++++++++- net/bridge/br_private.h | 3 +++ net/bridge/br_sysfs_if.c | 33 +++++++++++++++++++++++++++ 6 files changed, 134 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 8759cfb8aa2e..01b5069a73a5 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -334,6 +334,7 @@ enum { IFLA_BRPORT_GROUP_FWD_MASK, IFLA_BRPORT_NEIGH_SUPPRESS, IFLA_BRPORT_ISOLATED, + IFLA_BRPORT_BACKUP_PORT, __IFLA_BRPORT_MAX }; #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1) diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c index 9019f326fe81..5372e2042adf 100644 --- a/net/bridge/br_forward.c +++ b/net/bridge/br_forward.c @@ -142,7 +142,20 @@ static int deliver_clone(const struct net_bridge_port *prev, void br_forward(const struct net_bridge_port *to, struct sk_buff *skb, bool local_rcv, bool local_orig) { - if (to && should_deliver(to, skb)) { + if (unlikely(!to)) + goto out; + + /* redirect to backup link if the destination port is down */ + if (rcu_access_pointer(to->backup_port) && !netif_carrier_ok(to->dev)) { + struct net_bridge_port *backup_port; + + backup_port = rcu_dereference(to->backup_port); + if (unlikely(!backup_port)) + goto out; + to = backup_port; + } + + if (should_deliver(to, skb)) { if (local_rcv) deliver_clone(to, skb, local_orig); else @@ -150,6 +163,7 @@ void br_forward(const struct net_bridge_port *to, return; } +out: if (!local_rcv) kfree_skb(skb); } diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index 05e42d86882d..365759ac7c73 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -169,6 +169,58 @@ void br_manage_promisc(struct net_bridge *br) } } +int nbp_backup_change(struct net_bridge_port *p, + struct net_device *backup_dev) +{ + struct net_bridge_port *old_backup = rtnl_dereference(p->backup_port); + struct net_bridge_port *backup_p = NULL; + + ASSERT_RTNL(); + + if (backup_dev) { + if (!br_port_exists(backup_dev)) + return -ENOENT; + + backup_p = br_port_get_rtnl(backup_dev); + if (backup_p->br != p->br) + return -EINVAL; + } + + if (p == backup_p) + return -EINVAL; + + if (old_backup == backup_p) + return 0; + + /* if the backup link is already set, clear it */ + if (old_backup) + old_backup->backup_redirected_cnt--; + + if (backup_p) + backup_p->backup_redirected_cnt++; + rcu_assign_pointer(p->backup_port, backup_p); + + return 0; +} + +static void nbp_backup_clear(struct net_bridge_port *p) +{ + nbp_backup_change(p, NULL); + if (p->backup_redirected_cnt) { + struct net_bridge_port *cur_p; + + list_for_each_entry(cur_p, &p->br->port_list, list) { + struct net_bridge_port *backup_p; + + backup_p = rtnl_dereference(cur_p->backup_port); + if (backup_p == p) + nbp_backup_change(cur_p, NULL); + } + } + + WARN_ON(rcu_access_pointer(p->backup_port) || p->backup_redirected_cnt); +} + static void nbp_update_port_count(struct net_bridge *br) { struct net_bridge_port *p; @@ -286,6 +338,7 @@ static void del_nbp(struct net_bridge_port *p) nbp_vlan_flush(p); br_fdb_delete_by_port(br, p, 0, 1); switchdev_deferred_process(); + nbp_backup_clear(p); nbp_update_port_count(br); diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index 9f5eb05b0373..ec2b58a09f76 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -169,13 +169,15 @@ static inline size_t br_nlmsg_size(struct net_device *dev, u32 filter_mask) + nla_total_size(1) /* IFLA_OPERSTATE */ + nla_total_size(br_port_info_size()) /* IFLA_PROTINFO */ + nla_total_size(br_get_link_af_size_filtered(dev, - filter_mask)); /* IFLA_AF_SPEC */ + filter_mask)) /* IFLA_AF_SPEC */ + + nla_total_size(4); /* IFLA_BRPORT_BACKUP_PORT */ } static int br_port_fill_attrs(struct sk_buff *skb, const struct net_bridge_port *p) { u8 mode = !!(p->flags & BR_HAIRPIN_MODE); + struct net_bridge_port *backup_p; u64 timerval; if (nla_put_u8(skb, IFLA_BRPORT_STATE, p->state) || @@ -237,6 +239,14 @@ static int br_port_fill_attrs(struct sk_buff *skb, return -EMSGSIZE; #endif + /* we might be called only with br->lock */ + rcu_read_lock(); + backup_p = rcu_dereference(p->backup_port); + if (backup_p) + nla_put_u32(skb, IFLA_BRPORT_BACKUP_PORT, + backup_p->dev->ifindex); + rcu_read_unlock(); + return 0; } @@ -663,6 +673,7 @@ static const struct nla_policy br_port_policy[IFLA_BRPORT_MAX + 1] = { [IFLA_BRPORT_GROUP_FWD_MASK] = { .type = NLA_U16 }, [IFLA_BRPORT_NEIGH_SUPPRESS] = { .type = NLA_U8 }, [IFLA_BRPORT_ISOLATED] = { .type = NLA_U8 }, + [IFLA_BRPORT_BACKUP_PORT] = { .type = NLA_U32 }, }; /* Change the state of the port and notify spanning tree */ @@ -817,6 +828,23 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[]) if (err) return err; + if (tb[IFLA_BRPORT_BACKUP_PORT]) { + struct net_device *backup_dev = NULL; + u32 backup_ifindex; + + backup_ifindex = nla_get_u32(tb[IFLA_BRPORT_BACKUP_PORT]); + if (backup_ifindex) { + backup_dev = __dev_get_by_index(dev_net(p->dev), + backup_ifindex); + if (!backup_dev) + return -ENOENT; + } + + err = nbp_backup_change(p, backup_dev); + if (err) + return err; + } + br_port_flags_change(p, old_flags ^ p->flags); return 0; } diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 5216a524b537..ab7db2466c43 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -237,6 +237,7 @@ struct net_bridge_port { #ifdef CONFIG_BRIDGE_VLAN_FILTERING struct net_bridge_vlan_group __rcu *vlgrp; #endif + struct net_bridge_port __rcu *backup_port; /* STP */ u8 priority; @@ -281,6 +282,7 @@ struct net_bridge_port { int offload_fwd_mark; #endif u16 group_fwd_mask; + u16 backup_redirected_cnt; }; #define br_auto_port(p) ((p)->flags & BR_AUTO_MASK) @@ -595,6 +597,7 @@ netdev_features_t br_features_recompute(struct net_bridge *br, netdev_features_t features); void br_port_flags_change(struct net_bridge_port *port, unsigned long mask); void br_manage_promisc(struct net_bridge *br); +int nbp_backup_change(struct net_bridge_port *p, struct net_device *backup_dev); /* br_input.c */ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb); diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c index c4f5c83c92d2..91b88df20019 100644 --- a/net/bridge/br_sysfs_if.c +++ b/net/bridge/br_sysfs_if.c @@ -191,6 +191,38 @@ static int store_group_fwd_mask(struct net_bridge_port *p, static BRPORT_ATTR(group_fwd_mask, 0644, show_group_fwd_mask, store_group_fwd_mask); +static ssize_t show_backup_port(struct net_bridge_port *p, char *buf) +{ + struct net_bridge_port *backup_p; + int ret = 0; + + rcu_read_lock(); + backup_p = rcu_dereference(p->backup_port); + if (backup_p) + ret = sprintf(buf, "%s\n", backup_p->dev->name); + rcu_read_unlock(); + + return ret; +} + +static int store_backup_port(struct net_bridge_port *p, char *buf) +{ + struct net_device *backup_dev = NULL; + char *nl = strchr(buf, '\n'); + + if (nl) + *nl = '\0'; + + if (strlen(buf) > 0) { + backup_dev = __dev_get_by_name(dev_net(p->dev), buf); + if (!backup_dev) + return -ENOENT; + } + + return nbp_backup_change(p, backup_dev); +} +static BRPORT_ATTR_RAW(backup_port, 0644, show_backup_port, store_backup_port); + BRPORT_ATTR_FLAG(hairpin_mode, BR_HAIRPIN_MODE); BRPORT_ATTR_FLAG(bpdu_guard, BR_BPDU_GUARD); BRPORT_ATTR_FLAG(root_block, BR_ROOT_BLOCK); @@ -254,6 +286,7 @@ static const struct brport_attribute *brport_attrs[] = { &brport_attr_group_fwd_mask, &brport_attr_neigh_suppress, &brport_attr_isolated, + &brport_attr_backup_port, NULL }; -- 2.11.0