Nikolay Aleksandrov
2018-Jul-23 08:16 UTC
[Bridge] [PATCH net-next v3 0/2] net: bridge: add support for backup port
Hi,
This set introduces a new bridge port option that allows any port to have
any other port (in the same bridge of course) as its backup and traffic
will be forwarded to the backup port when the primary goes down. This is
mainly used in MLAG and EVPN setups where we have peerlink path which is
a backup of many (or even all) ports and is a participating bridge port
itself. There's more detailed information in patch 02. Patch 01 just
prepares the port sysfs code for options that take raw value. The main
issues that this set solves are scalability and fallback latency.
We have used similar code for over 6 months now to bring the fallback
latency of the backup peerlink down and avoid fdb notification storms.
Also due to the nature of master devices such setup is currently not
possible, and last but not least having tens of thousands of fdbs require
thousands of calls to switch.
I've also CCed our MLAG experts that have been using similar option.
Roopa also adds:
"Two switches acting in a MLAG pair are connected by the peerlink
interface which is a bridge port.
the config on one of the switches looks like the below. The other
switch also has a similar config.
eth0 is connected to one port on the server. And the server is
connected to both switches.
br0 -- team0---eth0
|
-- switch-peerlink
switch-peerlink becomes the failover/backport port when say team0 to
the server goes down.
Today, when team0 goes down, control plane has to withdraw all the fdb
entries pointing to team0
and re-install the fdb entries pointing to switch-peerlink...and
restore the fdb entries when team0 comes back up again.
and this is the problem we are trying to solve.
This also becomes necessary when multihoming is implemented by a
standard like E-VPN https://tools.ietf.org/html/rfc8365#section-8
where the 'switch-peerlink' is an overlay vxlan port (like nikolay
mentions in his patch commit). In these implementations, the fdb scale
can be much larger.
On why bond failover cannot be used here ?: the point that nikolay was
alluding to is, switch-peerlink in the above example is a bridge port
and is a failover/backport port for more than one or all ports in the
bridge br0. And you cannot enslave switch-peerlink into a second level
team
with other bridge ports. Hence a multi layered team device is not an
option (FWIW, switch-peerlink is also a teamed interface to the peer
switch)."
v3: Added Roopa's explanation and diagram
v2: In patch 01 use kstrdup/kfree to avoid casting the const buf. In order
to avoid using GFP_ATOMIC or always allocating I kept the spinlock inside
each branch.
Thanks,
Nik
Nikolay Aleksandrov (2):
net: bridge: add support for raw sysfs port options
net: bridge: add support for backup port
include/uapi/linux/if_link.h | 1 +
net/bridge/br_forward.c | 16 +++++++-
net/bridge/br_if.c | 53 ++++++++++++++++++++++++++
net/bridge/br_netlink.c | 30 ++++++++++++++-
net/bridge/br_private.h | 3 ++
net/bridge/br_sysfs_if.c | 89 +++++++++++++++++++++++++++++++++++++-------
6 files changed, 176 insertions(+), 16 deletions(-)
--
2.11.0
Nikolay Aleksandrov
2018-Jul-23 08:16 UTC
[Bridge] [PATCH 1/2 net-next v3] net: bridge: add support for raw sysfs port options
This patch adds a new alternative store callback for port sysfs options
which takes a raw value (buf) and can use it directly. It is needed for the
backup port sysfs support since we have to pass the device by its name.
Signed-off-by: Nikolay Aleksandrov <nikolay at cumulusnetworks.com>
---
There are a few checkpatch warnings here because of the old code, I've
noted them in my todo and will send separate patch for simple_strtoul
and the function prototypes.
v3: no change
v2: Use kstrndup/kfree instead of casting the const buf. In order
to avoid using GFP_ATOMIC or always allocating I kept the spinlock inside
each branch.
net/bridge/br_sysfs_if.c | 56 ++++++++++++++++++++++++++++++++++++------------
1 file changed, 42 insertions(+), 14 deletions(-)
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index ab4c7f8adf68..4ac940067754 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -25,6 +25,15 @@ struct brport_attribute {
struct attribute attr;
ssize_t (*show)(struct net_bridge_port *, char *);
int (*store)(struct net_bridge_port *, unsigned long);
+ int (*store_raw)(struct net_bridge_port *, char *);
+};
+
+#define BRPORT_ATTR_RAW(_name, _mode, _show, _store) \
+const struct brport_attribute brport_attr_##_name = { \
+ .attr = {.name = __stringify(_name), \
+ .mode = _mode }, \
+ .show = _show, \
+ .store_raw = _store, \
};
#define BRPORT_ATTR(_name, _mode, _show, _store) \
@@ -269,27 +278,46 @@ static ssize_t brport_store(struct kobject *kobj,
struct brport_attribute *brport_attr = to_brport_attr(attr);
struct net_bridge_port *p = kobj_to_brport(kobj);
ssize_t ret = -EINVAL;
- char *endp;
unsigned long val;
+ char *endp;
if (!ns_capable(dev_net(p->dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
- val = simple_strtoul(buf, &endp, 0);
- if (endp != buf) {
- if (!rtnl_trylock())
- return restart_syscall();
- if (p->dev && p->br && brport_attr->store) {
- spin_lock_bh(&p->br->lock);
- ret = brport_attr->store(p, val);
- spin_unlock_bh(&p->br->lock);
- if (!ret) {
- br_ifinfo_notify(RTM_NEWLINK, NULL, p);
- ret = count;
- }
+ if (!rtnl_trylock())
+ return restart_syscall();
+
+ if (!p->dev || !p->br)
+ goto out_unlock;
+
+ if (brport_attr->store_raw) {
+ char *buf_copy;
+
+ buf_copy = kstrndup(buf, count, GFP_KERNEL);
+ if (!buf_copy) {
+ ret = -ENOMEM;
+ goto out_unlock;
}
- rtnl_unlock();
+ spin_lock_bh(&p->br->lock);
+ ret = brport_attr->store_raw(p, buf_copy);
+ spin_unlock_bh(&p->br->lock);
+ kfree(buf_copy);
+ } else if (brport_attr->store) {
+ val = simple_strtoul(buf, &endp, 0);
+ if (endp == buf)
+ goto out_unlock;
+ spin_lock_bh(&p->br->lock);
+ ret = brport_attr->store(p, val);
+ spin_unlock_bh(&p->br->lock);
}
+
+ if (!ret) {
+ br_ifinfo_notify(RTM_NEWLINK, NULL, p);
+ ret = count;
+ }
+out_unlock:
+ rtnl_unlock();
+
return ret;
}
--
2.11.0
Nikolay Aleksandrov
2018-Jul-23 08:16 UTC
[Bridge] [PATCH 2/2 net-next v3] net: bridge: add support for backup port
This patch adds a new port attribute - IFLA_BRPORT_BACKUP_PORT, which
allows to set a backup port to be used for known unicast traffic if the
port has gone carrier down. The backup pointer is rcu protected and set
only under RTNL, a counter is maintained so when deleting a port we know
how many other ports reference it as a backup and we remove it from all.
Also the pointer is in the first cache line which is hot at the time of
the check and thus in the common case we only add one more test.
The backup port will be used only for the non-flooding case since
it's a part of the bridge and the flooded packets will be forwarded to it
anyway. To remove the forwarding just send a 0/non-existing backup port.
This is used to avoid numerous scalability problems when using MLAG most
notably if we have thousands of fdbs one would need to change all of them
on port carrier going down which takes too long and causes a storm of fdb
notifications (and again when the port comes back up). In a Multi-chassis
Link Aggregation setup usually hosts are connected to two different
switches which act as a single logical switch. Those switches usually have
a control and backup link between them called peerlink which might be used
for communication in case a host loses connectivity to one of them.
We need a fast way to failover in case a host port goes down and currently
none of the solutions (like bond) cannot fulfill the requirements because
the participating ports are actually the "master" devices and must
have the
same peerlink as their backup interface and at the same time all of them
must participate in the bridge device. As Roopa noted it's normal practice
in routing called fast re-route where a precalculated backup path is used
when the main one is down.
Another use case of this is with EVPN, having a single vxlan device which
is backup of every port. Due to the nature of master devices it's not
currently possible to use one device as a backup for many and still have
all of them participate in the bridge (which is master itself).
More detailed information about MLAG is available at the link below.
https://docs.cumulusnetworks.com/display/DOCS/Multi-Chassis+Link+Aggregation+-+MLAG
Further explanation and a diagram by Roopa:
Two switches acting in a MLAG pair are connected by the peerlink
interface which is a bridge port.
the config on one of the switches looks like the below. The other
switch also has a similar config.
eth0 is connected to one port on the server. And the server is
connected to both switches.
br0 -- team0---eth0
|
-- switch-peerlink
Signed-off-by: Nikolay Aleksandrov <nikolay at cumulusnetworks.com>
---
v3: No code changes, added Roopa's explanation and diagram
include/uapi/linux/if_link.h | 1 +
net/bridge/br_forward.c | 16 ++++++++++++-
net/bridge/br_if.c | 53 ++++++++++++++++++++++++++++++++++++++++++++
net/bridge/br_netlink.c | 30 ++++++++++++++++++++++++-
net/bridge/br_private.h | 3 +++
net/bridge/br_sysfs_if.c | 33 +++++++++++++++++++++++++++
6 files changed, 134 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 8759cfb8aa2e..01b5069a73a5 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -334,6 +334,7 @@ enum {
IFLA_BRPORT_GROUP_FWD_MASK,
IFLA_BRPORT_NEIGH_SUPPRESS,
IFLA_BRPORT_ISOLATED,
+ IFLA_BRPORT_BACKUP_PORT,
__IFLA_BRPORT_MAX
};
#define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 9019f326fe81..5372e2042adf 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -142,7 +142,20 @@ static int deliver_clone(const struct net_bridge_port
*prev,
void br_forward(const struct net_bridge_port *to,
struct sk_buff *skb, bool local_rcv, bool local_orig)
{
- if (to && should_deliver(to, skb)) {
+ if (unlikely(!to))
+ goto out;
+
+ /* redirect to backup link if the destination port is down */
+ if (rcu_access_pointer(to->backup_port) &&
!netif_carrier_ok(to->dev)) {
+ struct net_bridge_port *backup_port;
+
+ backup_port = rcu_dereference(to->backup_port);
+ if (unlikely(!backup_port))
+ goto out;
+ to = backup_port;
+ }
+
+ if (should_deliver(to, skb)) {
if (local_rcv)
deliver_clone(to, skb, local_orig);
else
@@ -150,6 +163,7 @@ void br_forward(const struct net_bridge_port *to,
return;
}
+out:
if (!local_rcv)
kfree_skb(skb);
}
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index e7c8d55212aa..0363f1bdc401 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -170,6 +170,58 @@ void br_manage_promisc(struct net_bridge *br)
}
}
+int nbp_backup_change(struct net_bridge_port *p,
+ struct net_device *backup_dev)
+{
+ struct net_bridge_port *old_backup = rtnl_dereference(p->backup_port);
+ struct net_bridge_port *backup_p = NULL;
+
+ ASSERT_RTNL();
+
+ if (backup_dev) {
+ if (!br_port_exists(backup_dev))
+ return -ENOENT;
+
+ backup_p = br_port_get_rtnl(backup_dev);
+ if (backup_p->br != p->br)
+ return -EINVAL;
+ }
+
+ if (p == backup_p)
+ return -EINVAL;
+
+ if (old_backup == backup_p)
+ return 0;
+
+ /* if the backup link is already set, clear it */
+ if (old_backup)
+ old_backup->backup_redirected_cnt--;
+
+ if (backup_p)
+ backup_p->backup_redirected_cnt++;
+ rcu_assign_pointer(p->backup_port, backup_p);
+
+ return 0;
+}
+
+static void nbp_backup_clear(struct net_bridge_port *p)
+{
+ nbp_backup_change(p, NULL);
+ if (p->backup_redirected_cnt) {
+ struct net_bridge_port *cur_p;
+
+ list_for_each_entry(cur_p, &p->br->port_list, list) {
+ struct net_bridge_port *backup_p;
+
+ backup_p = rtnl_dereference(cur_p->backup_port);
+ if (backup_p == p)
+ nbp_backup_change(cur_p, NULL);
+ }
+ }
+
+ WARN_ON(rcu_access_pointer(p->backup_port) || p->backup_redirected_cnt);
+}
+
static void nbp_update_port_count(struct net_bridge *br)
{
struct net_bridge_port *p;
@@ -295,6 +347,7 @@ static void del_nbp(struct net_bridge_port *p)
nbp_vlan_flush(p);
br_fdb_delete_by_port(br, p, 0, 1);
switchdev_deferred_process();
+ nbp_backup_clear(p);
nbp_update_port_count(br);
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 9f5eb05b0373..ec2b58a09f76 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -169,13 +169,15 @@ static inline size_t br_nlmsg_size(struct net_device *dev,
u32 filter_mask)
+ nla_total_size(1) /* IFLA_OPERSTATE */
+ nla_total_size(br_port_info_size()) /* IFLA_PROTINFO */
+ nla_total_size(br_get_link_af_size_filtered(dev,
- filter_mask)); /* IFLA_AF_SPEC */
+ filter_mask)) /* IFLA_AF_SPEC */
+ + nla_total_size(4); /* IFLA_BRPORT_BACKUP_PORT */
}
static int br_port_fill_attrs(struct sk_buff *skb,
const struct net_bridge_port *p)
{
u8 mode = !!(p->flags & BR_HAIRPIN_MODE);
+ struct net_bridge_port *backup_p;
u64 timerval;
if (nla_put_u8(skb, IFLA_BRPORT_STATE, p->state) ||
@@ -237,6 +239,14 @@ static int br_port_fill_attrs(struct sk_buff *skb,
return -EMSGSIZE;
#endif
+ /* we might be called only with br->lock */
+ rcu_read_lock();
+ backup_p = rcu_dereference(p->backup_port);
+ if (backup_p)
+ nla_put_u32(skb, IFLA_BRPORT_BACKUP_PORT,
+ backup_p->dev->ifindex);
+ rcu_read_unlock();
+
return 0;
}
@@ -663,6 +673,7 @@ static const struct nla_policy
br_port_policy[IFLA_BRPORT_MAX + 1] = {
[IFLA_BRPORT_GROUP_FWD_MASK] = { .type = NLA_U16 },
[IFLA_BRPORT_NEIGH_SUPPRESS] = { .type = NLA_U8 },
[IFLA_BRPORT_ISOLATED] = { .type = NLA_U8 },
+ [IFLA_BRPORT_BACKUP_PORT] = { .type = NLA_U32 },
};
/* Change the state of the port and notify spanning tree */
@@ -817,6 +828,23 @@ static int br_setport(struct net_bridge_port *p, struct
nlattr *tb[])
if (err)
return err;
+ if (tb[IFLA_BRPORT_BACKUP_PORT]) {
+ struct net_device *backup_dev = NULL;
+ u32 backup_ifindex;
+
+ backup_ifindex = nla_get_u32(tb[IFLA_BRPORT_BACKUP_PORT]);
+ if (backup_ifindex) {
+ backup_dev = __dev_get_by_index(dev_net(p->dev),
+ backup_ifindex);
+ if (!backup_dev)
+ return -ENOENT;
+ }
+
+ err = nbp_backup_change(p, backup_dev);
+ if (err)
+ return err;
+ }
+
br_port_flags_change(p, old_flags ^ p->flags);
return 0;
}
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index cf0005d2a4d0..11ed2029985f 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -237,6 +237,7 @@ struct net_bridge_port {
#ifdef CONFIG_BRIDGE_VLAN_FILTERING
struct net_bridge_vlan_group __rcu *vlgrp;
#endif
+ struct net_bridge_port __rcu *backup_port;
/* STP */
u8 priority;
@@ -281,6 +282,7 @@ struct net_bridge_port {
int offload_fwd_mark;
#endif
u16 group_fwd_mask;
+ u16 backup_redirected_cnt;
};
#define kobj_to_brport(obj) container_of(obj, struct net_bridge_port, kobj)
@@ -597,6 +599,7 @@ netdev_features_t br_features_recompute(struct net_bridge
*br,
netdev_features_t features);
void br_port_flags_change(struct net_bridge_port *port, unsigned long mask);
void br_manage_promisc(struct net_bridge *br);
+int nbp_backup_change(struct net_bridge_port *p, struct net_device
*backup_dev);
/* br_input.c */
int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff
*skb);
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 4ac940067754..7c87a2fe5248 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -191,6 +191,38 @@ static int store_group_fwd_mask(struct net_bridge_port *p,
static BRPORT_ATTR(group_fwd_mask, 0644, show_group_fwd_mask,
store_group_fwd_mask);
+static ssize_t show_backup_port(struct net_bridge_port *p, char *buf)
+{
+ struct net_bridge_port *backup_p;
+ int ret = 0;
+
+ rcu_read_lock();
+ backup_p = rcu_dereference(p->backup_port);
+ if (backup_p)
+ ret = sprintf(buf, "%s\n", backup_p->dev->name);
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static int store_backup_port(struct net_bridge_port *p, char *buf)
+{
+ struct net_device *backup_dev = NULL;
+ char *nl = strchr(buf, '\n');
+
+ if (nl)
+ *nl = '\0';
+
+ if (strlen(buf) > 0) {
+ backup_dev = __dev_get_by_name(dev_net(p->dev), buf);
+ if (!backup_dev)
+ return -ENOENT;
+ }
+
+ return nbp_backup_change(p, backup_dev);
+}
+static BRPORT_ATTR_RAW(backup_port, 0644, show_backup_port, store_backup_port);
+
BRPORT_ATTR_FLAG(hairpin_mode, BR_HAIRPIN_MODE);
BRPORT_ATTR_FLAG(bpdu_guard, BR_BPDU_GUARD);
BRPORT_ATTR_FLAG(root_block, BR_ROOT_BLOCK);
@@ -254,6 +286,7 @@ static const struct brport_attribute *brport_attrs[] = {
&brport_attr_group_fwd_mask,
&brport_attr_neigh_suppress,
&brport_attr_isolated,
+ &brport_attr_backup_port,
NULL
};
--
2.11.0
David Miller
2018-Jul-23 16:32 UTC
[Bridge] [PATCH net-next v3 0/2] net: bridge: add support for backup port
From: Nikolay Aleksandrov <nikolay at cumulusnetworks.com> Date: Mon, 23 Jul 2018 11:16:57 +0300> This set introduces a new bridge port option that allows any port to have > any other port (in the same bridge of course) as its backup and traffic > will be forwarded to the backup port when the primary goes down. This is > mainly used in MLAG and EVPN setups where we have peerlink path which is > a backup of many (or even all) ports and is a participating bridge port > itself. There's more detailed information in patch 02. Patch 01 just > prepares the port sysfs code for options that take raw value. The main > issues that this set solves are scalability and fallback latency.... Series applied, thanks Nikolay.