Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 00/19] bridge: mcast: Extensions for EVPN
Posting as RFC to show the whole picture. Will split into smaller submissions for v1 and add selftests. tl;dr ==== This patchset creates feature parity between user space and the kernel and allows the former to install and replace MDB port group entries with a source list and associated filter mode. This is required for EVPN use cases where multicast state is not derived from snooped IGMP/MLD packets, but instead derived from EVPN routes exchanged by the control plane in user space. Background ========= IGMPv3 [1] and MLDv2 [2] differ from earlier versions of the protocols in that they add support for source-specific multicast. That is, hosts can advertise interest in listening to a particular multicast address only from specific source addresses or from all sources except for specific source addresses. In kernel 5.10 [3][4], the bridge driver gained the ability to snoop IGMPv3/MLDv2 packets and install corresponding MDB port group entries. For example, a snooped IGMPv3 Membership Report that contains a single MODE_IS_EXCLUDE record for group 239.10.10.10 with sources 192.0.2.1, 192.0.2.2, 192.0.2.20 and 192.0.2.21 would trigger the creation of these entries: # bridge -d mdb show dev br0 port veth1 grp 239.10.10.10 src 192.0.2.21 temp filter_mode include proto kernel blocked dev br0 port veth1 grp 239.10.10.10 src 192.0.2.20 temp filter_mode include proto kernel blocked dev br0 port veth1 grp 239.10.10.10 src 192.0.2.2 temp filter_mode include proto kernel blocked dev br0 port veth1 grp 239.10.10.10 src 192.0.2.1 temp filter_mode include proto kernel blocked dev br0 port veth1 grp 239.10.10.10 temp filter_mode exclude source_list 192.0.2.21/0.00,192.0.2.20/0.00,192.0.2.2/0.00,192.0.2.1/0.00 proto kernel While the kernel can install and replace entries with a filter mode and source list, user space cannot. It can only add EXCLUDE entries with an empty source list, which is sufficient for IGMPv2/MLDv1, but not for IGMPv3/MLDv2. Use cases where the multicast state is not derived from snooped packets, but instead derived from routes exchanged by the user space control plane require feature parity between user space and the kernel in terms of MDB configuration. Such a use case is detailed in the next section. Motivation ========= RFC 7432 [5] defines a "MAC/IP Advertisement route" (type 2) [6] that allows NVE switches in the EVPN network to advertise and learn reachability information for unicast MAC addresses. Traffic destined to a unicast MAC address can therefore be selectively forwarded to a single NVE switch behind which the MAC is located. The same is not true for IP multicast traffic. Such traffic is simply flooded as BUM to all NVE switches in the broadcast domain (BD), regardless if a switch has interested receivers for the multicast stream or not. This is especially problematic for overlay networks that make heavy use of multicast. The issue is addressed by RFC 9251 [7] that defines a "Selective Multicast Ethernet Tag Route" (type 6) [8] which allows NVE switches in the EVPN network to advertise multicast streams that they are interested in. This is done by having each switch suppress IGMP/MLD packets from being transmitted to the NVE network and instead communicate the information over BGP to other switches. As far as the bridge driver is concerned, the above means that the multicast state (i.e., {multicast address, group timer, filter-mode, (source records)}) for the VXLAN bridge port is not populated by the kernel from snooped IGMP/MLD packets (they are suppressed), but instead by user space. Specifically, by the routing daemon that is exchanging EVPN routes with other NVE switches. Changes are obviously also required in the VXLAN driver, but they are the subject of future patchsets. See the "Future work" section. Implementation ============= The user interface is extended to allow user space to specify the filter mode of the MDB port group entry and its source list. Replace support is also added so that user space would not need to remove an entry and re-add it only to edit its source list or filter mode, as that would result in packet loss. Example usage: # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent \ source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra # bridge -d -s mdb show dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra blocked 0.00 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra blocked 0.00 dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra 0.00 The netlink interface is extended with a few new attributes in the RTM_NEWMDB request message: [ struct nlmsghdr ] [ struct br_port_msg ] [ MDBA_SET_ENTRY ] struct br_mdb_entry [ MDBA_SET_ENTRY_ATTRS ] [ MDBE_ATTR_SOURCE ] struct in_addr / struct in6_addr [ MDBE_ATTR_SRC_LIST ] // new [ MDBE_SRC_LIST_ENTRY ] [ MDBE_SRCATTR_ADDRESS ] struct in_addr / struct in6_addr [ ...] [ MDBE_ATTR_GROUP_MODE ] // new u8 [ MDBE_ATTR_RTPORT ] // new u8 No changes are required in RTM_NEWMDB responses and notifications, as all the information can already be dumped by the kernel today. Testing ====== Tested with existing bridge multicast selftests: bridge_igmp.sh, bridge_mdb_port_down.sh, bridge_mdb.sh, bridge_mld.sh, bridge_vlan_mcast.sh. Will add dedicated selftests in v1. Patchset overview ================ Patches #1-#8 are non-functional preparations aimed at making it easier to add additional netlink attributes later on. They create an MDB configuration structure into which netlink messages are parsed into. The structure is then passed in the entry creation / deletion call chain instead of passing the netlink attributes themselves. The same pattern is used by other rtnetlink objects such as routes and nexthops. I initially tried to extend the current code, but it proved to be too difficult, which is why I decided to refactor it to the extensible and familiar pattern used by other rtnetlink objects. The plan is to submit these patches separately for v1. Patches #9-#15 are an additional set of non-functional preparations for the core changes in the last patches. Patches #16-#17 allow user space to install (*, G) entries with a source list and associated filter mode. Specifically, patch #16 adds the necessary kernel plumbing and patch #17 exposes the new functionality to user space via a few new attributes. Patch #18 allows user space to specify the routing protocol of new MDB port group entries so that a routing daemon could differentiate between entries installed by it and those installed by an administrator. Patch #19 allows user space to replace MDB port group entries. This is useful, for example, when user space wants to add a new source to a source list. Instead of deleting a (*, G) entry and re-adding it with an extended source list (which would result in packet loss), user space can simply replace the current entry. Future work ========== The VXLAN driver will need to be extended with an MDB so that it could selectively forward IP multicast traffic to NVE switches with interested receivers instead of simply flooding it to all switches as BUM. The idea is to reuse the existing MDB interface for the VXLAN driver in a similar way to how the FDB interface is shared between the bridge and VXLAN drivers.>From command line perspective, configuration will look as follows:# bridge mdb add dev br0 port vxlan0 grp 239.1.1.1 permanent \ filter_mode exclude source_list 198.50.100.1,198.50.100.2 # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \ filter_mode include source_list 198.50.100.3,198.50.100.4 \ dst 192.0.2.1 dst_port 4789 src_vni 2 # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \ filter_mode exclude source_list 198.50.100.1,198.50.100.2 \ dst 192.0.2.2 dst_port 4789 src_vni 2 Where the first command is enabled by this set, but the next two will be the subject of future work.>From netlink perspective, the existing PF_BRIDGE/RTM_*MDB messages willbe extended to the VXLAN driver. This means that a few new attributes will be added (e.g., 'MDBE_ATTR_SRC_VNI') and that the handlers for these messages will need to move to net/core/rtnetlink.c. The rtnetlink code will call into the appropriate driver based on the ifindex specified in the ancillary header. iproute2 patches can be found here [9]. [1] https://datatracker.ietf.org/doc/html/rfc3376 [2] https://www.rfc-editor.org/rfc/rfc3810 [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6af52ae2ed14a6bc756d5606b29097dfd76740b8 [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=68d4fd30c83b1b208e08c954cd45e6474b148c87 [5] https://datatracker.ietf.org/doc/html/rfc7432 [6] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2 [7] https://datatracker.ietf.org/doc/html/rfc9251 [8] https://datatracker.ietf.org/doc/html/rfc9251#section-9.1 [9] https://github.com/idosch/iproute2/commits/submit/mdb_rfc_v1 Ido Schimmel (19): bridge: mcast: Centralize netlink attribute parsing bridge: mcast: Remove redundant checks bridge: mcast: Use MDB configuration structure where possible bridge: mcast: Propagate MDB configuration structure further bridge: mcast: Use MDB group key from configuration structure bridge: mcast: Remove br_mdb_parse() bridge: mcast: Move checks out of critical section bridge: mcast: Remove redundant function arguments bridge: mcast: Do not derive entry type from its filter mode bridge: mcast: Split (*, G) and (S, G) addition into different functions bridge: mcast: Place netlink policy before validation functions bridge: mcast: Add a centralized error path bridge: mcast: Expose br_multicast_new_group_src() bridge: mcast: Add a flag for user installed source entries bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source bridge: mcast: Add support for (*, G) with a source list and filter mode bridge: mcast: Allow user space to add (*, G) with a source list and filter mode bridge: mcast: Allow user space to specify MDB entry routing protocol bridge: mcast: Support replacement of MDB port group entries include/uapi/linux/if_bridge.h | 21 + net/bridge/br_mdb.c | 821 ++++++++++++++++++++++++--------- net/bridge/br_multicast.c | 5 +- net/bridge/br_private.h | 21 + 4 files changed, 649 insertions(+), 219 deletions(-) -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 01/19] bridge: mcast: Centralize netlink attribute parsing
Netlink attributes are currently passed deep in the MDB creation call chain, making it difficult to add new attributes. In addition, some validity checks are performed under the multicast lock although they can be performed before it is ever acquired. As a first step towards solving these issues, parse the RTM_{NEW,DEL}MDB messages into a configuration structure, relieving other functions from the need to handle raw netlink attributes. Subsequent patches will convert the MDB code to use this configuration structure. This is consistent with how other rtnetlink objects are handled, such as routes and nexthops. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 120 ++++++++++++++++++++++++++++++++++++++++ net/bridge/br_private.h | 7 +++ 2 files changed, 127 insertions(+) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 321be94c445a..c53050e47a0f 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -974,6 +974,116 @@ static int __br_mdb_add(struct net *net, struct net_bridge *br, return ret; } +static int br_mdb_config_attrs_init(struct nlattr *set_attrs, + struct br_mdb_config *cfg, + struct netlink_ext_ack *extack) +{ + struct nlattr *mdb_attrs[MDBE_ATTR_MAX + 1]; + int err; + + err = nla_parse_nested(mdb_attrs, MDBE_ATTR_MAX, set_attrs, + br_mdbe_attrs_pol, extack); + if (err) + return err; + + if (mdb_attrs[MDBE_ATTR_SOURCE] && + !is_valid_mdb_source(mdb_attrs[MDBE_ATTR_SOURCE], + cfg->entry->addr.proto, extack)) + return -EINVAL; + + __mdb_entry_to_br_ip(cfg->entry, &cfg->group, mdb_attrs); + + return 0; +} + +static int br_mdb_config_init(struct net *net, struct sk_buff *skb, + struct nlmsghdr *nlh, struct br_mdb_config *cfg, + struct netlink_ext_ack *extack) +{ + struct nlattr *tb[MDBA_SET_ENTRY_MAX + 1]; + struct br_port_msg *bpm; + struct net_device *dev; + int err; + + err = nlmsg_parse_deprecated(nlh, sizeof(*bpm), tb, + MDBA_SET_ENTRY_MAX, NULL, extack); + if (err) + return err; + + memset(cfg, 0, sizeof(*cfg)); + + bpm = nlmsg_data(nlh); + if (!bpm->ifindex) { + NL_SET_ERR_MSG_MOD(extack, "Invalid bridge ifindex"); + return -EINVAL; + } + + dev = __dev_get_by_index(net, bpm->ifindex); + if (!dev) { + NL_SET_ERR_MSG_MOD(extack, "Bridge device doesn't exist"); + return -ENODEV; + } + + if (!netif_is_bridge_master(dev)) { + NL_SET_ERR_MSG_MOD(extack, "Device is not a bridge"); + return -EOPNOTSUPP; + } + + cfg->br = netdev_priv(dev); + + if (!netif_running(cfg->br->dev)) { + NL_SET_ERR_MSG_MOD(extack, "Bridge device is not running"); + return -EINVAL; + } + + if (!br_opt_get(cfg->br, BROPT_MULTICAST_ENABLED)) { + NL_SET_ERR_MSG_MOD(extack, "Bridge's multicast processing is disabled"); + return -EINVAL; + } + + if (NL_REQ_ATTR_CHECK(extack, NULL, tb, MDBA_SET_ENTRY)) { + NL_SET_ERR_MSG_MOD(extack, "Missing MDBA_SET_ENTRY attribute"); + return -EINVAL; + } + if (nla_len(tb[MDBA_SET_ENTRY]) != sizeof(struct br_mdb_entry)) { + NL_SET_ERR_MSG_MOD(extack, "Invalid MDBA_SET_ENTRY attribute length"); + return -EINVAL; + } + + cfg->entry = nla_data(tb[MDBA_SET_ENTRY]); + if (!is_valid_mdb_entry(cfg->entry, extack)) + return -EINVAL; + + if (cfg->entry->ifindex != cfg->br->dev->ifindex) { + struct net_device *pdev; + + pdev = __dev_get_by_index(net, cfg->entry->ifindex); + if (!pdev) { + NL_SET_ERR_MSG_MOD(extack, "Port net device doesn't exist"); + return -ENODEV; + } + + cfg->p = br_port_get_rtnl(pdev); + if (!cfg->p) { + NL_SET_ERR_MSG_MOD(extack, "Net device is not a bridge port"); + return -EINVAL; + } + + if (cfg->p->br != cfg->br) { + NL_SET_ERR_MSG_MOD(extack, "Port belongs to a different bridge device"); + return -EINVAL; + } + } + + if (tb[MDBA_SET_ENTRY_ATTRS]) + return br_mdb_config_attrs_init(tb[MDBA_SET_ENTRY_ATTRS], cfg, + extack); + else + __mdb_entry_to_br_ip(cfg->entry, &cfg->group, NULL); + + return 0; +} + static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, struct netlink_ext_ack *extack) { @@ -984,9 +1094,14 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, struct net_device *dev, *pdev; struct br_mdb_entry *entry; struct net_bridge_vlan *v; + struct br_mdb_config cfg; struct net_bridge *br; int err; + err = br_mdb_config_init(net, skb, nlh, &cfg, extack); + if (err) + return err; + err = br_mdb_parse(skb, nlh, &dev, &entry, mdb_attrs, extack); if (err < 0) return err; @@ -1101,9 +1216,14 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh, struct net_device *dev, *pdev; struct br_mdb_entry *entry; struct net_bridge_vlan *v; + struct br_mdb_config cfg; struct net_bridge *br; int err; + err = br_mdb_config_init(net, skb, nlh, &cfg, extack); + if (err) + return err; + err = br_mdb_parse(skb, nlh, &dev, &entry, mdb_attrs, extack); if (err < 0) return err; diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 06e5f6faa431..278a18e88e42 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -92,6 +92,13 @@ struct bridge_mcast_stats { struct br_mcast_stats mstats; struct u64_stats_sync syncp; }; + +struct br_mdb_config { + struct net_bridge *br; + struct net_bridge_port *p; + struct br_mdb_entry *entry; + struct br_ip group; +}; #endif /* net_bridge_mcast_port must be always defined due to forwarding stubs */ -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 02/19] bridge: mcast: Remove redundant checks
These checks are now redundant as they are performed by br_mdb_config_init() while parsing the RTM_{NEW,DEL}MDB messages. Remove them. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 63 +++++++-------------------------------------- 1 file changed, 9 insertions(+), 54 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index c53050e47a0f..68fd34161a40 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -1090,11 +1090,10 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, struct nlattr *mdb_attrs[MDBE_ATTR_MAX + 1]; struct net *net = sock_net(skb->sk); struct net_bridge_vlan_group *vg; - struct net_bridge_port *p = NULL; - struct net_device *dev, *pdev; struct br_mdb_entry *entry; struct net_bridge_vlan *v; struct br_mdb_config cfg; + struct net_device *dev; struct net_bridge *br; int err; @@ -1108,38 +1107,12 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, br = netdev_priv(dev); - if (!netif_running(br->dev)) { - NL_SET_ERR_MSG_MOD(extack, "Bridge device is not running"); - return -EINVAL; - } - - if (!br_opt_get(br, BROPT_MULTICAST_ENABLED)) { - NL_SET_ERR_MSG_MOD(extack, "Bridge's multicast processing is disabled"); - return -EINVAL; - } - if (entry->ifindex != br->dev->ifindex) { - pdev = __dev_get_by_index(net, entry->ifindex); - if (!pdev) { - NL_SET_ERR_MSG_MOD(extack, "Port net device doesn't exist"); - return -ENODEV; - } - - p = br_port_get_rtnl(pdev); - if (!p) { - NL_SET_ERR_MSG_MOD(extack, "Net device is not a bridge port"); - return -EINVAL; - } - - if (p->br != br) { - NL_SET_ERR_MSG_MOD(extack, "Port belongs to a different bridge device"); - return -EINVAL; - } - if (p->state == BR_STATE_DISABLED && entry->state != MDB_PERMANENT) { + if (cfg.p->state == BR_STATE_DISABLED && entry->state != MDB_PERMANENT) { NL_SET_ERR_MSG_MOD(extack, "Port is in disabled state and entry is not permanent"); return -EINVAL; } - vg = nbp_vlan_group(p); + vg = nbp_vlan_group(cfg.p); } else { vg = br_vlan_group(br); } @@ -1150,12 +1123,12 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, if (br_vlan_enabled(br->dev) && vg && entry->vid == 0) { list_for_each_entry(v, &vg->vlan_list, vlist) { entry->vid = v->vid; - err = __br_mdb_add(net, br, p, entry, mdb_attrs, extack); + err = __br_mdb_add(net, br, cfg.p, entry, mdb_attrs, extack); if (err) break; } } else { - err = __br_mdb_add(net, br, p, entry, mdb_attrs, extack); + err = __br_mdb_add(net, br, cfg.p, entry, mdb_attrs, extack); } return err; @@ -1170,9 +1143,6 @@ static int __br_mdb_del(struct net_bridge *br, struct br_mdb_entry *entry, struct br_ip ip; int err = -EINVAL; - if (!netif_running(br->dev) || !br_opt_get(br, BROPT_MULTICAST_ENABLED)) - return -EINVAL; - __mdb_entry_to_br_ip(entry, &ip, mdb_attrs); spin_lock_bh(&br->multicast_lock); @@ -1212,11 +1182,10 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh, struct nlattr *mdb_attrs[MDBE_ATTR_MAX + 1]; struct net *net = sock_net(skb->sk); struct net_bridge_vlan_group *vg; - struct net_bridge_port *p = NULL; - struct net_device *dev, *pdev; struct br_mdb_entry *entry; struct net_bridge_vlan *v; struct br_mdb_config cfg; + struct net_device *dev; struct net_bridge *br; int err; @@ -1230,24 +1199,10 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh, br = netdev_priv(dev); - if (entry->ifindex != br->dev->ifindex) { - pdev = __dev_get_by_index(net, entry->ifindex); - if (!pdev) - return -ENODEV; - - p = br_port_get_rtnl(pdev); - if (!p) { - NL_SET_ERR_MSG_MOD(extack, "Net device is not a bridge port"); - return -EINVAL; - } - if (p->br != br) { - NL_SET_ERR_MSG_MOD(extack, "Port belongs to a different bridge device"); - return -EINVAL; - } - vg = nbp_vlan_group(p); - } else { + if (entry->ifindex != br->dev->ifindex) + vg = nbp_vlan_group(cfg.p); + else vg = br_vlan_group(br); - } /* If vlan filtering is enabled and VLAN is not specified * delete mdb entry on all vlans configured on the port. -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 03/19] bridge: mcast: Use MDB configuration structure where possible
The MDB configuration structure (i.e., struct br_mdb_config) now includes all the necessary information from the parsed RTM_{NEW,DEL}MDB netlink messages, so use it. This will later allow us to delete the calls to br_mdb_parse() from br_mdb_add() and br_mdb_del(). No functional changes intended. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 34 +++++++++++++++------------------- 1 file changed, 15 insertions(+), 19 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 68fd34161a40..cdc71516a51b 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -1094,7 +1094,6 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, struct net_bridge_vlan *v; struct br_mdb_config cfg; struct net_device *dev; - struct net_bridge *br; int err; err = br_mdb_config_init(net, skb, nlh, &cfg, extack); @@ -1105,30 +1104,30 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, if (err < 0) return err; - br = netdev_priv(dev); - - if (entry->ifindex != br->dev->ifindex) { - if (cfg.p->state == BR_STATE_DISABLED && entry->state != MDB_PERMANENT) { + if (cfg.p) { + if (cfg.p->state == BR_STATE_DISABLED && cfg.entry->state != MDB_PERMANENT) { NL_SET_ERR_MSG_MOD(extack, "Port is in disabled state and entry is not permanent"); return -EINVAL; } vg = nbp_vlan_group(cfg.p); } else { - vg = br_vlan_group(br); + vg = br_vlan_group(cfg.br); } /* If vlan filtering is enabled and VLAN is not specified * install mdb entry on all vlans configured on the port. */ - if (br_vlan_enabled(br->dev) && vg && entry->vid == 0) { + if (br_vlan_enabled(cfg.br->dev) && vg && cfg.entry->vid == 0) { list_for_each_entry(v, &vg->vlan_list, vlist) { - entry->vid = v->vid; - err = __br_mdb_add(net, br, cfg.p, entry, mdb_attrs, extack); + cfg.entry->vid = v->vid; + err = __br_mdb_add(net, cfg.br, cfg.p, cfg.entry, + mdb_attrs, extack); if (err) break; } } else { - err = __br_mdb_add(net, br, cfg.p, entry, mdb_attrs, extack); + err = __br_mdb_add(net, cfg.br, cfg.p, cfg.entry, mdb_attrs, + extack); } return err; @@ -1186,7 +1185,6 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh, struct net_bridge_vlan *v; struct br_mdb_config cfg; struct net_device *dev; - struct net_bridge *br; int err; err = br_mdb_config_init(net, skb, nlh, &cfg, extack); @@ -1197,23 +1195,21 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh, if (err < 0) return err; - br = netdev_priv(dev); - - if (entry->ifindex != br->dev->ifindex) + if (cfg.p) vg = nbp_vlan_group(cfg.p); else - vg = br_vlan_group(br); + vg = br_vlan_group(cfg.br); /* If vlan filtering is enabled and VLAN is not specified * delete mdb entry on all vlans configured on the port. */ - if (br_vlan_enabled(br->dev) && vg && entry->vid == 0) { + if (br_vlan_enabled(cfg.br->dev) && vg && cfg.entry->vid == 0) { list_for_each_entry(v, &vg->vlan_list, vlist) { - entry->vid = v->vid; - err = __br_mdb_del(br, entry, mdb_attrs); + cfg.entry->vid = v->vid; + err = __br_mdb_del(cfg.br, cfg.entry, mdb_attrs); } } else { - err = __br_mdb_del(br, entry, mdb_attrs); + err = __br_mdb_del(cfg.br, cfg.entry, mdb_attrs); } return err; -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 04/19] bridge: mcast: Propagate MDB configuration structure further
As an intermediate step towards only using the new MDB configuration structure, pass it further in the control path instead of passing individual attributes. No functional changes intended. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index cdc71516a51b..2f9b192500a3 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -959,17 +959,15 @@ static int br_mdb_add_group(struct net_bridge *br, struct net_bridge_port *port, return 0; } -static int __br_mdb_add(struct net *net, struct net_bridge *br, - struct net_bridge_port *p, - struct br_mdb_entry *entry, +static int __br_mdb_add(struct br_mdb_config *cfg, struct nlattr **mdb_attrs, struct netlink_ext_ack *extack) { int ret; - spin_lock_bh(&br->multicast_lock); - ret = br_mdb_add_group(br, p, entry, mdb_attrs, extack); - spin_unlock_bh(&br->multicast_lock); + spin_lock_bh(&cfg->br->multicast_lock); + ret = br_mdb_add_group(cfg->br, cfg->p, cfg->entry, mdb_attrs, extack); + spin_unlock_bh(&cfg->br->multicast_lock); return ret; } @@ -1120,22 +1118,22 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, if (br_vlan_enabled(cfg.br->dev) && vg && cfg.entry->vid == 0) { list_for_each_entry(v, &vg->vlan_list, vlist) { cfg.entry->vid = v->vid; - err = __br_mdb_add(net, cfg.br, cfg.p, cfg.entry, - mdb_attrs, extack); + err = __br_mdb_add(&cfg, mdb_attrs, extack); if (err) break; } } else { - err = __br_mdb_add(net, cfg.br, cfg.p, cfg.entry, mdb_attrs, - extack); + err = __br_mdb_add(&cfg, mdb_attrs, extack); } return err; } -static int __br_mdb_del(struct net_bridge *br, struct br_mdb_entry *entry, +static int __br_mdb_del(struct br_mdb_config *cfg, struct nlattr **mdb_attrs) { + struct br_mdb_entry *entry = cfg->entry; + struct net_bridge *br = cfg->br; struct net_bridge_mdb_entry *mp; struct net_bridge_port_group *p; struct net_bridge_port_group __rcu **pp; @@ -1206,10 +1204,10 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh, if (br_vlan_enabled(cfg.br->dev) && vg && cfg.entry->vid == 0) { list_for_each_entry(v, &vg->vlan_list, vlist) { cfg.entry->vid = v->vid; - err = __br_mdb_del(cfg.br, cfg.entry, mdb_attrs); + err = __br_mdb_del(&cfg, mdb_attrs); } } else { - err = __br_mdb_del(cfg.br, cfg.entry, mdb_attrs); + err = __br_mdb_del(&cfg, mdb_attrs); } return err; -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 05/19] bridge: mcast: Use MDB group key from configuration structure
The MDB group key (i.e., {source, destination, protocol, VID}) is currently determined under the multicast lock from the netlink attributes. Instead, use the group key from the MDB configuration structure that was prepared before acquiring the lock. No functional changes intended. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 2f9b192500a3..cb4fd27f118f 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -855,20 +855,19 @@ __br_mdb_choose_context(struct net_bridge *br, static int br_mdb_add_group(struct net_bridge *br, struct net_bridge_port *port, struct br_mdb_entry *entry, - struct nlattr **mdb_attrs, + struct br_mdb_config *cfg, struct netlink_ext_ack *extack) { struct net_bridge_mdb_entry *mp, *star_mp; struct net_bridge_port_group __rcu **pp; struct net_bridge_port_group *p; struct net_bridge_mcast *brmctx; - struct br_ip group, star_group; + struct br_ip group = cfg->group; unsigned long now = jiffies; unsigned char flags = 0; + struct br_ip star_group; u8 filter_mode; - __mdb_entry_to_br_ip(entry, &group, mdb_attrs); - brmctx = __br_mdb_choose_context(br, entry, extack); if (!brmctx) return -EINVAL; @@ -966,7 +965,7 @@ static int __br_mdb_add(struct br_mdb_config *cfg, int ret; spin_lock_bh(&cfg->br->multicast_lock); - ret = br_mdb_add_group(cfg->br, cfg->p, cfg->entry, mdb_attrs, extack); + ret = br_mdb_add_group(cfg->br, cfg->p, cfg->entry, cfg, extack); spin_unlock_bh(&cfg->br->multicast_lock); return ret; @@ -1118,6 +1117,7 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, if (br_vlan_enabled(cfg.br->dev) && vg && cfg.entry->vid == 0) { list_for_each_entry(v, &vg->vlan_list, vlist) { cfg.entry->vid = v->vid; + cfg.group.vid = v->vid; err = __br_mdb_add(&cfg, mdb_attrs, extack); if (err) break; @@ -1137,11 +1137,9 @@ static int __br_mdb_del(struct br_mdb_config *cfg, struct net_bridge_mdb_entry *mp; struct net_bridge_port_group *p; struct net_bridge_port_group __rcu **pp; - struct br_ip ip; + struct br_ip ip = cfg->group; int err = -EINVAL; - __mdb_entry_to_br_ip(entry, &ip, mdb_attrs); - spin_lock_bh(&br->multicast_lock); mp = br_mdb_ip_get(br, &ip); if (!mp) @@ -1204,6 +1202,7 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh, if (br_vlan_enabled(cfg.br->dev) && vg && cfg.entry->vid == 0) { list_for_each_entry(v, &vg->vlan_list, vlist) { cfg.entry->vid = v->vid; + cfg.group.vid = v->vid; err = __br_mdb_del(&cfg, mdb_attrs); } } else { -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 06/19] bridge: mcast: Remove br_mdb_parse()
The parsing of the netlink messages and the validity checks are now performed in br_mdb_config_init() so we can remove br_mdb_parse(). This finally allows us to stop passing netlink attributes deep in the MDB control path and only use the MDB configuration structure. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 93 +++------------------------------------------ 1 file changed, 5 insertions(+), 88 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index cb4fd27f118f..67b6bc7272d3 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -754,73 +754,6 @@ static const struct nla_policy br_mdbe_attrs_pol[MDBE_ATTR_MAX + 1] = { sizeof(struct in6_addr)), }; -static int br_mdb_parse(struct sk_buff *skb, struct nlmsghdr *nlh, - struct net_device **pdev, struct br_mdb_entry **pentry, - struct nlattr **mdb_attrs, struct netlink_ext_ack *extack) -{ - struct net *net = sock_net(skb->sk); - struct br_mdb_entry *entry; - struct br_port_msg *bpm; - struct nlattr *tb[MDBA_SET_ENTRY_MAX+1]; - struct net_device *dev; - int err; - - err = nlmsg_parse_deprecated(nlh, sizeof(*bpm), tb, - MDBA_SET_ENTRY_MAX, NULL, NULL); - if (err < 0) - return err; - - bpm = nlmsg_data(nlh); - if (bpm->ifindex == 0) { - NL_SET_ERR_MSG_MOD(extack, "Invalid bridge ifindex"); - return -EINVAL; - } - - dev = __dev_get_by_index(net, bpm->ifindex); - if (dev == NULL) { - NL_SET_ERR_MSG_MOD(extack, "Bridge device doesn't exist"); - return -ENODEV; - } - - if (!netif_is_bridge_master(dev)) { - NL_SET_ERR_MSG_MOD(extack, "Device is not a bridge"); - return -EOPNOTSUPP; - } - - *pdev = dev; - - if (!tb[MDBA_SET_ENTRY]) { - NL_SET_ERR_MSG_MOD(extack, "Missing MDBA_SET_ENTRY attribute"); - return -EINVAL; - } - if (nla_len(tb[MDBA_SET_ENTRY]) != sizeof(struct br_mdb_entry)) { - NL_SET_ERR_MSG_MOD(extack, "Invalid MDBA_SET_ENTRY attribute length"); - return -EINVAL; - } - - entry = nla_data(tb[MDBA_SET_ENTRY]); - if (!is_valid_mdb_entry(entry, extack)) - return -EINVAL; - *pentry = entry; - - if (tb[MDBA_SET_ENTRY_ATTRS]) { - err = nla_parse_nested(mdb_attrs, MDBE_ATTR_MAX, - tb[MDBA_SET_ENTRY_ATTRS], - br_mdbe_attrs_pol, extack); - if (err) - return err; - if (mdb_attrs[MDBE_ATTR_SOURCE] && - !is_valid_mdb_source(mdb_attrs[MDBE_ATTR_SOURCE], - entry->addr.proto, extack)) - return -EINVAL; - } else { - memset(mdb_attrs, 0, - sizeof(struct nlattr *) * (MDBE_ATTR_MAX + 1)); - } - - return 0; -} - static struct net_bridge_mcast * __br_mdb_choose_context(struct net_bridge *br, const struct br_mdb_entry *entry, @@ -959,7 +892,6 @@ static int br_mdb_add_group(struct net_bridge *br, struct net_bridge_port *port, } static int __br_mdb_add(struct br_mdb_config *cfg, - struct nlattr **mdb_attrs, struct netlink_ext_ack *extack) { int ret; @@ -1084,23 +1016,16 @@ static int br_mdb_config_init(struct net *net, struct sk_buff *skb, static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, struct netlink_ext_ack *extack) { - struct nlattr *mdb_attrs[MDBE_ATTR_MAX + 1]; struct net *net = sock_net(skb->sk); struct net_bridge_vlan_group *vg; - struct br_mdb_entry *entry; struct net_bridge_vlan *v; struct br_mdb_config cfg; - struct net_device *dev; int err; err = br_mdb_config_init(net, skb, nlh, &cfg, extack); if (err) return err; - err = br_mdb_parse(skb, nlh, &dev, &entry, mdb_attrs, extack); - if (err < 0) - return err; - if (cfg.p) { if (cfg.p->state == BR_STATE_DISABLED && cfg.entry->state != MDB_PERMANENT) { NL_SET_ERR_MSG_MOD(extack, "Port is in disabled state and entry is not permanent"); @@ -1118,19 +1043,18 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, list_for_each_entry(v, &vg->vlan_list, vlist) { cfg.entry->vid = v->vid; cfg.group.vid = v->vid; - err = __br_mdb_add(&cfg, mdb_attrs, extack); + err = __br_mdb_add(&cfg, extack); if (err) break; } } else { - err = __br_mdb_add(&cfg, mdb_attrs, extack); + err = __br_mdb_add(&cfg, extack); } return err; } -static int __br_mdb_del(struct br_mdb_config *cfg, - struct nlattr **mdb_attrs) +static int __br_mdb_del(struct br_mdb_config *cfg) { struct br_mdb_entry *entry = cfg->entry; struct net_bridge *br = cfg->br; @@ -1174,23 +1098,16 @@ static int __br_mdb_del(struct br_mdb_config *cfg, static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh, struct netlink_ext_ack *extack) { - struct nlattr *mdb_attrs[MDBE_ATTR_MAX + 1]; struct net *net = sock_net(skb->sk); struct net_bridge_vlan_group *vg; - struct br_mdb_entry *entry; struct net_bridge_vlan *v; struct br_mdb_config cfg; - struct net_device *dev; int err; err = br_mdb_config_init(net, skb, nlh, &cfg, extack); if (err) return err; - err = br_mdb_parse(skb, nlh, &dev, &entry, mdb_attrs, extack); - if (err < 0) - return err; - if (cfg.p) vg = nbp_vlan_group(cfg.p); else @@ -1203,10 +1120,10 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh, list_for_each_entry(v, &vg->vlan_list, vlist) { cfg.entry->vid = v->vid; cfg.group.vid = v->vid; - err = __br_mdb_del(&cfg, mdb_attrs); + err = __br_mdb_del(&cfg); } } else { - err = __br_mdb_del(&cfg, mdb_attrs); + err = __br_mdb_del(&cfg); } return err; -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 07/19] bridge: mcast: Move checks out of critical section
The checks only require information parsed from the RTM_NEWMDB netlink message and do not rely on any state stored in the bridge driver. Therefore, there is no need to perform the checks in the critical section under the multicast lock. Move the checks out of the critical section. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 67b6bc7272d3..aa5faccf09f8 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -805,24 +805,6 @@ static int br_mdb_add_group(struct net_bridge *br, struct net_bridge_port *port, if (!brmctx) return -EINVAL; - /* host join errors which can happen before creating the group */ - if (!port && !br_group_is_l2(&group)) { - /* don't allow any flags for host-joined IP groups */ - if (entry->state) { - NL_SET_ERR_MSG_MOD(extack, "Flags are not allowed for host groups"); - return -EINVAL; - } - if (!br_multicast_is_star_g(&group)) { - NL_SET_ERR_MSG_MOD(extack, "Groups with sources cannot be manually host joined"); - return -EINVAL; - } - } - - if (br_group_is_l2(&group) && entry->state != MDB_PERMANENT) { - NL_SET_ERR_MSG_MOD(extack, "Only permanent L2 entries allowed"); - return -EINVAL; - } - mp = br_multicast_new_group(br, &group); if (IS_ERR(mp)) return PTR_ERR(mp); @@ -1026,6 +1008,24 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, if (err) return err; + /* host join errors which can happen before creating the group */ + if (!cfg.p && !br_group_is_l2(&cfg.group)) { + /* don't allow any flags for host-joined IP groups */ + if (cfg.entry->state) { + NL_SET_ERR_MSG_MOD(extack, "Flags are not allowed for host groups"); + return -EINVAL; + } + if (!br_multicast_is_star_g(&cfg.group)) { + NL_SET_ERR_MSG_MOD(extack, "Groups with sources cannot be manually host joined"); + return -EINVAL; + } + } + + if (br_group_is_l2(&cfg.group) && cfg.entry->state != MDB_PERMANENT) { + NL_SET_ERR_MSG_MOD(extack, "Only permanent L2 entries allowed"); + return -EINVAL; + } + if (cfg.p) { if (cfg.p->state == BR_STATE_DISABLED && cfg.entry->state != MDB_PERMANENT) { NL_SET_ERR_MSG_MOD(extack, "Port is in disabled state and entry is not permanent"); -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 08/19] bridge: mcast: Remove redundant function arguments
Drop the first three arguments and instead extract them from the MDB configuration structure. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index aa5faccf09f8..850a04177c91 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -786,13 +786,14 @@ __br_mdb_choose_context(struct net_bridge *br, return brmctx; } -static int br_mdb_add_group(struct net_bridge *br, struct net_bridge_port *port, - struct br_mdb_entry *entry, - struct br_mdb_config *cfg, +static int br_mdb_add_group(struct br_mdb_config *cfg, struct netlink_ext_ack *extack) { struct net_bridge_mdb_entry *mp, *star_mp; struct net_bridge_port_group __rcu **pp; + struct br_mdb_entry *entry = cfg->entry; + struct net_bridge_port *port = cfg->p; + struct net_bridge *br = cfg->br; struct net_bridge_port_group *p; struct net_bridge_mcast *brmctx; struct br_ip group = cfg->group; @@ -879,7 +880,7 @@ static int __br_mdb_add(struct br_mdb_config *cfg, int ret; spin_lock_bh(&cfg->br->multicast_lock); - ret = br_mdb_add_group(cfg->br, cfg->p, cfg->entry, cfg, extack); + ret = br_mdb_add_group(cfg, extack); spin_unlock_bh(&cfg->br->multicast_lock); return ret; -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 09/19] bridge: mcast: Do not derive entry type from its filter mode
Currently, the filter mode (i.e., INCLUDE / EXCLUDE) of MDB entries cannot be set from user space. Instead, it is set by the kernel according to the entry type: (*, G) entries are treated as EXCLUDE and (S, G) entries are treated as INCLUDE. This allows the kernel to derive the entry type from its filter mode. Subsequent patches will allow user space to set the filter mode of (*, G) entries, making the current assumption incorrect. As a preparation, remove the current assumption and instead determine the entry type from its key, which is a more direct way. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 850a04177c91..dd56063430ed 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -857,17 +857,14 @@ static int br_mdb_add_group(struct br_mdb_config *cfg, * added to it for proper replication */ if (br_multicast_should_handle_mode(brmctx, group.proto)) { - switch (filter_mode) { - case MCAST_EXCLUDE: - br_multicast_star_g_handle_mode(p, MCAST_EXCLUDE); - break; - case MCAST_INCLUDE: + if (br_multicast_is_star_g(&group)) { + br_multicast_star_g_handle_mode(p, filter_mode); + } else { star_group = p->key.addr; memset(&star_group.src, 0, sizeof(star_group.src)); star_mp = br_mdb_ip_get(br, &star_group); if (star_mp) br_multicast_sg_add_exclude_ports(star_mp, p); - break; } } -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 10/19] bridge: mcast: Split (*, G) and (S, G) addition into different functions
When the bridge is using IGMP version 3 or MLD version 2, it handles the addition of (*, G) and (S, G) entries differently. When a new (S, G) port group entry is added, all the (*, G) EXCLUDE ports need to be added to the port group of the new entry. Similarly, when a new (*, G) EXCLUDE port group entry is added, the port needs to be added to the port group of all the matching (S, G) entries. Subsequent patches will create more differences between both entry types. Namely, filter mode and source list can only be specified for (*, G) entries. Given the current and future differences between both entry types, handle the addition of each entry type in a different function, thereby avoiding the creation of one complex function. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 145 +++++++++++++++++++++++++++++--------------- 1 file changed, 96 insertions(+), 49 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index dd56063430ed..a48eef866974 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -786,21 +786,107 @@ __br_mdb_choose_context(struct net_bridge *br, return brmctx; } +static int br_mdb_add_group_sg(struct br_mdb_config *cfg, + struct net_bridge_mdb_entry *mp, + struct net_bridge_mcast *brmctx, + unsigned char flags, + struct netlink_ext_ack *extack) +{ + struct net_bridge_port_group __rcu **pp; + struct net_bridge_port_group *p; + unsigned long now = jiffies; + + for (pp = &mp->ports; + (p = mlock_dereference(*pp, cfg->br)) != NULL; + pp = &p->next) { + if (p->key.port == cfg->p) { + NL_SET_ERR_MSG_MOD(extack, "(S, G) group is already joined by port"); + return -EEXIST; + } + if ((unsigned long)p->key.port < (unsigned long)cfg->p) + break; + } + + p = br_multicast_new_port_group(cfg->p, &cfg->group, *pp, flags, NULL, + MCAST_INCLUDE, RTPROT_STATIC); + if (unlikely(!p)) { + NL_SET_ERR_MSG_MOD(extack, "Couldn't allocate new (S, G) port group"); + return -ENOMEM; + } + rcu_assign_pointer(*pp, p); + if (!(flags & MDB_PG_FLAGS_PERMANENT)) + mod_timer(&p->timer, + now + brmctx->multicast_membership_interval); + br_mdb_notify(cfg->br->dev, mp, p, RTM_NEWMDB); + + /* All of (*, G) EXCLUDE ports need to be added to the new (S, G) for + * proper replication. + */ + if (br_multicast_should_handle_mode(brmctx, cfg->group.proto)) { + struct net_bridge_mdb_entry *star_mp; + struct br_ip star_group; + + star_group = p->key.addr; + memset(&star_group.src, 0, sizeof(star_group.src)); + star_mp = br_mdb_ip_get(cfg->br, &star_group); + if (star_mp) + br_multicast_sg_add_exclude_ports(star_mp, p); + } + + return 0; +} + +static int br_mdb_add_group_star_g(struct br_mdb_config *cfg, + struct net_bridge_mdb_entry *mp, + struct net_bridge_mcast *brmctx, + unsigned char flags, + struct netlink_ext_ack *extack) +{ + struct net_bridge_port_group __rcu **pp; + struct net_bridge_port_group *p; + unsigned long now = jiffies; + + for (pp = &mp->ports; + (p = mlock_dereference(*pp, cfg->br)) != NULL; + pp = &p->next) { + if (p->key.port == cfg->p) { + NL_SET_ERR_MSG_MOD(extack, "(*, G) group is already joined by port"); + return -EEXIST; + } + if ((unsigned long)p->key.port < (unsigned long)cfg->p) + break; + } + + p = br_multicast_new_port_group(cfg->p, &cfg->group, *pp, flags, NULL, + MCAST_EXCLUDE, RTPROT_STATIC); + if (unlikely(!p)) { + NL_SET_ERR_MSG_MOD(extack, "Couldn't allocate new (*, G) port group"); + return -ENOMEM; + } + rcu_assign_pointer(*pp, p); + if (!(flags & MDB_PG_FLAGS_PERMANENT)) + mod_timer(&p->timer, + now + brmctx->multicast_membership_interval); + br_mdb_notify(cfg->br->dev, mp, p, RTM_NEWMDB); + /* If we are adding a new EXCLUDE port group (*, G), it needs to be + * also added to all (S, G) entries for proper replication. + */ + if (br_multicast_should_handle_mode(brmctx, cfg->group.proto)) + br_multicast_star_g_handle_mode(p, MCAST_EXCLUDE); + + return 0; +} + static int br_mdb_add_group(struct br_mdb_config *cfg, struct netlink_ext_ack *extack) { - struct net_bridge_mdb_entry *mp, *star_mp; - struct net_bridge_port_group __rcu **pp; struct br_mdb_entry *entry = cfg->entry; struct net_bridge_port *port = cfg->p; + struct net_bridge_mdb_entry *mp; struct net_bridge *br = cfg->br; - struct net_bridge_port_group *p; struct net_bridge_mcast *brmctx; struct br_ip group = cfg->group; - unsigned long now = jiffies; unsigned char flags = 0; - struct br_ip star_group; - u8 filter_mode; brmctx = __br_mdb_choose_context(br, entry, extack); if (!brmctx) @@ -823,52 +909,13 @@ static int br_mdb_add_group(struct br_mdb_config *cfg, return 0; } - for (pp = &mp->ports; - (p = mlock_dereference(*pp, br)) != NULL; - pp = &p->next) { - if (p->key.port == port) { - NL_SET_ERR_MSG_MOD(extack, "Group is already joined by port"); - return -EEXIST; - } - if ((unsigned long)p->key.port < (unsigned long)port) - break; - } - - filter_mode = br_multicast_is_star_g(&group) ? MCAST_EXCLUDE : - MCAST_INCLUDE; - if (entry->state == MDB_PERMANENT) flags |= MDB_PG_FLAGS_PERMANENT; - p = br_multicast_new_port_group(port, &group, *pp, flags, NULL, - filter_mode, RTPROT_STATIC); - if (unlikely(!p)) { - NL_SET_ERR_MSG_MOD(extack, "Couldn't allocate new port group"); - return -ENOMEM; - } - rcu_assign_pointer(*pp, p); - if (entry->state == MDB_TEMPORARY) - mod_timer(&p->timer, - now + brmctx->multicast_membership_interval); - br_mdb_notify(br->dev, mp, p, RTM_NEWMDB); - /* if we are adding a new EXCLUDE port group (*,G) it needs to be also - * added to all S,G entries for proper replication, if we are adding - * a new INCLUDE port (S,G) then all of *,G EXCLUDE ports need to be - * added to it for proper replication - */ - if (br_multicast_should_handle_mode(brmctx, group.proto)) { - if (br_multicast_is_star_g(&group)) { - br_multicast_star_g_handle_mode(p, filter_mode); - } else { - star_group = p->key.addr; - memset(&star_group.src, 0, sizeof(star_group.src)); - star_mp = br_mdb_ip_get(br, &star_group); - if (star_mp) - br_multicast_sg_add_exclude_ports(star_mp, p); - } - } - - return 0; + if (br_multicast_is_star_g(&group)) + return br_mdb_add_group_star_g(cfg, mp, brmctx, flags, extack); + else + return br_mdb_add_group_sg(cfg, mp, brmctx, flags, extack); } static int __br_mdb_add(struct br_mdb_config *cfg, -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 11/19] bridge: mcast: Place netlink policy before validation functions
Subsequent patches are going to add additional validation functions and netlink policies. Some of these functions will need to perform parsing using nla_parse_nested() and the new policies. In order to keep all the policies next to each other, move the current policy to before the validation functions. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index a48eef866974..26740df62fd6 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -663,6 +663,12 @@ void br_rtr_notify(struct net_device *dev, struct net_bridge_mcast_port *pmctx, rtnl_set_sk_err(net, RTNLGRP_MDB, err); } +static const struct nla_policy br_mdbe_attrs_pol[MDBE_ATTR_MAX + 1] = { + [MDBE_ATTR_SOURCE] = NLA_POLICY_RANGE(NLA_BINARY, + sizeof(struct in_addr), + sizeof(struct in6_addr)), +}; + static bool is_valid_mdb_entry(struct br_mdb_entry *entry, struct netlink_ext_ack *extack) { @@ -748,12 +754,6 @@ static bool is_valid_mdb_source(struct nlattr *attr, __be16 proto, return true; } -static const struct nla_policy br_mdbe_attrs_pol[MDBE_ATTR_MAX + 1] = { - [MDBE_ATTR_SOURCE] = NLA_POLICY_RANGE(NLA_BINARY, - sizeof(struct in_addr), - sizeof(struct in6_addr)), -}; - static struct net_bridge_mcast * __br_mdb_choose_context(struct net_bridge *br, const struct br_mdb_entry *entry, -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 12/19] bridge: mcast: Add a centralized error path
Subsequent patches will add memory allocations in br_mdb_config_init() as the MDB configuration structure will include a linked list of source entries. This memory will need to be freed regardless if br_mdb_add() succeeded or failed. As a preparation for this change, add a centralized error path where the memory will be freed. Note that br_mdb_del() already has one error path and therefore does not require any changes. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 26740df62fd6..157ba4e765c1 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -1053,28 +1053,29 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, if (err) return err; + err = -EINVAL; /* host join errors which can happen before creating the group */ if (!cfg.p && !br_group_is_l2(&cfg.group)) { /* don't allow any flags for host-joined IP groups */ if (cfg.entry->state) { NL_SET_ERR_MSG_MOD(extack, "Flags are not allowed for host groups"); - return -EINVAL; + goto out; } if (!br_multicast_is_star_g(&cfg.group)) { NL_SET_ERR_MSG_MOD(extack, "Groups with sources cannot be manually host joined"); - return -EINVAL; + goto out; } } if (br_group_is_l2(&cfg.group) && cfg.entry->state != MDB_PERMANENT) { NL_SET_ERR_MSG_MOD(extack, "Only permanent L2 entries allowed"); - return -EINVAL; + goto out; } if (cfg.p) { if (cfg.p->state == BR_STATE_DISABLED && cfg.entry->state != MDB_PERMANENT) { NL_SET_ERR_MSG_MOD(extack, "Port is in disabled state and entry is not permanent"); - return -EINVAL; + goto out; } vg = nbp_vlan_group(cfg.p); } else { @@ -1096,6 +1097,7 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, err = __br_mdb_add(&cfg, extack); } +out: return err; } -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 13/19] bridge: mcast: Expose br_multicast_new_group_src()
Currently, new group source entries are only created in response to received Membership Reports. Subsequent patches are going to allow user space to install (*, G) entries with a source list. As a preparatory step, expose br_multicast_new_group_src() so that it could later be invoked from the MDB code (i.e., br_mdb.c) that handles RTM_NEWMDB messages. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_multicast.c | 2 +- net/bridge/br_private.h | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c index 09140bc8c15e..14f72d11f4a2 100644 --- a/net/bridge/br_multicast.c +++ b/net/bridge/br_multicast.c @@ -1232,7 +1232,7 @@ br_multicast_find_group_src(struct net_bridge_port_group *pg, struct br_ip *ip) return NULL; } -static struct net_bridge_group_src * +struct net_bridge_group_src * br_multicast_new_group_src(struct net_bridge_port_group *pg, struct br_ip *src_ip) { struct net_bridge_group_src *grp_src; diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 278a18e88e42..2aa453ea04f9 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -972,6 +972,9 @@ void br_multicast_sg_add_exclude_ports(struct net_bridge_mdb_entry *star_mp, struct net_bridge_port_group *sg); struct net_bridge_group_src * br_multicast_find_group_src(struct net_bridge_port_group *pg, struct br_ip *ip); +struct net_bridge_group_src * +br_multicast_new_group_src(struct net_bridge_port_group *pg, + struct br_ip *src_ip); void br_multicast_del_group_src(struct net_bridge_group_src *src, bool fastleave); void br_multicast_ctx_init(struct net_bridge *br, -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 14/19] bridge: mcast: Add a flag for user installed source entries
There are a few places where the bridge driver differentiates between (S, G) entries installed by the kernel (in response to Membership Reports) and those installed by user space. One of them is when deleting an (S, G) entry corresponding to a source entry that is being deleted. While user space cannot currently add a source entry to a (*, G), it can add an (S, G) entry that later corresponds to a source entry created by the reception of a Membership Report. If this source entry is later deleted because its source timer expired or because the (*, G) entry is being deleted, the bridge driver will not delete the corresponding (S, G) entry if it was added by user space as permanent. This is going to be a problem when the ability to install a (*, G) with a source list is exposed to user space. In this case, when user space installs the (*, G) as permanent, then all the (S, G) entries corresponding to its source list will also be installed as permanent. When user space deletes the (*, G), all the source entries will be deleted and the expectation is that the corresponding (S, G) entries will be deleted as well. Solve this by introducing a new source entry flag denoting that the entry was installed by user space. When the entry is deleted, delete the corresponding (S, G) entry even if it was installed by user space as permanent, as the flag tells us that it was installed in response to the source entry being created. The flag will be set in a subsequent patch where source entries are created in response to user requests. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_multicast.c | 3 ++- net/bridge/br_private.h | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c index 14f72d11f4a2..5d2dd114c54c 100644 --- a/net/bridge/br_multicast.c +++ b/net/bridge/br_multicast.c @@ -552,7 +552,8 @@ static void br_multicast_fwd_src_remove(struct net_bridge_group_src *src, continue; if (p->rt_protocol != RTPROT_KERNEL && - (p->flags & MDB_PG_FLAGS_PERMANENT)) + (p->flags & MDB_PG_FLAGS_PERMANENT) && + !(src->flags & BR_SGRP_F_USER_ADDED)) break; if (fastleave) diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 2aa453ea04f9..6879d2e1128f 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -299,6 +299,7 @@ struct net_bridge_fdb_flush_desc { #define BR_SGRP_F_DELETE BIT(0) #define BR_SGRP_F_SEND BIT(1) #define BR_SGRP_F_INSTALLED BIT(2) +#define BR_SGRP_F_USER_ADDED BIT(3) struct net_bridge_mcast_gc { struct hlist_node gc_node; -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 15/19] bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
User space will soon be able to install a (*, G) with a source list, prompting the creation of a (S, G) entry for each source. In this case, the group timer of the (S, G) entry should never be set. Solve this by adding a new field to the MDB configuration structure that denotes whether the (S, G) corresponds to a source or not. The field will be set in a subsequent patch where br_mdb_add_group_sg() is called in order to create a (S, G) entry for each user provided source. Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 2 +- net/bridge/br_private.h | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 157ba4e765c1..2804da7b0aa1 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -814,7 +814,7 @@ static int br_mdb_add_group_sg(struct br_mdb_config *cfg, return -ENOMEM; } rcu_assign_pointer(*pp, p); - if (!(flags & MDB_PG_FLAGS_PERMANENT)) + if (!(flags & MDB_PG_FLAGS_PERMANENT) && !cfg->src_entry) mod_timer(&p->timer, now + brmctx->multicast_membership_interval); br_mdb_notify(cfg->br->dev, mp, p, RTM_NEWMDB); diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 6879d2e1128f..1bd6eebad002 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -98,6 +98,7 @@ struct br_mdb_config { struct net_bridge_port *p; struct br_mdb_entry *entry; struct br_ip group; + bool src_entry; }; #endif -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 16/19] bridge: mcast: Add support for (*, G) with a source list and filter mode
In preparation for allowing user space to add (*, G) entries with a source list and associated filter mode, add the necessary plumbing to handle such requests. Extend the MDB configuration structure with a currently empty source list and filter mode that is currently hard coded to EXCLUDE. Add the source entries and the corresponding (S, G) entries before making the new (*, G) port group entry visible to the data path. Handle the creation of each source entry in a similar fashion to how it is created from the data path in response to received Membership Reports: Create the source entry, arm the source timer (if needed), add a corresponding (S, G) forwarding entry and finally mark the source entry as installed (by user space). Add the (S, G) entry by populating an MDB configuration structure and calling br_mdb_add_group_sg() as if a new entry is created by user space, with the sole difference that the 'src_entry' field is set to make sure that the group timer of such entries is never armed. Note that it is not currently possible to add more than 32 source entries to a port group entry. If this proves to be a problem we can either increase 'PG_SRC_ENT_LIMIT' or avoid forcing a limit on entries created by user space. For example, by adding a new argument to br_multicast_new_group_src(). Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 130 +++++++++++++++++++++++++++++++++++++++- net/bridge/br_private.h | 7 +++ 2 files changed, 134 insertions(+), 3 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 2804da7b0aa1..8fc8816a76bf 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -836,6 +836,115 @@ static int br_mdb_add_group_sg(struct br_mdb_config *cfg, return 0; } +static int br_mdb_add_group_src_fwd(struct br_mdb_config *cfg, + struct br_ip *src_ip, + struct net_bridge_mcast *brmctx, + struct netlink_ext_ack *extack) +{ + struct net_bridge_mdb_entry *sgmp; + struct br_mdb_config sg_cfg; + struct br_ip sg_ip; + u8 flags = 0; + + sg_ip = cfg->group; + sg_ip.src = src_ip->src; + sgmp = br_multicast_new_group(cfg->br, &sg_ip); + if (IS_ERR(sgmp)) { + NL_SET_ERR_MSG_MOD(extack, "Failed to add (S, G) MDB entry"); + return PTR_ERR(sgmp); + } + + if (cfg->entry->state == MDB_PERMANENT) + flags |= MDB_PG_FLAGS_PERMANENT; + if (cfg->filter_mode == MCAST_EXCLUDE) + flags |= MDB_PG_FLAGS_BLOCKED; + + memset(&sg_cfg, 0, sizeof(sg_cfg)); + INIT_LIST_HEAD(&sg_cfg.src_list); + sg_cfg.br = cfg->br; + sg_cfg.p = cfg->p; + sg_cfg.entry = cfg->entry; + sg_cfg.group = sg_ip; + sg_cfg.src_entry = true; + sg_cfg.filter_mode = MCAST_INCLUDE; + return br_mdb_add_group_sg(&sg_cfg, sgmp, brmctx, flags, extack); +} + +static int br_mdb_add_group_src(struct br_mdb_config *cfg, + struct net_bridge_port_group *pg, + struct net_bridge_mcast *brmctx, + struct br_mdb_src_entry *src, + struct netlink_ext_ack *extack) +{ + struct net_bridge_group_src *ent; + unsigned long now = jiffies; + int err; + + ent = br_multicast_find_group_src(pg, &src->addr); + if (!ent) { + ent = br_multicast_new_group_src(pg, &src->addr); + if (!ent) { + NL_SET_ERR_MSG_MOD(extack, "Failed to add new source entry"); + return -ENOSPC; + } + } else { + NL_SET_ERR_MSG_MOD(extack, "Source entry already exists"); + return -EEXIST; + } + + if (cfg->filter_mode == MCAST_INCLUDE && + cfg->entry->state == MDB_TEMPORARY) + mod_timer(&ent->timer, now + br_multicast_gmi(brmctx)); + else + del_timer(&ent->timer); + + /* Install a (S, G) forwarding entry for the source. */ + err = br_mdb_add_group_src_fwd(cfg, &src->addr, brmctx, extack); + if (err) + goto err_del_sg; + + ent->flags = BR_SGRP_F_INSTALLED | BR_SGRP_F_USER_ADDED; + + return 0; + +err_del_sg: + br_multicast_del_group_src(ent, false); + return err; +} + +static void br_mdb_del_group_src(struct net_bridge_port_group *pg, + struct br_mdb_src_entry *src) +{ + struct net_bridge_group_src *ent; + + ent = br_multicast_find_group_src(pg, &src->addr); + if (WARN_ON_ONCE(!ent)) + return; + br_multicast_del_group_src(ent, false); +} + +static int br_mdb_add_group_srcs(struct br_mdb_config *cfg, + struct net_bridge_port_group *pg, + struct net_bridge_mcast *brmctx, + struct netlink_ext_ack *extack) +{ + struct br_mdb_src_entry *src; + int err; + + list_for_each_entry(src, &cfg->src_list, list) { + err = br_mdb_add_group_src(cfg, pg, brmctx, src, extack); + if (err) + goto err_del_group_srcs; + } + + return 0; + +err_del_group_srcs: + list_for_each_entry_continue_reverse(src, &cfg->src_list, list) + br_mdb_del_group_src(pg, src); + return err; +} + static int br_mdb_add_group_star_g(struct br_mdb_config *cfg, struct net_bridge_mdb_entry *mp, struct net_bridge_mcast *brmctx, @@ -845,6 +954,7 @@ static int br_mdb_add_group_star_g(struct br_mdb_config *cfg, struct net_bridge_port_group __rcu **pp; struct net_bridge_port_group *p; unsigned long now = jiffies; + int err; for (pp = &mp->ports; (p = mlock_dereference(*pp, cfg->br)) != NULL; @@ -858,23 +968,35 @@ static int br_mdb_add_group_star_g(struct br_mdb_config *cfg, } p = br_multicast_new_port_group(cfg->p, &cfg->group, *pp, flags, NULL, - MCAST_EXCLUDE, RTPROT_STATIC); + cfg->filter_mode, RTPROT_STATIC); if (unlikely(!p)) { NL_SET_ERR_MSG_MOD(extack, "Couldn't allocate new (*, G) port group"); return -ENOMEM; } + + err = br_mdb_add_group_srcs(cfg, p, brmctx, extack); + if (err) + goto err_del_port_group; + rcu_assign_pointer(*pp, p); - if (!(flags & MDB_PG_FLAGS_PERMANENT)) + if (!(flags & MDB_PG_FLAGS_PERMANENT) && + cfg->filter_mode == MCAST_EXCLUDE) mod_timer(&p->timer, now + brmctx->multicast_membership_interval); br_mdb_notify(cfg->br->dev, mp, p, RTM_NEWMDB); /* If we are adding a new EXCLUDE port group (*, G), it needs to be * also added to all (S, G) entries for proper replication. */ - if (br_multicast_should_handle_mode(brmctx, cfg->group.proto)) + if (br_multicast_should_handle_mode(brmctx, cfg->group.proto) && + cfg->filter_mode == MCAST_EXCLUDE) br_multicast_star_g_handle_mode(p, MCAST_EXCLUDE); return 0; + +err_del_port_group: + hlist_del_init(&p->mglist); + kfree(p); + return err; } static int br_mdb_add_group(struct br_mdb_config *cfg, @@ -967,6 +1089,8 @@ static int br_mdb_config_init(struct net *net, struct sk_buff *skb, return err; memset(cfg, 0, sizeof(*cfg)); + cfg->filter_mode = MCAST_EXCLUDE; + INIT_LIST_HEAD(&cfg->src_list); bpm = nlmsg_data(nlh); if (!bpm->ifindex) { diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 1bd6eebad002..0189fce6f3b7 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -93,12 +93,19 @@ struct bridge_mcast_stats { struct u64_stats_sync syncp; }; +struct br_mdb_src_entry { + struct list_head list; + struct br_ip addr; +}; + struct br_mdb_config { struct net_bridge *br; struct net_bridge_port *p; struct br_mdb_entry *entry; struct br_ip group; bool src_entry; + u8 filter_mode; + struct list_head src_list; }; #endif -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 17/19] bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
Add new netlink attributes to the RTM_NEWMDB request that allow user space to add (*, G) with a source list and filter mode. The RTM_NEWMDB message can already dump such entries (created by the kernel) so there is no need to add dump support. However, the message contains a different set of attributes depending if it is a request or a response. The naming and structure of the new attributes try to follow the existing ones used in the response. Request: [ struct nlmsghdr ] [ struct br_port_msg ] [ MDBA_SET_ENTRY ] struct br_mdb_entry [ MDBA_SET_ENTRY_ATTRS ] [ MDBE_ATTR_SOURCE ] struct in_addr / struct in6_addr [ MDBE_ATTR_SRC_LIST ] // new [ MDBE_SRC_LIST_ENTRY ] [ MDBE_SRCATTR_ADDRESS ] struct in_addr / struct in6_addr [ ...] [ MDBE_ATTR_GROUP_MODE ] // new u8 Response: [ struct nlmsghdr ] [ struct br_port_msg ] [ MDBA_MDB ] [ MDBA_MDB_ENTRY ] [ MDBA_MDB_ENTRY_INFO ] struct br_mdb_entry [ MDBA_MDB_EATTR_TIMER ] u32 [ MDBA_MDB_EATTR_SOURCE ] struct in_addr / struct in6_addr [ MDBA_MDB_EATTR_RTPROT ] u8 [ MDBA_MDB_EATTR_SRC_LIST ] [ MDBA_MDB_SRCLIST_ENTRY ] [ MDBA_MDB_SRCATTR_ADDRESS ] struct in_addr / struct in6_addr [ MDBA_MDB_SRCATTR_TIMER ] u8 [...] [ MDBA_MDB_EATTR_GROUP_MODE ] u8 Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- include/uapi/linux/if_bridge.h | 20 +++++ net/bridge/br_mdb.c | 132 +++++++++++++++++++++++++++++++++ 2 files changed, 152 insertions(+) diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h index a86a7e7b811f..0d9fe73fc48c 100644 --- a/include/uapi/linux/if_bridge.h +++ b/include/uapi/linux/if_bridge.h @@ -723,10 +723,30 @@ enum { enum { MDBE_ATTR_UNSPEC, MDBE_ATTR_SOURCE, + MDBE_ATTR_SRC_LIST, + MDBE_ATTR_GROUP_MODE, __MDBE_ATTR_MAX, }; #define MDBE_ATTR_MAX (__MDBE_ATTR_MAX - 1) +/* per mdb entry source */ +enum { + MDBE_SRC_LIST_UNSPEC, + MDBE_SRC_LIST_ENTRY, + __MDBE_SRC_LIST_MAX, +}; +#define MDBE_SRC_LIST_MAX (__MDBE_SRC_LIST_MAX - 1) + +/* per mdb entry per source attributes + * these are embedded in MDBE_SRC_LIST_ENTRY + */ +enum { + MDBE_SRCATTR_UNSPEC, + MDBE_SRCATTR_ADDRESS, + __MDBE_SRCATTR_MAX, +}; +#define MDBE_SRCATTR_MAX (__MDBE_SRCATTR_MAX - 1) + /* Embedded inside LINK_XSTATS_TYPE_BRIDGE */ enum { BRIDGE_XSTATS_UNSPEC, diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 8fc8816a76bf..909b0fb49a0c 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -663,10 +663,25 @@ void br_rtr_notify(struct net_device *dev, struct net_bridge_mcast_port *pmctx, rtnl_set_sk_err(net, RTNLGRP_MDB, err); } +static const struct nla_policy +br_mdbe_src_list_entry_pol[MDBE_SRCATTR_MAX + 1] = { + [MDBE_SRCATTR_ADDRESS] = NLA_POLICY_RANGE(NLA_BINARY, + sizeof(struct in_addr), + sizeof(struct in6_addr)), +}; + +static const struct nla_policy +br_mdbe_src_list_pol[MDBE_SRC_LIST_MAX + 1] = { + [MDBE_SRC_LIST_ENTRY] = NLA_POLICY_NESTED(br_mdbe_src_list_entry_pol), +}; + static const struct nla_policy br_mdbe_attrs_pol[MDBE_ATTR_MAX + 1] = { [MDBE_ATTR_SOURCE] = NLA_POLICY_RANGE(NLA_BINARY, sizeof(struct in_addr), sizeof(struct in6_addr)), + [MDBE_ATTR_GROUP_MODE] = NLA_POLICY_RANGE(NLA_U8, MCAST_EXCLUDE, + MCAST_INCLUDE), + [MDBE_ATTR_SRC_LIST] = NLA_POLICY_NESTED(br_mdbe_src_list_pol), }; static bool is_valid_mdb_entry(struct br_mdb_entry *entry, @@ -1052,6 +1067,73 @@ static int __br_mdb_add(struct br_mdb_config *cfg, return ret; } +static int br_mdb_config_src_entry_init(struct nlattr *src_entry, + struct br_mdb_config *cfg, + struct netlink_ext_ack *extack) +{ + struct nlattr *tb[MDBE_SRCATTR_MAX + 1]; + struct br_mdb_src_entry *src; + int err; + + err = nla_parse_nested(tb, MDBE_SRCATTR_MAX, src_entry, + br_mdbe_src_list_entry_pol, extack); + if (err) + return err; + + if (NL_REQ_ATTR_CHECK(extack, src_entry, tb, MDBE_SRCATTR_ADDRESS)) + return -EINVAL; + + if (!is_valid_mdb_source(tb[MDBE_SRCATTR_ADDRESS], + cfg->entry->addr.proto, extack)) + return -EINVAL; + + src = kzalloc(sizeof(*src), GFP_KERNEL); + if (!src) + return -ENOMEM; + src->addr.proto = cfg->entry->addr.proto; + nla_memcpy(&src->addr.src, tb[MDBE_SRCATTR_ADDRESS], + nla_len(tb[MDBE_SRCATTR_ADDRESS])); + list_add_tail(&src->list, &cfg->src_list); + + return 0; +} + +static void br_mdb_config_src_entry_fini(struct br_mdb_src_entry *src) +{ + list_del(&src->list); + kfree(src); +} + +static int br_mdb_config_src_list_init(struct nlattr *src_list, + struct br_mdb_config *cfg, + struct netlink_ext_ack *extack) +{ + struct br_mdb_src_entry *src, *tmp; + struct nlattr *src_entry; + int rem, err; + + nla_for_each_nested(src_entry, src_list, rem) { + err = br_mdb_config_src_entry_init(src_entry, cfg, extack); + if (err) + goto err_src_entry_init; + } + + return 0; + +err_src_entry_init: + list_for_each_entry_safe(src, tmp, &cfg->src_list, list) + br_mdb_config_src_entry_fini(src); + return err; +} + +static void br_mdb_config_src_list_fini(struct br_mdb_config *cfg) +{ + struct br_mdb_src_entry *src, *tmp; + + list_for_each_entry_safe(src, tmp, &cfg->src_list, list) + br_mdb_config_src_entry_fini(src); +} + static int br_mdb_config_attrs_init(struct nlattr *set_attrs, struct br_mdb_config *cfg, struct netlink_ext_ack *extack) @@ -1071,9 +1153,52 @@ static int br_mdb_config_attrs_init(struct nlattr *set_attrs, __mdb_entry_to_br_ip(cfg->entry, &cfg->group, mdb_attrs); + if (mdb_attrs[MDBE_ATTR_GROUP_MODE]) { + if (!cfg->p) { + NL_SET_ERR_MSG_MOD(extack, "Filter mode cannot be set for host groups"); + return -EINVAL; + } + if (!br_multicast_is_star_g(&cfg->group)) { + NL_SET_ERR_MSG_MOD(extack, "Filter mode can only be set for (*, G) entries"); + return -EINVAL; + } + cfg->filter_mode = nla_get_u8(mdb_attrs[MDBE_ATTR_GROUP_MODE]); + } else { + cfg->filter_mode = MCAST_EXCLUDE; + } + + if (mdb_attrs[MDBE_ATTR_SRC_LIST]) { + if (!cfg->p) { + NL_SET_ERR_MSG_MOD(extack, "Source list cannot be set for host groups"); + return -EINVAL; + } + if (!br_multicast_is_star_g(&cfg->group)) { + NL_SET_ERR_MSG_MOD(extack, "Source list can only be set for (*, G) entries"); + return -EINVAL; + } + if (!mdb_attrs[MDBE_ATTR_GROUP_MODE]) { + NL_SET_ERR_MSG_MOD(extack, "Source list cannot be set without filter mode"); + return -EINVAL; + } + err = br_mdb_config_src_list_init(mdb_attrs[MDBE_ATTR_SRC_LIST], + cfg, extack); + if (err) + return err; + } + + if (list_empty(&cfg->src_list) && cfg->filter_mode == MCAST_INCLUDE) { + NL_SET_ERR_MSG_MOD(extack, "Cannot add (*, G) INCLUDE with an empty source list"); + return -EINVAL; + } + return 0; } +static void br_mdb_config_attrs_fini(struct br_mdb_config *cfg) +{ + br_mdb_config_src_list_fini(cfg); +} + static int br_mdb_config_init(struct net *net, struct sk_buff *skb, struct nlmsghdr *nlh, struct br_mdb_config *cfg, struct netlink_ext_ack *extack) @@ -1164,6 +1289,11 @@ static int br_mdb_config_init(struct net *net, struct sk_buff *skb, return 0; } +static void br_mdb_config_fini(struct br_mdb_config *cfg) +{ + br_mdb_config_attrs_fini(cfg); +} + static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, struct netlink_ext_ack *extack) { @@ -1222,6 +1352,7 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, } out: + br_mdb_config_fini(&cfg); return err; } @@ -1297,6 +1428,7 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh, err = __br_mdb_del(&cfg); } + br_mdb_config_fini(&cfg); return err; } -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 18/19] bridge: mcast: Allow user space to specify MDB entry routing protocol
Add the 'MDBE_ATTR_RTPORT' attribute to allow user space to specify the routing protocol of the MDB port group entry. Enforce a minimum value of 'RTPROT_STATIC' to prevent user space from using protocol values that should only be set by the kernel (e.g., 'RTPROT_KERNEL'). Maintain backward compatibility by defaulting to 'RTPROT_STATIC'. The protocol is already visible to user space in RTM_NEWMDB responses and notifications via the 'MDBA_MDB_EATTR_RTPROT' attribute. The routing protocol allows a routing daemon to distinguish between entries configured by it and those configured by the administrator. Once MDB flush is supported, the protocol can be used as a criterion according to which the flush is performed. Examples: # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto kernel Error: integer out of range. # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto static # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent proto zebra # bridge mdb add dev br0 port dummy10 grp 239.1.1.2 permanent source_list 198.51.100.1,198.51.100.2 filter_mode include proto 250 # bridge -d mdb show dev br0 port dummy10 grp 239.1.1.2 src 198.51.100.2 permanent filter_mode include proto 250 dev br0 port dummy10 grp 239.1.1.2 src 198.51.100.1 permanent filter_mode include proto 250 dev br0 port dummy10 grp 239.1.1.2 permanent filter_mode include source_list 198.51.100.2/0.00,198.51.100.1/0.00 proto 250 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude proto static Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- include/uapi/linux/if_bridge.h | 1 + net/bridge/br_mdb.c | 10 ++++++++-- net/bridge/br_private.h | 1 + 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h index 0d9fe73fc48c..d9de241d90f9 100644 --- a/include/uapi/linux/if_bridge.h +++ b/include/uapi/linux/if_bridge.h @@ -725,6 +725,7 @@ enum { MDBE_ATTR_SOURCE, MDBE_ATTR_SRC_LIST, MDBE_ATTR_GROUP_MODE, + MDBE_ATTR_RTPROT, __MDBE_ATTR_MAX, }; #define MDBE_ATTR_MAX (__MDBE_ATTR_MAX - 1) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 909b0fb49a0c..7ee6d383ad07 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -682,6 +682,7 @@ static const struct nla_policy br_mdbe_attrs_pol[MDBE_ATTR_MAX + 1] = { [MDBE_ATTR_GROUP_MODE] = NLA_POLICY_RANGE(NLA_U8, MCAST_EXCLUDE, MCAST_INCLUDE), [MDBE_ATTR_SRC_LIST] = NLA_POLICY_NESTED(br_mdbe_src_list_pol), + [MDBE_ATTR_RTPROT] = NLA_POLICY_MIN(NLA_U8, RTPROT_STATIC), }; static bool is_valid_mdb_entry(struct br_mdb_entry *entry, @@ -823,7 +824,7 @@ static int br_mdb_add_group_sg(struct br_mdb_config *cfg, } p = br_multicast_new_port_group(cfg->p, &cfg->group, *pp, flags, NULL, - MCAST_INCLUDE, RTPROT_STATIC); + MCAST_INCLUDE, cfg->rt_protocol); if (unlikely(!p)) { NL_SET_ERR_MSG_MOD(extack, "Couldn't allocate new (S, G) port group"); return -ENOMEM; @@ -882,6 +883,7 @@ static int br_mdb_add_group_src_fwd(struct br_mdb_config *cfg, sg_cfg.group = sg_ip; sg_cfg.src_entry = true; sg_cfg.filter_mode = MCAST_INCLUDE; + sg_cfg.rt_protocol = cfg->rt_protocol; return br_mdb_add_group_sg(&sg_cfg, sgmp, brmctx, flags, extack); } @@ -983,7 +985,7 @@ static int br_mdb_add_group_star_g(struct br_mdb_config *cfg, } p = br_multicast_new_port_group(cfg->p, &cfg->group, *pp, flags, NULL, - cfg->filter_mode, RTPROT_STATIC); + cfg->filter_mode, cfg->rt_protocol); if (unlikely(!p)) { NL_SET_ERR_MSG_MOD(extack, "Couldn't allocate new (*, G) port group"); return -ENOMEM; @@ -1191,6 +1193,9 @@ static int br_mdb_config_attrs_init(struct nlattr *set_attrs, return -EINVAL; } + if (mdb_attrs[MDBE_ATTR_RTPROT]) + cfg->rt_protocol = nla_get_u8(mdb_attrs[MDBE_ATTR_RTPROT]); + return 0; } @@ -1216,6 +1221,7 @@ static int br_mdb_config_init(struct net *net, struct sk_buff *skb, memset(cfg, 0, sizeof(*cfg)); cfg->filter_mode = MCAST_EXCLUDE; INIT_LIST_HEAD(&cfg->src_list); + cfg->rt_protocol = RTPROT_STATIC; bpm = nlmsg_data(nlh); if (!bpm->ifindex) { diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 0189fce6f3b7..73f0e98de33b 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -105,6 +105,7 @@ struct br_mdb_config { struct br_ip group; bool src_entry; u8 filter_mode; + u8 rt_protocol; struct list_head src_list; }; #endif -- 2.37.3
Ido Schimmel
2022-Oct-18 12:04 UTC
[Bridge] [RFC PATCH net-next 19/19] bridge: mcast: Support replacement of MDB port group entries
Now that user space can specify additional attributes of port group entries such as filter mode and source list, it makes sense to allow user space to atomically modify these attributes by replacing entries instead of forcing user space to delete the entries and add them back. Replace MDB port group entries when the 'NLM_F_REPLACE' flag is specified in the netlink message header. When a (*, G) entry is replaced, update the following attributes: Source list, state, filter mode, protocol and flags. If the entry is temporary and in EXCLUDE mode, reset the group timer to the group membership interval. If the entry is temporary and in INCLUDE mode, reset the source timers of associated sources to the group membership interval. Examples: # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.2 filter_mode include # bridge -d -s mdb show dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.2 permanent filter_mode include proto static 0.00 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto static 0.00 dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode include source_list 192.0.2.2/0.00,192.0.2.1/0.00 proto static 0.00 # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra # bridge -d -s mdb show dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra blocked 0.00 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra blocked 0.00 dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra 0.00 # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 temp source_list 192.0.2.4,192.0.2.3 filter_mode include proto bgp # bridge -d -s mdb show dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.4 temp filter_mode include proto bgp 0.00 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 temp filter_mode include proto bgp 0.00 dev br0 port dummy10 grp 239.1.1.1 temp filter_mode include source_list 192.0.2.4/259.44,192.0.2.3/259.44 proto bgp 0.00 Signed-off-by: Ido Schimmel <idosch at nvidia.com> --- net/bridge/br_mdb.c | 103 ++++++++++++++++++++++++++++++++++++++-- net/bridge/br_private.h | 1 + 2 files changed, 99 insertions(+), 5 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 7ee6d383ad07..b0c506a3e09e 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -802,6 +802,28 @@ __br_mdb_choose_context(struct net_bridge *br, return brmctx; } +static int br_mdb_replace_group_sg(struct br_mdb_config *cfg, + struct net_bridge_mdb_entry *mp, + struct net_bridge_port_group *pg, + struct net_bridge_mcast *brmctx, + unsigned char flags, + struct netlink_ext_ack *extack) +{ + unsigned long now = jiffies; + + pg->flags = flags; + pg->rt_protocol = cfg->rt_protocol; + if (!(flags & MDB_PG_FLAGS_PERMANENT) && !cfg->src_entry) + mod_timer(&pg->timer, + now + brmctx->multicast_membership_interval); + else + del_timer(&pg->timer); + + br_mdb_notify(cfg->br->dev, mp, pg, RTM_NEWMDB); + + return 0; +} + static int br_mdb_add_group_sg(struct br_mdb_config *cfg, struct net_bridge_mdb_entry *mp, struct net_bridge_mcast *brmctx, @@ -816,8 +838,12 @@ static int br_mdb_add_group_sg(struct br_mdb_config *cfg, (p = mlock_dereference(*pp, cfg->br)) != NULL; pp = &p->next) { if (p->key.port == cfg->p) { - NL_SET_ERR_MSG_MOD(extack, "(S, G) group is already joined by port"); - return -EEXIST; + if (!(cfg->nlflags & NLM_F_REPLACE)) { + NL_SET_ERR_MSG_MOD(extack, "(S, G) group is already joined by port"); + return -EEXIST; + } + return br_mdb_replace_group_sg(cfg, mp, p, brmctx, + flags, extack); } if ((unsigned long)p->key.port < (unsigned long)cfg->p) break; @@ -884,6 +910,7 @@ static int br_mdb_add_group_src_fwd(struct br_mdb_config *cfg, sg_cfg.src_entry = true; sg_cfg.filter_mode = MCAST_INCLUDE; sg_cfg.rt_protocol = cfg->rt_protocol; + sg_cfg.nlflags = cfg->nlflags; return br_mdb_add_group_sg(&sg_cfg, sgmp, brmctx, flags, extack); } @@ -904,7 +931,7 @@ static int br_mdb_add_group_src(struct br_mdb_config *cfg, NL_SET_ERR_MSG_MOD(extack, "Failed to add new source entry"); return -ENOSPC; } - } else { + } else if (!(cfg->nlflags & NLM_F_REPLACE)) { NL_SET_ERR_MSG_MOD(extack, "Source entry already exists"); return -EEXIST; } @@ -962,6 +989,67 @@ static int br_mdb_add_group_srcs(struct br_mdb_config *cfg, return err; } +static int br_mdb_replace_group_srcs(struct br_mdb_config *cfg, + struct net_bridge_port_group *pg, + struct net_bridge_mcast *brmctx, + struct netlink_ext_ack *extack) +{ + struct net_bridge_group_src *ent; + struct hlist_node *tmp; + int err; + + hlist_for_each_entry(ent, &pg->src_list, node) + ent->flags |= BR_SGRP_F_DELETE; + + err = br_mdb_add_group_srcs(cfg, pg, brmctx, extack); + if (err) + goto err_clear_delete; + + hlist_for_each_entry_safe(ent, tmp, &pg->src_list, node) { + if (ent->flags & BR_SGRP_F_DELETE) + br_multicast_del_group_src(ent, false); + } + + return 0; + +err_clear_delete: + hlist_for_each_entry(ent, &pg->src_list, node) + ent->flags &= ~BR_SGRP_F_DELETE; + return err; +} + +static int br_mdb_replace_group_star_g(struct br_mdb_config *cfg, + struct net_bridge_mdb_entry *mp, + struct net_bridge_port_group *pg, + struct net_bridge_mcast *brmctx, + unsigned char flags, + struct netlink_ext_ack *extack) +{ + unsigned long now = jiffies; + int err; + + err = br_mdb_replace_group_srcs(cfg, pg, brmctx, extack); + if (err) + return err; + + pg->flags = flags; + pg->filter_mode = cfg->filter_mode; + pg->rt_protocol = cfg->rt_protocol; + if (!(flags & MDB_PG_FLAGS_PERMANENT) && + cfg->filter_mode == MCAST_EXCLUDE) + mod_timer(&pg->timer, + now + brmctx->multicast_membership_interval); + else + del_timer(&pg->timer); + + br_mdb_notify(cfg->br->dev, mp, pg, RTM_NEWMDB); + + if (br_multicast_should_handle_mode(brmctx, cfg->group.proto)) + br_multicast_star_g_handle_mode(pg, cfg->filter_mode); + + return 0; +} + static int br_mdb_add_group_star_g(struct br_mdb_config *cfg, struct net_bridge_mdb_entry *mp, struct net_bridge_mcast *brmctx, @@ -977,8 +1065,12 @@ static int br_mdb_add_group_star_g(struct br_mdb_config *cfg, (p = mlock_dereference(*pp, cfg->br)) != NULL; pp = &p->next) { if (p->key.port == cfg->p) { - NL_SET_ERR_MSG_MOD(extack, "(*, G) group is already joined by port"); - return -EEXIST; + if (!(cfg->nlflags & NLM_F_REPLACE)) { + NL_SET_ERR_MSG_MOD(extack, "(*, G) group is already joined by port"); + return -EEXIST; + } + return br_mdb_replace_group_star_g(cfg, mp, p, brmctx, + flags, extack); } if ((unsigned long)p->key.port < (unsigned long)cfg->p) break; @@ -1222,6 +1314,7 @@ static int br_mdb_config_init(struct net *net, struct sk_buff *skb, cfg->filter_mode = MCAST_EXCLUDE; INIT_LIST_HEAD(&cfg->src_list); cfg->rt_protocol = RTPROT_STATIC; + cfg->nlflags = nlh->nlmsg_flags; bpm = nlmsg_data(nlh); if (!bpm->ifindex) { diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 73f0e98de33b..7831f01fa018 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -107,6 +107,7 @@ struct br_mdb_config { u8 filter_mode; u8 rt_protocol; struct list_head src_list; + u32 nlflags; }; #endif -- 2.37.3
Jakub Kicinski
2022-Oct-18 19:21 UTC
[Bridge] [RFC PATCH net-next 00/19] bridge: mcast: Extensions for EVPN
On Tue, 18 Oct 2022 15:04:01 +0300 Ido Schimmel wrote:> [ MDBE_ATTR_SRC_LIST ] // new > [ MDBE_SRC_LIST_ENTRY ] > [ MDBE_SRCATTR_ADDRESS ] > struct in_addr / struct in6_addr > [ ...]nit: I found that the MDBE_ATTR_SRC_LIST level of wrapping corresponds to how "sane" formats work, but in practice there is no need for it in netlink. You can put the entry nests directly in the outer. Saves one layer of parsing. Just thought I'd mention it, you can keep as is if you prefer.
Nikolay Aleksandrov
2022-Oct-19 13:15 UTC
[Bridge] [RFC PATCH net-next 00/19] bridge: mcast: Extensions for EVPN
On 18/10/2022 15:04, Ido Schimmel wrote:> Posting as RFC to show the whole picture. Will split into smaller > submissions for v1 and add selftests. > > tl;dr > ====> > This patchset creates feature parity between user space and the kernel > and allows the former to install and replace MDB port group entries with > a source list and associated filter mode. This is required for EVPN use > cases where multicast state is not derived from snooped IGMP/MLD > packets, but instead derived from EVPN routes exchanged by the control > plane in user space. > > Background > =========> > IGMPv3 [1] and MLDv2 [2] differ from earlier versions of the protocols > in that they add support for source-specific multicast. That is, hosts > can advertise interest in listening to a particular multicast address > only from specific source addresses or from all sources except for > specific source addresses. > > In kernel 5.10 [3][4], the bridge driver gained the ability to snoop > IGMPv3/MLDv2 packets and install corresponding MDB port group entries. > For example, a snooped IGMPv3 Membership Report that contains a single > MODE_IS_EXCLUDE record for group 239.10.10.10 with sources 192.0.2.1, > 192.0.2.2, 192.0.2.20 and 192.0.2.21 would trigger the creation of these > entries: > > # bridge -d mdb show > dev br0 port veth1 grp 239.10.10.10 src 192.0.2.21 temp filter_mode include proto kernel blocked > dev br0 port veth1 grp 239.10.10.10 src 192.0.2.20 temp filter_mode include proto kernel blocked > dev br0 port veth1 grp 239.10.10.10 src 192.0.2.2 temp filter_mode include proto kernel blocked > dev br0 port veth1 grp 239.10.10.10 src 192.0.2.1 temp filter_mode include proto kernel blocked > dev br0 port veth1 grp 239.10.10.10 temp filter_mode exclude source_list 192.0.2.21/0.00,192.0.2.20/0.00,192.0.2.2/0.00,192.0.2.1/0.00 proto kernel > > While the kernel can install and replace entries with a filter mode and > source list, user space cannot. It can only add EXCLUDE entries with an > empty source list, which is sufficient for IGMPv2/MLDv1, but not for > IGMPv3/MLDv2. > > Use cases where the multicast state is not derived from snooped packets, > but instead derived from routes exchanged by the user space control > plane require feature parity between user space and the kernel in terms > of MDB configuration. Such a use case is detailed in the next section. > > Motivation > =========> > RFC 7432 [5] defines a "MAC/IP Advertisement route" (type 2) [6] that > allows NVE switches in the EVPN network to advertise and learn > reachability information for unicast MAC addresses. Traffic destined to > a unicast MAC address can therefore be selectively forwarded to a single > NVE switch behind which the MAC is located. > > The same is not true for IP multicast traffic. Such traffic is simply > flooded as BUM to all NVE switches in the broadcast domain (BD), > regardless if a switch has interested receivers for the multicast stream > or not. This is especially problematic for overlay networks that make > heavy use of multicast. > > The issue is addressed by RFC 9251 [7] that defines a "Selective > Multicast Ethernet Tag Route" (type 6) [8] which allows NVE switches in > the EVPN network to advertise multicast streams that they are interested > in. This is done by having each switch suppress IGMP/MLD packets from > being transmitted to the NVE network and instead communicate the > information over BGP to other switches. > > As far as the bridge driver is concerned, the above means that the > multicast state (i.e., {multicast address, group timer, filter-mode, > (source records)}) for the VXLAN bridge port is not populated by the > kernel from snooped IGMP/MLD packets (they are suppressed), but instead > by user space. Specifically, by the routing daemon that is exchanging > EVPN routes with other NVE switches. > > Changes are obviously also required in the VXLAN driver, but they are > the subject of future patchsets. See the "Future work" section. > > Implementation > =============> > The user interface is extended to allow user space to specify the filter > mode of the MDB port group entry and its source list. Replace support is > also added so that user space would not need to remove an entry and > re-add it only to edit its source list or filter mode, as that would > result in packet loss. Example usage: > > # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent \ > source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra > # bridge -d -s mdb show > dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra blocked 0.00 > dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra blocked 0.00 > dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra 0.00 > > The netlink interface is extended with a few new attributes in the > RTM_NEWMDB request message: > > [ struct nlmsghdr ] > [ struct br_port_msg ] > [ MDBA_SET_ENTRY ] > struct br_mdb_entry > [ MDBA_SET_ENTRY_ATTRS ] > [ MDBE_ATTR_SOURCE ] > struct in_addr / struct in6_addr > [ MDBE_ATTR_SRC_LIST ] // new > [ MDBE_SRC_LIST_ENTRY ] > [ MDBE_SRCATTR_ADDRESS ] > struct in_addr / struct in6_addr > [ ...] > [ MDBE_ATTR_GROUP_MODE ] // new > u8 > [ MDBE_ATTR_RTPORT ] // new > u8 > > No changes are required in RTM_NEWMDB responses and notifications, as > all the information can already be dumped by the kernel today. > > Testing > ======> > Tested with existing bridge multicast selftests: bridge_igmp.sh, > bridge_mdb_port_down.sh, bridge_mdb.sh, bridge_mld.sh, > bridge_vlan_mcast.sh. > > Will add dedicated selftests in v1. > > Patchset overview > ================> > Patches #1-#8 are non-functional preparations aimed at making it easier > to add additional netlink attributes later on. They create an MDB > configuration structure into which netlink messages are parsed into. > The structure is then passed in the entry creation / deletion call chain > instead of passing the netlink attributes themselves. The same pattern > is used by other rtnetlink objects such as routes and nexthops. > > I initially tried to extend the current code, but it proved to be too > difficult, which is why I decided to refactor it to the extensible and > familiar pattern used by other rtnetlink objects. > > The plan is to submit these patches separately for v1. > > Patches #9-#15 are an additional set of non-functional preparations for > the core changes in the last patches. > > Patches #16-#17 allow user space to install (*, G) entries with a source > list and associated filter mode. Specifically, patch #16 adds the > necessary kernel plumbing and patch #17 exposes the new functionality to > user space via a few new attributes. > > Patch #18 allows user space to specify the routing protocol of new MDB > port group entries so that a routing daemon could differentiate between > entries installed by it and those installed by an administrator. > > Patch #19 allows user space to replace MDB port group entries. This is > useful, for example, when user space wants to add a new source to a > source list. Instead of deleting a (*, G) entry and re-adding it with an > extended source list (which would result in packet loss), user space can > simply replace the current entry. > > Future work > ==========> > The VXLAN driver will need to be extended with an MDB so that it could > selectively forward IP multicast traffic to NVE switches with interested > receivers instead of simply flooding it to all switches as BUM. > > The idea is to reuse the existing MDB interface for the VXLAN driver in > a similar way to how the FDB interface is shared between the bridge and > VXLAN drivers. > > From command line perspective, configuration will look as follows: > > # bridge mdb add dev br0 port vxlan0 grp 239.1.1.1 permanent \ > filter_mode exclude source_list 198.50.100.1,198.50.100.2 > > # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \ > filter_mode include source_list 198.50.100.3,198.50.100.4 \ > dst 192.0.2.1 dst_port 4789 src_vni 2 > > # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \ > filter_mode exclude source_list 198.50.100.1,198.50.100.2 \ > dst 192.0.2.2 dst_port 4789 src_vni 2 > > Where the first command is enabled by this set, but the next two will be > the subject of future work. > > From netlink perspective, the existing PF_BRIDGE/RTM_*MDB messages will > be extended to the VXLAN driver. This means that a few new attributes > will be added (e.g., 'MDBE_ATTR_SRC_VNI') and that the handlers for > these messages will need to move to net/core/rtnetlink.c. The rtnetlink > code will call into the appropriate driver based on the ifindex > specified in the ancillary header. > > iproute2 patches can be found here [9]. > > [1] https://datatracker.ietf.org/doc/html/rfc3376 > [2] https://www.rfc-editor.org/rfc/rfc3810 > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6af52ae2ed14a6bc756d5606b29097dfd76740b8 > [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=68d4fd30c83b1b208e08c954cd45e6474b148c87 > [5] https://datatracker.ietf.org/doc/html/rfc7432 > [6] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2 > [7] https://datatracker.ietf.org/doc/html/rfc9251 > [8] https://datatracker.ietf.org/doc/html/rfc9251#section-9.1 > [9] https://github.com/idosch/iproute2/commits/submit/mdb_rfc_v1 > > Ido Schimmel (19): > bridge: mcast: Centralize netlink attribute parsing > bridge: mcast: Remove redundant checks > bridge: mcast: Use MDB configuration structure where possible > bridge: mcast: Propagate MDB configuration structure further > bridge: mcast: Use MDB group key from configuration structure > bridge: mcast: Remove br_mdb_parse() > bridge: mcast: Move checks out of critical section > bridge: mcast: Remove redundant function arguments > bridge: mcast: Do not derive entry type from its filter mode > bridge: mcast: Split (*, G) and (S, G) addition into different > functions > bridge: mcast: Place netlink policy before validation functions > bridge: mcast: Add a centralized error path > bridge: mcast: Expose br_multicast_new_group_src() > bridge: mcast: Add a flag for user installed source entries > bridge: mcast: Avoid arming group timer when (S, G) corresponds to a > source > bridge: mcast: Add support for (*, G) with a source list and filter > mode > bridge: mcast: Allow user space to add (*, G) with a source list and > filter mode > bridge: mcast: Allow user space to specify MDB entry routing protocol > bridge: mcast: Support replacement of MDB port group entries > > include/uapi/linux/if_bridge.h | 21 + > net/bridge/br_mdb.c | 821 ++++++++++++++++++++++++--------- > net/bridge/br_multicast.c | 5 +- > net/bridge/br_private.h | 21 + > 4 files changed, 649 insertions(+), 219 deletions(-) >Good work, the set looks good to me. It's a natural extension to allow user-space to manipulate such mcast groups. As we discussed privately it's only missing the selftests. :) Cheers, Nik