Alexandra Winter
2022-Mar-30 11:14 UTC
[Bridge] [PATCH net-next v2] veth: Support bonding events
On 30.03.22 12:23, Nikolay Aleksandrov wrote:> On 30/03/2022 03:54, Jakub Kicinski wrote: >> Dropping the BPF people from CC and adding Hangbin, bridge and >> bond/team. Please exercise some judgment when sending patches.Thank you. I did 'scripts/get_maintainer.pl drivers/net/veth.c' but was a bit surprised about the outcome.>> >> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote: >>> Bonding drivers generate specific events during failover that trigger >>> switch updates. When a veth device is attached to a bridge with a >>> bond interface, we want external switches to learn about the veth >>> devices as well. >>> >>> Example: >>> >>> | veth_a2 | veth_b2 | veth_c2 | >>> ------o-----------o----------o------ >>> \ | / >>> o o o >>> veth_a1 veth_b1 veth_c1 >>> ------------------------- >>> | bridge | >>> ------------------------- >>> bond0 >>> / \ >>> eth0 eth1 >>> >>> In case of failover from eth0 to eth1, the netdev_notifier needs to be >>> propagated, so e.g. veth_a2 can re-announce its MAC address to the >>> external hardware attached to eth1. >>> >>> Without this patch we have seen cases where recovery after bond failover >>> took an unacceptable amount of time (depending on timeout settings in the >>> network). >>> >>> Due to the symmetric nature of veth special care is required to avoid >>> endless notification loops. Therefore we only notify from a veth >>> bridgeport to a peer that is not a bridgeport. >>> >>> References: >>> Same handling as for macvlan: >>> commit 4c9912556867 ("macvlan: Support bonding events") >>> and vlan: >>> commit 4aa5dee4d999 ("net: convert resend IGMP to notifier event") >>> >>> Alternatives: >>> Propagate notifier events to all ports of a bridge. IIUC, this was >>> rejected in https://www.spinics.net/lists/netdev/msg717292.html >> >> My (likely flawed) reading of Nik's argument was that (1) he was >> concerned about GARP storms; (2) he didn't want the GARP to be >> broadcast to all ports, just the bond that originated the request. >> > > Yes, that would be ideal. Trying to avoid unnecessary bcasts, that is > especially important for large setups with lots of devices.One way to target the bond that originated the request, would be if the bridge itself would do GARPs/RARPS/.., on this bond port for all MACs that are in its FDB. What do you think about that?> >> I'm not sure I follow (1), as Hangbin said the event is rare, plus >> GARP only comes from interfaces that have an IP addr, which IIUC >> most bridge ports will not have. >> > > Indeed, such setups are not the most common ones. > >> This patch in no way addresses (2). But then, again, if we put >> a macvlan on top of a bridge master it will shotgun its GARPS all >> the same. So it's not like veth would be special in that regard. >> >> Nik, what am I missing? >> > > If we're talking about macvlan -> bridge -> bond then the bond flap's > notify peers shouldn't reach the macvlan. Generally broadcast traffic > is quite expensive for the bridge, I have patches that improve on the > technical side (consider ports only for the same bcast domain), but you also > wouldn't want unnecessary bcast packets being sent around. :) > There are setups with tens of bond devices and propagating that to all would be > very expensive, but most of all unnecessary. It would also hurt setups with > a lot of vlan devices on the bridge. There are setups with hundreds of vlans > and hundreds of macvlans on top, propagating it up would send it to all of > them and that wouldn't scale at all, these mostly have IP addresses too. > > Perhaps we can enable propagation on a per-port or per-bridge basis, then we > can avoid these walks. That is, make it opt-in. > >>> It also seems difficult to avoid re-bouncing the notifier. >> >> syzbot will make short work of this patch, I think the potential >> for infinite loops has to be addressed somehow. IIUC this is the >> first instance of forwarding those notifiers to a peer rather >> than within a upper <> lower device hierarchy which is a DAG.My concern was about the Hangbin's alternative proposal to notify all bridge ports. I hope in my porposal I was able to avoid infinite loops.>> >>> Signed-off-by: Alexandra Winter <wintera at linux.ibm.com> >>> --- >>> drivers/net/veth.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 53 insertions(+) >>> >>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c >>> index d29fb9759cc9..74b074453197 100644 >>> --- a/drivers/net/veth.c >>> +++ b/drivers/net/veth.c >>> @@ -1579,6 +1579,57 @@ static void veth_setup(struct net_device *dev) >>> dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE; >>> } >>> >>> +static bool netif_is_veth(const struct net_device *dev) >>> +{ >>> + return (dev->netdev_ops == &veth_netdev_ops); >> >> brackets unnecessary >> >>> +} >>> + >>> +static void veth_notify_peer(unsigned long event, const struct net_device *dev) >>> +{ >>> + struct net_device *peer; >>> + struct veth_priv *priv; >>> + >>> + priv = netdev_priv(dev); >>> + peer = rtnl_dereference(priv->peer); >>> + /* avoid re-bounce between 2 bridges */ >>> + if (!netif_is_bridge_port(peer)) >>> + call_netdevice_notifiers(event, peer); >>> +} >>> + >>> +/* Called under rtnl_lock */ >>> +static int veth_device_event(struct notifier_block *unused, >>> + unsigned long event, void *ptr) >>> +{ >>> + struct net_device *dev, *lower; >>> + struct list_head *iter; >>> + >>> + dev = netdev_notifier_info_to_dev(ptr); >>> + >>> + switch (event) { >>> + case NETDEV_NOTIFY_PEERS: >>> + case NETDEV_BONDING_FAILOVER: >>> + case NETDEV_RESEND_IGMP: >>> + /* propagate to peer of a bridge attached veth */ >>> + if (netif_is_bridge_master(dev)) { >> >> Having veth sift thru bridge ports seems strange. >> In fact it could be beneficial to filter the event based on >> port state (whether it's forwarding, vlan etc). But looking >> at details of port state outside the bridge would be even stranger. >> >>> + iter = &dev->adj_list.lower; >>> + lower = netdev_next_lower_dev_rcu(dev, &iter); >>> + while (lower) { >>> + if (netif_is_veth(lower)) >>> + veth_notify_peer(event, lower); >>> + lower = netdev_next_lower_dev_rcu(dev, &iter); >> >> let's add netdev_for_each_lower_dev_rcu() rather than open-coding >> >>> + } >>> + } >>> + break; >>> + default: >>> + break; >>> + } >>> + return NOTIFY_DONE; >>> +} >>> + >>> +static struct notifier_block veth_notifier_block __read_mostly = { >>> + .notifier_call = veth_device_event, >> >> extra tab here >> >>> +}; >>> + >>> /* >>> * netlink interface >>> */ >>> @@ -1824,12 +1875,14 @@ static struct rtnl_link_ops veth_link_ops = { >>> >>> static __init int veth_init(void) >>> { >>> + register_netdevice_notifier(&veth_notifier_block); >> >> this can fail >> >>> return rtnl_link_register(&veth_link_ops); >>> } >>> >>> static __exit void veth_exit(void) >>> { >>> rtnl_link_unregister(&veth_link_ops); >>> + unregister_netdevice_notifier(&veth_notifier_block); >>> } >>> >>> module_init(veth_init); >> >
Nikolay Aleksandrov
2022-Mar-30 11:25 UTC
[Bridge] [PATCH net-next v2] veth: Support bonding events
On 30/03/2022 14:14, Alexandra Winter wrote:> > > On 30.03.22 12:23, Nikolay Aleksandrov wrote: >> On 30/03/2022 03:54, Jakub Kicinski wrote: >>> Dropping the BPF people from CC and adding Hangbin, bridge and >>> bond/team. Please exercise some judgment when sending patches. > Thank you. > I did 'scripts/get_maintainer.pl drivers/net/veth.c' > but was a bit surprised about the outcome. > >>> >>> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote: >>>> Bonding drivers generate specific events during failover that trigger >>>> switch updates. When a veth device is attached to a bridge with a >>>> bond interface, we want external switches to learn about the veth >>>> devices as well. >>>> >>>> Example: >>>> >>>> | veth_a2 | veth_b2 | veth_c2 | >>>> ------o-----------o----------o------ >>>> \ | / >>>> o o o >>>> veth_a1 veth_b1 veth_c1 >>>> ------------------------- >>>> | bridge | >>>> ------------------------- >>>> bond0 >>>> / \ >>>> eth0 eth1 >>>> >>>> In case of failover from eth0 to eth1, the netdev_notifier needs to be >>>> propagated, so e.g. veth_a2 can re-announce its MAC address to the >>>> external hardware attached to eth1. >>>> >>>> Without this patch we have seen cases where recovery after bond failover >>>> took an unacceptable amount of time (depending on timeout settings in the >>>> network). >>>> >>>> Due to the symmetric nature of veth special care is required to avoid >>>> endless notification loops. Therefore we only notify from a veth >>>> bridgeport to a peer that is not a bridgeport. >>>> >>>> References: >>>> Same handling as for macvlan: >>>> commit 4c9912556867 ("macvlan: Support bonding events") >>>> and vlan: >>>> commit 4aa5dee4d999 ("net: convert resend IGMP to notifier event") >>>> >>>> Alternatives: >>>> Propagate notifier events to all ports of a bridge. IIUC, this was >>>> rejected in https://www.spinics.net/lists/netdev/msg717292.html >>> >>> My (likely flawed) reading of Nik's argument was that (1) he was >>> concerned about GARP storms; (2) he didn't want the GARP to be >>> broadcast to all ports, just the bond that originated the request. >>> >> >> Yes, that would be ideal. Trying to avoid unnecessary bcasts, that is >> especially important for large setups with lots of devices. > > One way to target the bond that originated the request, would be if the > bridge itself would do GARPs/RARPS/.., on this bond port for all MACs > that are in its FDB. What do you think about that? >That's a hack and you can already do it easily in user-space, you don't need anything special in the bridge. It is also very specific, and it should only happen in certain situations (e.g. a/b flap) which the bridge doesn't really know about, but user-space does because it can see the notifications and can see the bond mode.>> >>> I'm not sure I follow (1), as Hangbin said the event is rare, plus >>> GARP only comes from interfaces that have an IP addr, which IIUC >>> most bridge ports will not have. >>> >> >> Indeed, such setups are not the most common ones. >> >>> This patch in no way addresses (2). But then, again, if we put >>> a macvlan on top of a bridge master it will shotgun its GARPS all >>> the same. So it's not like veth would be special in that regard. >>> >>> Nik, what am I missing? >>> >> >> If we're talking about macvlan -> bridge -> bond then the bond flap's >> notify peers shouldn't reach the macvlan. Generally broadcast traffic >> is quite expensive for the bridge, I have patches that improve on the >> technical side (consider ports only for the same bcast domain), but you also >> wouldn't want unnecessary bcast packets being sent around. :) >> There are setups with tens of bond devices and propagating that to all would be >> very expensive, but most of all unnecessary. It would also hurt setups with >> a lot of vlan devices on the bridge. There are setups with hundreds of vlans >> and hundreds of macvlans on top, propagating it up would send it to all of >> them and that wouldn't scale at all, these mostly have IP addresses too. >> >> Perhaps we can enable propagation on a per-port or per-bridge basis, then we >> can avoid these walks. That is, make it opt-in. >> >>>> It also seems difficult to avoid re-bouncing the notifier. >>> >>> syzbot will make short work of this patch, I think the potential >>> for infinite loops has to be addressed somehow. IIUC this is the >>> first instance of forwarding those notifiers to a peer rather >>> than within a upper <> lower device hierarchy which is a DAG. > > My concern was about the Hangbin's alternative proposal to notify all > bridge ports. I hope in my porposal I was able to avoid infinite loops. > >>> >>>> Signed-off-by: Alexandra Winter <wintera at linux.ibm.com> >>>> --- >>>> drivers/net/veth.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 53 insertions(+) >>>> >>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c >>>> index d29fb9759cc9..74b074453197 100644 >>>> --- a/drivers/net/veth.c >>>> +++ b/drivers/net/veth.c >>>> @@ -1579,6 +1579,57 @@ static void veth_setup(struct net_device *dev) >>>> dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE; >>>> } >>>> >>>> +static bool netif_is_veth(const struct net_device *dev) >>>> +{ >>>> + return (dev->netdev_ops == &veth_netdev_ops); >>> >>> brackets unnecessary >>> >>>> +} >>>> + >>>> +static void veth_notify_peer(unsigned long event, const struct net_device *dev) >>>> +{ >>>> + struct net_device *peer; >>>> + struct veth_priv *priv; >>>> + >>>> + priv = netdev_priv(dev); >>>> + peer = rtnl_dereference(priv->peer); >>>> + /* avoid re-bounce between 2 bridges */ >>>> + if (!netif_is_bridge_port(peer)) >>>> + call_netdevice_notifiers(event, peer); >>>> +} >>>> + >>>> +/* Called under rtnl_lock */ >>>> +static int veth_device_event(struct notifier_block *unused, >>>> + unsigned long event, void *ptr) >>>> +{ >>>> + struct net_device *dev, *lower; >>>> + struct list_head *iter; >>>> + >>>> + dev = netdev_notifier_info_to_dev(ptr); >>>> + >>>> + switch (event) { >>>> + case NETDEV_NOTIFY_PEERS: >>>> + case NETDEV_BONDING_FAILOVER: >>>> + case NETDEV_RESEND_IGMP: >>>> + /* propagate to peer of a bridge attached veth */ >>>> + if (netif_is_bridge_master(dev)) { >>> >>> Having veth sift thru bridge ports seems strange. >>> In fact it could be beneficial to filter the event based on >>> port state (whether it's forwarding, vlan etc). But looking >>> at details of port state outside the bridge would be even stranger. >>> >>>> + iter = &dev->adj_list.lower; >>>> + lower = netdev_next_lower_dev_rcu(dev, &iter); >>>> + while (lower) { >>>> + if (netif_is_veth(lower)) >>>> + veth_notify_peer(event, lower); >>>> + lower = netdev_next_lower_dev_rcu(dev, &iter); >>> >>> let's add netdev_for_each_lower_dev_rcu() rather than open-coding >>> >>>> + } >>>> + } >>>> + break; >>>> + default: >>>> + break; >>>> + } >>>> + return NOTIFY_DONE; >>>> +} >>>> + >>>> +static struct notifier_block veth_notifier_block __read_mostly = { >>>> + .notifier_call = veth_device_event, >>> >>> extra tab here >>> >>>> +}; >>>> + >>>> /* >>>> * netlink interface >>>> */ >>>> @@ -1824,12 +1875,14 @@ static struct rtnl_link_ops veth_link_ops = { >>>> >>>> static __init int veth_init(void) >>>> { >>>> + register_netdevice_notifier(&veth_notifier_block); >>> >>> this can fail >>> >>>> return rtnl_link_register(&veth_link_ops); >>>> } >>>> >>>> static __exit void veth_exit(void) >>>> { >>>> rtnl_link_unregister(&veth_link_ops); >>>> + unregister_netdevice_notifier(&veth_notifier_block); >>>> } >>>> >>>> module_init(veth_init); >>> >>
Jakub Kicinski
2022-Mar-30 15:51 UTC
[Bridge] [PATCH net-next v2] veth: Support bonding events
On Wed, 30 Mar 2022 13:14:12 +0200 Alexandra Winter wrote:> >> This patch in no way addresses (2). But then, again, if we put > >> a macvlan on top of a bridge master it will shotgun its GARPS all > >> the same. So it's not like veth would be special in that regard. > >> > >> Nik, what am I missing? > > > > If we're talking about macvlan -> bridge -> bond then the bond flap's > > notify peers shouldn't reach the macvlan.Hm, right. I'm missing a step in my understanding. As you say bridge does not seem to be re-broadcasting the event to its master. So how does Alexandra catch this kind of an event? :S case NETDEV_NOTIFY_PEERS: /* propagate to peer of a bridge attached veth */ if (netif_is_bridge_master(dev)) { IIUC bond will notify with dev == bond netdev. Where is the event with dev == br generated?> > Generally broadcast traffic > > is quite expensive for the bridge, I have patches that improve on the > > technical side (consider ports only for the same bcast domain), but you also > > wouldn't want unnecessary bcast packets being sent around. :) > > There are setups with tens of bond devices and propagating that to all would be > > very expensive, but most of all unnecessary. It would also hurt setups with > > a lot of vlan devices on the bridge. There are setups with hundreds of vlans > > and hundreds of macvlans on top, propagating it up would send it to all of > > them and that wouldn't scale at all, these mostly have IP addresses too.Ack.> > Perhaps we can enable propagation on a per-port or per-bridge basis, then we > > can avoid these walks. That is, make it opt-in.Maybe opt-out? But assuming the event is only generated on active/backup switch over - when would it be okay to ignore the notification?> >>> It also seems difficult to avoid re-bouncing the notifier. > >> > >> syzbot will make short work of this patch, I think the potential > >> for infinite loops has to be addressed somehow. IIUC this is the > >> first instance of forwarding those notifiers to a peer rather > >> than within a upper <> lower device hierarchy which is a DAG. > > My concern was about the Hangbin's alternative proposal to notify all > bridge ports. I hope in my porposal I was able to avoid infinite loops.Possibly I'm confused as to where the notification for bridge master gets sent..