thr3ads.net - Linux Ethernet Bridging - [Bridge] [PATCH net-next v2] veth: Support bonding events [Mar 2022]

If this information is useful, please help other people find it:
Share via:

Alexandra Winter

2022-Mar-30 11:14 UTC

[Bridge] [PATCH net-next v2] veth: Support bonding events

On 30.03.22 12:23, Nikolay Aleksandrov wrote:> On 30/03/2022 03:54, Jakub Kicinski wrote:
>> Dropping the BPF people from CC and adding Hangbin, bridge and
>> bond/team. Please exercise some judgment when sending patches.Thank you. 
I did  'scripts/get_maintainer.pl drivers/net/veth.c'
but was a bit surprised about the outcome.
>>
>> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote:
>>> Bonding drivers generate specific events during failover that
trigger
>>> switch updates.  When a veth device is attached to a bridge with a
>>> bond interface, we want external switches to learn about the veth
>>> devices as well.
>>>
>>> Example:
>>>
>>> 	| veth_a2   |  veth_b2  |  veth_c2 |
>>> 	------o-----------o----------o------
>>> 	       \	  |	    /
>>> 		o	  o	   o
>>> 	      veth_a1  veth_b1  veth_c1
>>> 	      -------------------------
>>> 	      |        bridge         |
>>> 	      -------------------------
>>> 			bond0
>>> 			/  \
>>> 		     eth0  eth1
>>>
>>> In case of failover from eth0 to eth1, the netdev_notifier needs to
be
>>> propagated, so e.g. veth_a2 can re-announce its MAC address to the
>>> external hardware attached to eth1.
>>>
>>> Without this patch we have seen cases where recovery after bond
failover
>>> took an unacceptable amount of time (depending on timeout settings
in the
>>> network).
>>>
>>> Due to the symmetric nature of veth special care is required to
avoid
>>> endless notification loops. Therefore we only notify from a veth
>>> bridgeport to a peer that is not a bridgeport.
>>>
>>> References:
>>> Same handling as for macvlan:
>>> commit 4c9912556867 ("macvlan: Support bonding events")
>>> and vlan:
>>> commit 4aa5dee4d999 ("net: convert resend IGMP to notifier
event")
>>>
>>> Alternatives:
>>> Propagate notifier events to all ports of a bridge. IIUC, this was
>>> rejected in https://www.spinics.net/lists/netdev/msg717292.html
>>
>> My (likely flawed) reading of Nik's argument was that (1) he was
>> concerned about GARP storms; (2) he didn't want the GARP to be
>> broadcast to all ports, just the bond that originated the request.
>>
> 
> Yes, that would be ideal. Trying to avoid unnecessary bcasts, that is
> especially important for large setups with lots of devices.
One way to target the bond that originated the request, would be if the
bridge itself would do GARPs/RARPS/..,  on this bond port for all MACs
that are in its FDB. What do you think about that?
> 
>> I'm not sure I follow (1), as Hangbin said the event is rare, plus 
>> GARP only comes from interfaces that have an IP addr, which IIUC
>> most bridge ports will not have.
>>
> 
> Indeed, such setups are not the most common ones.
> 
>> This patch in no way addresses (2). But then, again, if we put 
>> a macvlan on top of a bridge master it will shotgun its GARPS all 
>> the same. So it's not like veth would be special in that regard.
>>
>> Nik, what am I missing?
>>
> 
> If we're talking about macvlan -> bridge -> bond then the bond
flap's
> notify peers shouldn't reach the macvlan. Generally broadcast traffic
> is quite expensive for the bridge, I have patches that improve on the
> technical side (consider ports only for the same bcast domain), but you
also
> wouldn't want unnecessary bcast packets being sent around. :)
> There are setups with tens of bond devices and propagating that to all
would be
> very expensive, but most of all unnecessary. It would also hurt setups with
> a lot of vlan devices on the bridge. There are setups with hundreds of
vlans
> and hundreds of macvlans on top, propagating it up would send it to all of
> them and that wouldn't scale at all, these mostly have IP addresses
too.
> 
> Perhaps we can enable propagation on a per-port or per-bridge basis, then
we
> can avoid these walks. That is, make it opt-in.
> 
>>> It also seems difficult to avoid re-bouncing the notifier.
>>
>> syzbot will make short work of this patch, I think the potential
>> for infinite loops has to be addressed somehow. IIUC this is the 
>> first instance of forwarding those notifiers to a peer rather
>> than within a upper <> lower device hierarchy which is a DAG.
My concern was about the Hangbin's alternative proposal to notify all
bridge ports. I hope in my porposal I was able to avoid infinite loops.
>>
>>> Signed-off-by: Alexandra Winter <wintera at linux.ibm.com>
>>> --- 
>>>  drivers/net/veth.c | 53
++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 53 insertions(+)
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index d29fb9759cc9..74b074453197 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -1579,6 +1579,57 @@ static void veth_setup(struct net_device
*dev)
>>>  	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
>>>  }
>>>  
>>> +static bool netif_is_veth(const struct net_device *dev)
>>> +{
>>> +	return (dev->netdev_ops == &veth_netdev_ops);
>>
>> brackets unnecessary 
>>
>>> +}
>>> +
>>> +static void veth_notify_peer(unsigned long event, const struct
net_device *dev)
>>> +{
>>> +	struct net_device *peer;
>>> +	struct veth_priv *priv;
>>> +
>>> +	priv = netdev_priv(dev);
>>> +	peer = rtnl_dereference(priv->peer);
>>> +	/* avoid re-bounce between 2 bridges */
>>> +	if (!netif_is_bridge_port(peer))
>>> +		call_netdevice_notifiers(event, peer);
>>> +}
>>> +
>>> +/* Called under rtnl_lock */
>>> +static int veth_device_event(struct notifier_block *unused,
>>> +			     unsigned long event, void *ptr)
>>> +{
>>> +	struct net_device *dev, *lower;
>>> +	struct list_head *iter;
>>> +
>>> +	dev = netdev_notifier_info_to_dev(ptr);
>>> +
>>> +	switch (event) {
>>> +	case NETDEV_NOTIFY_PEERS:
>>> +	case NETDEV_BONDING_FAILOVER:
>>> +	case NETDEV_RESEND_IGMP:
>>> +		/* propagate to peer of a bridge attached veth */
>>> +		if (netif_is_bridge_master(dev)) {
>>
>> Having veth sift thru bridge ports seems strange.
>> In fact it could be beneficial to filter the event based on
>> port state (whether it's forwarding, vlan etc). But looking
>> at details of port state outside the bridge would be even stranger.
>>
>>> +			iter = &dev->adj_list.lower;
>>> +			lower = netdev_next_lower_dev_rcu(dev, &iter);
>>> +			while (lower) {
>>> +				if (netif_is_veth(lower))
>>> +					veth_notify_peer(event, lower);
>>> +				lower = netdev_next_lower_dev_rcu(dev, &iter);
>>
>> let's add netdev_for_each_lower_dev_rcu() rather than open-coding
>>
>>> +			}
>>> +		}
>>> +		break;
>>> +	default:
>>> +		break;
>>> +	}
>>> +	return NOTIFY_DONE;
>>> +}
>>> +
>>> +static struct notifier_block veth_notifier_block __read_mostly = {
>>> +		.notifier_call  = veth_device_event,
>>
>> extra tab here
>>
>>> +};
>>> +
>>>  /*
>>>   * netlink interface
>>>   */
>>> @@ -1824,12 +1875,14 @@ static struct rtnl_link_ops veth_link_ops =
{
>>>  
>>>  static __init int veth_init(void)
>>>  {
>>> +	register_netdevice_notifier(&veth_notifier_block);
>>
>> this can fail
>>
>>>  	return rtnl_link_register(&veth_link_ops);
>>>  }
>>>  
>>>  static __exit void veth_exit(void)
>>>  {
>>>  	rtnl_link_unregister(&veth_link_ops);
>>> +	unregister_netdevice_notifier(&veth_notifier_block);
>>>  }
>>>  
>>>  module_init(veth_init);
>>
>

Nikolay Aleksandrov

2022-Mar-30 11:25 UTC

head link

[Bridge] [PATCH net-next v2] veth: Support bonding events

On 30/03/2022 14:14, Alexandra Winter wrote:> 
> 
> On 30.03.22 12:23, Nikolay Aleksandrov wrote:
>> On 30/03/2022 03:54, Jakub Kicinski wrote:
>>> Dropping the BPF people from CC and adding Hangbin, bridge and
>>> bond/team. Please exercise some judgment when sending patches.
> Thank you. 
> I did  'scripts/get_maintainer.pl drivers/net/veth.c'
> but was a bit surprised about the outcome.
> 
>>>
>>> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote:
>>>> Bonding drivers generate specific events during failover that
trigger
>>>> switch updates.  When a veth device is attached to a bridge
with a
>>>> bond interface, we want external switches to learn about the
veth
>>>> devices as well.
>>>>
>>>> Example:
>>>>
>>>> 	| veth_a2   |  veth_b2  |  veth_c2 |
>>>> 	------o-----------o----------o------
>>>> 	       \	  |	    /
>>>> 		o	  o	   o
>>>> 	      veth_a1  veth_b1  veth_c1
>>>> 	      -------------------------
>>>> 	      |        bridge         |
>>>> 	      -------------------------
>>>> 			bond0
>>>> 			/  \
>>>> 		     eth0  eth1
>>>>
>>>> In case of failover from eth0 to eth1, the netdev_notifier
needs to be
>>>> propagated, so e.g. veth_a2 can re-announce its MAC address to
the
>>>> external hardware attached to eth1.
>>>>
>>>> Without this patch we have seen cases where recovery after bond
failover
>>>> took an unacceptable amount of time (depending on timeout
settings in the
>>>> network).
>>>>
>>>> Due to the symmetric nature of veth special care is required to
avoid
>>>> endless notification loops. Therefore we only notify from a
veth
>>>> bridgeport to a peer that is not a bridgeport.
>>>>
>>>> References:
>>>> Same handling as for macvlan:
>>>> commit 4c9912556867 ("macvlan: Support bonding
events")
>>>> and vlan:
>>>> commit 4aa5dee4d999 ("net: convert resend IGMP to notifier
event")
>>>>
>>>> Alternatives:
>>>> Propagate notifier events to all ports of a bridge. IIUC, this
was
>>>> rejected in https://www.spinics.net/lists/netdev/msg717292.html
>>>
>>> My (likely flawed) reading of Nik's argument was that (1) he
was
>>> concerned about GARP storms; (2) he didn't want the GARP to be
>>> broadcast to all ports, just the bond that originated the request.
>>>
>>
>> Yes, that would be ideal. Trying to avoid unnecessary bcasts, that is
>> especially important for large setups with lots of devices.
> 
> One way to target the bond that originated the request, would be if the
> bridge itself would do GARPs/RARPS/..,  on this bond port for all MACs
> that are in its FDB. What do you think about that?
> 
That's a hack and you can already do it easily in user-space, you don't
need
anything special in the bridge. It is also very specific, and it should only
happen in certain situations (e.g. a/b flap) which the bridge doesn't really
know about, but user-space does because it can see the notifications and
can see the bond mode.
>>
>>> I'm not sure I follow (1), as Hangbin said the event is rare,
plus
>>> GARP only comes from interfaces that have an IP addr, which IIUC
>>> most bridge ports will not have.
>>>
>>
>> Indeed, such setups are not the most common ones.
>>
>>> This patch in no way addresses (2). But then, again, if we put 
>>> a macvlan on top of a bridge master it will shotgun its GARPS all 
>>> the same. So it's not like veth would be special in that
regard.
>>>
>>> Nik, what am I missing?
>>>
>>
>> If we're talking about macvlan -> bridge -> bond then the
bond flap's
>> notify peers shouldn't reach the macvlan. Generally broadcast
traffic
>> is quite expensive for the bridge, I have patches that improve on the
>> technical side (consider ports only for the same bcast domain), but you
also
>> wouldn't want unnecessary bcast packets being sent around. :)
>> There are setups with tens of bond devices and propagating that to all
would be
>> very expensive, but most of all unnecessary. It would also hurt setups
with
>> a lot of vlan devices on the bridge. There are setups with hundreds of
vlans
>> and hundreds of macvlans on top, propagating it up would send it to all
of
>> them and that wouldn't scale at all, these mostly have IP addresses
too.
>>
>> Perhaps we can enable propagation on a per-port or per-bridge basis,
then we
>> can avoid these walks. That is, make it opt-in.
>>
>>>> It also seems difficult to avoid re-bouncing the notifier.
>>>
>>> syzbot will make short work of this patch, I think the potential
>>> for infinite loops has to be addressed somehow. IIUC this is the 
>>> first instance of forwarding those notifiers to a peer rather
>>> than within a upper <> lower device hierarchy which is a DAG.
> 
> My concern was about the Hangbin's alternative proposal to notify all
> bridge ports. I hope in my porposal I was able to avoid infinite loops.
> >>>
>>>> Signed-off-by: Alexandra Winter <wintera at
linux.ibm.com>
>>>> --- 
>>>>  drivers/net/veth.c | 53
++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 53 insertions(+)
>>>>
>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>> index d29fb9759cc9..74b074453197 100644
>>>> --- a/drivers/net/veth.c
>>>> +++ b/drivers/net/veth.c
>>>> @@ -1579,6 +1579,57 @@ static void veth_setup(struct net_device
*dev)
>>>>  	dev->mpls_features = NETIF_F_HW_CSUM |
NETIF_F_GSO_SOFTWARE;
>>>>  }
>>>>  
>>>> +static bool netif_is_veth(const struct net_device *dev)
>>>> +{
>>>> +	return (dev->netdev_ops == &veth_netdev_ops);
>>>
>>> brackets unnecessary 
>>>
>>>> +}
>>>> +
>>>> +static void veth_notify_peer(unsigned long event, const struct
net_device *dev)
>>>> +{
>>>> +	struct net_device *peer;
>>>> +	struct veth_priv *priv;
>>>> +
>>>> +	priv = netdev_priv(dev);
>>>> +	peer = rtnl_dereference(priv->peer);
>>>> +	/* avoid re-bounce between 2 bridges */
>>>> +	if (!netif_is_bridge_port(peer))
>>>> +		call_netdevice_notifiers(event, peer);
>>>> +}
>>>> +
>>>> +/* Called under rtnl_lock */
>>>> +static int veth_device_event(struct notifier_block *unused,
>>>> +			     unsigned long event, void *ptr)
>>>> +{
>>>> +	struct net_device *dev, *lower;
>>>> +	struct list_head *iter;
>>>> +
>>>> +	dev = netdev_notifier_info_to_dev(ptr);
>>>> +
>>>> +	switch (event) {
>>>> +	case NETDEV_NOTIFY_PEERS:
>>>> +	case NETDEV_BONDING_FAILOVER:
>>>> +	case NETDEV_RESEND_IGMP:
>>>> +		/* propagate to peer of a bridge attached veth */
>>>> +		if (netif_is_bridge_master(dev)) {
>>>
>>> Having veth sift thru bridge ports seems strange.
>>> In fact it could be beneficial to filter the event based on
>>> port state (whether it's forwarding, vlan etc). But looking
>>> at details of port state outside the bridge would be even stranger.
>>>
>>>> +			iter = &dev->adj_list.lower;
>>>> +			lower = netdev_next_lower_dev_rcu(dev, &iter);
>>>> +			while (lower) {
>>>> +				if (netif_is_veth(lower))
>>>> +					veth_notify_peer(event, lower);
>>>> +				lower = netdev_next_lower_dev_rcu(dev, &iter);
>>>
>>> let's add netdev_for_each_lower_dev_rcu() rather than
open-coding
>>>
>>>> +			}
>>>> +		}
>>>> +		break;
>>>> +	default:
>>>> +		break;
>>>> +	}
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static struct notifier_block veth_notifier_block __read_mostly
= {
>>>> +		.notifier_call  = veth_device_event,
>>>
>>> extra tab here
>>>
>>>> +};
>>>> +
>>>>  /*
>>>>   * netlink interface
>>>>   */
>>>> @@ -1824,12 +1875,14 @@ static struct rtnl_link_ops
veth_link_ops = {
>>>>  
>>>>  static __init int veth_init(void)
>>>>  {
>>>> +	register_netdevice_notifier(&veth_notifier_block);
>>>
>>> this can fail
>>>
>>>>  	return rtnl_link_register(&veth_link_ops);
>>>>  }
>>>>  
>>>>  static __exit void veth_exit(void)
>>>>  {
>>>>  	rtnl_link_unregister(&veth_link_ops);
>>>> +	unregister_netdevice_notifier(&veth_notifier_block);
>>>>  }
>>>>  
>>>>  module_init(veth_init);
>>>
>>

Jakub Kicinski

2022-Mar-30 15:51 UTC

head link

[Bridge] [PATCH net-next v2] veth: Support bonding events

On Wed, 30 Mar 2022 13:14:12 +0200 Alexandra Winter
wrote:> >> This patch in no way addresses (2). But then, again, if we put 
> >> a macvlan on top of a bridge master it will shotgun its GARPS all 
> >> the same. So it's not like veth would be special in that
regard.
> >>
> >> Nik, what am I missing?
> > 
> > If we're talking about macvlan -> bridge -> bond then the
bond flap's
> > notify peers shouldn't reach the macvlan.
Hm, right. I'm missing a step in my understanding. As you say bridge
does not seem to be re-broadcasting the event to its master. So how
does Alexandra catch this kind of an event? :S

	case NETDEV_NOTIFY_PEERS:
		/* propagate to peer of a bridge attached veth */
		if (netif_is_bridge_master(dev)) {  

IIUC bond will notify with dev == bond netdev. Where is the event with
dev == br generated?
> > Generally broadcast traffic
> > is quite expensive for the bridge, I have patches that improve on the
> > technical side (consider ports only for the same bcast domain), but
you also
> > wouldn't want unnecessary bcast packets being sent around. :)
> > There are setups with tens of bond devices and propagating that to all
would be
> > very expensive, but most of all unnecessary. It would also hurt setups
with
> > a lot of vlan devices on the bridge. There are setups with hundreds of
vlans
> > and hundreds of macvlans on top, propagating it up would send it to
all of
> > them and that wouldn't scale at all, these mostly have IP
addresses too.
Ack.
> > Perhaps we can enable propagation on a per-port or per-bridge basis,
then we
> > can avoid these walks. That is, make it opt-in.
Maybe opt-out? But assuming the event is only generated on
active/backup switch over - when would it be okay to ignore
the notification?
> >>> It also seems difficult to avoid re-bouncing the notifier.  
> >>
> >> syzbot will make short work of this patch, I think the potential
> >> for infinite loops has to be addressed somehow. IIUC this is the 
> >> first instance of forwarding those notifiers to a peer rather
> >> than within a upper <> lower device hierarchy which is a
DAG.
> 
> My concern was about the Hangbin's alternative proposal to notify all
> bridge ports. I hope in my porposal I was able to avoid infinite loops.
Possibly I'm confused as to where the notification for bridge master
gets sent..

Linux Ethernet Bridging - Mar 2022 - [Bridge] [PATCH net-next v2] veth: Support bonding events

[Bridge] [PATCH net-next v2] veth: Support bonding events

[Bridge] [PATCH net-next v2] veth: Support bonding events

[Bridge] [PATCH net-next v2] veth: Support bonding events