thr3ads.net - Linux Ethernet Bridging - [Bridge] [PATCH V3 net-next 1/4] net: bridge: add fdb flag to extent locked port feature [Jul 2022]

If this information is useful, please help other people find it:
Share via:

Nikolay Aleksandrov

2022-Jul-06 19:38 UTC

[Bridge] [PATCH V3 net-next 1/4] net: bridge: add fdb flag to extent locked port feature

On 06/07/2022 21:13, Vladimir Oltean wrote:> Hi Nikolay,
> 
> On Wed, May 25, 2022 at 01:18:49PM +0300, Nikolay Aleksandrov wrote:
>>>>>>>> Hi Hans,
>>>>>>>> So this approach has a fundamental problem,
f->dst is changed without any synchronization
>>>>>>>> you cannot rely on it and thus you cannot
account for these entries properly. We must be very
>>>>>>>> careful if we try to add any new
synchronization not to affect performance as well.
>>>>>>>> More below...
>>>>>>>>
>>>>>>>>> @@ -319,6 +326,9 @@ static void
fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f,
>>>>>>>>>  	if (test_bit(BR_FDB_STATIC,
&f->flags))
>>>>>>>>>  		fdb_del_hw_addr(br,
f->key.addr.addr);
>>>>>>>>>  
>>>>>>>>> +	if (test_bit(BR_FDB_ENTRY_LOCKED,
&f->flags) && !test_bit(BR_FDB_OFFLOADED, &f->flags))
>>>>>>>>> +	
atomic_dec(&f->dst->locked_entry_cnt);
>>>>>>>>
>>>>>>>> Sorry but you cannot do this for multiple
reasons:
>>>>>>>>  - f->dst can be NULL
>>>>>>>>  - f->dst changes without any
synchronization
>>>>>>>>  - there is no synchronization between
fdb's flags and its ->dst
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>  Nik
>>>>>>>
>>>>>>> Hi Nik,
>>>>>>>
>>>>>>> if a port is decoupled from the bridge, the locked
entries would of
>>>>>>> course be invalid, so maybe if adding and removing
a port is accounted
>>>>>>> for wrt locked entries and the count of locked
entries, would that not
>>>>>>> work?
>>>>>>>
>>>>>>> Best,
>>>>>>> Hans
>>>>>>
>>>>>> Hi Hans,
>>>>>> Unfortunately you need the correct amount of locked
entries per-port if you want
>>>>>> to limit their number per-port, instead of globally. So
you need a
>>>>>> consistent
>>>>>
>>>>> Hi Nik,
>>>>> the used dst is a port structure, so it is per-port and not
globally.
>>>>>
>>>>> Best,
>>>>> Hans
>>>>>
>>>>
>>>> Yeah, I know. :) That's why I wrote it, if the limit is not
a feature requirement I'd suggest
>>>> dropping it altogether, it can be enforced externally (e.g.
from user-space) if needed.
>>>>
>>>> By the way just fyi net-next is closed right now due to merge
window. And one more
>>>> thing please include a short log of changes between versions
when you send a new one.
>>>> I had to go look for v2 to find out what changed.
>>>>
>>>
>>> Okay, I will drop the limit in the bridge module, which is an easy
thing
>>> to do. :) (It is mostly there to ensure against DOS attacks if
someone
>>> bombards a locked port with random mac addresses.)
>>> I have a similar limitation in the driver, which should then
probably be
>>> dropped too?
>>>
>>
>> That is up to you/driver, I'd try looking for similar problems in
other switch drivers
>> and check how those were handled. There are people in the CC above that
can
>> directly answer that. :)
> 
> Not sure whom you're referring to?
I meant people who have dealt with hardware resource management in the drivers.
> 
> In fact I was pretty sure that I didn't see any OOM protection in the
> source code of the Linux bridge driver itself either, so I wanted to
> check that for myself, so I wrote a small "killswitch" program
that's
> supposed to, well, kill a switch. It took me a while to find a few free
> hours to do the test, sorry for that.
> 
> https://github.com/vladimiroltean/killswitch/blob/master/src/killswitch.c
> 
> Sure enough, I can kill a Marvell Armada 3720 device with 1GB of RAM
> within 3 minutes of running the test program.
> 
I don't think that is new or surprising, if there isn't anything to
control the
device resources you'll get there. You don't really need to write any
new programs
you can easily do it with mausezahn. I have tests that add over 10 million fdbs
on
devices for a few seconds.

The point is it's not the bridge's task to limit memory consumption or
to watch for resource
management. You can limit new entries from the device driver (in case of swdev
learning) or
you can use a daemon to watch the number of entries and disable learning. There
are many
different ways to avoid this. We've discussed it before and I don't mind
adding a hard fdb
per-port limit in the bridge as long as it's done properly. We've also
discussed LRU and similar
algorithms for fdb learning and eviction. But any hardcoded limits or limits
that can break
current default use cases are unacceptable, they must be opt-in.
> [  273.864203] ksoftirqd/0: page allocation failure: order:0,
mode:0x40a20(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
> [  273.876426] CPU: 0 PID: 12 Comm: ksoftirqd/0 Not tainted
5.18.7-rc1-00013-g52b92343db13 #74
> [  273.884775] Hardware name: CZ.NIC Turris Mox Board (DT)
> [  273.889994] Call trace:
> [  273.892437]  dump_backtrace.part.0+0xc8/0xd4
> [  273.896721]  show_stack+0x18/0x70
> [  273.900039]  dump_stack_lvl+0x68/0x84
> [  273.903703]  dump_stack+0x18/0x34
> [  273.907017]  warn_alloc+0x114/0x1a0
> [  273.910508]  __alloc_pages+0xbb0/0xbe0
> [  273.914257]  cache_grow_begin+0x60/0x300
> [  273.918183]  fallback_alloc+0x184/0x220
> [  273.922017]  ____cache_alloc_node+0x174/0x190
> [  273.926373]  kmem_cache_alloc+0x1a4/0x220
> [  273.930381]  fdb_create+0x40/0x430
> [  273.933784]  br_fdb_update+0x198/0x210
> [  273.937532]  br_handle_frame_finish+0x244/0x530
> [  273.942063]  br_handle_frame+0x1c0/0x270
> [  273.945986]  __netif_receive_skb_core.constprop.0+0x29c/0xd30
> [  273.951734]  __netif_receive_skb_list_core+0xe8/0x210
> [  273.956784]  netif_receive_skb_list_internal+0x180/0x29c
> [  273.962091]  napi_gro_receive+0x174/0x190
> [  273.966099]  mvneta_rx_swbm+0x6b8/0xb40
> [  273.969935]  mvneta_poll+0x684/0x900
> [  273.973506]  __napi_poll+0x38/0x18c
> [  273.976988]  net_rx_action+0xe8/0x280
> [  273.980643]  __do_softirq+0x124/0x2a0
> [  273.984299]  run_ksoftirqd+0x4c/0x60
> [  273.987871]  smpboot_thread_fn+0x23c/0x270
> [  273.991963]  kthread+0x10c/0x110
> [  273.995188]  ret_from_fork+0x10/0x20
> 
> (followed by lots upon lots of vomiting, followed by ...)
> 
> [  311.138590] Out of memory and no killable processes...
> [  311.143774] Kernel panic - not syncing: System is deadlocked on memory
> [  311.150295] CPU: 0 PID: 6 Comm: kworker/0:0 Not tainted
5.18.7-rc1-00013-g52b92343db13 #74
> [  311.158550] Hardware name: CZ.NIC Turris Mox Board (DT)
> [  311.163766] Workqueue: events rht_deferred_worker
> [  311.168477] Call trace:
> [  311.170916]  dump_backtrace.part.0+0xc8/0xd4
> [  311.175188]  show_stack+0x18/0x70
> [  311.178501]  dump_stack_lvl+0x68/0x84
> [  311.182159]  dump_stack+0x18/0x34
> [  311.185466]  panic+0x168/0x328
> [  311.188515]  out_of_memory+0x568/0x584
> [  311.192261]  __alloc_pages+0xb04/0xbe0
> [  311.196006]  __alloc_pages_bulk+0x15c/0x604
> [  311.200185]  alloc_pages_bulk_array_mempolicy+0xbc/0x24c
> [  311.205491]  __vmalloc_node_range+0x238/0x550
> [  311.209843]  __vmalloc_node_range+0x1c0/0x550
> [  311.214195]  kvmalloc_node+0xe0/0x124
> [  311.217856]  bucket_table_alloc.isra.0+0x40/0x150
> [  311.222554]  rhashtable_rehash_alloc.isra.0+0x20/0x8c
> [  311.227599]  rht_deferred_worker+0x7c/0x540
> [  311.231775]  process_one_work+0x1d0/0x320
> [  311.235779]  worker_thread+0x70/0x440
> [  311.239435]  kthread+0x10c/0x110
> [  311.242661]  ret_from_fork+0x10/0x20
> [  311.246238] SMP: stopping secondary CPUs
> [  311.250161] Kernel Offset: disabled
> [  311.253642] CPU features: 0x000,00020009,00001086
> [  311.258338] Memory Limit: none
> [  311.261390] ---[ end Kernel panic - not syncing: System is deadlocked on
memory ]---
> 
> That can't be quite alright? Shouldn't we have some sort of
protection
> in the bridge itself too, not just tell hardware driver writers to deal
> with it? Or is it somewhere, but it needs to be enabled/configured?
This is expected, if you'd like feel free to add a hard learning limit in
the driver
and the bridge (again if implemented properly).
Nothing can save you if someone has L2 access to the device, they can poison any
table if learning is enabled.

Vladimir Oltean

2022-Jul-06 20:21 UTC

head link

[Bridge] [PATCH V3 net-next 1/4] net: bridge: add fdb flag to extent locked port feature

On Wed, Jul 06, 2022 at 10:38:04PM +0300, Nikolay Aleksandrov
wrote:> I don't think that is new or surprising, if there isn't anything to
control the
> device resources you'll get there. You don't really need to write
any new programs
> you can easily do it with mausezahn. I have tests that add over 10 million
fdbs on
> devices for a few seconds.
Of course it isn't new, but that doesn't make the situation in any way
better,
quite the opposite...
> The point is it's not the bridge's task to limit memory consumption
or to watch for resource
> management. You can limit new entries from the device driver (in case of
swdev learning) or
> you can use a daemon to watch the number of entries and disable learning.
There are many
> different ways to avoid this. We've discussed it before and I don't
mind adding a hard fdb
> per-port limit in the bridge as long as it's done properly. We've
also discussed LRU and similar
> algorithms for fdb learning and eviction. But any hardcoded limits or
limits that can break
> current default use cases are unacceptable, they must be opt-in.
I don't think you can really say that it's not the bridge's task to
limit memory consumption when what it does is essentially allocate
memory from untrusted and unbounded user input, in kernel softirq
context.

That's in fact the problem, the kernel OOM killer will kick in, but
there will be no process to kill. This is why the kernel deadlocks on
memory and dies.

Maybe where our expectations differ is that I believe that a Linux
bridge shouldn't need gazillions of tweaks to not kill the kernel?
There are many devices in production using a bridge without such
configuration, you can't just make it opt-in.

Of course, performance under heavy stress is a separate concern, and
maybe user space monitoring would be a better idea for that.

I know you changed jobs, but did Cumulus Linux have an application to
monitor and limit the FDB entry count? Is there some standard
application which does this somewhere, or does everybody roll their own?

Anyway, limiting FDB entry count from user space is still theoretically
different from not dying. If you need to schedule a task to dispose of
the weight while the ship is sinking from softirq context, you may never
get to actually schedule that task in time. AFAIK the bridge UAPI doesn't
expose a pre-programmed limit, so what needs to be done is for user
space to manually delete entries until the count falls below the limit.

Linux Ethernet Bridging - Jul 2022 - [Bridge] [PATCH V3 net-next 1/4] net: bridge: add fdb flag to extent locked port feature

[Bridge] [PATCH V3 net-next 1/4] net: bridge: add fdb flag to extent locked port feature

[Bridge] [PATCH V3 net-next 1/4] net: bridge: add fdb flag to extent locked port feature