Nikolay Aleksandrov
2023-Aug-19 09:25 UTC
[Bridge] [PATCH] net: bridge: Fix refcnt issues in dev_ioctl
Hi Ziqi, On 8/19/23 11:10, Ziqi Zhao wrote:> In the bug reported by Syzbot, certain bridge devices would have a > leaked reference created by race conditions in dev_ioctl, specifically, > under SIOCBRADDIF or SIOCBRDELIF operations. The reference leak wouldHow would it leak a reference, could you elaborate? The reference is always taken and always released after the call.> be shown in the periodic unregister_netdevice call, which throws a > warning and cause Syzbot to report a crash. Upon inspection of theIf you reproduced it, is the device later removed or is it really stuck?> logic in dev_ioctl, it seems the reference was introduced to ensure > proper access to the bridge device after rtnl_unlock. and the latter > function is necessary to maintain the following lock order in any > bridge related ioctl calls: > > 1) br_ioctl_mutex => 2) rtnl_lock > > Conceptually, though, br_ioctl_mutex could be considered more specific > than rtnl_lock given their usages, hence swapping their order would be > a reasonable proposal. This patch changes all related call sites to > maintain the reversed order of the two locks: > > 1) rtnl_lock => 2) br_ioctl_mutex > > By doing so, the extra reference introduced in dev_ioctl is no longer > needed, and hence the reference leak bug is now resolved.IIRC there was no bug, it was a false-positive. The reference is held a bit longer but then released, so the device is deleted later. I might be remembering wrong, but I think I briefly looked into this when it was reported. If that's not the case I'd be interested to see a new report/trace because the bug might be somewhere else.> > Reported-by: syzbot+881d65229ca4f9ae8c84 at syzkaller.appspotmail.com > Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") > Signed-off-by: Ziqi Zhao <astrajoan at yahoo.com> > --- > net/bridge/br_ioctl.c | 4 ---- > net/core/dev_ioctl.c | 8 +------- > net/socket.c | 2 ++ > 3 files changed, 3 insertions(+), 11 deletions(-) >Thanks, Nik
Ziqi Zhao
2023-Aug-19 22:50 UTC
[Bridge] [PATCH] net: bridge: Fix refcnt issues in dev_ioctl
On Sat, Aug 19, 2023 at 12:25:15PM +0300, Nikolay Aleksandrov wrote: Hi Nik, Thank you so much for reviewing the patch and getting back to me!> IIRC there was no bug, it was a false-positive. The reference is held a bit > longer but then released, so the device is deleted later.> If you reproduced it, is the device later removed or is it really stuck?I ran the reproducer again without the patch and it seems you are correct. It was trying to create a very short-lived bridge, then delete it immediately in the next call. The device in question "wpan4" never showed up when I polled with `ip link` in the VM, so I'd say it did not get stuck.> How would it leak a reference, could you elaborate? > The reference is always taken and always released after the call.This was where I got a bit confused too. The system had a timeout of 140 seconds for the unregister_netdevice check. If the bridge in question was created and deleted repeatedly, the warning would indeed not be an actual reference leak. But how could its reference show up after 140 seconds if the bridge's creation and deletion were all within a couple of milliseconds? So I let the system run for a bit longer with the reproducer, and after ~200 seconds, the kernel crashed and complained that some tasks had been waiting for too long (more than 143 seconds) trying to get hold of the br_ioctl_mutex. This was also quite strange to me, since on the surface it definitely looked like a deadlock, but the strict locking order as I described previously should prevent any deadlocks from happening. Anyways, I decided to test switching up the lock order, since both the refcnt warning and the task stall seemed closely related to the above mentioned locks. When I ran the reproducer again after the patch, both the warning and the stall issue went away. So I guess the patch is still relevant in preventing bugs in some extreme cases -- although the scenario created by the reproducer would probably never happen in real usages? Please let me know whether you have any thoughts on how the above issues were triggered, and what other information I could gather to further demystify this bug. Thank you again for your help! Best regards, Ziqi
Seemingly Similar Threads
- [Bridge] [PATCH] net: bridge: Fix refcnt issues in dev_ioctl
- [Bridge] [PATCH] net: bridge: Fix refcnt issues in dev_ioctl
- [Bridge] [PATCH] net: bridge: Fix refcnt issues in dev_ioctl
- [Bridge] [syzbot] [net?] unregister_netdevice: waiting for DEV to become free (8)
- [Bridge] [PATCH] can: j1939: prevent deadlock by changing j1939_socks_lock to rwlock