Steven Rostedt
2022-Oct-27 19:55 UTC
[Bridge] [RFC][PATCH v2 19/31] timers: net: Use del_timer_shutdown() before freeing timer
On Thu, 27 Oct 2022 12:38:16 -0700 Guenter Roeck <linux at roeck-us.net> wrote:> On 10/27/22 12:27, Steven Rostedt wrote: > > On Thu, 27 Oct 2022 15:20:58 -0400 > > Steven Rostedt <rostedt at goodmis.org> wrote: > > > >>> (many more of those) > >>> ... > >>> [ 16.329989] timer_fixup_free+0x40/0x54 > >> > >> Ah, I see the issue here. Looks like the timer_fixup_free() is calling > >> itself and crashing. > >> > >> Let me take a look into that. I didn't touch the fixup code, and there > >> could be an assumption there that it's behaving with the old approach. > > > > Can you add this and see if it makes this issue go away? > > > > Yes, that fixes the crash. However, it still reports > > [ 12.235054] ------------[ cut here ]------------ > [ 12.235240] ODEBUG: free active (active state 0) object type: timer_list hint: tcp_write_timer+0x0/0x190 > [ 12.237331] WARNING: CPU: 0 PID: 310 at lib/debugobjects.c:502 debug_print_object+0xb8/0x100 > ... > [ 12.255251] Call trace: > [ 12.255305] debug_print_object+0xb8/0x100 > [ 12.255385] __debug_check_no_obj_freed+0x1d0/0x25c > [ 12.255474] debug_check_no_obj_freed+0x20/0x90 > [ 12.255555] slab_free_freelist_hook.constprop.0+0xac/0x1b0 > [ 12.255650] kmem_cache_free+0x1ac/0x500 > [ 12.255728] __sk_destruct+0x140/0x2a0 > [ 12.255805] sk_destruct+0x54/0x64 > [ 12.255877] __sk_free+0x74/0x120 > [ 12.255944] sk_free+0x64/0x8c > [ 12.256009] tcp_close+0x94/0xc0 > [ 12.256076] inet_release+0x50/0xb0 > [ 12.256145] __sock_release+0x44/0xbc > [ 12.256219] sock_close+0x18/0x30 > [ 12.256292] __fput+0x84/0x270 > [ 12.256361] ____fput+0x10/0x20 > [ 12.256426] task_work_run+0x88/0xf0 > [ 12.256499] do_exit+0x334/0xafc > [ 12.256566] do_group_exit+0x34/0x90 > [ 12.256634] __arm64_sys_exit_group+0x18/0x20 > [ 12.256713] invoke_syscall+0x48/0x114 > [ 12.256789] el0_svc_common.constprop.0+0x60/0x11c > [ 12.256874] do_el0_svc+0x30/0xd0 > [ 12.256943] el0_svc+0x48/0xc0 > [ 12.257008] el0t_64_sync_handler+0xbc/0x13c > [ 12.257086] el0t_64_sync+0x18c/0x190 > > Is that a real problem or a false positive ? I didn't see that > without your patch series (which of course might be the whole point > of the series). >I think this is indeed an issue, and I'm replying to the net patch as it has the necessary folks Cc'd. The ipv4 tcp code has: void tcp_init_xmit_timers(struct sock *sk) { inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer, &tcp_keepalive_timer); And from the above back trace: tcp_close() where I'm assuming that tcp_disconnect() or tcp_done() was called that both calls: tcp_clear_xmit_timers(sk); That calls: inet_csk_clear_xmit_timers(sk); That has: void inet_csk_clear_xmit_timers(struct sock *sk) { struct inet_connection_sock *icsk = inet_csk(sk); icsk->icsk_pending = icsk->icsk_ack.pending = 0; sk_stop_timer(sk, &icsk->icsk_retransmit_timer); sk_stop_timer(sk, &icsk->icsk_delack_timer); sk_stop_timer(sk, &sk->sk_timer); } Where: void sk_stop_timer(struct sock *sk, struct timer_list* timer) { if (del_timer(timer)) __sock_put(sk); } Hence, this is a case where we have timers that have been disabled with only del_timer() before the timers are freed. I think we need to update this code to squeeze in a del_timer_shutdown() to make sure that the timers are never restarted. There is a sk_stop_timer_sync() that I changed to use del_timer_shutdown() but that's only used in one file: net/mptcp/pm_netlink.c -- Steve
Linus Torvalds
2022-Oct-27 20:15 UTC
[Bridge] [RFC][PATCH v2 19/31] timers: net: Use del_timer_shutdown() before freeing timer
On Thu, Oct 27, 2022 at 12:55 PM Steven Rostedt <rostedt at goodmis.org> wrote:> > I think we need to update this code to squeeze in a del_timer_shutdown() to > make sure that the timers are never restarted.So the reason the networking code does this is that it can't just do the old 'sync()' thing, the timers are deleted in contexts where that isn't valid. Which is also afaik why the networking code does that whole "timer implies a refcount to the socket" and then does the if (del_timer(timer)) sock_put() thing (ie if the del_timer failed - possibly because it was already running - you leave the refcount alone). So the networking code cannot do the del_timer_shutdown() for the same reason it cannot do the del_timer_sync(): it can't afford to wait for the timer to stop running. I suspect it needs something like a new "del_timer_shutdown_async()" that isn't synchronous, but does that - acts as del_timer in that it doesn't wait, and returns a success if it could just remove the pending case - does that "mark timer for shutdown" in that success case or something similar. But the networking people will know better. Linus