Steven Rostedt
2022-Oct-27 21:07 UTC
[Bridge] [RFC][PATCH v2 19/31] timers: net: Use del_timer_shutdown() before freeing timer
On Thu, 27 Oct 2022 13:48:54 -0700 Linus Torvalds <torvalds at linux-foundation.org> wrote:> On Thu, Oct 27, 2022 at 1:34 PM Steven Rostedt <rostedt at goodmis.org> wrote: > > > > What about del_timer_try_shutdown(), that if it removes the timer, it sets > > the function to NULL (making it equivalent to a successful shutdown), > > otherwise it does nothing. Allowing the the timer to be rearmed. > > Sounds sane to me and should work, but as mentioned, I think the > networking people need to say "yeah" too. > > And maybe that function can also disallow any future re-arming even > for the case where the timer couldn't be actively removed.Well, I think this current use case will break if we prevent the timer from being rearmed or run again if it's not found. But as you said, the networking folks need to confirm or deny it. The fact that it does the sock_put() when it removes the timer makes me think that it can be called again, and we shouldn't prevent that from happening. The debug code will let us know too, as it only "frees" it for freeing if it deactivated the timer and shut it down.> > So any *currently* active timer wouldn't be waited for (either because > locking may make that a deadlock situation, or simply due to > performance issues), but at least it would guarantee that no new timer > activations can happen. > > Because I do like the whole notion of "timer has been shutdown and > cannot be used as a timer any more without re-initializing it" being a > real state - even for a timer that may be "currently in flight". > > So this all sounds very worthwhile to me, but I'm not surprised that > we have code that then knows about all the subtleties of "del_timer() > might still have a running timer" and actually take advantage of it > (where "advantage" is likely more of a "deal with the complexities" > rather than anything really positive ;)Good to hear. This has been a thorn in our side as we keep hitting these crashes in the timer code that look like a timer was freed before it triggered.> > And those existing subtle users might want particular semantics to at > least make said complexities easier. >Yeah, as someone told me recently, "If you let them play long enough without setting out the rules, they will take advantage of everything and it will be extremely hard to get them back in order". -- Steve
Steven Rostedt
2022-Oct-27 21:15 UTC
[Bridge] [RFC][PATCH v2 19/31] timers: net: Use del_timer_shutdown() before freeing timer
On Thu, 27 Oct 2022 17:07:20 -0400 Steven Rostedt <rostedt at goodmis.org> wrote:> > And maybe that function can also disallow any future re-arming even > > for the case where the timer couldn't be actively removed.The naming of the functions will depend on this. If the async version always shuts down the timer, then we should have the interface be: del_timer_shutdown() <- async del_timer_shutdown_sync <- sync As it would match the del_timer() and del_timer_sync() semantics. If shutdown only happens if the timer is removed, then I believe the current approach of del_timer_shutdown() being synchronous and del_timer_try_shutdown() being async is the way to go, as it follows more the semantics of mutex_lock() and mutex_trylock(). -- Steve
Steven Rostedt
2022-Oct-27 22:35 UTC
[Bridge] [RFC][PATCH v2 19/31] timers: net: Use del_timer_shutdown() before freeing timer
On Thu, 27 Oct 2022 17:07:20 -0400 Steven Rostedt <rostedt at goodmis.org> wrote:> Well, I think this current use case will break if we prevent the timer from > being rearmed or run again if it's not found. But as you said, the > networking folks need to confirm or deny it. > > The fact that it does the sock_put() when it removes the timer makes me > think that it can be called again, and we shouldn't prevent that from > happening. > > The debug code will let us know too, as it only "frees" it for freeing if > it deactivated the timer and shut it down.I think we have our answer from Guenter's report: Linux version 6.1.0-rc2-00138-gced58c742836 (groeck at jupiter) (aarch64-linux-gcc (GCC) 11.3.0, GNU ld (GNU Binutils) 2.39) #1 SMP PREEMPT Thu Oct 27 14:53:17 PDT 2022 [ 17.258727] ------------[ cut here ]------------ [ 17.259079] ODEBUG: free active (active state 0) object type: timer_list hint: tcp_write_timer+0x0/0x190 [ 17.259723] WARNING: CPU: 0 PID: 309 at lib/debugobjects.c:502 debug_print_object+0xb8/0x100 [ 17.259951] Modules linked in: [ 17.260249] CPU: 0 PID: 309 Comm: telnet Tainted: G N 6.1.0-rc2-00138-gced58c742836 #1 [ 17.260518] Hardware name: linux,dummy-virt (DT) [ 17.260779] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 17.260967] pc : debug_print_object+0xb8/0x100 [ 17.261096] lr : debug_print_object+0xb8/0x100 [ 17.261223] sp : ffff8000086539e0 [ 17.261324] x29: ffff8000086539e0 x28: 0000000000000004 x27: ffff0d2ac2168000 [ 17.261574] x26: 0000000000000000 x25: ffffa241e2b9de18 x24: ffffa241e4f8fcd8 [ 17.261772] x23: ffffa241e336b370 x22: ffffa241e2b9de18 x21: ffff0d2ac20c5710 [ 17.261967] x20: ffffa241e4ea2568 x19: ffffa241e3ea3c00 x18: 00000000ffffffff [ 17.262161] x17: 6c6973742068696e x16: 3a2074696d65725f x15: 6563742074797065 [ 17.262375] x14: 65203029206f626a x13: ffffa241e3ec7640 x12: 0000000000000d50 [ 17.262591] x11: 0000000000000470 x10: ffffa241e3f1f640 x9 : ffffa241e3ec7640 [ 17.262821] x8 : 00000000ffffefff x7 : ffffa241e3f1f640 x6 : 0000000000000000 [ 17.263028] x5 : ffff0d2adfebba68 x4 : 0000000000000000 x3 : 0000000000000027 [ 17.263235] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0d2ac658b340 [ 17.263528] Call trace: [ 17.263646] debug_print_object+0xb8/0x100 [ 17.263795] __debug_check_no_obj_freed+0x1d0/0x25c [ 17.263927] debug_check_no_obj_freed+0x20/0x90 [ 17.264051] slab_free_freelist_hook.constprop.0+0xac/0x1b0 [ 17.264197] kmem_cache_free+0x1ac/0x500 [ 17.264311] __sk_destruct+0x140/0x2a0 [ 17.264425] sk_destruct+0x54/0x64 [ 17.264531] __sk_free+0x74/0x120 [ 17.264636] sk_free+0x64/0x8c [ 17.264736] tcp_close+0x94/0xc0 [ 17.264840] inet_release+0x50/0xb0 [ 17.264949] __sock_release+0x44/0xbc [ 17.265061] sock_close+0x18/0x30 [ 17.265166] __fput+0x84/0x270 [ 17.265266] ____fput+0x10/0x20 [ 17.265366] task_work_run+0x88/0xf0 [ 17.265491] do_exit+0x334/0xafc [ 17.265596] do_group_exit+0x34/0x90 [ 17.265705] __arm64_sys_exit_group+0x18/0x20 [ 17.265826] invoke_syscall+0x48/0x114 [ 17.265941] el0_svc_common.constprop.0+0x60/0x11c [ 17.266070] do_el0_svc+0x30/0xd0 [ 17.266175] el0_svc+0x48/0xc0 [ 17.266276] el0t_64_sync_handler+0xbc/0x13c [ 17.266396] el0t_64_sync+0x18c/0x190 [ 17.266565] irq event stamp: 5192 [ 17.266676] hardirqs last enabled at (5191): [<ffffa241e1926a18>] __up_console_sem+0x78/0x84 [ 17.266903] hardirqs last disabled at (5192): [<ffffa241e2b4d504>] el1_dbg+0x24/0x90 [ 17.267093] softirqs last enabled at (5170): [<ffffa241e181050c>] __do_softirq+0x46c/0x5d8 [ 17.267305] softirqs last disabled at (5163): [<ffffa241e1816750>] ____do_softirq+0x10/0x20 [ 17.267506] ---[ end trace 0000000000000000 ]--- [ 17.275715] ------------[ cut here ]------------ I'll go modify that code to make it shutdown even if it returns zero. I thinks this means we'll need to change the name to: del_timer_shutdown() del_timer_shutdown_sync() But I want to confirm this first. -- Steve