Hi again, On 01/07/2019 20:31, Martin Weinelt wrote:> Hi Nik, > > On 7/1/19 7:03 PM, Nikolay Aleksandrov wrote: >> Hi Martin, >> >> On 01/07/2019 19:53, Martin Weinelt wrote: >>> Hi Nik, >>> >>> more info below. >>> >>> On 6/29/19 3:11 PM, nikolay at cumulusnetworks.com wrote: >>>> On 29 June 2019 14:54:44 EEST, Martin Weinelt <martin at linuxlounge.net> wrote: >>>>> Hello, >>>>> >>>>> we've recently been experiencing memory leaks on our Linux-based >>>>> routers, >>>>> at least as far back as v4.19.16. >>>>> >>>>> After rebuilding with KASAN it found a use-after-free in >>>>> br_multicast_rcv which I could reproduce on v5.2.0-rc6. >>>>> >>>>> Please find the KASAN report below, I'm anot sure what else to provide >>>>> so >>>>> feel free to ask. >>>>> >>>>> Best, >>>>> Martin >>>>> >>>>> >>>> >>>> Hi Martin, >>>> I'll look into this, are there any specific steps to reproduce it? >>>> >>>> Thanks, >>>> Nik >>>>> >>> Each server is a KVM Guest and has 18 bridges with the same master/slave >>> relationships: >>> >>> bridge -> batman-adv -> {l2 tunnel, virtio device} >>> >>> Linus L?ssing from the batman-adv asked me to apply this patch to help >>> debugging. >>> >>> v5.2-rc6-170-g728254541ebc with this patch yielded the following KASAN >>> report, not sure if the additional information at the end is a result of >>> the added patch though. >>> >>> Best, >>> Martin >>> >> >> I see a couple of issues that can cause out-of-bounds accesses in br_multicast.c >> more specifically there're pskb_may_pull calls and accesses to stale skb pointers. >> I've had these on my "to fix" list for some time now, will prepare, test the fixes and >> send them for review. In a few minutes I'll send a test patch for you. >> That being said, I thought you said you've been experiencing memory leaks, but below >> reports are for out-of-bounds accesses, could you please clarify if you were >> speaking about these or is there another issue as well ? >> If you're experiencing memory leaks, are you sure they're related to the bridge ? >> You could try kmemleak for those. >> >> Thank you, >> Nik >> > > we had been experiencing memory leaks on v4.19.37, thats why we started to turn on > KASAN and kmemleak in the first place. This is when we found this use-after-free. > > The memory leak exists, and is a separate issue. Apparently kmemleak does not work, > I suspect the early log size is too small > > root at gw02:~# echo scan > /sys/kernel/debug/kmemleak -bash: echo: write error: Device or resource busy > > CONFIG_HAVE_DEBUG_KMEMLEAK=y > CONFIG_DEBUG_KMEMLEAK=y > CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE=400 > # CONFIG_DEBUG_KMEMLEAK_TEST is not set > # CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF is not set > CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y > > I'll increase the early log size with the next build to try and get more information > on the memory leak, I'll open a separate thread for that then. > > Thanks, > Martin >I see, thanks for clarifying this. So on the KASAN could you please try the attached patch ? Also could you please run the br_multicast_rcv+xxx addresses through linux/scripts/faddr2line for your kernel/bridge: usage: faddr2line [--list] <object file> <func+offset> <func+offset>... Thanks, Nik -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-net-bridge-mcast-fix-possible-uses-of-stale-pointers.patch Type: text/x-patch Size: 3443 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/bridge/attachments/20190701/848a5c95/attachment.bin>
Hi again, On 7/1/19 7:37 PM, Nikolay Aleksandrov wrote:> I see, thanks for clarifying this. So on the KASAN could you please try the attached patch ? > Also could you please run the br_multicast_rcv+xxx addresses through > linux/scripts/faddr2line for your kernel/bridge: > usage: faddr2line [--list] <object file> <func+offset> <func+offset>... > > Thanks, > Nik >back with a new report. This is 5.2.0-rc7 + your patch. Best, Martin $ ./faddr2line /usr/lib/debug/lib/modules/5.2.0-rc7+/kernel/net/bridge/bridge.ko br_multicast_rcv+0x4d0/0x4b00 br_multicast_rcv+0x4d0/0x4b00: __skb_header_pointer at /home/hexa/git/linux/./include/linux/skbuff.h:3476 (inlined by) skb_header_pointer at /home/hexa/git/linux/./include/linux/skbuff.h:3486 (inlined by) br_ip6_multicast_mld2_report at /home/hexa/git/linux/net/bridge/br_multicast.c:998 (inlined by) br_multicast_ipv6_rcv at /home/hexa/git/linux/net/bridge/br_multicast.c:1694 (inlined by) br_multicast_rcv at /home/hexa/git/linux/net/bridge/br_multicast.c:1729 [ 329.723036] =================================================================[ 329.732244] BUG: KASAN: stack-out-of-bounds in skb_copy_bits+0x33e/0x730 [ 329.738974] Write of size 8 at addr ffff888050f09860 by task swapper/4/0 [ 329.745754] [ 329.749528] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G OE 5.2.0-rc7+ #2 [ 329.756304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 329.764062] Call Trace: [ 329.768281] <IRQ> [ 329.772037] dump_stack+0x71/0xab [ 329.776015] print_address_description+0x6a/0x280 [ 329.780840] ? skb_copy_bits+0x33e/0x730 [ 329.784817] __kasan_report+0x152/0x1aa [ 329.788623] ? skb_copy_bits+0x33e/0x730 [ 329.792398] ? skb_copy_bits+0x33e/0x730 [ 329.796231] kasan_report+0xe/0x20 [ 329.800250] memcpy+0x34/0x50 [ 329.803716] skb_copy_bits+0x33e/0x730 [ 329.807736] br_multicast_rcv+0x4d0/0x4b00 [bridge] [ 329.811579] ? netif_receive_skb_internal+0x84/0x1a0 [ 329.815197] ? br_multicast_disable_port+0x150/0x150 [bridge] [ 329.819164] ? netif_receive_skb+0x1b/0x1e0 [ 329.823374] ? br_pass_frame_up+0x25b/0x3a0 [bridge] [ 329.828084] ? br_handle_local_finish+0x20/0x20 [bridge] [ 329.832960] ? br_fdb_update+0x10e/0x6e0 [bridge] [ 329.837599] ? br_handle_frame_finish+0x3c6/0x11d0 [bridge] [ 329.843090] br_handle_frame_finish+0x3c6/0x11d0 [bridge] [ 329.848091] ? br_pass_frame_up+0x3a0/0x3a0 [bridge] [ 329.853063] ? _raw_write_trylock+0x100/0x100 [ 329.857660] ? update_load_avg+0x1c4/0x1890 [ 329.861863] ? virtnet_probe+0x1c80/0x1c80 [virtio_net] [ 329.866355] br_handle_frame+0x731/0xd90 [bridge] [ 329.870343] ? rcu_irq_exit+0x72/0x1c0 [ 329.873708] ? br_handle_frame_finish+0x11d0/0x11d0 [bridge] [ 329.878333] ? do_IRQ+0x71/0x160 [ 329.881667] ? __update_load_avg_cfs_rq+0x2aa/0x980 [ 329.885859] ? common_interrupt+0xa/0xf [ 329.889588] ? __update_load_avg_cfs_rq+0x2aa/0x980 [ 329.894482] __netif_receive_skb_core+0xced/0x2d70 [ 329.900915] ? napi_complete_done+0x10/0x360 [ 329.905743] ? virtqueue_get_buf_ctx+0x271/0x1130 [virtio_ring] [ 329.909920] ? do_xdp_generic+0x20/0x20 [ 329.912854] ? virtqueue_napi_complete+0x39/0x70 [virtio_net] [ 329.916242] ? virtnet_poll+0x94d/0xc78 [virtio_net] [ 329.919606] ? receive_buf+0x5120/0x5120 [virtio_net] [ 329.924068] ? __netif_receive_skb_one_core+0x97/0x1d0 [ 329.929248] ? account_entity_enqueue+0x340/0x4c0 [ 329.933515] __netif_receive_skb_one_core+0x97/0x1d0 [ 329.937323] ? __netif_receive_skb_core+0x2d70/0x2d70 [ 329.941076] ? _raw_write_trylock+0x100/0x100 [ 329.944515] process_backlog+0x19c/0x650 [ 329.947618] ? update_cfs_group+0x10b/0x380 [ 329.950863] net_rx_action+0x71e/0xbc0 [ 329.953899] ? napi_complete_done+0x360/0x360 [ 329.957240] ? handle_irq_event_percpu+0xeb/0x140 [ 329.960485] ? _raw_spin_lock+0x7a/0xd0 [ 329.962980] ? _raw_write_trylock+0x100/0x100 [ 329.965598] __do_softirq+0x1db/0x5f9 [ 329.968022] irq_exit+0x123/0x150 [ 329.970338] do_IRQ+0x71/0x160 [ 329.972599] common_interrupt+0xf/0xf [ 329.975034] </IRQ> [ 329.977167] RIP: 0010:native_safe_halt+0xe/0x10 [ 329.979783] Code: 09 f9 fe 48 8b 04 24 e9 12 ff ff ff e9 07 00 00 00 0f 00 2d d4 60 52 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d c4 60 52 00 fb f4 <c3> 90 66 66 66 66 90 41 56 41 55 41 54 55 53 e8 7e 05 ba fe 65 44 [ 329.987611] RSP: 0018:ffff888050347d98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd1 [ 329.991311] RAX: ffffffffa3d1ad00 RBX: 0000000000000004 RCX: ffffffffa28bbbd6 [ 329.994903] RDX: 1ffff1100a05c5b8 RSI: 0000000000000004 RDI: ffff888050f33f38 [ 329.998432] RBP: ffffed100a05c5b8 R08: ffffed100a1e67e8 R09: ffffed100a1e67e7 [ 330.002025] R10: ffff888050f33f3b R11: ffffed100a1e67e8 R12: ffffffffa4c604c0 [ 330.005736] R13: 0000000000000004 R14: 0000000000000000 R15: ffff8880502e2dc0 [ 330.010074] ? ldsem_down_write+0x590/0x590 [ 330.012905] ? rcu_idle_enter+0x106/0x150 [ 330.016130] ? tsc_verify_tsc_adjust+0x96/0x2a0 [ 330.019068] default_idle+0x1f/0x280 [ 330.021710] do_idle+0x2d8/0x3e0 [ 330.024246] ? arch_cpu_idle_exit+0x40/0x40 [ 330.027089] cpu_startup_entry+0x19/0x20 [ 330.030120] start_secondary+0x316/0x3f0 [ 330.032770] ? set_cpu_sibling_map+0x19c0/0x19c0 [ 330.035625] secondary_startup_64+0xa4/0xb0 [ 330.038485] [ 330.040559] The buggy address belongs to the page: [ 330.043525] page:ffffea000143c240 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 [ 330.047223] flags: 0xffffc000001000(reserved) [ 330.050063] raw: 00ffffc000001000 ffffea000143c248 ffffea000143c248 0000000000000000 [ 330.053806] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 [ 330.057765] page dumped because: kasan: bad access detected [ 330.060927] [ 330.063115] Memory state around the buggy address: [ 330.066036] ffff888050f09700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 330.069509] ffff888050f09780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 330.073066] >ffff888050f09800: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 02 f4 f4 f4 [ 330.076586] ^ [ 330.079861] ffff888050f09880: f2 f2 f2 f2 04 f4 f4 f4 f2 f2 f2 f2 00 00 04 f4 [ 330.083397] ffff888050f09900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 330.086923] =================================================================[ 330.090465] Disabling lock debugging due to kernel taint