On Mon, Jan 28, 2019 at 09:28:36AM +0100, Dmitry Vyukov wrote:> > > Weird, this is the kfree() on the error path of br_multicast_new_group() > > when rhashtable_lookup_insert_fast() fails, which means the entry should > > not be linked in the rhashtable or the hlist. > > All other frees are via kfree_rcu. I'll keep looking.. > > Humm.... +rhashtable.c maintianers > > The code in br_multicast_new_group is effectively: > > mp = kzalloc(sizeof(*mp), GFP_ATOMIC); > err = rhashtable_lookup_insert_fast(&br->mdb_hash_tbl, &mp->rhnode, > br_mdb_rht_params); > if (err) > kfree(mp); > > So it looks like rhashtable_lookup_insert_fast both returned an error > and left the new object linked in the table. Since it happened only > once, it may have something to do with concurrent resizing/shrinking.I've looked through the rhashtable code in question and everything looks OK. So I suspect some earlier corruption has occured to cause this anomalous result. Is it possible to collect earlier alloc/free stack traces on the object in question? Cheers, -- Email: Herbert Xu <herbert at gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Wed, Feb 20, 2019 at 11:23 AM Herbert Xu <herbert at gondor.apana.org.au> wrote:> > On Mon, Jan 28, 2019 at 09:28:36AM +0100, Dmitry Vyukov wrote: > > > > > Weird, this is the kfree() on the error path of br_multicast_new_group() > > > when rhashtable_lookup_insert_fast() fails, which means the entry should > > > not be linked in the rhashtable or the hlist. > > > All other frees are via kfree_rcu. I'll keep looking.. > > > > Humm.... +rhashtable.c maintianers > > > > The code in br_multicast_new_group is effectively: > > > > mp = kzalloc(sizeof(*mp), GFP_ATOMIC); > > err = rhashtable_lookup_insert_fast(&br->mdb_hash_tbl, &mp->rhnode, > > br_mdb_rht_params); > > if (err) > > kfree(mp); > > > > So it looks like rhashtable_lookup_insert_fast both returned an error > > and left the new object linked in the table. Since it happened only > > once, it may have something to do with concurrent resizing/shrinking. > > I've looked through the rhashtable code in question and everything > looks OK. So I suspect some earlier corruption has occured to cause > this anomalous result.Taking into account that this still happened only once, I tend to write it off onto a previous silent memory corruption (we have dozens of known bugs that corrupt memory). So if several people already looked at it and don't see the root cause, it's probably time to stop spending time on this until we have more info. Although, there was also this one: https://groups.google.com/d/msg/syzkaller-bugs/QfCCSxdB1aM/y2cn9IZJCwAJ I have not checked if it can be the root cause of this report, but it points suspiciously close to this stack and when I looked at it, it the report looked legit.> Is it possible to collect earlier alloc/free stack traces on the object in question?You mean even before the alloc/free of the current incarnation this object? This looks challenging from memory consumption point of view. Also how many of them will we print in reports? Also the page can go through page_alloc and then tracking will be even more challenging. And in the end the object can be corrupted by an out-of-bounds write or a like. Heap object reuse quarantine should take care of the common case, and I don't see a reasonable way to handle all possible corner cases...
Hi Dmitry: On Thu, Feb 21, 2019 at 11:54:42AM +0100, Dmitry Vyukov wrote:> > Taking into account that this still happened only once, I tend to > write it off onto a previous silent memory corruption (we have dozens > of known bugs that corrupt memory). So if several people already > looked at it and don't see the root cause, it's probably time to stop > spending time on this until we have more info. > > Although, there was also this one: > https://groups.google.com/d/msg/syzkaller-bugs/QfCCSxdB1aM/y2cn9IZJCwAJ > I have not checked if it can be the root cause of this report, but it > points suspiciously close to this stack and when I looked at it, it > the report looked legit.Have you had any more reports of this kind coming from br_multicast? It looks like ommit 1515a63fc413f160d20574ab0894e7f1020c7be2 Author: Nikolay Aleksandrov <nikolay at cumulusnetworks.com> Date: Wed Apr 3 23:27:24 2019 +0300 net: bridge: always clear mcast matching struct on reports and leaves may have at least fixed the uninitialised value error. Thanks, -- Email: Herbert Xu <herbert at gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt