Nigel Metheringham
2001-Aug-06 10:43 UTC
Bug: 2.2.20pre/ext3 0.0.7a crash apparently in sys_close()
Apologies for a vague and wooly bug report, but I can't reproduce this on my test systems - I *can* reproduce it on some production ones like a flash.... but that seems to upset our operations guys :-( I am getting a set of crashes on some boxes in the field, apparently related to high network traffic (this only occurs on boxes with ethernet connectivity back to the centre rather than the majority of boxes which have E1 connectivity back), and when the boxes are under network load. The boxes have several filesystems on a h/w RAID controller, all of which are ext3 except /boot. There should be *very* little disk traffic on these boxes in normal use, including small amounts of syslog. The kernel is a 2.2.20pre - the couple of messages here are from 2.2.20pre8 - which also has FreeSWAN 1.91 and RAID stuff patched in (however on all of these boxes RAID is not in use since the h/w has onboard h/w RAID). A kernel which has 0.0.7a ext3 dies in this situation. A kernel which has 0.0.6b ext3 works without fault. The death message is:- Unable to handle kernel NULL pointer dereference at virtual address 00000004 current->tss.cr3 = 0ddf6000, %cr3 = 0ddf6000 *pde = 00000000 Entering kdb due to panic @ 0xc9124fe3 eax = 0x00000006 ebx = 0x00000004 ecx = 0xcdf32000 edx = 0x00000000 esi = 0x00000000 edi = 0xfffffff7 esp = 0xcdf32000 eip = 0xc0124fe3 ebp = 0xbffffca0 ss = 0x00000000 cs = 0x00000010 eflags 0x00010246 ds = 0x00000018 es = 0x00000018 origeax = 0xffffffff ®s 0xcdf33f80 [NB These boxes have Compaq Remote Insight consoles - so this message is retyped from a jpg of the console output :-( Nothing makes it into syslog] backtrace and checks against System.map show this to be in sys_close() A few other bombs have also been mostly in sys_close with a couple of others within schedule(). Nigel.
Stephen C. Tweedie
2001-Aug-08 15:19 UTC
Re: Bug: 2.2.20pre/ext3 0.0.7a crash apparently in sys_close()
Hi, On Mon, Aug 06, 2001 at 11:43:54AM +0100, Nigel Metheringham wrote:> Apologies for a vague and wooly bug report, but I can't reproduce this > on my test systems - I *can* reproduce it on some production ones like a > flash.... but that seems to upset our operations guys :-( > > I am getting a set of crashes on some boxes in the field, apparently > related to high network traffic (this only occurs on boxes with ethernet > connectivity back to the centre rather than the majority of boxes which > have E1 connectivity back), and when the boxes are under network load. > The boxes have several filesystems on a h/w RAID controller, all of > which are ext3 except /boot. There should be *very* little disk > traffic on these boxes in normal use, including small amounts of syslog. > > The kernel is a 2.2.20pre - the couple of messages here are from > 2.2.20pre8 - which also has FreeSWAN 1.91 and RAID stuff patched in > (however on all of these boxes RAID is not in use since the h/w has > onboard h/w RAID). > > A kernel which has 0.0.7a ext3 dies in this situation. > A kernel which has 0.0.6b ext3 works without fault.Is 7a the _only_ difference between a working and a non-working kernel? Other than that, I'd probably need a bit more info than what's here to get much further with this. One thing that's always worth trying is to disable slab poisoning and see if things get better --- that's a piece of pure debugging which ext3 enables for its own benefit, but which often causes other buggy drivers to fall apart under load. The oops you posted doesn't have any obvious signs of slab debugging problems, though. Cheers, Stephen