Hi
I've got serious troubles - I posted a while back about experiencing
ext3 errors using 2.4.18, at the time I put the problems down to
harddisk failure, but these problems are occurring more and more - not
all of our systems are having this problem but 3 systems have now shown
this problem.
The hardware is essentially the same, the only difference is disk
manufacturers but we've now seen the problems on several brands of disk
(Maxtor, Seagate, ibm etc) so I can't simply put this down to disk
failure (motherboard failure possibly, but 3 different motherboards?)
I've included below some of the debug output from the kernel below,
there is a lot of it so I've only included the different types of errors
reported (with times when the problems started)
Thu 12/12/02 15:50:37.315 [KMSG:<2>EXT3-fs error (device ide0(3,10)):
ext3_free_blocks: Freeing blocks in system zones - Block = 128, count 1]
Fri 13/12/02 23:55:46.383 [KMSG:<4> <6>attempt to access beyond end
of
device]
[KMSG:<6>16:05: rw=0, want=137058900, limit=39230698]
[KMSG:<2>EXT3-fs error (device ide1(22,5)): ext3_free_branches: Read
failure, inode=3567502, block=-1576348012]
[KMSG:<6>attempt to access beyond end of device]
[KMSG:<6>16:05: rw=0, want=1724901664, limit=39230698]
[KMSG:<2>EXT3-fs error (device ide1(22,5)): ext3_free_branches: Read
failure, inode=3567502, block=-642516409]
[KMSG:<6>attempt to access beyond end of device]
<snip>
Fri 13/12/02 23:55:46.411 [KMSG:<2>EXT3-fs error (device ide1(22,5)):
ext3_free_branches: Read failure, inode=3567502, block=1329885327]
[KMSG:<2>EXT3-fs error (device ide1(22,5)): ext3_free_blocks: Freeing
blocks not in datazone - block = 1874129395, count = 1]
[KMSG:<2>EXT3-fs error (device ide1(22,5)): ext3_free_blocks: Freeing
blocks not in datazone - block = 203477977, count = 1]
[KMSG:<2>EXT3-fs error (device ide1(22,5)): ext3_free_blocks: Freeing
blocks not in datazone - block = 2877124100, count = 1]
[KMSG:<2>EXT3-fs error (device ide1(22,5)): ext3_free_blocks: Freeing
blocks not in datazone - block = 103093662, count = 1]
[KMSG:<2>EXT3-fs error (device ide1(22,5)): ext3_free_blocks: Freeing
blocks not in datazone - block = 3719271906, count = 1]
[KMSG:<2>EXT3-fs error (device ide1(22,5)): ext3_free_blocks: Freeing
blocks not in datazone - block = 4274192639, count = 1]
<snip>
[KMSG:<2>EXT3-fs error (device ide1(22,5)): ext3_free_blocks: bit
already cleared for block 9126180]
<snip>
Fri 13/12/02 23:56:00.242 [KMSG:<2>EXT3-fs error (device ide1(22,5)):
ext3_free_blocks: Freeing blocks not in datazone - block = 3305048842,
count = 1]
[KMSG:<0>Assertion failure in do_get_write_access() at
transaction.c:589: "handle->h_buffer_credits > 0"]
[KMSG:<4>invalid operand: 0000]
[KMSG:<4>CPU: 0]
[KMSG:<4>EIP: 0010:[<c0156d09>] Not tainted]
[KMSG:<4>EFLAGS: 00010286]
[KMSG:<4>eax: 00000063 ebx: cefb4ac0 ecx: c55ac3c0 edx: fffffffe]
[KMSG:<4>esi: c7083f00 edi: 00000002 ebp: c7083f00 esp: c0605c20]
[KMSG:<4>ds: 0018 es: 0018 ss: 0018]
[KMSG:<4>Process videoexe (pid: 3412, stackpage=c0605000)]
[KMSG:<4>Stack: c0232720 c02328e6 c0232700 0000024d c0232921 cce0f000
ccd24a60 cefb4ac0 ]
[KMSG:<4> cce0f094 cce0f094 00000000 00000000 cce0f000 cc434760
c01570d8 ccd24a60 ]
[KMSG:<4> cefb4ac0 00000000 00000000 c7083f00 ccd24a60 c39cf460
c0150798 ccd24a60 ]
[KMSG:<4>Call Trace: [<c01570d8>] [<c0150798>]
[<c01570e0>] [<c01508fc>]
[<c0150b98>] ]
[KMSG:<4> [<c0150a68>] [<c0150a68>] [<c0150a68>]
[<c0150c79>]
[<c0150f0b>] [<c015763c>] ]
[KMSG:<4> [<c01516c2>] [<c0151727>] [<c0121b5f>]
[<c011feae>]
[<c011ff0d>] [<c0140567>] ]
[KMSG:<4> [<c01518d1>] [<c01406bc>] [<c012cc3d>]
[<c013796a>]
[<c012dae7>] [<c012de3a>] ]
[KMSG:<4> [<c0106b87>] ]
[KMSG:<4>]
[KMSG:<4>Code: 0f 0b 83 c4 14 8b 54 24 28 8b 42 04 48 8b 4c 24 28 89 41
04 ]
^^^
That's the final nail in the coffin as the process then locks solid (but
still has threads running which then run out of memory - total chaos
ensues) the box has to be powered off/on. The disks fsck when the
machine comes back up - no reports of any hardware IO errors.
The profile of the machine is that its doing lots of disk IO as its
capturing video to disk - there are 3 partitions used, the problem is
only occurring on one of them, they are each roughly 40 gig in size.
The only thing to note is that there was an issue where this partition
and another filled up and I had to make space (just by deleting files,
now the maximum used space is around 76%) - once I'd cleaned up the
file-system the system ran fine until the first error was reported at
15:50 on Thursday (as shown above), and then Friday (last night) it just
went haywire.
The only other thing to note is that there was a panic on kswapd a
number of hours earlier - but I've seen these on other systems running
2.4.18 and they don't seem to cause any problems (I think).
As I've mentioned I've seen the same behavior before on other systems,
the specs for all of them are:
Abit ST6 Motherboard with 1.2 Gig Celeron
2 x disks (varying sizes and makes)
128Meg Ram
AGP Graphics Card
Ethernet
Bt848 capture cards (2-3 depending on customer)
I'm really pulling my hair out - I don't know why they are doing this -
these are all on customer sites (they never go wrong in the office, each
one that have gone bad has been in different environments i.e. warm,
cold, no power spikes or anything reported) - and at the moment as you
can imagine we are not flavor of the month so I really need to come up
with a bullet-proof plan (one customer is one his second box, which did
the same as the first after 2 days - it ran in our office for 2 weeks no
problems!)
I know I've probably not given enough info (sorry I can't get a better
trace of the panic) - but any help that anyone can give will really
really really be appreciated.
Thanks,
Glen