System: FreeBSD 7.1-STABLE i386 (revision 187025) Panic message: kernel trap 12 with interrupts disabled Fatal trap 12: page fault while in kernel mode fault virtual address = 0xd2006ad0 fault code = supervisor write, page not present instruction pointer = 0x20:0xc05623aa stack pointer = 0x28:0xdd4f6c34 frame pointer = 0x28:0xdd4f6c40 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = resume, IOPL = 0 current process = 13 (swi4: clock) trap number = 12 panic: page fault KDB: stack backtrace: db_trace_self_wrapper(c074bb2f,dd4f6b14,c05514af,c0749d10,c07b85e0,...) at 0xc0478466 = db_trace_self_wrapper+0x26 kdb_backtrace(c0749d10,c07b85e0,c073b02b,dd4f6b20,dd4f6b20,...) at 0xc057a639 = kdb_backtrace+0x29 panic(c073b02b,c0761cb4,c36104dc,1,1,...) at 0xc05514af = panic+0xaf trap_fatal(c0761bb6,c,c3a89460,c3a8965c,c,...) at 0xc0705723 trap_fatal+0x353 trap(dd4f6bf4) at 0xc07060ca = trap+0x10a calltrap() at 0xc06f463b = calltrap+0x6 --- trap 0xc, eip = 0xc05623aa, esp = 0xdd4f6c34, ebp = 0xdd4f6c40 --- callout_reset(c3a8552c,13,c0561940,c3a852b8,c3612690,...) at 0xc05623aa = callout_reset+0x14a realitexpire(c3a852b8,2d6100,c3612690,1,dd4f6cbc,...) at 0xc0561ab6 realitexpire+0x176 softclock(0,0,c0747617,4a1,0,...) at 0xc0562c25 = softclock+0x235 ithread_loop(c35e5a20,dd4f6d38,0,0,0,...) at 0xc053268b = ithread_loop+0x1cb fork_exit(c05324c0,c35e5a20,dd4f6d38) at 0xc052eda1 = fork_exit+0xa1 fork_trampoline() at 0xc06f46b0 = fork_trampoline+0x8 --- trap 0, eip = 0, esp = 0xdd4f6d70, ebp = 0 --- Some debugging: (kgdb) bt #0 doadump () at pcpu.h:196 #1 0xc05512b3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:418 #2 0xc05514ff in panic (fmt=Variable "fmt" is not available. ) at /usr/src/sys/kern/kern_shutdown.c:574 #3 0xc0705723 in trap_fatal (frame=0xdd4f6bf4, eva=3523242704) at /usr/src/sys/i386/i386/trap.c:939 #4 0xc07060ca in trap (frame=0xdd4f6bf4) at /usr/src/sys/i386/i386/trap.c:320 #5 0xc06f463b in calltrap () at /usr/src/sys/i386/i386/exception.s:159 #6 0xc05623aa in callout_reset (c=0xc3a8552c, to_ticks=19, ftn=0xc0561940 <realitexpire>, arg=0xc3a852b8) at /usr/src/sys/kern/kern_timeout.c:471 #7 0xc0561ab6 in realitexpire (arg=0xc3a852b8) at /usr/src/sys/kern/kern_time.c:684 #8 0xc0562c25 in softclock (dummy=0x0) at /usr/src/sys/kern/kern_timeout.c:274 #9 0xc053268b in ithread_loop (arg=0xc35e5a20) at /usr/src/sys/kern/kern_intr.c:1088 #10 0xc052eda1 in fork_exit (callout=0xc05324c0 <ithread_loop>, arg=0xc35e5a20, frame=0xdd4f6d38) at /usr/src/sys/kern/kern_fork.c:804 #11 0xc06f46b0 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:264 (kgdb) fr 6 #6 0xc05623aa in callout_reset (c=0xc3a8552c, to_ticks=19, ftn=0xc0561940 <realitexpire>, arg=0xc3a852b8) at /usr/src/sys/kern/kern_timeout.c:471 471 /usr/src/sys/kern/kern_timeout.c: No such file or directory. in /usr/src/sys/kern/kern_timeout.c (kgdb) p *c $1 = {c_links = {sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0xd2006ad0}}, c_time = 2974104, c_arg = 0xc3a852b8, c_func 0xc0561940 <realitexpire>, c_mtx = 0x0, c_flags = 22} (kgdb) p c->c_links.tqe.tqe_prev $2 = (struct callout **) 0xd2006ad0 (kgdb) p *c->c_links.tqe.tqe_prev Cannot access memory at address 0xd2006ad0 (kgdb) p callwheel[c->c_time & callwheelmask] $4 = {tqh_first = 0x0, tqh_last = 0xd2006ad0} The code: 467 c->c_arg = arg; 468 c->c_flags |= (CALLOUT_ACTIVE | CALLOUT_PENDING); 469 c->c_func = ftn; 470 c->c_time = ticks + to_ticks; 471 TAILQ_INSERT_TAIL(&callwheel[c->c_time & callwheelmask], 472 c, c_links.tqe); Additional info: I recently added some new memory to this system. The memory survived several passes of memtest86 before booting to FreeBSD. It also survived one pass after the incident. Still I wouldn't exclude a possibility of it being bad. Small analysis: If this is not because of bad memory, then it probably means that a struct callout was earlier deallocated somewhere (possibly as a part of a bigger object), but not unregistered/removed from callout mechanism. I guess it is quite hard to backtrack that now. All I can say that was nothing "funny" happening on the machine from the point of view of attaching/detaching any HW or loading/unloading modules or anything like that. Just "normal" work. So it could be something that it is always "on", like network stack or ata subsystem, etc. -- Andriy Gapon
Andriy Gapon
2009-Jan-28 04:27 UTC
problem with "cold" hardware? [Was: panic in callout_reset: bad link in callwheel]
on 24/01/2009 13:00 Andriy Gapon said the following: [snip]> Additional info: > I recently added some new memory to this system. > The memory survived several passes of memtest86 before booting to > FreeBSD. It also survived one pass after the incident. > Still I wouldn't exclude a possibility of it being bad.I think that I established that the crash was because of hardware issue. I had another panic at a different place but with the similar diagnostics - bad pointer passed to a call. Fortunately, the second time the pointer was to a well-known long-lived object. So I was able to compare the bad pointer to an actual address. It turned out that a single bit was flipped. Then I realized that in both cases I saw panics after "very cold" boots, i.e. the system was powered down for more than 1 hour before the boot. So I performed memtest86 run again, this time also after a long power-off. And it reported lots of errors. I restarted memtest86 10 minutes later and then it could not find any errors in any tests. Previously I heard about problems with hardware running hot, but not with it being "cold". I put the word in quotes, because the system is in a room with normal room temperature. Any guesses what hardware part might be acting up like this? -- Andriy Gapon
Andrew Snow
2009-Jan-28 11:22 UTC
problem with "cold" hardware? [Was: panic in callout_reset: bad link in callwheel]
Andriy Gapon wrote:> Previously I heard about problems with hardware running hot, but not > with it being "cold". I put the word in quotes, because the system is in > a room with normal room temperature. > > Any guesses what hardware part might be acting up like this?Power supply. Give all the capacitors a visual check. Or you may be drawing too much power from your rated supply. - Andrew
Ulf Zimmermann
2009-Jan-28 12:55 UTC
problem with "cold" hardware? [Was: panic in callout_reset: bad link in callwheel]
On Thu, Jan 29, 2009 at 06:22:26AM +1100, Andrew Snow wrote:> Andriy Gapon wrote: > >Previously I heard about problems with hardware running hot, but not > >with it being "cold". I put the word in quotes, because the system is in > >a room with normal room temperature. > > > >Any guesses what hardware part might be acting up like this? > > Power supply. Give all the capacitors a visual check. Or you may be > drawing too much power from your rated supply.Another thing could be bad soldering of the memory slot. It might not have a full contact when at room temperature, but as it heats up by 10-20C inside the case it might expand and give full contact. This could apply to copper runs on the board, contact points from the board to the memory slot, contact from the slot to the memory. -- Regards, Ulf. --------------------------------------------------------------------- Ulf Zimmermann, 1525 Pacific Ave., Alameda, CA-94501, #: 510-865-0204 You can find my resume at: http://www.Alameda.net/~ulf/resume.html
Andriy Gapon
2009-Feb-02 03:36 UTC
problem with "cold" hardware? [Was: panic in callout_reset: bad link in callwheel]
on 28/01/2009 21:22 Andrew Snow said the following:> Andriy Gapon wrote: >> Previously I heard about problems with hardware running hot, but not >> with it being "cold". I put the word in quotes, because the system is in >> a room with normal room temperature. >> >> Any guesses what hardware part might be acting up like this? > > Power supply. Give all the capacitors a visual check. Or you may be > drawing too much power from your rated supply.Right on the target. I opened the PSU after replacing it, visually it looks OK (too me), nevertheless I have verified that the fault was in it. Thank you and everybody who helped! -- Andriy Gapon