I just got a panic following by a reboot a few seconds after running "portsnap update", /var/log/messages shows the following: Jul 7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel Jul 7 03:49:38 atom kernel: spin lock 0xffffffff80b3edc0 (sched lock 1) held by 0xffffff00017d8370 (tid 100054) too long Jul 7 03:49:38 atom kernel: panic: spin lock held too long /var/crash looks empty. This is a system running official 7.2-p1 binaries since I am using freebsd-update to keep up with the patches (just updated to -p2 after this panic) running with very low load, mostly serving files to my home network over Samba and running a few irssi instances in a screen. What do I need to do to catch more information if/when this happens again? - Sincerely, Dan Naumov
2009/7/7 Dan Naumov <dan.naumov@gmail.com>:> I just got a panic following by a reboot a few seconds after running > "portsnap update", /var/log/messages shows the following: > > Jul 7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel > Jul 7 03:49:38 atom kernel: spin lock 0xffffffff80b3edc0 (sched lock > 1) held by 0xffffff00017d8370 (tid 100054) too long > Jul 7 03:49:38 atom kernel: panic: spin lock held too longThat's a known bug, affecting -CURRENT as well. The cpustop IPI is handled though an NMI, which means it could interrupt a CPU in any moment, even while holding a spinlock, violating one well known FreeBSD rule. That means that the cpu can stop itself while the thread was holding the sched lock spinlock and not releasing it (there is no way, modulo highly hackish, to fix that). In the while hardclock() wants to schedule something else to run and got stuck on the thread lock. Ideal fix would involve not using a NMI for serving the cpustop while having a cheap way (not making the common path too hard) to tell hardclock() to avoid scheduling while cpustop is in flight. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein
> I hope you can get some crash dumps for the developers to look at, > Attilio was trying to help me but sadly the machine had to be put into > active use so I could no longer play with FreeBSD due to unsolved > instability.I want to help investigate this problem also but I remembered that the /var/crash was empty after the panic.. Just the same as your situation. So is there other ways to get the crash dumps? My machine is also put into service now but I think it should be OK for a short down time. Regards, C.C.
> > I hope you can get some crash dumps for the developers to look at, > > Attilio was trying to help me but sadly the machine had to be put into > > active use so I could no longer play with FreeBSD due to unsolved > > instability. > > I want to help investigate this problem also but I remembered that the > /var/crash was empty after the panic.. Just the same as your > situation. > So is there other ways to get the crash dumps? > My machine is also put into service now but I think it should be OK > for a short down time.Could that one (on i386) be related? http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134584
> Could that one (on i386) be related?> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134584 > I have no idea about it but I can tell the difference... My machine panic randomly rather than on shutdown and I remembered that it failed to write core dump. It also failed to reboot automatically.. Regards, CC
2009/7/22 C. C. Tang <hiyorin@gmail.com>:>> Could that one (on i386) be related? >> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134584 >> > > I have no idea about it but I can tell the difference... > My machine panic randomly rather than on shutdown and I remembered that it > failed to write core dump. It also failed to reboot automatically..Is your problem on -CURRENT and amd64? At some point there has been a problem with PAT support (and tlb_shootdowns() could lead to a livelock hanging forever, leading to such a bug) but I expect it is fixed now. Can you try with a fresh new -CURRENT if any? Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein
On Wednesday 22 July 2009 12:30:47 pm Attilio Rao wrote:> 2009/7/22 C. C. Tang <hiyorin@gmail.com>: > >> Could that one (on i386) be related? > >> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134584 > >> > > > > I have no idea about it but I can tell the difference... > > My machine panic randomly rather than on shutdown and I remembered that it > > failed to write core dump. It also failed to reboot automatically.. > > Is your problem on -CURRENT and amd64? > At some point there has been a problem with PAT support (and > tlb_shootdowns() could lead to a livelock hanging forever, leading to > such a bug) but I expect it is fixed now.That only happens in a test kernel module and not in stock FreeBSD (and no, it is not yet fixed). -- John Baldwin
2009/7/7 Dan Naumov <dan.naumov@gmail.com>:> I just got a panic following by a reboot a few seconds after running > "portsnap update", /var/log/messages shows the following: > > Jul 7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel > Jul 7 03:49:38 atom kernel: spin lock 0xffffffff80b3edc0 (sched lock > 1) held by 0xffffff00017d8370 (tid 100054) too long > Jul 7 03:49:38 atom kernel: panic: spin lock held too long > > /var/crash looks empty. This is a system running official 7.2-p1 > binaries since I am using freebsd-update to keep up with the patches > (just updated to -p2 after this panic) running with very low load, > mostly serving files to my home network over Samba and running a few > irssi instances in a screen. What do I need to do to catch more > information if/when this happens again?Dan, is that machine equipped with Hyperthreading? Attilio -- Peace can only be achieved by understanding - A. Einstein
2009/9/22 C. C. Tang <hiyorin@gmail.com>:>>>>>> >>>>>> >>>> I have patched the sched_ule.c and did a make buildkernel & make >>>> installkernel (is buildworld and installworld necessary?), rebooted and >>>> the >>>> machine is running now. >>>> I will post here again if there is any update. > > My server is up for 3.5 days now with HyperThreading & powerd enabled. > No panic occured yet.Usually how long did it take to panic? Attilio -- Peace can only be achieved by understanding - A. Einstein
Attilio Rao wrote:> 2009/9/22 C. C. Tang <hiyorin@gmail.com>: >>>>>>> >>>>> I have patched the sched_ule.c and did a make buildkernel & make >>>>> installkernel (is buildworld and installworld necessary?), rebooted and >>>>> the >>>>> machine is running now. >>>>> I will post here again if there is any update. >> My server is up for 3.5 days now with HyperThreading & powerd enabled. >> No panic occured yet. > > Usually how long did it take to panic? > > Attilio > >It is rather random, but will usually panic within one week. Anyway my server will keep running and I will report if it has any problem. Thanks, C.C.
C. C. Tang wrote:> Attilio Rao wrote: >> 2009/9/22 C. C. Tang <hiyorin@gmail.com>: >>>>>>>> >>>>>> I have patched the sched_ule.c and did a make buildkernel & make >>>>>> installkernel (is buildworld and installworld necessary?), >>>>>> rebooted and >>>>>> the >>>>>> machine is running now. >>>>>> I will post here again if there is any update. >>> My server is up for 3.5 days now with HyperThreading & powerd enabled. >>> No panic occured yet. >> >> Usually how long did it take to panic? >> >> Attilio >> >> > It is rather random, but will usually panic within one week. > Anyway my server will keep running and I will report if it has any problem. > > Thanks, > C.C. >My server is up for 9.5 days now. Seems working fine. C.C.
2009/9/28 C. C. Tang <hiyorin@gmail.com>:> C. C. Tang wrote: >> >> Attilio Rao wrote: >>> >>> 2009/9/22 C. C. Tang <hiyorin@gmail.com>: >>>>>>>>> >>>>>>> I have patched the sched_ule.c and did a make buildkernel & make >>>>>>> installkernel (is buildworld and installworld necessary?), rebooted >>>>>>> and >>>>>>> the >>>>>>> machine is running now. >>>>>>> I will post here again if there is any update. >>>> >>>> My server is up for 3.5 days now with HyperThreading & powerd enabled. >>>> No panic occured yet. >>> >>> Usually how long did it take to panic? >>> >>> Attilio >>> >>> >> It is rather random, but will usually panic within one week. >> Anyway my server will keep running and I will report if it has any >> problem. >> >> Thanks, >> C.C. >> > My server is up for 9.5 days now. Seems working fine.The patch has been committed to STABLE_7 as well. Attilio -- Peace can only be achieved by understanding - A. Einstein