Hello! I am running CentOS-5 with latest kernel available by deault (2.6.23). I installed it on a Dell XPS machine having Intel Quad processors (4 parallel cpus). I use it to run a computational program and I need to keep the program running for 1-2 months continuously. I generally boot it in runlevel-3 with network ON without X and use ssh from another machine to connect and run the program using the "nohup" utility. However, the system automatically gets suspended (the computational program stops, ssh stops working, whole the OS seems to be freezing) after 4-5 hours. I have stopped the "acpid" daemon and boot the kernel with "acpi=off" option in "grub.conf" but no help. The kernel log ( /var/log/messages) doesn't show anything special. After the instant of suspension, kernel also stops logging into "/var/log/messages". Please help me out. I think there is a kernel problem. I have run programs for days and days continuously using FC5 (which had older kernel). I can't use FC5 or older version of CentOS because I need GCC-4.1.2+ to compile parallel OpenMP program. Thank you, Chandra
On Tue, Feb 05, 2008 at 04:31:57PM +0900, Chandra wrote:> Hello! > > I am running CentOS-5 with latest kernel available by deault (2.6.23).NO! You are no longer running CentOS-5 if you change your kernel for your own version... Tru -- Tru Huynh (mirrors, CentOS-3 i386/x86_64 Package Maintenance) http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xBEFA581B -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <http://lists.centos.org/pipermail/centos/attachments/20080205/7f82583f/attachment-0001.sig>
> > I am running CentOS-5 with latest kernel available by deault (2.6.23). > NO! > You are no longer running CentOS-5 if you change your kernel for > your own version... >Dear Tru, Thank you for your mail. I didn't change anything at all. It is just the default installation. However, I AM running CentOS-5. Infact, when I did "#rpm -q kernel", it didnt give me any result. I didn't went into details of it assuming that there may be something missing at the trailing end i.e. kernel-<version-no> in my command. Anyway, I would appreciate if you can help me to get my problem solved. Thanks a lot, Chandra
Chandra <shekharc.2004 at gmail.com> wrote:> Hello! > > I am running CentOS-5 with latest kernel available by deault (2.6.23). > I installed it on a Dell XPS machine having Intel Quad processors (4 > parallel cpus). I use it to run a computational program and I need to > keep the program running for 1-2 months continuously. I generally boot > it in runlevel-3 with network ON without X and use ssh from another > machine to connect and run the program using the "nohup" utility. > > However, the system automatically gets suspended (the computational > program stops, ssh stops working, whole the OS seems to be freezing) > after 4-5 hours. I have stopped the "acpid" daemon and boot the kernel > with "acpi=off" option in "grub.conf" but no help. The kernel log ( > /var/log/messages) doesn't show anything special. After the instant of > suspension, kernel also stops logging into "/var/log/messages". > > Please help me out. I think there is a kernel problem. I have run > programs for days and days continuously using FC5 (which had older > kernel). I can't use FC5 or older version of CentOS because I need > GCC-4.1.2+ to compile parallel OpenMP program. > > Thank you, > > ChandraWhat do you have to do to get the box out of "suspend?" If the system is frozen and you have to reboot the box to "unfreeze" it, I'd guess it's a heat issue. Cheers, Dave -- Politics, n. Strife of interests masquerading as a contest of principles. -- Ambrose Bierce
I don't think that is the "harmless" error message mentioned in the release notes as that had to do with the "crash kernel". I saw this same error on a Dell AMD system. It seems the motherboard in that system didn't do ACPI IRQ routing as the kernel expected and experienced a lot of random problems until "acpi=noirq" was passed as a kernel option to disable ACPI IRQ routing defaulting back to the APIC IRQ routing. If that still gives you problems then you may need to use "irq=poll" which forces the kernel to poll for IRQ changes. -Ross ----- Original Message ----- From: centos-bounces at centos.org <centos-bounces at centos.org> To: CentOS mailing list <centos at centos.org> Sent: Wed Feb 06 08:03:47 2008 Subject: Re: [CentOS] Re: system gets suspended automatically! On Wed, 2008-02-06 at 21:48 +0900, Chandra wrote:> ==========================================================> AN ERROR IS SHOWING UP AT BOOT TIME. It seems to be a BUG: > ===========================================================> Memory for crash kernel (0x0 to 0x0) notwithin permissible range > ..MP-BIOS bug: 8254 timer not connected to IO-APIC > Red Hat nash version 5.1.19.6 starting > Welcome to CentOS release 5 (Final) > .... > ..... > and continues normal booting. > > Any idea how to deal with it. > Please not that it has 4 CPUs. > > Thanks a lot, > - Chandra > _______________________________________________Check the Release Notes. It is apparently harmless. I see it on all my CentOS 5.1 machines. B.J. Ubuntu 7.10, Linux 2.6.22-14-generic unknown 08:02:44 up 21:42, 2 users, load average: 0.15, 0.22, 0.16 _______________________________________________ CentOS mailing list CentOS at centos.org http://lists.centos.org/mailman/listinfo/centos ______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20080206/997bc07e/attachment.html>
=======================================================================> > Memory for crash kernel (0x0 to 0x0) notwithin permissible range> > ..MP-BIOS bug: 8254 timer not connected to IO-APIC > > Red Hat nash version 5.1.19.6 starting > > Welcome to CentOS release 5 (Final) > > .... > > ..... > > and continues normal booting.2008/2/6 Ross S. W. Walker <rwalker at medallion.com>:> I don't think that is the "harmless" error message mentioned in the release > notes as that had to do with the "crash kernel". > > I saw this same error on a Dell AMD system. It seems the motherboard in > that system didn't do ACPI IRQ routing as the kernel expected and > experienced a lot of random problems until "acpi=noirq" was passed as a > kernel option to disable ACPI IRQ routing defaulting back to the APIC IRQ > routing. If that still gives you problems then you may need to use > "irq=poll" which forces the kernel to poll for IRQ changes.At first, I am sorry for my late reply. I was very busy. Well, "acpi=noirq" didn't work but after using the "irq=poll" option, the message "MP-BIOS bug: 8254 timer not connected to IO-APIC" stopped appearing. However, the message "Memory for crash kernel (0x0 to 0x0) notwithin permissible range" is still appearing. I have started my computation program after booting the OS with "irq=poll" option. I will report later if it really worked and system doesn't freez anymore after running the program for long time. This is the grub.conf: kernel /boot/vmlinuz-2.6.18-53.el5PAE ro root=LABEL=/12 irq=poll early-login quiet. Also, the deamon "acpid" is not running. =========================================================================== ===========================================================================2008/2/6 Tru Huynh <tru at centos.org>:> Looks like some hardware crash to me, otherwise you would have > some logs for oops/hangs. > > Can you make available somewhere your /var/log/messages (don't > send a few MB file to the list) > and the /proc/cmdline content ? > > You said you used "acpi=off" and acpid disabled is it still the case? > > ~> chkconfig --list cpuspeed > cpuspeed 0:off 1:on 2:off 3:off 4:off 5:off 6:offAs far as the the kernel log message and content of "/proc/cmdline" is concerned, I will certainly make these available if the aforementioned "irq=poll" optioned also fails. And yes, the until last time, "acpi=off, noapic" options were passed to the kernel and acpid were kept stopped. The output of "chkconfig --list cpuspeed" is "cpuspeed 0:off 1:on 2:on 3:on 4:on 5:on 6:off". However, the "service cpuspeed start" or "service cpuspeed stop" commands doesn't show any message. Also, the gui to control the services (system-config-services) shows that cpuspeed is stopped. So, I guess, cpuspeed is of no effect. But anyway, I will report the details a little lated after I finish checking the "irq=poll" option. ================================================================ In the mean time, I also verified that it is NOT a hardware problem. I installed FC5 in one of the other partitions and ran a SERIAL version of the same program (i.e. no OpenMP, gcc without -fopenmp flag) and it didn't freez at all. Well, I had to pass the "noapic" option during this installation and it didn't recognize my network card ;). When I run the PARALLEL version of the program (gcc with -fopenmp option), it ran for few hours and stopped with an error message something like "libopenmp: not sufficient memory...allocating 60 bytes". However, the system didn't hang or didn't reboot. So, I believe, it has something to do with the OpenMP, not the hardware. Anyway, thank you for all your replies. I will keep posting the updates here. - Chandra
2008/2/6 Ross S. W. Walker <rwalker at medallion.com>:> I don't think that is the "harmless" error message mentioned in the release > notes as that had to do with the "crash kernel". > > I saw this same error on a Dell AMD system. It seems the motherboard in > that system didn't do ACPI IRQ routing as the kernel expected and > experienced a lot of random problems until "acpi=noirq" was passed as a > kernel option to disable ACPI IRQ routing defaulting back to the APIC IRQ > routing. If that still gives you problems then you may need to use > "irq=poll" which forces the kernel to poll for IRQ changes. > > -RossThanks a lot for the tip. This seems to have worked. My system is running continuously from last 45 hours without any hang. This is the miracle grub.conf entry: kernel /boot/vmlinuz-2.6.18-53.el5PAE ro root=LABEL=/12 irq=poll acpi=off noapic nolapic early-login quiet with the acpid daemon off. It is working well with the OpenMP parallelization. When I tried without the "noapic nolapic" option in grub.conf, the system worked with serial code but hanged while the OpenMP is used for parallelization. Anyway, thanks a lot for all you guys' responses. Well, I don't have much idea but when the kernel detects multiple cpus, the "irq=poll" entry should be added by default. It may be useful in solving a lot of such problems (well, just a thought) (-__^) -Chandra