David McGiven
2013-Nov-15 11:11 UTC
[CentOS] Crash and automatical reboot when using the NVIDIA card
Hello there, I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20). A few minutes after using the GPU for doing some HPC calculations, the server crashes and reboots itself. This is happening every time. I know it will be rebooted but I don't know when. Sometimes it's 20 minutes after starting using it. Sometimes it's 2 hours. If I unplug the GPU card and put some stress on the server, it works ok. So I suspect there's a bug in the kernel/nvidia driver. I can't find any messages on /var/log/messages. What should I do ? Should I file a bug on the centos bugtracking system ? Is there anyway I can gather more information ? The server is in a remote location so I have a hard time accessing the console. Thanks.
John Doe
2013-Nov-15 11:21 UTC
[CentOS] Crash and automatical reboot when using the NVIDIA card
From: David McGiven <davidmcgivenn at gmail.com>> I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel > : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20). > A few minutes after using the GPU for doing some HPC calculations, the > server crashes and reboots itself. This is happening every time. I know it > will be rebooted but I don't know when. Sometimes it's 20 minutes after > starting using it. Sometimes it's 2 hours. > If I unplug the GPU card and put some stress on the server, it works ok. So > I suspect there's a bug in the kernel/nvidia driver. > I can't find any messages on /var/log/messages.Did you check the IPMI logs? First thing that comes to my mind would be overheating. Maybe dump the temperatures every minute to a log file and after next reboot, check if there is a pic... Or maybe a freeze + the watchdog kicking in? JD
Ron Young
2013-Nov-15 15:06 UTC
[CentOS] Crash and automatical reboot when using the NVIDIA card
I am forced to use a windoze 7 box and recently MS decided in its infinite wisdom to update the nvidia driver via windoze update. My machine immediately started with the same symptoms David is having...hanging at indeterminate times, even a BSOD twice. It would do this even when idle during the night. Googling for an answer resulted in finding a forum related to the nvidia web site on which there was a post suggesting that there were a lot of problems with the current version and we should reinstall back level drivers. The post suggested going back to 314.22. I did so and have not had a single problem since. YMMV Regards, Ron Young 919-621-9015 http://www.linkedin.com/in/ronhyoung +++++++++++++++++++ Little tiny dreams require little tiny thoughts and little tiny steps. Great big dreams require great big thoughts and little tiny steps. +++++++++++++++++++ *Kosh*: The avalanche has already started. It is too late for the pebbles to vote. On Fri, Nov 15, 2013 at 6:11 AM, David McGiven <davidmcgivenn at gmail.com>wrote:> Hello there, > > I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel > : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20). > > A few minutes after using the GPU for doing some HPC calculations, the > server crashes and reboots itself. This is happening every time. I know it > will be rebooted but I don't know when. Sometimes it's 20 minutes after > starting using it. Sometimes it's 2 hours. > > If I unplug the GPU card and put some stress on the server, it works ok. So > I suspect there's a bug in the kernel/nvidia driver. > > I can't find any messages on /var/log/messages. > > What should I do ? Should I file a bug on the centos bugtracking system ? > Is there anyway I can gather more information ? The server is in a remote > location so I have a hard time accessing the console. > > Thanks. > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >
Panruo Wu
2013-Nov-22 19:36 UTC
[CentOS] Crash and automatical reboot when using the NVIDIA card
David McGiven <davidmcgivenn at ...> writes:> > Hello there, > > I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel > : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20). > > A few minutes after using the GPU for doing some HPC calculations, the > server crashes and reboots itself. This is happening every time. I know it > will be rebooted but I don't know when. Sometimes it's 20 minutes after > starting using it. Sometimes it's 2 hours. > > If I unplug the GPU card and put some stress on the server, it works ok. So > I suspect there's a bug in the kernel/nvidia driver. > > I can't find any messages on /var/log/messages. > > What should I do ? Should I file a bug on the centos bugtracking system ? > Is there anyway I can gather more information ? The server is in a remote > location so I have a hard time accessing the console. > > Thanks. >Hi there, I also have the same problem with all my 4 Supermicro machines. I don't know why it happens but nvidia driver seems to be blamed for me. I'm using CentOS 6.3 and nVidia driver version 304.54 or 319.37. Best, Panruo
Reasonably Related Threads
- Centos 6.2 x64 after GRUB menu, black screen with blinking cursor
- CentOS 6 updating policy
- systemctl reboot -- server not accessible after reboot
- Centos 4 and Supermicro ICH7R
- Processed: user debian-qa@lists.debian.org, found 728743 in 331.20-1, tagging 728743, tagging 735576 ...