dwight at supercomputer.org
2010-Apr-30 16:32 UTC
[Xen-devel] XCP: Crashes on dual Xeon HP ProLiant systems
Is anyone else running the latest XCP on HP ProLiant DL380 systems? Or a similar dual Xeon 8-core system? I''m seeing spontaneous reboots when under a load. Specifically, when 4 Windows HVMs are loaded, I haven''t noticed any reboots yet. But when running 7 or 8, the system will reboot within minutes. Very little information appears on the console. I built a debugging version of the hypervisor, which changed the behavior; the system managed to stay up for 2-3 hours with 7 VMs running. However, it again spontaneously rebooted, with no real messages on the console as to why. I can send out the console log messages this evening, along with the system information if there''s interest. Alas, I don''t have access to these items at the moment. I have also been running memtest86 overnight. As of 1.5 hours into the test, there were no errors. But there are 48 GB of RAM on the system, so the testing wasn''t complete when I left. Any suggestions here? I was going to build a 32-bit kernel from the latest patches, but it appears Centos 5.4 Xen is also not stable on these systems. I had trouble getting the kernel to build here, with various errors. The most notable of which was: ---------------------- CC arch/x86/kernel/acpi/processor.o In file included from arch/x86/kernel/acpi/processor.c:8: include/linux/kernel.h:185: internal compiler error: Segmentation fault Please submit a full bug report, with preprocessed source if appropriate. See <http://bugzilla.redhat.com/bugzilla> for instructions. The bug is not reproducible, so it is likely a hardware or OS problem. make[2]: *** [arch/x86/kernel/acpi/processor.o] Error 1 make[1]: *** [arch/x86/kernel/acpi] Error 2 make: *** [arch/x86/kernel] Error 2 ---------------------- This was with a 64-bit Dom0 and a 32-bit Fedora 11 VM. A 64-bit DomU works just fine. I know the stock XCP kernel is 32-bits. Are there any issues running a 64-bit XCP kernel, other than a slight degradation in speed? -dwight- _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen
2010-Apr-30 18:20 UTC
Re: [Xen-devel] XCP: Crashes on dual Xeon HP ProLiant systems
On Fri, Apr 30, 2010 at 09:32:37AM -0700, dwight at supercomputer.org wrote:> Is anyone else running the latest XCP on HP ProLiant DL380 > systems? Or a similar dual Xeon 8-core system? I''m seeing > spontaneous reboots when under a load. > > Specifically, when 4 Windows HVMs are loaded, I haven''t noticed > any reboots yet. But when running 7 or 8, the system will > reboot within minutes. Very little information appears on > the console. > > I built a debugging version of the hypervisor, which changed > the behavior; the system managed to stay up for 2-3 hours > with 7 VMs running. However, it again spontaneously rebooted, > with no real messages on the console as to why. > > I can send out the console log messages this evening, along > with the system information if there''s interest. Alas, I > don''t have access to these items at the moment. > > I have also been running memtest86 overnight. As of 1.5 hours into > the test, there were no errors. But there are 48 GB of RAM > on the system, so the testing wasn''t complete when I left. > > Any suggestions here? I was going to build a 32-bit kernel > from the latest patches, but it appears Centos 5.4 Xen is > also not stable on these systems. I had trouble getting > the kernel to build here, with various errors. The most > notable of which was: > > ---------------------- > CC arch/x86/kernel/acpi/processor.o > In file included from arch/x86/kernel/acpi/processor.c:8: > include/linux/kernel.h:185: internal compiler error: Segmentation > fault > Please submit a full bug report, > with preprocessed source if appropriate. > See <http://bugzilla.redhat.com/bugzilla> for instructions. > The bug is not reproducible, so it is likely a hardware or OS > problem. > make[2]: *** [arch/x86/kernel/acpi/processor.o] Error 1 > make[1]: *** [arch/x86/kernel/acpi] Error 2 > make: *** [arch/x86/kernel] Error 2 > ---------------------- >Uhm.. the compiler really shouldn''t crash. Are you sure your hardware is OK? If the stock EL5.4 Xen also crashes, it could be broken hardware? Did you try running memtest86+ ? Is baremetal Linux stable, if you run for example "make -j8 bzImage && make -j8 modules && make clean" kernel build in a loop? -- Pasi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Apr-30 19:15 UTC
Re: [Xen-devel] XCP: Crashes on dual Xeon HP ProLiant systems
On Fri, 2010-04-30 at 17:32 +0100, dwight at supercomputer.org wrote:> > A 64-bit DomU works just fine. I know the stock XCP > kernel is 32-bits. Are there any issues running a 64-bit > XCP kernel, other than a slight degradation in speed?The XCP domain 0 kernel is only tested in 32 bit (PAE) configurations. I''d expect it to work for 64 bit but wouldn''t necessarily bet on it. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
dwight at supercomputer.org
2010-May-01 21:06 UTC
Re: [Xen-devel] XCP: Crashes on dual Xeon HP ProLiant systems
On Friday 30 April 2010 11:20:07 am Pasi Kärkkäinen wrote:> On Fri, Apr 30, 2010 at 09:32:37AM -0700, dwight atsupercomputer.org wrote:> > Is anyone else running the latest XCP on HP ProLiant DL380 > > systems? Or a similar dual Xeon 8-core system? I''m seeing > > spontaneous reboots when under a load. ... > > Uhm.. the compiler really shouldn''t crash. > > Are you sure your hardware is OK? If the stock EL5.4 Xen also > crashes, it could be broken hardware? > > Did you try running memtest86+ ? > > Is baremetal Linux stable, if you run for example > "make -j8 bzImage && make -j8 modules && make clean" kernel build > in a loop? > > -- PasiThank you for your reply, Pasi. I agree that the compiler shouldn''t crash. That''s definitely rude behavior. It might well be broken hardware. I was thinking that it was more likely that it was an issue between the older CentOS Xen and this much newer Xeon hardware. And so the "hardware or OS problem" that gcc was complaining about was an issue with the Virtualized hardware. But yesterday I ran into a different issue, which leads me to believe that it is either a physical hardware or Dom0 OS issue. On the machine which was running XCP, I tried installing 64-bit CentOS 5.4. The installation crashed. Two separate times. The first time I didn''t have a log file (since it was a video based installation). The second time through though I used the iLO virtualized serial port, and I could see that the installation crashed about halfway through. Again, a spontaneous reboot, as XCP experienced. I talked to one of the guys in the lab, who has done far more installations of these ProLiant (and Dell) boxes than I have, and he was quite familiar with this. He said that on some of these boxes (both HP and Dell), the 64-bit CentOS 5.4 install will crash. But supposedly the 32-bit installation will work. He also said that CentOS 5.3, both 32 and 64 bit, work fine. I realize that this is anecdotal, and I don''t have any more information here (as to the CPU''s and hardware), but I thought that this was interesting. At this point, I don''t trust either the hardware or the OS, so I''m going to start a full diagnostics run using a suite that I''ve put together over the past 15 years, which has served me very well in qualifying boxes. memtest86 is one of these. I mentioned earlier that I had started an overnight run of this on both boxes. I can now report that both have passed. After 12+ hours, they had gone successfully through two separate runs without error. Next up is prime95, with the torture test. Nothing else comes close to exercising the CPU, as indicated by the heat given off during this test. This will also be a test of the thermal cooling. If that passes, then I''m going to exercise the disk subsystem. One of these is very similar to what you suggested. Specifically, multiple rebuilds of the kernel, but from scratch each time. Frankly, though, I''m going to see if I can get a different ProLiant box. Nonetheless, I want the data on this one. I''m hoping that I can detect a box which will fail, before I run XCP on it. I''ll post the results when I have them, hopefully in a couple of days. -dwight- _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
dwight at supercomputer.org
2010-May-01 21:07 UTC
Re: [Xen-devel] XCP: Crashes on dual Xeon HP ProLiant systems
On Friday 30 April 2010 12:15:38 pm Ian Campbell wrote:> On Fri, 2010-04-30 at 17:32 +0100, dwight at supercomputer.orgwrote:> > A 64-bit DomU works just fine. I know the stock XCP > > kernel is 32-bits. Are there any issues running a 64-bit > > XCP kernel, other than a slight degradation in speed? > > The XCP domain 0 kernel is only tested in 32 bit (PAE) > configurations. I''d expect it to work for 64 bit but wouldn''t > necessarily bet on it. > > IanThank you, Ian. That was helpful. -dwight- _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
dwight at supercomputer.org
2010-May-24 16:35 UTC
Re: [Xen-devel] XCP: Epilog - Crashes on dual Xeon HP ProLiant systems
On Friday 30 April 2010 09:32:37 am I wrote:> Is anyone else running the latest XCP on HP ProLiant DL380 > systems? Or a similar dual Xeon 8-core system? I''m seeing > spontaneous reboots when under a load. >I wanted to follow up to the list on this issue, particularly if someone else in the future comes across this with the ProLiant series. The bottom line is that it was a firmware issue (actually, at least two different components needed a firmware update. Thanks to Pasi and Ian for the replies and suggestions. Also, I was able to repeat the odd behavior of 64-bit CentOS 5.4 not installing, while the 32-bit version worked. This also went away after the firmware upgrade. Here are some more details which probably aren''t of interest to the list, but I''m sending them along in the hopes of sparing someone else who comes across this, and does a Google search. The key test here was running a continual loop of a -j8 kernel build, from scratch. One test failed after 14 hours; another after 9 hours. memtestx86 and prime95 in torture test mode worked fine. The bottom line here is that it looks like we got some machines from one of the early manufacturing runs back in July. HP has put in a lot of effort in fixing a number of issues since then. One needs at least the general firmware update ISO from their website, which is presently at Version 9. This is necessary, but not sufficient. One of our machines would still crash (though 64-bit CentOS would now install). The final missing piece was a CPLD update, which HP support was kind enough to quickly send me. With that, all machines have been running XCP and numerous VMs quite solidly under a heavy load. In spite of these problems, I have to give kudos to HP for the support effort that they''ve put into fixing all of these problems over the past year. Some manufacturers wouldn''t put nearly as much effort into it. Thanks again, -dwight- _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel