Hello everyone, I have an extremely annoying freeze problem with Xen that I can''t get fixed or at least debugged. It''s a bit of a long story. I ordered a x86_64 based coloserver middle of last year to run Xen and a couple of personal domU on it. The box kept freezing all the time, I tried a lot of things to debug it and I could not get a hold of it. The description of this setup is in http://thread.gmane.org/gmane.comp.emulators.xen.user/25347/focus=25500 and http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1007 . Shortly after those mails (middle of July) after my hoster had swapped each and every part in this box they finally replaced the previous VIA based board for one with an AMD/ATI chipset and suddenly the box was rock stable. During the last 10 months I did not have a single crash. It ran with a self-compiled 3.1.0 first, was then changed to a Debian lenny userland and hypervisor, did get a self-compiled dom0 kernel based on Ubuntu Gutsy in January, the fresh Debian 3.2.1 hypervisor end of May. No problems whatsoever. A few days ago the box crashed and did not come back online, even after issueing a hardware reset command. The IP-KVM my hoster connected showed that the box was waiting for a keypress in BIOS saying POST was interrupted before which might be caused by OverClocking (not in use, definitely). When you pressed a key the box booted fine but crashed within minutes, again dying in the BIOS. Definitely a hardware defect. After almost all parts were replaced (CPU, RAM, power supply, fans) the box did not crash in BIOS anymore, but suddenly started to experience the dom0 hangs again. The software setup had not been changed since January (the Gutsy kernel installation) and had been rebooted a couple of times after that for maintenance, so it should definitely be fine. I thought that maybe the board was faulty and got it changed to another one, an nForce 560 based MSI-K9N NEO-F V3. Still, the same crashes. Except for the harddisk the hardware has been completely replaced. I tried changing the dom0 kernel to the Ubuntu Hardy 2.6.24-18-xen distribution kernel, I tried numerous boot options for the Hypervisor (noacpi, nolapic, watchdog) and the dom0 kernel (swiotlb, now trying acpi=off and noapic). The problem is always the same, after some hours the box freezes. There are no error messages in the log or on the console, nothing. I still cannot send the 3*Ctrl-a to the box using the IP-KVM so I can''t tell whether dom0 or the hypervisor crashed, but I can tell that nothing whatsoever responds anymore. Does anyone have any idea how to debug this further? Any options I might try to at least better understand this issue? svr01:~# dpkg -l | grep xen ii libxenstore3.0 3.2.1-1 ii linux-image-2.6.24-18-xen 2.6.24-18.32 Linux ii xen-hypervisor-3.2-1-amd64 3.2.1-1 The Xen ii xen-tools 3.9-3 Tools ii xen-utils-3.2-1 3.2.1-1 XEN ii xen-utils-common 3.2.0-2 XEN ii xenstore-utils 3.2.1-1 Bernhard _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Looks like headaches ;) Anyway, I found the ubunty gutsy and ubuntu hardy (x32_64) kernels bleeding edge, but NOT stable for production use... If your hoster did have changed all relevant parts, I assume there are some obscure BIOS settings "changed". In general, I would suggest to move from gaming hardware to server hardware ;) Based on Xensource''s 2.6.18-8, I''ve made our "primary" kernel with added Areca Raid and fixed 3ware Raid support. I found it rock-solid on a bunch of different Xeon and Opteron driven boards, though it lacks disklabel support. Feel free to take it (and get a higher MTBF for debugging...) http://boreas.netz-haut.net/pub/kernelpack-2.6.18.8-xen-2008.tar.gz Cheers, Stephan Bernhard Schmidt schrieb:> Hello everyone, > > I have an extremely annoying freeze problem with Xen that I can''t get > fixed or at least debugged. It''s a bit of a long story. > > I ordered a x86_64 based coloserver middle of last year to run Xen and a > couple of personal domU on it. The box kept freezing all the time, I > tried a lot of things to debug it and I could not get a hold of it. The > description of this setup is in > http://thread.gmane.org/gmane.comp.emulators.xen.user/25347/focus=25500 > and http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1007 . > > Shortly after those mails (middle of July) after my hoster had swapped > each and every part in this box they finally replaced the previous VIA > based board for one with an AMD/ATI chipset and suddenly the box was > rock stable. During the last 10 months I did not have a single crash. It > ran with a self-compiled 3.1.0 first, was then changed to a Debian lenny > userland and hypervisor, did get a self-compiled dom0 kernel based on > Ubuntu Gutsy in January, the fresh Debian 3.2.1 hypervisor end of May. > No problems whatsoever. > > A few days ago the box crashed and did not come back online, even after > issueing a hardware reset command. The IP-KVM my hoster connected showed > that the box was waiting for a keypress in BIOS saying POST was > interrupted before which might be caused by OverClocking (not in use, > definitely). When you pressed a key the box booted fine but crashed > within minutes, again dying in the BIOS. Definitely a hardware defect. > After almost all parts were replaced (CPU, RAM, power supply, fans) the > box did not crash in BIOS anymore, but suddenly started to experience > the dom0 hangs again. The software setup had not been changed since > January (the Gutsy kernel installation) and had been rebooted a couple > of times after that for maintenance, so it should definitely be fine. > > I thought that maybe the board was faulty and got it changed to another > one, an nForce 560 based MSI-K9N NEO-F V3. Still, the same crashes. > Except for the harddisk the hardware has been completely replaced. > > I tried changing the dom0 kernel to the Ubuntu Hardy 2.6.24-18-xen > distribution kernel, I tried numerous boot options for the Hypervisor > (noacpi, nolapic, watchdog) and the dom0 kernel (swiotlb, now trying > acpi=off and noapic). The problem is always the same, after some hours > the box freezes. There are no error messages in the log or on the > console, nothing. I still cannot send the 3*Ctrl-a to the box using the > IP-KVM so I can''t tell whether dom0 or the hypervisor crashed, but I can > tell that nothing whatsoever responds anymore. > > Does anyone have any idea how to debug this further? Any options I might > try to at least better understand this issue? > > svr01:~# dpkg -l | grep xen > ii libxenstore3.0 3.2.1-1 > ii linux-image-2.6.24-18-xen 2.6.24-18.32 Linux > ii xen-hypervisor-3.2-1-amd64 3.2.1-1 The Xen > ii xen-tools 3.9-3 Tools > ii xen-utils-3.2-1 3.2.1-1 XEN > ii xen-utils-common 3.2.0-2 XEN > ii xenstore-utils 3.2.1-1 > > Bernhard > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users-- Stephan Seitz Senior System Administrator *netz-haut* e.K. multimediale kommunikation zweierweg 22 97074 würzburg fon: +49 931 2876247 fax: +49 931 2876248 web: www.netz-haut.de <http://www.netz-haut.de/> registriergericht: amtsgericht würzburg, hra 5054 _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Stephan Seitz <s.seitz@netz-haut.de> wrote:> Anyway, I found the ubunty gutsy and ubuntu hardy (x32_64) kernels > bleeding edge, but NOT stable for production use...Ubuntu Gutsy has been rock-stable for me since it was released. Hardy kernels have several stability issues which usually manifest in an OOPS either in dom0 or domU, but I haven''t seen a single box freezing without any message at all so far.> If your hoster did have changed all relevant parts, I assume there are some > me obscure BIOS settings "changed". In general, I would suggest to > move from gaming hardware to server hardware ;)The crashes (re)appeared when everything except (!) the board was changed. I would happily move to server hardware if a) I would find another hoster with native and decent IPv6 on the ethernet in a price range suitable for end users b) there would be a guarantee that these issues don''t happen again b) is the key here, I have installed a number of Xen boxes, both x32 and x64 with Debian-Xen, Ubuntu-Xen or homebrew Xen. Some of them had their share of issues (flaky drivers, flaky RAM, flaky kernels), but none ever failed on me like that. Especially since basically the whole system was changed between the old instabilities mid-2007 and the recent ones. I totally fail to understand this, may it be gameserver hardware or expensive hardware, the system should be stable.> Feel free to take it (and get a higher MTBF for debugging...) > http://boreas.netz-haut.net/pub/kernelpack-2.6.18.8-xen-2008.tar.gzI''ll try that kernel, but I''m pretty sure it will fail as well. Bernhard _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Since you have IPKVM access, ask your host to boot up memtest86 for you and let your box "cook" for 12 to 24 hours, as these kinds of problems are always hardware-related in my experience. I run a hosting shop and memtest86 is standard operating procedure for all server deployments. It will identify hardware issues with the motherboard/RAM/cpu usually within minutes, and heat-related issues usually within 24 hours. I would recommend not taking the IPKVM off while it''s under test for various reasons. You''ll either get a bad memory block, a lockup, a reboot, or several passes with no error of any kind. HTH. -Ray On Sun, Jun 8, 2008 at 5:52 PM, Bernhard Schmidt <berni@birkenwald.de> wrote:> Stephan Seitz <s.seitz@netz-haut.de> wrote: > >> Anyway, I found the ubunty gutsy and ubuntu hardy (x32_64) kernels >> bleeding edge, but NOT stable for production use... > > Ubuntu Gutsy has been rock-stable for me since it was released. Hardy > kernels have several stability issues which usually manifest in an OOPS > either in dom0 or domU, but I haven''t seen a single box freezing without > any message at all so far. > >> If your hoster did have changed all relevant parts, I assume there are some >> me obscure BIOS settings "changed". In general, I would suggest to >> move from gaming hardware to server hardware ;) > > The crashes (re)appeared when everything except (!) the board was > changed. > > I would happily move to server hardware if > a) I would find another hoster with native and decent IPv6 on the > ethernet in a price range suitable for end users > b) there would be a guarantee that these issues don''t happen again > > b) is the key here, I have installed a number of Xen boxes, both x32 > and x64 with Debian-Xen, Ubuntu-Xen or homebrew Xen. Some of them had > their share of issues (flaky drivers, flaky RAM, flaky kernels), but > none ever failed on me like that. Especially since basically the whole > system was changed between the old instabilities mid-2007 and the recent > ones. I totally fail to understand this, may it be gameserver hardware > or expensive hardware, the system should be stable. > >> Feel free to take it (and get a higher MTBF for debugging...) >> http://boreas.netz-haut.net/pub/kernelpack-2.6.18.8-xen-2008.tar.gz > > I''ll try that kernel, but I''m pretty sure it will fail as well. > > Bernhard > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Mon, Jun 09, 2008 at 02:24:26AM -0400, Ray Barnes wrote: Hi Ray,> Since you have IPKVM access, ask your host to boot up memtest86 for > you and let your box "cook" for 12 to 24 hours, as these kinds of > problems are always hardware-related in my experience. I run a > hosting shop and memtest86 is standard operating procedure for all > server deployments. It will identify hardware issues with the > motherboard/RAM/cpu usually within minutes, and heat-related issues > usually within 24 hours. I would recommend not taking the IPKVM off > while it''s under test for various reasons. You''ll either get a bad > memory block, a lockup, a reboot, or several passes with no error of > any kind. HTH.If history is any indicator (as I said I had the same problem already last year) memtest86+ will run fine. But I''m running it nontheless, first pass has been completed without any errors. Bernhard _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
En/na Bernhard Schmidt ha escrit:> On Mon, Jun 09, 2008 at 02:24:26AM -0400, Ray Barnes wrote: > > Hi Ray, > > >> Since you have IPKVM access, ask your host to boot up memtest86 for >> you and let your box "cook" for 12 to 24 hours, as these kinds of >> problems are always hardware-related in my experience. I run a >> hosting shop and memtest86 is standard operating procedure for all >> server deployments. It will identify hardware issues with the >> motherboard/RAM/cpu usually within minutes, and heat-related issues >> usually within 24 hours. I would recommend not taking the IPKVM off >> while it''s under test for various reasons. You''ll either get a bad >> memory block, a lockup, a reboot, or several passes with no error of >> any kind. HTH. >> > > If history is any indicator (as I said I had the same problem already > last year) memtest86+ will run fine. But I''m running it nontheless, > first pass has been completed without any errors. > > Bernhard > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users >Hello, any news on this issue? ... I am experiencing this very same freezing issue with this software installed: ii libxenstore3.0 3.2.1-2 Xenstore communications library for Xen ii linux-image-2.6.18-6-xen-amd64 2.6.18.dfsg.1-22etch2 Linux 2.6.18 image on AMD64 ii linux-modules-2.6.18-6-xen-amd64 2.6.18.dfsg.1-22etch2 Linux 2.6.18 modules on AMD64 ii xen-hypervisor-3.2-1-amd64 3.2.1-2 The Xen Hypervisor on AMD64 ii xen-tools 3.9-3 Tools to manage Debian XEN virtual servers ii xen-utils-3.2-1 3.2.1-2 XEN administrative tools ii xen-utils-common 3.2.0-2 XEN administrative tools - common files ii xenstore-utils 3.2.1-2 Xenstore utilities for Xen Freezing episodes happen every day or so. KVM access is dead, no messages on console, no keyboard access ... hardware is tested ok. Thanks, regards, Sergi _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
I recall 3.2.1 to be extremely stable, and I have yet to see a software issue under Xen where it did not produce a kernel "oops". Did you actually run memtest86 in accordance with my recommendations, or some other method? -Ray On Thu, Jan 8, 2009 at 12:02 PM, Sergi Seira <s.seira@cdmon.com> wrote:> En/na Bernhard Schmidt ha escrit: > > On Mon, Jun 09, 2008 at 02:24:26AM -0400, Ray Barnes wrote: > > > > Hi Ray, > > > > > >> Since you have IPKVM access, ask your host to boot up memtest86 for > >> you and let your box "cook" for 12 to 24 hours, as these kinds of > >> problems are always hardware-related in my experience. I run a > >> hosting shop and memtest86 is standard operating procedure for all > >> server deployments. It will identify hardware issues with the > >> motherboard/RAM/cpu usually within minutes, and heat-related issues > >> usually within 24 hours. I would recommend not taking the IPKVM off > >> while it''s under test for various reasons. You''ll either get a bad > >> memory block, a lockup, a reboot, or several passes with no error of > >> any kind. HTH. > >> > > > > If history is any indicator (as I said I had the same problem already > > last year) memtest86+ will run fine. But I''m running it nontheless, > > first pass has been completed without any errors. > > > > Bernhard > > > > _______________________________________________ > > Xen-users mailing list > > Xen-users@lists.xensource.com > > http://lists.xensource.com/xen-users > > > Hello, > > any news on this issue? ... I am experiencing this very same freezing > issue with this software installed: > > ii libxenstore3.0 3.2.1-2 Xenstore communications library for Xen > ii linux-image-2.6.18-6-xen-amd64 2.6.18.dfsg.1-22etch2 Linux 2.6.18 > image on AMD64 > ii linux-modules-2.6.18-6-xen-amd64 2.6.18.dfsg.1-22etch2 Linux 2.6.18 > modules on AMD64 > ii xen-hypervisor-3.2-1-amd64 3.2.1-2 The Xen Hypervisor on AMD64 > ii xen-tools 3.9-3 Tools to manage Debian XEN virtual servers > ii xen-utils-3.2-1 3.2.1-2 XEN administrative tools > ii xen-utils-common 3.2.0-2 XEN administrative tools - common files > ii xenstore-utils 3.2.1-2 Xenstore utilities for Xen > > Freezing episodes happen every day or so. KVM access is dead, no > messages on console, no keyboard access ... hardware is tested ok. > > > Thanks, > regards, > Sergi >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
En/na Ray Barnes ha escrit:> I recall 3.2.1 to be extremely stable, and I have yet to see a > software issue under Xen where it did not produce a kernel "oops". > Did you actually run memtest86 in accordance with my recommendations, > or some other method? > > -Ray > > > On Thu, Jan 8, 2009 at 12:02 PM, Sergi Seira <s.seira@cdmon.com > <mailto:s.seira@cdmon.com>> wrote: > > En/na Bernhard Schmidt ha escrit: > > On Mon, Jun 09, 2008 at 02:24:26AM -0400, Ray Barnes wrote: > > > > Hi Ray, > > > > > >> Since you have IPKVM access, ask your host to boot up memtest86 for > >> you and let your box "cook" for 12 to 24 hours, as these kinds of > >> problems are always hardware-related in my experience. I run a > >> hosting shop and memtest86 is standard operating procedure for all > >> server deployments. It will identify hardware issues with the > >> motherboard/RAM/cpu usually within minutes, and heat-related issues > >> usually within 24 hours. I would recommend not taking the > IPKVM off > >> while it''s under test for various reasons. You''ll either get a bad > >> memory block, a lockup, a reboot, or several passes with no > error of > >> any kind. HTH. > >> > > > > If history is any indicator (as I said I had the same problem > already > > last year) memtest86+ will run fine. But I''m running it nontheless, > > first pass has been completed without any errors. > > > > Bernhard > > > > _______________________________________________ > > Xen-users mailing list > > Xen-users@lists.xensource.com <mailto:Xen-users@lists.xensource.com> > > http://lists.xensource.com/xen-users > > > Hello, > > any news on this issue? ... I am experiencing this very same freezing > issue with this software installed: > > ii libxenstore3.0 3.2.1-2 Xenstore communications library for Xen > ii linux-image-2.6.18-6-xen-amd64 2.6.18.dfsg.1-22etch2 Linux 2.6.18 > image on AMD64 > ii linux-modules-2.6.18-6-xen-amd64 2.6.18.dfsg.1-22etch2 Linux 2.6.18 > modules on AMD64 > ii xen-hypervisor-3.2-1-amd64 3.2.1-2 The Xen Hypervisor on AMD64 > ii xen-tools 3.9-3 Tools to manage Debian XEN virtual servers > ii xen-utils-3.2-1 3.2.1-2 XEN administrative tools > ii xen-utils-common 3.2.0-2 XEN administrative tools - common files > ii xenstore-utils 3.2.1-2 Xenstore utilities for Xen > > Freezing episodes happen every day or so. KVM access is dead, no > messages on console, no keyboard access ... hardware is tested ok. > > > Thanks, > regards, > Sergi > > > ------------------------------------------------------------------------ > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-usersHello Ray, sorry for the delay, I did a memtest and passed with no errors ... It''s a 16GB server so a memtest needs 5/6 hours to complete. I would like to think this is a hardware issue, but I have another server where I recently installed xen3.2.1 under debian4.0 and froze once 7 days ago. The initial complaining server freezes every 2/3 days, it had 13 paravirtual domU''s but now has only two and had a freezing episode yesterday. Anyway ... I now this is hard to resolve ... I found on the web only this thread matching the issue, so it''s more likely to be a hardware/driver problem than a xen problem. I don''t like, though, that I can''t point where the problem is. Thanks for your interest. Regards, Sergi _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
I have read the full thread today, it looks like the problems I have here with Solaris Nevada SXCE / Xvm Core Hypervisor (Xen of Sun), I''m running Gentoo x64 PV guests and sometimes / on the second machine win 2008 x64 HVM. Both boxes (Sun fire x4150 (2x Quad Harpertown Xeon)) have shown immediate reboots without any noticeable reason, Sun support has tracked it down no hardware problems it looks like a problem with the xen/xvm hypervisor. Some times I have had the feeling that it depends on system load, but then the box crashed with the half of normal domain count on 24.12.2007 morning. In other means no load at all, but hw log shows strange acpi events, so it could be that the hypervisor does something strange. Also we have had this problem month ago, then it disappeared, the first time we missed to track it the full way down. Florian Sergi Seira schrieb:> En/na Ray Barnes ha escrit: >> I recall 3.2.1 to be extremely stable, and I have yet to see a >> software issue under Xen where it did not produce a kernel "oops". >> Did you actually run memtest86 in accordance with my recommendations, >> or some other method? >> >> -Ray >> >> >> On Thu, Jan 8, 2009 at 12:02 PM, Sergi Seira <s.seira@cdmon.com >> <mailto:s.seira@cdmon.com>> wrote: >> >> En/na Bernhard Schmidt ha escrit: >> > On Mon, Jun 09, 2008 at 02:24:26AM -0400, Ray Barnes wrote: >> > >> > Hi Ray, >> > >> > >> >> Since you have IPKVM access, ask your host to boot up memtest86 for >> >> you and let your box "cook" for 12 to 24 hours, as these kinds of >> >> problems are always hardware-related in my experience. I run a >> >> hosting shop and memtest86 is standard operating procedure for all >> >> server deployments. It will identify hardware issues with the >> >> motherboard/RAM/cpu usually within minutes, and heat-related issues >> >> usually within 24 hours. I would recommend not taking the >> IPKVM off >> >> while it''s under test for various reasons. You''ll either get a bad >> >> memory block, a lockup, a reboot, or several passes with no >> error of >> >> any kind. HTH. >> >> >> > >> > If history is any indicator (as I said I had the same problem >> already >> > last year) memtest86+ will run fine. But I''m running it nontheless, >> > first pass has been completed without any errors. >> > >> > Bernhard >> > >> > _______________________________________________ >> > Xen-users mailing list >> > Xen-users@lists.xensource.com <mailto:Xen-users@lists.xensource.com> >> > http://lists.xensource.com/xen-users >> > >> Hello, >> >> any news on this issue? ... I am experiencing this very same freezing >> issue with this software installed: >> >> ii libxenstore3.0 3.2.1-2 Xenstore communications library for Xen >> ii linux-image-2.6.18-6-xen-amd64 2.6.18.dfsg.1-22etch2 Linux 2.6.18 >> image on AMD64 >> ii linux-modules-2.6.18-6-xen-amd64 2.6.18.dfsg.1-22etch2 Linux 2.6.18 >> modules on AMD64 >> ii xen-hypervisor-3.2-1-amd64 3.2.1-2 The Xen Hypervisor on AMD64 >> ii xen-tools 3.9-3 Tools to manage Debian XEN virtual servers >> ii xen-utils-3.2-1 3.2.1-2 XEN administrative tools >> ii xen-utils-common 3.2.0-2 XEN administrative tools - common files >> ii xenstore-utils 3.2.1-2 Xenstore utilities for Xen >> >> Freezing episodes happen every day or so. KVM access is dead, no >> messages on console, no keyboard access ... hardware is tested ok. >> >> >> Thanks, >> regards, >> Sergi >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Xen-users mailing list >> Xen-users@lists.xensource.com >> http://lists.xensource.com/xen-users > Hello Ray, > > sorry for the delay, I did a memtest and passed with no errors ... It''s > a 16GB server so a memtest needs 5/6 hours to complete. > I would like to think this is a hardware issue, but I have another > server where I recently installed xen3.2.1 under debian4.0 and froze > once 7 days ago. The initial complaining server freezes every 2/3 days, > it had 13 paravirtual domU''s but now has only two and had a freezing > episode yesterday. > Anyway ... I now this is hard to resolve ... I found on the web only > this thread matching the issue, so it''s more likely to be a > hardware/driver problem than a xen problem. I don''t like, though, that I > can''t point where the problem is. > > Thanks for your interest. > Regards, > Sergi > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users