Heiko Wundram
2012-Mar-27 12:48 UTC
Debugging (possible) Xen-related hang-issues without the possibility of attaching serial console to capture Xen output
Hey all! I''m currently in the process of trying to debug a (possibly!) Xen-related issue (with Kernel 3.2.9-vanilla as Dom0, and Xen 4.1.2 [almost] vanilla, except for a patch for CVE-2012-0029), where a system freezes without me being able to ascertain a triggering event. The system console is completely unresponsive in the event of hangs (i.e., displays motd and the login prompt, but doesn''t display a Dom0 traceback or anything related), and I''m unable to pinpoint the exact source of the hang condition due to not being able to reproduce the hang under laboratory-conditions (which is the worst part about these hangs) with similar hardware (so, basically, it might actually not be Xen- but [hosting-]environment-related). I''d already written about this some time ago, and the suggestion to enable network logging (through netconsole) is implemented on the system in question, but when the system locks up I''m also not seeing any output to the netconsole connected process on the remote logging machine - it seems that the Dom0-kernel is not "the guilty party" when these hangs occur. Generally, I''d have started to attach a serial console to the system now to check whether the hypervisor logs any fatal errors when the system goes to a hang, but as the machine in question is located at a service provider (which I can''t change easily and which adamantly refuses to help with debugging in the form of attaching a serial console), I''m unable to get at Xen serial output on the host in question. Is there any, _any_ other way to get at hypervisor output in the case of a crash? i.e., some form of memory buffering of the output messages which I might later be able to find in the host memory after rebooting to a "working" state, or something similar? I know that FreeBSD (and lately) Linux implement this kind of crash debugging helpers - is there any possibility to get something similar working with Xen? Reverting to a kdump/kexec-enabled kernel (i.e., 2.6.18) for the Dom0 is (pretty much) impossible due to the hardware of the system in use, but if someone points me at a 2.6.34+ Xen-kexec/kdump-enabled Dom0-kernel, I''d also be happy to give that a try (I''ve not found any, or rather: kdump/kexec also doesn''t work with the OpenSUSE-up-patched xenified 2.6.38 I had in use before switching to 3.2.x-vanilla, but the problems [hangs] were similar with the older kernel). Thanks for any answer! -- --- Heiko.
Tom Snowhorn
2012-Mar-27 23:41 UTC
Re: Debugging (possible) Xen-related hang-issues without the possibility of attaching serial console to capture Xen output
(forgive me, i''m replying to the digest) What leads you to believe it''s a "Xen-related hang issue" ? I''d be more inclined to suspect the service provider gave you a box with bad RAM (which is why we ''memtest86'' all of ours before giving them to customers). It sounds like you have an IPKVM, so see if you can boot memtest if you haven''t already done so. If that''s not readily available, try the built-in kernel memtest functions (included with late-model 2.6 kernels), though I understand their tests are less conclusive than bare-metal memtest86. Rgds, TS> Date: Tue, 27 Mar 2012 14:48:13 +0200 > From: Heiko Wundram <modelnine@modelnine.org> > To: <xen-users@lists.xen.org> > Subject: [Xen-users] Debugging (possible) Xen-related hang-issues > without the possibility of attaching serial console to capture Xen > output > Message-ID: <b3079c7787ecaa6d5b14bbe07911bc5d@modelnine.org> > Content-Type: text/plain; charset=UTF-8; format=flowed
Heiko Wundram
2012-Mar-28 09:04 UTC
Re: Debugging (possible) Xen-related hang-issues without the possibility of attaching serial console to capture Xen output
Am 28.03.2012 01:41, schrieb Tom Snowhorn:> What leads you to believe it''s a "Xen-related hang issue" ? I''d be > more inclined to suspect the service provider gave you a box with bad > RAM (which is why we ''memtest86'' all of ours before giving them to > customers). It sounds like you have an IPKVM, so see if you can boot > memtest if you haven''t already done so. If that''s not readily > available, try the built-in kernel memtest functions (included with > late-model 2.6 kernels), though I understand their tests are less > conclusive than bare-metal memtest86.I''ve not run a memtest on the (currently failing) hardware (yet), due to the extended downtime that would be incurred (which is even less acceptable than regular reboots for the systems running on the hosting system), but I''ve had the provider swap all components (except the HDs) three times so far to try to exclude hardware-related issues (i.e., I''ve had three different Mobos of the same make, with different RAM [also of different vendors, Samsung and Kingston], different [Intel-]CPUs in the same family, different PSUs, different Adaptec-HWRAID-controllers). That''s why my current top suspect is either a datacenter related issue (failing power, broken main, something "funky" like that) or a Xen-related issue; I''d simply like to make sure that I can exclude the latter (that the system "bombs" hard, and that this is not some problem with the boards ACPI/somesuch and Xen which does get diagnosed by the Hypervisor and causes it to panic). I''m not using PCI-passthrough or any other "modern" virtualization technology, besides HVM - which makes this implausible (in my "world"), though. I still have access to one of the former machines (which were swapped) which also hung regularily (but I couldn''t cause it to hang with the "laboratory" setup I mentioned in the original mail); I''ll check whether memtest reports anything on that machine. Thanks for the hint! -- --- Heiko.
Tom Snowhorn
2012-Mar-28 13:03 UTC
Re: Debugging (possible) Xen-related hang-issues without the possibility of attaching serial console to capture Xen output
Got it. I''ve seen hundreds, if not thousands of Kingston sticks and never encountered a bad one, let alone 3 times in a row from a reputable manufacturer. I''m guessing your problem is not confined to the box''s hardware itself. I''m having trouble "reading between the lines" of your description. What occurs with regard to the machine when this problem happens? A reboot? You mentioned the word "hang", but you also mentioned that the datacenter might be losing power or similar, so I''m guessing the machine reboots some of the time? If that''s the case, are you using the "noreboot" kernel flag? Rgds, TS> Message: 14 > Date: Wed, 28 Mar 2012 11:04:18 +0200 > From: Heiko Wundram <modelnine@modelnine.org> > To: <xen-users@lists.xen.org> > Subject: Re: [Xen-users] Debugging (possible) Xen-related hang-issues > without the possibility of attaching serial console to capture Xen > output > Message-ID: <e047433279e0753f0876debe2da41669@modelnine.org> > Content-Type: text/plain; charset=UTF-8; format=flowed
Heiko Wundram
2012-Mar-28 14:21 UTC
Re: Debugging (possible) Xen-related hang-issues without the possibility of attaching serial console to capture Xen output
Am 28.03.2012 15:03, schrieb Tom Snowhorn:> Got it. I've seen hundreds, if not thousands of Kingston sticks and > never encountered a bad one, let alone 3 times in a row from a > reputable manufacturer. I'm guessing your problem is not confined to > the box's hardware itself.No, it's not confined to the specific hardware that I originally noted - as detailed in the last mail, I've had the same hangs on three similar (i.e. same MoBo-type, same RAM [by amount, but ot make], same Adapted-RAID-PCIE-Controller, same PSU-type) but (besides the HDs, which were always carried over) different boxes. I'll not get another (different) one soon to test on, the datacenter is pretty adamant about that. ;-) When switching boxes or as emergency measures, I've tried several combinations of Xen/Dom0 kernels (Kernel: xenified 2.6.x, vanilla 3.0, vanilla 3.1.x, vanilla 3.2.6/.9, Xen: 4.0.1 up to 4.1.2, all revisions, no -RCs), and all of the fail similarly, see below for some more detail. To detail the hardware: board is an MSI X58 Pro-E (MS-7522), with current BIOS (I guess that's V8.15, but don't have a chance to look now), the CPUs I tested were all i7 920+, with no disks attached to the mainboard but rather to a PCIE-SATA-RAID Adaptec 5405, configured as RAID-10. Network card is an Intel-based dual-port 1GBit card, also connected via PCIE. RAM is/was either Kingston or Samsung, 4GB modules, 24GB socketed on the mainboard in total in 6 banks.> I'm having trouble "reading between the lines" of your description. > What occurs with regard to the machine when this problem happens? A > reboot? You mentioned the word "hang", but you also mentioned that > the datacenter might be losing power or similar, so I'm guessing the > machine reboots some of the time? If that's the case, are you using > the "noreboot" kernel flag?Sorry that I wasn't really clear in the description (I've talked too much with colleagues of mine about this, and as such take "info" for granted, probably): the system does hang hard (i.e., it doesn't reboot). I've turned off console blanking for the Dom0-kernel, and the Dom0-login-prompt remains visible on the IP-KVM connected to the host, but the system is otherwise "frozen", i.e. pressing Caps/Numlock doesn't change state of the corresponding keyboard LED (which the datacenter have "physically" confirmed for me), and there is no backtrace on the console, so I'm pretty sure that it's _not_ the Dom0 kernel which is panicing (at least it's not showing signs of panic). Whether Xen panics: I can't say; that's why I tried to get the datacenter to attach a serial console to the system in question so that I might be able to grab output of Xen to actually diagnose a possible hardware incompatability, but as the original mail stated: I can't get the data-center provider to attach a serial cable... Leading to my "cry for help." :-) Why I hinted at the data-center possibly being responsible for the hangs: I wouldn't actually be surprised (because I've seen that in our own data-center, albeit that was very exceptional) if some voltage fluctutation due to too many systems being connected to the same circuit would cause servers which were under high stress at such time to freeze - we had similar symptoms (of non-reproducible "hangs") at our own data-center when too many servers were connected to a 16A circuit, and peak usage for the connected servers seemed to be just above that, changing some to a different circuit without restacking in the rack(s) cleared the hangs for all of them. Same thing goes for temperature: I've not seen exceptionally high temperatures on the system sensors, but operating at 70°C for the southbridge is somewhat high in my book - a recipe for hangs. The thing I'm currently trying to do is to exclude Xen from the loop (by making sure that it's not Xen that's hanging/going into "debug" mode), which would then leave a discussion with the hosting provider about rehousing the corresponding server(s). Which is why a memory dump/some form of memory access to Xen would be extremely valuable, after me resetting the system to get it back up and running. Thanks for your help! -- --- Heiko. _______________________________________________ Xen-users mailing list Xen-users@lists.xen.org http://lists.xen.org/xen-users
Miles Fidelman
2012-Mar-28 14:46 UTC
Re: Debugging (possible) Xen-related hang-issues without the possibility of attaching serial console to capture Xen output
Two thoughts: 1. Have you thought of installing an IPMI card into your box (if available), or hooking up a serial-over-IP box like this one - http://www.lantronix.com/device-networking/external-device-servers/eds1100_eds2100.html - I''ve found remote access to a serial port to be invaluable. 2. I used to have a really bad crash/hang/reboot issue with my xen installation. It turned out that the solution was to pin VMs to CPUs - otherwise there was an intermittent scheduling conflict that would simply come up once in a while and crash things. Never could actually find trace of it in logs, but pinning CPUs eliminated the issue. Not sure if that problem has been solved in later releases. Miles Fidelman -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra
Heiko Wundram
2012-Mar-28 14:56 UTC
Re: Debugging (possible) Xen-related hang-issues without the possibility of attaching serial console to capture Xen output
Am 28.03.2012 16:46, schrieb Miles Fidelman:> 1. Have you thought of installing an IPMI card into your box (if > available), or hooking up a serial-over-IP box like this one - > > http://www.lantronix.com/device-networking/external-device-servers/eds1100_eds2100.html > - I''ve found remote access to a serial port to be invaluable.I tried to do the latter, but alas, that''s not possible due to the restrictions the datacenter imposes on user-hardware being deployed in their housing-facility. There''s not much I can really do against that (except switch data-centers, which is somewhere down the road...); the replica-hardware I''ve tested locally (which is similar and for which I can access the serial console) does not crash (or at least I can''t get it to crash using the test-workloads I throw at it, which should be similar to the workloads on the deployed system).> 2. I used to have a really bad crash/hang/reboot issue with my xen > installation. It turned out that the solution was to pin VMs to CPUs > - otherwise there was an intermittent scheduling conflict that would > simply come up once in a while and crash things. Never could > actually > find trace of it in logs, but pinning CPUs eliminated the issue. Not > sure if that problem has been solved in later releases.That''s interesting, and something that I''ll gladly give a try. Thanks for the hint! -- --- Heiko.