I have a single VM (of 11) that has a recurring problem. This image has moved from machine to machine, with the problem following it. This image has been rebuilt from scratch, and the problem recurred. It would appear that there is something in the behaviour of this VM which causes it to crash and causes Xend to become unhappy. The problem presents as: Domain crashes, becomes zombie. xm destroy will not destroy the zombie. xm create will not start it or any other domain (Hotplug Scripts not working) The only solution appears to be a reboot of the host machine. Stopping and restarting xend/xendomains does not solve the problem. This particular VM is our continuous build system. It is building code pretty much all day long, and does very heavy NFS ops. The host machine is using Xen 3.0.2 running on FC5 2.6.17-1.2174_FC5xen0 (using the yum packages). The guest OS is FC4 with 2.6.17-1.2174_FC5xenU The problem only presents itself on this VM. It is actually an identical copy to the other 11 VMs, all of which are development boxes using NFS. The issue appears to occur only due to the volume of work the problem image does. One of the bits of help I need is in knowing where to get the information necessary to solve the problem. I''ve attached the bit of the xend.log that involves the crash and subsequent failed restarts. # xm info host : pdev0 release : 2.6.17-1.2174_FC5xen0 version : #1 SMP Tue Aug 8 16:26:11 EDT 2006 machine : x86_64 nr_cpus : 2 nr_nodes : 1 sockets_per_node : 2 cores_per_socket : 1 threads_per_core : 1 cpu_mhz : 2390 hw_caps : 00000000:00000000:078bfbff:e3d3fbff:00000000:00000010:00000001 total_memory : 8128 free_memory : 1413 xen_major : 3 xen_minor : 0 xen_extra : -unstable xen_caps : xen-3.0-x86_64 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable cc_compiler : gcc version 4.1.1 20060525 (Red Hat 4.1.1-1) cc_compile_by : brewbuilder cc_compile_domain : build.redhat.com cc_compile_date : Tue Aug 8 15:25:03 EDT 2006 # cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 37 model name : AMD Opteron(tm) Processor 250 stepping : 1 cpu MHz : 2390.648 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm bogomips : 5978.35 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 37 model name : AMD Opteron(tm) Processor 250 stepping : 1 cpu MHz : 2390.648 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm bogomips : 5978.35 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp (The machine has been bounced since I had to get the image back in service, I don''t know how useful this will be) # xm dmesg __ __ _____ ___ _ _ _ \ \/ /___ _ __ |___ / / _ \ _ _ _ __ ___| |_ __ _| |__ | | ___ \ // _ \ ''_ \ |_ \| | | |__| | | | ''_ \/ __| __/ _` | ''_ \| |/ _ \ / \ __/ | | | ___) | |_| |__| |_| | | | \__ \ || (_| | |_) | | __/ /_/\_\___|_| |_| |____(_)___/ \__,_|_| |_|___/\__\__,_|_.__/|_|\___| http://www.cl.cam.ac.uk/netos/xen University of Cambridge Computer Laboratory Xen version 3.0-unstable (brewbuilder@build.redhat.com) (gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)) Tue Aug 8 15:25:03 EDT 2006 Latest ChangeSet: unavailable (XEN) Command line: /boot/xen.gz-2.6.17-1.2174_FC5 (XEN) Physical RAM map: (XEN) 0000000000000000 - 000000000009a000 (usable) (XEN) 000000000009a000 - 00000000000a0000 (reserved) (XEN) 00000000000d0000 - 0000000000100000 (reserved) (XEN) 0000000000100000 - 00000000fbf70000 (usable) (XEN) 00000000fbf70000 - 00000000fbf77000 (ACPI data) (XEN) 00000000fbf77000 - 00000000fbf80000 (ACPI NVS) (XEN) 00000000fbf80000 - 00000000fc000000 (reserved) (XEN) 00000000fec00000 - 00000000fec00400 (reserved) (XEN) 00000000fee00000 - 00000000fee01000 (reserved) (XEN) 00000000fff80000 - 0000000100000000 (reserved) (XEN) 0000000100000000 - 0000000200000000 (usable) (XEN) System RAM: 8127MB (8322088kB) (XEN) Xen heap: 13MB (14020kB) (XEN) Using scheduler: SMP Credit Scheduler (credit) (XEN) found SMP MP-table at 000f7de0 (XEN) DMI present. (XEN) Using APIC driver default (XEN) ACPI: RSDP (v002 PTLTD ) @ 0x00000000000f7db0 (XEN) ACPI: XSDT (v001 PTLTD XSDT 0x06040000 LTP 0x00000000) @ 0x00000000fbf74bd4 (XEN) ACPI: FADT (v003 SUN V20z 0x06040000 PTEC 0x000f4240) @ 0x00000000fbf76c0c (XEN) ACPI: HPET (v001 Sun V20z 0x06040000 PTEC 0x00000000) @ 0x00000000fbf76d00 (XEN) ACPI: MADT (v001 PTLTD APIC 0x06040000 LTP 0x00000000) @ 0x00000000fbf76d38 (XEN) ACPI: SPCR (v001 PTLTD $UCRTBL$ 0x06040000 PTL 0x00000001) @ 0x00000000fbf76dae (XEN) ACPI: SSDT (v001 SUN V20z 0x06040000 LTP 0x00000001) @ 0x00000000fbf76dfe (XEN) ACPI: SSDT (v001 SUN V20z 0x06040000 LTP 0x00000001) @ 0x00000000fbf76e9b (XEN) ACPI: SRAT (v001 SUN V20z 0x06040000 SUN 0x00000001) @ 0x00000000fbf76f38 (XEN) ACPI: DSDT (v001 Sun V20z 0x06040000 MSFT 0x0100000e) @ 0x0000000000000000 (XEN) ACPI: Local APIC address 0xfee00000 (XEN) ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) (XEN) Processor #0 15:5 APIC version 16 (XEN) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) (XEN) Processor #1 15:5 APIC version 16 (XEN) ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) (XEN) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) (XEN) ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0]) (XEN) IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23 (XEN) ACPI: IOAPIC (id[0x03] address[0xfd000000] gsi_base[24]) (XEN) IOAPIC[1]: apic_id 3, version 17, address 0xfd000000, GSI 24-27 (XEN) ACPI: IOAPIC (id[0x04] address[0xfd001000] gsi_base[28]) (XEN) IOAPIC[2]: apic_id 4, version 17, address 0xfd001000, GSI 28-31 (XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge) (XEN) ACPI: IRQ0 used by override. (XEN) ACPI: IRQ2 used by override. (XEN) Enabling APIC mode: Flat. Using 3 I/O APICs (XEN) ACPI: HPET id: 0x102282a0 base: 0xfed00000 (XEN) Using ACPI (MADT) for SMP configuration information (XEN) Initializing CPU#0 (XEN) Detected 2390.648 MHz processor. (XEN) CPU0: AMD Flush Filter disabled (XEN) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) (XEN) CPU: L2 Cache: 1024K (64 bytes/line) (XEN) Intel machine check architecture supported. (XEN) Intel machine check reporting enabled on CPU#0. (XEN) CPU0: AMD Opteron(tm) Processor 250 stepping 01 (XEN) Booting processor 1/1 eip 90000 (XEN) Initializing CPU#1 (XEN) CPU1: AMD Flush Filter disabled (XEN) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) (XEN) CPU: L2 Cache: 1024K (64 bytes/line) (XEN) AMD: Disabling C1 Clock Ramping Node #0 (XEN) AMD: Disabling C1 Clock Ramping Node #1 (XEN) Intel machine check architecture supported. (XEN) Intel machine check reporting enabled on CPU#1. (XEN) CPU1: AMD Opteron(tm) Processor 250 stepping 01 (XEN) Total of 2 processors activated. (XEN) ENABLING IO-APIC IRQs (XEN) -> Using new ACK method (XEN) ..TIMER: vector=0xF0 apic1=0 pin1=2 apic2=0 pin2=0 (XEN) checking TSC synchronization across 2 CPUs: passed. (XEN) Platform timer is 14.318MHz HPET (XEN) Brought up 2 CPUs (XEN) Machine check exception polling timer started. (XEN) *** LOADING DOMAIN 0 *** (XEN) Domain 0 kernel supports features = { 0000000f }. (XEN) Domain 0 kernel requires features = { 00000000 }. (XEN) PHYSICAL MEMORY ARRANGEMENT: (XEN) Dom0 alloc.: 000000000e000000->0000000010000000 (2010971 pages to be allocated) (XEN) VIRTUAL MEMORY ARRANGEMENT: (XEN) Loaded kernel: ffffffff80200000->ffffffff80619108 (XEN) Init. ramdisk: ffffffff8061a000->ffffffff808db000 (XEN) Phys-Mach map: ffffffff808db000->ffffffff81842ad8 (XEN) Start info: ffffffff81843000->ffffffff81844000 (XEN) Page tables: ffffffff81844000->ffffffff81855000 (XEN) Boot stack: ffffffff81855000->ffffffff81856000 (XEN) TOTAL: ffffffff80000000->ffffffff81c00000 (XEN) ENTRY ADDRESS: ffffffff80200000 (XEN) Dom0 has maximum 2 VCPUs (XEN) Initrd len 0x2c1000, start at 0xffffffff8061a000 (XEN) Scrubbing Free RAM: ............................................................................ ......done. (XEN) Xen trace buffers: disabled (XEN) Xen is relinquishing VGA console. (XEN) *** Serial input -> DOM0 (type ''CTRL-a'' three times to switch input to Xen). --- Any help would be apprectiated. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Petersson, Mats
2006-Oct-26 16:37 UTC
RE: [Xen-users] Domain Crash and Xend can''t restart
> -----Original Message----- > From: xen-users-bounces@lists.xensource.com > [mailto:xen-users-bounces@lists.xensource.com] On Behalf Of > Mike Lemoine > Sent: 26 October 2006 17:16 > To: Xen-users > Subject: [Xen-users] Domain Crash and Xend can''t restart > > I have a single VM (of 11) that has a recurring problem. > This image has > moved from machine to machine, with the problem following it. > This image > has been rebuilt from scratch, and the problem recurred. It > would appear > that there is something in the behaviour of this VM which > causes it to crash > and causes Xend to become unhappy.Do you have a log from an attempt to boot it once it''s been "broken"? I think that would be useful. I would also suggest that you try the latest stable Xen (3.0.3), because there has been quite a few changes since August... [Snip logs etc] -- Mats _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On 26 Oct 2006 at 10:15, Mike Lemoine wrote: [...]> # xm dmesg > __ __ _____ ___ _ _ _ > \ \/ /___ _ __ |___ / / _ \ _ _ _ __ ___| |_ __ _| |__ | | ___ > \ // _ \ ''_ \ |_ \| | | |__| | | | ''_ \/ __| __/ _` | ''_ \| |/ _ \ > / \ __/ | | | ___) | |_| |__| |_| | | | \__ \ || (_| | |_) | | __/ > /_/\_\___|_| |_| |____(_)___/ \__,_|_| |_|___/\__\__,_|_.__/|_|\___| > > http://www.cl.cam.ac.uk/netos/xen > University of Cambridge Computer Laboratory > > Xen version 3.0-unstable (brewbuilder@build.redhat.com) (gcc version 4.1.1 > 20060525 (Red Hat 4.1.1-1)) Tue Aug 8 15:25:03 EDT 2006 > Latest ChangeSet: unavailableThe version shipped with SLES10 at least has no "unstable": \ \/ /___ _ __ \ // _ \ ''_ \ / \ __/ | | | /_/\_\___|_| |_| _____ ___ ____ ___ ___ _____ _ _ ___ ___ _____ |___ / / _ \ |___ \ / _ \ / _ \___ | || | / _ \ / _ \|___ | |_ \| | | | __) | | | | | (_) | / /| || || (_) |__| | | | / / ___) | |_| | / __/ | |_| |\__, |/ / |__ _\__, |__| |_| | / / |____(_)___(_)_____|___\___/ /_//_/ |_| /_/ \___(_)_/ |_____| http://www.cl.cam.ac.uk/netos/xen University of Cambridge Computer Laboratory Xen version 3.0.2_09749-0.7 (abuild@suse.de) (gcc version 4.1.0 (SUSE Linux)) Thu Jul 20 04:32:25 UTC 2006 Latest ChangeSet: 09749 [...] Maybe try your offensive VM with SLES10. Regards, Ulrich _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Happened again. While the problem VM is useless, connecting to the console of an idle VM I had on the machine (for exactly this purpose) had this in the buffer: BUG: sleeping function called from invalid context at kernel/workqueue.c:270 in_atomic():0, irqs_disabled():1 Call Trace: <ffffffff80294742>{flush_workqueue+26} <ffffffff80370914>{blkif_free+67} <ffffffff80371135>{blkfront_resume+37} <ffffffff8036d62c>{xenbus_exists+33} <ffffffff8029767d>{keventd_create_kthread+0} <ffffffff8036e724>{resume_dev+88} <ffffffff8036a265>{__do_suspend+0} <ffffffff8036e6cc>{resume_dev+0} <ffffffff80363f23>{bus_for_each_dev+67} <ffffffff8036a265>{__do_suspend+0} <ffffffff8036e63e>{xenbus_resume+37} <ffffffff8036a7ce>{__do_suspend+1385} <ffffffff80282af9>{__wake_up_common+62} <ffffffff8029767d>{keventd_create_kthread+0} <ffffffff8029767d>{keventd_create_kthread+0} <ffffffff80237f3f>{kthread+212} <ffffffff8026721a>{child_rip+8} <ffffffff8029767d>{keventd_create_kthread+0} <ffffffff80237e6b>{kthread+0} <ffffffff80267212>{child_rip+0} On 10/26/06 10:37 AM, "Petersson, Mats" <Mats.Petersson@amd.com> wrote:> > >> -----Original Message----- >> From: xen-users-bounces@lists.xensource.com >> [mailto:xen-users-bounces@lists.xensource.com] On Behalf Of >> Mike Lemoine >> Sent: 26 October 2006 17:16 >> To: Xen-users >> Subject: [Xen-users] Domain Crash and Xend can''t restart >> >> I have a single VM (of 11) that has a recurring problem. >> This image has >> moved from machine to machine, with the problem following it. >> This image >> has been rebuilt from scratch, and the problem recurred. It >> would appear >> that there is something in the behaviour of this VM which >> causes it to crash >> and causes Xend to become unhappy. > > Do you have a log from an attempt to boot it once it''s been "broken"? I > think that would be useful. > > I would also suggest that you try the latest stable Xen (3.0.3), because > there has been quite a few changes since August... > > [Snip logs etc] > > -- > Mats > >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Petersson, Mats
2006-Oct-27 17:20 UTC
RE: [Xen-users] Domain Crash and Xend can''t restart
Seems like something you should put into bugzilla. "sleeping function called from ..." is definitely broken code - it means that someone tried to make the current thread go to sleep when in an a mode where you can''t go to sleep (Not a good idea to sleep the kernel when interrupts are disabled). -- Mats> -----Original Message----- > From: Mike Lemoine [mailto:mlemoine@FireEye.com] > Sent: 27 October 2006 18:13 > To: Petersson, Mats; Xen-users > Subject: Re: [Xen-users] Domain Crash and Xend can''t restart > > Happened again. While the problem VM is useless, connecting > to the console > of an idle VM I had on the machine (for exactly this purpose) > had this in > the buffer: > > BUG: sleeping function called from invalid context at > kernel/workqueue.c:270 > in_atomic():0, irqs_disabled():1 > > Call Trace: <ffffffff80294742>{flush_workqueue+26} > <ffffffff80370914>{blkif_free+67} > <ffffffff80371135>{blkfront_resume+37} > <ffffffff8036d62c>{xenbus_exists+33} > <ffffffff8029767d>{keventd_create_kthread+0} > <ffffffff8036e724>{resume_dev+88} > <ffffffff8036a265>{__do_suspend+0} > <ffffffff8036e6cc>{resume_dev+0} > <ffffffff80363f23>{bus_for_each_dev+67} > <ffffffff8036a265>{__do_suspend+0} > <ffffffff8036e63e>{xenbus_resume+37} > <ffffffff8036a7ce>{__do_suspend+1385} > <ffffffff80282af9>{__wake_up_common+62} > <ffffffff8029767d>{keventd_create_kthread+0} > <ffffffff8029767d>{keventd_create_kthread+0} > <ffffffff80237f3f>{kthread+212} > <ffffffff8026721a>{child_rip+8} > <ffffffff8029767d>{keventd_create_kthread+0} > <ffffffff80237e6b>{kthread+0} <ffffffff80267212>{child_rip+0} > > > > On 10/26/06 10:37 AM, "Petersson, Mats" > <Mats.Petersson@amd.com> wrote: > > > > > > >> -----Original Message----- > >> From: xen-users-bounces@lists.xensource.com > >> [mailto:xen-users-bounces@lists.xensource.com] On Behalf Of > >> Mike Lemoine > >> Sent: 26 October 2006 17:16 > >> To: Xen-users > >> Subject: [Xen-users] Domain Crash and Xend can''t restart > >> > >> I have a single VM (of 11) that has a recurring problem. > >> This image has > >> moved from machine to machine, with the problem following it. > >> This image > >> has been rebuilt from scratch, and the problem recurred. It > >> would appear > >> that there is something in the behaviour of this VM which > >> causes it to crash > >> and causes Xend to become unhappy. > > > > Do you have a log from an attempt to boot it once it''s been > "broken"? I > > think that would be useful. > > > > I would also suggest that you try the latest stable Xen > (3.0.3), because > > there has been quite a few changes since August... > > > > [Snip logs etc] > > > > -- > > Mats > > > > > > > >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Seemingly Similar Threads
- compaq v2000 working with correct boot options and compaq r4000 still SLOWWW
- Unable to boot xen dom0 on IBM System x3250
- Problems creating DomUs with large memory system/PAE enabled
- dma_alloc_coherent issue with tg3 in x86_64 build
- compaq R4000 update (hogging interrupts)