xming wrote:> The symptom is (with a lot of subjective judgment) when there is a lot (or > too quick) output on the console of the domU (hvc0 connected with either > "xm crea file.cfg -c" or "xm cons id") the whole PV domU hangs. It will > really hang at random places, sometimes right after init and sometime > after I logged in and just generate some ouput (on hvc0) like "find /". IIRC > I have never seen a hang before init. >OK, I misunderstood your original report to mean that something was complaining about "too much" output. You''re saying that lots of console output seems to lock the domain. I''ve had a report about heavy disk IO seems to lock up as well. Perhaps they''re both related to high event rates. Do you think you could try an IO-intensive workload to see if you can get a similar lockup? When the domain is locked up, what does /usr/lib/xen/bin/xenctx say? Hm. Rather than backing out the structure-change patch, could you try this workaround: diff -r be3ca4e0e19e arch/x86/xen/enlighten.c --- a/arch/x86/xen/enlighten.c Thu Jan 17 14:25:07 2008 -0800 +++ b/arch/x86/xen/enlighten.c Thu Jan 17 16:37:42 2008 -0800 @@ -95,7 +95,7 @@ struct shared_info *HYPERVISOR_shared_in * * 0: not available, 1: available */ -static int have_vcpu_info_placement = 1; +static int have_vcpu_info_placement = 0; static void __init xen_vcpu_setup(int cpu) { Reverting the structure shape could cause crashes or random data corruption, but it has the side-effect of disabling the vpu_info structure placement mechanism. This patch disables it cleanly. Thanks, J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> OK, I misunderstood your original report to mean that something was > complaining about "too much" output. You''re saying that lots of console > output seems to lock the domain.Sorry about that, and yes that is the case.> I''ve had a report about heavy disk IO seems to lock up as well. Perhaps > they''re both related to high event rates. Do you think you could try an > IO-intensive workload to see if you can get a similar lockup?IO-intensive locks up too (see below)> When the domain is locked up, what does /usr/lib/xen/bin/xenctx say?see below> Hm. Rather than backing out the structure-change patch, could you try > this workaround: > > diff -r be3ca4e0e19e arch/x86/xen/enlighten.c > --- a/arch/x86/xen/enlighten.c Thu Jan 17 14:25:07 2008 -0800 > +++ b/arch/x86/xen/enlighten.c Thu Jan 17 16:37:42 2008 -0800 > @@ -95,7 +95,7 @@ struct shared_info *HYPERVISOR_shared_in > * > * 0: not available, 1: available > */ > -static int have_vcpu_info_placement = 1; > +static int have_vcpu_info_placement = 0; > > static void __init xen_vcpu_setup(int cpu) > {First of all this patch solves the lock-ups, it works as advertised :) The DomU works as before. Just for the record for people trying to apply this to 2.6.23.x you need to change the /x86/ to /i386/, unified x86 is since 2.6.24. I tried to create 2 tests, one is IO intensive and the other is console output intensive: test1. bonnie++ -s 1024 -u nobody test2. for i in `seq 1 50000`; do echo 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ; done In all acese where it crashed(hanged) there was no oops/panic. scenario 1 (booted 2.6.23.14 as is) -------------------------------------------------- (but with init=/bin/bash, otherwise I couldn''t get a prompt) test1: crashed # /usr/lib/xen/bin/xenctx 108 eip: c037c0c7 esp: c0343f90 eax: 00000000 ebx: 00000001 ecx: 00000000 edx: c0342000 esi: c0373004 edi: c1210df4 ebp: 00001b7d cs: 00000061 ds: 0000007b fs: 000000d8 gs: 00000000 Stack: c0100add c0378980 c0101962 c0104821 c120a000 c0378df4 c0348cff 00000025 c0348430 00000004 00009000 00006df4 00ea1000 c0363be0 c0343fe8 c03dd007 00000000 c0343fec c0349868 c0343fe0 178bc1f1 00002001 01020800 00060fb1 00000000 c03dd000 00000000 00000000 Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 06 00 00 00 cd 82 <c3> cc cc cc cc cc cc cc cc cc cc Call Trace: [<c037c0c7>] <-- [<c0100add>] [<c0378980>] [<c0101962>] [<c0104821>] [<c120a000>] [<c0378df4>] [<c0348cff>] [<c0348430>] [<c0363be0>] [<c0343fe8>] [<c03dd007>] [<c0343fec>] [<c0349868>] [<c0343fe0>] [<178bc1f1>] [<c03dd000>] test2: crashed after many many retries and sometimes with strange output 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAA 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 0000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ # /usr/lib/xen/bin/xenctx 113 eip: c037c0c7 esp: c0343f90 eax: 00000000 ebx: 00000001 ecx: 00000000 edx: c0342000 esi: c0373004 edi: c1210df4 ebp: 00001b7d cs: 00000061 ds: 0000007b fs: 000000d8 gs: 00000000 Stack: c0100add c0378980 c0101962 c0104821 c120a000 c0378df4 c0348cff 00000025 c0348430 00000004 00009000 00006df4 00ea1000 c0363be0 c0343fe8 c03dd007 00000000 c0343fec c0349868 c0343fe0 178bc1f1 00002001 00020800 00060fb1 00000000 c03dd000 00000000 00000000 Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 06 00 00 00 cd 82 <c3> cc cc cc cc cc cc cc cc cc cc Call Trace: [<c037c0c7>] <-- [<c0100add>] [<c0378980>] [<c0101962>] [<c0104821>] [<c120a000>] [<c0378df4>] [<c0348cff>] [<c0348430>] [<c0363be0>] [<c0343fe8>] [<c03dd007>] [<c0343fec>] [<c0349868>] [<c0343fe0>] [<178bc1f1>] [<c03dd000>] Scenario 2 (have_vcpu_info_placement = 0) -------------------------------------------------------------- test1: no crash test2: no crash, but occationally I still get funny output like this 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 0000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
xming wrote:>> OK, I misunderstood your original report to mean that something was >> complaining about "too much" output. You''re saying that lots of console >> output seems to lock the domain. >> > > Sorry about that, and yes that is the case. > > >> I''ve had a report about heavy disk IO seems to lock up as well. Perhaps >> they''re both related to high event rates. Do you think you could try an >> IO-intensive workload to see if you can get a similar lockup? >> > > IO-intensive locks up too (see below) > > >> When the domain is locked up, what does /usr/lib/xen/bin/xenctx say? >> > > see below > > >> Hm. Rather than backing out the structure-change patch, could you try >> this workaround: >> >> diff -r be3ca4e0e19e arch/x86/xen/enlighten.c >> --- a/arch/x86/xen/enlighten.c Thu Jan 17 14:25:07 2008 -0800 >> +++ b/arch/x86/xen/enlighten.c Thu Jan 17 16:37:42 2008 -0800 >> @@ -95,7 +95,7 @@ struct shared_info *HYPERVISOR_shared_in >> * >> * 0: not available, 1: available >> */ >> -static int have_vcpu_info_placement = 1; >> +static int have_vcpu_info_placement = 0; >> >> static void __init xen_vcpu_setup(int cpu) >> { >> > > First of all this patch solves the lock-ups, it works as advertised :)OK, good. I guess events are getting lost somewhere with vcpu_info placement.> # /usr/lib/xen/bin/xenctx 113 >Would it be possible to map the eip and some top parts of the stack back to kernel symbols? Seems to be the same place in both traces, which is interesting.> eip: c037c0c7 > esp: c0343f90 > eax: 00000000 ebx: 00000001 ecx: 00000000 edx: c0342000 > esi: c0373004 edi: c1210df4 ebp: 00001b7d > cs: 00000061 ds: 0000007b fs: 000000d8 gs: 00000000 > > Stack: > c0100add c0378980 c0101962 c0104821 c120a000 c0378df4 c0348cff 00000025 > c0348430 00000004 00009000 00006df4 00ea1000 c0363be0 c0343fe8 c03dd007 > 00000000 c0343fec c0349868 c0343fe0 178bc1f1 00002001 00020800 00060fb1 > 00000000 c03dd000 00000000 00000000 > > Code: > cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 06 00 00 00 cd 82 <c3> cc > cc cc cc cc cc cc cc cc cc > > Call Trace: > [<c037c0c7>] <-- > [<c0100add>] > [<c0378980>] > [<c0101962>] > [<c0104821>] > [<c120a000>] > [<c0378df4>] > [<c0348cff>] > [<c0348430>] > [<c0363be0>] > [<c0343fe8>] > [<c03dd007>] > [<c0343fec>] > [<c0349868>] > [<c0343fe0>] > [<178bc1f1>] > [<c03dd000>] > > Scenario 2 (have_vcpu_info_placement = 0) > -------------------------------------------------------------- > > test1: no crash > test2: no crash, but occationally I still get funny output like this > > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > 0000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > 000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > 000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ >Hm, I guess some of the output is getting dropped. Does this happen with 2.6.18-xen? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Jan 18, 2008 5:19 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:> > First of all this patch solves the lock-ups, it works as advertised :) > > OK, good. I guess events are getting lost somewhere with vcpu_info > placement.> Would it be possible to map the eip and some top parts of the stack back > to kernel symbols? Seems to be the same place in both traces, which is > interesting.Can you tell me how, or show me some pointers?> > Scenario 2 (have_vcpu_info_placement = 0) > > -------------------------------------------------------------- > > > > test1: no crash > > test2: no crash, but occationally I still get funny output like this > > > > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > > 0000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > > 000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > > 000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > > 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ > > > > Hm, I guess some of the output is getting dropped. Does this happen > with 2.6.18-xen?yes it does 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 0000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ # uname -a Linux builder 2.6.18-xen-r8 #3 SMP Thu Dec 20 15:07:20 CET 2007 i686 AMD Athlon(tm) X2 Dual Core Processor BE-2300 AuthenticAMD GNU/Linux cheers xming _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
xming wrote:>> Would it be possible to map the eip and some top parts of the stack back >> to kernel symbols? Seems to be the same place in both traces, which is >> interesting. >> > > Can you tell me how, or show me some pointers? >Do "nm -n vmlinux" on the kernel to set an address sorted list of symbols, and then look to see what''s near the eip (c037c0c7) and near the top of the stack (c0100add, c0378980, c0101962, ...). Some of these may be in data, or other strange places, but the ones which correspond to code are interesting.>>> Scenario 2 (have_vcpu_info_placement = 0) >>> -------------------------------------------------------------- >>> >>> test1: no crash >>> test2: no crash, but occationally I still get funny output like this >>> >>> 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ >>> 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ >>> 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ >>> 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ >>> 0000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ >>> 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ >>> 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ >>> 000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ >>> 000AAAAAAAAAAAAAAAAAAAAAAAAAAZZ >>> 00AAAAAAAAAAAAAAAAAAAAAAAAAAZZ >>> >>> >> Hm, I guess some of the output is getting dropped. Does this happen >> with 2.6.18-xen? >> > > yes it does >OK, good. I Didn''t Break It (tm) ;) J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Jan 18, 2008 6:26 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:> >> Would it be possible to map the eip and some top parts of the stack back > >> to kernel symbols? Seems to be the same place in both traces, which is > >> interesting.> Do "nm -n vmlinux" on the kernel to set an address sorted list of > symbols, and then look to see what''s near the eip (c037c0c7) and near > the top of the stack (c0100add, c0378980, c0101962, ...). Some of these > may be in data, or other strange places, but the ones which correspond > to code are interesting.ok I have done some of them, but I still don''t know what I should be looking at. Do you mean code related to xen or code related to have_vcpu_info_placement? Please be patient with me :) I just paste some of the result (around those addresses) here: c037b000 B empty_zero_page c037c000 B hypercall_page c037d000 B system_state c0100a00 t xen_cpuid c0100a80 t xen_set_debugreg c0100a90 t xen_get_debugreg c0100aa0 t xen_save_fl c0100ac0 t xen_irq_disable c0100ad0 t xen_safe_halt c0100af0 t xen_halt c0100b20 t xen_store_tr c0100b30 t cvt_gate_to_trap c0100bb0 t xen_io_delay c0378980 D per_cpu__irq_stat c03789c0 d per_cpu__runqueues c0378df4 D __per_cpu_end c01018b0 t xen_flush_tlb_single c0101940 t xen_idle c0101980 T xen_setup_features c01019c0 T xen_mc_flush c0101aa0 T xen_mc_callback c0104710 T kernel_thread c01047c0 T cpu_idle c0104840 T cpu_idle_wait c0104940 T exit_thread c0103fe4 T xen_irq_enable_direct c0103ff1 T xen_irq_enable_direct_reloc c0103ff5 T xen_irq_enable_direct_end c0103ff8 T xen_irq_disable_direct c0104000 T xen_irq_disable_direct_end c0104004 T xen_save_fl_direct c0104011 T xen_save_fl_direct_end c0104014 T xen_restore_fl_direct c010402b T xen_restore_fl_direct_reloc c03483f0 t maxcpus c0348430 t unknown_bootoption c0348610 T parse_early_param> >> Hm, I guess some of the output is getting dropped. Does this happen > >> with 2.6.18-xen?> > yes it does> OK, good. I Didn''t Break It (tm) ;)So no fix from you? :) Thanks _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
xming wrote:> ok I have done some of them, but I still don''t know what I should be looking > at. Do you mean code related to xen or code related to have_vcpu_info_placement? > Please be patient with me :) > > I just paste some of the result (around those addresses) here: >Thanks, that answers that particular question; the vcpu is blocked waiting for something to happen, which probably means it missed the event which was supposed to wake it up. Why is another question. At least there''s a workaround, and that workaround gives me some clue where to look. BTW, is it an SMP or UP domain? Does it make a difference?>> OK, good. I Didn''t Break It (tm) ;) >> > > So no fix from you? :) >Maybe when I have nothing else to do. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Jan 20, 2008 7:37 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:> xming wrote: > > ok I have done some of them, but I still don''t know what I should be looking > > at. Do you mean code related to xen or code related to have_vcpu_info_placement? > > Please be patient with me :) > > > > I just paste some of the result (around those addresses) here: > > > > Thanks, that answers that particular question; the vcpu is blocked > waiting for something to happen, which probably means it missed the > event which was supposed to wake it up. Why is another question. At > least there''s a workaround, and that workaround gives me some clue where > to look.Want me to test it?> BTW, is it an SMP or UP domain? Does it make a difference?It doesn''t matter, I tried vcpu=1 and vcpu=2, unless you want me to try to recompile a UP kernel?> >> OK, good. I Didn''t Break It (tm) ;) > > > So no fix from you? :) > > Maybe when I have nothing else to do.I''ll wait, or should I poke xen-devel? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
xming wrote:>> Thanks, that answers that particular question; the vcpu is blocked >> waiting for something to happen, which probably means it missed the >> event which was supposed to wake it up. Why is another question. At >> least there''s a workaround, and that workaround gives me some clue where >> to look. >> > > Want me to test it? >I''ll probably look at this when my current batch of work is under control. In the meantime, I''ll submit the workaround patch to keep people happy. The only downside is a small performance hit.>> BTW, is it an SMP or UP domain? Does it make a difference? >> > > It doesn''t matter, I tried vcpu=1 and vcpu=2, unless you want me to try > to recompile a UP kernel? >It would be an interesting datapoint, but I don''t think it will make a difference.>> Maybe when I have nothing else to do. >> > > I''ll wait, or should I poke xen-devel? >Poke xen-devel. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel