(no dice on xen-users, let''s try xen-devel...) Hi, I''m running a HA Xen cluster, where the dom0s are crosslinked via a null modem serial cable for heartbeat redundancy. This works most of the time, but the serial connection is very unreliable, dropping characters all the time, with lot of messages like "ttyS0: 2 input overrun(s)" in dmesg. No such problem when running the same kernel on bare metal. The link is running at 9600 baud, so the system should easily cope, but it looks like the serial interrupt isn''t serviced timely enough under Xen. I''m running Xen 4.0.1 now with kernel 2.6.32 (stock Debian squeeze), but the problem isn''t specific to this setup, Xen 3.2 with kernel 2.6.18 had much the same issue. Raising dom0''s scheduling weight didn''t help much (or at all), pinning all domUs to CPU1-3 and vcpu0 of dom0 to CPU0 actually made the problem worse. $ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 1: 9 0 0 0 xen-pirq-ioapic-edge i8042 4: 31712520 0 0 0 xen-pirq-ioapic-edge serial [...] Is there some known solution to this problem? It feels like overly big dom0 interrupt latency... maybe caused by the single-threaded hypervisor? Comments more than welcome! -- Thanks, Feri. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, Mar 20, 2011 at 12:02:08PM +0100, Ferenc Wagner wrote:> (no dice on xen-users, let''s try xen-devel...) > > Hi, > > I''m running a HA Xen cluster, where the dom0s are crosslinked via a null > modem serial cable for heartbeat redundancy. This works most of the > time, but the serial connection is very unreliable, dropping charactersIf you use the hypervisor serial connection "console=com1 com1=11152..." and re-route the console output in Linux kernel to it (console=hvc0) does this problem disappear?> all the time, with lot of messages like "ttyS0: 2 input overrun(s)" in > dmesg. No such problem when running the same kernel on bare metal. The > link is running at 9600 baud, so the system should easily cope, but it > looks like the serial interrupt isn''t serviced timely enough under Xen. > I''m running Xen 4.0.1 now with kernel 2.6.32 (stock Debian squeeze), but > the problem isn''t specific to this setup, Xen 3.2 with kernel 2.6.18 had > much the same issue. Raising dom0''s scheduling weight didn''t help much > (or at all), pinning all domUs to CPU1-3 and vcpu0 of dom0 to CPU0 > actually made the problem worse. > > $ cat /proc/interrupts > CPU0 CPU1 CPU2 CPU3 > 1: 9 0 0 0 xen-pirq-ioapic-edge i8042 > 4: 31712520 0 0 0 xen-pirq-ioapic-edge serial > [...] > > Is there some known solution to this problem? It feels like overly big > dom0 interrupt latency... maybe caused by the single-threaded hypervisor? > Comments more than welcome! > -- > Thanks, > Feri. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> writes:> On Sun, Mar 20, 2011 at 12:02:08PM +0100, Ferenc Wagner wrote: > >> I''m running a HA Xen cluster, where the dom0s are crosslinked via a >> null modem serial cable for heartbeat redundancy. This works most of >> the time, but the serial connection is very unreliable, dropping >> characters all the time, with lot of messages like "ttyS0: 2 input >> overrun(s)" in dmesg. > > If you use the hypervisor serial connection "console=com1 com1=11152..." > and re-route the console output in Linux kernel to it (console=hvc0) > does this problem disappear?Hi, That''s indeed a valid point which I forgot to discuss above, sorry. I actually use the second serial port as Xen and Linux console: "com2=auto,8n1 console=com2L,vga" on the Xen command line and "console=hvc0 earlyprintk=xen" on the Linux command line. The first serial port is used by heartbeat (the software) as a safety (parallel) communication channel in case of a network partition. I''m not sure if heartbeat would accept /dev/hvcX as a serial device, but even now the Xen (and Linux) serial console is also lossy as hell (at 57600 baud, as dictated by the DRAC management card in the machine and configured by the boot loader). However, this lossage isn''t accompanied by any warning. I hope this answers your question. I also run the exact same hardware configuration without Xen, and the Linux console is perfectly solid in that case (like it should under hardware flow control). As an illustration, here''s the serial output of querying ioapic info on the Xen console: (XEN) number of MP IRQ sources: 16. (XEN) number of IO-APIC #8 registers: 16. (XEN) number of IO-APIC #9 registers: 16. (XEN) number of IO-APIC #10 registers: 16. (XEN) testing the IO APIC....................... (XEN) IO APIC #8...... (XEN) .... register #00: 00000000 (XEN) ....... : physical APIC id: 00 (XEN) ....... : Delivery Type: 0 (XEN) ....... : LTS : 0 (XEN) .... register #01: 000F0011 (XEN) ....... : max redirection entries: 000F (XEN) ....... : PRQ implemented: 0 (XEN) ....... : IO APIC version: 0011 (XEN) .... register #02: 00000000 (XEN) ....... : arbitration: 00 (XEN) .... IRQ redirection table: (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: (XEN) 00 0FF 0F 1 0 0 0 0 0 0 F0 (XEN) 01 001 01 0 0 0 0 0 1 1 28 (XEN) 02 001 01 0 0 0 0 0 1 1 30 (XEN) 03 0FF 0F 0 0 0 0 0 1 1 F2 (XEN) 04 001 01 0 0 0 0 0 1 1 38 (XEN) 05 001 01 0 0 0 0 0 1 1 40 (XEN) 06 001 01 0 0 0 0 0 1 1 48 (XEN) 07 001 01 0 0 0 0 0 1 1 50 (XEN) 08 001 01 0 0 0 0 0 1 1 58 (XEN) 09 001 01 0 0 0 0 0 1 1 60 (XEN) 0a 001 01 0 0 0 0 0 1 1 68 (XEN) 0b 001 01 0 1 0 1 0 1 1 70 (XEN) 0c 001 01 0 0 0 0 0 1 1 78 (XEN) 0d 00F 0F 1 0 0 0 0 1 1 88 (XEN) 0e 001 01 0 0 0 0 0 1 1 90 (XEN) 0f 001 01 0 0 0 0 0 1 1 98 (XEN) IO APIC #9...... (XEN) .... register #00: 00000000 (XEN) ....... : physical APIC id: 00 (XEN) ....... : Delivery Type: 0 (XEN) ....... : LTS : 0 (XEN) .... register #01: 000F0011 (XEN) ....... : max redirection entries: 000F (XEN) ....... : PRQ implemented: 0 (XEN) ....... : IO APIC version: 0011 (XEN) .... register #02: 00000000 (XEN) ....... : arbitration: 00 (XEN) .... IRQ redirection table: (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: (XEN) 00 000 00 1 0 0 0 0 0 0 00 (XEN) 01 000 00 1 0 0 0 0 0 0 00 (XEN) 02 000 00 1 0 0 0 0 0 0 00 (XEN) 03 000 00 1 0 0 0 0 0 0 00 (XEN) 04 000 00 1 0 0 0 0 0 0 00 (XEN) 05 000 00 1 0 0 0 0 0 0 00 (XEN) 06 000 00 1 0 0 0 0 0 0 00 (XEN) 07 00F 0F 1 1 0 1 0 1 1 A0 (XEN) 08 000 00 1 0 0 0 0 0 0 00 (XEN) 09 000 00 1 0 0 0 0 0 0 00 (XEN) 0a 000 00 1 0 0 0 0 0 0 00 (XEN) 0b 000 00 1 0 0 0 0 0 0 00 (XEN) 0c 001 01 0 1 0 1 0 1 1 A8 (XEN) 0d 001 01 0 1 0 1 0 1 1 B8 (XN) 0e 001 01 1 0 1 0 1 1 B0(XEN) 0f 001 0 0 1 0 0 1 1 C0 (XEN) IO AP #10...... (XE .... register 0: 00000000 (X) ....... : ysical APIC id:0 (XEN) ...... : Delivery Te: 0 (XEN) ..... : LTS : 0 (XEN) .. register #01000F0011 (XEN)...... : maredirection entes: 000F XEN) ....... : IO APIC versi: 0011 (XEN) .. register #02:0000000 (XEN) ..... : arbration: 00 (XE .... IRQ redirtion table: (X) NR Log Phy Mk Trig IRR Pol at Dest Deli Ve: (XEN) 0000 00 1 0 0 0 0 0 0 00 (XEN)01 000 00 1 0 0 0 0 0 00 (N) 02 000 00 0 0 0 0 0 0 (XEN) 03 000 1 0 0 0 0 0 00 (XEN) 04 0 00 1 0 0 0 0 0 00 (XEN) 5 000 00 1 0 0 0 0 00 (X) 06 000 00 1 0 0 0 0 0 00(XEN) 07 000 0 1 0 0 0 0 0 00 (XEN) 08 0 00 1 0 0 0 0 00 (XEN) 000 00 1 0 0 0 0 0 00 XEN) 0b 000 001 0 0 0 0 0 0 00 (XEN) 0c 0000 1 0 0 0 0 0 00 (XEN) 0000 00 1 0 0 0 0 0 0 00 (XEN 0e 000 00 1 0 0 0 0 0 0 00 EN) 0f 000 00 1 0 0 0 0 0 0 (XEN) Using veor-based indexi (XEN) IRQ to n mappings: (X) IRQ240 -> 0:0(XEN) IRQ40 -> 1 (XEN) IRQ48 0:2 (XEN) IRQ2 -> 0:3 (XEN)RQ56 -> 0:4 XEN) IRQ72 -> 05 (XEN) IRQ80 -0:7 (XEN) IRQ8-> 0:8 (XEN) I96 -> 0:9 XEN) IRQ112 -> 110 (XEN) IRQ12-> 0:12 (XEN) Q136 -> 0:13 (N) IRQ144 -> 0: (XEN) IRQ152 0:15 (XEN) IR60 -> 1:7 XEN) IRQ184 -> 132 (XEN) IRQ17-> 1:14 (XEN) Q192 -> 1:15 (N) ........................... After all, this may also result from the lack of flow control under Xen. And that''s why I haven''t complain about this console corruption before: there''s no option for flow control amongst the Xen console settings (like the r on the Linux console= option). But /dev/ttyS0 runs with crtscts, so flow control should be active on the heartbeat link at least. But since CTS/RTS flow control is implemented in software, we''re back to interrupt latency problems. -- Thanks, Feri. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Mar-22 15:27 UTC
Re: [Xen-devel] Re: dom0 serial input overruns
> the boot loader). However, this lossage isn''t accompanied by anyOooh, strange.> warning. I hope this answers your question. I also run the exact same > hardware configuration without Xen, and the Linux console is perfectly > solid in that case (like it should under hardware flow control). As anOK.> illustration, here''s the serial output of querying ioapic info on the > Xen console:Can you do ''*'' in the debug console. There are some other ones that I curious. Mainly what the IRQ 4 in your Linux dom0 maps to what vector. Looking at your IOAPIC output:> (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: > (XEN) 04 001 01 0 0 0 0 0 1 1 38it should be going to the first CPU (CPU 0), but I am not sure about the other flags... We had some issue with Xen 4.0 where IRQs below 16 would not be set correctly. It would only go to the first CPU instead of being broadcast to all of them. Look for a thread from ''M A Young'' about keyboard issues. .. If you boot just baremetal (so no Xen) and you give it ''apic=debug'' it should print the IOAPIC output. Can you see what this? I am very curious to see if GSI 4 has similar looking flags set (the vector value is going to be different) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
This issue has been around for a long time. Pinning one CPU to dom0 makes things a little better, but you can still run into lost characters on serial input at anything above 14400 bps. I was fighting this back in 2007 when I was trying to get some serial attached Point-of-Sale equipment work with Xen. Pinning dom0 & making the qemu serial device model read more aggressively from the serial device got things reliable at 9.6 & 14.4 which was all my customer at the time needed, but I any speed above that still had these described issues. From memory, it didn''t make a whole lot of difference whether you let dom0 or the hypervisor handle the serial ports. Regards, Trolle On Tue, Mar 22, 2011 at 11:27 AM, Konrad Rzeszutek Wilk < konrad.wilk@oracle.com> wrote:> > the boot loader). However, this lossage isn''t accompanied by any > > Oooh, strange. > > warning. I hope this answers your question. I also run the exact same > > hardware configuration without Xen, and the Linux console is perfectly > > solid in that case (like it should under hardware flow control). As an > > OK. > > > illustration, here''s the serial output of querying ioapic info on the > > Xen console: > > Can you do ''*'' in the debug console. There are some other ones that I > curious. Mainly what the IRQ 4 in your Linux dom0 maps to what vector. > Looking at your IOAPIC output: > > > (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: > > (XEN) 04 001 01 0 0 0 0 0 1 1 38 > > it should be going to the first CPU (CPU 0), but I am not sure about > the other flags... We had some issue with Xen 4.0 where IRQs below > 16 would not be set correctly. It would only go to the first CPU > instead of being broadcast to all of them. Look for a thread from ''M A > Young'' > about keyboard issues. > > .. If you boot just baremetal (so no Xen) and you give it ''apic=debug'' > it should print the IOAPIC output. Can you see what this? I am > very curious to see if GSI 4 has similar looking flags set (the vector > value is going to be different) > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> writes:> wferi writes: > >> the Xen (and Linux) serial console is also lossy as hell [...] >> However, this lossage isn''t accompanied by any warning. > > Oooh, strange.I have to take this back, partly. Although the bare linux serial console is *much* more reliable (I couldn''t trigger much visible corruption by a simple ''while echo " X"; do :; done'' loop, as under Xen, even that heavily loses characters during the bootup message storm when going through the Serial-over-LAN thingie. Now I took that out of the picture entirely, using a physical serial connection instead. This made a world of difference: bootup logs are pretty much perfect now, and even the above while loop seldom produces a single wiggle (57600 baud). See http://apt.niif.hu/xen_bootup.log for good example (the stray character before "Allocated console ring" seems fully deterministic). I''ll test the same console setup under bare Linux tomorrow, maybe that won''t make a single error... Still, these (infrequent) glitches over hvc0 go unnoticed by the system, as far as I can tell.> Can you do ''*'' in the debug console. There are some other ones that I > curious. Mainly what the IRQ 4 in your Linux dom0 maps to what vector. > Looking at your IOAPIC output: > >> (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: >> (XEN) 04 001 01 0 0 0 0 0 1 1 38 > > it should be going to the first CPU (CPU 0), but I am not sure about > the other flags... We had some issue with Xen 4.0 where IRQs below > 16 would not be set correctly. It would only go to the first CPU > instead of being broadcast to all of them./proc/interrupts certainly agrees with this conjecture: CPU0 CPU1 CPU2 CPU3 4: 467344 0 0 0 xen-pirq-ioapic-edge serial> Look for a thread from ''M A Young'' about keyboard issues.Long thread, I''ll read through it tomorrow. Meanwhile, please find the requested output at http://apt.niif.hu/xen_full_debug.log.> .. If you boot just baremetal (so no Xen) and you give it ''apic=debug'' > it should print the IOAPIC output. Can you see what this? I am > very curious to see if GSI 4 has similar looking flags set (the vector > value is going to be different)I hope http://apt.niif.hu/apic_debug.log contains the info you need. I doubt it''s related, but on some bootups I get various errors (with different traces but always at the same point), which don''t seem to harm the operation of the system. Still, maybe you can make something out of them: http://apt.niif.hu/xen_boot_errors.log. -- Thanks, Feri. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ferenc Wagner <wferi@niif.hu> writes:> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> writes: > >> wferi writes: >> >>> the Xen (and Linux) serial console is also lossy as hell [...] >>> However, this lossage isn''t accompanied by any warning. >> >> Oooh, strange. > > I have to take this back, partly. Although the bare linux serial console > is *much* more reliable (I couldn''t trigger much visible corruption by a > simple ''while echo " X"; do :; done'' loop, as under > Xen, even that heavily loses characters during the bootup message storm > when going through the Serial-over-LAN thingie. Now I took that out of > the picture entirely, using a physical serial connection instead. This > made a world of difference: bootup logs are pretty much perfect now, and > even the above while loop seldom produces a single wiggle (57600 baud). > See http://apt.niif.hu/xen_bootup.log for good example (the stray > character before "Allocated console ring" seems fully deterministic). > I''ll test the same console setup under bare Linux tomorrow, maybe that > won''t make a single error...Yes, testing confirms that bare Linux is even better, I couldn''t notice a single missing character (vs. some glitch in every couple of seconds under Xen).> Still, these (infrequent) glitches over hvc0 go unnoticed by the system, > as far as I can tell.Who should notice this, after all? hvc0 itself probably not, being a virtual device. Does the Xen serial driver detect overruns?>> Look for a thread from ''M A Young'' about keyboard issues. > > Long thread, I''ll read through it tomorrow.There''s a good chance I''m running without the fix mentioned in the conclusion (I''m yet to check), and my serial interrupts really hit CPU0 alone. But my serial connection mostly works, not totally dead like in that thread. -- Thanks, Feri. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Mar-24 11:54 UTC
Re: [Xen-devel] Re: dom0 serial input overruns
On Wed, Mar 23, 2011 at 07:57:08PM +0100, Ferenc Wagner wrote:> Ferenc Wagner <wferi@niif.hu> writes: > > > Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> writes: > > > >> wferi writes: > >> > >>> the Xen (and Linux) serial console is also lossy as hell [...] > >>> However, this lossage isn''t accompanied by any warning. > >> > >> Oooh, strange. > > > > I have to take this back, partly. Although the bare linux serial console > > is *much* more reliable (I couldn''t trigger much visible corruption by a > > simple ''while echo " X"; do :; done'' loop, as under > > Xen, even that heavily loses characters during the bootup message storm > > when going through the Serial-over-LAN thingie. Now I took that out of > > the picture entirely, using a physical serial connection instead. This > > made a world of difference: bootup logs are pretty much perfect now, and > > even the above while loop seldom produces a single wiggle (57600 baud). > > See http://apt.niif.hu/xen_bootup.log for good example (the stray > > character before "Allocated console ring" seems fully deterministic). > > I''ll test the same console setup under bare Linux tomorrow, maybe that > > won''t make a single error... > > Yes, testing confirms that bare Linux is even better, I couldn''t notice > a single missing character (vs. some glitch in every couple of seconds > under Xen). > > > Still, these (infrequent) glitches over hvc0 go unnoticed by the system, > > as far as I can tell. > > Who should notice this, after all? hvc0 itself probably not, being a > virtual device. Does the Xen serial driver detect overruns?It is not a serial driver anymore. It uses some other type of API that does not have all of the fancy serial support. No idea actually how it does flow control.> > >> Look for a thread from ''M A Young'' about keyboard issues. > > > > Long thread, I''ll read through it tomorrow. > > There''s a good chance I''m running without the fix mentioned in the > conclusion (I''m yet to check), and my serial interrupts really hit CPU0 > alone. But my serial connection mostly works, not totally dead like in > that thread.That is good. Happy to have been able to fix your problem by just talking about it! BTW, haven''t looked in details at the logs - little swamped right now.> -- > Thanks, > Feri. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> writes:> On Wed, Mar 23, 2011 at 07:57:08PM +0100, Ferenc Wagner wrote: > >> Ferenc Wagner <wferi@niif.hu> writes: >> >>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> writes: >>> >>>> wferi writes: >>>> >>>>> the Xen (and Linux) serial console is also lossy as hell [...] >>>>> However, this lossage isn''t accompanied by any warning. >>>> >>>> Oooh, strange. >>> >>> I have to take this back, partly. Although the bare linux serial console >>> is *much* more reliable (I couldn''t trigger much visible corruption by a >>> simple ''while echo " X"; do :; done'' loop, as under >>> Xen, even that heavily loses characters during the bootup message storm >>> when going through the Serial-over-LAN thingie. Now I took that out of >>> the picture entirely, using a physical serial connection instead. This >>> made a world of difference: bootup logs are pretty much perfect now, and >>> even the above while loop seldom produces a single wiggle (57600 baud). >>> See http://apt.niif.hu/xen_bootup.log for good example (the stray >>> character before "Allocated console ring" seems fully deterministic). >>> I''ll test the same console setup under bare Linux tomorrow, maybe that >>> won''t make a single error... >> >> Yes, testing confirms that bare Linux is even better, I couldn''t notice >> a single missing character (vs. some glitch in every couple of seconds >> under Xen). >> >>> Still, these (infrequent) glitches over hvc0 go unnoticed by the system, >>> as far as I can tell. >> >> Who should notice this, after all? hvc0 itself probably not, being a >> virtual device. Does the Xen serial driver detect overruns? > > It is not a serial driver anymore. It uses some other type of API that > does not have all of the fancy serial support. No idea actually how it > does flow control.I guess you mean hvc. Yes, that surely shadows all the serial stuff which Xen does. But I''m afraid the Xen serial driver underneath has no flow control at all. At least I couldn''t find it in the code last time I checked (maybe a year ago).>>>> Look for a thread from ''M A Young'' about keyboard issues. >>> >>> Long thread, I''ll read through it tomorrow. >> >> There''s a good chance I''m running without the fix mentioned in the >> conclusion (I''m yet to check), and my serial interrupts really hit CPU0 >> alone. But my serial connection mostly works, not totally dead like in >> that thread. > > That is good. Happy to have been able to fix your problem by just > talking about it!Er, no, my serial overrun problem isn''t fixed at all, it''s still present. It just isn''t a show stopper, the application copes with this amount of data loss. Still, the amount of generated kernel- and application logs (especially under load) is more than annoying.> BTW, haven''t looked in details at the logs - little swamped right now.Sure, getting dom0 support into mainline comes first, definitely. :) Thanks for pushing that along so nicely. The boot logs aren''t particularly interesting, maybe except for the one with Xen and Linux errors. It''s possible that it should be reported separately... -- Thanks, Feri. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel