Tomasz Wroblewski
2013-Aug-26 09:17 UTC
[PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Aug-26 11:17 UTC
Re: [PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
>>> On 26.08.13 at 11:17, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: >- fix occasional xen boot hang whilst using PCI uart. Dom0 kernel disables ioport responses > during PCI system initialisation, causing xen hang if __ns16550_poll() routine happens to > be scheduled during that time. Detect and exit. Amended ns16550_ioport_invalid function > to only check IER register, which contains three reservered (always 0) bits, therefore > it''s sufficient for this test.And this was observed with 4.4-unstable? I''m asking because I would at a first glance have thought that taking care of this ought to be a desirable side effect of calling pci_hide_device().>+static int ns16550_ioport_invalid(struct ns16550 *uart) >+{ >+ return (((unsigned char)ns_read_reg(uart, UART_IER)) == 0xff); >+}Why checking just one register is sufficient when originally>-static int ns16550_ioport_invalid(struct ns16550 *uart) >-{ >- return ((((unsigned char)ns_read_reg(uart, UART_LSR)) == 0xff) && >- (((unsigned char)ns_read_reg(uart, UART_MCR)) == 0xff) && >- (((unsigned char)ns_read_reg(uart, UART_IER)) == 0xff) && >- (((unsigned char)ns_read_reg(uart, UART_IIR)) == 0xff) && >- (((unsigned char)ns_read_reg(uart, UART_LCR)) == 0xff)); >-}we checked five also needs some better explanation. Jan
Tomasz Wroblewski
2013-Aug-26 11:39 UTC
Re: [PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
On 08/26/2013 01:17 PM, Jan Beulich wrote:>>>> On 26.08.13 at 11:17, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >> - fix occasional xen boot hang whilst using PCI uart. Dom0 kernel disables ioport responses >> during PCI system initialisation, causing xen hang if __ns16550_poll() routine happens to >> be scheduled during that time. Detect and exit. Amended ns16550_ioport_invalid function >> to only check IER register, which contains three reservered (always 0) bits, therefore >> it''s sufficient for this test. > And this was observed with 4.4-unstable? I''m asking because I > would at a first glance have thought that taking care of this > ought to be a desirable side effect of calling pci_hide_device().This was observed with stable 4.3 - it seems to be doing the pci_hide_device as well, so I don''t think this affects, or was it bugfixed later? I''m not entirely sure how is pci_hide_device supposed to work though - in my dom0, on 4.3, I am seeing the pci serial card used by xen console, so maybe it is bugged? (or i misunderstand it).>> +static int ns16550_ioport_invalid(struct ns16550 *uart) >> +{ >> + return (((unsigned char)ns_read_reg(uart, UART_IER)) == 0xff); >> +} > Why checking just one register is sufficient when originally > >> -static int ns16550_ioport_invalid(struct ns16550 *uart) >> -{ >> - return ((((unsigned char)ns_read_reg(uart, UART_LSR)) == 0xff)&& >> - (((unsigned char)ns_read_reg(uart, UART_MCR)) == 0xff)&& >> - (((unsigned char)ns_read_reg(uart, UART_IER)) == 0xff)&& >> - (((unsigned char)ns_read_reg(uart, UART_IIR)) == 0xff)&& >> - (((unsigned char)ns_read_reg(uart, UART_LCR)) == 0xff)); >> -} > we checked five also needs some better explanation.I believe it''s enough to test IER register since it contains 3 reserved bits which are always 0 during normal operation, therefore the condition will never hit then. Made this as a mini optimisation since this function would now be called more frequently.> Jan >
Jan Beulich
2013-Aug-26 12:54 UTC
Re: [PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
>>> On 26.08.13 at 13:39, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: > On 08/26/2013 01:17 PM, Jan Beulich wrote: >>>>> On 26.08.13 at 11:17, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>> - fix occasional xen boot hang whilst using PCI uart. Dom0 kernel disables > ioport responses >>> during PCI system initialisation, causing xen hang if __ns16550_poll() > routine happens to >>> be scheduled during that time. Detect and exit. Amended > ns16550_ioport_invalid function >>> to only check IER register, which contains three reservered (always 0) > bits, therefore >>> it''s sufficient for this test. >> And this was observed with 4.4-unstable? I''m asking because I >> would at a first glance have thought that taking care of this >> ought to be a desirable side effect of calling pci_hide_device(). > This was observed with stable 4.3 - it seems to be doing the > pci_hide_device as well, so I don''t think this affects, or was it > bugfixed later? I''m not entirely sure how is pci_hide_device supposed to > work though - in my dom0, on 4.3, I am seeing the pci serial card used > by xen console, so maybe it is bugged? (or i misunderstand it).Wait, yes, pci_ro_device() is what would be needed to drop Dom0 writes to the device''s config space. But we don''t want this if at all possible, as there may be other devices (more serial ports and/or one or more parallel ports) on the same card, and we want to allow Dom0 to drive those. Nevertheless, the approach of your patch in simply giving up the device (even if only termporarily) looks questionable to me We''d rather need to restore full access to it I would think. But yes, this hypervisor and Dom0 playing with the same device is sort of a gray area.>>> +static int ns16550_ioport_invalid(struct ns16550 *uart) >>> +{ >>> + return (((unsigned char)ns_read_reg(uart, UART_IER)) == 0xff); >>> +} >> Why checking just one register is sufficient when originally >> >>> -static int ns16550_ioport_invalid(struct ns16550 *uart) >>> -{ >>> - return ((((unsigned char)ns_read_reg(uart, UART_LSR)) == 0xff)&& >>> - (((unsigned char)ns_read_reg(uart, UART_MCR)) == 0xff)&& >>> - (((unsigned char)ns_read_reg(uart, UART_IER)) == 0xff)&& >>> - (((unsigned char)ns_read_reg(uart, UART_IIR)) == 0xff)&& >>> - (((unsigned char)ns_read_reg(uart, UART_LCR)) == 0xff)); >>> -} >> we checked five also needs some better explanation. > I believe it''s enough to test IER register since it contains 3 reserved > bits which are always 0 during normal operation, therefore the condition > will never hit then. Made this as a mini optimisation since this > function would now be called more frequently.I assumed it was something like this. But that needs to be said in the patch description. Jan
Tomasz Wroblewski
2013-Aug-26 13:25 UTC
Re: [PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
>>> And this was observed with 4.4-unstable? I''m asking because I >>> would at a first glance have thought that taking care of this >>> ought to be a desirable side effect of calling pci_hide_device(). >> This was observed with stable 4.3 - it seems to be doing the >> pci_hide_device as well, so I don''t think this affects, or was it >> bugfixed later? I''m not entirely sure how is pci_hide_device supposed to >> work though - in my dom0, on 4.3, I am seeing the pci serial card used >> by xen console, so maybe it is bugged? (or i misunderstand it). > Wait, yes, pci_ro_device() is what would be needed to drop > Dom0 writes to the device''s config space. But we don''t want > this if at all possible, as there may be other devices (more > serial ports and/or one or more parallel ports) on the same > card, and we want to allow Dom0 to drive those. > > Nevertheless, the approach of your patch in simply giving up > the device (even if only termporarily) looks questionable to me > We''d rather need to restore full access to it I would think. But > yes, this hypervisor and Dom0 playing with the same device is > sort of a gray area.Restore ioport access at the start of poll routine (if not on) and disable it again at the end (if was not on)? I might do that (if you really prefer), but intuitively that seems more likely to produce side effects in dom0 kernel than skipping a poll in xen>>>> +static int ns16550_ioport_invalid(struct ns16550 *uart) >>>> +{ >>>> + return (((unsigned char)ns_read_reg(uart, UART_IER)) == 0xff); >>>> +} >>> Why checking just one register is sufficient when originally >>> >>>> -static int ns16550_ioport_invalid(struct ns16550 *uart) >>>> -{ >>>> - return ((((unsigned char)ns_read_reg(uart, UART_LSR)) == 0xff)&& >>>> - (((unsigned char)ns_read_reg(uart, UART_MCR)) == 0xff)&& >>>> - (((unsigned char)ns_read_reg(uart, UART_IER)) == 0xff)&& >>>> - (((unsigned char)ns_read_reg(uart, UART_IIR)) == 0xff)&& >>>> - (((unsigned char)ns_read_reg(uart, UART_LCR)) == 0xff)); >>>> -} >>> we checked five also needs some better explanation. >> I believe it''s enough to test IER register since it contains 3 reserved >> bits which are always 0 during normal operation, therefore the condition >> will never hit then. Made this as a mini optimisation since this >> function would now be called more frequently. > I assumed it was something like this. But that needs to be said in > the patch description.Yeah, I did mention it in the desc though, to quote: "Amended ns16550_ioport_invalid function to only check IER register, which contains three reservered (always 0) bits, therefore it''s sufficient for this test", thought that was enough> Jan >
Jan Beulich
2013-Aug-26 13:52 UTC
Re: [PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
>>> On 26.08.13 at 15:25, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: >> Nevertheless, the approach of your patch in simply giving up >> the device (even if only termporarily) looks questionable to me >> We''d rather need to restore full access to it I would think. But >> yes, this hypervisor and Dom0 playing with the same device is >> sort of a gray area. > Restore ioport access at the start of poll routine (if not on) and > disable it again at the end (if was not on)? I might do that (if you > really prefer), but intuitively that seems more likely to produce side > effects in dom0 kernel than skipping a poll in xenAs long as it''s guaranteed to only be a poll (or a few of them) being affected, this is fine. But what if an interrupt is being used? Jan
Tomasz Wroblewski
2013-Aug-26 15:09 UTC
Re: [PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
On 08/26/2013 03:52 PM, Jan Beulich wrote:>>>> On 26.08.13 at 15:25, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>> Nevertheless, the approach of your patch in simply giving up >>> the device (even if only termporarily) looks questionable to me >>> We''d rather need to restore full access to it I would think. But >>> yes, this hypervisor and Dom0 playing with the same device is >>> sort of a gray area. >> Restore ioport access at the start of poll routine (if not on) and >> disable it again at the end (if was not on)? I might do that (if you >> really prefer), but intuitively that seems more likely to produce side >> effects in dom0 kernel than skipping a poll in xen > As long as it''s guaranteed to only be a poll (or a few of them) being > affected, this is fine. But what if an interrupt is being used?I''m probably missing something so can you elaborate on this? Probably not what you are asking, but ns16550_interrupt function currently doesn''t hang when ioports are disabled as a byproduct of the "while ( !(ns_read_reg(uart, IIR) & IIR_NOINT) )" test in there, which already causes it to break out on 0xFF regs> Jan >
Jan Beulich
2013-Aug-26 15:26 UTC
Re: [PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
>>> On 26.08.13 at 17:09, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: > On 08/26/2013 03:52 PM, Jan Beulich wrote: >>>>> On 26.08.13 at 15:25, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>>> Nevertheless, the approach of your patch in simply giving up >>>> the device (even if only termporarily) looks questionable to me >>>> We''d rather need to restore full access to it I would think. But >>>> yes, this hypervisor and Dom0 playing with the same device is >>>> sort of a gray area. >>> Restore ioport access at the start of poll routine (if not on) and >>> disable it again at the end (if was not on)? I might do that (if you >>> really prefer), but intuitively that seems more likely to produce side >>> effects in dom0 kernel than skipping a poll in xen >> As long as it''s guaranteed to only be a poll (or a few of them) being >> affected, this is fine. But what if an interrupt is being used? > I''m probably missing something so can you elaborate on this? Probably > not what you are asking, but ns16550_interrupt function currently > doesn''t hang when ioports are disabled as a byproduct of the "while > ( !(ns_read_reg(uart, IIR) & IIR_NOINT) )" test in there, which already > causes it to break out on 0xFF regsMy question was along the lines of "If I/O port access is disabled, isn''t the whole driver screwed (even if only temporarily)?" And if the answer to this is "yes" (I can''t see it to be "no"), dealing with this likely requires more than the change you proposed. Jan
Tomasz Wroblewski
2013-Aug-26 16:12 UTC
Re: [PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
On 08/26/2013 05:26 PM, Jan Beulich wrote:>>>> On 26.08.13 at 17:09, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >> On 08/26/2013 03:52 PM, Jan Beulich wrote: >>>>>> On 26.08.13 at 15:25, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>>>> Nevertheless, the approach of your patch in simply giving up >>>>> the device (even if only termporarily) looks questionable to me >>>>> We''d rather need to restore full access to it I would think. But >>>>> yes, this hypervisor and Dom0 playing with the same device is >>>>> sort of a gray area. >>>> Restore ioport access at the start of poll routine (if not on) and >>>> disable it again at the end (if was not on)? I might do that (if you >>>> really prefer), but intuitively that seems more likely to produce side >>>> effects in dom0 kernel than skipping a poll in xen >>> As long as it''s guaranteed to only be a poll (or a few of them) being >>> affected, this is fine. But what if an interrupt is being used? >> I''m probably missing something so can you elaborate on this? Probably >> not what you are asking, but ns16550_interrupt function currently >> doesn''t hang when ioports are disabled as a byproduct of the "while >> ( !(ns_read_reg(uart, IIR)& IIR_NOINT) )" test in there, which already >> causes it to break out on 0xFF regs > My question was along the lines of "If I/O port access is disabled, > isn''t the whole driver screwed (even if only temporarily)?" And if > the answer to this is "yes" (I can''t see it to be "no"), dealing with > this likely requires more than the change you proposed.It could be, I only have empirical evidence of not noticing any serial out hiccups during dom0 kernel init. Since this is is small driver and it seems to primarily interact with the I/O only in ns16550_interrupt, ns16550_poll, ns16550_tx_ready, putc, getc (and in some init functions but these will only be called before dom0 boot), I thought that: * ns16550_interrupt will be fine with IO ports disabled, it''ll just exit * ns16550_poll will be fine with the posted patch, it''ll exit * ns16550_getc looks like it has a potential of producing 0xFF characters incorrectly, so maybe would need a port test as well * ns16550_putc should be fine since write to ioport will just be dropped * ns16550_tx_ready should be fine, it will return 1 if ioports are disabled which is what it needs to be returning to avoid spinning in serial.c So besides possibly the extra check in getc, not really sure what else can be done better here> Jan >
Jan Beulich
2013-Aug-27 06:55 UTC
Re: [PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
>>> On 26.08.13 at 18:12, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: > On 08/26/2013 05:26 PM, Jan Beulich wrote: >> My question was along the lines of "If I/O port access is disabled, >> isn''t the whole driver screwed (even if only temporarily)?" And if >> the answer to this is "yes" (I can''t see it to be "no"), dealing with >> this likely requires more than the change you proposed. > It could be, I only have empirical evidence of not noticing any serial > out hiccups during dom0 kernel init. Since this is is small driver and > it seems to primarily interact with the I/O only in ns16550_interrupt, > ns16550_poll, ns16550_tx_ready, putc, getc (and in some init functions > but these will only be called before dom0 boot), I thought that: > > * ns16550_interrupt will be fine with IO ports disabled, it''ll just exitAh, right, the flag being tested is a "no-interrupt-pending" one. Good.> * ns16550_poll will be fine with the posted patch, it''ll exit > * ns16550_getc looks like it has a potential of producing 0xFF > characters incorrectly, so maybe would need a port test as wellRight.> * ns16550_putc should be fine since write to ioport will just be droppedIdeally it would of course postpone the writes, see below.> * ns16550_tx_ready should be fine, it will return 1 if ioports are > disabled which is what it needs to be returning to avoid spinning in > serial.cActually, one could probably tweak this so that at least in the non- sync, non-log-everything case it serial.c would prefer buffering over calling ->putc() (and hence dropping), by allowing ->tx_ready() to indicate this "disconnected" state via a special return value. Jan
Tomasz Wroblewski
2013-Aug-27 08:52 UTC
Re: [PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
On 08/27/2013 08:55 AM, Jan Beulich wrote:>>>> On 26.08.13 at 18:12, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >> On 08/26/2013 05:26 PM, Jan Beulich wrote: >>> My question was along the lines of "If I/O port access is disabled, >>> isn''t the whole driver screwed (even if only temporarily)?" And if >>> the answer to this is "yes" (I can''t see it to be "no"), dealing with >>> this likely requires more than the change you proposed. >> It could be, I only have empirical evidence of not noticing any serial >> out hiccups during dom0 kernel init. Since this is is small driver and >> it seems to primarily interact with the I/O only in ns16550_interrupt, >> ns16550_poll, ns16550_tx_ready, putc, getc (and in some init functions >> but these will only be called before dom0 boot), I thought that: >> >> * ns16550_interrupt will be fine with IO ports disabled, it''ll just exit > Ah, right, the flag being tested is a "no-interrupt-pending" one. > Good. > >> * ns16550_poll will be fine with the posted patch, it''ll exit >> * ns16550_getc looks like it has a potential of producing 0xFF >> characters incorrectly, so maybe would need a port test as well > Right. > >> * ns16550_putc should be fine since write to ioport will just be dropped > Ideally it would of course postpone the writes, see below. > >> * ns16550_tx_ready should be fine, it will return 1 if ioports are >> disabled which is what it needs to be returning to avoid spinning in >> serial.c > Actually, one could probably tweak this so that at least in the non- > sync, non-log-everything case it serial.c would prefer buffering > over calling ->putc() (and hence dropping), by allowing > ->tx_ready() to indicate this "disconnected" state via a special > return value.Aha okay. I''m trying to come up with something along the above lines then. I think I''ll split this patch into two, one for the timer list fix (which I think is not controversial) and other for the hang/port unavailable case> Jan >
Jan Beulich
2013-Aug-27 09:01 UTC
Re: [PATCH] PCI uart: fix boot hang, and second S3 resume inactive timer list corruption
>>> On 27.08.13 at 10:52, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: > I think I''ll split this patch into two, one for the timer list fix > (which I think is not controversial) and other for the hang/port > unavailable caseThat''d be nice, yes. Jan