Looking at cloning this logic to more properly support MCE, I see two issues with this code: - by using iret, the NMI is being acknowledged to the CPU, and since nothing was done to address its reason, I can''t see why it shouldn''t re-trigger right after that iret (unless it was sent as an IPI) - by re-issuing it on vector 31, the resulting interrupt will have lower priority than any external interrupt, hence all pending interrupts will be serviced before getting to actually handle the NMI; ideally this should use the highest possible vector, but since priorities are grouped anyway, at least allocating the vector from the high priority pool would seem necessary Thanks, Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 15/5/07 15:46, "Jan Beulich" <jbeulich@novell.com> wrote:> - by using iret, the NMI is being acknowledged to the CPU, and since nothing > was done to address its reason, I can''t see why it shouldn''t re-trigger > right after that iret (unless it was sent as an IPI)Yes, it''s good enough for watchdog and oprofile. Level-triggered external NMIs will of course be a problem. We could possibly work around this by masking LINT1 if we are CPU0 (and, of course, if LAPIC is enabled) and then unmasking only at the end of real NMI handler. And of course x86/64 doesn''t have this problem at all, and practically speaking is pretty much the only hypervisor build that vendors seem to care about.> - by re-issuing it on vector 31, the resulting interrupt will have lower > priority > than any external interrupt, hence all pending interrupts will be serviced > before getting to actually handle the NMI; ideally this should use the > highest > possible vector, but since priorities are grouped anyway, at least > allocating > the vector from the high priority pool would seem necessaryYes, this is true. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir@xensource.com> 15.05.07 17:00 >>> >On 15/5/07 15:46, "Jan Beulich" <jbeulich@novell.com> wrote: > >> - by using iret, the NMI is being acknowledged to the CPU, and since nothing >> was done to address its reason, I can''t see why it shouldn''t re-trigger >> right after that iret (unless it was sent as an IPI) > >Yes, it''s good enough for watchdog and oprofile. Level-triggered external >NMIs will of course be a problem. We could possibly work around this by >masking LINT1 if we are CPU0 (and, of course, if LAPIC is enabled) and then >unmasking only at the end of real NMI handler. And of course x86/64 doesn''t >have this problem at all, and practically speaking is pretty much the only >hypervisor build that vendors seem to care about.What if we removed the deferral altogether, and made the NMI handler store into the outer most frame (after all, selector registers have fixed places on that frame), marking the that frame accordingly so that overwriting the values saved this way can be avoided in the interrupted save sequence (would be necessary only if both %ds and %es are neither __HYPERVISOR_DS nor null [neatly avoiding special casing the vm86 mode entry in the outer frame], and would add an extra branch to __SAVE_ALL_PRE plus splitting the selector register stores into moving %ds and %es into general purpose registers, testing the flag NMI or MCE handlers may set, and storing the GPRs into the frame if the flag was clear). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/5/07 09:17, "Jan Beulich" <jbeulich@novell.com> wrote:>> Yes, it''s good enough for watchdog and oprofile. Level-triggered external >> NMIs will of course be a problem. We could possibly work around this by >> masking LINT1 if we are CPU0 (and, of course, if LAPIC is enabled) and then >> unmasking only at the end of real NMI handler. And of course x86/64 doesn''t >> have this problem at all, and practically speaking is pretty much the only >> hypervisor build that vendors seem to care about. > > What if we removed the deferral altogether, and made the NMI handler > store into the outer most frame (after all, selector registers have fixed > places on that frame), marking the that frame accordingly so that > overwriting the values saved this way can be avoided in the > interrupted save sequence (would be necessary only if both %ds and > %es are neither __HYPERVISOR_DS nor null [neatly avoiding special > casing the vm86 mode entry in the outer frame], and would add an extra > branch to __SAVE_ALL_PRE plus splitting the selector register stores > into moving %ds and %es into general purpose registers, testing the > flag NMI or MCE handlers may set, and storing the GPRs into the frame > if the flag was clear).It sounds a bit painful. Also it''s the exit-to-guest path that is more of a pain to deal with. In this case we may have restored a segment register by the time we take the NMI. What do we do in this case about restoring the segment register safely? Races in updating GDT/LDT may mean that the reload still may fault, even though it didn''t just before; also we may need to do work in Xen (e.g., shadow-mode stuff) in interrupts-enabled context to fix up a #GP or #PG. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>It sounds a bit painful. Also it''s the exit-to-guest path that is more of a >pain to deal with. In this case we may have restored a segment register by >the time we take the NMI. What do we do in this case about restoring the >segment register safely? Races in updating GDT/LDT may mean that the reload >still may fault, even though it didn''t just before; also we may need to do >work in Xen (e.g., shadow-mode stuff) in interrupts-enabled context to fix >up a #GP or #PG.Indeed. Nevertheless, for non-restartable MCEs, deferral is impossible, and hence some mechanism would still be needed (unless we say the machine''s going down anyway in this case and we don''t care about getting a proper reason logged). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/5/07 11:10, "Jan Beulich" <jbeulich@novell.com> wrote:>> It sounds a bit painful. Also it''s the exit-to-guest path that is more of a >> pain to deal with. In this case we may have restored a segment register by >> the time we take the NMI. What do we do in this case about restoring the >> segment register safely? Races in updating GDT/LDT may mean that the reload >> still may fault, even though it didn''t just before; also we may need to do >> work in Xen (e.g., shadow-mode stuff) in interrupts-enabled context to fix >> up a #GP or #PG. > > Indeed. Nevertheless, for non-restartable MCEs, deferral is impossible, and > hence some mechanism would still be needed (unless we say the machine''s > going down anyway in this case and we don''t care about getting a proper > reason logged).You mean, like with a #DF, that sometimes a #MC may have bogus CS:EIP and so you cannot IRET from it? I''m not sure how much we care about losing these and turning a possibly-informative crash into an ugly and confusing crash. Personally I''ve never seen a #MC or had one reported to me, restartable or not. Maybe I''m lucky. :-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir@xensource.com> 16.05.07 14:32 >>> >On 16/5/07 11:10, "Jan Beulich" <jbeulich@novell.com> wrote: > >>> It sounds a bit painful. Also it''s the exit-to-guest path that is more of a >>> pain to deal with. In this case we may have restored a segment register by >>> the time we take the NMI. What do we do in this case about restoring the >>> segment register safely? Races in updating GDT/LDT may mean that the reload >>> still may fault, even though it didn''t just before; also we may need to do >>> work in Xen (e.g., shadow-mode stuff) in interrupts-enabled context to fix >>> up a #GP or #PG. >> >> Indeed. Nevertheless, for non-restartable MCEs, deferral is impossible, and >> hence some mechanism would still be needed (unless we say the machine''s >> going down anyway in this case and we don''t care about getting a proper >> reason logged). > >You mean, like with a #DF, that sometimes a #MC may have bogus CS:EIP and so >you cannot IRET from it? I''m not sure how much we care about losing these >and turning a possibly-informative crash into an ugly and confusing crash.Yes, that''s what I mean. And I''m not so much concerned about turning a (very rare) ''nice'' crash into an ''ugly'' one, but more about that fact that until the system actually crashes it may continue to execute for a short while, possibly making the data corruption situation worse.>Personally I''ve never seen a #MC or had one reported to me, restartable or >not. Maybe I''m lucky. :-)I did see quite a few non-restartable ones (on native Linux), and it took me some time to actually get the system into a state where I could see the related messages before it rebooted. I don''t have that system anymore, though, otherwise I might be have been able to use it for testing purposes here. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir@xensource.com> 16.05.07 10:28 >>> >On 16/5/07 09:17, "Jan Beulich" <jbeulich@novell.com> wrote: > >>> Yes, it''s good enough for watchdog and oprofile. Level-triggered external >>> NMIs will of course be a problem. We could possibly work around this by >>> masking LINT1 if we are CPU0 (and, of course, if LAPIC is enabled) and then >>> unmasking only at the end of real NMI handler. And of course x86/64 doesn''t >>> have this problem at all, and practically speaking is pretty much the only >>> hypervisor build that vendors seem to care about. >> >> What if we removed the deferral altogether, and made the NMI handler >> store into the outer most frame (after all, selector registers have fixed >> places on that frame), marking the that frame accordingly so that >> overwriting the values saved this way can be avoided in the >> interrupted save sequence (would be necessary only if both %ds and >> %es are neither __HYPERVISOR_DS nor null [neatly avoiding special >> casing the vm86 mode entry in the outer frame], and would add an extra >> branch to __SAVE_ALL_PRE plus splitting the selector register stores >> into moving %ds and %es into general purpose registers, testing the >> flag NMI or MCE handlers may set, and storing the GPRs into the frame >> if the flag was clear). > >It sounds a bit painful. Also it''s the exit-to-guest path that is more of a >pain to deal with. In this case we may have restored a segment register by >the time we take the NMI. What do we do in this case about restoring the >segment register safely? Races in updating GDT/LDT may mean that the reload >still may fault, even though it didn''t just before; also we may need to do >work in Xen (e.g., shadow-mode stuff) in interrupts-enabled context to fix >up a #GP or #PG.I think I found a pretty reasonable solution, which I''m attaching in its current (3.1-based) form, together with the prior (untested) variant that copied the NMI behavior. Even if it doesn''t apply to -unstable, I''d be glad if you could have a brief look to see whether you consider the approach too intrusive (in which case there would be no point in trying to bring it forward to -unstable). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 21/5/07 15:01, "Jan Beulich" <jbeulich@novell.com> wrote:> I think I found a pretty reasonable solution, which I''m attaching in its > current > (3.1-based) form, together with the prior (untested) variant that copied the > NMI behavior. Even if it doesn''t apply to -unstable, I''d be glad if you could > have a brief look to see whether you consider the approach too intrusive (in > which case there would be no point in trying to bring it forward to > -unstable).You''ll have to explain how the changes to the x86_32 entry.S work. The rest of the patch looks good. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir@xensource.com> 21.05.07 16:17 >>> >On 21/5/07 15:01, "Jan Beulich" <jbeulich@novell.com> wrote: > >> I think I found a pretty reasonable solution, which I''m attaching in its >> current >> (3.1-based) form, together with the prior (untested) variant that copied the >> NMI behavior. Even if it doesn''t apply to -unstable, I''d be glad if you could >> have a brief look to see whether you consider the approach too intrusive (in >> which case there would be no point in trying to bring it forward to >> -unstable). > >You''ll have to explain how the changes to the x86_32 entry.S work. The rest >of the patch looks good.The idea is to always check values read from %ds and %es against __HYPERVISOR_DS, and only store into the current frame (all normal handlers) or the outer-most one (NMI and MCE) if the value read is different. That way, any NMI or MCE occurring during frame setup will store selectors not saved so far on behalf of the interrupted handler, with that interrupted handler either having managed to read the guest selector (in which case it can store it regardless of whether NMI/MCE kicked in between the read and the store) or finding __HYPERVISOR_DS already in the register, in which case it''ll know not to store (as the nested handler would have done the store). For the restore portion this makes use of the fact that there''s exactly one such code sequence, and by moving the selector restore part past all other restores (including all stack pointer adjustments) the NMI/MCE handlers can safely detect whether any selector would have been restored already (by range checking EIP) and move EIP back to the beginning of the selector restore sequence without having to play with the stack pointer itself or any other gpr. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Well, this sounds fine to me. If you port it I''ll apply it. I would prefer it as a separate patch from the rest of the MCA/MCE changes really, but if that''s a pain then don''t worry about it. -- Keir On 21/5/07 15:34, "Jan Beulich" <jbeulich@novell.com> wrote:> The idea is to always check values read from %ds and %es against > __HYPERVISOR_DS, > and only store into the current frame (all normal handlers) or the outer-most > one (NMI and MCE) if the value read is different. That way, any NMI or MCE > occurring during frame setup will store selectors not saved so far on behalf > of > the interrupted handler, with that interrupted handler either having managed > to read the guest selector (in which case it can store it regardless of > whether > NMI/MCE kicked in between the read and the store) or finding __HYPERVISOR_DS > already in the register, in which case it''ll know not to store (as the nested > handler would have done the store). > > For the restore portion this makes use of the fact that there''s exactly one > such code sequence, and by moving the selector restore part past all other > restores (including all stack pointer adjustments) the NMI/MCE handlers can > safely detect whether any selector would have been restored already (by > range checking EIP) and move EIP back to the beginning of the selector > restore sequence without having to play with the stack pointer itself or any > other gpr._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel