Jan, Keir, Recently we design xen vMCE as attached. Please kindly help me to review it, any comments/suggestions are appreciated. Thanks, Jinsong _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 20.06.12 at 18:13, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: > Recently we design xen vMCE as attached. > Please kindly help me to review it, any comments/suggestions are > appreciated.The concept looks quite okay, provided no OS has a problem with the limitations imposed (most notably the restriction to a single reporting bank, particularly in the context of e.g. Linux partly ignoring the first bank under some conditions iirc). As to not needing any migration specific adjustments - what if a migration is in progress when an event needs to be delivered? Jan
Jan Beulich wrote:>>>> On 20.06.12 at 18:13, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: >> Recently we design xen vMCE as attached. >> Please kindly help me to review it, any comments/suggestions are >> appreciated. > > The concept looks quite okay, provided no OS has a problem with > the limitations imposed (most notably the restriction to a single > reporting bank, particularly in the context of e.g. Linux partly > ignoring the first bank under some conditions iirc).''bank0 skipping'' quirks is only for older model cpus, I think we have 2 options: 1). still use 1 bank and simply ignore this issue. I mean, even if guest runs at bank0 quirks platform, when hypervisor inject vMCE# to guest, guest skip bank0, then guest MCE logic would think it detect a spurious mce, then kill itself. Considering bank0 quirks is only for old cpus, this is acceptable; 2). use 32 banks In fact, a third option is, use 1 bank, but hypervisor kill guest when it detect bank0 quirks. This would be same effect as option 1, so I prefer let guest kill itself.> > As to not needing any migration specific adjustments - what if > a migration is in progress when an event needs to be delivered? > > JanIf a migration is in progress while an event delivered, we abort the migration. Thanks, Jinsong-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
>>> On 22.06.12 at 12:40, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: > Jan Beulich wrote: >>>>> On 20.06.12 at 18:13, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: >>> Recently we design xen vMCE as attached. >>> Please kindly help me to review it, any comments/suggestions are >>> appreciated. >> >> The concept looks quite okay, provided no OS has a problem with >> the limitations imposed (most notably the restriction to a single >> reporting bank, particularly in the context of e.g. Linux partly >> ignoring the first bank under some conditions iirc). > > ''bank0 skipping'' quirks is only for older model cpus, I think we have 2 > options: > 1). still use 1 bank and simply ignore this issue. I mean, even if guest > runs at bank0 quirks platform, when hypervisor inject vMCE# to guest, guest > skip bank0, then guest MCE logic would think it detect a spurious mce, then > kill itself. Considering bank0 quirks is only for old cpus, this is > acceptable; > 2). use 32 banks > > In fact, a third option is, use 1 bank, but hypervisor kill guest when it > detect bank0 quirks. This would be same effect as option 1, so I prefer let > guest kill itself.Out of these, I''d actually favor using 32 banks. Using 2 banks instead of 1 might be another option.>> As to not needing any migration specific adjustments - what if >> a migration is in progress when an event needs to be delivered? > > If a migration is in progress while an event delivered, we abort the > migration.Is there a way the hypervisor can tell the tools to abort a migration? Or are you meaning to say such functionality would need to be added? One other concern that occurred to me after long having sent the original response: Your proposal aims at a fixed, unmodifiable vMCE interface. How is that going to be forward compatible? I.e. consider you had made that proposal before the SRAO/SRAR changes went in - would the same interface (with the same set of capability bits set/clear) still be suitable? I think that we minimally need to retain the MCG_CAP register as being of potentially variable content (and hence needing saving/restoring on migration). To support this in a forward compatible manner, we may have to have a way to tell the hypervisor e.g. via command line option which extra MSRs have to be treated read-as-zero/writes-ignored upon guest accesses. Jan
Jan Beulich wrote:>>>> On 22.06.12 at 12:40, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: >> Jan Beulich wrote: >>>>>> On 20.06.12 at 18:13, "Liu, Jinsong" <jinsong.liu@intel.com> >>>>>> wrote: >>>> Recently we design xen vMCE as attached. >>>> Please kindly help me to review it, any comments/suggestions are >>>> appreciated. >>> >>> The concept looks quite okay, provided no OS has a problem with >>> the limitations imposed (most notably the restriction to a single >>> reporting bank, particularly in the context of e.g. Linux partly >>> ignoring the first bank under some conditions iirc). >> >> ''bank0 skipping'' quirks is only for older model cpus, I think we >> have 2 options: 1). still use 1 bank and simply ignore this issue. I >> mean, even if guest runs at bank0 quirks platform, when hypervisor >> inject vMCE# to guest, guest skip bank0, then guest MCE logic would >> think it detect a spurious mce, then kill itself. Considering bank0 >> quirks is only for old cpus, this is acceptable; 2). use 32 banks >> >> In fact, a third option is, use 1 bank, but hypervisor kill guest >> when it detect bank0 quirks. This would be same effect as option 1, >> so I prefer let guest kill itself. > > Out of these, I''d actually favor using 32 banks. Using 2 banks > instead of 1 might be another option.Yes, 2 or 32 are both OK. Let''s use 32. (or, 2 is better for some other reasons, let me confirm inside Intel first).> >>> As to not needing any migration specific adjustments - what if >>> a migration is in progress when an event needs to be delivered? >> >> If a migration is in progress while an event delivered, we abort the >> migration. > > Is there a way the hypervisor can tell the tools to abort a > migration? Or are you meaning to say such functionality would > need to be added?I didn''t check migration code, but I think it could be done by 1). set a flag indecating the case ''migration is in progress while event delivered''. 2). at the last shakehand stage of migration (i.e. A to B), checking the flag and if yes, abort migration. So guest will continue run at A, and quit at B after timeout.> > One other concern that occurred to me after long having sent > the original response: Your proposal aims at a fixed, > unmodifiable vMCE interface. How is that going to be forward > compatible? I.e. consider you had made that proposal before > the SRAO/SRAR changes went in - would the same interface (with > the same set of capability bits set/clear) still be suitable?Yes, since it''s pure s/w emulated interface. At the case when SRAO or SRAR not supported by h/w platform, it''s still OK, since under such case hypervisor don''t need deliver SRAO or SRAR to guest at all. The emulated vMCE interface just tell the guest that it runs at a virtual platform with those well-defined capabilities.> > I think that we minimally need to retain the MCG_CAP register > as being of potentially variable content (and hence needing > saving/restoring on migration). To support this in a forward > compatible manner, we may have to have a way to tell the > hypervisor e.g. via command line option which extra MSRs > have to be treated read-as-zero/writes-ignored upon guest > accesses. > > JanSeems unnecessary, reason as above. Thanks, Jinsong
>>> On 22.06.12 at 15:46, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: > Jan Beulich wrote: >> One other concern that occurred to me after long having sent >> the original response: Your proposal aims at a fixed, >> unmodifiable vMCE interface. How is that going to be forward >> compatible? I.e. consider you had made that proposal before >> the SRAO/SRAR changes went in - would the same interface (with >> the same set of capability bits set/clear) still be suitable? > > Yes, since it''s pure s/w emulated interface. At the case when SRAO or SRAR > not supported by h/w platform, it''s still OK, since under such case > hypervisor don''t need deliver SRAO or SRAR to guest at all. The emulated vMCE > interface just tell the guest that it runs at a virtual platform with those > well-defined capabilities.I probably didn''t express well enough what I want you to check: Consider you had done the current re-design work without SRAO/SRAR in mind (e.g. a couple of years ago). Would the end result have been the same? Namely, would the bits you nominate to be set/clear in MCG_CAP be the same?>> I think that we minimally need to retain the MCG_CAP register >> as being of potentially variable content (and hence needing >> saving/restoring on migration). To support this in a forward >> compatible manner, we may have to have a way to tell the >> hypervisor e.g. via command line option which extra MSRs >> have to be treated read-as-zero/writes-ignored upon guest >> accesses. > > Seems unnecessary, reason as above.So going forward you see no possibility of additions to the interface that might warrant allowing more bits to be set in MCG_CAP than you define to be set here? That really looks unrealistic to me. Jan
>>> I think that we minimally need to retain the MCG_CAP register >>> as being of potentially variable content (and hence needing >>> saving/restoring on migration). To support this in a forward >>> compatible manner, we may have to have a way to tell the >>> hypervisor e.g. via command line option which extra MSRs >>> have to be treated read-as-zero/writes-ignored upon guest >>> accesses. >> >> Seems unnecessary, reason as above. > > So going forward you see no possibility of additions to the > interface that might warrant allowing more bits to be set in > MCG_CAP than you define to be set here? That really looks > unrealistic to me.More bits can be added to MCG_CAP - but this becomes a hard problem for migration because most OS guests only look at MCG_CAP at boot time (linux certainly does) ... so if you migrate and change MCG_CAP to match the new host, the guest will have no idea that things changed. MCG_SER_P bit is easy to deal with: 1) You already know about it 2) You can safely set it when the guest is running on a host that does not support software error recover ... the guest will just think that SRAO and SRAR events are possible, but they will never actually occur. Setting it does make sure that the guest will be ready should you migrate to a system that really does support it. Future bits would have to be dealt with on a case by case basis. The same general model would apply ... set the bit everywhere ... but you might need some Xen changes to virtualize any new resources and capabilities that were described by the new bit. Side question: when a guest is migrated from one machine to another, does the version of Xen running on source and target have to be the same? This tactic will not work if you migrate a guest running on Xen version 99 that does have the code to virtualize some new capability to a machine running Xen version 98 that doesn''t. -Tony
> 1). still use 1 bank and simply ignore this issue. I mean, even if guest > runs at bank0 quirks platform, when hypervisor inject vMCE# to guest, guest > skip bank0, then guest MCE logic would think it detect a spurious mce, then > kill itself. Considering bank0 quirks is only for old cpus, this is > acceptable; > 2). use 32 banks > > In fact, a third option is, use 1 bank, but hypervisor kill guest when it > detect bank0 quirks. This would be same effect as option 1, so I prefer let > guest kill itself.Don''t you control what CPUID is shown to the guest? Under what circumstances would you tell the guest that it is running on an AMD-K7 or an Intel family 6 with model < 0x1A? Surely for migration reasons you need to present the same virtualized family/model all the time ... so just don''t use ones that cause problems. If that isn''t an option - then say there are 2 banks and have Xen ignore bank 0 (make MC0_STATUS always appear to contain 0) and put all the errors into bank1. If you tell the guest there are 32 banks it will read all of them. Which means a lot of pointless exits to the hypervisor. -Tony
>>> On 22.06.12 at 17:21, "Luck, Tony" <tony.luck@intel.com> wrote: >>>> I think that we minimally need to retain the MCG_CAP register >>>> as being of potentially variable content (and hence needing >>>> saving/restoring on migration). To support this in a forward >>>> compatible manner, we may have to have a way to tell the >>>> hypervisor e.g. via command line option which extra MSRs >>>> have to be treated read-as-zero/writes-ignored upon guest >>>> accesses. >>> >>> Seems unnecessary, reason as above. >> >> So going forward you see no possibility of additions to the >> interface that might warrant allowing more bits to be set in >> MCG_CAP than you define to be set here? That really looks >> unrealistic to me. > > More bits can be added to MCG_CAP - but this becomes a hard > problem for migration because most OS guests only look at > MCG_CAP at boot time (linux certainly does) ... so if you migrate > and change MCG_CAP to match the new host, the guest will have > no idea that things changed.And it doesn''t have to - respective events will just not occur anymore. All we need to make sure is that if it accesses an MSR that''s no longer backed by anything in hardware, we don''t inject a #GP (but instead return e.g. all zeros to reads, and drop all writes).> MCG_SER_P bit is easy to deal with: > 1) You already know about it > 2) You can safely set it when the guest is running on a host that > does not support software error recover ... the guest will just > think that SRAO and SRAR events are possible, but they will never > actually occur. Setting it does make sure that the guest will be > ready should you migrate to a system that really does support it.And so is likely going to be the case for at least some of the additions to come.> Future bits would have to be dealt with on a case by case basis. > The same general model would apply ... set the bit everywhere ... > but you might need some Xen changes to virtualize any new resources > and capabilities that were described by the new bit. Side question: > when a guest is migrated from one machine to another, does the version > of Xen running on source and target have to be the same? This tactic > will not work if you migrate a guest running on Xen version 99 that > does have the code to virtualize some new capability to a machine > running Xen version 98 that doesn''t.No, that''s not a requirement, and hence we need to create a model that will at least always allow upward (destination host running higher version Xen than source) migration, and that preserves as much functionality as possible when moved between hosts with different MCA capabilities. Jan
>>> On 22.06.12 at 17:29, "Luck, Tony" <tony.luck@intel.com> wrote: >> 1). still use 1 bank and simply ignore this issue. I mean, even if guest >> runs at bank0 quirks platform, when hypervisor inject vMCE# to guest, guest >> skip bank0, then guest MCE logic would think it detect a spurious mce, then >> kill itself. Considering bank0 quirks is only for old cpus, this is >> acceptable; >> 2). use 32 banks >> >> In fact, a third option is, use 1 bank, but hypervisor kill guest when it >> detect bank0 quirks. This would be same effect as option 1, so I prefer let >> guest kill itself. > > Don''t you control what CPUID is shown to the guest? Under what circumstances > would you tell the guest that it is running on an AMD-K7 or an Intel family > 6 with model < 0x1A? Surely for migration reasons you need to present the > same virtualized family/model all the time ... so just don''t use ones that > cause problems.I don''t think family/model/stepping are frequently faked, it''s normally just the various feature flags that get normalized to the smallest common set.> If that isn''t an option - then say there are 2 banks and have Xen ignore bank > 0 (make MC0_STATUS always appear to contain 0) and put all the errors into > bank1. If you tell the guest there are 32 banks it will read all of them. > Which means a lot of pointless exits to the hypervisor.Indeed, emulating too many banks can have its own downsides. Yet I don''t think we''re really concerned about performance when handling machine checks. But having more than one usable bank must have advantages, else hardware wouldn''t implement things that way. Jan
> Yet I don''t think we''re really concerned about performance when handling machine > checks. But having more than one usable bank must have advantages, else hardware > wouldn''t implement things that way.Primary reason for multiple banks is h/w design ... the silicon implementing the bank is generally included in the component that generates the error. E.g. there may be multiple memory controllers on a die, each with its own bank. H/W designers hate running long "wires" across the die as it messes up their layout options. There may be some secondary side benefit that independent errors might be reported to different banks, and so avoid some overwrite problems. But I don''t think that Xen has a big worry with overwrite, does it? In general the errors that you will show to the guest are ones that you expect the guest to handle immediately (e.g. SRAO and SRAR signaled with a machine check). You do not log any corrected errors to the guest (they can''t do anything useful with them). You certainly don''t log any errors that are not signaled. So you should never have any errors hanging around in banks for long periods that would get overwritten. -Tony
>>> On 22.06.12 at 18:21, "Luck, Tony" <tony.luck@intel.com> wrote: > There may be some secondary side benefit that independent errors might be > reported to different banks, and so avoid some overwrite problems. But I > don''t > think that Xen has a big worry with overwrite, does it? In general the > errors that > you will show to the guest are ones that you expect the guest to handle > immediately > (e.g. SRAO and SRAR signaled with a machine check). You do not log any > corrected > errors to the guest (they can''t do anything useful with them). You certainly > don''t > log any errors that are not signaled. So you should never have any errors > hanging > around in banks for long periods that would get overwritten.The problem is the determination of what "long" means here. The target vCPU may not get scheduled for extended periods of time (raising its priority would have other undesirable implications), so the risk over overwrite exists from that perspective. However, assuming that reportable errors get associated with a particular vCPU, and that such a vCPU won''t be able to execute any further guest code prior to the delivery of the exception, the only real risk here would be if the vMCE handler itself raised another event. That I agree with Jinsong can be well treated as fatal (killing the guest, provided the event gets properly logged so the admin isn''t left in the dark regarding the unexpected death of the VM), mostly matching would a single bank hardware implementation would result in. Jan
Jan Beulich wrote:>>>> On 22.06.12 at 15:46, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: >> Jan Beulich wrote: >>> One other concern that occurred to me after long having sent >>> the original response: Your proposal aims at a fixed, >>> unmodifiable vMCE interface. How is that going to be forward >>> compatible? I.e. consider you had made that proposal before >>> the SRAO/SRAR changes went in - would the same interface (with >>> the same set of capability bits set/clear) still be suitable? >> >> Yes, since it''s pure s/w emulated interface. At the case when SRAO >> or SRAR not supported by h/w platform, it''s still OK, since under >> such case hypervisor don''t need deliver SRAO or SRAR to guest at >> all. The emulated vMCE interface just tell the guest that it runs at >> a virtual platform with those well-defined capabilities. > > I probably didn''t express well enough what I want you to check: > Consider you had done the current re-design work without > SRAO/SRAR in mind (e.g. a couple of years ago). Would the > end result have been the same? Namely, would the bits you > nominate to be set/clear in MCG_CAP be the same? > >>> I think that we minimally need to retain the MCG_CAP register >>> as being of potentially variable content (and hence needing >>> saving/restoring on migration). To support this in a forward >>> compatible manner, we may have to have a way to tell the >>> hypervisor e.g. via command line option which extra MSRs >>> have to be treated read-as-zero/writes-ignored upon guest >>> accesses. >> >> Seems unnecessary, reason as above. > > So going forward you see no possibility of additions to the > interface that might warrant allowing more bits to be set in > MCG_CAP than you define to be set here? That really looks > unrealistic to me. > > JanSorry for misunderstanding your meaning in my last email, please ignore it. I agree that MCG_CAP should be save/restore when migration, considering in the future some CAP bit may be added. Thanks, Jinsong-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Jun 22, 2012 at 4:21 PM, Luck, Tony <tony.luck@intel.com> wrote:>>>> I think that we minimally need to retain the MCG_CAP register >>>> as being of potentially variable content (and hence needing >>>> saving/restoring on migration). To support this in a forward >>>> compatible manner, we may have to have a way to tell the >>>> hypervisor e.g. via command line option which extra MSRs >>>> have to be treated read-as-zero/writes-ignored upon guest >>>> accesses. >>> >>> Seems unnecessary, reason as above. >> >> So going forward you see no possibility of additions to the >> interface that might warrant allowing more bits to be set in >> MCG_CAP than you define to be set here? That really looks >> unrealistic to me. > > More bits can be added to MCG_CAP - but this becomes a hard > problem for migration because most OS guests only look at > MCG_CAP at boot time (linux certainly does) ... so if you migrate > and change MCG_CAP to match the new host, the guest will have > no idea that things changed.Typically if you have a heterogeneous pool (i.e., different processors with different CPUID bits) you typically have to calculate the minimum subset of features available on all processors and only expose those to the guest. It wouldn''t be too hard to extend that to vMCEs if we had to. Alternately, as Jan says, we could just fake things out when we need to. -George