As I had expressed before, I''m thinking that the current way of handling the top level of PAE paging is inappropriate, even after the above-4G adjustments that cured part of the problem. This is specifically because - the handling here isn''t consistent with how hardware behaves in the same situation (though the Xen behavior is probably within range of the generic architecture specification), in that the processor reads the 4 top level entries when CR3 gets re-loaded (and hence doesn''t try to access them later in any way), while Xen treats them (including potential updates to them) like just on any level in the hierarchy - the guest still needs to allocate a full page, even though only the first 32 bytes of it are actually used - the shadowing done in Xen could be avoided altogether by following hardware behavior. Just now I found that there is a resulting issue for the 32on64 work I''m doing: Since none of the entries 4...511 of the PMD get initialized in Linux, and since Xen nevertheless has to validate all 512 entries (in order to avoid making available translations that could be used during speculative execution), the validation has the potential to fail (and does in reality), resulting in the guest dying. The only option I presently see is to special case the compatibility guest in the l3 handling and (I really hate to do that) clear out the 518 supposedly unused entries (or at least clear their present bits), meaning that no guest may ever make clever assumptions and try to store some other data in the unused portion of the pgd page. Thanks for sharing any other ideas on how to overcome this problem, Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> As I had expressed before, I''m thinking that the current way ofhandling> the > top level of PAE paging is inappropriate, even after the above-4G > adjustments > that cured part of the problem. This is specifically because > - the handling here isn''t consistent with how hardware behaves in thesame> situation (though the Xen behavior is probably within range of thegeneric> architecture specification), in that the processor reads the 4 toplevel> entries > when CR3 gets re-loaded (and hence doesn''t try to access them later inany> way), while Xen treats them (including potential updates to them) likejust> on any level in the hierarchy > - the guest still needs to allocate a full page, even though only thefirst> 32 > bytes of it are actually used > - the shadowing done in Xen could be avoided altogether by following > hardware behavior. > > Just now I found that there is a resulting issue for the 32on64 workI''m> doing: Since none of the entries 4...511 of the PMD get initialized in > Linux, > and since Xen nevertheless has to validate all 512 entries (in orderto> avoid making available translations that could be used duringspeculative> execution), the validation has the potential to fail (and does inreality),> resulting in the guest dying. The only option I presently see is tospecial> case the compatibility guest in the l3 handling and (I really hate todo> that) clear out the 518 supposedly unused entries (or at least clear > their present bits), meaning that no guest may ever make clever > assumptions and try to store some other data in the unused portion of > the pgd page.Why not just have a fixed per-vcpu L4 and L3, into which the 4 PAE L3''s get copied on every cr3 load? It''s most analogous to what happens today. We''ve thought of removing the page-size restriction on PAE L3''s in the past, but it''s pretty low down the priority list as it typically doesn''t cost a great deal of memory. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>Why not just have a fixed per-vcpu L4 and L3, into which the 4 PAE L3''s >get copied on every cr3 load? >It''s most analogous to what happens today.In the shadowing (PAE, 32bit) case (a code path that, as I said, I''d rather see ripped out). In the general 64-bit case, this would add another (needless) distinct code path. I think I still like better the idea of clearing out the final 518 entries.>We''ve thought of removing the page-size restriction on PAE L3''s in the >past, but it''s pretty low down the priority list as it typically doesn''t >cost a great deal of memory.Ah. I would have felt different. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> >Why not just have a fixed per-vcpu L4 and L3, into which the 4 PAEL3''s> >get copied on every cr3 load? > >It''s most analogous to what happens today. > > In the shadowing (PAE, 32bit) case (a code path that, as I said, I''drather> see ripped out).Why? It''s essential to allow PAE PGDs to live above 4GB, which is a PITA otherwise.> In the general 64-bit case, this would add another > (needless) distinct code path. I think I still like better the idea of > clearing out the final 518 entries. > > >We''ve thought of removing the page-size restriction on PAE L3''s inthe> >past, but it''s pretty low down the priority list as it typicallydoesn''t> >cost a great deal of memory. > > Ah. I would have felt different.Most machines probably only have a hundred processes (we can exclude kernel threads and threads in general), hence maybe a few hundred KB wasted, tops. If we did remove the size restriction, we''d still want to put them in their own slab cache rather than the general 32b cache, as you don''t want them being shared with other non-PGD data. This is a PITA that mandates how we handle shadowing of PAE PGDs in the HVM case where we can''t control what they''re allocated alongside. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 19/10/06 11:39, "Jan Beulich" <jbeulich@novell.com> wrote:> Just now I found that there is a resulting issue for the 32on64 work I''m > doing: Since none of the entries 4...511 of the PMD get initialized in Linux, > and since Xen nevertheless has to validate all 512 entries (in order to > avoid making available translations that could be used during speculative > execution), the validation has the potential to fail (and does in reality), > resulting in the guest dying. The only option I presently see is to special > case the compatibility guest in the l3 handling and (I really hate to do > that) clear out the 518 supposedly unused entries (or at least clear > their present bits), meaning that no guest may ever make clever > assumptions and try to store some other data in the unused portion of > the pgd page.Either copy the PGDs out into a shadow L3, as we do for PAE Xen today. Or, as you say, zap the 508 unused entries. No guest uses them -- I''m pretty sure Linux is the only PAE-capable guest (others are non-pae or 64-bit). Storing other stuff in the page would be inconvenient anyway since it has to be read-only. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 19/10/06 12:18, "Jan Beulich" <jbeulich@novell.com> wrote:>> Why not just have a fixed per-vcpu L4 and L3, into which the 4 PAE L3''s >> get copied on every cr3 load? >> It''s most analogous to what happens today. > > In the shadowing (PAE, 32bit) case (a code path that, as I said, I''d rather > see ripped out). In the general 64-bit case, this would add another > (needless) distinct code path. I think I still like better the idea of > clearing > out the final 518 entries.If we allowed non-pae-aligned L3s then you''d have no choice but to shadow anyway, as that would be the only way to make the guest mappings appear at the correct place in the 64-bit address space. -- Keir>> We''ve thought of removing the page-size restriction on PAE L3''s in the >> past, but it''s pretty low down the priority list as it typically doesn''t >> cost a great deal of memory. > > Ah. I would have felt different._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
NetWare is also a PAE guest, but doesn''t put anything in the rest of the page, so zapping would be fine for NetWare. - Bruce>>> On 10/19/2006 at 6:56 AM, in message <C15D34A3.2CB1%Keir.Fraser@cl.cam.ac.uk>,Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:> On 19/10/06 11:39, "Jan Beulich" <jbeulich@novell.com> wrote: > >> Just now I found that there is a resulting issue for the 32on64 work I''m >> doing: Since none of the entries 4...511 of the PMD get initialized in > Linux, >> and since Xen nevertheless has to validate all 512 entries (in order to >> avoid making available translations that could be used during speculative >> execution), the validation has the potential to fail (and does in reality), >> resulting in the guest dying. The only option I presently see is to special >> case the compatibility guest in the l3 handling and (I really hate to do >> that) clear out the 518 supposedly unused entries (or at least clear >> their present bits), meaning that no guest may ever make clever >> assumptions and try to store some other data in the unused portion of >> the pgd page. > > Either copy the PGDs out into a shadow L3, as we do for PAE Xen today. Or, > as you say, zap the 508 unused entries. No guest uses them -- I''m pretty > sure Linux is the only PAE-capable guest (others are non-pae or 64-bit). > Storing other stuff in the page would be inconvenient anyway since it has to > be read-only. > > -- Keir > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>If we allowed non-pae-aligned L3s then you''d have no choice but to shadow >anyway, as that would be the only way to make the guest mappings appear at >the correct place in the 64-bit address space.It''s not really shadowing, since there is no need to monitor changes. I''d therefore rather call it snapshotting. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 19/10/06 15:24, "Jan Beulich" <jbeulich@novell.com> wrote:>> If we allowed non-pae-aligned L3s then you''d have no choice but to shadow >> anyway, as that would be the only way to make the guest mappings appear at >> the correct place in the 64-bit address space. > > It''s not really shadowing, since there is no need to monitor changes. I''d > therefore rather call it snapshotting.But you do need to track updates, right? What if the guest zaps an L3 entry in an in-use PGD table and than flushes TLBs? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Date: Thu, 19 Oct 2006 13:56:51 +0100 > From: Keir Fraser <Keir.Fraser@cl.cam.ac.uk> > Subject: Re: [Xen-devel] PAE issue (32-on-64 work) > To: Jan Beulich <jbeulich@novell.com>, <xen-devel@lists.xensource.com> > Message-ID: <C15D34A3.2CB1%Keir.Fraser@cl.cam.ac.uk> > Content-Type: text/plain; charset="US-ASCII" > > On 19/10/06 11:39, "Jan Beulich" <jbeulich@novell.com> wrote: > > >>Just now I found that there is a resulting issue for the 32on64 work I''m >>doing: Since none of the entries 4...511 of the PMD get initialized in Linux, >>and since Xen nevertheless has to validate all 512 entries (in order to >>avoid making available translations that could be used during speculative >>execution), the validation has the potential to fail (and does in reality), >>resulting in the guest dying. The only option I presently see is to special >>case the compatibility guest in the l3 handling and (I really hate to do >>that) clear out the 518 supposedly unused entries (or at least clear >>their present bits), meaning that no guest may ever make clever >>assumptions and try to store some other data in the unused portion of >>the pgd page. > > > Either copy the PGDs out into a shadow L3, as we do for PAE Xen today. Or, > as you say, zap the 508 unused entries. No guest uses them -- I''m pretty > sure Linux is the only PAE-capable guest (others are non-pae or 64-bit). > Storing other stuff in the page would be inconvenient anyway since it has to > be read-only. > > -- Keir >I just now happen to be changing the Solaris 32 bit domains to support PAE on XEN, purposely to be able to use the 32-on-64 capabilites as they are available. The code path in Solaris currently supports 2 possibilities for PAE top level tables. The normal code we use on bare metal allocates only 1 page that all cpu''s share for the the top level page table. For example, cpu0 uses the 1st four quads for its current process'' L3, cpu1 uses the next four, etc. On context switch or cr3 reload we (re)copy in the 4 entries of the process for that CPU''s section of the page. That code path is, as so much of the 32 bit PAE support, a special case. So it was easily turned off and made to just use an entire page for each unique top level L3 on Xen. I did that just for my initial bring up on PAE Xen, but was hoping to go back to some form of the optimized version next. Joe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 19/10/06 17:19, "Joe Bonasera" <joe.bonasera@sun.com> wrote:> The code path in Solaris currently supports 2 possibilities for PAE top level > tables. The normal code we use on bare metal allocates only 1 page > that all cpu''s share for the the top level page table. For > example, cpu0 uses the 1st four quads for its current process'' > L3, cpu1 uses the next four, etc. On context switch or cr3 reload > we (re)copy in the 4 entries of the process for that CPU''s section > of the page. > > That code path is, as so much of the 32 bit PAE support, a special > case. So it was easily turned off and made to just use > an entire page for each unique top level L3 on Xen. I did that just for > my initial bring up on PAE Xen, but was hoping to go back to some > form of the optimized version next.You should just allocate a page-sized L3 per process and be done with it. A page overhead per process is nothing to be concerned about (clearly the overhead can be even bigger if, for example, you run 4-level tables on x86_64). Recopying the L3 entries every TLB flush will *not* perform well on current Xen. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 19/10/06 17:19, "Joe Bonasera" <joe.bonasera@sun.com> wrote: > > >>The code path in Solaris currently supports 2 possibilities for PAE top level >>tables. The normal code we use on bare metal allocates only 1 page >>that all cpu''s share for the the top level page table. For >>example, cpu0 uses the 1st four quads for its current process'' >>L3, cpu1 uses the next four, etc. On context switch or cr3 reload >>we (re)copy in the 4 entries of the process for that CPU''s section >>of the page. >> >>That code path is, as so much of the 32 bit PAE support, a special >>case. So it was easily turned off and made to just use >>an entire page for each unique top level L3 on Xen. I did that just for >>my initial bring up on PAE Xen, but was hoping to go back to some >>form of the optimized version next. > > > You should just allocate a page-sized L3 per process and be done with it. A > page overhead per process is nothing to be concerned about (clearly the > overhead can be even bigger if, for example, you run 4-level tables on > x86_64). Recopying the L3 entries every TLB flush will *not* perform well on > current Xen. >Well we actually don''t do complete TLB flushes very often at all, essentially only the first time a new L3 entry is created by a running process which for most processes means never - as >1Gig processes are rare. So it shouldn''t matter if they hit one or two slowish flushes. Are there any other performance implications to watch out for? Joe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 19/10/06 6:22 pm, "Joe Bonasera" <joe.bonasera@sun.com> wrote:>> You should just allocate a page-sized L3 per process and be done with it. A >> page overhead per process is nothing to be concerned about (clearly the >> overhead can be even bigger if, for example, you run 4-level tables on >> x86_64). Recopying the L3 entries every TLB flush will *not* perform well on >> current Xen. >> > > Well we actually don''t do complete TLB flushes very often at all, essentially > only the first time a new L3 entry is created by a running process which > for most processes means never - as >1Gig processes are rare. > So it shouldn''t matter if they hit one or two slowish flushes. > > Are there any other performance implications to watch out for?I don''t think so. Just remember that PAE L3 entry updates are not fast. We really expect it only to happen on process creation/destruction (or similar frequency). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Apparently Analagous Threads
- A different probklem with save/restore on C/S 14823.
- PAE panic on nv75a / Tecra M2
- 3.0.2-testing: pci_set_dma_mask, pci_set_consistent_dma_mask(pci, 0x0fffffff) returns < 0 (ICE1712)
- [PATCH] Fix PAE shadow on a machine with RAM above 4G on x86_64 xen
- Question regarding the number of P2M l3e entries