Konrad Rzeszutek Wilk
2011-Nov-29 18:33 UTC
[PATCH 0 of 2] Documentation patches for HYPERVISOR_mmu_update (v1).
Documenting some of the requirements when using HYPERVISOR_mmu_update that are not spelled in details. It is x86_64 and Linux centric since those are the only ones that I am quite familiar with. Can modify them in the future to describe other architectures as well.
Konrad Rzeszutek Wilk
2011-Nov-29 18:33 UTC
[PATCH 1 of 2] doc: Update MMU_NORMAL_PT_UPDATE requirements
# HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1322586121 18000 # Node ID 789429b7859a6791cb0a08ba93064b50c9272218 # Parent a2cb7ed6d0a2ee5aecb3a988750ce9c8d8b718ee doc: Update MMU_NORMAL_PT_UPDATE requirements. There are some implicit requirements when using the hypercall which are not mentioned. Mainly the requirement that the pagetable be RO. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r a2cb7ed6d0a2 -r 789429b7859a xen/include/public/xen.h --- a/xen/include/public/xen.h Mon Nov 28 17:42:40 2011 +0000 +++ b/xen/include/public/xen.h Tue Nov 29 12:02:01 2011 -0500 @@ -187,6 +187,40 @@ DEFINE_XEN_GUEST_HANDLE(xen_pfn_t); * FD == DOMID_XEN: Map restricted areas of Xen''s heap space. * ptr[:2] -- Machine address of the page-table entry to modify. * val -- Value to write. + * + * There also certain implicit requirements when using this hypercall. The + * pages that make up a pagetable must be mapped read-only in the guest. + * This prevents uncontrolled guest updates to the pagetable. Xen strictly + * enforces this, and will disallow any pagetable update which will end up + * mapping pagetable page RW, and will disallow using any writable page as a + * pagetable. In practice it means that when constructing a page table for a + * process, thread, etc, we MUST be very dilligient in following these rules: + * 1). Start with top-level page (PGD or in Xen language: L4). Fill out + * the entries. + * 2). Keep on going, filling out the upper (PUD or L3), and middle (PMD + * or L2). + * 3). Start filling out the PTE table (L1) with the PTE entries. Once + * done, make sure to set each of those entries to RO (so writeable bit + * is unset). Once that has been completed, set the PMD (L2) for this + * PTE table as RO. + * 4). When completed with all of the PMD (L2) entries, and all of them have + * been set to RO, make sure to set RO the PUD (L3). Do the same + * operation on PGD (L4) pagetable entries that have a PUD (L3) entry. + * 5). Now before you can use those pages (so setting the cr3), you MUST also + * pin them so that the hypervisor can verify the entries. This is done + * via the HYPERVISOR_mmuext_op(MMUEXT_PIN_L4_TABLE, guest physical frame + * number of the PGD (L4)). And this point the HYPERVISOR_mmuext_op( + * MMUEXT_NEW_BASEPTR, guest physical frame number of the PGD (L4)) can be + * issued. + * For 32-bit guests, the L4 is not used (as there is less pagetables), so + * instead use L3. + * At this point the pagetables can be modified using the MMU_NORMAL_PT_UPDATE + * hypercall. Also if so desired the OS can also try to write to the PTE + * and be trapped by the hypervisor (as the PTE entry is RO). + * + * To deallocate the pages, the operations are the reverse of the steps + * mentioned above. The argument is MMUEXT_UNPIN_TABLE for all levels and the + * pagetable MUST not be in use (meaning that the cr3 is not set to it). * * ptr[1:0] == MMU_MACHPHYS_UPDATE: * Updates an entry in the machine->pseudo-physical mapping table.
Konrad Rzeszutek Wilk
2011-Nov-29 18:33 UTC
[PATCH 2 of 2] doc: Update MMU_NORMAL_PT_UPDATE about the val
# HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1322591439 18000 # Node ID f2d0c5fee64cdad8c9c04bd2c7cf44a7183345d5 # Parent 789429b7859a6791cb0a08ba93064b50c9272218 doc: Update MMU_NORMAL_PT_UPDATE about the val. The val is used as the pagetable entry with the machine frame number and some page table bits. This explains what those page table bits are. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r 789429b7859a -r f2d0c5fee64c xen/include/public/xen.h --- a/xen/include/public/xen.h Tue Nov 29 12:02:01 2011 -0500 +++ b/xen/include/public/xen.h Tue Nov 29 13:30:39 2011 -0500 @@ -231,6 +231,72 @@ DEFINE_XEN_GUEST_HANDLE(xen_pfn_t); * ptr[1:0] == MMU_PT_UPDATE_PRESERVE_AD: * As MMU_NORMAL_PT_UPDATE above, but A/D bits currently in the PTE are ORed * with those in @val. + * + * @val is usually the machine frame number along with some attributes. + * The attributes by default follow the architecture defined bits. Meaning that + * if this is a X86_64 machine and four page table layout is used, the layout + * of val is: + * - 63 if set means No execute (NX) + * - 46-13 the machine frame number + * - 12 available for guest + * - 11 available for guest + * - 10 available for guest + * - 9 available for guest + * - 8 global + * - 7 PAT (PSE is disabled, must use hypercall to make 4MB or 2MB pages) + * - 6 dirty + * - 5 accessed + * - 4 page cached disabled + * - 3 page write through + * - 2 userspace accessible + * - 1 writeable + * - 0 present + * + * The one bits that does not fit with the default layout is the PAGE_PSE + * also called PAGE_PAT). The MMUEXT_[UN]MARK_SUPER arguments to the + * HYPERVISOR_mmuext_op serve as mechanism to set a pagetable to be 4MB + * (or 2MB) instead of using the PAGE_PSE bit. + * + * The reason that the PAGE_PSE (bit 7) is not being utilized is due to Xen + * using it as the Page Attribute Table (PAT) bit - for details on it please + * refer to Intel SDM 10.12. The PAT allows to set the caching attributes of + * pages instead of using MTRRs. + * + * The PAT MSR is as follow (it is a 64-bit value, each entry is 8 bits): + * PAT4 PAT0 + * +---+----+----+----+-----+----+----+ + * WC | WC | WB | UC | UC- | WC | WB | <= Linux + * +---+----+----+----+-----+----+----+ + * WC | WT | WB | UC | UC- | WT | WB | <= BIOS (default when machine boots) + * +---+----+----+----+-----+----+----+ + * WC | WP | WC | UC | UC- | WT | WB | <= Xen + * +---+----+----+----+-----+----+----+ + * + * The lookup of this index table translates to looking up + * Bit 7, Bit 4, and Bit 3 of val entry: + * + * PAT/PSE (bit 7) ... PCD (bit 4) .. PWT (bit 3). + * + * If all bits are off, then we are using PAT0. If bit 3 turned on, + * then we are using PAT1, if bit 3 and bit 4, then PAT2.. + * + * As you can see, the Linux PAT1 translates to PAT4 under Xen. Which means + * that if a guest that follows Linux''s PAT setup and would like to set Write + * Combined on pages it MUST use PAT4 entry. Meaning that Bit 7 (PAGE_PAT) is + * set. For example, under Linux it only uses PAT0, PAT1, and PAT2 for the + * caching as: + * + * WB = none (so PAT0) + * WC = PWT (bit 3 on) + * UC = PWT | PCD (bit 3 and 4 are on). + * + * To make it work with Xen, it needs to translate the WC bit as so: + * + * PWT (so bit 3 on) --> PAT (so bit 7 is on) and clear bit 3 + * + * And to translate back it would: + * + * PAT (bit 7 on) --> PWT (bit 3 on) and clear bit 7. */ #define MMU_NORMAL_PT_UPDATE 0 /* checked ''*ptr = val''. ptr is MA. */ #define MMU_MACHPHYS_UPDATE 1 /* ptr = MA of frame to modify entry for */
Ian Jackson
2011-Nov-29 18:47 UTC
Re: [PATCH 0 of 2] Documentation patches for HYPERVISOR_mmu_update (v1).
Konrad Rzeszutek Wilk writes ("[PATCH 0 of 2] Documentation patches for HYPERVISOR_mmu_update (v1)."):> Documenting some of the requirements when using HYPERVISOR_mmu_update > that are not spelled in details. It is x86_64 and Linux centric > since those are the only ones that I am quite familiar with.Both applied, thanks. Ian.