Mark Johnson ( / Joe B.) wrote:
> Joe''s response..
>
> ---
>
>
> -------- Original Message --------
> Subject: Re: [Fwd: Re: [xen-discuss] xvm 32-bit: excessive number of
pagefaults after fork()
/ copy-on-wri]> Date: Thu, 24 Apr 2008 12:00:41 -0700
> From: Joe B...
>
>
> >
> >
> > The hypervisor is told up invalidate the page that contains the
> > PTE (via HYPERVISOR_update_va_mapping, va cda02000 flags 2),
> > but the CPU / MMU isn''t told that the mapping for the
virtual stack address
> > 8047000 has changed. Isn''t it possible that the CPU / MMU
/ TLB has
> > cached the information "virtual stack address 8047000 is not
valid
> > address",
> > after the call to x86pte_inval() ?
>
> Nope. No hardware ever works this way. It only caches information in the
TLB
> when the lowest bit (PRESENT) is set in the PTE. Therefore when setting an
> entry that is zero to a non-zero value, an INVLPG instruction is never
needed
> on hardware.
Hmm, Ok.
But some processors seem to cache non-leaf page table entries;
according to a comment I found in uts/i86pc/vm/htable.c, function
unlink_ptp():
1067 /*
1068 * When a top level VLP page table entry changes, we must issue
1069 * a reload of cr3 on all processors.
1070 *
1071 * If we don''t need do do that, then we still have to INVLPG
against
1072 * an address covered by the inner page table, as the latest processors
1073 * have TLB-like caches for non-leaf page table entries.
1074 */
1075 if (!(hat->hat_flags & HAT_FREEING)) {
1076 hat_tlb_inval(hat, (higher->ht_flags & HTABLE_VLP) ?
1077 DEMAP_ALL_ADDR : old->ht_vaddr);
1078 }
An additional thing that appears to be important with this xvm
performance / spurious page fault problem is that the sequence
of events is something like this:
- fork() is called, which creates pages tables for the child
process, but there are no valid page table entries for the stack
- on return from the kernel, the code in libc.so''s __forkx code
triggers a
read stack page fault first; this creates a leaf page table for the stack
and installs a readonly copy of a page from the parent process'' stack
(this is the only page in the leaf page table)
- next, there is a write attempt to the readonly stack page;
a writable copy of the page is created and is installed in the
page table.
This is done by removing the readonly page, invalidating the TLB for
the stack VA; after that the new writable page is installed.
| Since there is exactly one page at the time the readonly page gets
| removed, the whole leaf page table for the stack is removed, too.
| (ht->ht_valid_cnt == 0 in htable_release(), so that unlink_ptp()
| gets called)
And when the new writable stack page is installed, no leaf page table
is found, so a new leaf page table must be constructed before the
writable stack pte can be installed.
When I change the __forkx code in libc.so.1 like this, the spurious
interrupts are gone (I removed all of my extra TLB flush workarounds
from the kernel):
% more .hg/patches/xvm_32bit_fork_performance
diff --git a/usr/src/lib/libc/i386/sys/forkx.s
b/usr/src/lib/libc/i386/sys/forkx.s
--- a/usr/src/lib/libc/i386/sys/forkx.s
+++ b/usr/src/lib/libc/i386/sys/forkx.s
@@ -48,8 +48,17 @@
pushl $0
pushl %ecx
SYSTRAP_2RVALS(forksys)
+ movl %eax, 4(%esp) /* write something to the stack, this avoids
+ * page faulting twice. Otherwise we fault
+ * on the next "popl" installing a shared RO
+ * page from the parent, and get a COW fault
+ * on the next "movl %ecx, (%esp)" instruction.
+ *
+ * This works around a fork performance problem
+ * in xvm 32-bit PV dom0/domU.
+ */
popl %ecx
- movl %ecx, 0(%esp)
+ movl %ecx, (%esp)
SYSCERROR
testl %edx, %edx
jz 1f /* jump if parent */
Btw, instead of using "movl %eax, 4(%esp)" to write something to the
current
stack page I can also use a read like "movl -4096(%esp), %ecx" so that
there
is another page that prevents destroying the stack leaf page table on the
COW fault.
Both of the above changes avoid destroying and recreating
the leaf page table for the stack VA area.
> > htable.c x86pte_set() does a TLB flush when the old PTE
> > referred to a referenced page, but it doesn''t update the TLB
when
> > an empty PTE was replaced with a new translation:
> >
> > 2090 /*
> ...
> >
> >
> >
> > Btw. I''ve been experimenting with this change to
x86pte_set()
> > (lines 2103 ... 2111 added):
> >
> > 2090 /*
> ...
> > 2111 #endif
> >
> >
> > With xpv_page_fault_hack := 0 I get the original code.
> >
> > With xpv_page_fault_hack := 2 I try to do an INVALPG on the
> > new installed translation. But that hasn''t fixed the
issue...
> >
> > But with xpv_page_fault_hack := 1 the entire TLB gets flushed
> > when installing new stack pages, and now:
>
> That would confirm a bug in Xen.. both xen_flush_tlb() and xen_flush_va()
> should have identical behavior here - and shouldn''t be necessary
either.
>
> >
> > 1. the libMicro-0.4.0 fork_100 test runs ~ 30x faster in a 32-bit PV
> > domU !!
> > 800 seconds -> 28 seconds
> > 2 ./boot/solaris/bin/create_ramdisk runs ~ 4x faster in a 32-bit PV
domU !
> > 2 minutes -> 36 seconds
> >
> >
> > So it seems that there is an issue with the TLB in 32-bit xVM PV
doms...
>
> The bug is probably in TLB flushing management in the Xen code itself.
> I know they''ve said in the past that they do all kinds of very
crafty
> optimizations to avoid unnecessary invalidates in the hypervisor.
They do seem to defer flushing the whole TLB, the DOP_FLUSH_TLB /
queue_deferred_ops() stuff in xen.hg/xen/mm.c
INVLPG isn''t deferred, as far as I understand it.
> I suspect they''ve got a bug.
It seems the problem is that the mmu does not notice that the leaf page
table for the stack got replaced, and is looking at old cached page table
data, or something like that. Flushing the whole TLB by a %CR3 write
works around the problem.
Otherwise I seem to be getting spurious page faults in a loop until a
process switch happens, which will install a new %CR3 mmu VA/PA
translation table and breaks the loop.