thr3ads.net - Xen devel - [Xen-devel] Latency spike during page_scrub

If this information is useful, please help other people find it:
Share via:

Chris Lalancette

2009-Jul-02 14:47 UTC

[Xen-devel] Latency spike during page_scrub_softirq

All,
This is a topic which has been brought up before, but is still something
that plagues certain machines. Let me describe the test scenario, the problem
as I see it, and some of the data that I''ve collected.

TEST
----
The test is on a large AMD NUMA machine with 128GB of memory and 32 cpus (8 x
quad-core), memory interleaved, running RHEL-5.4 Xen (although I believe the
issue probably affects upstream Xen as well). I install 2 RHEL-5.3 guests, one
with 32GB of memory, and one with 64GB of memory. On the first guest, I run a
continuous ping (just out to the default gateway). While that ping test is
running, on the dom0 I do "xm destroy <64GB_guest>". This takes
a while to
complete (as expected), but what is not expected is some huge jumps in the ping
responses on the 32GB domains. For instance, in the test I''m currently
running,
normal ping response time is ~0.5ms, but during the xm destroy of the other
domain the ping response can jump up all the way to 3000 (or more) ms. Once the
big domain destroy is finished, everything returns to normal.

PROBLEM
------->From what I can tell, the problem lies in page_scrub_softirq(). As a firsttest, I disabled page-scrubbing completely (obviously insecure, but just a
test). With no page-scrubbing at all, and direct memory freeing in
free_domheap_pages(), no delays of the kind experienced in the original test
were seen. As a second test, I implemented the page scrubbing inside
free_domheap_pages(), and again, no spikes at all were seen.

I then put things back like they were, and instrumented page_scrub_softirq().
Now, the serialize_lock at the top of the function makes sure only one CPU at a
time comes in here. However, when I instrumented the rest of the function, I
found that when a CPU was in here doing work, it was spending 80-95% of
it''s
time waiting to get the page_scrub_lock (I have raw numbers, if you want to see
them).

At first I would think this was purely contention with the other page_scrub_lock
user in free_domheap_pages(). However, after changing the
spin_lock(&page_scrub_lock) into a spin_trylock() inside
page_scrub_softirq(), I
still saw the spikes in the ping test, even though my instrumentation showed I
was only waiting like 20 - 30% of the time on the spinlock. So I can''t
fully
explain the rest of the spike. Any ideas? Other things I should probe?

SOLUTION
--------
There are a couple of solutions that I can think of:
1) Just clear the pages inside free_domheap_pages(). I tried this with a 64GB
guest as mentioned above, and I didn''t see any ill effects from doing
so. It
seems like this might actually be a valid way to go, although then a single CPU
is doing all of the work of freeing the pages (might be a problem on UP
systems).
2) Clear the pages inside free_domheap_pages(), but do some kind of yield every
once in a while. I don''t know how feasible this would be.
3) Do a lockless FIFO between free_domheap_pages() and page_scrub_softirq()
(since that is all it really is). While this would certainly work, it seems
like a bit of overengineering for this problem.

Other ideas? I''m happy to try to implement these, I''m just
not sure what we
would prefer to do.

--
Chris Lalancette

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Jul-02 15:13 UTC

head link

Re: [Xen-devel] Latency spike during page_scrub_softirq

On 02/07/2009 15:47, "Chris Lalancette" <clalance@redhat.com>
wrote:
> There are a couple of solutions that I can think of:
> 1)  Just clear the pages inside free_domheap_pages().  I tried this with a
> 64GB
> guest as mentioned above, and I didn''t see any ill effects from
doing so.  It
> seems like this might actually be a valid way to go, although then a single
> CPU
> is doing all of the work of freeing the pages (might be a problem on UP
> systems).
Now that domain destruction is preemptible all the way back up to libxc, I
think the page-scrub queue is not so much required. And it seems it never
worked very well anyway! I will remove it.

This may make ''xm destroy'' operations take a while, but
actually this may be
more sensibly handled by punting the destroy hypercall into another thread
at dom0 userspace level, rather than doing the shonky
''scheduling'' we
attempt in Xen itself right now.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Chris Lalancette

2009-Jul-03 07:32 UTC

head link

Re: [Xen-devel] Latency spike during page_scrub_softirq

Keir Fraser wrote:> On 02/07/2009 15:47, "Chris Lalancette"
<clalance@redhat.com> wrote:
> 
>> There are a couple of solutions that I can think of:
>> 1)  Just clear the pages inside free_domheap_pages().  I tried this
with a
>> 64GB
>> guest as mentioned above, and I didn''t see any ill effects
from doing so.  It
>> seems like this might actually be a valid way to go, although then a
single
>> CPU
>> is doing all of the work of freeing the pages (might be a problem on UP
>> systems).
> 
> Now that domain destruction is preemptible all the way back up to libxc, I
> think the page-scrub queue is not so much required. And it seems it never
> worked very well anyway! I will remove it.
> 
> This may make ''xm destroy'' operations take a while, but
actually this may be
> more sensibly handled by punting the destroy hypercall into another thread
> at dom0 userspace level, rather than doing the shonky
''scheduling'' we
> attempt in Xen itself right now.
Yep, agreed, and I see you''ve committed as c/s 19886.  Except...

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
...
@@ -1247,10 +1220,7 @@ void free_domheap_pages(struct page_info
             for ( i = 0; i < (1 << order); i++ )
             {
                 page_set_owner(&pg[i], NULL);
-                spin_lock(&page_scrub_lock);
-                page_list_add(&pg[i], &page_scrub_list);
-                scrub_pages++;
-                spin_unlock(&page_scrub_lock);
+                scrub_one_page(&pg[i]);
             }
         }
     }

This hunk actually needs to free the page as well, with free_heap_pages().

-- 
Chris Lalancette

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Jul-03 07:48 UTC

head link

Re: [Xen-devel] Latency spike during page_scrub_softirq

On 03/07/2009 08:32, "Chris Lalancette" <clalance@redhat.com>
wrote:
> This hunk actually needs to free the page as well, with free_heap_pages().
Yeah, oops!

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jul 2009 - Latency spike during page_scrub_softirq

[Xen-devel] Latency spike during page_scrub_softirq

Re: [Xen-devel] Latency spike during page_scrub_softirq

Re: [Xen-devel] Latency spike during page_scrub_softirq

Re: [Xen-devel] Latency spike during page_scrub_softirq