thr3ads.net - Xen devel - heap_lock optimizations? [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Konrad Rzeszutek Wilk

2013-Jul-15 15:15 UTC

heap_lock optimizations?

Hey Tim,

I was looking at making the ''Scrubbing Free RAM:'' code faster
on 1TB
boxes with 128 CPUs. And naively I wrote code that setup a tasklet
on each CPU and scrub a swatch of MFNs. Unfortunatly even on 8VCPU
machines the end result was a slower boot time!

The culprit looks to be the heap_lock that is taken and released
on every MFN (for fun I added a bit of code to do batches - of
32 MFNs and to iterate over those 32 MFNs while holding the lock - that
did make it a bit faster, but not by a much).

What I am wondering is:
 - Have you ever thought about optimizing this? If so, how?
 - Another idea to potentially make this faster was to seperate this
   scrubbing in two stages:  
    1) (under the heap_lock) - reserve/take a giant set of MFN pages
       (perhaps also consult the NUMA affinity). This would be
       usurping the whole heap[zone].
    2) Give it out to the CPUS to scrub (this would be done without being
       under a spinlock). The heap[zone] would be split equally amongst the
       CPUs.
    3) Goto 1 until done.
 - Look for examples in the Linux kernel to see how it does it.

Thanks!

Malcolm Crossley

2013-Jul-15 15:46 UTC

head link

Re: heap_lock optimizations?

On 15/07/13 16:15, Konrad Rzeszutek Wilk wrote:> Hey Tim,
>
> I was looking at making the ''Scrubbing Free RAM:'' code
faster on 1TB
> boxes with 128 CPUs. And naively I wrote code that setup a tasklet
> on each CPU and scrub a swatch of MFNs. Unfortunatly even on 8VCPU
> machines the end result was a slower boot time!
>
> The culprit looks to be the heap_lock that is taken and released
> on every MFN (for fun I added a bit of code to do batches - of
> 32 MFNs and to iterate over those 32 MFNs while holding the lock - that
> did make it a bit faster, but not by a much).
>
> What I am wondering is:
>   - Have you ever thought about optimizing this? If so, how?
>   - Another idea to potentially make this faster was to seperate this
>     scrubbing in two stages:
>      1) (under the heap_lock) - reserve/take a giant set of MFN pages
>         (perhaps also consult the NUMA affinity). This would be
>         usurping the whole heap[zone].
>      2) Give it out to the CPUS to scrub (this would be done without being
>         under a spinlock). The heap[zone] would be split equally amongst
the
>         CPUs.
>      3) Goto 1 until done.
>   - Look for examples in the Linux kernel to see how it does it.
>
> Thanks!Hi Konrad,

Did you see a patch I posted for this last year? 
http://lists.xen.org/archives/html/xen-devel/2012-05/msg00701.html

Unfortunately I made some minor errors and it didn''t apply cleanly but 
I''ll fix it up now and repost so you can test it.

Malcolm> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Tim Deegan

2013-Jul-15 16:09 UTC

head link

Re: heap_lock optimizations?

Hi,

At 11:15 -0400 on 15 Jul (1373886925), Konrad Rzeszutek Wilk
wrote:> Hey Tim,
> 
> I was looking at making the ''Scrubbing Free RAM:'' code
faster on 1TB
> boxes with 128 CPUs. And naively I wrote code that setup a tasklet
> on each CPU and scrub a swatch of MFNs. Unfortunatly even on 8VCPU
> machines the end result was a slower boot time!
> 
> The culprit looks to be the heap_lock that is taken and released
> on every MFN (for fun I added a bit of code to do batches - of
> 32 MFNs and to iterate over those 32 MFNs while holding the lock - that
> did make it a bit faster, but not by a much).
> 
> What I am wondering is:
>  - Have you ever thought about optimizing this? If so, how?
Malcolm Crossley posted an RFC patch a while ago to do this kind of
stuff -- parcelled out RAM to socket-local CPUs and IIRC took the
heap-lock once for all on the coordinating CPU. 

http://lists.xen.org/archives/html/xen-devel/2012-05/msg00701.html

AIUI he''s going to send a v2 now that 4.3 is done. 

Tim.

Konrad Rzeszutek Wilk

2013-Jul-15 16:14 UTC

head link

Re: heap_lock optimizations?

On Mon, Jul 15, 2013 at 04:46:37PM +0100, Malcolm Crossley
wrote:> On 15/07/13 16:15, Konrad Rzeszutek Wilk wrote:
> >Hey Tim,
> >
> >I was looking at making the ''Scrubbing Free RAM:''
code faster on 1TB
> >boxes with 128 CPUs. And naively I wrote code that setup a tasklet
> >on each CPU and scrub a swatch of MFNs. Unfortunatly even on 8VCPU
> >machines the end result was a slower boot time!
> >
> >The culprit looks to be the heap_lock that is taken and released
> >on every MFN (for fun I added a bit of code to do batches - of
> >32 MFNs and to iterate over those 32 MFNs while holding the lock - that
> >did make it a bit faster, but not by a much).
> >
> >What I am wondering is:
> >  - Have you ever thought about optimizing this? If so, how?
> >  - Another idea to potentially make this faster was to seperate this
> >    scrubbing in two stages:
> >     1) (under the heap_lock) - reserve/take a giant set of MFN pages
> >        (perhaps also consult the NUMA affinity). This would be
> >        usurping the whole heap[zone].
> >     2) Give it out to the CPUS to scrub (this would be done without
being
> >        under a spinlock). The heap[zone] would be split equally
amongst the
> >        CPUs.
> >     3) Goto 1 until done.
> >  - Look for examples in the Linux kernel to see how it does it.
> >
> >Thanks!
> Hi Konrad,
> 
> Did you see a patch I posted for this last year?
> http://lists.xen.org/archives/html/xen-devel/2012-05/msg00701.html
I did not.> 
> Unfortunately I made some minor errors and it didn''t apply cleanly
> but I''ll fix it up now and repost so you can test it.
Ah, it follows similar logic to mine. I used tasklet but you are using
IPIs. That might be better.

Will wait for your patch and test it out. Thanks!

Konrad Rzeszutek Wilk

2013-Jul-16 16:41 UTC

head link

Re: heap_lock optimizations?

On Mon, Jul 15, 2013 at 04:46:37PM +0100, Malcolm Crossley
wrote:> On 15/07/13 16:15, Konrad Rzeszutek Wilk wrote:
> >Hey Tim,
> >
> >I was looking at making the ''Scrubbing Free RAM:''
code faster on 1TB
> >boxes with 128 CPUs. And naively I wrote code that setup a tasklet
> >on each CPU and scrub a swatch of MFNs. Unfortunatly even on 8VCPU
> >machines the end result was a slower boot time!
> >
> >The culprit looks to be the heap_lock that is taken and released
> >on every MFN (for fun I added a bit of code to do batches - of
> >32 MFNs and to iterate over those 32 MFNs while holding the lock - that
> >did make it a bit faster, but not by a much).
> >
> >What I am wondering is:
> >  - Have you ever thought about optimizing this? If so, how?
> >  - Another idea to potentially make this faster was to seperate this
> >    scrubbing in two stages:
> >     1) (under the heap_lock) - reserve/take a giant set of MFN pages
> >        (perhaps also consult the NUMA affinity). This would be
> >        usurping the whole heap[zone].
> >     2) Give it out to the CPUS to scrub (this would be done without
being
> >        under a spinlock). The heap[zone] would be split equally
amongst the
> >        CPUs.
> >     3) Goto 1 until done.
> >  - Look for examples in the Linux kernel to see how it does it.
> >
> >Thanks!
> Hi Konrad,
> 
> Did you see a patch I posted for this last year?
> http://lists.xen.org/archives/html/xen-devel/2012-05/msg00701.html
> 
> Unfortunately I made some minor errors and it didn''t apply cleanly
> but I''ll fix it up now and repost so you can test it.
I took a stab at it (your updated one), and this is what I found
(this is on a 4CPU box and the cycles count)

    14112560772 <- original
    14006409540 <- Mine (tasklet) - using old heap_lock
    1331412384  <- Malcoms IPI (heap_lock held for a long time).
    1374497324  <- Mine (tasklet) - heap_lock held for a long time

Meaning that your usage of IPI is superior. The heap_lock
is held for chunk_size across all of the CPUs and that looks OK to me.

Looking forward to seeing you post the patch. Thanks!

Xen devel - Jul 2013 - heap_lock optimizations?

heap_lock optimizations?

Re: heap_lock optimizations?

Re: heap_lock optimizations?

Re: heap_lock optimizations?

Re: heap_lock optimizations?