Hey Tim, I was looking at making the ''Scrubbing Free RAM:'' code faster on 1TB boxes with 128 CPUs. And naively I wrote code that setup a tasklet on each CPU and scrub a swatch of MFNs. Unfortunatly even on 8VCPU machines the end result was a slower boot time! The culprit looks to be the heap_lock that is taken and released on every MFN (for fun I added a bit of code to do batches - of 32 MFNs and to iterate over those 32 MFNs while holding the lock - that did make it a bit faster, but not by a much). What I am wondering is: - Have you ever thought about optimizing this? If so, how? - Another idea to potentially make this faster was to seperate this scrubbing in two stages: 1) (under the heap_lock) - reserve/take a giant set of MFN pages (perhaps also consult the NUMA affinity). This would be usurping the whole heap[zone]. 2) Give it out to the CPUS to scrub (this would be done without being under a spinlock). The heap[zone] would be split equally amongst the CPUs. 3) Goto 1 until done. - Look for examples in the Linux kernel to see how it does it. Thanks!
On 15/07/13 16:15, Konrad Rzeszutek Wilk wrote:> Hey Tim, > > I was looking at making the ''Scrubbing Free RAM:'' code faster on 1TB > boxes with 128 CPUs. And naively I wrote code that setup a tasklet > on each CPU and scrub a swatch of MFNs. Unfortunatly even on 8VCPU > machines the end result was a slower boot time! > > The culprit looks to be the heap_lock that is taken and released > on every MFN (for fun I added a bit of code to do batches - of > 32 MFNs and to iterate over those 32 MFNs while holding the lock - that > did make it a bit faster, but not by a much). > > What I am wondering is: > - Have you ever thought about optimizing this? If so, how? > - Another idea to potentially make this faster was to seperate this > scrubbing in two stages: > 1) (under the heap_lock) - reserve/take a giant set of MFN pages > (perhaps also consult the NUMA affinity). This would be > usurping the whole heap[zone]. > 2) Give it out to the CPUS to scrub (this would be done without being > under a spinlock). The heap[zone] would be split equally amongst the > CPUs. > 3) Goto 1 until done. > - Look for examples in the Linux kernel to see how it does it. > > Thanks!Hi Konrad, Did you see a patch I posted for this last year? http://lists.xen.org/archives/html/xen-devel/2012-05/msg00701.html Unfortunately I made some minor errors and it didn''t apply cleanly but I''ll fix it up now and repost so you can test it. Malcolm> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Hi, At 11:15 -0400 on 15 Jul (1373886925), Konrad Rzeszutek Wilk wrote:> Hey Tim, > > I was looking at making the ''Scrubbing Free RAM:'' code faster on 1TB > boxes with 128 CPUs. And naively I wrote code that setup a tasklet > on each CPU and scrub a swatch of MFNs. Unfortunatly even on 8VCPU > machines the end result was a slower boot time! > > The culprit looks to be the heap_lock that is taken and released > on every MFN (for fun I added a bit of code to do batches - of > 32 MFNs and to iterate over those 32 MFNs while holding the lock - that > did make it a bit faster, but not by a much). > > What I am wondering is: > - Have you ever thought about optimizing this? If so, how?Malcolm Crossley posted an RFC patch a while ago to do this kind of stuff -- parcelled out RAM to socket-local CPUs and IIRC took the heap-lock once for all on the coordinating CPU. http://lists.xen.org/archives/html/xen-devel/2012-05/msg00701.html AIUI he''s going to send a v2 now that 4.3 is done. Tim.
On Mon, Jul 15, 2013 at 04:46:37PM +0100, Malcolm Crossley wrote:> On 15/07/13 16:15, Konrad Rzeszutek Wilk wrote: > >Hey Tim, > > > >I was looking at making the ''Scrubbing Free RAM:'' code faster on 1TB > >boxes with 128 CPUs. And naively I wrote code that setup a tasklet > >on each CPU and scrub a swatch of MFNs. Unfortunatly even on 8VCPU > >machines the end result was a slower boot time! > > > >The culprit looks to be the heap_lock that is taken and released > >on every MFN (for fun I added a bit of code to do batches - of > >32 MFNs and to iterate over those 32 MFNs while holding the lock - that > >did make it a bit faster, but not by a much). > > > >What I am wondering is: > > - Have you ever thought about optimizing this? If so, how? > > - Another idea to potentially make this faster was to seperate this > > scrubbing in two stages: > > 1) (under the heap_lock) - reserve/take a giant set of MFN pages > > (perhaps also consult the NUMA affinity). This would be > > usurping the whole heap[zone]. > > 2) Give it out to the CPUS to scrub (this would be done without being > > under a spinlock). The heap[zone] would be split equally amongst the > > CPUs. > > 3) Goto 1 until done. > > - Look for examples in the Linux kernel to see how it does it. > > > >Thanks! > Hi Konrad, > > Did you see a patch I posted for this last year? > http://lists.xen.org/archives/html/xen-devel/2012-05/msg00701.htmlI did not.> > Unfortunately I made some minor errors and it didn''t apply cleanly > but I''ll fix it up now and repost so you can test it.Ah, it follows similar logic to mine. I used tasklet but you are using IPIs. That might be better. Will wait for your patch and test it out. Thanks!
On Mon, Jul 15, 2013 at 04:46:37PM +0100, Malcolm Crossley wrote:> On 15/07/13 16:15, Konrad Rzeszutek Wilk wrote: > >Hey Tim, > > > >I was looking at making the ''Scrubbing Free RAM:'' code faster on 1TB > >boxes with 128 CPUs. And naively I wrote code that setup a tasklet > >on each CPU and scrub a swatch of MFNs. Unfortunatly even on 8VCPU > >machines the end result was a slower boot time! > > > >The culprit looks to be the heap_lock that is taken and released > >on every MFN (for fun I added a bit of code to do batches - of > >32 MFNs and to iterate over those 32 MFNs while holding the lock - that > >did make it a bit faster, but not by a much). > > > >What I am wondering is: > > - Have you ever thought about optimizing this? If so, how? > > - Another idea to potentially make this faster was to seperate this > > scrubbing in two stages: > > 1) (under the heap_lock) - reserve/take a giant set of MFN pages > > (perhaps also consult the NUMA affinity). This would be > > usurping the whole heap[zone]. > > 2) Give it out to the CPUS to scrub (this would be done without being > > under a spinlock). The heap[zone] would be split equally amongst the > > CPUs. > > 3) Goto 1 until done. > > - Look for examples in the Linux kernel to see how it does it. > > > >Thanks! > Hi Konrad, > > Did you see a patch I posted for this last year? > http://lists.xen.org/archives/html/xen-devel/2012-05/msg00701.html > > Unfortunately I made some minor errors and it didn''t apply cleanly > but I''ll fix it up now and repost so you can test it.I took a stab at it (your updated one), and this is what I found (this is on a 4CPU box and the cycles count) 14112560772 <- original 14006409540 <- Mine (tasklet) - using old heap_lock 1331412384 <- Malcoms IPI (heap_lock held for a long time). 1374497324 <- Mine (tasklet) - heap_lock held for a long time Meaning that your usage of IPI is superior. The heap_lock is held for chunk_size across all of the CPUs and that looks OK to me. Looking forward to seeing you post the patch. Thanks!