Hi, Over the last year we have tried many times to get acceptable performance from pv_ops kernels. Tests done with 1,2,4 and 8 cores. The more cores the lower the score. Inside the domU it shows all cores, top -s shows all cores in use. xentop in dom0 never shows over 99% cpu. 2.6.18.8-xenU kernel show''s over 700% cpu and the scores are about 8 x the pv_ops score. Any ideas ? John 1 core BYTE UNIX Benchmarks (Version 4.1-wht.2) System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux /dev/xvda1 141110136 1066476 132875660 1% / Start Benchmark Run: Tue May 18 13:54:54 BST 2010 13:54:54 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 End Benchmark Run: Tue May 18 14:06:12 BST 2010 14:06:12 up 11 min, 2 users, load average: 11.48, 5.20, 2.43 INDEX VALUES TEST BASELINE RESULT INDEX Dhrystone 2 using register variables 376783.7 8950813.0 237.6 Double-Precision Whetstone 83.1 2103.7 253.2 Execl Throughput 188.3 1568.4 83.3 File Copy 1024 bufsize 2000 maxblocks 2672.0 64198.0 240.3 File Copy 256 bufsize 500 maxblocks 1077.0 17781.0 165.1 File Read 4096 bufsize 8000 maxblocks 15382.0 643717.0 418.5 Pipe-based Context Switching 15448.6 85379.4 55.3 Pipe Throughput 111814.6 478490.1 42.8 Process Creation 569.3 3329.6 58.5 Shell Scripts (8 concurrent) 44.8 380.7 85.0 System Call Overhead 114433.5 498712.3 43.6 ======== FINAL SCORE 114.1 2-cores =============================================================BYTE UNIX Benchmarks (Version 4.1-wht.2) System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux /dev/xvda1 141110136 1066548 132875588 1% / Start Benchmark Run: Tue May 18 14:07:27 BST 2010 14:07:27 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 End Benchmark Run: Tue May 18 14:18:04 BST 2010 14:18:04 up 10 min, 1 user, load average: 12.78, 5.53, 2.49 INDEX VALUES TEST BASELINE RESULT INDEX Dhrystone 2 using register variables 376783.7 10124838.6 268.7 Double-Precision Whetstone 83.1 1188.7 143.0 Execl Throughput 188.3 1596.2 84.8 File Copy 1024 bufsize 2000 maxblocks 2672.0 58323.0 218.3 File Copy 256 bufsize 500 maxblocks 1077.0 17776.0 165.1 File Read 4096 bufsize 8000 maxblocks 15382.0 568217.0 369.4 Pipe-based Context Switching 15448.6 86111.3 55.7 Pipe Throughput 111814.6 469957.8 42.0 Process Creation 569.3 3298.1 57.9 Shell Scripts (8 concurrent) 44.8 378.9 84.6 System Call Overhead 114433.5 532828.4 46.6 ======== FINAL SCORE 107.9 4-cores =============================================================BYTE UNIX Benchmarks (Version 4.1-wht.2) System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux /dev/xvda1 141110136 1066628 132875508 1% / Start Benchmark Run: Tue May 18 14:19:17 BST 2010 14:19:17 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 End Benchmark Run: Tue May 18 14:29:53 BST 2010 14:29:53 up 10 min, 1 user, load average: 13.59, 6.35, 2.97 INDEX VALUES TEST BASELINE RESULT INDEX Dhrystone 2 using register variables 376783.7 10185429.8 270.3 Double-Precision Whetstone 83.1 759.8 91.4 Execl Throughput 188.3 1386.2 73.6 File Copy 1024 bufsize 2000 maxblocks 2672.0 62331.0 233.3 File Copy 256 bufsize 500 maxblocks 1077.0 16492.0 153.1 File Read 4096 bufsize 8000 maxblocks 15382.0 563402.0 366.3 Pipe-based Context Switching 15448.6 87176.0 56.4 Pipe Throughput 111814.6 481068.1 43.0 Process Creation 569.3 3128.9 55.0 Shell Scripts (8 concurrent) 44.8 394.9 88.1 System Call Overhead 114433.5 539996.1 47.2 ======== FINAL SCORE 102.6 8-cores =============================================================BYTE UNIX Benchmarks (Version 4.1-wht.2, 8 threads) System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux /dev/xvda1 141110136 1066680 132875456 1% / Start Benchmark Run: Tue May 18 14:30:59 BST 2010 14:30:59 up 0 min, 1 user, load average: 0.07, 0.02, 0.00 End Benchmark Run: Tue May 18 14:42:52 BST 2010 14:42:52 up 12 min, 1 user, load average: 25.56, 10.84, 4.96 INDEX VALUES TEST BASELINE RESULT INDEX Dhrystone 2 using register variables 376783.7 9972130.3 264.7 Double-Precision Whetstone 83.1 755.2 90.9 Execl Throughput 188.3 1584.7 84.2 File Copy 1024 bufsize 2000 maxblocks 2672.0 58981.0 220.7 File Copy 256 bufsize 500 maxblocks 1077.0 16904.0 157.0 File Read 4096 bufsize 8000 maxblocks 15382.0 557735.0 362.6 Pipe-based Context Switching 15448.6 80738.2 52.3 Pipe Throughput 111814.6 450891.2 40.3 Process Creation 569.3 2948.5 51.8 Shell Scripts (8 concurrent) 44.8 378.1 84.4 System Call Overhead 114433.5 537443.2 47.0 ======== FINAL SCORE 100.9 -- Professional hosting without compromise www.clustered.net _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-May-18 18:38 UTC
Re: [Xen-devel] Poor SMP performance pv_ops domU
On 05/18/2010 10:34 AM, John Morrison wrote:> Hi, > > Over the last year we have tried many times to get acceptable performance from pv_ops kernels. > > Tests done with 1,2,4 and 8 cores. The more cores the lower the score. > > Inside the domU it shows all cores, top -s shows all cores in use. > xentop in dom0 never shows over 99% cpu. > > 2.6.18.8-xenU kernel show''s over 700% cpu and the scores are about 8 x the pv_ops score. > > Any ideas ? >Well, I guess some kind of bad serialization is going on in there, and it should be fairly obvious with a bit of examination. Have you tried building your own pvops domu kernels? Does enabling PV spinlocks make any difference? Also enabling some of the lock debugging/profiling/contention monitoring stuff may give useful results. Can you post the corresponding 2.6.18 results? Are there specific sub-tests which show the effect more strongly than the others? How does the 2.6.32 kernel fare when booted native? Thanks, J> > John > > > 1 core > > BYTE UNIX Benchmarks (Version 4.1-wht.2) > System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux > /dev/xvda1 141110136 1066476 132875660 1% / > > Start Benchmark Run: Tue May 18 13:54:54 BST 2010 > 13:54:54 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 > > End Benchmark Run: Tue May 18 14:06:12 BST 2010 > 14:06:12 up 11 min, 2 users, load average: 11.48, 5.20, 2.43 > > > INDEX VALUES > TEST BASELINE RESULT INDEX > > Dhrystone 2 using register variables 376783.7 8950813.0 237.6 > Double-Precision Whetstone 83.1 2103.7 253.2 > Execl Throughput 188.3 1568.4 83.3 > File Copy 1024 bufsize 2000 maxblocks 2672.0 64198.0 240.3 > File Copy 256 bufsize 500 maxblocks 1077.0 17781.0 165.1 > File Read 4096 bufsize 8000 maxblocks 15382.0 643717.0 418.5 > Pipe-based Context Switching 15448.6 85379.4 55.3 > Pipe Throughput 111814.6 478490.1 42.8 > Process Creation 569.3 3329.6 58.5 > Shell Scripts (8 concurrent) 44.8 380.7 85.0 > System Call Overhead 114433.5 498712.3 43.6 > ========> FINAL SCORE 114.1 > > 2-cores > > =============================================================> BYTE UNIX Benchmarks (Version 4.1-wht.2) > System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux > /dev/xvda1 141110136 1066548 132875588 1% / > > Start Benchmark Run: Tue May 18 14:07:27 BST 2010 > 14:07:27 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 > > End Benchmark Run: Tue May 18 14:18:04 BST 2010 > 14:18:04 up 10 min, 1 user, load average: 12.78, 5.53, 2.49 > > > INDEX VALUES > TEST BASELINE RESULT INDEX > > Dhrystone 2 using register variables 376783.7 10124838.6 268.7 > Double-Precision Whetstone 83.1 1188.7 143.0 > Execl Throughput 188.3 1596.2 84.8 > File Copy 1024 bufsize 2000 maxblocks 2672.0 58323.0 218.3 > File Copy 256 bufsize 500 maxblocks 1077.0 17776.0 165.1 > File Read 4096 bufsize 8000 maxblocks 15382.0 568217.0 369.4 > Pipe-based Context Switching 15448.6 86111.3 55.7 > Pipe Throughput 111814.6 469957.8 42.0 > Process Creation 569.3 3298.1 57.9 > Shell Scripts (8 concurrent) 44.8 378.9 84.6 > System Call Overhead 114433.5 532828.4 46.6 > ========> FINAL SCORE 107.9 > > 4-cores > > =============================================================> BYTE UNIX Benchmarks (Version 4.1-wht.2) > System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux > /dev/xvda1 141110136 1066628 132875508 1% / > > Start Benchmark Run: Tue May 18 14:19:17 BST 2010 > 14:19:17 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 > > End Benchmark Run: Tue May 18 14:29:53 BST 2010 > 14:29:53 up 10 min, 1 user, load average: 13.59, 6.35, 2.97 > > > INDEX VALUES > TEST BASELINE RESULT INDEX > > Dhrystone 2 using register variables 376783.7 10185429.8 270.3 > Double-Precision Whetstone 83.1 759.8 91.4 > Execl Throughput 188.3 1386.2 73.6 > File Copy 1024 bufsize 2000 maxblocks 2672.0 62331.0 233.3 > File Copy 256 bufsize 500 maxblocks 1077.0 16492.0 153.1 > File Read 4096 bufsize 8000 maxblocks 15382.0 563402.0 366.3 > Pipe-based Context Switching 15448.6 87176.0 56.4 > Pipe Throughput 111814.6 481068.1 43.0 > Process Creation 569.3 3128.9 55.0 > Shell Scripts (8 concurrent) 44.8 394.9 88.1 > System Call Overhead 114433.5 539996.1 47.2 > ========> FINAL SCORE 102.6 > 8-cores > > =============================================================> BYTE UNIX Benchmarks (Version 4.1-wht.2, 8 threads) > System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux > /dev/xvda1 141110136 1066680 132875456 1% / > > Start Benchmark Run: Tue May 18 14:30:59 BST 2010 > 14:30:59 up 0 min, 1 user, load average: 0.07, 0.02, 0.00 > > End Benchmark Run: Tue May 18 14:42:52 BST 2010 > 14:42:52 up 12 min, 1 user, load average: 25.56, 10.84, 4.96 > > > INDEX VALUES > TEST BASELINE RESULT INDEX > > Dhrystone 2 using register variables 376783.7 9972130.3 264.7 > Double-Precision Whetstone 83.1 755.2 90.9 > Execl Throughput 188.3 1584.7 84.2 > File Copy 1024 bufsize 2000 maxblocks 2672.0 58981.0 220.7 > File Copy 256 bufsize 500 maxblocks 1077.0 16904.0 157.0 > File Read 4096 bufsize 8000 maxblocks 15382.0 557735.0 362.6 > Pipe-based Context Switching 15448.6 80738.2 52.3 > Pipe Throughput 111814.6 450891.2 40.3 > Process Creation 569.3 2948.5 51.8 > Shell Scripts (8 concurrent) 44.8 378.1 84.4 > System Call Overhead 114433.5 537443.2 47.0 > ========> FINAL SCORE 100.9 > > > > -- > Professional hosting without compromise > www.clustered.net > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I''ve tried with various kernel''s today - pv_ops seems to only use 1 core out of 8. PV spinlocks makes no difference. The thing that sticks out most is I cannot get the dom0 (xen-3.4.2) to show more that about 99.7% cpu usage for any pv_ops kernel. #!/usr/bin/perl while () {} running 8 of these loads 2.6.18.8-xenU with nearly 800% cpu as shown in dom0 running the same 8 in any pv_ops kernel''s only gets as high as about 99.7% Inside the pv and xenU kernels top -s show all 8 cores being used. John On 18 May 2010, at 19:38, Jeremy Fitzhardinge wrote:> On 05/18/2010 10:34 AM, John Morrison wrote: >> Hi, >> >> Over the last year we have tried many times to get acceptable performance from pv_ops kernels. >> >> Tests done with 1,2,4 and 8 cores. The more cores the lower the score. >> >> Inside the domU it shows all cores, top -s shows all cores in use. >> xentop in dom0 never shows over 99% cpu. >> >> 2.6.18.8-xenU kernel show''s over 700% cpu and the scores are about 8 x the pv_ops score. >> >> Any ideas ? >> > > Well, I guess some kind of bad serialization is going on in there, and > it should be fairly obvious with a bit of examination. > > Have you tried building your own pvops domu kernels? Does enabling PV > spinlocks make any difference? Also enabling some of the lock > debugging/profiling/contention monitoring stuff may give useful results. > > Can you post the corresponding 2.6.18 results? Are there specific > sub-tests which show the effect more strongly than the others? > > How does the 2.6.32 kernel fare when booted native? > > Thanks, > J > >> >> John >> >> >> 1 core >> >> BYTE UNIX Benchmarks (Version 4.1-wht.2) >> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux >> /dev/xvda1 141110136 1066476 132875660 1% / >> >> Start Benchmark Run: Tue May 18 13:54:54 BST 2010 >> 13:54:54 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 >> >> End Benchmark Run: Tue May 18 14:06:12 BST 2010 >> 14:06:12 up 11 min, 2 users, load average: 11.48, 5.20, 2.43 >> >> >> INDEX VALUES >> TEST BASELINE RESULT INDEX >> >> Dhrystone 2 using register variables 376783.7 8950813.0 237.6 >> Double-Precision Whetstone 83.1 2103.7 253.2 >> Execl Throughput 188.3 1568.4 83.3 >> File Copy 1024 bufsize 2000 maxblocks 2672.0 64198.0 240.3 >> File Copy 256 bufsize 500 maxblocks 1077.0 17781.0 165.1 >> File Read 4096 bufsize 8000 maxblocks 15382.0 643717.0 418.5 >> Pipe-based Context Switching 15448.6 85379.4 55.3 >> Pipe Throughput 111814.6 478490.1 42.8 >> Process Creation 569.3 3329.6 58.5 >> Shell Scripts (8 concurrent) 44.8 380.7 85.0 >> System Call Overhead 114433.5 498712.3 43.6 >> ========>> FINAL SCORE 114.1 >> >> 2-cores >> >> =============================================================>> BYTE UNIX Benchmarks (Version 4.1-wht.2) >> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux >> /dev/xvda1 141110136 1066548 132875588 1% / >> >> Start Benchmark Run: Tue May 18 14:07:27 BST 2010 >> 14:07:27 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 >> >> End Benchmark Run: Tue May 18 14:18:04 BST 2010 >> 14:18:04 up 10 min, 1 user, load average: 12.78, 5.53, 2.49 >> >> >> INDEX VALUES >> TEST BASELINE RESULT INDEX >> >> Dhrystone 2 using register variables 376783.7 10124838.6 268.7 >> Double-Precision Whetstone 83.1 1188.7 143.0 >> Execl Throughput 188.3 1596.2 84.8 >> File Copy 1024 bufsize 2000 maxblocks 2672.0 58323.0 218.3 >> File Copy 256 bufsize 500 maxblocks 1077.0 17776.0 165.1 >> File Read 4096 bufsize 8000 maxblocks 15382.0 568217.0 369.4 >> Pipe-based Context Switching 15448.6 86111.3 55.7 >> Pipe Throughput 111814.6 469957.8 42.0 >> Process Creation 569.3 3298.1 57.9 >> Shell Scripts (8 concurrent) 44.8 378.9 84.6 >> System Call Overhead 114433.5 532828.4 46.6 >> ========>> FINAL SCORE 107.9 >> >> 4-cores >> >> =============================================================>> BYTE UNIX Benchmarks (Version 4.1-wht.2) >> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux >> /dev/xvda1 141110136 1066628 132875508 1% / >> >> Start Benchmark Run: Tue May 18 14:19:17 BST 2010 >> 14:19:17 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 >> >> End Benchmark Run: Tue May 18 14:29:53 BST 2010 >> 14:29:53 up 10 min, 1 user, load average: 13.59, 6.35, 2.97 >> >> >> INDEX VALUES >> TEST BASELINE RESULT INDEX >> >> Dhrystone 2 using register variables 376783.7 10185429.8 270.3 >> Double-Precision Whetstone 83.1 759.8 91.4 >> Execl Throughput 188.3 1386.2 73.6 >> File Copy 1024 bufsize 2000 maxblocks 2672.0 62331.0 233.3 >> File Copy 256 bufsize 500 maxblocks 1077.0 16492.0 153.1 >> File Read 4096 bufsize 8000 maxblocks 15382.0 563402.0 366.3 >> Pipe-based Context Switching 15448.6 87176.0 56.4 >> Pipe Throughput 111814.6 481068.1 43.0 >> Process Creation 569.3 3128.9 55.0 >> Shell Scripts (8 concurrent) 44.8 394.9 88.1 >> System Call Overhead 114433.5 539996.1 47.2 >> ========>> FINAL SCORE 102.6 >> 8-cores >> >> =============================================================>> BYTE UNIX Benchmarks (Version 4.1-wht.2, 8 threads) >> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux >> /dev/xvda1 141110136 1066680 132875456 1% / >> >> Start Benchmark Run: Tue May 18 14:30:59 BST 2010 >> 14:30:59 up 0 min, 1 user, load average: 0.07, 0.02, 0.00 >> >> End Benchmark Run: Tue May 18 14:42:52 BST 2010 >> 14:42:52 up 12 min, 1 user, load average: 25.56, 10.84, 4.96 >> >> >> INDEX VALUES >> TEST BASELINE RESULT INDEX >> >> Dhrystone 2 using register variables 376783.7 9972130.3 264.7 >> Double-Precision Whetstone 83.1 755.2 90.9 >> Execl Throughput 188.3 1584.7 84.2 >> File Copy 1024 bufsize 2000 maxblocks 2672.0 58981.0 220.7 >> File Copy 256 bufsize 500 maxblocks 1077.0 16904.0 157.0 >> File Read 4096 bufsize 8000 maxblocks 15382.0 557735.0 362.6 >> Pipe-based Context Switching 15448.6 80738.2 52.3 >> Pipe Throughput 111814.6 450891.2 40.3 >> Process Creation 569.3 2948.5 51.8 >> Shell Scripts (8 concurrent) 44.8 378.1 84.4 >> System Call Overhead 114433.5 537443.2 47.0 >> ========>> FINAL SCORE 100.9 >> >> >> >> -- >> Professional hosting without compromise >> www.clustered.net >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >> >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-May-19 17:44 UTC
Re: [Xen-devel] Poor SMP performance pv_ops domU
On 05/19/2010 09:24 AM, John Morrison wrote:> I''ve tried with various kernel''s today - pv_ops seems to only use 1 core out of 8. > > PV spinlocks makes no difference. > > The thing that sticks out most is I cannot get the dom0 (xen-3.4.2) to show more that about 99.7% cpu usage for any pv_ops kernel. > > #!/usr/bin/perl > > while () {} > > running 8 of these loads 2.6.18.8-xenU with nearly 800% cpu as shown in dom0 > running the same 8 in any pv_ops kernel''s only gets as high as about 99.7% >What tool are you using to show CPU use?> Inside the pv and xenU kernels top -s show all 8 cores being used. >I tried to reproduce this: 1. I created a 4 vcpu pvops PV domain (4 pcpu host) 2. Confirmed that all 4 vcpus are present with "cat /proc/cpuinfo" in the domain 3. Ran 4 instances of ``perl -e "while(){}"&'''' in the domain 4. "top" within the domain shows 99% overall user time, no stolen time, with the perl processes each using 99% cpu time 5. in dom0 "watch -n 1 xl vcpu-list <domain>" shows all 4 vcpus are consuming 1 vcpu second per second 6. running a spin loop in dom0 makes top within the domain show 16-25% stolen time Aside from top showing "99%" rather than ~400% as one might expect, it all seems OK, and it looks like the vcpus are actually getting all the CPU they''re asking for. I think the 99 vs 400 difference is just a change in how the kernel shows its accounting (since there''s been a lot of change in that area between .18 and .32, including a whole new scheduler). If you''re seeing a real performance regression between .18 and .32, that''s interesting, but it would be useful to make sure you''re comparing apples to apples; in particular, isolating any performance effect inherent in Linux''s performance change from .18 -> .32, compared to pvops vs xenU. So, things to try: * make sure all the vcpus are actually enabled within your domain; if your adding them after the domain has booted, you need to make sure they get hot-plugged properly * make sure you don''t have any expensive debug options enabled in your kernel config * run your benchmark on the 2.6.32 kernel booted native and compare it to pvops running under xen * compare it with the Novell 2.6.32 non-pvops kernel * try pinning the vcpus to physical cpus to eliminate any Xen scheduler effects Thanks, J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-May-19 19:48 UTC
Re: [Xen-devel] Poor SMP performance pv_ops domU
(Re-added cc: xen-devel) On 05/19/2010 12:41 PM, John Morrison wrote:> xentop for the cpu usage. > > We see the performance of a single core in domU when running a pv_ops kernel. > Reboot domU with 2.6.18.8-xenU and performance jumps nearly 8 fold. >Could you reproduce my experiment? If you look at the CPU time accumulated by each vcpu, is it incrementing at less than 1 vcpu second/second?> Pinned all 8 cpu''s - still the same results. > > Tried bare metal much better results. >What do you mean by "much better"? How does it compare to domu 2.6.18?> We have seen this over 18 months on all pv kernel''s we try. > > It''s not any specific kernel - all pv kernel''s we try have the same performance impact. >Do you mean pvops, or all PV Xen kernels? How do the recent Novell Xenlinux kernels perform? Have you verified there are no expensive debug options enabled? BTW, is it a 32 or 64-bit guest? J> John > > On 19 May 2010, at 18:44, Jeremy Fitzhardinge wrote: > > >> On 05/19/2010 09:24 AM, John Morrison wrote: >> >>> I''ve tried with various kernel''s today - pv_ops seems to only use 1 core out of 8. >>> >>> PV spinlocks makes no difference. >>> >>> The thing that sticks out most is I cannot get the dom0 (xen-3.4.2) to show more that about 99.7% cpu usage for any pv_ops kernel. >>> >>> #!/usr/bin/perl >>> >>> while () {} >>> >>> running 8 of these loads 2.6.18.8-xenU with nearly 800% cpu as shown in dom0 >>> running the same 8 in any pv_ops kernel''s only gets as high as about 99.7% >>> >>> >> What tool are you using to show CPU use? >> >> >>> Inside the pv and xenU kernels top -s show all 8 cores being used. >>> >>> >> I tried to reproduce this: >> >> 1. I created a 4 vcpu pvops PV domain (4 pcpu host) >> 2. Confirmed that all 4 vcpus are present with "cat /proc/cpuinfo" in >> the domain >> 3. Ran 4 instances of ``perl -e "while(){}"&'''' in the domain >> 4. "top" within the domain shows 99% overall user time, no stolen >> time, with the perl processes each using 99% cpu time >> 5. in dom0 "watch -n 1 xl vcpu-list <domain>" shows all 4 vcpus are >> consuming 1 vcpu second per second >> 6. running a spin loop in dom0 makes top within the domain show >> 16-25% stolen time >> >> Aside from top showing "99%" rather than ~400% as one might expect, it >> all seems OK, and it looks like the vcpus are actually getting all the >> CPU they''re asking for. I think the 99 vs 400 difference is just a >> change in how the kernel shows its accounting (since there''s been a lot >> of change in that area between .18 and .32, including a whole new >> scheduler). >> >> If you''re seeing a real performance regression between .18 and .32, >> that''s interesting, but it would be useful to make sure you''re comparing >> apples to apples; in particular, isolating any performance effect >> inherent in Linux''s performance change from .18 -> .32, compared to >> pvops vs xenU. >> >> So, things to try: >> >> * make sure all the vcpus are actually enabled within your domain; >> if your adding them after the domain has booted, you need to make >> sure they get hot-plugged properly >> * make sure you don''t have any expensive debug options enabled in >> your kernel config >> * run your benchmark on the 2.6.32 kernel booted native and compare >> it to pvops running under xen >> * compare it with the Novell 2.6.32 non-pvops kernel >> * try pinning the vcpus to physical cpus to eliminate any Xen >> scheduler effects >> >> Thanks, >> J >> >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel