thr3ads.net - Xen devel - [Xen-devel] Poor SMP performance pv

If this information is useful, please help other people find it:
Share via:

John Morrison

2010-May-18 17:34 UTC

[Xen-devel] Poor SMP performance pv_ops domU

Hi,

Over the last year we have tried many times to get acceptable performance from
pv_ops kernels.

Tests done with 1,2,4 and 8 cores. The more cores the lower the score.

Inside the domU it shows all cores, top -s shows all cores in use.
xentop in dom0 never shows over 99% cpu.

2.6.18.8-xenU kernel show''s over 700% cpu and the scores are about 8 x
the pv_ops score.

Any ideas ?


John


1 core

BYTE UNIX Benchmarks (Version 4.1-wht.2)
System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC
2010 x86_64 GNU/Linux
/dev/xvda1           141110136   1066476 132875660   1% /

Start Benchmark Run: Tue May 18 13:54:54 BST 2010
 13:54:54 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00

End Benchmark Run: Tue May 18 14:06:12 BST 2010
 14:06:12 up 11 min,  2 users,  load average: 11.48, 5.20, 2.43


                     INDEX VALUES
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  8950813.0      237.6
Double-Precision Whetstone                      83.1     2103.7      253.2
Execl Throughput                               188.3     1568.4       83.3
File Copy 1024 bufsize 2000 maxblocks         2672.0    64198.0      240.3
File Copy 256 bufsize 500 maxblocks           1077.0    17781.0      165.1
File Read 4096 bufsize 8000 maxblocks        15382.0   643717.0      418.5
Pipe-based Context Switching                 15448.6    85379.4       55.3
Pipe Throughput                             111814.6   478490.1       42.8
Process Creation                               569.3     3329.6       58.5
Shell Scripts (8 concurrent)                    44.8      380.7       85.0
System Call Overhead                        114433.5   498712.3       43.6
                                                                 ========    
FINAL SCORE                                                     114.1

2-cores

=============================================================BYTE UNIX
Benchmarks (Version 4.1-wht.2)
System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC
2010 x86_64 GNU/Linux
/dev/xvda1           141110136   1066548 132875588   1% /

Start Benchmark Run: Tue May 18 14:07:27 BST 2010
 14:07:27 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00

End Benchmark Run: Tue May 18 14:18:04 BST 2010
 14:18:04 up 10 min,  1 user,  load average: 12.78, 5.53, 2.49


                     INDEX VALUES
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7 10124838.6      268.7
Double-Precision Whetstone                      83.1     1188.7      143.0
Execl Throughput                               188.3     1596.2       84.8
File Copy 1024 bufsize 2000 maxblocks         2672.0    58323.0      218.3
File Copy 256 bufsize 500 maxblocks           1077.0    17776.0      165.1
File Read 4096 bufsize 8000 maxblocks        15382.0   568217.0      369.4
Pipe-based Context Switching                 15448.6    86111.3       55.7
Pipe Throughput                             111814.6   469957.8       42.0
Process Creation                               569.3     3298.1       57.9
Shell Scripts (8 concurrent)                    44.8      378.9       84.6
System Call Overhead                        114433.5   532828.4       46.6
                                                                 ========    
FINAL SCORE                                                     107.9

4-cores

=============================================================BYTE UNIX
Benchmarks (Version 4.1-wht.2)
System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC
2010 x86_64 GNU/Linux
/dev/xvda1           141110136   1066628 132875508   1% /

Start Benchmark Run: Tue May 18 14:19:17 BST 2010
 14:19:17 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00

End Benchmark Run: Tue May 18 14:29:53 BST 2010
 14:29:53 up 10 min,  1 user,  load average: 13.59, 6.35, 2.97


                     INDEX VALUES
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7 10185429.8      270.3
Double-Precision Whetstone                      83.1      759.8       91.4
Execl Throughput                               188.3     1386.2       73.6
File Copy 1024 bufsize 2000 maxblocks         2672.0    62331.0      233.3
File Copy 256 bufsize 500 maxblocks           1077.0    16492.0      153.1
File Read 4096 bufsize 8000 maxblocks        15382.0   563402.0      366.3
Pipe-based Context Switching                 15448.6    87176.0       56.4
Pipe Throughput                             111814.6   481068.1       43.0
Process Creation                               569.3     3128.9       55.0
Shell Scripts (8 concurrent)                    44.8      394.9       88.1
System Call Overhead                        114433.5   539996.1       47.2
                                                                 ========    
FINAL SCORE                                                     102.6
8-cores
 
=============================================================BYTE UNIX
Benchmarks (Version 4.1-wht.2, 8 threads)
System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC
2010 x86_64 GNU/Linux
/dev/xvda1           141110136   1066680 132875456   1% /

Start Benchmark Run: Tue May 18 14:30:59 BST 2010
 14:30:59 up 0 min,  1 user,  load average: 0.07, 0.02, 0.00

End Benchmark Run: Tue May 18 14:42:52 BST 2010
 14:42:52 up 12 min,  1 user,  load average: 25.56, 10.84, 4.96


                     INDEX VALUES
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  9972130.3      264.7
Double-Precision Whetstone                      83.1      755.2       90.9
Execl Throughput                               188.3     1584.7       84.2
File Copy 1024 bufsize 2000 maxblocks         2672.0    58981.0      220.7
File Copy 256 bufsize 500 maxblocks           1077.0    16904.0      157.0
File Read 4096 bufsize 8000 maxblocks        15382.0   557735.0      362.6
Pipe-based Context Switching                 15448.6    80738.2       52.3
Pipe Throughput                             111814.6   450891.2       40.3
Process Creation                               569.3     2948.5       51.8
Shell Scripts (8 concurrent)                    44.8      378.1       84.4
System Call Overhead                        114433.5   537443.2       47.0
                                                                 ========    
FINAL SCORE                                                     100.9



--
Professional hosting without compromise
www.clustered.net


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-May-18 18:38 UTC

head link

Re: [Xen-devel] Poor SMP performance pv_ops domU

On 05/18/2010 10:34 AM, John Morrison wrote:> Hi,
>
> Over the last year we have tried many times to get acceptable performance
from pv_ops kernels.
>
> Tests done with 1,2,4 and 8 cores. The more cores the lower the score.
>
> Inside the domU it shows all cores, top -s shows all cores in use.
> xentop in dom0 never shows over 99% cpu.
>
> 2.6.18.8-xenU kernel show''s over 700% cpu and the scores are about
8 x the pv_ops score.
>
> Any ideas ?
>   
Well, I guess some kind of bad serialization is going on in there, and
it should be fairly obvious with a bit of examination.

Have you tried building your own pvops domu kernels?  Does enabling PV
spinlocks make any difference?  Also enabling some of the lock
debugging/profiling/contention monitoring stuff may give useful results.

Can you post the corresponding 2.6.18 results?  Are there specific
sub-tests which show the effect more strongly than the others?

How does the 2.6.32 kernel fare when booted native?

Thanks,
    J
>
> John
>
>
> 1 core
>
> BYTE UNIX Benchmarks (Version 4.1-wht.2)
> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34
UTC 2010 x86_64 GNU/Linux
> /dev/xvda1           141110136   1066476 132875660   1% /
>
> Start Benchmark Run: Tue May 18 13:54:54 BST 2010
>  13:54:54 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>
> End Benchmark Run: Tue May 18 14:06:12 BST 2010
>  14:06:12 up 11 min,  2 users,  load average: 11.48, 5.20, 2.43
>
>
>                      INDEX VALUES
> TEST                                        BASELINE     RESULT      INDEX
>
> Dhrystone 2 using register variables        376783.7  8950813.0      237.6
> Double-Precision Whetstone                      83.1     2103.7      253.2
> Execl Throughput                               188.3     1568.4       83.3
> File Copy 1024 bufsize 2000 maxblocks         2672.0    64198.0      240.3
> File Copy 256 bufsize 500 maxblocks           1077.0    17781.0      165.1
> File Read 4096 bufsize 8000 maxblocks        15382.0   643717.0      418.5
> Pipe-based Context Switching                 15448.6    85379.4       55.3
> Pipe Throughput                             111814.6   478490.1       42.8
> Process Creation                               569.3     3329.6       58.5
> Shell Scripts (8 concurrent)                    44.8      380.7       85.0
> System Call Overhead                        114433.5   498712.3       43.6
>                                                                 
========>      FINAL SCORE                                                   
114.1
>
> 2-cores
>
> =============================================================> BYTE UNIX
Benchmarks (Version 4.1-wht.2)
> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34
UTC 2010 x86_64 GNU/Linux
> /dev/xvda1           141110136   1066548 132875588   1% /
>
> Start Benchmark Run: Tue May 18 14:07:27 BST 2010
>  14:07:27 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>
> End Benchmark Run: Tue May 18 14:18:04 BST 2010
>  14:18:04 up 10 min,  1 user,  load average: 12.78, 5.53, 2.49
>
>
>                      INDEX VALUES
> TEST                                        BASELINE     RESULT      INDEX
>
> Dhrystone 2 using register variables        376783.7 10124838.6      268.7
> Double-Precision Whetstone                      83.1     1188.7      143.0
> Execl Throughput                               188.3     1596.2       84.8
> File Copy 1024 bufsize 2000 maxblocks         2672.0    58323.0      218.3
> File Copy 256 bufsize 500 maxblocks           1077.0    17776.0      165.1
> File Read 4096 bufsize 8000 maxblocks        15382.0   568217.0      369.4
> Pipe-based Context Switching                 15448.6    86111.3       55.7
> Pipe Throughput                             111814.6   469957.8       42.0
> Process Creation                               569.3     3298.1       57.9
> Shell Scripts (8 concurrent)                    44.8      378.9       84.6
> System Call Overhead                        114433.5   532828.4       46.6
>                                                                 
========>      FINAL SCORE                                                   
107.9
>
> 4-cores
>
> =============================================================> BYTE UNIX
Benchmarks (Version 4.1-wht.2)
> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34
UTC 2010 x86_64 GNU/Linux
> /dev/xvda1           141110136   1066628 132875508   1% /
>
> Start Benchmark Run: Tue May 18 14:19:17 BST 2010
>  14:19:17 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>
> End Benchmark Run: Tue May 18 14:29:53 BST 2010
>  14:29:53 up 10 min,  1 user,  load average: 13.59, 6.35, 2.97
>
>
>                      INDEX VALUES
> TEST                                        BASELINE     RESULT      INDEX
>
> Dhrystone 2 using register variables        376783.7 10185429.8      270.3
> Double-Precision Whetstone                      83.1      759.8       91.4
> Execl Throughput                               188.3     1386.2       73.6
> File Copy 1024 bufsize 2000 maxblocks         2672.0    62331.0      233.3
> File Copy 256 bufsize 500 maxblocks           1077.0    16492.0      153.1
> File Read 4096 bufsize 8000 maxblocks        15382.0   563402.0      366.3
> Pipe-based Context Switching                 15448.6    87176.0       56.4
> Pipe Throughput                             111814.6   481068.1       43.0
> Process Creation                               569.3     3128.9       55.0
> Shell Scripts (8 concurrent)                    44.8      394.9       88.1
> System Call Overhead                        114433.5   539996.1       47.2
>                                                                 
========>      FINAL SCORE                                                   
102.6
> 8-cores
>  
> =============================================================> BYTE UNIX
Benchmarks (Version 4.1-wht.2, 8 threads)
> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34
UTC 2010 x86_64 GNU/Linux
> /dev/xvda1           141110136   1066680 132875456   1% /
>
> Start Benchmark Run: Tue May 18 14:30:59 BST 2010
>  14:30:59 up 0 min,  1 user,  load average: 0.07, 0.02, 0.00
>
> End Benchmark Run: Tue May 18 14:42:52 BST 2010
>  14:42:52 up 12 min,  1 user,  load average: 25.56, 10.84, 4.96
>
>
>                      INDEX VALUES
> TEST                                        BASELINE     RESULT      INDEX
>
> Dhrystone 2 using register variables        376783.7  9972130.3      264.7
> Double-Precision Whetstone                      83.1      755.2       90.9
> Execl Throughput                               188.3     1584.7       84.2
> File Copy 1024 bufsize 2000 maxblocks         2672.0    58981.0      220.7
> File Copy 256 bufsize 500 maxblocks           1077.0    16904.0      157.0
> File Read 4096 bufsize 8000 maxblocks        15382.0   557735.0      362.6
> Pipe-based Context Switching                 15448.6    80738.2       52.3
> Pipe Throughput                             111814.6   450891.2       40.3
> Process Creation                               569.3     2948.5       51.8
> Shell Scripts (8 concurrent)                    44.8      378.1       84.4
> System Call Overhead                        114433.5   537443.2       47.0
>                                                                 
========>      FINAL SCORE                                                   
100.9
>
>
>
> --
> Professional hosting without compromise
> www.clustered.net
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>   

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

John Morrison

2010-May-19 16:24 UTC

head link

Re: [Xen-devel] Poor SMP performance pv_ops domU

I''ve tried with various kernel''s today - pv_ops seems to only
use 1 core out of 8.

PV spinlocks makes no difference.

The thing that sticks out most is I cannot get the dom0 (xen-3.4.2) to show more
that about 99.7% cpu usage for any pv_ops kernel.

#!/usr/bin/perl

while () {}

running 8 of these loads 2.6.18.8-xenU with nearly 800% cpu as shown in dom0
running the same 8 in any pv_ops kernel''s only gets as high as about
99.7%

Inside the pv and xenU kernels top -s show all 8 cores being used.


John

On 18 May 2010, at 19:38, Jeremy Fitzhardinge wrote:
> On 05/18/2010 10:34 AM, John Morrison wrote:
>> Hi,
>> 
>> Over the last year we have tried many times to get acceptable
performance from pv_ops kernels.
>> 
>> Tests done with 1,2,4 and 8 cores. The more cores the lower the score.
>> 
>> Inside the domU it shows all cores, top -s shows all cores in use.
>> xentop in dom0 never shows over 99% cpu.
>> 
>> 2.6.18.8-xenU kernel show''s over 700% cpu and the scores are
about 8 x the pv_ops score.
>> 
>> Any ideas ?
>> 
> 
> Well, I guess some kind of bad serialization is going on in there, and
> it should be fairly obvious with a bit of examination.
> 
> Have you tried building your own pvops domu kernels?  Does enabling PV
> spinlocks make any difference?  Also enabling some of the lock
> debugging/profiling/contention monitoring stuff may give useful results.
> 
> Can you post the corresponding 2.6.18 results?  Are there specific
> sub-tests which show the effect more strongly than the others?
> 
> How does the 2.6.32 kernel fare when booted native?
> 
> Thanks,
>    J
> 
>> 
>> John
>> 
>> 
>> 1 core
>> 
>> BYTE UNIX Benchmarks (Version 4.1-wht.2)
>> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16
09:17:34 UTC 2010 x86_64 GNU/Linux
>> /dev/xvda1           141110136   1066476 132875660   1% /
>> 
>> Start Benchmark Run: Tue May 18 13:54:54 BST 2010
>> 13:54:54 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>> 
>> End Benchmark Run: Tue May 18 14:06:12 BST 2010
>> 14:06:12 up 11 min,  2 users,  load average: 11.48, 5.20, 2.43
>> 
>> 
>>                     INDEX VALUES
>> TEST                                        BASELINE     RESULT     
INDEX
>> 
>> Dhrystone 2 using register variables        376783.7  8950813.0     
237.6
>> Double-Precision Whetstone                      83.1     2103.7     
253.2
>> Execl Throughput                               188.3     1568.4      
83.3
>> File Copy 1024 bufsize 2000 maxblocks         2672.0    64198.0     
240.3
>> File Copy 256 bufsize 500 maxblocks           1077.0    17781.0     
165.1
>> File Read 4096 bufsize 8000 maxblocks        15382.0   643717.0     
418.5
>> Pipe-based Context Switching                 15448.6    85379.4      
55.3
>> Pipe Throughput                             111814.6   478490.1      
42.8
>> Process Creation                               569.3     3329.6      
58.5
>> Shell Scripts (8 concurrent)                    44.8      380.7      
85.0
>> System Call Overhead                        114433.5   498712.3      
43.6
>>                                                                
========>>     FINAL SCORE                                                
114.1
>> 
>> 2-cores
>> 
>> =============================================================>>
BYTE UNIX Benchmarks (Version 4.1-wht.2)
>> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16
09:17:34 UTC 2010 x86_64 GNU/Linux
>> /dev/xvda1           141110136   1066548 132875588   1% /
>> 
>> Start Benchmark Run: Tue May 18 14:07:27 BST 2010
>> 14:07:27 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>> 
>> End Benchmark Run: Tue May 18 14:18:04 BST 2010
>> 14:18:04 up 10 min,  1 user,  load average: 12.78, 5.53, 2.49
>> 
>> 
>>                     INDEX VALUES
>> TEST                                        BASELINE     RESULT     
INDEX
>> 
>> Dhrystone 2 using register variables        376783.7 10124838.6     
268.7
>> Double-Precision Whetstone                      83.1     1188.7     
143.0
>> Execl Throughput                               188.3     1596.2      
84.8
>> File Copy 1024 bufsize 2000 maxblocks         2672.0    58323.0     
218.3
>> File Copy 256 bufsize 500 maxblocks           1077.0    17776.0     
165.1
>> File Read 4096 bufsize 8000 maxblocks        15382.0   568217.0     
369.4
>> Pipe-based Context Switching                 15448.6    86111.3      
55.7
>> Pipe Throughput                             111814.6   469957.8      
42.0
>> Process Creation                               569.3     3298.1      
57.9
>> Shell Scripts (8 concurrent)                    44.8      378.9      
84.6
>> System Call Overhead                        114433.5   532828.4      
46.6
>>                                                                
========>>     FINAL SCORE                                                
107.9
>> 
>> 4-cores
>> 
>> =============================================================>>
BYTE UNIX Benchmarks (Version 4.1-wht.2)
>> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16
09:17:34 UTC 2010 x86_64 GNU/Linux
>> /dev/xvda1           141110136   1066628 132875508   1% /
>> 
>> Start Benchmark Run: Tue May 18 14:19:17 BST 2010
>> 14:19:17 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>> 
>> End Benchmark Run: Tue May 18 14:29:53 BST 2010
>> 14:29:53 up 10 min,  1 user,  load average: 13.59, 6.35, 2.97
>> 
>> 
>>                     INDEX VALUES
>> TEST                                        BASELINE     RESULT     
INDEX
>> 
>> Dhrystone 2 using register variables        376783.7 10185429.8     
270.3
>> Double-Precision Whetstone                      83.1      759.8      
91.4
>> Execl Throughput                               188.3     1386.2      
73.6
>> File Copy 1024 bufsize 2000 maxblocks         2672.0    62331.0     
233.3
>> File Copy 256 bufsize 500 maxblocks           1077.0    16492.0     
153.1
>> File Read 4096 bufsize 8000 maxblocks        15382.0   563402.0     
366.3
>> Pipe-based Context Switching                 15448.6    87176.0      
56.4
>> Pipe Throughput                             111814.6   481068.1      
43.0
>> Process Creation                               569.3     3128.9      
55.0
>> Shell Scripts (8 concurrent)                    44.8      394.9      
88.1
>> System Call Overhead                        114433.5   539996.1      
47.2
>>                                                                
========>>     FINAL SCORE                                                
102.6
>> 8-cores
>> 
>> =============================================================>>
BYTE UNIX Benchmarks (Version 4.1-wht.2, 8 threads)
>> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16
09:17:34 UTC 2010 x86_64 GNU/Linux
>> /dev/xvda1           141110136   1066680 132875456   1% /
>> 
>> Start Benchmark Run: Tue May 18 14:30:59 BST 2010
>> 14:30:59 up 0 min,  1 user,  load average: 0.07, 0.02, 0.00
>> 
>> End Benchmark Run: Tue May 18 14:42:52 BST 2010
>> 14:42:52 up 12 min,  1 user,  load average: 25.56, 10.84, 4.96
>> 
>> 
>>                     INDEX VALUES
>> TEST                                        BASELINE     RESULT     
INDEX
>> 
>> Dhrystone 2 using register variables        376783.7  9972130.3     
264.7
>> Double-Precision Whetstone                      83.1      755.2      
90.9
>> Execl Throughput                               188.3     1584.7      
84.2
>> File Copy 1024 bufsize 2000 maxblocks         2672.0    58981.0     
220.7
>> File Copy 256 bufsize 500 maxblocks           1077.0    16904.0     
157.0
>> File Read 4096 bufsize 8000 maxblocks        15382.0   557735.0     
362.6
>> Pipe-based Context Switching                 15448.6    80738.2      
52.3
>> Pipe Throughput                             111814.6   450891.2      
40.3
>> Process Creation                               569.3     2948.5      
51.8
>> Shell Scripts (8 concurrent)                    44.8      378.1      
84.4
>> System Call Overhead                        114433.5   537443.2      
47.0
>>                                                                
========>>     FINAL SCORE                                                
100.9
>> 
>> 
>> 
>> --
>> Professional hosting without compromise
>> www.clustered.net
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>> 
>> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-May-19 17:44 UTC

head link

Re: [Xen-devel] Poor SMP performance pv_ops domU

On 05/19/2010 09:24 AM, John Morrison wrote:> I''ve tried with various kernel''s today - pv_ops seems to
only use 1 core out of 8.
>
> PV spinlocks makes no difference.
>
> The thing that sticks out most is I cannot get the dom0 (xen-3.4.2) to show
more that about 99.7% cpu usage for any pv_ops kernel.
>
> #!/usr/bin/perl
>
> while () {}
>
> running 8 of these loads 2.6.18.8-xenU with nearly 800% cpu as shown in
dom0
> running the same 8 in any pv_ops kernel''s only gets as high as
about 99.7%
>   
What tool are you using to show CPU use?
> Inside the pv and xenU kernels top -s show all 8 cores being used.
>   
I tried to reproduce this:

   1. I created a 4 vcpu pvops PV domain (4 pcpu host)
   2. Confirmed that all 4 vcpus are present with "cat /proc/cpuinfo"
in
      the domain
   3. Ran 4 instances of ``perl -e
"while(){}"&'''' in the domain
   4. "top" within the domain shows 99% overall user time, no stolen
      time, with the perl processes each using 99% cpu time
   5. in dom0 "watch -n 1 xl vcpu-list <domain>" shows all 4
vcpus are
      consuming 1 vcpu second per second
   6. running a spin loop in dom0 makes top within the domain show
      16-25% stolen time

Aside from top showing "99%" rather than ~400% as one might expect, it
all seems OK, and it looks like the vcpus are actually getting all the
CPU they''re asking for.  I think the 99 vs 400 difference is just a
change in how the kernel shows its accounting (since there''s been a lot
of change in that area between .18 and .32, including a whole new
scheduler).

If you''re seeing a real performance regression between .18 and .32,
that''s interesting, but it would be useful to make sure you''re
comparing
apples to apples; in particular, isolating any performance effect
inherent in Linux''s performance change from .18 -> .32, compared to
pvops vs xenU.

So, things to try:

    * make sure all the vcpus are actually enabled within your domain;
      if your adding them after the domain has booted, you need to make
      sure they get hot-plugged properly
    * make sure you don''t have any expensive debug options enabled in
      your kernel config
    * run your benchmark on the 2.6.32 kernel booted native and compare
      it to pvops running under xen
    * compare it with the Novell 2.6.32 non-pvops kernel
    * try pinning the vcpus to physical cpus to eliminate any Xen
      scheduler effects

Thanks,
    J


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-May-19 19:48 UTC

head link

Re: [Xen-devel] Poor SMP performance pv_ops domU

(Re-added cc: xen-devel)

On 05/19/2010 12:41 PM, John Morrison wrote:> xentop for the cpu usage.
>
> We see the performance of a single core in domU when running a pv_ops
kernel.
> Reboot domU with 2.6.18.8-xenU and performance jumps nearly 8 fold.
>   
Could you reproduce my experiment?  If you look at the CPU time
accumulated by each vcpu, is it incrementing at less than 1 vcpu
second/second?
> Pinned all 8 cpu''s -  still the same results.
>
> Tried bare metal much better results.
>   
What do you mean by "much better"?  How does it compare to domu
2.6.18?
> We have seen this over 18 months on all pv kernel''s we try.
>
> It''s not any specific kernel - all pv kernel''s we try
have the same performance impact.
>   
Do you mean pvops, or all PV Xen kernels?  How do the recent Novell
Xenlinux kernels perform?  Have you verified there are no expensive
debug options enabled?

BTW, is it a 32 or 64-bit guest?

    J
> John
>
> On 19 May 2010, at 18:44, Jeremy Fitzhardinge wrote:
>
>   
>> On 05/19/2010 09:24 AM, John Morrison wrote:
>>     
>>> I''ve tried with various kernel''s today - pv_ops
seems to only use 1 core out of 8.
>>>
>>> PV spinlocks makes no difference.
>>>
>>> The thing that sticks out most is I cannot get the dom0 (xen-3.4.2)
to show more that about 99.7% cpu usage for any pv_ops kernel.
>>>
>>> #!/usr/bin/perl
>>>
>>> while () {}
>>>
>>> running 8 of these loads 2.6.18.8-xenU with nearly 800% cpu as
shown in dom0
>>> running the same 8 in any pv_ops kernel''s only gets as
high as about 99.7%
>>>
>>>       
>> What tool are you using to show CPU use?
>>
>>     
>>> Inside the pv and xenU kernels top -s show all 8 cores being used.
>>>
>>>       
>> I tried to reproduce this:
>>
>>   1. I created a 4 vcpu pvops PV domain (4 pcpu host)
>>   2. Confirmed that all 4 vcpus are present with "cat
/proc/cpuinfo" in
>>      the domain
>>   3. Ran 4 instances of ``perl -e
"while(){}"&'''' in the domain
>>   4. "top" within the domain shows 99% overall user time, no
stolen
>>      time, with the perl processes each using 99% cpu time
>>   5. in dom0 "watch -n 1 xl vcpu-list <domain>" shows
all 4 vcpus are
>>      consuming 1 vcpu second per second
>>   6. running a spin loop in dom0 makes top within the domain show
>>      16-25% stolen time
>>
>> Aside from top showing "99%" rather than ~400% as one might
expect, it
>> all seems OK, and it looks like the vcpus are actually getting all the
>> CPU they''re asking for.  I think the 99 vs 400 difference is
just a
>> change in how the kernel shows its accounting (since there''s
been a lot
>> of change in that area between .18 and .32, including a whole new
>> scheduler).
>>
>> If you''re seeing a real performance regression between .18 and
.32,
>> that''s interesting, but it would be useful to make sure
you''re comparing
>> apples to apples; in particular, isolating any performance effect
>> inherent in Linux''s performance change from .18 -> .32,
compared to
>> pvops vs xenU.
>>
>> So, things to try:
>>
>>    * make sure all the vcpus are actually enabled within your domain;
>>      if your adding them after the domain has booted, you need to make
>>      sure they get hot-plugged properly
>>    * make sure you don''t have any expensive debug options
enabled in
>>      your kernel config
>>    * run your benchmark on the 2.6.32 kernel booted native and compare
>>      it to pvops running under xen
>>    * compare it with the Novell 2.6.32 non-pvops kernel
>>    * try pinning the vcpus to physical cpus to eliminate any Xen
>>      scheduler effects
>>
>> Thanks,
>>    J
>>
>>     
>   

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - May 2010 - Poor SMP performance pv_ops domU

[Xen-devel] Poor SMP performance pv_ops domU

Re: [Xen-devel] Poor SMP performance pv_ops domU

Re: [Xen-devel] Poor SMP performance pv_ops domU

Re: [Xen-devel] Poor SMP performance pv_ops domU

Re: [Xen-devel] Poor SMP performance pv_ops domU