thr3ads.net - Lustre devel - [Lustre-devel] COS performance issues [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Alexander Zarochentsev

2008-Oct-08 06:40 UTC

[Lustre-devel] COS performance issues

Hello,

the cos performance testing made discovered COS performance
problems in tests which were not supposed to be problematic.

The tests are: 
(3a) parallel file creation from several clients, each client creates files in
own directory.
(3b) parallel file creation from several clients in one shared directory.

The test results published in Bugzilla bug 17235. The slowdown depends on
hardware, for the test clusters
Washie1,3 it is enough big, the difference between COS=0 and COS=1 was about 20%
initially and PTLDEBUG=0 reduces the slowdown to 12-13% (COS patch adds
debugging
stuff like ENTRY/EXIT to ptlrpc routines).

COS code finds no inter-client dependencies and adds no extra commits
during those tests, there are about 60 commits in each of 300 seconds tests
with COS=0 and COS=1, one ldiskfs commit in 5 seconds.

However, oprofile results are terrible for COS=1.

COS code profiling results:
== COS=1 =
376924    4.5556  ptlrpc.ko                ptlrpc                  
ptlrpc_server_handle_reply
288668    3.4889  vmlinux                  vmlinux                 
.text.task_rq_lock
281528    3.4026  ptlrpc.ko                ptlrpc                  
ldlm_resource_find
180107    2.1768  vmlinux                  vmlinux                 
.text.__wake_up_common
173166    2.0929  obdclass.ko              obdclass                
htable_lookup
166829    2.0163  e1000e.ko                e1000e                  
e1000_irq_enable
150765    1.8222  vmlinux                  vmlinux                  schedule
127059    1.5357  e1000e.ko                e1000e                  
e1000_intr_msi
119343    1.4424  ptlrpc.ko                ptlrpc                  
target_send_reply
118733    1.4350  lvfs.ko                  lvfs                    
lprocfs_counter_add
99568     1.2034  vmlinux                  vmlinux                 
__find_get_block
93356     1.1283  vmlinux                  vmlinux                  __switch_to
91259     1.1030  oprofiled                oprofiled                (no symbols)
91147     1.1016  vmlinux                  vmlinux                 
try_to_wake_up
89665     1.0837  vmlinux                  vmlinux                  kfree
87774     1.0609  vmlinux                  vmlinux                 
module_refcount
85149     1.0291  vmlinux                  vmlinux                  memset
84822     1.0252  vmlinux                  vmlinux                  mwait_idle
78850     0.9530  vmlinux                  vmlinux                  __kmalloc
70324     0.8500  ptlrpc.ko                ptlrpc                   ptlrpc_main
69333     0.8380  vmlinux                  vmlinux                  memmove

== COS=0 =
237314    2.7642  e1000e.ko                e1000e                  
e1000_irq_enable
211304    2.4613  obdclass.ko              obdclass                
htable_lookup
197515    2.3007  vmlinux                  vmlinux                  schedule
180431    2.1017  e1000e.ko                e1000e                  
e1000_intr_msi
128113    1.4923  vmlinux                  vmlinux                  __switch_to
127683    1.4873  lvfs.ko                  lvfs                    
lprocfs_counter_add
122777    1.4301  vmlinux                  vmlinux                  kfree
122584    1.4279  vmlinux                  vmlinux                 
module_refcount
116800    1.3605  vmlinux                  vmlinux                 
__find_get_block
110418    1.2862  vmlinux                  vmlinux                  mwait_idle
101909    1.1870  vmlinux                  vmlinux                  memset
101100    1.1776  oprofiled                oprofiled                (no symbols)
96397     1.1228  vmlinux                  vmlinux                  __kmalloc
91745     1.0686  ptlrpc.ko                ptlrpc                   ptlrpc_main
80605     0.9389  ksocklnd.ko              ksocklnd                
ksocknal_scheduler
79782     0.9293  vmlinux                  vmlinux                 
kmem_cache_free
76968     0.8965  mds.ko                   mds                     
mds_lov_dump_objids
75094     0.8747  vmlinux                  vmlinux                 
.text.task_rq_lock
75087     0.8746  vmlinux                  vmlinux                  memmove
72627     0.8460  vmlinux                  vmlinux                 
find_get_page
67242     0.7832  vmlinux                  vmlinux                 
kmem_cache_alloc

== events distribution in ptlrpc_server_handle_reply, the most cpu consuming
function =
CPU: P4 / Xeon, speed 2800.2 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 100000

vma      samples  %        image name               app name                
symbol name
00000000000498f0 376924    4.5556  ptlrpc.ko                ptlrpc              
ptlrpc_server_handle_reply
  00000000000498f0 1        2.7e-04
  00000000000498f4 10        0.0027
  0000000000049905 3        8.0e-04
  0000000000049912 22        0.0058
  0000000000049972 1        2.7e-04
  0000000000049977 658       0.1746
  000000000004997e 24        0.0064
  0000000000049a23 5         0.0013
  0000000000049a35 21        0.0056
  0000000000049a80 1        2.7e-04
  0000000000049a85 4         0.0011
  0000000000049a92 2        5.3e-04
  0000000000049a99 22        0.0058
  0000000000049aa9 1        2.7e-04
  0000000000049ab5 371741   98.6249
  ...              ^^^^^^   spin_lock
(&obd->obd_uncommitted_replies_lock);

I have a patch to avoid using of obd_uncommitted_replies_lock 
in ptlrpc_server_handle_reply but it has minimal effect, 
ptlrpc_server_handle_reply still the most cpu consuming function
because of svc->srv_lock contention.

I think the problem is that COS defers processing of replies to transaction
commit time.
When commit happens, MDS has to process thousands of replies (about 14k replies
per
commit in the test 3.a) in short period of time. I guess the mdt service threads
all woken up and spin trying to get the service svr_lock. Processing of new
requests may
also suffer of this.

I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging compiled into a
kernel, it found no
sleep under spinlock bugs.

Further optimization may include 
1. per-reply spin locks.
2. per-cpu structures and threads to process reply queues.

Any comments?

Thanks.

PS. the test results are much better when MDS server is sata20 machine with 4
cores
(the MDS from Washie1 has 2 cores), COS=0 and COS=1 have only %3 difference:

COS=1
Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 secs)
Rate: 3096.94 creates/sec (total: 2 threads 929083 creates 300 secs)

COS=0
Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 secs)
Rate: 3152.89 creates/sec (total: 2 threads 945868 creates 300 secs)

-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Alexander Zarochentsev

2008-Oct-08 11:44 UTC

head link

[Lustre-devel] COS performance issues

sending to lustre-recovery@ as well

Hello,

the cos performance testing made discovered COS performance
problems in tests which were not supposed to be problematic.

The tests are: 
(3a) parallel file creation from several clients, each client creates files in
own directory.
(3b) parallel file creation from several clients in one shared directory.

The test results published in Bugzilla bug 17235. The slowdown depends on
hardware, for the test clusters
Washie1,3 it is enough big, the difference between COS=0 and COS=1 was about 20%
initially and PTLDEBUG=0 reduces the slowdown to 12-13% (COS patch adds
debugging
stuff like ENTRY/EXIT to ptlrpc routines).

COS code finds no inter-client dependencies and adds no extra commits
during those tests, there are about 60 commits in each of 300 seconds tests
with COS=0 and COS=1, one ldiskfs commit in 5 seconds.

However, oprofile results are terrible for COS=1.

COS code profiling results:
== COS=1 =
376924    4.5556  ptlrpc.ko                ptlrpc                  
ptlrpc_server_handle_reply
288668    3.4889  vmlinux                  vmlinux                 
.text.task_rq_lock
281528    3.4026  ptlrpc.ko                ptlrpc                  
ldlm_resource_find
180107    2.1768  vmlinux                  vmlinux                 
.text.__wake_up_common
173166    2.0929  obdclass.ko              obdclass                
htable_lookup
166829    2.0163  e1000e.ko                e1000e                  
e1000_irq_enable
150765    1.8222  vmlinux                  vmlinux                  schedule
127059    1.5357  e1000e.ko                e1000e                  
e1000_intr_msi
119343    1.4424  ptlrpc.ko                ptlrpc                  
target_send_reply
118733    1.4350  lvfs.ko                  lvfs                    
lprocfs_counter_add
99568     1.2034  vmlinux                  vmlinux                 
__find_get_block
93356     1.1283  vmlinux                  vmlinux                  __switch_to
91259     1.1030  oprofiled                oprofiled                (no symbols)
91147     1.1016  vmlinux                  vmlinux                 
try_to_wake_up
89665     1.0837  vmlinux                  vmlinux                  kfree
87774     1.0609  vmlinux                  vmlinux                 
module_refcount
85149     1.0291  vmlinux                  vmlinux                  memset
84822     1.0252  vmlinux                  vmlinux                  mwait_idle
78850     0.9530  vmlinux                  vmlinux                  __kmalloc
70324     0.8500  ptlrpc.ko                ptlrpc                   ptlrpc_main
69333     0.8380  vmlinux                  vmlinux                  memmove

== COS=0 =
237314    2.7642  e1000e.ko                e1000e                  
e1000_irq_enable
211304    2.4613  obdclass.ko              obdclass                
htable_lookup
197515    2.3007  vmlinux                  vmlinux                  schedule
180431    2.1017  e1000e.ko                e1000e                  
e1000_intr_msi
128113    1.4923  vmlinux                  vmlinux                  __switch_to
127683    1.4873  lvfs.ko                  lvfs                    
lprocfs_counter_add
122777    1.4301  vmlinux                  vmlinux                  kfree
122584    1.4279  vmlinux                  vmlinux                 
module_refcount
116800    1.3605  vmlinux                  vmlinux                 
__find_get_block
110418    1.2862  vmlinux                  vmlinux                  mwait_idle
101909    1.1870  vmlinux                  vmlinux                  memset
101100    1.1776  oprofiled                oprofiled                (no symbols)
96397     1.1228  vmlinux                  vmlinux                  __kmalloc
91745     1.0686  ptlrpc.ko                ptlrpc                   ptlrpc_main
80605     0.9389  ksocklnd.ko              ksocklnd                
ksocknal_scheduler
79782     0.9293  vmlinux                  vmlinux                 
kmem_cache_free
76968     0.8965  mds.ko                   mds                     
mds_lov_dump_objids
75094     0.8747  vmlinux                  vmlinux                 
.text.task_rq_lock
75087     0.8746  vmlinux                  vmlinux                  memmove
72627     0.8460  vmlinux                  vmlinux                 
find_get_page
67242     0.7832  vmlinux                  vmlinux                 
kmem_cache_alloc

== events distribution in ptlrpc_server_handle_reply, the most cpu consuming
function =
CPU: P4 / Xeon, speed 2800.2 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 100000

vma      samples  %        image name               app name                
symbol name
00000000000498f0 376924    4.5556  ptlrpc.ko                ptlrpc              
ptlrpc_server_handle_reply
  00000000000498f0 1        2.7e-04
  00000000000498f4 10        0.0027
  0000000000049905 3        8.0e-04
  0000000000049912 22        0.0058
  0000000000049972 1        2.7e-04
  0000000000049977 658       0.1746
  000000000004997e 24        0.0064
  0000000000049a23 5         0.0013
  0000000000049a35 21        0.0056
  0000000000049a80 1        2.7e-04
  0000000000049a85 4         0.0011
  0000000000049a92 2        5.3e-04
  0000000000049a99 22        0.0058
  0000000000049aa9 1        2.7e-04
  0000000000049ab5 371741   98.6249
  ...              ^^^^^^   spin_lock
(&obd->obd_uncommitted_replies_lock);

I have a patch to avoid using of obd_uncommitted_replies_lock 
in ptlrpc_server_handle_reply but it has minimal effect, 
ptlrpc_server_handle_reply still the most cpu consuming function
because of svc->srv_lock contention.

I think the problem is that COS defers processing of replies to transaction
commit time.
When commit happens, MDS has to process thousands of replies (about 14k replies
per
commit in the test 3.a) in short period of time. I guess the mdt service threads
all woken up and spin trying to get the service svr_lock. Processing of new
requests may
also suffer of this.

I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging compiled into a
kernel, it found no
sleep under spinlock bugs.

Further optimization may include 
1. per-reply spin locks.
2. per-cpu structures and threads to process reply queues.

Any comments?

Thanks.

PS. the test results are much better when MDS server is sata20 machine with 4
cores
(the MDS from Washie1 has 2 cores), COS=0 and COS=1 have only %3 difference:

COS=1
Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 secs)
Rate: 3096.94 creates/sec (total: 2 threads 929083 creates 300 secs)

COS=0
Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 secs)
Rate: 3152.89 creates/sec (total: 2 threads 945868 creates 300 secs)

-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Alex Zhuravlev

2008-Oct-08 11:48 UTC

head link

[Lustre-devel] COS performance issues

try to profile with single CPU? you''ll probably get an idea how
"per-cpu"
approach can help.

thanks, Alex

Alexander Zarochentsev wrote:> I have a patch to avoid using of obd_uncommitted_replies_lock 
> in ptlrpc_server_handle_reply but it has minimal effect, 
> ptlrpc_server_handle_reply still the most cpu consuming function
> because of svc->srv_lock contention.
> 
> I think the problem is that COS defers processing of replies to transaction
commit time.
> When commit happens, MDS has to process thousands of replies (about 14k
replies per
> commit in the test 3.a) in short period of time. I guess the mdt service
threads
> all woken up and spin trying to get the service svr_lock. Processing of new
requests may
> also suffer of this.
> 
> I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging compiled
into a kernel, it found no
> sleep under spinlock bugs.
> 
> Further optimization may include 
> 1. per-reply spin locks.
> 2. per-cpu structures and threads to process reply queues.
> 
> Any comments?
> 
> Thanks.
> 
> PS. the test results are much better when MDS server is sata20 machine with
4 cores
> (the MDS from Washie1 has 2 cores), COS=0 and COS=1 have only %3
difference:
> 
> COS=1
> Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 secs)
> Rate: 3096.94 creates/sec (total: 2 threads 929083 creates 300 secs)
> 
> COS=0
> Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 secs)
> Rate: 3152.89 creates/sec (total: 2 threads 945868 creates 300 secs)
>

Andreas Dilger

2008-Oct-11 16:00 UTC

head link

[Lustre-devel] COS performance issues

On Oct 08, 2008  15:44 +0400, Alexander Zarochentsev
wrote:> I think the problem is that COS defers processing of replies to
> transaction commit time.  When commit happens, MDS has to process
> thousands of replies (about 14k replies per commit in the test 3.a)
> in short period of time. I guess the mdt service threads all woken
> up and spin trying to get the service svr_lock. Processing of new
> requests may also suffer of this.
Can you please explain what replies are being blocked?  It can''t be the
create replies or the clients would be blocked waiting after starting a
single create each.

I think the thread and lock contention is only part of the issue - if all
of these replies are blocked until transaction commit this wastes all of
the bandwidth on the network while the replies are being held.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Eric Barton

2008-Oct-11 20:18 UTC

head link

[Lustre-devel] COS performance issues

Andreas,

Zam doesn''t mean the reply messages being blocked - he means
the reply state that is holding lock references.  Previously
they got cleaned up one at a time as the repACKs came in.  Now
they get scheduled by the 1000s on the commit callback.  I''m
betting that we''re getting thundering herds of MDS threads
at this time and that we should have dedicated 1-per-CPU threads
doing the cleanup to minimize interference with normal request
processing.  Bzzz thinks it''s purely spinlock contention
that''s
the source of the inefficiency.  So Zam is going to run an
experiment with a single CPU MDS to see if the performance issue
goes away.  I''m betting it won''t and Bzzz is betting it will.
We should see soon...

    Cheers,
              Eric
>  
> 
> -----Original Message-----
> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On
Behalf Of Andreas Dilger
> Sent: 11 October 2008 5:00 PM
> To: Alexander Zarochentsev
> Cc: lustre-devel at lists.lustre.org; lustre-recovery at Sun.COM
> Subject: Re: [Lustre-devel] COS performance issues
> 
> On Oct 08, 2008  15:44 +0400, Alexander Zarochentsev wrote:
> > I think the problem is that COS defers processing of replies to
> > transaction commit time.  When commit happens, MDS has to process
> > thousands of replies (about 14k replies per commit in the test 3.a)
> > in short period of time. I guess the mdt service threads all woken
> > up and spin trying to get the service svr_lock. Processing of new
> > requests may also suffer of this.
> 
> Can you please explain what replies are being blocked?  It can''t
be the
> create replies or the clients would be blocked waiting after starting a
> single create each.
> 
> I think the thread and lock contention is only part of the issue - if all
> of these replies are blocked until transaction commit this wastes all of
> the bandwidth on the network while the replies are being held.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
>

Alexander Zarochentsev

2008-Oct-12 15:51 UTC

head link

[Lustre-devel] COS performance issues

On 11 October 2008 20:00:10 Andreas Dilger wrote:> On Oct 08, 2008  15:44 +0400, Alexander Zarochentsev wrote:
> > I think the problem is that COS defers processing of replies to
> > transaction commit time.  When commit happens, MDS has to process
> > thousands of replies (about 14k replies per commit in the test 3.a)
> > in short period of time. I guess the mdt service threads all woken
> > up and spin trying to get the service svr_lock. Processing of new
> > requests may also suffer of this.
>
> Can you please explain what replies are being blocked?  It can''t
be
> the create replies or the clients would be blocked waiting after
> starting a single create each.
>
> I think the thread and lock contention is only part of the issue - if
> all of these replies are blocked until transaction commit this wastes
> all of the bandwidth on the network while the replies are being held.
as Eric explained already, replies are sent already but the reply state 
objects remain in memory.  The cleanup process is almost serialized by 
per-service spinlock. It might be only a part of the problem but the 
first task to solve.
> Cheers, Andreas
WBR,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Alexander Zarochentsev

2008-Oct-12 18:41 UTC

head link

[Lustre-devel] COS performance issues

On 8 October 2008 15:48:50 Alex Zhuravlev wrote:> try to profile with single CPU? you''ll probably get an idea how
> "per-cpu" approach can help.
I booted the MDS server with maxcpus=1 kernel parameter and here are the 
results:

cos=0
2039.31 creates/sec (total: 2 threads 611794 creates 300 secs)
2037.80 creates/sec (total: 2 threads 611341 creates 300 secs)
2076.21 creates/sec (total: 2 threads 622864 creates 300 secs)

cos=1
1874.93 creates/sec (total: 2 threads 564354 creates 301 secs)
1923.97 creates/sec (total: 2 threads 577191 creates 300 secs)
1892.61 creates/sec (total: 2 threads 567783 creates 300 secs)
1874.74 creates/sec (total: 2 threads 562421 creates 300 secs)

unfortunately profiling info isn''t available yet, the results are done 
with SLES10 which can boot with maxcpus=1 but has no oprofile 
installed.
> Alexander Zarochentsev wrote:
> > I have a patch to avoid using of obd_uncommitted_replies_lock
> > in ptlrpc_server_handle_reply but it has minimal effect,
> > ptlrpc_server_handle_reply still the most cpu consuming function
> > because of svc->srv_lock contention.
> >
> > I think the problem is that COS defers processing of replies to
> > transaction commit time. When commit happens, MDS has to process
> > thousands of replies (about 14k replies per commit in the test 3.a)
> > in short period of time. I guess the mdt service threads all woken
> > up and spin trying to get the service svr_lock. Processing of new
> > requests may also suffer of this.
> >
> > I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging
> > compiled into a kernel, it found no sleep under spinlock bugs.
> >
> > Further optimization may include
> > 1. per-reply spin locks.
> > 2. per-cpu structures and threads to process reply queues.
> >
> > Any comments?
> >
> > Thanks.
> >
> > PS. the test results are much better when MDS server is sata20
> > machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0 and
> > COS=1 have only %3 difference:
> >
> > COS=1
> > Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300
> > secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates
> > 300 secs)
> >
> > COS=0
> > Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301
> > secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates
> > 300 secs)
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Alex Zhuravlev

2008-Oct-12 19:12 UTC

head link

[Lustre-devel] COS performance issues

would be good to look at profiles as the next one was ldlm_resource_get()

thanks, Alex

Alexander Zarochentsev wrote:> On 8 October 2008 15:48:50 Alex Zhuravlev wrote:
>> try to profile with single CPU? you''ll probably get an idea
how
>> "per-cpu" approach can help.
> 
> I booted the MDS server with maxcpus=1 kernel parameter and here are the 
> results:
> 
> cos=0
> 2039.31 creates/sec (total: 2 threads 611794 creates 300 secs)
> 2037.80 creates/sec (total: 2 threads 611341 creates 300 secs)
> 2076.21 creates/sec (total: 2 threads 622864 creates 300 secs)
> 
> cos=1
> 1874.93 creates/sec (total: 2 threads 564354 creates 301 secs)
> 1923.97 creates/sec (total: 2 threads 577191 creates 300 secs)
> 1892.61 creates/sec (total: 2 threads 567783 creates 300 secs)
> 1874.74 creates/sec (total: 2 threads 562421 creates 300 secs)
> 
> unfortunately profiling info isn''t available yet, the results are
done
> with SLES10 which can boot with maxcpus=1 but has no oprofile 
> installed.
> 
>> Alexander Zarochentsev wrote:
>>> I have a patch to avoid using of obd_uncommitted_replies_lock
>>> in ptlrpc_server_handle_reply but it has minimal effect,
>>> ptlrpc_server_handle_reply still the most cpu consuming function
>>> because of svc->srv_lock contention.
>>>
>>> I think the problem is that COS defers processing of replies to
>>> transaction commit time. When commit happens, MDS has to process
>>> thousands of replies (about 14k replies per commit in the test 3.a)
>>> in short period of time. I guess the mdt service threads all woken
>>> up and spin trying to get the service svr_lock. Processing of new
>>> requests may also suffer of this.
>>>
>>> I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging
>>> compiled into a kernel, it found no sleep under spinlock bugs.
>>>
>>> Further optimization may include
>>> 1. per-reply spin locks.
>>> 2. per-cpu structures and threads to process reply queues.
>>>
>>> Any comments?
>>>
>>> Thanks.
>>>
>>> PS. the test results are much better when MDS server is sata20
>>> machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0 and
>>> COS=1 have only %3 difference:
>>>
>>> COS=1
>>> Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300
>>> secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates
>>> 300 secs)
>>>
>>> COS=0
>>> Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301
>>> secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates
>>> 300 secs)
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Alexander Zarochentsev

2008-Oct-13 14:36 UTC

head link

[Lustre-devel] COS performance issues

On 12 October 2008 23:12:10 Alex Zhuravlev wrote:> would be good to look at profiles as the next one was
> ldlm_resource_get()
Alex, look, ptlrpc_server_handle_reply has gone:

CPU: P4 / Xeon, speed 2800.25 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped
with a unit mask of 0x01 (mandatory) count
 100000
samples  %        image name               app name                 symbol name
312318    4.7857  vmlinux                  vmlinux                 
try_to_wake_up
259690    3.9792  ptlrpc.ko                ptlrpc                  
ldlm_resource_find
175916    2.6956  vmlinux                  vmlinux                 
__wake_up_common
160912    2.4657  e1000.ko                 e1000                   
e1000_irq_enable
157031    2.4062  obdclass.ko              obdclass                
htable_lookup
134309    2.0580  e1000.ko                 e1000                   
e1000_intr_msi
121261    1.8581  lvfs.ko                  lvfs                    
lprocfs_counter_add
87613     1.3425  vmlinux                  vmlinux                 
__find_get_block
87562     1.3417  oprofiled                oprofiled                (no symbols)
77926     1.1941  vmlinux                  vmlinux                  memset
74561     1.1425  vmlinux                  vmlinux                  __kmalloc
70952     1.0872  vmlinux                  vmlinux                  __switch_to
67288     1.0311  mds.ko                   mds                     
mds_lov_dump_objids
67085     1.0279  vmlinux                  vmlinux                  memmove
66521     1.0193  vmlinux                  vmlinux                  kfree
55112     0.8445  ptlrpc.ko                ptlrpc                   ptlrpc_main
54357     0.8329  vmlinux                  vmlinux                  schedule
53501     0.8198  vmlinux                  vmlinux                 
find_get_page
> thanks, Alex
>
> Alexander Zarochentsev wrote:
> > On 8 October 2008 15:48:50 Alex Zhuravlev wrote:
> >> try to profile with single CPU? you''ll probably get an
idea how
> >> "per-cpu" approach can help.
> >
> > I booted the MDS server with maxcpus=1 kernel parameter and here
> > are the results:
> >
> > cos=0
> > 2039.31 creates/sec (total: 2 threads 611794 creates 300 secs)
> > 2037.80 creates/sec (total: 2 threads 611341 creates 300 secs)
> > 2076.21 creates/sec (total: 2 threads 622864 creates 300 secs)
> >
> > cos=1
> > 1874.93 creates/sec (total: 2 threads 564354 creates 301 secs)
> > 1923.97 creates/sec (total: 2 threads 577191 creates 300 secs)
> > 1892.61 creates/sec (total: 2 threads 567783 creates 300 secs)
> > 1874.74 creates/sec (total: 2 threads 562421 creates 300 secs)
> >
> > unfortunately profiling info isn''t available yet, the results
are
> > done with SLES10 which can boot with maxcpus=1 but has no oprofile
> > installed.
> >
> >> Alexander Zarochentsev wrote:
> >>> I have a patch to avoid using of obd_uncommitted_replies_lock
> >>> in ptlrpc_server_handle_reply but it has minimal effect,
> >>> ptlrpc_server_handle_reply still the most cpu consuming
function
> >>> because of svc->srv_lock contention.
> >>>
> >>> I think the problem is that COS defers processing of replies
to
> >>> transaction commit time. When commit happens, MDS has to
process
> >>> thousands of replies (about 14k replies per commit in the test
> >>> 3.a) in short period of time. I guess the mdt service threads
all
> >>> woken up and spin trying to get the service svr_lock.
Processing
> >>> of new requests may also suffer of this.
> >>>
> >>> I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP
debugging
> >>> compiled into a kernel, it found no sleep under spinlock bugs.
> >>>
> >>> Further optimization may include
> >>> 1. per-reply spin locks.
> >>> 2. per-cpu structures and threads to process reply queues.
> >>>
> >>> Any comments?
> >>>
> >>> Thanks.
> >>>
> >>> PS. the test results are much better when MDS server is sata20
> >>> machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0
> >>> and COS=1 have only %3 difference:
> >>>
> >>> COS=1
> >>> Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300
> >>> secs) Rate: 3096.94 creates/sec (total: 2 threads 929083
creates
> >>> 300 secs)
> >>>
> >>> COS=0
> >>> Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301
> >>> secs) Rate: 3152.89 creates/sec (total: 2 threads 945868
creates
> >>> 300 secs)
> >>
> >> _______________________________________________
> >> Lustre-devel mailing list
> >> Lustre-devel at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-devel
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Alex Zhuravlev

2008-Oct-13 15:04 UTC

head link

[Lustre-devel] COS performance issues

cool! do you have profile with COS disabled? how is it different?
I guess ldlm_resorce_find() is result of COS as now we''ve got 10K times
more locks and resorces, but what about try_to_wake_up() and __wake_up_common()
?


thanks, Alex

Alexander Zarochentsev wrote:> On 12 October 2008 23:12:10 Alex Zhuravlev wrote:
>> would be good to look at profiles as the next one was
>> ldlm_resource_get()
> 
> Alex, look, ptlrpc_server_handle_reply has gone:
> 
> CPU: P4 / Xeon, speed 2800.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped
> with a unit mask of 0x01 (mandatory) count
>  100000
> samples  %        image name               app name                 symbol
name
> 312318    4.7857  vmlinux                  vmlinux                 
try_to_wake_up
> 259690    3.9792  ptlrpc.ko                ptlrpc                  
ldlm_resource_find
> 175916    2.6956  vmlinux                  vmlinux                 
__wake_up_common
> 160912    2.4657  e1000.ko                 e1000                   
e1000_irq_enable
> 157031    2.4062  obdclass.ko              obdclass                
htable_lookup
> 134309    2.0580  e1000.ko                 e1000                   
e1000_intr_msi
> 121261    1.8581  lvfs.ko                  lvfs                    
lprocfs_counter_add
> 87613     1.3425  vmlinux                  vmlinux                 
__find_get_block
> 87562     1.3417  oprofiled                oprofiled                (no
symbols)
> 77926     1.1941  vmlinux                  vmlinux                  memset
> 74561     1.1425  vmlinux                  vmlinux                 
__kmalloc
> 70952     1.0872  vmlinux                  vmlinux                 
__switch_to
> 67288     1.0311  mds.ko                   mds                     
mds_lov_dump_objids
> 67085     1.0279  vmlinux                  vmlinux                  memmove
> 66521     1.0193  vmlinux                  vmlinux                  kfree
> 55112     0.8445  ptlrpc.ko                ptlrpc                  
ptlrpc_main
> 54357     0.8329  vmlinux                  vmlinux                 
schedule
> 53501     0.8198  vmlinux                  vmlinux                 
find_get_page
> 
>> thanks, Alex
>>
>> Alexander Zarochentsev wrote:
>>> On 8 October 2008 15:48:50 Alex Zhuravlev wrote:
>>>> try to profile with single CPU? you''ll probably get an
idea how
>>>> "per-cpu" approach can help.
>>> I booted the MDS server with maxcpus=1 kernel parameter and here
>>> are the results:
>>>
>>> cos=0
>>> 2039.31 creates/sec (total: 2 threads 611794 creates 300 secs)
>>> 2037.80 creates/sec (total: 2 threads 611341 creates 300 secs)
>>> 2076.21 creates/sec (total: 2 threads 622864 creates 300 secs)
>>>
>>> cos=1
>>> 1874.93 creates/sec (total: 2 threads 564354 creates 301 secs)
>>> 1923.97 creates/sec (total: 2 threads 577191 creates 300 secs)
>>> 1892.61 creates/sec (total: 2 threads 567783 creates 300 secs)
>>> 1874.74 creates/sec (total: 2 threads 562421 creates 300 secs)
>>>
>>> unfortunately profiling info isn''t available yet, the
results are
>>> done with SLES10 which can boot with maxcpus=1 but has no oprofile
>>> installed.
>>>
>>>> Alexander Zarochentsev wrote:
>>>>> I have a patch to avoid using of
obd_uncommitted_replies_lock
>>>>> in ptlrpc_server_handle_reply but it has minimal effect,
>>>>> ptlrpc_server_handle_reply still the most cpu consuming
function
>>>>> because of svc->srv_lock contention.
>>>>>
>>>>> I think the problem is that COS defers processing of
replies to
>>>>> transaction commit time. When commit happens, MDS has to
process
>>>>> thousands of replies (about 14k replies per commit in the
test
>>>>> 3.a) in short period of time. I guess the mdt service
threads all
>>>>> woken up and spin trying to get the service svr_lock.
Processing
>>>>> of new requests may also suffer of this.
>>>>>
>>>>> I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP
debugging
>>>>> compiled into a kernel, it found no sleep under spinlock
bugs.
>>>>>
>>>>> Further optimization may include
>>>>> 1. per-reply spin locks.
>>>>> 2. per-cpu structures and threads to process reply queues.
>>>>>
>>>>> Any comments?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> PS. the test results are much better when MDS server is
sata20
>>>>> machine with 4 cores (the MDS from Washie1 has 2 cores),
COS=0
>>>>> and COS=1 have only %3 difference:
>>>>>
>>>>> COS=1
>>>>> Rate: 3101.77 creates/sec (total: 2 threads 930530 creates
300
>>>>> secs) Rate: 3096.94 creates/sec (total: 2 threads 929083
creates
>>>>> 300 secs)
>>>>>
>>>>> COS=0
>>>>> Rate: 3184.01 creates/sec (total: 2 threads 958388 creates
301
>>>>> secs) Rate: 3152.89 creates/sec (total: 2 threads 945868
creates
>>>>> 300 secs)
>>>> _______________________________________________
>>>> Lustre-devel mailing list
>>>> Lustre-devel at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Alexander Zarochentsev

2008-Oct-17 08:07 UTC

head link

[Lustre-devel] COS performance issues

On 13 October 2008 19:04:51 Alex Zhuravlev wrote:> cool! do you have profile with COS disabled? how is it different?
> I guess ldlm_resorce_find() is result of COS as now we''ve got 10K
> times more locks and resorces, but what about try_to_wake_up() and
> __wake_up_common() ?
What I got after number of performance optimizations:

332019    4.4010  ptlrpc.ko                ptlrpc                  
ldlm_resource_find
190465    2.5247  obdclass.ko              obdclass                
htable_lookup
175260    2.3231  e1000.ko                 e1000                   
e1000_irq_enable
146324    1.9396  vmlinux                  vmlinux                  schedule
139799    1.8531  e1000.ko                 e1000                   
e1000_intr_msi
121729    1.6136  lvfs.ko                  lvfs                    
lprocfs_counter_add
105164    1.3940  vmlinux                  vmlinux                 
__find_get_block
102296    1.3560  vmlinux                  vmlinux                  memset
96728     1.2822  vmlinux                  vmlinux                  __switch_to
92123     1.2211  vmlinux                  vmlinux                  kfree
90847     1.2042  oprofiled                oprofiled                (no symbols)
84152     1.1155  vmlinux                  vmlinux                  mwait_idle
82378     1.0919  vmlinux                  vmlinux                  __kmalloc
80368     1.0653  vmlinux                  vmlinux                 
try_to_wake_up
79110     1.0486  mds.ko                   mds                     
mds_lov_dump_objids
71730     0.9508  vmlinux                  vmlinux                  memmove


The ptlrpc_server_handle_reply has gone due to per-reply spin lock implemented,
no try_to_wake_up due to dedicated reply handling threads,
there is only ldlm_resource_find and htable_lookup left, I hope to fix that with
the next patch.

The massive create tests shows the following results:

COS=1
Rate: 2085.73 creates/sec (total: 2 threads 625719 creates 300 secs)
Rate: 2072.38 creates/sec (total: 2 threads 621713 creates 300 secs)
Rate: 2053.80 creates/sec (total: 2 threads 616141 creates 300 secs)

COS=0
Rate: 2128.22 creates/sec (total: 2 threads 638465 creates 300 secs)
Rate: 2151.05 creates/sec (total: 2 threads 645316 creates 300 secs)
Rate: 2169.91 creates/sec (total: 2 threads 650974 creates 300 secs)

Clean HEAD

Rate: 2199.67 creates/sec (total: 2 threads 659900 creates 300 secs)
Rate: 2213.95 creates/sec (total: 2 threads 664184 creates 300 secs)
Rate: 2186.74 creates/sec (total: 2 threads 658209 creates 301 secs)
Rate: 2206.10 creates/sec (total: 2 threads 661831 creates 300 secs)
Rate: 2217.49 creates/sec (total: 2 threads 665247 creates 300 secs)
Rate: 2230.86 creates/sec (total: 2 threads 669258 creates 300 secs)

WBR,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Robert Read

2008-Oct-17 20:55 UTC

head link

[Lustre-devel] COS performance issues

On Oct 17, 2008, at 01:07 , Alexander Zarochentsev
wrote:> The massive create tests shows the following results:
>
> COS=1
> Rate: 2085.73 creates/sec (total: 2 threads 625719 creates 300 secs)
> Rate: 2072.38 creates/sec (total: 2 threads 621713 creates 300 secs)
> Rate: 2053.80 creates/sec (total: 2 threads 616141 creates 300 secs)
>
> COS=0
> Rate: 2128.22 creates/sec (total: 2 threads 638465 creates 300 secs)
> Rate: 2151.05 creates/sec (total: 2 threads 645316 creates 300 secs)
> Rate: 2169.91 creates/sec (total: 2 threads 650974 creates 300 secs)
>
> Clean HEAD
>
> Rate: 2199.67 creates/sec (total: 2 threads 659900 creates 300 secs)
> Rate: 2213.95 creates/sec (total: 2 threads 664184 creates 300 secs)
> Rate: 2186.74 creates/sec (total: 2 threads 658209 creates 301 secs)
> Rate: 2206.10 creates/sec (total: 2 threads 661831 creates 300 secs)
> Rate: 2217.49 creates/sec (total: 2 threads 665247 creates 300 secs)
> Rate: 2230.86 creates/sec (total: 2 threads 669258 creates 300 secs)
>
Any idea why COS=0 was consistently a bit slower than HEAD?

robert

Alexander Zarochentsev

2008-Oct-18 10:34 UTC

head link

[Lustre-devel] COS performance issues

On 18 October 2008 00:55:03 Robert Read wrote:> On Oct 17, 2008, at 01:07 , Alexander Zarochentsev wrote:
> > The massive create tests shows the following results:
> >
> > COS=1
> > Rate: 2085.73 creates/sec (total: 2 threads 625719 creates 300
> > secs) Rate: 2072.38 creates/sec (total: 2 threads 621713 creates
> > 300 secs) Rate: 2053.80 creates/sec (total: 2 threads 616141
> > creates 300 secs)
> >
> > COS=0
> > Rate: 2128.22 creates/sec (total: 2 threads 638465 creates 300
> > secs) Rate: 2151.05 creates/sec (total: 2 threads 645316 creates
> > 300 secs) Rate: 2169.91 creates/sec (total: 2 threads 650974
> > creates 300 secs)
> >
> > Clean HEAD
> >
> > Rate: 2199.67 creates/sec (total: 2 threads 659900 creates 300
> > secs) Rate: 2213.95 creates/sec (total: 2 threads 664184 creates
> > 300 secs) Rate: 2186.74 creates/sec (total: 2 threads 658209
> > creates 301 secs) Rate: 2206.10 creates/sec (total: 2 threads
> > 661831 creates 300 secs) Rate: 2217.49 creates/sec (total: 2
> > threads 665247 creates 300 secs) Rate: 2230.86 creates/sec (total:
> > 2 threads 669258 creates 300 secs)
>
> Any idea why COS=0 was consistently a bit slower than HEAD?
I only suspect the ENTRY/EXIT stuff I added to several ptlrpc functions.
next time I will try to build Lustre with CDEBUG_ENTRY_EXIT=0.
> robert
The latest results I got after increasing of ldlm resource
hashtable size are:

zam at lts-head:~$ zgrep Rate: m-ht-improved-WASHIE3.4-20081017.log

COS = 1
Rate: 1887.57 creates/sec (total: 2 threads 566271 creates 300 secs)
Rate: 1980.18 creates/sec (total: 2 threads 594053 creates 300 secs)
Rate: 1923.76 creates/sec (total: 2 threads 577129 creates 300 secs)

COS = 0
Rate: 1804.92 creates/sec (total: 2 threads 541475 creates 300 secs)
Rate: 1915.29 creates/sec (total: 2 threads 574587 creates 300 secs)
Rate: 2025.70 creates/sec (total: 2 threads 607711 creates 300 secs)


CLEAN HEAD:
zam at lts-head:~$ zgrep Rate: m-cleanhead-WASHIE3.4-20081017.log
Rate: 1986.13 creates/sec (total: 2 threads 595839 creates 300 secs)
Rate: 2009.84 creates/sec (total: 2 threads 602952 creates 300 secs)
Rate: 1887.05 creates/sec (total: 2 threads 566115 creates 300 secs)
Rate: 2060.59 creates/sec (total: 2 threads 618178 creates 300 secs)
Rate: 1868.89 creates/sec (total: 2 threads 560666 creates 300 secs)
Rate: 1976.95 creates/sec (total: 2 threads 594084 creates 301 secs)
zam at lts-head:~$

average values:

COS=1 : 1930.5
COS=0 : 1915.3
HEAD  : 1964.9

The results are done on another test cluster 
so they are not directly comparable with the previous ones.

The profiling chart for COS=1, there is no ldlm_resource_find anymore in top20:

198152    2.9052  e1000.ko                 e1000                   
e1000_irq_enable
158535    2.3243  e1000.ko                 e1000                   
e1000_intr_msi
152375    2.2340  vmlinux                  vmlinux                  schedule
148709    2.1803  obdclass.ko              obdclass                
htable_lookup
109440    1.6045  lvfs.ko                  lvfs                    
lprocfs_counter_add
102773    1.5068  vmlinux                  vmlinux                  __switch_to
95607     1.4017  vmlinux                  vmlinux                 
__find_get_block
90447     1.3261  vmlinux                  vmlinux                  kfree
88713     1.3006  vmlinux                  vmlinux                  mwait_idle
84511     1.2390  vmlinux                  vmlinux                 
try_to_wake_up
83801     1.2286  vmlinux                  vmlinux                  memset
79735     1.1690  vmlinux                  vmlinux                  __kmalloc
78594     1.1523  oprofiled                oprofiled                (no symbols)
71184     1.0437  ksocklnd.ko              ksocklnd                
ksocknal_scheduler
65613     0.9620  vmlinux                  vmlinux                 
kmem_cache_free
64525     0.9460  mds.ko                   mds                     
mds_lov_dump_objids
64208     0.9414  ptlrpc.ko                ptlrpc                   ptlrpc_main
63260     0.9275  vmlinux                  vmlinux                  memmove
56148     0.8232  vmlinux                  vmlinux                 
find_get_page
55011     0.8065  vmlinux                  vmlinux                 
kmem_cache_alloc

I attached all performance enhancement patches 
(except the latest patch of ldlm resource hash table size increase) 
to the COS bug (Bug 15393).

Thanks,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Lustre devel - Oct 2008 - COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues

[Lustre-devel] COS performance issues