Hello, the cos performance testing made discovered COS performance problems in tests which were not supposed to be problematic. The tests are: (3a) parallel file creation from several clients, each client creates files in own directory. (3b) parallel file creation from several clients in one shared directory. The test results published in Bugzilla bug 17235. The slowdown depends on hardware, for the test clusters Washie1,3 it is enough big, the difference between COS=0 and COS=1 was about 20% initially and PTLDEBUG=0 reduces the slowdown to 12-13% (COS patch adds debugging stuff like ENTRY/EXIT to ptlrpc routines). COS code finds no inter-client dependencies and adds no extra commits during those tests, there are about 60 commits in each of 300 seconds tests with COS=0 and COS=1, one ldiskfs commit in 5 seconds. However, oprofile results are terrible for COS=1. COS code profiling results: == COS=1 = 376924 4.5556 ptlrpc.ko ptlrpc ptlrpc_server_handle_reply 288668 3.4889 vmlinux vmlinux .text.task_rq_lock 281528 3.4026 ptlrpc.ko ptlrpc ldlm_resource_find 180107 2.1768 vmlinux vmlinux .text.__wake_up_common 173166 2.0929 obdclass.ko obdclass htable_lookup 166829 2.0163 e1000e.ko e1000e e1000_irq_enable 150765 1.8222 vmlinux vmlinux schedule 127059 1.5357 e1000e.ko e1000e e1000_intr_msi 119343 1.4424 ptlrpc.ko ptlrpc target_send_reply 118733 1.4350 lvfs.ko lvfs lprocfs_counter_add 99568 1.2034 vmlinux vmlinux __find_get_block 93356 1.1283 vmlinux vmlinux __switch_to 91259 1.1030 oprofiled oprofiled (no symbols) 91147 1.1016 vmlinux vmlinux try_to_wake_up 89665 1.0837 vmlinux vmlinux kfree 87774 1.0609 vmlinux vmlinux module_refcount 85149 1.0291 vmlinux vmlinux memset 84822 1.0252 vmlinux vmlinux mwait_idle 78850 0.9530 vmlinux vmlinux __kmalloc 70324 0.8500 ptlrpc.ko ptlrpc ptlrpc_main 69333 0.8380 vmlinux vmlinux memmove == COS=0 = 237314 2.7642 e1000e.ko e1000e e1000_irq_enable 211304 2.4613 obdclass.ko obdclass htable_lookup 197515 2.3007 vmlinux vmlinux schedule 180431 2.1017 e1000e.ko e1000e e1000_intr_msi 128113 1.4923 vmlinux vmlinux __switch_to 127683 1.4873 lvfs.ko lvfs lprocfs_counter_add 122777 1.4301 vmlinux vmlinux kfree 122584 1.4279 vmlinux vmlinux module_refcount 116800 1.3605 vmlinux vmlinux __find_get_block 110418 1.2862 vmlinux vmlinux mwait_idle 101909 1.1870 vmlinux vmlinux memset 101100 1.1776 oprofiled oprofiled (no symbols) 96397 1.1228 vmlinux vmlinux __kmalloc 91745 1.0686 ptlrpc.ko ptlrpc ptlrpc_main 80605 0.9389 ksocklnd.ko ksocklnd ksocknal_scheduler 79782 0.9293 vmlinux vmlinux kmem_cache_free 76968 0.8965 mds.ko mds mds_lov_dump_objids 75094 0.8747 vmlinux vmlinux .text.task_rq_lock 75087 0.8746 vmlinux vmlinux memmove 72627 0.8460 vmlinux vmlinux find_get_page 67242 0.7832 vmlinux vmlinux kmem_cache_alloc == events distribution in ptlrpc_server_handle_reply, the most cpu consuming function = CPU: P4 / Xeon, speed 2800.2 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 vma samples % image name app name symbol name 00000000000498f0 376924 4.5556 ptlrpc.ko ptlrpc ptlrpc_server_handle_reply 00000000000498f0 1 2.7e-04 00000000000498f4 10 0.0027 0000000000049905 3 8.0e-04 0000000000049912 22 0.0058 0000000000049972 1 2.7e-04 0000000000049977 658 0.1746 000000000004997e 24 0.0064 0000000000049a23 5 0.0013 0000000000049a35 21 0.0056 0000000000049a80 1 2.7e-04 0000000000049a85 4 0.0011 0000000000049a92 2 5.3e-04 0000000000049a99 22 0.0058 0000000000049aa9 1 2.7e-04 0000000000049ab5 371741 98.6249 ... ^^^^^^ spin_lock (&obd->obd_uncommitted_replies_lock); I have a patch to avoid using of obd_uncommitted_replies_lock in ptlrpc_server_handle_reply but it has minimal effect, ptlrpc_server_handle_reply still the most cpu consuming function because of svc->srv_lock contention. I think the problem is that COS defers processing of replies to transaction commit time. When commit happens, MDS has to process thousands of replies (about 14k replies per commit in the test 3.a) in short period of time. I guess the mdt service threads all woken up and spin trying to get the service svr_lock. Processing of new requests may also suffer of this. I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging compiled into a kernel, it found no sleep under spinlock bugs. Further optimization may include 1. per-reply spin locks. 2. per-cpu structures and threads to process reply queues. Any comments? Thanks. PS. the test results are much better when MDS server is sata20 machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0 and COS=1 have only %3 difference: COS=1 Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates 300 secs) COS=0 Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates 300 secs) -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
sending to lustre-recovery@ as well Hello, the cos performance testing made discovered COS performance problems in tests which were not supposed to be problematic. The tests are: (3a) parallel file creation from several clients, each client creates files in own directory. (3b) parallel file creation from several clients in one shared directory. The test results published in Bugzilla bug 17235. The slowdown depends on hardware, for the test clusters Washie1,3 it is enough big, the difference between COS=0 and COS=1 was about 20% initially and PTLDEBUG=0 reduces the slowdown to 12-13% (COS patch adds debugging stuff like ENTRY/EXIT to ptlrpc routines). COS code finds no inter-client dependencies and adds no extra commits during those tests, there are about 60 commits in each of 300 seconds tests with COS=0 and COS=1, one ldiskfs commit in 5 seconds. However, oprofile results are terrible for COS=1. COS code profiling results: == COS=1 = 376924 4.5556 ptlrpc.ko ptlrpc ptlrpc_server_handle_reply 288668 3.4889 vmlinux vmlinux .text.task_rq_lock 281528 3.4026 ptlrpc.ko ptlrpc ldlm_resource_find 180107 2.1768 vmlinux vmlinux .text.__wake_up_common 173166 2.0929 obdclass.ko obdclass htable_lookup 166829 2.0163 e1000e.ko e1000e e1000_irq_enable 150765 1.8222 vmlinux vmlinux schedule 127059 1.5357 e1000e.ko e1000e e1000_intr_msi 119343 1.4424 ptlrpc.ko ptlrpc target_send_reply 118733 1.4350 lvfs.ko lvfs lprocfs_counter_add 99568 1.2034 vmlinux vmlinux __find_get_block 93356 1.1283 vmlinux vmlinux __switch_to 91259 1.1030 oprofiled oprofiled (no symbols) 91147 1.1016 vmlinux vmlinux try_to_wake_up 89665 1.0837 vmlinux vmlinux kfree 87774 1.0609 vmlinux vmlinux module_refcount 85149 1.0291 vmlinux vmlinux memset 84822 1.0252 vmlinux vmlinux mwait_idle 78850 0.9530 vmlinux vmlinux __kmalloc 70324 0.8500 ptlrpc.ko ptlrpc ptlrpc_main 69333 0.8380 vmlinux vmlinux memmove == COS=0 = 237314 2.7642 e1000e.ko e1000e e1000_irq_enable 211304 2.4613 obdclass.ko obdclass htable_lookup 197515 2.3007 vmlinux vmlinux schedule 180431 2.1017 e1000e.ko e1000e e1000_intr_msi 128113 1.4923 vmlinux vmlinux __switch_to 127683 1.4873 lvfs.ko lvfs lprocfs_counter_add 122777 1.4301 vmlinux vmlinux kfree 122584 1.4279 vmlinux vmlinux module_refcount 116800 1.3605 vmlinux vmlinux __find_get_block 110418 1.2862 vmlinux vmlinux mwait_idle 101909 1.1870 vmlinux vmlinux memset 101100 1.1776 oprofiled oprofiled (no symbols) 96397 1.1228 vmlinux vmlinux __kmalloc 91745 1.0686 ptlrpc.ko ptlrpc ptlrpc_main 80605 0.9389 ksocklnd.ko ksocklnd ksocknal_scheduler 79782 0.9293 vmlinux vmlinux kmem_cache_free 76968 0.8965 mds.ko mds mds_lov_dump_objids 75094 0.8747 vmlinux vmlinux .text.task_rq_lock 75087 0.8746 vmlinux vmlinux memmove 72627 0.8460 vmlinux vmlinux find_get_page 67242 0.7832 vmlinux vmlinux kmem_cache_alloc == events distribution in ptlrpc_server_handle_reply, the most cpu consuming function = CPU: P4 / Xeon, speed 2800.2 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 vma samples % image name app name symbol name 00000000000498f0 376924 4.5556 ptlrpc.ko ptlrpc ptlrpc_server_handle_reply 00000000000498f0 1 2.7e-04 00000000000498f4 10 0.0027 0000000000049905 3 8.0e-04 0000000000049912 22 0.0058 0000000000049972 1 2.7e-04 0000000000049977 658 0.1746 000000000004997e 24 0.0064 0000000000049a23 5 0.0013 0000000000049a35 21 0.0056 0000000000049a80 1 2.7e-04 0000000000049a85 4 0.0011 0000000000049a92 2 5.3e-04 0000000000049a99 22 0.0058 0000000000049aa9 1 2.7e-04 0000000000049ab5 371741 98.6249 ... ^^^^^^ spin_lock (&obd->obd_uncommitted_replies_lock); I have a patch to avoid using of obd_uncommitted_replies_lock in ptlrpc_server_handle_reply but it has minimal effect, ptlrpc_server_handle_reply still the most cpu consuming function because of svc->srv_lock contention. I think the problem is that COS defers processing of replies to transaction commit time. When commit happens, MDS has to process thousands of replies (about 14k replies per commit in the test 3.a) in short period of time. I guess the mdt service threads all woken up and spin trying to get the service svr_lock. Processing of new requests may also suffer of this. I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging compiled into a kernel, it found no sleep under spinlock bugs. Further optimization may include 1. per-reply spin locks. 2. per-cpu structures and threads to process reply queues. Any comments? Thanks. PS. the test results are much better when MDS server is sata20 machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0 and COS=1 have only %3 difference: COS=1 Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates 300 secs) COS=0 Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates 300 secs) -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
try to profile with single CPU? you''ll probably get an idea how "per-cpu" approach can help. thanks, Alex Alexander Zarochentsev wrote:> I have a patch to avoid using of obd_uncommitted_replies_lock > in ptlrpc_server_handle_reply but it has minimal effect, > ptlrpc_server_handle_reply still the most cpu consuming function > because of svc->srv_lock contention. > > I think the problem is that COS defers processing of replies to transaction commit time. > When commit happens, MDS has to process thousands of replies (about 14k replies per > commit in the test 3.a) in short period of time. I guess the mdt service threads > all woken up and spin trying to get the service svr_lock. Processing of new requests may > also suffer of this. > > I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging compiled into a kernel, it found no > sleep under spinlock bugs. > > Further optimization may include > 1. per-reply spin locks. > 2. per-cpu structures and threads to process reply queues. > > Any comments? > > Thanks. > > PS. the test results are much better when MDS server is sata20 machine with 4 cores > (the MDS from Washie1 has 2 cores), COS=0 and COS=1 have only %3 difference: > > COS=1 > Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 secs) > Rate: 3096.94 creates/sec (total: 2 threads 929083 creates 300 secs) > > COS=0 > Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 secs) > Rate: 3152.89 creates/sec (total: 2 threads 945868 creates 300 secs) >
On Oct 08, 2008 15:44 +0400, Alexander Zarochentsev wrote:> I think the problem is that COS defers processing of replies to > transaction commit time. When commit happens, MDS has to process > thousands of replies (about 14k replies per commit in the test 3.a) > in short period of time. I guess the mdt service threads all woken > up and spin trying to get the service svr_lock. Processing of new > requests may also suffer of this.Can you please explain what replies are being blocked? It can''t be the create replies or the clients would be blocked waiting after starting a single create each. I think the thread and lock contention is only part of the issue - if all of these replies are blocked until transaction commit this wastes all of the bandwidth on the network while the replies are being held. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas, Zam doesn''t mean the reply messages being blocked - he means the reply state that is holding lock references. Previously they got cleaned up one at a time as the repACKs came in. Now they get scheduled by the 1000s on the commit callback. I''m betting that we''re getting thundering herds of MDS threads at this time and that we should have dedicated 1-per-CPU threads doing the cleanup to minimize interference with normal request processing. Bzzz thinks it''s purely spinlock contention that''s the source of the inefficiency. So Zam is going to run an experiment with a single CPU MDS to see if the performance issue goes away. I''m betting it won''t and Bzzz is betting it will. We should see soon... Cheers, Eric> > > -----Original Message----- > From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On Behalf Of Andreas Dilger > Sent: 11 October 2008 5:00 PM > To: Alexander Zarochentsev > Cc: lustre-devel at lists.lustre.org; lustre-recovery at Sun.COM > Subject: Re: [Lustre-devel] COS performance issues > > On Oct 08, 2008 15:44 +0400, Alexander Zarochentsev wrote: > > I think the problem is that COS defers processing of replies to > > transaction commit time. When commit happens, MDS has to process > > thousands of replies (about 14k replies per commit in the test 3.a) > > in short period of time. I guess the mdt service threads all woken > > up and spin trying to get the service svr_lock. Processing of new > > requests may also suffer of this. > > Can you please explain what replies are being blocked? It can''t be the > create replies or the clients would be blocked waiting after starting a > single create each. > > I think the thread and lock contention is only part of the issue - if all > of these replies are blocked until transaction commit this wastes all of > the bandwidth on the network while the replies are being held. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >
On 11 October 2008 20:00:10 Andreas Dilger wrote:> On Oct 08, 2008 15:44 +0400, Alexander Zarochentsev wrote: > > I think the problem is that COS defers processing of replies to > > transaction commit time. When commit happens, MDS has to process > > thousands of replies (about 14k replies per commit in the test 3.a) > > in short period of time. I guess the mdt service threads all woken > > up and spin trying to get the service svr_lock. Processing of new > > requests may also suffer of this. > > Can you please explain what replies are being blocked? It can''t be > the create replies or the clients would be blocked waiting after > starting a single create each. > > I think the thread and lock contention is only part of the issue - if > all of these replies are blocked until transaction commit this wastes > all of the bandwidth on the network while the replies are being held.as Eric explained already, replies are sent already but the reply state objects remain in memory. The cleanup process is almost serialized by per-service spinlock. It might be only a part of the problem but the first task to solve.> Cheers, AndreasWBR, -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
On 8 October 2008 15:48:50 Alex Zhuravlev wrote:> try to profile with single CPU? you''ll probably get an idea how > "per-cpu" approach can help.I booted the MDS server with maxcpus=1 kernel parameter and here are the results: cos=0 2039.31 creates/sec (total: 2 threads 611794 creates 300 secs) 2037.80 creates/sec (total: 2 threads 611341 creates 300 secs) 2076.21 creates/sec (total: 2 threads 622864 creates 300 secs) cos=1 1874.93 creates/sec (total: 2 threads 564354 creates 301 secs) 1923.97 creates/sec (total: 2 threads 577191 creates 300 secs) 1892.61 creates/sec (total: 2 threads 567783 creates 300 secs) 1874.74 creates/sec (total: 2 threads 562421 creates 300 secs) unfortunately profiling info isn''t available yet, the results are done with SLES10 which can boot with maxcpus=1 but has no oprofile installed.> Alexander Zarochentsev wrote: > > I have a patch to avoid using of obd_uncommitted_replies_lock > > in ptlrpc_server_handle_reply but it has minimal effect, > > ptlrpc_server_handle_reply still the most cpu consuming function > > because of svc->srv_lock contention. > > > > I think the problem is that COS defers processing of replies to > > transaction commit time. When commit happens, MDS has to process > > thousands of replies (about 14k replies per commit in the test 3.a) > > in short period of time. I guess the mdt service threads all woken > > up and spin trying to get the service svr_lock. Processing of new > > requests may also suffer of this. > > > > I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging > > compiled into a kernel, it found no sleep under spinlock bugs. > > > > Further optimization may include > > 1. per-reply spin locks. > > 2. per-cpu structures and threads to process reply queues. > > > > Any comments? > > > > Thanks. > > > > PS. the test results are much better when MDS server is sata20 > > machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0 and > > COS=1 have only %3 difference: > > > > COS=1 > > Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 > > secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates > > 300 secs) > > > > COS=0 > > Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 > > secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates > > 300 secs) > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel-- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
would be good to look at profiles as the next one was ldlm_resource_get() thanks, Alex Alexander Zarochentsev wrote:> On 8 October 2008 15:48:50 Alex Zhuravlev wrote: >> try to profile with single CPU? you''ll probably get an idea how >> "per-cpu" approach can help. > > I booted the MDS server with maxcpus=1 kernel parameter and here are the > results: > > cos=0 > 2039.31 creates/sec (total: 2 threads 611794 creates 300 secs) > 2037.80 creates/sec (total: 2 threads 611341 creates 300 secs) > 2076.21 creates/sec (total: 2 threads 622864 creates 300 secs) > > cos=1 > 1874.93 creates/sec (total: 2 threads 564354 creates 301 secs) > 1923.97 creates/sec (total: 2 threads 577191 creates 300 secs) > 1892.61 creates/sec (total: 2 threads 567783 creates 300 secs) > 1874.74 creates/sec (total: 2 threads 562421 creates 300 secs) > > unfortunately profiling info isn''t available yet, the results are done > with SLES10 which can boot with maxcpus=1 but has no oprofile > installed. > >> Alexander Zarochentsev wrote: >>> I have a patch to avoid using of obd_uncommitted_replies_lock >>> in ptlrpc_server_handle_reply but it has minimal effect, >>> ptlrpc_server_handle_reply still the most cpu consuming function >>> because of svc->srv_lock contention. >>> >>> I think the problem is that COS defers processing of replies to >>> transaction commit time. When commit happens, MDS has to process >>> thousands of replies (about 14k replies per commit in the test 3.a) >>> in short period of time. I guess the mdt service threads all woken >>> up and spin trying to get the service svr_lock. Processing of new >>> requests may also suffer of this. >>> >>> I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging >>> compiled into a kernel, it found no sleep under spinlock bugs. >>> >>> Further optimization may include >>> 1. per-reply spin locks. >>> 2. per-cpu structures and threads to process reply queues. >>> >>> Any comments? >>> >>> Thanks. >>> >>> PS. the test results are much better when MDS server is sata20 >>> machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0 and >>> COS=1 have only %3 difference: >>> >>> COS=1 >>> Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 >>> secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates >>> 300 secs) >>> >>> COS=0 >>> Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 >>> secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates >>> 300 secs) >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel >
On 12 October 2008 23:12:10 Alex Zhuravlev wrote:> would be good to look at profiles as the next one was > ldlm_resource_get()Alex, look, ptlrpc_server_handle_reply has gone: CPU: P4 / Xeon, speed 2800.25 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped with a unit mask of 0x01 (mandatory) count 100000 samples % image name app name symbol name 312318 4.7857 vmlinux vmlinux try_to_wake_up 259690 3.9792 ptlrpc.ko ptlrpc ldlm_resource_find 175916 2.6956 vmlinux vmlinux __wake_up_common 160912 2.4657 e1000.ko e1000 e1000_irq_enable 157031 2.4062 obdclass.ko obdclass htable_lookup 134309 2.0580 e1000.ko e1000 e1000_intr_msi 121261 1.8581 lvfs.ko lvfs lprocfs_counter_add 87613 1.3425 vmlinux vmlinux __find_get_block 87562 1.3417 oprofiled oprofiled (no symbols) 77926 1.1941 vmlinux vmlinux memset 74561 1.1425 vmlinux vmlinux __kmalloc 70952 1.0872 vmlinux vmlinux __switch_to 67288 1.0311 mds.ko mds mds_lov_dump_objids 67085 1.0279 vmlinux vmlinux memmove 66521 1.0193 vmlinux vmlinux kfree 55112 0.8445 ptlrpc.ko ptlrpc ptlrpc_main 54357 0.8329 vmlinux vmlinux schedule 53501 0.8198 vmlinux vmlinux find_get_page> thanks, Alex > > Alexander Zarochentsev wrote: > > On 8 October 2008 15:48:50 Alex Zhuravlev wrote: > >> try to profile with single CPU? you''ll probably get an idea how > >> "per-cpu" approach can help. > > > > I booted the MDS server with maxcpus=1 kernel parameter and here > > are the results: > > > > cos=0 > > 2039.31 creates/sec (total: 2 threads 611794 creates 300 secs) > > 2037.80 creates/sec (total: 2 threads 611341 creates 300 secs) > > 2076.21 creates/sec (total: 2 threads 622864 creates 300 secs) > > > > cos=1 > > 1874.93 creates/sec (total: 2 threads 564354 creates 301 secs) > > 1923.97 creates/sec (total: 2 threads 577191 creates 300 secs) > > 1892.61 creates/sec (total: 2 threads 567783 creates 300 secs) > > 1874.74 creates/sec (total: 2 threads 562421 creates 300 secs) > > > > unfortunately profiling info isn''t available yet, the results are > > done with SLES10 which can boot with maxcpus=1 but has no oprofile > > installed. > > > >> Alexander Zarochentsev wrote: > >>> I have a patch to avoid using of obd_uncommitted_replies_lock > >>> in ptlrpc_server_handle_reply but it has minimal effect, > >>> ptlrpc_server_handle_reply still the most cpu consuming function > >>> because of svc->srv_lock contention. > >>> > >>> I think the problem is that COS defers processing of replies to > >>> transaction commit time. When commit happens, MDS has to process > >>> thousands of replies (about 14k replies per commit in the test > >>> 3.a) in short period of time. I guess the mdt service threads all > >>> woken up and spin trying to get the service svr_lock. Processing > >>> of new requests may also suffer of this. > >>> > >>> I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging > >>> compiled into a kernel, it found no sleep under spinlock bugs. > >>> > >>> Further optimization may include > >>> 1. per-reply spin locks. > >>> 2. per-cpu structures and threads to process reply queues. > >>> > >>> Any comments? > >>> > >>> Thanks. > >>> > >>> PS. the test results are much better when MDS server is sata20 > >>> machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0 > >>> and COS=1 have only %3 difference: > >>> > >>> COS=1 > >>> Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 > >>> secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates > >>> 300 secs) > >>> > >>> COS=0 > >>> Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 > >>> secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates > >>> 300 secs) > >> > >> _______________________________________________ > >> Lustre-devel mailing list > >> Lustre-devel at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-devel-- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
cool! do you have profile with COS disabled? how is it different? I guess ldlm_resorce_find() is result of COS as now we''ve got 10K times more locks and resorces, but what about try_to_wake_up() and __wake_up_common() ? thanks, Alex Alexander Zarochentsev wrote:> On 12 October 2008 23:12:10 Alex Zhuravlev wrote: >> would be good to look at profiles as the next one was >> ldlm_resource_get() > > Alex, look, ptlrpc_server_handle_reply has gone: > > CPU: P4 / Xeon, speed 2800.25 MHz (estimated) > Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped > with a unit mask of 0x01 (mandatory) count > 100000 > samples % image name app name symbol name > 312318 4.7857 vmlinux vmlinux try_to_wake_up > 259690 3.9792 ptlrpc.ko ptlrpc ldlm_resource_find > 175916 2.6956 vmlinux vmlinux __wake_up_common > 160912 2.4657 e1000.ko e1000 e1000_irq_enable > 157031 2.4062 obdclass.ko obdclass htable_lookup > 134309 2.0580 e1000.ko e1000 e1000_intr_msi > 121261 1.8581 lvfs.ko lvfs lprocfs_counter_add > 87613 1.3425 vmlinux vmlinux __find_get_block > 87562 1.3417 oprofiled oprofiled (no symbols) > 77926 1.1941 vmlinux vmlinux memset > 74561 1.1425 vmlinux vmlinux __kmalloc > 70952 1.0872 vmlinux vmlinux __switch_to > 67288 1.0311 mds.ko mds mds_lov_dump_objids > 67085 1.0279 vmlinux vmlinux memmove > 66521 1.0193 vmlinux vmlinux kfree > 55112 0.8445 ptlrpc.ko ptlrpc ptlrpc_main > 54357 0.8329 vmlinux vmlinux schedule > 53501 0.8198 vmlinux vmlinux find_get_page > >> thanks, Alex >> >> Alexander Zarochentsev wrote: >>> On 8 October 2008 15:48:50 Alex Zhuravlev wrote: >>>> try to profile with single CPU? you''ll probably get an idea how >>>> "per-cpu" approach can help. >>> I booted the MDS server with maxcpus=1 kernel parameter and here >>> are the results: >>> >>> cos=0 >>> 2039.31 creates/sec (total: 2 threads 611794 creates 300 secs) >>> 2037.80 creates/sec (total: 2 threads 611341 creates 300 secs) >>> 2076.21 creates/sec (total: 2 threads 622864 creates 300 secs) >>> >>> cos=1 >>> 1874.93 creates/sec (total: 2 threads 564354 creates 301 secs) >>> 1923.97 creates/sec (total: 2 threads 577191 creates 300 secs) >>> 1892.61 creates/sec (total: 2 threads 567783 creates 300 secs) >>> 1874.74 creates/sec (total: 2 threads 562421 creates 300 secs) >>> >>> unfortunately profiling info isn''t available yet, the results are >>> done with SLES10 which can boot with maxcpus=1 but has no oprofile >>> installed. >>> >>>> Alexander Zarochentsev wrote: >>>>> I have a patch to avoid using of obd_uncommitted_replies_lock >>>>> in ptlrpc_server_handle_reply but it has minimal effect, >>>>> ptlrpc_server_handle_reply still the most cpu consuming function >>>>> because of svc->srv_lock contention. >>>>> >>>>> I think the problem is that COS defers processing of replies to >>>>> transaction commit time. When commit happens, MDS has to process >>>>> thousands of replies (about 14k replies per commit in the test >>>>> 3.a) in short period of time. I guess the mdt service threads all >>>>> woken up and spin trying to get the service svr_lock. Processing >>>>> of new requests may also suffer of this. >>>>> >>>>> I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging >>>>> compiled into a kernel, it found no sleep under spinlock bugs. >>>>> >>>>> Further optimization may include >>>>> 1. per-reply spin locks. >>>>> 2. per-cpu structures and threads to process reply queues. >>>>> >>>>> Any comments? >>>>> >>>>> Thanks. >>>>> >>>>> PS. the test results are much better when MDS server is sata20 >>>>> machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0 >>>>> and COS=1 have only %3 difference: >>>>> >>>>> COS=1 >>>>> Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 >>>>> secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates >>>>> 300 secs) >>>>> >>>>> COS=0 >>>>> Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 >>>>> secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates >>>>> 300 secs) >>>> _______________________________________________ >>>> Lustre-devel mailing list >>>> Lustre-devel at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-devel >
On 13 October 2008 19:04:51 Alex Zhuravlev wrote:> cool! do you have profile with COS disabled? how is it different? > I guess ldlm_resorce_find() is result of COS as now we''ve got 10K > times more locks and resorces, but what about try_to_wake_up() and > __wake_up_common() ?What I got after number of performance optimizations: 332019 4.4010 ptlrpc.ko ptlrpc ldlm_resource_find 190465 2.5247 obdclass.ko obdclass htable_lookup 175260 2.3231 e1000.ko e1000 e1000_irq_enable 146324 1.9396 vmlinux vmlinux schedule 139799 1.8531 e1000.ko e1000 e1000_intr_msi 121729 1.6136 lvfs.ko lvfs lprocfs_counter_add 105164 1.3940 vmlinux vmlinux __find_get_block 102296 1.3560 vmlinux vmlinux memset 96728 1.2822 vmlinux vmlinux __switch_to 92123 1.2211 vmlinux vmlinux kfree 90847 1.2042 oprofiled oprofiled (no symbols) 84152 1.1155 vmlinux vmlinux mwait_idle 82378 1.0919 vmlinux vmlinux __kmalloc 80368 1.0653 vmlinux vmlinux try_to_wake_up 79110 1.0486 mds.ko mds mds_lov_dump_objids 71730 0.9508 vmlinux vmlinux memmove The ptlrpc_server_handle_reply has gone due to per-reply spin lock implemented, no try_to_wake_up due to dedicated reply handling threads, there is only ldlm_resource_find and htable_lookup left, I hope to fix that with the next patch. The massive create tests shows the following results: COS=1 Rate: 2085.73 creates/sec (total: 2 threads 625719 creates 300 secs) Rate: 2072.38 creates/sec (total: 2 threads 621713 creates 300 secs) Rate: 2053.80 creates/sec (total: 2 threads 616141 creates 300 secs) COS=0 Rate: 2128.22 creates/sec (total: 2 threads 638465 creates 300 secs) Rate: 2151.05 creates/sec (total: 2 threads 645316 creates 300 secs) Rate: 2169.91 creates/sec (total: 2 threads 650974 creates 300 secs) Clean HEAD Rate: 2199.67 creates/sec (total: 2 threads 659900 creates 300 secs) Rate: 2213.95 creates/sec (total: 2 threads 664184 creates 300 secs) Rate: 2186.74 creates/sec (total: 2 threads 658209 creates 301 secs) Rate: 2206.10 creates/sec (total: 2 threads 661831 creates 300 secs) Rate: 2217.49 creates/sec (total: 2 threads 665247 creates 300 secs) Rate: 2230.86 creates/sec (total: 2 threads 669258 creates 300 secs) WBR, -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
On Oct 17, 2008, at 01:07 , Alexander Zarochentsev wrote:> The massive create tests shows the following results: > > COS=1 > Rate: 2085.73 creates/sec (total: 2 threads 625719 creates 300 secs) > Rate: 2072.38 creates/sec (total: 2 threads 621713 creates 300 secs) > Rate: 2053.80 creates/sec (total: 2 threads 616141 creates 300 secs) > > COS=0 > Rate: 2128.22 creates/sec (total: 2 threads 638465 creates 300 secs) > Rate: 2151.05 creates/sec (total: 2 threads 645316 creates 300 secs) > Rate: 2169.91 creates/sec (total: 2 threads 650974 creates 300 secs) > > Clean HEAD > > Rate: 2199.67 creates/sec (total: 2 threads 659900 creates 300 secs) > Rate: 2213.95 creates/sec (total: 2 threads 664184 creates 300 secs) > Rate: 2186.74 creates/sec (total: 2 threads 658209 creates 301 secs) > Rate: 2206.10 creates/sec (total: 2 threads 661831 creates 300 secs) > Rate: 2217.49 creates/sec (total: 2 threads 665247 creates 300 secs) > Rate: 2230.86 creates/sec (total: 2 threads 669258 creates 300 secs) >Any idea why COS=0 was consistently a bit slower than HEAD? robert
On 18 October 2008 00:55:03 Robert Read wrote:> On Oct 17, 2008, at 01:07 , Alexander Zarochentsev wrote: > > The massive create tests shows the following results: > > > > COS=1 > > Rate: 2085.73 creates/sec (total: 2 threads 625719 creates 300 > > secs) Rate: 2072.38 creates/sec (total: 2 threads 621713 creates > > 300 secs) Rate: 2053.80 creates/sec (total: 2 threads 616141 > > creates 300 secs) > > > > COS=0 > > Rate: 2128.22 creates/sec (total: 2 threads 638465 creates 300 > > secs) Rate: 2151.05 creates/sec (total: 2 threads 645316 creates > > 300 secs) Rate: 2169.91 creates/sec (total: 2 threads 650974 > > creates 300 secs) > > > > Clean HEAD > > > > Rate: 2199.67 creates/sec (total: 2 threads 659900 creates 300 > > secs) Rate: 2213.95 creates/sec (total: 2 threads 664184 creates > > 300 secs) Rate: 2186.74 creates/sec (total: 2 threads 658209 > > creates 301 secs) Rate: 2206.10 creates/sec (total: 2 threads > > 661831 creates 300 secs) Rate: 2217.49 creates/sec (total: 2 > > threads 665247 creates 300 secs) Rate: 2230.86 creates/sec (total: > > 2 threads 669258 creates 300 secs) > > Any idea why COS=0 was consistently a bit slower than HEAD?I only suspect the ENTRY/EXIT stuff I added to several ptlrpc functions. next time I will try to build Lustre with CDEBUG_ENTRY_EXIT=0.> robertThe latest results I got after increasing of ldlm resource hashtable size are: zam at lts-head:~$ zgrep Rate: m-ht-improved-WASHIE3.4-20081017.log COS = 1 Rate: 1887.57 creates/sec (total: 2 threads 566271 creates 300 secs) Rate: 1980.18 creates/sec (total: 2 threads 594053 creates 300 secs) Rate: 1923.76 creates/sec (total: 2 threads 577129 creates 300 secs) COS = 0 Rate: 1804.92 creates/sec (total: 2 threads 541475 creates 300 secs) Rate: 1915.29 creates/sec (total: 2 threads 574587 creates 300 secs) Rate: 2025.70 creates/sec (total: 2 threads 607711 creates 300 secs) CLEAN HEAD: zam at lts-head:~$ zgrep Rate: m-cleanhead-WASHIE3.4-20081017.log Rate: 1986.13 creates/sec (total: 2 threads 595839 creates 300 secs) Rate: 2009.84 creates/sec (total: 2 threads 602952 creates 300 secs) Rate: 1887.05 creates/sec (total: 2 threads 566115 creates 300 secs) Rate: 2060.59 creates/sec (total: 2 threads 618178 creates 300 secs) Rate: 1868.89 creates/sec (total: 2 threads 560666 creates 300 secs) Rate: 1976.95 creates/sec (total: 2 threads 594084 creates 301 secs) zam at lts-head:~$ average values: COS=1 : 1930.5 COS=0 : 1915.3 HEAD : 1964.9 The results are done on another test cluster so they are not directly comparable with the previous ones. The profiling chart for COS=1, there is no ldlm_resource_find anymore in top20: 198152 2.9052 e1000.ko e1000 e1000_irq_enable 158535 2.3243 e1000.ko e1000 e1000_intr_msi 152375 2.2340 vmlinux vmlinux schedule 148709 2.1803 obdclass.ko obdclass htable_lookup 109440 1.6045 lvfs.ko lvfs lprocfs_counter_add 102773 1.5068 vmlinux vmlinux __switch_to 95607 1.4017 vmlinux vmlinux __find_get_block 90447 1.3261 vmlinux vmlinux kfree 88713 1.3006 vmlinux vmlinux mwait_idle 84511 1.2390 vmlinux vmlinux try_to_wake_up 83801 1.2286 vmlinux vmlinux memset 79735 1.1690 vmlinux vmlinux __kmalloc 78594 1.1523 oprofiled oprofiled (no symbols) 71184 1.0437 ksocklnd.ko ksocklnd ksocknal_scheduler 65613 0.9620 vmlinux vmlinux kmem_cache_free 64525 0.9460 mds.ko mds mds_lov_dump_objids 64208 0.9414 ptlrpc.ko ptlrpc ptlrpc_main 63260 0.9275 vmlinux vmlinux memmove 56148 0.8232 vmlinux vmlinux find_get_page 55011 0.8065 vmlinux vmlinux kmem_cache_alloc I attached all performance enhancement patches (except the latest patch of ldlm resource hash table size increase) to the COS bug (Bug 15393). Thanks, -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems