thr3ads.net - Lustre devel - [Lustre-devel] Some investigations on MDS creation rate [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Oleg Drokin

2009-Feb-14 08:47 UTC

[Lustre-devel] Some investigations on MDS creation rate

Hello!

    Few days ago a question arose (probably not the first time) how  
quickly we can create files on MDS (ignoring OST objects creation for  
now)
    if we have super-low-latency i/o and super low-latency network.  
Branch is b1_6, though I plan to run HEAD like this too.
    I decided that perhaps fully local mount would be good enough for  
the super-fast latency (Eric has some problems with this arguing that
    lo lnd does local memory copy which should be somewhat expensive).
    For super-low latency I made some loop files on tmpfs (I did not  
want to go through real ramdisk, since I think that cuts through some of
    the block-searching logic).
    Also naturally debug was disabled.

    The test was creation of 150000 files in a single dir on lustre  
(not in root).
    As expected, we are 100% cpu-bound in this scenario.
    My somewhat dated Athlon X2 2.1Ghz achieves around 4.6k creates/ 
sec, the problem is there is pretty big variability (easily 10% both  
ways),
    anyway I collected some oprofile traces and the overall picture is  
pretty similar in there across the runs:
    We spend 45+% time in main code, 20+% in ptlrpc+ldlm, 9% in  
ldiskfs, 5-7% in lnet, ~3% in mds, ~3% in jbd.
    When looking at the functions, the main offender is  
do_gettimeofday with 12+% of total cpu time, but the callgraph implies  
that
    it is not called from Lustre.
    second worst is memset_c with 4+%, what is interesting here is  
that 79% of calls to long memsets are from init_inode, zeroing out  
i_dquot
    field (600+ bytes on x86_64, 440 bytes on i686), another 16% are  
from .test.d_free and I am not sure what that means.
    next is add_dirent_to_buf (in ldiskfs) with 2.5-3.5%
    Somewhat surprisingly, ptlrpc_main is 2.4% of cpu as well.
    There is also plenty of schedule-related activity (schedule,  
try_to_wake_up, and friends) which when combined should give not so  
small % of cpu of 5.4%

    I decided to look at ptlrpc_main to see what''s so bad about it and
sure enough, I see recent code that basically makes it to run at least  
twice for
    every incoming request and in different threads at that. After a  
patch (below), create speed seemed to be better, but still was pretty  
variable.
    oprofile shows that scheduling activity significantly descreased  
(and ptlrpc_main easily takes 30% less time now), but then other code  
starts to
    take more time at random (esp. ldiskfs related and memsets).
    One of my suspicions is perhaps cpu-pingpong of tasks (I thought  
that dualcore cpus do not have all that much penalty for cpu  
switching, but might
    be I was wrong about that), I pinned all mds threads to one cpu  
and tried to run createmany from another, that bumped overall create  
speed to
    around 5k creates/sec and with ptlrpc_main patch to 5130 creates/ 
sec.
    Another thing I noticed is add_waitq_exclusive is adding tasks to  
the tail of waitq with the logic of (shared waiters go to the  
beginning since they
    need to be woken up anyway, and we only need to wake up one  
exclusive task). This really does not bode well with our usage  
scenario, all of our
    service threads are exclusive waiters, and so we are constantly  
rotate through them throwing away ones that are hot in the cache and  
replacing them
    with stale old ones (well, not code, but stack and such). As an  
experiment, I replaced the __add_wait_queue_tail with __add_wait_queue  
in
    add_wait_queue_exclusive() on top of all previous changes and I  
now consistently see 5370 creates/sec.

    I guess I need to run client and server on separate nodes since  
that might cut variability, but then getting superfast network would  
be a lot harder.

    The tests were pretty simplistic in a sense that there was not a  
lot of contention on MDS (due to single-threaded nature of creates),  
but somewhat
    interesting anyway.

    I wonder if anyone might be has ideas about other reasons for big  
variability in individual function times for the same workload.

--- ptlrpc/service.c	11 Feb 2009 08:42:10 -0000	1.114.30.44
+++ ptlrpc/service.c	14 Feb 2009 08:43:24 -0000
@@ -1156,7 +1156,6 @@ ptlrpc_server_handle_req_in(struct ptlrp
          rc = ptlrpc_server_request_add(svc, req);
          if (rc)
                  GOTO(err_req, rc);
-        cfs_waitq_signal(&svc->srv_waitq);
          RETURN(1);

  err_req:
@@ -1654,14 +1653,15 @@ static int ptlrpc_main(void *arg)
                  if (!list_empty(&svc->srv_reply_queue))
                          ptlrpc_server_handle_reply (svc);

+recheck_queue:
                  if (!list_empty(&svc->srv_req_in_queue)) {
                          /* Process all incoming reqs before handling  
any */
                          ptlrpc_server_handle_req_in(svc);
                          /* but limit ourselves in case of flood */
                          if (counter++ < 1000)
-                                continue;
-                        counter = 0;
+                                goto recheck_queue;
                  }
+                counter = 0;

                  if (svc->srv_at_check)
                          ptlrpc_at_check_timed(svc);


Bye,
     Oleg

Oleg Drokin

2009-Feb-16 00:10 UTC

head link

[Lustre-devel] Some investigations on MDS creation rate

Hello!

On Feb 15, 2009, at 10:02 AM, Nic Henke wrote:> Was this all of the changes ? Why remove the cfs_waitq_signal ?
Yes it was.
We remove cfs_waitq_signal since it wakes up another thread to process  
a message that we just moved from incoming queue to "to be processed"
queue.
It helps me because I only have 1 message waiting at any one given time.
If there is more than one message waiting, the result is not entirely  
clear, but I think should be fine as well, essentially every incoming  
message
woke up one processing thread, and then they run racing to the  
incoming message queue first, pick 1 request at a time, and put it  
into processing queue,
then try to see if there are more incoming messages (if the MDS is  
lightly loaded, there is likely none, because another thread already  
took care
of them), and then process one request from processing queue.
I suspect that it would be beneficial to only process one incoming  
message and immediately start processing it to avoid processing it on  
a potentially
cache-cold cpu if there is no high priority handler registered. If  
there is high priority handler registered, we can exit early on once  
we met
high priority message in incoming message processing.
> We are having mdsrate issues on 1.6.5 as well - but so far we are  
> not CPU bound yet. We''ll be trying things like increasing the
number
> of MDS threads and the create_count for the OSTs - If we are not CPU  
> bound, we are waiting on something else.
Are you not CPU-bound on MDS?
How many clients do you run with mdsrate (as in separate clients  
nodes) to how many cpu MDS?
Have you tried eliminating object creation overhead just to see how  
much effect that was having (mdsrate mknod option)?

Bye,
    Oleg

Oleg Drokin

2009-Feb-16 11:11 UTC

head link

[Lustre-devel] Some investigations on MDS creation rate

Hello!

   Ok, a follow up to my findings. First of all it turned out in this  
specific (and quite useless) case of a lot of mknods we do extra  
unneeded lookup RPC
   where we can do without (also in somewhat more useful mkdir case  
too). I filed a bug 18534 with a prototype fix.
   With that fix on top of all previous changes I now see create rate  
of ~8.8k (with peaks into 9.3k territory) creates/sec.
   I also performed tests on HEAD and it performs significantly worse  
(5.8k at most). Even with all the same fixes ported from b1_6.
   The CPU remains the same and I used lock meter to verify that there  
is no significant lock contention.
   Looking into the oprofile results, it looks like all code just  
became slower (judging by more hits in various areas for the same  
workload).
   Also some parts of the code are now more heavily loaded (ptlrpc 
+ldlm in HEAD draws more cpu time, llite portion of the code is 50%  
more time
   (but still from 2% to 3% total cpu time spent, not all that  
significant)).
   Another problem on HEAD is huge variability between runs. Easily  
could be +/- 50% between runs for HEAD where b1_6 results are pretty  
close together.
   I have no idea why the variability and I do not see anything very  
obvious that would explain sudden overall performance degradation of  
HEAD code either yet.

   I have put my oprofile results to http://linuxhacker.ru/lustre-profile 
  the file suffixes are -calls for callgraphs, -func for cpu usage per  
function and
   no prefix for cpu usage per module. oprofile and oprofiled binaries  
substracted from results.
   There are 3 runs of b1_6 and HEAD, every time a fresh filesystem  
was created (this is the only way to do in on HEAD now, since you  
cannot remove 150k files from
   same dir due to bug 17560).
   all 3 runs of b1_6 gave out ~7.8k/sec (lower speed is due to  
oprofile taking cpu).
   head runs 1 and 3 gave 3.0k/sec, while run 2 produced 5.1k/sec.
   It is interesting that in slower HEAD runs lnet takes 60% more  
wallclock cpu time than in faster run and the faster head run has lnet  
at the same
   wall time as b1_6 runs. I do not know yet if that means that  
sometimes HEAD decided to double amount of traffic for some reason or  
if there
   is another explanation too.

   As a comparison I have run createmany with same parameters on just  
ldiskfs, the measurements are less finegrained there, since we just  
divide number of
   creates performed by the time in seconds and creating 150k files on  
ldiskfs takes only 5-6 seconds, so the rate is 25k-30k creates/sec.

Bye,
     Oleg

Alex Zhuravlev

2009-Feb-16 15:26 UTC

head link

[Lustre-devel] Some investigations on MDS creation rate

Oleg Drokin wrote:>    I also performed tests on HEAD and it performs significantly worse  
> (5.8k at most). Even with all the same fixes ported from b1_6.
>    The CPU remains the same and I used lock meter to verify that there  
> is no significant lock contention.
>    Looking into the oprofile results, it looks like all code just  
> became slower (judging by more hits in various areas for the same  
> workload).
>    Also some parts of the code are now more heavily loaded (ptlrpc 
> +ldlm in HEAD draws more cpu time, llite portion of the code is 50%  
> more time
notice HEAD takes two ldlm locks for each create, 1.6 takes one.
>    I have no idea why the variability and I do not see anything very  
> obvious that would explain sudden overall performance degradation of  
> HEAD code either yet.
it would  be interesting to gather L1/L2 cache hit/miss with oprofile for
1.6 and HEAD.

thanks, Alex

Oleg Drokin

2009-Feb-16 21:05 UTC

head link

[Lustre-devel] Some investigations on MDS creation rate

Hello!

On Feb 16, 2009, at 10:26 AM, Alex Zhuravlev wrote:
> Oleg Drokin wrote:
>>   I also performed tests on HEAD and it performs significantly  
>> worse  (5.8k at most). Even with all the same fixes ported from b1_6.
>>   The CPU remains the same and I used lock meter to verify that  
>> there  is no significant lock contention.
>>   Looking into the oprofile results, it looks like all code just   
>> became slower (judging by more hits in various areas for the same   
>> workload).
>>   Also some parts of the code are now more heavily loaded (ptlrpc  
>> +ldlm in HEAD draws more cpu time, llite portion of the code is  
>> 50%  more time
> notice HEAD takes two ldlm locks for each create, 1.6 takes one.
What is the other one? I am looking at mdt_md_create and I see it  
locks parent, but not child, just like b1_6.

Bye,
     Oleg

Alex Zhuravlev

2009-Feb-17 06:17 UTC

head link

[Lustre-devel] Some investigations on MDS creation rate

Oleg Drokin wrote:>> notice HEAD takes two ldlm locks for each create, 1.6 takes one.
> 
> What is the other one? I am looking at mdt_md_create and I see it locks 
> parent, but not child, just like b1_6.
one lock on whole parent directory (to invalidate client''s cache),
another
lock on part of parent directory with given hash. this is workaround for
slow creates from multiple clients due to rep-ack mechanism.

thanks, Alex

Lustre devel - Feb 2009 - Some investigations on MDS creation rate

[Lustre-devel] Some investigations on MDS creation rate

[Lustre-devel] Some investigations on MDS creation rate

[Lustre-devel] Some investigations on MDS creation rate

[Lustre-devel] Some investigations on MDS creation rate

[Lustre-devel] Some investigations on MDS creation rate

[Lustre-devel] Some investigations on MDS creation rate