Hello! Few days ago a question arose (probably not the first time) how quickly we can create files on MDS (ignoring OST objects creation for now) if we have super-low-latency i/o and super low-latency network. Branch is b1_6, though I plan to run HEAD like this too. I decided that perhaps fully local mount would be good enough for the super-fast latency (Eric has some problems with this arguing that lo lnd does local memory copy which should be somewhat expensive). For super-low latency I made some loop files on tmpfs (I did not want to go through real ramdisk, since I think that cuts through some of the block-searching logic). Also naturally debug was disabled. The test was creation of 150000 files in a single dir on lustre (not in root). As expected, we are 100% cpu-bound in this scenario. My somewhat dated Athlon X2 2.1Ghz achieves around 4.6k creates/ sec, the problem is there is pretty big variability (easily 10% both ways), anyway I collected some oprofile traces and the overall picture is pretty similar in there across the runs: We spend 45+% time in main code, 20+% in ptlrpc+ldlm, 9% in ldiskfs, 5-7% in lnet, ~3% in mds, ~3% in jbd. When looking at the functions, the main offender is do_gettimeofday with 12+% of total cpu time, but the callgraph implies that it is not called from Lustre. second worst is memset_c with 4+%, what is interesting here is that 79% of calls to long memsets are from init_inode, zeroing out i_dquot field (600+ bytes on x86_64, 440 bytes on i686), another 16% are from .test.d_free and I am not sure what that means. next is add_dirent_to_buf (in ldiskfs) with 2.5-3.5% Somewhat surprisingly, ptlrpc_main is 2.4% of cpu as well. There is also plenty of schedule-related activity (schedule, try_to_wake_up, and friends) which when combined should give not so small % of cpu of 5.4% I decided to look at ptlrpc_main to see what''s so bad about it and sure enough, I see recent code that basically makes it to run at least twice for every incoming request and in different threads at that. After a patch (below), create speed seemed to be better, but still was pretty variable. oprofile shows that scheduling activity significantly descreased (and ptlrpc_main easily takes 30% less time now), but then other code starts to take more time at random (esp. ldiskfs related and memsets). One of my suspicions is perhaps cpu-pingpong of tasks (I thought that dualcore cpus do not have all that much penalty for cpu switching, but might be I was wrong about that), I pinned all mds threads to one cpu and tried to run createmany from another, that bumped overall create speed to around 5k creates/sec and with ptlrpc_main patch to 5130 creates/ sec. Another thing I noticed is add_waitq_exclusive is adding tasks to the tail of waitq with the logic of (shared waiters go to the beginning since they need to be woken up anyway, and we only need to wake up one exclusive task). This really does not bode well with our usage scenario, all of our service threads are exclusive waiters, and so we are constantly rotate through them throwing away ones that are hot in the cache and replacing them with stale old ones (well, not code, but stack and such). As an experiment, I replaced the __add_wait_queue_tail with __add_wait_queue in add_wait_queue_exclusive() on top of all previous changes and I now consistently see 5370 creates/sec. I guess I need to run client and server on separate nodes since that might cut variability, but then getting superfast network would be a lot harder. The tests were pretty simplistic in a sense that there was not a lot of contention on MDS (due to single-threaded nature of creates), but somewhat interesting anyway. I wonder if anyone might be has ideas about other reasons for big variability in individual function times for the same workload. --- ptlrpc/service.c 11 Feb 2009 08:42:10 -0000 1.114.30.44 +++ ptlrpc/service.c 14 Feb 2009 08:43:24 -0000 @@ -1156,7 +1156,6 @@ ptlrpc_server_handle_req_in(struct ptlrp rc = ptlrpc_server_request_add(svc, req); if (rc) GOTO(err_req, rc); - cfs_waitq_signal(&svc->srv_waitq); RETURN(1); err_req: @@ -1654,14 +1653,15 @@ static int ptlrpc_main(void *arg) if (!list_empty(&svc->srv_reply_queue)) ptlrpc_server_handle_reply (svc); +recheck_queue: if (!list_empty(&svc->srv_req_in_queue)) { /* Process all incoming reqs before handling any */ ptlrpc_server_handle_req_in(svc); /* but limit ourselves in case of flood */ if (counter++ < 1000) - continue; - counter = 0; + goto recheck_queue; } + counter = 0; if (svc->srv_at_check) ptlrpc_at_check_timed(svc); Bye, Oleg
Hello! On Feb 15, 2009, at 10:02 AM, Nic Henke wrote:> Was this all of the changes ? Why remove the cfs_waitq_signal ?Yes it was. We remove cfs_waitq_signal since it wakes up another thread to process a message that we just moved from incoming queue to "to be processed" queue. It helps me because I only have 1 message waiting at any one given time. If there is more than one message waiting, the result is not entirely clear, but I think should be fine as well, essentially every incoming message woke up one processing thread, and then they run racing to the incoming message queue first, pick 1 request at a time, and put it into processing queue, then try to see if there are more incoming messages (if the MDS is lightly loaded, there is likely none, because another thread already took care of them), and then process one request from processing queue. I suspect that it would be beneficial to only process one incoming message and immediately start processing it to avoid processing it on a potentially cache-cold cpu if there is no high priority handler registered. If there is high priority handler registered, we can exit early on once we met high priority message in incoming message processing.> We are having mdsrate issues on 1.6.5 as well - but so far we are > not CPU bound yet. We''ll be trying things like increasing the number > of MDS threads and the create_count for the OSTs - If we are not CPU > bound, we are waiting on something else.Are you not CPU-bound on MDS? How many clients do you run with mdsrate (as in separate clients nodes) to how many cpu MDS? Have you tried eliminating object creation overhead just to see how much effect that was having (mdsrate mknod option)? Bye, Oleg
Hello! Ok, a follow up to my findings. First of all it turned out in this specific (and quite useless) case of a lot of mknods we do extra unneeded lookup RPC where we can do without (also in somewhat more useful mkdir case too). I filed a bug 18534 with a prototype fix. With that fix on top of all previous changes I now see create rate of ~8.8k (with peaks into 9.3k territory) creates/sec. I also performed tests on HEAD and it performs significantly worse (5.8k at most). Even with all the same fixes ported from b1_6. The CPU remains the same and I used lock meter to verify that there is no significant lock contention. Looking into the oprofile results, it looks like all code just became slower (judging by more hits in various areas for the same workload). Also some parts of the code are now more heavily loaded (ptlrpc +ldlm in HEAD draws more cpu time, llite portion of the code is 50% more time (but still from 2% to 3% total cpu time spent, not all that significant)). Another problem on HEAD is huge variability between runs. Easily could be +/- 50% between runs for HEAD where b1_6 results are pretty close together. I have no idea why the variability and I do not see anything very obvious that would explain sudden overall performance degradation of HEAD code either yet. I have put my oprofile results to http://linuxhacker.ru/lustre-profile the file suffixes are -calls for callgraphs, -func for cpu usage per function and no prefix for cpu usage per module. oprofile and oprofiled binaries substracted from results. There are 3 runs of b1_6 and HEAD, every time a fresh filesystem was created (this is the only way to do in on HEAD now, since you cannot remove 150k files from same dir due to bug 17560). all 3 runs of b1_6 gave out ~7.8k/sec (lower speed is due to oprofile taking cpu). head runs 1 and 3 gave 3.0k/sec, while run 2 produced 5.1k/sec. It is interesting that in slower HEAD runs lnet takes 60% more wallclock cpu time than in faster run and the faster head run has lnet at the same wall time as b1_6 runs. I do not know yet if that means that sometimes HEAD decided to double amount of traffic for some reason or if there is another explanation too. As a comparison I have run createmany with same parameters on just ldiskfs, the measurements are less finegrained there, since we just divide number of creates performed by the time in seconds and creating 150k files on ldiskfs takes only 5-6 seconds, so the rate is 25k-30k creates/sec. Bye, Oleg
Alex Zhuravlev
2009-Feb-16 15:26 UTC
[Lustre-devel] Some investigations on MDS creation rate
Oleg Drokin wrote:> I also performed tests on HEAD and it performs significantly worse > (5.8k at most). Even with all the same fixes ported from b1_6. > The CPU remains the same and I used lock meter to verify that there > is no significant lock contention. > Looking into the oprofile results, it looks like all code just > became slower (judging by more hits in various areas for the same > workload). > Also some parts of the code are now more heavily loaded (ptlrpc > +ldlm in HEAD draws more cpu time, llite portion of the code is 50% > more timenotice HEAD takes two ldlm locks for each create, 1.6 takes one.> I have no idea why the variability and I do not see anything very > obvious that would explain sudden overall performance degradation of > HEAD code either yet.it would be interesting to gather L1/L2 cache hit/miss with oprofile for 1.6 and HEAD. thanks, Alex
Hello! On Feb 16, 2009, at 10:26 AM, Alex Zhuravlev wrote:> Oleg Drokin wrote: >> I also performed tests on HEAD and it performs significantly >> worse (5.8k at most). Even with all the same fixes ported from b1_6. >> The CPU remains the same and I used lock meter to verify that >> there is no significant lock contention. >> Looking into the oprofile results, it looks like all code just >> became slower (judging by more hits in various areas for the same >> workload). >> Also some parts of the code are now more heavily loaded (ptlrpc >> +ldlm in HEAD draws more cpu time, llite portion of the code is >> 50% more time > notice HEAD takes two ldlm locks for each create, 1.6 takes one.What is the other one? I am looking at mdt_md_create and I see it locks parent, but not child, just like b1_6. Bye, Oleg
Alex Zhuravlev
2009-Feb-17 06:17 UTC
[Lustre-devel] Some investigations on MDS creation rate
Oleg Drokin wrote:>> notice HEAD takes two ldlm locks for each create, 1.6 takes one. > > What is the other one? I am looking at mdt_md_create and I see it locks > parent, but not child, just like b1_6.one lock on whole parent directory (to invalidate client''s cache), another lock on part of parent directory with given hash. this is workaround for slow creates from multiple clients due to rep-ack mechanism. thanks, Alex