Eric Barton
2009-Jul-29 15:37 UTC
[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong
Oleg, I''m replying via lustre-devel since this is of general interest. Comments inline...> Hello! > > I looked into the current lnet code for smp scalability and had some > discussion with Liang and I think there are some topics we need to > cover. > > with ofed 1.3 all interrupts arrive to single cpu, that cpu looks > into some data (currently - sending NID), and puts that message into > a processing queue for some CPU that happens to match he hash. This > is already quite not ideal (even with all the boost we are > supposedly getting) - this means each queue lock is constantly > bouncing between interrupt-receiving cpu and handling CPU. with > ofed 1.4 interrupts would be distributed across many cpus, which in > my opinion has a potential to make above case even worse, now the > locks would be bouncing across multiple cpus (not sure if it makes > for more overhead or the same). > > Now consider that we decide to implement somewhat better cpu > scheduling than that for MDS (and possibly OSTs too, though that is > debatable and needs some measurements), we definitely want hashing > based on object IDs.The advantage of hashing on client NID is that we can hash consistently at all stack levels without layering violations. If clients aren''t contending for the same objects, do we get the same benefits with hashing on NID as we get hashing on object ID?> The idea was to offload this task to lustre-provided event callback, > but that seems to mean we add another cpu rescheduling point that > way (in addition to one described above). Liang told me that we > cannot avoid the first switch since interrupt handler cannot process > the actual message received as this involves accessing and updating > per-NID information (credits and stuff) and if we do it on multiple > CPUs (in case of ofed 1.4 and other lnds that can have multiple cpus > serving interrupts), that means a lot of lock contention potentially > when single client''s requests arrive on multiple cpus.My own belief is that most if not all performance-critical use cases involve many more clients than there are server CPUs - i.e. we don''t lose by trying to keep a single client''s RPCs local to 1 CPU. Note that this means looking through the LND protocol level into the LNET header as early as possible.> I wonder if this could be relieved somehow? Ability to call lustre > request callback from interrupt so that we do only one cpu > rescheduling would be great.This is not an option. Ensuring that the only code allowed to block interrupts is the LND has avoided _many_ nasty real-time issues which will return immediately if we relax this rule.> (of course we can try to encode this information somewhere in actual > message header like xid now where lnet interrupt handler can access > it and use in its hash algorithm, but that way we give away a lot of > flexibility, so this is not the best solution, I would think).It would be better to add an additional "hints" field to LNET messages which could be used for this purpose.> Another scenario that I have not seen discussed but that is > potentially pretty important for MDS is ability to route expected > messages (the ones like rep-ack reply) to a specific cpu regardless > of what NID did it come from. E.g. if we did rescheduling of MDS > request to some CPU and this is a difficult reply, we definitely > want the confirmation to be processed on that same cpu that sent the > reply originally, since it references all the locks supposedly > served by that CPU, etc. This is better to happen within LNET. I > guess similar thing might be beneficial to clients too where a reply > is received on the same CPU that sent original request in hopes that > the cache is still valid and the processing would be so much faster > as a result.You could use a "hints" field in the LNET header for this.> I wonder if there are any ways to influence what CPU would receive > interrupt initially that we can exploit to avoid the cpu switches > completely if possible? Should we investigate polling after certain > threshold of incoming messages is met?Layers below the LND should already be doing interrupt coalescing. Have we got any measurements to show the impact of handling the message on a different CPU from the initial interrupt? If we can keep everything on 1 CPU once we''re in thread context, is 1 switch like this such a big deal> Perhaps for RDMA-noncapable LNDs we can save on switches by > redirecting transfer straight into the buffer registered by target > processing CPU and signal that thread in a cheaper way than double > spinlock taking + wakeup, or does that becomes irrelevant due to all > the overhead of non-RDMA transfer?RDMA shouldn''t be involved in the message handling for which we need to improve SMP scaling. Since RDMA always involves an additional network round-trip to set up the transfer and may also require mapping buffers into network VM, anything "small" (<= 4K including LND and LNET protocol overhead) is transferred by message passing - i.e. received first into dedicated network buffers and then copied out. This copying is done in thread context in the LND as is the event callback. So if we do as much as possible in request_in_callback() (e.g. initial unpacking - AT processing etc) we''ll be running on the same CPU LNET used to handle the message. I''ve attached Liang''s measurements where he changed request_in_callback() to enqueue incoming requests on per-CPU queues. The measurements were taken with a 16 core server and 40 clients using DDR IB. The results show similar performance gains to those seen with LNET self-test when requests are always queued to the same CPU. When requests are queued to a different CPU, total throughput can fall by as much ~60%. However keep in mind that even with this unnecessary switch, the total throughput is still getting on for 10x better than current releases. Lustre LNET LND GETATTR PUT(request) client->server: IMMEDIATE PUT(reply) server->client: IMMEDIATE BRW WRITE PUT(request) client->server: IMMEDIATE GET(bulk) server->client: GET_REQ client->server: RDMA + GET_DONE PUT(reply) server->client: IMMEDIATE BRW READ PUT(request) client->server: IMMEDIATE PUT(bulk) server->client: PUT_REQ client->server: PUT_ACK server->client: RDMA + PUT_DONE PUT(reply) server->client: IMMEDIATE Peak getattr performance of ~630,000 RPCs/sec translates into the same number of LND messages per second in both directions. Peak write performance of ~990MB/s with 4K requests translates to 253,440 write RPCs/sec and 506,880 LND messages per second in both directions. Similarly ~640MB/s reads translates to 163840 read RPCs/sec, 327,680 incoming LND messages per second and 491,520 outgoing LND messages per second.> Also on lustre front - something I plan to tackle, though not yet > completely sure how: Lustre has a concept of reserving one thread for > difficult replies handling + one thread for high priority messages > handling (if enabled). In SMP scalability branch that becomes 2x > num_cpus reserved threads potentially per service since naturally > rep_ack reply or high prio message might arrive on any cpu separately > now (and message queues are per cpu) - seems like huge overkill to > me. I see that there is a handle reply separate threads in HEAD now, > so perhaps this could be greatly simplified by proper usage of those. > the high prio seems to be harder to improve, though.These threads are required in case all normal service threads are blocking. I don''t suppose this can be a performance critical case, so voilating CPU affinity for the sake of deadlock avoidance seems OK. However is 1 extra thread per CPU such a big deal? We''ll have 10s-100s of them in any case.> Do anybody else have any extra thoughts for lustre side > improvements we can get off this?I think we need measurements to prove/disprove whether object affinity trumps client affinity.> > Bye, > Oleg-- Cheers, Eric -------------- next part -------------- A non-text attachment was scrubbed... Name: echo_perf.pdf Type: application/pdf Size: 108069 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20090729/a3f70e67/attachment-0001.pdf
Oleg Drokin
2009-Jul-29 16:01 UTC
[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong
Hello! On Jul 29, 2009, at 11:37 AM, Eric Barton wrote:>> Now consider that we decide to implement somewhat better cpu >> scheduling than that for MDS (and possibly OSTs too, though that is >> debatable and needs some measurements), we definitely want hashing >> based on object IDs. > The advantage of hashing on client NID is that we can hash > consistently at all stack levels without layering violations. If > clients aren''t contending for the same objects, do we get the same > benefits with hashing on NID as we get hashing on object ID?Yes. If clients are not contending, we have same benefits, but this never happens in the real world. Creates in a same dir is a contention point on the dir and there is no point in scheduling all clients on different cpus and let them serialize, where we can free those cpus for some other set of clients doing something else. I guess this is less important for OSTs, since we do not recommend letting multiple clients to access same objects anyway, but in the case where this happens the benefit of serializing still might be there (though for non-recommended usecase) due to reduced contention.>> The idea was to offload this task to lustre-provided event callback, >> but that seems to mean we add another cpu rescheduling point that >> way (in addition to one described above). Liang told me that we >> cannot avoid the first switch since interrupt handler cannot process >> the actual message received as this involves accessing and updating >> per-NID information (credits and stuff) and if we do it on multiple >> CPUs (in case of ofed 1.4 and other lnds that can have multiple cpus >> serving interrupts), that means a lot of lock contention potentially >> when single client''s requests arrive on multiple cpus. > My own belief is that most if not all performance-critical use cases > involve many more clients than there are server CPUs - i.e. we don''t > lose by trying to keep a single client''s RPCs local to 1 CPU. Note > that this means looking through the LND protocol level into the LNET > header as early as possible.Absolutely. I am mostly in agreeing with you on this, except for the above mentioned shared create (or any shared access, really) case.>> (of course we can try to encode this information somewhere in actual >> message header like xid now where lnet interrupt handler can access >> it and use in its hash algorithm, but that way we give away a lot of >> flexibility, so this is not the best solution, I would think). > It would be better to add an additional "hints" field to LNET messages > which could be used for this purpose.Yup. We need an API for lustre to specify those hints when passing a message to lnet. The big part here is - should we then allow lnet to actually use this hint? If yes - we lose a lot of flexibility (suppose we have a contended object1 with a big queue of request piled for this object1. Theoretically in the future we might have an ability to detect this situation and when a request arrives for another object2 whose hash would also redistribute it to the same cpu that is now busy with working through all the request1 accesses, we can schedule it to different cpu (and remember that all requests for object2 should now go to that different cpu) that is completely idle a the moment.>> Another scenario that I have not seen discussed but that is >> potentially pretty important for MDS is ability to route expected >> messages (the ones like rep-ack reply) to a specific cpu regardless >> of what NID did it come from. E.g. if we did rescheduling of MDS >> request to some CPU and this is a difficult reply, we definitely >> want the confirmation to be processed on that same cpu that sent the >> reply originally, since it references all the locks supposedly >> served by that CPU, etc. This is better to happen within LNET. I >> guess similar thing might be beneficial to clients too where a reply >> is received on the same CPU that sent original request in hopes that >> the cache is still valid and the processing would be so much faster >> as a result. > You could use a "hints" field in the LNET header for this.Actually, the big difference with above-mentioned hints is that in this case we need no API. Essentially lnet need to be smart enough to recognize a reply as something that should go to the same cpu from where original message was sent.>> I wonder if there are any ways to influence what CPU would receive >> interrupt initially that we can exploit to avoid the cpu switches >> completely if possible? Should we investigate polling after certain >> threshold of incoming messages is met? > Layers below the LND should already be doing interrupt coalescing. > > Have we got any measurements to show the impact of handling the > message on a different CPU from the initial interrupt? If we can keep > everything on 1 CPU once we''re in thread context, is 1 switch like > this such a big dealI do not have any measurements, but I remember Liang did some tests and each cpu switch is pretty expensive. And this would be second cpu switch already.>> Perhaps for RDMA-noncapable LNDs we can save on switches by >> redirecting transfer straight into the buffer registered by target >> processing CPU and signal that thread in a cheaper way than double >> spinlock taking + wakeup, or does that becomes irrelevant due to all >> the overhead of non-RDMA transfer? > RDMA shouldn''t be involved in the message handling for which we need > to improve SMP scaling. Since RDMA always involves an additional > network round-trip to set up the transfer and may also require mapping > buffers into network VM, anything "small" (<= 4K including LND and > LNET protocol overhead) is transferred by message passing - > i.e. received first into dedicated network buffers and then copied > out. This copying is done in thread context in the LND as is the > event callback.Well, I guess I used wrong word. By RDMA I meant a process in which message arrives to registered buffer and then we are signalled that the message is there. As opposed to a scheme where first we get a signal that message is about to arrive and we still have a chance to decide where to land it.>> Also on lustre front - something I plan to tackle, though not yet >> completely sure how: Lustre has a concept of reserving one thread for >> difficult replies handling + one thread for high priority messages >> handling (if enabled). In SMP scalability branch that becomes 2x >> num_cpus reserved threads potentially per service since naturally >> rep_ack reply or high prio message might arrive on any cpu separately >> now (and message queues are per cpu) - seems like huge overkill to >> me. I see that there is a handle reply separate threads in HEAD now, >> so perhaps this could be greatly simplified by proper usage of those. >> the high prio seems to be harder to improve, though. > These threads are required in case all normal service threads are > blocking. I don''t suppose this can be a performance critical case, so > voilating CPU affinity for the sake of deadlock avoidance seems OK. > However is 1 extra thread per CPU such a big deal? We''ll have > 10s-100s of them in any case.Well, I am not sure if this is a big deal or not yet. That''s why I am raising a question.>> Do anybody else have any extra thoughts for lustre side >> improvements we can get off this? > I think we need measurements to prove/disprove whether object affinity > trumps client affinity.Absolutely. And we need to make sure we measure both kind of workloads, shared and nonshared. Bye, Oleg
Ricardo M. Correia
2009-Jul-29 18:55 UTC
[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong
On Qua, 2009-07-29 at 12:01 -0400, Oleg Drokin wrote:> Yes. If clients are not contending, we have same benefits, but > this never happens in the real world. > Creates in a same dir is a contention point on the dir and there is > no point in scheduling all clients on different cpus and let them > serialize, where we can free those cpus for some other set of clients > doing something else.Will this still be true with the DMU? I don''t know which locks are involved at the Lustre level, but the DMU itself does leaf-level locking of ZAP structures (used for directories). Cheers, Ricardo
Oleg Drokin
2009-Jul-29 19:05 UTC
[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong
Hello! On Jul 29, 2009, at 2:55 PM, Ricardo M. Correia wrote:> On Qua, 2009-07-29 at 12:01 -0400, Oleg Drokin wrote: >> Yes. If clients are not contending, we have same benefits, but >> this never happens in the real world. >> Creates in a same dir is a contention point on the dir and there is >> no point in scheduling all clients on different cpus and let them >> serialize, where we can free those cpus for some other set of clients >> doing something else. > Will this still be true with the DMU?To some degree. Even without DMU HEAD has pdirops enabled which reduces the locking unit to certain subset of hash values only. Keep in mind that aside from the dir content locking there is another point of serialization which is to lock entire directory inode to update times, size and possibly link count (of course normally this is quite small section that is lost in the noise of search in the directory, but when you have a lot of CPUs, bouncing these pages and locks around cpu caches might raise level of overhead significantly).> I don''t know which locks are involved at the Lustre level, but the DMU > itself does leaf-level locking of ZAP structures (used for > directories).Possibly more than one, since in order to ensure you are not inserting a duplicate entry (with big enough dir) you need to lock all blocks that could contain entries in a given hash range/whatever is the way ZIL(?) orders entries with Bye, Oleg
Nicolas Williams
2009-Jul-29 19:22 UTC
[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong
On Wed, Jul 29, 2009 at 04:37:29PM +0100, Eric Barton wrote:> > Also on lustre front - something I plan to tackle, though not yet > > completely sure how: Lustre has a concept of reserving one thread for > > difficult replies handling + one thread for high priority messages > > handling (if enabled). In SMP scalability branch that becomes 2x > > num_cpus reserved threads potentially per service since naturally > > rep_ack reply or high prio message might arrive on any cpu separately > > now (and message queues are per cpu) - seems like huge overkill to > > me. I see that there is a handle reply separate threads in HEAD now, > > so perhaps this could be greatly simplified by proper usage of those. > > the high prio seems to be harder to improve, though. > > These threads are required in case all normal service threads are > blocking. I don''t suppose this can be a performance critical case, so > voilating CPU affinity for the sake of deadlock avoidance seems OK. > However is 1 extra thread per CPU such a big deal? We''ll have > 10s-100s of them in any case.Probably not. You could have a single thread per-CPU if everything was written in async I/O, continuation passing style (CPS), blocking only in an event loop per-CPU. That''d reduce context switches, but it''d increase the amount of context being saved and read as that one thread services each event/event completion. In other words, you''d still have context switches! Also, the code would get insanely complicated -- CPS is for compilers, not humans (nor do we have Scheme-like continuations in C nor in the Linux kernel, and if we did that''d add quite a bit of run-time overhead too). And kernels are not usually written this way either, so it may not even be feasible. The thread model is just easier to code to.> > Do anybody else have any extra thoughts for lustre side > > improvements we can get off this? > > I think we need measurements to prove/disprove whether object affinity > trumps client affinity.If we have secure PTLRPC in the picture then client affinity is more likely to trump object affinity: between keys, key schedules, and sequence number windows may add up to enough. (Of course, we could have multiple streams per-client, so that a client could be serviced by multiple server CPUs.) Nico --
Andreas Dilger
2009-Jul-30 01:40 UTC
[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong
On Jul 29, 2009 16:37 +0100, Eric Barton wrote:> > Now consider that we decide to implement somewhat better cpu > > scheduling than that for MDS (and possibly OSTs too, though that is > > debatable and needs some measurements), we definitely want hashing > > based on object IDs. > > The advantage of hashing on client NID is that we can hash > consistently at all stack levels without layering violations. If > clients aren''t contending for the same objects, do we get the same > benefits with hashing on NID as we get hashing on object ID?The problem with hashing on the NID, and only doing micro-benchmarks that parallelize trivially is that we are missing very important factors in the overall performance. I don''t at all object to optimizing the LNET code to be very scalable in this way, but this isn''t the end goal. I can imagine that keeping the initial message handling (LND processing, credits, etc) on a per-NID basis to be CPU local is fine. However, the amount of state and locks involved at the Lustre level will far exceed the connection state at the LNET level, and we need to optimize the place that has the most overhead. IMHO, that means having request processing affinity at a FID level (parent directory, target inode, file offset, etc). As can be seen with the echo client graphs, sure we "lose" a lot of "no-op getattr" performance when we go to 100% ping-pong requests (i.e. no NID affinity at all), but in absolute terms we still get 250k RPCs/sec even with no NID affinity. In contrast, the file read and write with 1MB RPCs will saturate the network with 1000-2000 RPCs/sec, so whether we can handle 250k or 650k RPCs/sec empty requests is totally meaningless. I suspect the same would hold true with the getattr tests if they had to actually do an inode lookup and read actual data. If the getattr requests are scheduled to the CPU where the inode is cached then the real life performance will be maximized. It won''t be 650k RPCs/sec, but I don''t think that is achievable in most real workloads anyway.> > The idea was to offload this task to lustre-provided event callback, > > but that seems to mean we add another cpu rescheduling point that > > way (in addition to one described above). Liang told me that we > > cannot avoid the first switch since interrupt handler cannot process > > the actual message received as this involves accessing and updating > > per-NID information (credits and stuff) and if we do it on multiple > > CPUs (in case of ofed 1.4 and other lnds that can have multiple cpus > > serving interrupts), that means a lot of lock contention potentially > > when single client''s requests arrive on multiple cpus. > > My own belief is that most if not all performance-critical use cases > involve many more clients than there are server CPUs - i.e. we don''t > lose by trying to keep a single client''s RPCs local to 1 CPU. Note > that this means looking through the LND protocol level into the LNET > header as early as possible.Let us separate the initial handling of the request in the LNET/LND level from the hand-off of the request structure itself to the Lustre service thread. If we can process the LNET-level locking/accounting in a NID/CPU-affine manner, and all that is cross-CPU is the request buffer maybe that is the lowest-cost "request context switch" that is possible. AFAIK, it is the OST service thread that is doing the initialization of the bulk buffers, and not the LNET code, so we don''t have a huge amount of data that needs to be shipped between cores. If we can also avoid lock ping-pong on the request queues as requests are being assigned at the Lustre level that is ideal. I think this would be possible by e.g. having multiple per-CPU request "queuelets" (batches of requests that are handled as a unit, instead of having per-request processing). See the ASCII art below for reference. The IRQ handler puts incoming requests on a CPU-affine list of some sort. Each request is put into into a CPU-affine list by NID hash to minimize peer processing overhead (credits, etc). We get a list of requests that need to be scheduled to a CPU based on the content of the message, and that scheduling has to be done outside of the IRQ context. The LNET code now does the receive processing (still on the same CPU) to call the req_in handler (CPU request scheduler, possibly the very same as the NRS) to determine which core will do the full Lustre processing of the request. The CPU request scheduler will add these requests to one of $num_active_cpus() _local_ queuelets (q$cpunr.$batchnr) until it is full, or some deadline (possibly load related) is passed. At that point the finished queuelet is moved to the target CPU''s local staging area (S$cpunr). IRQ handler LNET/req_sched OST thread ----------- -------------- ---------- [request] | v CPU-affine list(s) CPU-affine list(s) | | | | v v v v q0.4 q1.3 q2.2 q3.4 S0->q0.1->Q0 (CPU 0 threads) S0->q0.2->Q0 (CPU 0 threads) q0.3 (finished) -> S0 S0->q0.3->Q0 (CPU 0 threads) S1->q1.0->Q0 (CPU 1 threads) q1.1 (finished) -> S1 S1->q1.1->Q0 (CPU 1 threads) q1.2 (finished) -> S1 S1->q1.2->Q0 (CPU 1 threads) S2->q1.1->Q0 (CPU 2 threads) q2.1 (finished) -> S2 S2->q2.1->Q0 (CPU 2 threads) S3->q3.1->Q0 (CPU 3 threads) q3.2 (finished) -> S3 S3->q3.2->Q0 (CPU 3 threads) q3.3 (finished) -> S3 S3->q3.3->Q0 (CPU 3 threads) As the service threads process requests they periodically check for new queuelets in their CPU-local staging area and move them to their local request queue (Q$cpunr). The requests are processed one-at-a-time by the CPU-local service threads as they are today from their request queue Q. What is important is that the cross-CPU lock contention is limited to the handoff of a large number of requests at a time (i.e. the whole queuelet) instead of on a per-request basis, so the lock contention on the Lustre request queue is orders of magnitude lower. Also, since the per-CPU service threads can remove whole queuelets from the staging area at one time they will also not be holding this lock for any length of time, ensuring the LNET threads are not blocked.> > (of course we can try to encode this information somewhere in actual > > message header like xid now where lnet interrupt handler can access > > it and use in its hash algorithm, but that way we give away a lot of > > flexibility, so this is not the best solution, I would think). > > It would be better to add an additional "hints" field to LNET messages > which could be used for this purpose.I''m not against this either. I think it is a reasonable approach, but the hints need to be independent of whatever mechanism the server is using for scheduling. That means a hint is not a CPU number or anything, but rather e.g. a parent FID number (MDS) or an (object number XOR file offset in GB). We might want to have a "major" hint (e.g. parent FID, object number) and a "minor" hint (e.g. child hash, file offset in GB) to let the server do load balancing as appropriate. Consider the OSS case where a large file is being read by many clients. With NID affinity, there will essentially be completely random cross-CPU memory accesses. With object number + offset-in-GB affinity the file extent locking and memory accesses will be CPU affine, so minimal cross-CPU memory access.> > Another scenario that I have not seen discussed but that is > > potentially pretty important for MDS is ability to route expected > > messages (the ones like rep-ack reply) to a specific cpu regardless > > of what NID did it come from. E.g. if we did rescheduling of MDS > > request to some CPU and this is a difficult reply, we definitely > > want the confirmation to be processed on that same cpu that sent the > > reply originally, since it references all the locks supposedly > > served by that CPU, etc. This is better to happen within LNET. I > > guess similar thing might be beneficial to clients too where a reply > > is received on the same CPU that sent original request in hopes that > > the cache is still valid and the processing would be so much faster > > as a result. > > You could use a "hints" field in the LNET header for this.These should really be "cookies" provided by the server, rather than hints generated by the client, but the mechanism could be the same.> These threads are required in case all normal service threads are > blocking. I don''t suppose this can be a performance critical case, so > voilating CPU affinity for the sake of deadlock avoidance seems OK. > However is 1 extra thread per CPU such a big deal? We''ll have > 10s-100s of them in any case.I agree. Until this is shown to be a major issue I don''t think it is worth the investment of any time to fix.> > Do anybody else have any extra thoughts for lustre side > > improvements we can get off this? > > I think we need measurements to prove/disprove whether object affinity > trumps client affinity.Yes, that is the critical question. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Liang Zhen
2009-Jul-30 09:25 UTC
[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong
Reply inline..... Andreas Dilger wrote:> Let us separate the initial handling of the request in the LNET/LND > level from the hand-off of the request structure itself to the Lustre > service thread. If we can process the LNET-level locking/accounting > in a NID/CPU-affine manner, and all that is cross-CPU is the request > buffer maybe that is the lowest-cost "request context switch" that > is possible. > > AFAIK, it is the OST service thread that is doing the initialization > of the bulk buffers, and not the LNET code, so we don''t have a huge > amount of data that needs to be shipped between cores. If we can > also avoid lock ping-pong on the request queues as requests are > being assigned at the Lustre level that is ideal. > > > I think this would be possible by e.g. having multiple per-CPU request > "queuelets" (batches of requests that are handled as a unit, instead of > having per-request processing). See the ASCII art below for reference. > > The IRQ handler puts incoming requests on a CPU-affine list of some sort. > Each request is put into into a CPU-affine list by NID hash to minimize > peer processing overhead (credits, etc). We get a list of requests > that need to be scheduled to a CPU based on the content of the message, > and that scheduling has to be done outside of the IRQ context. > > > The LNET code now does the receive processing (still on the same CPU) > to call the req_in handler (CPU request scheduler, possibly the very same > as the NRS) to determine which core will do the full Lustre processing of > the request. The CPU request scheduler will add these requests to one of > $num_active_cpus() _local_ queuelets (q$cpunr.$batchnr) until it is full, > or some deadline (possibly load related) is passed. At that point the > finished queuelet is moved to the target CPU''s local staging area (S$cpunr). > > IRQ handler LNET/req_sched OST thread > ----------- -------------- ---------- > [request] > | > v > CPU-affine list(s) > CPU-affine list(s) > | | | | > v v v v > q0.4 q1.3 q2.2 q3.4 > S0->q0.1->Q0 (CPU 0 threads) > S0->q0.2->Q0 (CPU 0 threads) > q0.3 (finished) -> S0 > S0->q0.3->Q0 (CPU 0 threads) > S1->q1.0->Q0 (CPU 1 threads) > q1.1 (finished) -> S1 > S1->q1.1->Q0 (CPU 1 threads) > q1.2 (finished) -> S1 > S1->q1.2->Q0 (CPU 1 threads) > S2->q1.1->Q0 (CPU 2 threads) > q2.1 (finished) -> S2 > S2->q2.1->Q0 (CPU 2 threads) > S3->q3.1->Q0 (CPU 3 threads) > q3.2 (finished) -> S3 > S3->q3.2->Q0 (CPU 3 threads) > q3.3 (finished) -> S3 > S3->q3.3->Q0 (CPU 3 threads) > > As the service threads process requests they periodically check for new > queuelets in their CPU-local staging area and move them to their local > request queue (Q$cpunr). The requests are processed one-at-a-time by > the CPU-local service threads as they are today from their request queue Q. >So the queuelets could be: a) popped to target CPU if local CPU got enough messages for target; b) poll by target CPU if target CPU is idle. for a) it''s good and can reduce contention, but for b), If service thread (of each CPU) make periodically poll from all other CPUs, there could be a unnecessary delay (interval of poll) if those queuelets are always not full at all, unless local-CPU "peek" the message queue on target CPU in callback, and post message to there directly (instead of queuelet of local CPU) when the queue is empty. However, there could be another problem, the "peek" is not a light operation even don''t need any lock, target CPU is likely changing it''s own request queue (exclusive access), so the "peek" is already a cache syncup.> > What is important is that the cross-CPU lock contention is limited > to the handoff of a large number of requests at a time (i.e. the > whole queuelet) instead of on a per-request basis, so the lock > contention on the Lustre request queue is orders of magnitude lower. > Also, since the per-CPU service threads can remove whole queuelets > from the staging area at one time they will also not be holding this > lock for any length of time, ensuring the LNET threads are not blocked. > > > >>> (of course we can try to encode this information somewhere in actual >>> message header like xid now where lnet interrupt handler can access >>> it and use in its hash algorithm, but that way we give away a lot of >>> flexibility, so this is not the best solution, I would think). >>> >> It would be better to add an additional "hints" field to LNET messages >> which could be used for this purpose. >>I''m quite confusing at here, I think Oleg was talking about incoming request, but LNet message is totally invisible in interrupt handlers, as LNet message is created by lnet_parse() which is called by LND scheduler later(after woken up by interrupt handler).> > I''m not against this either. I think it is a reasonable approach, > but the hints need to be independent of whatever mechanism the > server is using for scheduling. That means a hint is not a CPU > number or anything, but rather e.g. a parent FID number (MDS) or > an (object number XOR file offset in GB). We might want to have > a "major" hint (e.g. parent FID, object number) and a "minor" > hint (e.g. child hash, file offset in GB) to let the server do > load balancing as appropriate. > > Consider the OSS case where a large file is being read by many > clients. With NID affinity, there will essentially be completely > random cross-CPU memory accesses. With object number + offset-in-GB > affinity the file extent locking and memory accesses will be CPU > affine, so minimal cross-CPU memory access. > > >>> Another scenario that I have not seen discussed but that is >>> potentially pretty important for MDS is ability to route expected >>> messages (the ones like rep-ack reply) to a specific cpu regardless >>> of what NID did it come from. E.g. if we did rescheduling of MDS >>> request to some CPU and this is a difficult reply, we definitely >>> want the confirmation to be processed on that same cpu that sent the >>> reply originally, since it references all the locks supposedly >>> served by that CPU, etc. This is better to happen within LNET. I >>> guess similar thing might be beneficial to clients too where a reply >>> is received on the same CPU that sent original request in hopes that >>> the cache is still valid and the processing would be so much faster >>> as a result. >>> >> You could use a "hints" field in the LNET header for this. >>That''s about outgoing LNet message when sending reply, however, sending a message still need go through "connection" & "peer" of LNet and LND as well, and finally go out from the connection of network stack, which are all bound on CPU hashed by NID (again). So I think unless we create connections on all CPUs for each client, otherwise switch like that is unavoidable. Actually, I think the fact is, LNet & LND are using NID affinity for connection & peer, Lustre is using object affinity, so we need switch CPU. If we want to go through the stack without switching of CPU, then we have to cancel NID affinity from LNet, that means we need a global peer table and a global lock to serialize, then we can schedule send/receive on any of CPU as we want, but "global" come back again...> > >>> get off this? >>> >> I think we need measurements to prove/disprove whether object affinity >> trumps client affinity. >> > > Yes, that is the critical question. >Totally agree.... Regards Liang> Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >
Oleg Drokin
2009-Jul-30 13:35 UTC
[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong
Hello! On Jul 30, 2009, at 5:25 AM, Liang Zhen wrote:>>>> Another scenario that I have not seen discussed but that is >>>> potentially pretty important for MDS is ability to route expected >>>> messages (the ones like rep-ack reply) to a specific cpu regardless >>>> of what NID did it come from. E.g. if we did rescheduling of MDS >>>> request to some CPU and this is a difficult reply, we definitely >>>> want the confirmation to be processed on that same cpu that sent >>>> the >>>> reply originally, since it references all the locks supposedly >>>> served by that CPU, etc. This is better to happen within LNET. I >>>> guess similar thing might be beneficial to clients too where a >>>> reply >>>> is received on the same CPU that sent original request in hopes >>>> that >>>> the cache is still valid and the processing would be so much faster >>>> as a result. >>> You could use a "hints" field in the LNET header for this. > That''s about outgoing LNet message when sending reply, however, > sending a message still need go through "connection" & "peer" of > LNet and LND as well, and finally go out from the connection of > network stack, which are all bound on CPU hashed by NID (again).Nothing prevents us from introducing extra argument for event handler, thoguh? Bye, Oleg
Liang Zhen
2009-Jul-30 13:53 UTC
[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong
Oleg Drokin wrote:> Hello! > > On Jul 30, 2009, at 5:25 AM, Liang Zhen wrote: >>>>> Another scenario that I have not seen discussed but that is >>>>> potentially pretty important for MDS is ability to route expected >>>>> messages (the ones like rep-ack reply) to a specific cpu regardless >>>>> of what NID did it come from. E.g. if we did rescheduling of MDS >>>>> request to some CPU and this is a difficult reply, we definitely >>>>> want the confirmation to be processed on that same cpu that sent the >>>>> reply originally, since it references all the locks supposedly >>>>> served by that CPU, etc. This is better to happen within LNET. I >>>>> guess similar thing might be beneficial to clients too where a reply >>>>> is received on the same CPU that sent original request in hopes that >>>>> the cache is still valid and the processing would be so much faster >>>>> as a result. >>>> You could use a "hints" field in the LNET header for this. >> That''s about outgoing LNet message when sending reply, however, >> sending a message still need go through "connection" & "peer" of LNet >> and LND as well, and finally go out from the connection of network >> stack, which are all bound on CPU hashed by NID (again). > > Nothing prevents us from introducing extra argument for event handler, > thoguh?We actually don''t need do that in my branch, when we send reply, LNet would generate a cookie(MD handle) for the reply buffer which already contained current CPU id, and remote peer will send back ACK with the same cookie(MD handle), so the ACK will match to the sending CPU id and callback for the same CPU. So that''s some work we have already done, :) Regards Liang> > Bye, > Oleg
Andreas Dilger
2009-Jul-30 23:19 UTC
[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong
On Jul 30, 2009 17:25 +0800, Liang Zhen wrote:> Andreas Dilger wrote: >> The IRQ handler puts incoming requests on a CPU-affine list of some sort. >> Each request is put into into a CPU-affine list by NID hash to minimize >> peer processing overhead (credits, etc). We get a list of requests >> that need to be scheduled to a CPU based on the content of the message, >> and that scheduling has to be done outside of the IRQ context. >> >> >> The LNET code now does the receive processing (still on the same CPU) >> to call the req_in handler (CPU request scheduler, possibly the very same >> as the NRS) to determine which core will do the full Lustre processing of >> the request. The CPU request scheduler will add these requests to one of >> $num_active_cpus() _local_ queuelets (q$cpunr.$batchnr) until it is full, >> or some deadline (possibly load related) is passed. At that point the >> finished queuelet is moved to the target CPU''s local staging area (S$cpunr).Note also that some kinds of replies (OBD_PING, for example) could be completed entirely by ptlrpc_server_handle_req_in() without invoking any context switching.>> As the service threads process requests they periodically check for new >> queuelets in their CPU-local staging area and move them to their local >> request queue (Q$cpunr). The requests are processed one-at-a-time by >> the CPU-local service threads as they are today from their request queue Q. > > So the queuelets could be: a) popped to target CPU if local CPU got > enough messages for target; b) poll by target CPU if target CPU is idle. > for a) it''s good and can reduce contention, but for b), If service > thread (of each CPU) make periodically poll from all other CPUs, there > could be a unnecessary delay (interval of poll) if those queuelets are > always not full at all, unless local-CPU "peek" the message queue on > target CPU in callback, and post message to there directly (instead of > queuelet of local CPU) when the queue is empty. However, there could be > another problem, the "peek" is not a light operation even don''t need any > lock, target CPU is likely changing it''s own request queue (exclusive > access), so the "peek" is already a cache syncup.I don''t think ALL service threads would necessarily poll for queuelets. As you suggest, any polling would be lightweight. We might not have polling at all, however. The LNET code could make a decision (based on message arrival rate, whether there are other unhandled queuelets in the staging list, maximum delay (deadline). That said, if the service threads are idle and there are requests to be processed then some lock contention is acceptable, since the system cannot be too busy at that time. That wouldn''t have to be polling, but rather a wakeup of a single thread waiting on the request queue.>>>> (of course we can try to encode this information somewhere in actual >>>> message header like xid now where lnet interrupt handler can access >>>> it and use in its hash algorithm, but that way we give away a lot of >>>> flexibility, so this is not the best solution, I would think). >>>> >>> It would be better to add an additional "hints" field to LNET messages >>> which could be used for this purpose. > > I''m quite confusing at here, I think Oleg was talking about incoming > request, but LNet message is totally invisible in interrupt handlers, as > LNet message is created by lnet_parse() which is called by LND scheduler > later(after woken up by interrupt handler).Would the hints have to be down at the LND-specific headers? In any case, something that can be accessed as easily as the NID. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.