thr3ads.net - Lustre devel - [Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Eric Barton

2009-Jul-29 15:37 UTC

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

Oleg,

I''m replying via lustre-devel since this is of general interest.
Comments inline...
> Hello!
> 
> I looked into the current lnet code for smp scalability and had some
> discussion with Liang and I think there are some topics we need to
> cover.
> 
> with ofed 1.3 all interrupts arrive to single cpu, that cpu looks
> into some data (currently - sending NID), and puts that message into
> a processing queue for some CPU that happens to match he hash.  This
> is already quite not ideal (even with all the boost we are
> supposedly getting) - this means each queue lock is constantly
> bouncing between interrupt-receiving cpu and handling CPU.  with
> ofed 1.4 interrupts would be distributed across many cpus, which in
> my opinion has a potential to make above case even worse, now the
> locks would be bouncing across multiple cpus (not sure if it makes
> for more overhead or the same).
> 
> Now consider that we decide to implement somewhat better cpu
> scheduling than that for MDS (and possibly OSTs too, though that is
> debatable and needs some measurements), we definitely want hashing
> based on object IDs.
The advantage of hashing on client NID is that we can hash
consistently at all stack levels without layering violations.  If
clients aren''t contending for the same objects, do we get the same
benefits with hashing on NID as we get hashing on object ID?
> The idea was to offload this task to lustre-provided event callback,
> but that seems to mean we add another cpu rescheduling point that
> way (in addition to one described above). Liang told me that we
> cannot avoid the first switch since interrupt handler cannot process
> the actual message received as this involves accessing and updating
> per-NID information (credits and stuff) and if we do it on multiple
> CPUs (in case of ofed 1.4 and other lnds that can have multiple cpus
> serving interrupts), that means a lot of lock contention potentially
> when single client''s requests arrive on multiple cpus.
My own belief is that most if not all performance-critical use cases
involve many more clients than there are server CPUs - i.e. we don''t
lose by trying to keep a single client''s RPCs local to 1 CPU.  Note
that this means looking through the LND protocol level into the LNET
header as early as possible.
> I wonder if this could be relieved somehow? Ability to call lustre
> request callback from interrupt so that we do only one cpu
> rescheduling would be great.
This is not an option.  Ensuring that the only code allowed to block
interrupts is the LND has avoided _many_ nasty real-time issues which
will return immediately if we relax this rule.
> (of course we can try to encode this information somewhere in actual
> message header like xid now where lnet interrupt handler can access
> it and use in its hash algorithm, but that way we give away a lot of
> flexibility, so this is not the best solution, I would think).
It would be better to add an additional "hints" field to LNET messages
which could be used for this purpose.
> Another scenario that I have not seen discussed but that is
> potentially pretty important for MDS is ability to route expected
> messages (the ones like rep-ack reply) to a specific cpu regardless
> of what NID did it come from. E.g. if we did rescheduling of MDS
> request to some CPU and this is a difficult reply, we definitely
> want the confirmation to be processed on that same cpu that sent the
> reply originally, since it references all the locks supposedly
> served by that CPU, etc. This is better to happen within LNET. I
> guess similar thing might be beneficial to clients too where a reply
> is received on the same CPU that sent original request in hopes that
> the cache is still valid and the processing would be so much faster
> as a result.
You could use a "hints" field in the LNET header for this.
> I wonder if there are any ways to influence what CPU would receive
> interrupt initially that we can exploit to avoid the cpu switches
> completely if possible? Should we investigate polling after certain
> threshold of incoming messages is met?  
Layers below the LND should already be doing interrupt coalescing.

Have we got any measurements to show the impact of handling the
message on a different CPU from the initial interrupt?  If we can keep
everything on 1 CPU once we''re in thread context, is 1 switch like
this such a big deal
> Perhaps for RDMA-noncapable LNDs we can save on switches by
> redirecting transfer straight into the buffer registered by target
> processing CPU and signal that thread in a cheaper way than double
> spinlock taking + wakeup, or does that becomes irrelevant due to all
> the overhead of non-RDMA transfer?
RDMA shouldn''t be involved in the message handling for which we need
to improve SMP scaling.  Since RDMA always involves an additional
network round-trip to set up the transfer and may also require mapping
buffers into network VM, anything "small" (<= 4K including LND and
LNET protocol overhead) is transferred by message passing -
i.e. received first into dedicated network buffers and then copied
out.  This copying is done in thread context in the LND as is the
event callback.

So if we do as much as possible in request_in_callback() (e.g. initial
unpacking - AT processing etc) we''ll be running on the same CPU LNET
used to handle the message.

I''ve attached Liang''s measurements where he changed
request_in_callback() to enqueue incoming requests on per-CPU queues.
The measurements were taken with a 16 core server and 40 clients using
DDR IB.  The results show similar performance gains to those seen with
LNET self-test when requests are always queued to the same CPU.  When
requests are queued to a different CPU, total throughput can fall by
as much ~60%.  However keep in mind that even with this unnecessary
switch, the total throughput is still getting on for 10x better than
current releases.

Lustre      LNET           LND

GETATTR     PUT(request)   client->server: IMMEDIATE
            PUT(reply)     server->client: IMMEDIATE

BRW WRITE   PUT(request)   client->server: IMMEDIATE
            GET(bulk)      server->client: GET_REQ
                           client->server: RDMA + GET_DONE
            PUT(reply)     server->client: IMMEDIATE

BRW READ    PUT(request)   client->server: IMMEDIATE
            PUT(bulk)      server->client: PUT_REQ
                           client->server: PUT_ACK
                           server->client: RDMA + PUT_DONE
            PUT(reply)     server->client: IMMEDIATE

Peak getattr performance of ~630,000 RPCs/sec translates into the same
number of LND messages per second in both directions.

Peak write performance of ~990MB/s with 4K requests translates to
253,440 write RPCs/sec and 506,880 LND messages per second in both
directions.  

Similarly ~640MB/s reads translates to 163840 read RPCs/sec, 327,680
incoming LND messages per second and 491,520 outgoing LND messages per
second.
> Also on lustre front - something I plan to tackle, though not yet
> completely sure how: Lustre has a concept of reserving one thread for
> difficult replies handling + one thread for high priority messages
> handling (if enabled). In SMP scalability branch that becomes 2x
> num_cpus reserved threads potentially per service since naturally
> rep_ack reply or high prio message might arrive on any cpu separately
> now (and message queues are per cpu) - seems like huge overkill to
> me. I see that there is a handle reply separate threads in HEAD now,
> so perhaps this could be greatly simplified by proper usage of those.
> the high prio seems to be harder to improve, though.
These threads are required in case all normal service threads are
blocking.  I don''t suppose this can be a performance critical case, so
voilating CPU affinity for the sake of deadlock avoidance seems OK.
However is 1 extra thread per CPU such a big deal?  We''ll have
10s-100s of them in any case.
> Do anybody else have any extra thoughts for lustre side  
> improvements we can get off this?
I think we need measurements to prove/disprove whether object affinity
trumps client affinity.
> 
> Bye,
>      Oleg
-- 

        Cheers,
                   Eric
-------------- next part --------------
A non-text attachment was scrubbed...
Name: echo_perf.pdf
Type: application/pdf
Size: 108069 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20090729/a3f70e67/attachment-0001.pdf

Oleg Drokin

2009-Jul-29 16:01 UTC

head link

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

Hello!

On Jul 29, 2009, at 11:37 AM, Eric Barton wrote:
>> Now consider that we decide to implement somewhat better cpu
>> scheduling than that for MDS (and possibly OSTs too, though that is
>> debatable and needs some measurements), we definitely want hashing
>> based on object IDs.
> The advantage of hashing on client NID is that we can hash
> consistently at all stack levels without layering violations.  If
> clients aren''t contending for the same objects, do we get the same
> benefits with hashing on NID as we get hashing on object ID?
Yes. If clients are not contending, we have same benefits, but
this never happens in the real world.
Creates in a same dir is a contention point on the dir and there is
no point in scheduling all clients on different cpus and let them
serialize, where we can free those cpus for some other set of clients
doing something else.
I guess this is less important for OSTs, since we do not recommend
letting multiple clients to access same objects anyway, but in the
case where this happens the benefit of serializing still might be there
(though for non-recommended usecase) due to reduced contention.
>> The idea was to offload this task to lustre-provided event callback,
>> but that seems to mean we add another cpu rescheduling point that
>> way (in addition to one described above). Liang told me that we
>> cannot avoid the first switch since interrupt handler cannot process
>> the actual message received as this involves accessing and updating
>> per-NID information (credits and stuff) and if we do it on multiple
>> CPUs (in case of ofed 1.4 and other lnds that can have multiple cpus
>> serving interrupts), that means a lot of lock contention potentially
>> when single client''s requests arrive on multiple cpus.
> My own belief is that most if not all performance-critical use cases
> involve many more clients than there are server CPUs - i.e. we
don''t
> lose by trying to keep a single client''s RPCs local to 1 CPU. 
Note
> that this means looking through the LND protocol level into the LNET
> header as early as possible.
Absolutely. I am mostly in agreeing with you on this, except for the
above mentioned shared create (or any shared access, really) case.
>> (of course we can try to encode this information somewhere in actual
>> message header like xid now where lnet interrupt handler can access
>> it and use in its hash algorithm, but that way we give away a lot of
>> flexibility, so this is not the best solution, I would think).
> It would be better to add an additional "hints" field to LNET
messages
> which could be used for this purpose.
Yup. We need an API for lustre to specify those hints when passing
a message to lnet.
The big part here is - should we then allow lnet to actually use this
hint? If yes - we lose a lot of flexibility (suppose we have a contended
object1 with a big queue of request piled for this object1.
Theoretically in the future we might have an ability to detect this
situation and when a request arrives for another object2 whose hash  
would
also redistribute it to the same cpu that is now busy with working  
through
all the request1 accesses, we can schedule it to different cpu (and  
remember
that all requests for object2 should now go to that different cpu)  
that is
completely idle a the moment.
>> Another scenario that I have not seen discussed but that is
>> potentially pretty important for MDS is ability to route expected
>> messages (the ones like rep-ack reply) to a specific cpu regardless
>> of what NID did it come from. E.g. if we did rescheduling of MDS
>> request to some CPU and this is a difficult reply, we definitely
>> want the confirmation to be processed on that same cpu that sent the
>> reply originally, since it references all the locks supposedly
>> served by that CPU, etc. This is better to happen within LNET. I
>> guess similar thing might be beneficial to clients too where a reply
>> is received on the same CPU that sent original request in hopes that
>> the cache is still valid and the processing would be so much faster
>> as a result.
> You could use a "hints" field in the LNET header for this.
Actually, the big difference with above-mentioned hints is that in this
case we need no API. Essentially lnet need to be smart enough to
recognize a reply as something that should go to the same cpu from
where original message was sent.
>> I wonder if there are any ways to influence what CPU would receive
>> interrupt initially that we can exploit to avoid the cpu switches
>> completely if possible? Should we investigate polling after certain
>> threshold of incoming messages is met?
> Layers below the LND should already be doing interrupt coalescing.
>
> Have we got any measurements to show the impact of handling the
> message on a different CPU from the initial interrupt?  If we can keep
> everything on 1 CPU once we''re in thread context, is 1 switch like
> this such a big deal
I do not have any measurements, but I remember Liang did some tests
and each cpu switch is pretty expensive.
And this would be second cpu switch already.
>> Perhaps for RDMA-noncapable LNDs we can save on switches by
>> redirecting transfer straight into the buffer registered by target
>> processing CPU and signal that thread in a cheaper way than double
>> spinlock taking + wakeup, or does that becomes irrelevant due to all
>> the overhead of non-RDMA transfer?
> RDMA shouldn''t be involved in the message handling for which we
need
> to improve SMP scaling.  Since RDMA always involves an additional
> network round-trip to set up the transfer and may also require mapping
> buffers into network VM, anything "small" (<= 4K including LND
and
> LNET protocol overhead) is transferred by message passing -
> i.e. received first into dedicated network buffers and then copied
> out.  This copying is done in thread context in the LND as is the
> event callback.
Well, I guess I used wrong word. By RDMA I meant a process in which
message arrives to registered buffer and then we are signalled that the
message is there. As opposed to a scheme where first we get a signal
that message is about to arrive and we still have a chance to decide
where to land it.
>> Also on lustre front - something I plan to tackle, though not yet
>> completely sure how: Lustre has a concept of reserving one thread for
>> difficult replies handling + one thread for high priority messages
>> handling (if enabled). In SMP scalability branch that becomes 2x
>> num_cpus reserved threads potentially per service since naturally
>> rep_ack reply or high prio message might arrive on any cpu separately
>> now (and message queues are per cpu) - seems like huge overkill to
>> me. I see that there is a handle reply separate threads in HEAD now,
>> so perhaps this could be greatly simplified by proper usage of those.
>> the high prio seems to be harder to improve, though.
> These threads are required in case all normal service threads are
> blocking.  I don''t suppose this can be a performance critical
case, so
> voilating CPU affinity for the sake of deadlock avoidance seems OK.
> However is 1 extra thread per CPU such a big deal?  We''ll have
> 10s-100s of them in any case.
Well, I am not sure if this is a big deal or not yet. That''s why I am
raising a question.
>> Do anybody else have any extra thoughts for lustre side
>> improvements we can get off this?
> I think we need measurements to prove/disprove whether object affinity
> trumps client affinity.
Absolutely. And we need to make sure we measure both kind of workloads,
shared and nonshared.

Bye,
     Oleg

Ricardo M. Correia

2009-Jul-29 18:55 UTC

head link

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

On Qua, 2009-07-29 at 12:01 -0400, Oleg Drokin wrote:> Yes. If clients are not contending, we have same benefits, but
> this never happens in the real world.
> Creates in a same dir is a contention point on the dir and there is
> no point in scheduling all clients on different cpus and let them
> serialize, where we can free those cpus for some other set of clients
> doing something else.
Will this still be true with the DMU?

I don''t know which locks are involved at the Lustre level, but the DMU
itself does leaf-level locking of ZAP structures (used for directories).

Cheers,
Ricardo

Oleg Drokin

2009-Jul-29 19:05 UTC

head link

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

Hello!

On Jul 29, 2009, at 2:55 PM, Ricardo M. Correia wrote:
> On Qua, 2009-07-29 at 12:01 -0400, Oleg Drokin wrote:
>> Yes. If clients are not contending, we have same benefits, but
>> this never happens in the real world.
>> Creates in a same dir is a contention point on the dir and there is
>> no point in scheduling all clients on different cpus and let them
>> serialize, where we can free those cpus for some other set of clients
>> doing something else.
> Will this still be true with the DMU?
To some degree.
Even without DMU HEAD has pdirops enabled which reduces the locking  
unit to
certain subset of hash values only.
Keep in mind that aside from the dir content locking there is another  
point of
serialization which is to lock entire directory inode to update times,  
size
and possibly link count (of course normally this is quite small section
that is lost in the noise of search in the directory, but when you  
have a lot of
CPUs, bouncing these pages and locks around cpu caches might raise level
of overhead significantly).
> I don''t know which locks are involved at the Lustre level, but the
DMU
> itself does leaf-level locking of ZAP structures (used for  
> directories).
Possibly more than one, since in order to ensure you are not inserting a
duplicate entry (with big enough dir) you need to lock all blocks that  
could
contain entries in a given hash range/whatever is the way ZIL(?)  
orders entries with

Bye,
     Oleg

Nicolas Williams

2009-Jul-29 19:22 UTC

head link

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

On Wed, Jul 29, 2009 at 04:37:29PM +0100, Eric Barton
wrote:> > Also on lustre front - something I plan to tackle, though not yet
> > completely sure how: Lustre has a concept of reserving one thread for
> > difficult replies handling + one thread for high priority messages
> > handling (if enabled). In SMP scalability branch that becomes 2x
> > num_cpus reserved threads potentially per service since naturally
> > rep_ack reply or high prio message might arrive on any cpu separately
> > now (and message queues are per cpu) - seems like huge overkill to
> > me. I see that there is a handle reply separate threads in HEAD now,
> > so perhaps this could be greatly simplified by proper usage of those.
> > the high prio seems to be harder to improve, though.
> 
> These threads are required in case all normal service threads are
> blocking.  I don''t suppose this can be a performance critical
case, so
> voilating CPU affinity for the sake of deadlock avoidance seems OK.
> However is 1 extra thread per CPU such a big deal?  We''ll have
> 10s-100s of them in any case.
Probably not.  You could have a single thread per-CPU if everything was
written in async I/O, continuation passing style (CPS), blocking only in
an event loop per-CPU.  That''d reduce context switches, but
it''d
increase the amount of context being saved and read as that one thread
services each event/event completion.  In other words, you''d still have
context switches!

Also, the code would get insanely complicated -- CPS is for compilers,
not humans (nor do we have Scheme-like continuations in C nor in the
Linux kernel, and if we did that''d add quite a bit of run-time overhead
too).  And kernels are not usually written this way either, so it may
not even be feasible.  The thread model is just easier to code to.
> > Do anybody else have any extra thoughts for lustre side  
> > improvements we can get off this?
> 
> I think we need measurements to prove/disprove whether object affinity
> trumps client affinity.
If we have secure PTLRPC in the picture then client affinity is more
likely to trump object affinity: between keys, key schedules, and
sequence number windows may add up to enough.  (Of course, we could have
multiple streams per-client, so that a client could be serviced by
multiple server CPUs.)

Nico
--

Andreas Dilger

2009-Jul-30 01:40 UTC

head link

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

On Jul 29, 2009  16:37 +0100, Eric Barton wrote:> > Now consider that we decide to implement somewhat better cpu
> > scheduling than that for MDS (and possibly OSTs too, though that is
> > debatable and needs some measurements), we definitely want hashing
> > based on object IDs.
> 
> The advantage of hashing on client NID is that we can hash
> consistently at all stack levels without layering violations.  If
> clients aren''t contending for the same objects, do we get the same
> benefits with hashing on NID as we get hashing on object ID?
The problem with hashing on the NID, and only doing micro-benchmarks
that parallelize trivially is that we are missing very important factors
in the overall performance.  I don''t at all object to optimizing the
LNET code to be very scalable in this way, but this isn''t the end goal.

I can imagine that keeping the initial message handling (LND processing,
credits, etc) on a per-NID basis to be CPU local is fine.  However,
the amount of state and locks involved at the Lustre level will far
exceed the connection state at the LNET level, and we need to optimize
the place that has the most overhead.  IMHO, that means having request
processing affinity at a FID level (parent directory, target inode,
file offset, etc).

As can be seen with the echo client graphs, sure we "lose" a lot of
"no-op getattr" performance when we go to 100% ping-pong requests
(i.e. no NID affinity at all), but in absolute terms we still get 250k
RPCs/sec even with no NID affinity.  In contrast, the file read and
write with 1MB RPCs will saturate the network with 1000-2000 RPCs/sec,
so whether we can handle 250k or 650k RPCs/sec empty requests is totally
meaningless.

I suspect the same would hold true with the getattr tests if they had
to actually do an inode lookup and read actual data.  If the getattr
requests are scheduled to the CPU where the inode is cached then the
real life performance will be maximized.  It won''t be 650k RPCs/sec,
but I don''t think that is achievable in most real workloads anyway.
> > The idea was to offload this task to lustre-provided event callback,
> > but that seems to mean we add another cpu rescheduling point that
> > way (in addition to one described above). Liang told me that we
> > cannot avoid the first switch since interrupt handler cannot process
> > the actual message received as this involves accessing and updating
> > per-NID information (credits and stuff) and if we do it on multiple
> > CPUs (in case of ofed 1.4 and other lnds that can have multiple cpus
> > serving interrupts), that means a lot of lock contention potentially
> > when single client''s requests arrive on multiple cpus.
> 
> My own belief is that most if not all performance-critical use cases
> involve many more clients than there are server CPUs - i.e. we
don''t
> lose by trying to keep a single client''s RPCs local to 1 CPU. 
Note
> that this means looking through the LND protocol level into the LNET
> header as early as possible.
Let us separate the initial handling of the request in the LNET/LND
level from the hand-off of the request structure itself to the Lustre
service thread.  If we can process the LNET-level locking/accounting
in a NID/CPU-affine manner, and all that is cross-CPU is the request
buffer maybe that is the lowest-cost "request context switch" that
is possible.

AFAIK, it is the OST service thread that is doing the initialization
of the bulk buffers, and not the LNET code, so we don''t have a huge
amount of data that needs to be shipped between cores.  If we can
also avoid lock ping-pong on the request queues as requests are
being assigned at the Lustre level that is ideal.

I think this would be possible by e.g. having multiple per-CPU request
"queuelets" (batches of requests that are handled as a unit, instead
of
having per-request processing).  See the ASCII art below for reference.

The IRQ handler puts incoming requests on a CPU-affine list of some sort.
Each request is put into into a CPU-affine list by NID hash to minimize
peer processing overhead (credits, etc).  We get a list of requests
that need to be scheduled to a CPU based on the content of the message,
and that scheduling has to be done outside of the IRQ context.

The LNET code now does the receive processing (still on the same CPU)
to call the req_in handler (CPU request scheduler, possibly the very same
as the NRS) to determine which core will do the full Lustre processing of
the request.  The CPU request scheduler will add these requests to one of
$num_active_cpus() _local_ queuelets (q$cpunr.$batchnr) until it is full,
or some deadline (possibly load related) is passed.  At that point the
finished queuelet is moved to the target CPU''s local staging area
(S$cpunr).

IRQ handler          LNET/req_sched           OST thread
-----------          --------------           ----------
[request]
    |
    v
CPU-affine list(s)
                     CPU-affine list(s)
                     |    |    |    |
                     v    v    v    v   
                     q0.4 q1.3 q2.2 q3.4
                                               S0->q0.1->Q0 (CPU 0
threads)
                                               S0->q0.2->Q0 (CPU 0
threads)
                     q0.3 (finished) -> S0
                                               S0->q0.3->Q0 (CPU 0
threads)
                                               S1->q1.0->Q0 (CPU 1
threads)
                     q1.1 (finished) -> S1
                                               S1->q1.1->Q0 (CPU 1
threads)
                     q1.2 (finished) -> S1
                                               S1->q1.2->Q0 (CPU 1
threads)
                                               S2->q1.1->Q0 (CPU 2
threads)
                     q2.1 (finished) -> S2
                                               S2->q2.1->Q0 (CPU 2
threads)
                                               S3->q3.1->Q0 (CPU 3
threads)
                     q3.2 (finished) -> S3
                                               S3->q3.2->Q0 (CPU 3
threads)
                     q3.3 (finished) -> S3
                                               S3->q3.3->Q0 (CPU 3
threads)

As the service threads process requests they periodically check for new
queuelets in their CPU-local staging area and move them to their local
request queue (Q$cpunr).  The requests are processed one-at-a-time by
the CPU-local service threads as they are today from their request queue Q.

What is important is that the cross-CPU lock contention is limited
to the handoff of a large number of requests at a time (i.e. the
whole queuelet) instead of on a per-request basis, so the lock
contention on the Lustre request queue is orders of magnitude lower.
Also, since the per-CPU service threads can remove whole queuelets
from the staging area at one time they will also not be holding this
lock for any length of time, ensuring the LNET threads are not blocked.

> > (of course we can try to encode this information somewhere in actual
> > message header like xid now where lnet interrupt handler can access
> > it and use in its hash algorithm, but that way we give away a lot of
> > flexibility, so this is not the best solution, I would think).
> 
> It would be better to add an additional "hints" field to LNET
messages
> which could be used for this purpose.
I''m not against this either.  I think it is a reasonable approach,
but the hints need to be independent of whatever mechanism the
server is using for scheduling.  That means a hint is not a CPU
number or anything, but rather e.g. a parent FID number (MDS) or
an (object number XOR file offset in GB).  We might want to have
a "major" hint (e.g. parent FID, object number) and a
"minor"
hint (e.g. child hash, file offset in GB) to let the server do
load balancing as appropriate.

Consider the OSS case where a large file is being read by many
clients.  With NID affinity, there will essentially be completely
random cross-CPU memory accesses.  With object number + offset-in-GB
affinity the file extent locking and memory accesses will be CPU
affine, so minimal cross-CPU memory access.
> > Another scenario that I have not seen discussed but that is
> > potentially pretty important for MDS is ability to route expected
> > messages (the ones like rep-ack reply) to a specific cpu regardless
> > of what NID did it come from. E.g. if we did rescheduling of MDS
> > request to some CPU and this is a difficult reply, we definitely
> > want the confirmation to be processed on that same cpu that sent the
> > reply originally, since it references all the locks supposedly
> > served by that CPU, etc. This is better to happen within LNET. I
> > guess similar thing might be beneficial to clients too where a reply
> > is received on the same CPU that sent original request in hopes that
> > the cache is still valid and the processing would be so much faster
> > as a result.
> 
> You could use a "hints" field in the LNET header for this.
These should really be "cookies" provided by the server, rather than
hints generated by the client, but the mechanism could be the same.
> These threads are required in case all normal service threads are
> blocking.  I don''t suppose this can be a performance critical
case, so
> voilating CPU affinity for the sake of deadlock avoidance seems OK.
> However is 1 extra thread per CPU such a big deal?  We''ll have
> 10s-100s of them in any case.
I agree.  Until this is shown to be a major issue I don''t think it
is worth the investment of any time to fix.
> > Do anybody else have any extra thoughts for lustre side  
> > improvements we can get off this?
> 
> I think we need measurements to prove/disprove whether object affinity
> trumps client affinity.
Yes, that is the critical question.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Liang Zhen

2009-Jul-30 09:25 UTC

head link

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

Reply inline.....

Andreas Dilger wrote:> Let us separate the initial handling of the request in the LNET/LND
> level from the hand-off of the request structure itself to the Lustre
> service thread.  If we can process the LNET-level locking/accounting
> in a NID/CPU-affine manner, and all that is cross-CPU is the request
> buffer maybe that is the lowest-cost "request context switch"
that
> is possible.
>
> AFAIK, it is the OST service thread that is doing the initialization
> of the bulk buffers, and not the LNET code, so we don''t have a
huge
> amount of data that needs to be shipped between cores.  If we can
> also avoid lock ping-pong on the request queues as requests are
> being assigned at the Lustre level that is ideal.
>
>
> I think this would be possible by e.g. having multiple per-CPU request
> "queuelets" (batches of requests that are handled as a unit,
instead of
> having per-request processing).  See the ASCII art below for reference.
>
> The IRQ handler puts incoming requests on a CPU-affine list of some sort.
> Each request is put into into a CPU-affine list by NID hash to minimize
> peer processing overhead (credits, etc).  We get a list of requests
> that need to be scheduled to a CPU based on the content of the message,
> and that scheduling has to be done outside of the IRQ context.
>
>
> The LNET code now does the receive processing (still on the same CPU)
> to call the req_in handler (CPU request scheduler, possibly the very same
> as the NRS) to determine which core will do the full Lustre processing of
> the request.  The CPU request scheduler will add these requests to one of
> $num_active_cpus() _local_ queuelets (q$cpunr.$batchnr) until it is full,
> or some deadline (possibly load related) is passed.  At that point the
> finished queuelet is moved to the target CPU''s local staging area
(S$cpunr).
>
> IRQ handler          LNET/req_sched           OST thread
> -----------          --------------           ----------
> [request]
>     |
>     v
> CPU-affine list(s)
>                      CPU-affine list(s)
>                      |    |    |    |
>                      v    v    v    v   
>                      q0.4 q1.3 q2.2 q3.4
>                                                S0->q0.1->Q0 (CPU 0
threads)
>                                                S0->q0.2->Q0 (CPU 0
threads)
>                      q0.3 (finished) -> S0
>                                                S0->q0.3->Q0 (CPU 0
threads)
>                                                S1->q1.0->Q0 (CPU 1
threads)
>                      q1.1 (finished) -> S1
>                                                S1->q1.1->Q0 (CPU 1
threads)
>                      q1.2 (finished) -> S1
>                                                S1->q1.2->Q0 (CPU 1
threads)
>                                                S2->q1.1->Q0 (CPU 2
threads)
>                      q2.1 (finished) -> S2
>                                                S2->q2.1->Q0 (CPU 2
threads)
>                                                S3->q3.1->Q0 (CPU 3
threads)
>                      q3.2 (finished) -> S3
>                                                S3->q3.2->Q0 (CPU 3
threads)
>                      q3.3 (finished) -> S3
>                                                S3->q3.3->Q0 (CPU 3
threads)
>
> As the service threads process requests they periodically check for new
> queuelets in their CPU-local staging area and move them to their local
> request queue (Q$cpunr).  The requests are processed one-at-a-time by
> the CPU-local service threads as they are today from their request queue Q.
>   
So the queuelets could be: a) popped to target CPU if local CPU got 
enough messages for target; b) poll  by target CPU if target CPU is idle.
for a) it''s good and can reduce contention, but for b), If service 
thread (of each CPU) make periodically poll from all other CPUs, there 
could be a unnecessary delay (interval of poll) if those queuelets are 
always not full at all, unless local-CPU "peek" the message queue on 
target CPU in callback, and post message to there directly (instead of 
queuelet of local CPU) when the queue is empty. However, there could be 
another problem, the "peek" is not a light operation even
don''t need any
lock, target CPU is likely changing it''s own request queue (exclusive 
access), so the "peek" is already a cache syncup.

>
> What is important is that the cross-CPU lock contention is limited
> to the handoff of a large number of requests at a time (i.e. the
> whole queuelet) instead of on a per-request basis, so the lock
> contention on the Lustre request queue is orders of magnitude lower.
> Also, since the per-CPU service threads can remove whole queuelets
> from the staging area at one time they will also not be holding this
> lock for any length of time, ensuring the LNET threads are not blocked.
>
>
>   
>>> (of course we can try to encode this information somewhere in
actual
>>> message header like xid now where lnet interrupt handler can access
>>> it and use in its hash algorithm, but that way we give away a lot
of
>>> flexibility, so this is not the best solution, I would think).
>>>       
>> It would be better to add an additional "hints" field to LNET
messages
>> which could be used for this purpose.
>>     
I''m quite confusing at here, I think Oleg was talking about incoming 
request, but LNet message is totally invisible in interrupt handlers, as 
LNet message is created by lnet_parse() which is called by LND scheduler 
later(after woken up by interrupt handler).
>
> I''m not against this either.  I think it is a reasonable approach,
> but the hints need to be independent of whatever mechanism the
> server is using for scheduling.  That means a hint is not a CPU
> number or anything, but rather e.g. a parent FID number (MDS) or
> an (object number XOR file offset in GB).  We might want to have
> a "major" hint (e.g. parent FID, object number) and a
"minor"
> hint (e.g. child hash, file offset in GB) to let the server do
> load balancing as appropriate.
>
> Consider the OSS case where a large file is being read by many
> clients.  With NID affinity, there will essentially be completely
> random cross-CPU memory accesses.  With object number + offset-in-GB
> affinity the file extent locking and memory accesses will be CPU
> affine, so minimal cross-CPU memory access.
>
>   
>>> Another scenario that I have not seen discussed but that is
>>> potentially pretty important for MDS is ability to route expected
>>> messages (the ones like rep-ack reply) to a specific cpu regardless
>>> of what NID did it come from. E.g. if we did rescheduling of MDS
>>> request to some CPU and this is a difficult reply, we definitely
>>> want the confirmation to be processed on that same cpu that sent
the
>>> reply originally, since it references all the locks supposedly
>>> served by that CPU, etc. This is better to happen within LNET. I
>>> guess similar thing might be beneficial to clients too where a
reply
>>> is received on the same CPU that sent original request in hopes
that
>>> the cache is still valid and the processing would be so much faster
>>> as a result.
>>>       
>> You could use a "hints" field in the LNET header for this.
>>     
That''s about outgoing LNet message when sending reply, however, sending
a message still need go through "connection" & "peer" of
LNet and LND as
well, and finally go out from the connection of network stack, which are 
all bound on CPU hashed by NID (again).
So I think unless we create connections on all CPUs for each client, 
otherwise switch like that is unavoidable.
Actually, I think the fact is, LNet & LND are using NID affinity for 
connection & peer, Lustre is using object affinity, so we need switch 
CPU. If we want to go through the stack without switching of CPU, then 
we have to cancel NID affinity from LNet, that means we need a global 
peer table and a global lock to serialize, then we can schedule 
send/receive on any of CPU as we want, but "global" come back again...
>
>   
>>>  get off this?
>>>       
>> I think we need measurements to prove/disprove whether object affinity
>> trumps client affinity.
>>     
>
> Yes, that is the critical question.
>   
Totally agree....

Regards
Liang

> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>

Oleg Drokin

2009-Jul-30 13:35 UTC

head link

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

Hello!

On Jul 30, 2009, at 5:25 AM, Liang Zhen wrote:>>>> Another scenario that I have not seen discussed but that is
>>>> potentially pretty important for MDS is ability to route
expected
>>>> messages (the ones like rep-ack reply) to a specific cpu
regardless
>>>> of what NID did it come from. E.g. if we did rescheduling of
MDS
>>>> request to some CPU and this is a difficult reply, we
definitely
>>>> want the confirmation to be processed on that same cpu that
sent
>>>> the
>>>> reply originally, since it references all the locks supposedly
>>>> served by that CPU, etc. This is better to happen within LNET.
I
>>>> guess similar thing might be beneficial to clients too where a
>>>> reply
>>>> is received on the same CPU that sent original request in hopes
>>>> that
>>>> the cache is still valid and the processing would be so much
faster
>>>> as a result.
>>> You could use a "hints" field in the LNET header for
this.
> That''s about outgoing LNet message when sending reply, however,  
> sending a message still need go through "connection" &
"peer" of
> LNet and LND as well, and finally go out from the connection of  
> network stack, which are all bound on CPU hashed by NID (again).
Nothing prevents us from introducing extra argument for event handler,  
thoguh?

Bye,
     Oleg

Liang Zhen

2009-Jul-30 13:53 UTC

head link

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

Oleg Drokin wrote:> Hello!
>
> On Jul 30, 2009, at 5:25 AM, Liang Zhen wrote:
>>>>> Another scenario that I have not seen discussed but that is
>>>>> potentially pretty important for MDS is ability to route
expected
>>>>> messages (the ones like rep-ack reply) to a specific cpu
regardless
>>>>> of what NID did it come from. E.g. if we did rescheduling
of MDS
>>>>> request to some CPU and this is a difficult reply, we
definitely
>>>>> want the confirmation to be processed on that same cpu that
sent the
>>>>> reply originally, since it references all the locks
supposedly
>>>>> served by that CPU, etc. This is better to happen within
LNET. I
>>>>> guess similar thing might be beneficial to clients too
where a reply
>>>>> is received on the same CPU that sent original request in
hopes that
>>>>> the cache is still valid and the processing would be so
much faster
>>>>> as a result.
>>>> You could use a "hints" field in the LNET header for
this.
>> That''s about outgoing LNet message when sending reply,
however,
>> sending a message still need go through "connection" &
"peer" of LNet
>> and LND as well, and finally go out from the connection of network 
>> stack, which are all bound on CPU hashed by NID (again).
>
> Nothing prevents us from introducing extra argument for event handler, 
> thoguh?
We actually don''t need  do that in my branch, when we send reply, LNet 
would generate a cookie(MD handle) for the reply buffer which already 
contained current CPU id, and remote peer will send back ACK with the 
same cookie(MD handle), so the  ACK will match to the sending CPU id and 
callback for the same CPU. So that''s some work we have already done, :)

Regards
Liang

>
> Bye,
>     Oleg

Andreas Dilger

2009-Jul-30 23:19 UTC

head link

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

On Jul 30, 2009  17:25 +0800, Liang Zhen wrote:> Andreas Dilger wrote:
>> The IRQ handler puts incoming requests on a CPU-affine list of some
sort.
>> Each request is put into into a CPU-affine list by NID hash to minimize
>> peer processing overhead (credits, etc).  We get a list of requests
>> that need to be scheduled to a CPU based on the content of the message,
>> and that scheduling has to be done outside of the IRQ context.
>>
>>
>> The LNET code now does the receive processing (still on the same CPU)
>> to call the req_in handler (CPU request scheduler, possibly the very
same
>> as the NRS) to determine which core will do the full Lustre processing
of
>> the request.  The CPU request scheduler will add these requests to one
of
>> $num_active_cpus() _local_ queuelets (q$cpunr.$batchnr) until it is
full,
>> or some deadline (possibly load related) is passed.  At that point the
>> finished queuelet is moved to the target CPU''s local staging
area (S$cpunr).
Note also that some kinds of replies (OBD_PING, for example) could be
completed entirely by ptlrpc_server_handle_req_in() without invoking
any context switching.
>> As the service threads process requests they periodically check for new
>> queuelets in their CPU-local staging area and move them to their local
>> request queue (Q$cpunr).  The requests are processed one-at-a-time by
>> the CPU-local service threads as they are today from their request
queue Q.
>
> So the queuelets could be: a) popped to target CPU if local CPU got  
> enough messages for target; b) poll  by target CPU if target CPU is idle.
> for a) it''s good and can reduce contention, but for b), If service
> thread (of each CPU) make periodically poll from all other CPUs, there  
> could be a unnecessary delay (interval of poll) if those queuelets are  
> always not full at all, unless local-CPU "peek" the message queue
on
> target CPU in callback, and post message to there directly (instead of  
> queuelet of local CPU) when the queue is empty. However, there could be  
> another problem, the "peek" is not a light operation even
don''t need any
> lock, target CPU is likely changing it''s own request queue
(exclusive
> access), so the "peek" is already a cache syncup.
I don''t think ALL service threads would necessarily poll for queuelets.
As you suggest, any polling would be lightweight.  We might not have
polling at all, however.  The LNET code could make a decision (based
on message arrival rate, whether there are other unhandled queuelets
in the staging list, maximum delay (deadline).  That said, if the
service threads are idle and there are requests to be processed then
some lock contention is acceptable, since the system cannot be too
busy at that time.  That wouldn''t have to be polling, but rather a
wakeup of a single thread waiting on the request queue.
>>>> (of course we can try to encode this information somewhere in
actual
>>>> message header like xid now where lnet interrupt handler can
access
>>>> it and use in its hash algorithm, but that way we give away a
lot of
>>>> flexibility, so this is not the best solution, I would think).
>>>>       
>>> It would be better to add an additional "hints" field to
LNET messages
>>> which could be used for this purpose.
>
> I''m quite confusing at here, I think Oleg was talking about
incoming
> request, but LNet message is totally invisible in interrupt handlers, as  
> LNet message is created by lnet_parse() which is called by LND scheduler  
> later(after woken up by interrupt handler).
Would the hints have to be down at the LND-specific headers?  In any
case, something that can be accessed as easily as the NID.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre devel - Jul 2009 - SMP Scalability, MDS, reducing cpu pingpong

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong

[Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong