thr3ads.net - Lustre devel - [Lustre-devel] Completion callbacks [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Eric Barton

2008-Aug-12 18:05 UTC

[Lustre-devel] Completion callbacks

Hi,

I''ve been discussing how to improve LNET performance on multi-core
nodes with the Lustre networking team.  The current code uses a single
spinlock for almost everything - this has the advantage of simplicity
but at the obvious expense of serialising everything.

Where usage is simply to safeguard global table handling (e.g. to add
an MD to the global MD hash table), we''re inclined to stick with the
single global lock.  In such cases, it''s not clear concurrency gains
will justify additional code complication.  So we''re really focussing
on the 2 main issues where the global lock can potentially be held for
extended periods.

1. ME matching.

When an LNET GET or PUT arrives, it must be matched against an ME
queued on its destination portal.  There can potentially be many
thousands of these and they can potentially be searched sequentially
with the global lock held.

In fact, Lustre uses MEs in only 2 ways - (a) match-any for incoming
RPC requests and (b) match-unique for everything else.  (a) is
effectively the only place Lustre does actual message queueing - (b)
is used for everything else (i.e. bulk data and RPC replies) and I
think of it as an RDMA, where the matchbits+peerid is the RDMA
descriptor.

(a) always matches the first entry in the list of MEs, so even though
we post 1,000s of request buffers on a server, we never search the
whole list. (b) can potentially lead to a full list search.

The change we''re considering is to detect when a portal is used
exclusively for match-unique MEs (situation (b) - we already use
different portals for (a) and (b)) and match using a hash table rather
than a list search.

2. EQ callbacks.

EQ callbacks are currently serialised by the single global LNET lock.
Holding the lock for the duration of the callback greatly simplifies
the LNET event handling code, but unfortunately is open to abuse by
the handler itself (see below).  

The change we''re considering is to use one lock per EQ so that we
get better concurrency by using many EQs.  This avoids complicating
the existing EQ locking code, but it does require Lustre changes.  
However making Lustre use a pool of EQs (say 1 per CPU) should be a
very simple and self-contained change.

I''d appreciate any feedback on these suggested changes.

--

While I''m talking about EQ callbacks, I _do_ think there is still a
need to restructure some of Lustre''s event handlers.  EQ callbacks are
meant to provide notification and nothing else.  Originally they could
even be called in interrupt context, so all you are supposed to do in
them is update status and schedule anything significant for later.
Nowadays, EQ callbacks are only called in thread context, but the
general guidance on doing very little apart from notification remains.

Except things are never quite as black-and-white as that.  For
example, request_out_callback() has always called
ptlrpc_req_finished() - and whereas this most usually decrements a
refcount and maybe frees the request, during shutdown this can
actually recurse into a complete import cleanup.

This blatant flouting of the EQ callback rules has never been fixed
since it didn''t actually break anything and didn''t hurt
performance
during normal operation.  However problems like this can be compounded
as new features are developed (e.g. additional code in these callbacks
to support secure ptlrpc).  So I think it''s time to review what can
happen inside the lustre event handlers and consider what might need
to be restructured.  Even with improved EQ handler concurrency, you''re
still tying down an LNET thread for the duration of the EQ callback
with possible unforseen consequences.

For example, some LNDs use a pool of worker threads - one thread per
CPU.  If the LND assigns particular connections to particular worker
threads (e.g. socklnd), none of these connections can make progress
while the worker thread is executing the event callback handler.

    Cheers,
              Eric

Peter Braam

2008-Aug-13 03:24 UTC

head link

[Lustre-devel] Completion callbacks

...snip....> 2. EQ callbacks.
> 
> EQ callbacks are currently serialised by the single global LNET lock.
> Holding the lock for the duration of the callback greatly simplifies
> the LNET event handling code, but unfortunately is open to abuse by
> the handler itself (see below).
> 
> The change we''re considering is to use one lock per EQ so that we
> get better concurrency by using many EQs.  This avoids complicating
> the existing EQ locking code, but it does require Lustre changes.
> However making Lustre use a pool of EQs (say 1 per CPU) should be a
> very simple and self-contained change.
This doesn''t sound so attractive.  Isn''t it possible to hide
this under the
LNET API?

Peter

> I''d appreciate any feedback on these suggested changes.
> 
> --
> 
> While I''m talking about EQ callbacks, I _do_ think there is still
a
> need to restructure some of Lustre''s event handlers.  EQ callbacks
are
> meant to provide notification and nothing else.  Originally they could
> even be called in interrupt context, so all you are supposed to do in
> them is update status and schedule anything significant for later.
> Nowadays, EQ callbacks are only called in thread context, but the
> general guidance on doing very little apart from notification remains.
> 
> Except things are never quite as black-and-white as that.  For
> example, request_out_callback() has always called
> ptlrpc_req_finished() - and whereas this most usually decrements a
> refcount and maybe frees the request, during shutdown this can
> actually recurse into a complete import cleanup.
> 
> This blatant flouting of the EQ callback rules has never been fixed
> since it didn''t actually break anything and didn''t hurt
performance
> during normal operation.  However problems like this can be compounded
> as new features are developed (e.g. additional code in these callbacks
> to support secure ptlrpc).  So I think it''s time to review what
can
> happen inside the lustre event handlers and consider what might need
> to be restructured.  Even with improved EQ handler concurrency,
you''re
> still tying down an LNET thread for the duration of the EQ callback
> with possible unforseen consequences.
> 
> For example, some LNDs use a pool of worker threads - one thread per
> CPU.  If the LND assigns particular connections to particular worker
> threads (e.g. socklnd), none of these connections can make progress
> while the worker thread is executing the event callback handler.
> 
>     Cheers,
>               Eric
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Nikita Danilov

2008-Aug-13 03:53 UTC

head link

[Lustre-devel] Completion callbacks

Eric Barton writes:
 > Hi,
 > 

[...]

 > 
 > I''d appreciate any feedback on these suggested changes.

What about starting with a single lock for callbacks, _different_ from
the lock protecting ME matching? Also, what callbacks lock protects
exactly? Maybe it can be replaced with a read-write lock?

Nikita.

Liang Zhen

2008-Aug-13 05:52 UTC

head link

[Lustre-devel] Completion callbacks

Eric,
Reply inline...

Eric Barton ??:> Hi,
>
> I''ve been discussing how to improve LNET performance on multi-core
> nodes with the Lustre networking team.  The current code uses a single
> spinlock for almost everything - this has the advantage of simplicity
> but at the obvious expense of serialising everything.
>
> Where usage is simply to safeguard global table handling (e.g. to add
> an MD to the global MD hash table), we''re inclined to stick with
the
> single global lock.  In such cases, it''s not clear concurrency
gains
> will justify additional code complication.  So we''re really
focussing
> on the 2 main issues where the global lock can potentially be held for
> extended periods.
>
> 1. ME matching.
>
> When an LNET GET or PUT arrives, it must be matched against an ME
> queued on its destination portal.  There can potentially be many
> thousands of these and they can potentially be searched sequentially
> with the global lock held.
>
> In fact, Lustre uses MEs in only 2 ways - (a) match-any for incoming
> RPC requests and (b) match-unique for everything else.  (a) is
> effectively the only place Lustre does actual message queueing - (b)
> is used for everything else (i.e. bulk data and RPC replies) and I
> think of it as an RDMA, where the matchbits+peerid is the RDMA
> descriptor.
>
> (a) always matches the first entry in the list of MEs, so even though
> we post 1,000s of request buffers on a server, we never search the
> whole list. (b) can potentially lead to a full list search.
>
> The change we''re considering is to detect when a portal is used
> exclusively for match-unique MEs (situation (b) - we already use
> different portals for (a) and (b)) and match using a hash table rather
> than a list search.
>   if we can always ignore "ignore_bits" of ME (never used by Lustre), we
can hash MEs by match_bits,
otherwise we can only hash NID of peer which is less reasonable to
me.> 2. EQ callbacks.
>
> EQ callbacks are currently serialised by the single global LNET lock.
> Holding the lock for the duration of the callback greatly simplifies
> the LNET event handling code, but unfortunately is open to abuse by
> the handler itself (see below).  
>
> The change we''re considering is to use one lock per EQ so that we
> get better concurrency by using many EQs.  This avoids complicating
> the existing EQ locking code, but it does require Lustre changes.  
> However making Lustre use a pool of EQs (say 1 per CPU) should be a
> very simple and self-contained change.
>
> I''d appreciate any feedback on these suggested changes.
>   Isaac and I discussed about this and we think:
1. We can create an array of locks for each EQ (for example NCPUs locks 
for each EQ),
and hash MD (i.e, by handle cookie) to these locks to get cocurrent of 
eq_callback without
losing order of events for each MD, also, upper layers wouldn''t see any
change.
2. we change ptlrpc_master_callback so that: it makes a copy of EV and 
queue it in a
FIFO and return, another thread process ev''s in this FIFO and callback 
one by one and
we can guarantee events order and call real callbacks without
lnet_lock> --
>
> While I''m talking about EQ callbacks, I _do_ think there is still
a
> need to restructure some of Lustre''s event handlers.  EQ callbacks
are
> meant to provide notification and nothing else.  Originally they could
> even be called in interrupt context, so all you are supposed to do in
> them is update status and schedule anything significant for later.
> Nowadays, EQ callbacks are only called in thread context, but the
> general guidance on doing very little apart from notification remains.
>
> Except things are never quite as black-and-white as that.  For
> example, request_out_callback() has always called
> ptlrpc_req_finished() - and whereas this most usually decrements a
> refcount and maybe frees the request, during shutdown this can
> actually recurse into a complete import cleanup.
>
> This blatant flouting of the EQ callback rules has never been fixed
> since it didn''t actually break anything and didn''t hurt
performance
> during normal operation.  However problems like this can be compounded
> as new features are developed (e.g. additional code in these callbacks
> to support secure ptlrpc).  So I think it''s time to review what
can
> happen inside the lustre event handlers and consider what might need
> to be restructured.  Even with improved EQ handler concurrency,
you''re
> still tying down an LNET thread for the duration of the EQ callback
> with possible unforseen consequences.
>
> For example, some LNDs use a pool of worker threads - one thread per
> CPU.  If the LND assigns particular connections to particular worker
> threads (e.g. socklnd), none of these connections can make progress
> while the worker thread is executing the event callback handler.
>
>     Cheers,
>               Eric
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Liang Zhen

2008-Aug-13 06:32 UTC

head link

[Lustre-devel] Completion callbacks

Liang Zhen ??:>
> 2. we change ptlrpc_master_callback so that: it makes a copy of EV and 
> queue it in a
> FIFO and return, another thread process ev''s in this FIFO and
callback
> one by one and
> we can guarantee events order and call real callbacks without lnet_lock
>   
We can even have an eq_callback_thread (or threads pool) in LNet,
lnet_enq_event_locked() enqueue event and wakeup the callback_thread,
so we don''t need change ptlrpc at all.
>
> -devel
>

Maxim V. Patlasov

2008-Aug-13 11:04 UTC

head link

[Lustre-devel] Completion callbacks

Folks,
> Liang Zhen ??:
>   
>> >
>> > 2. we change ptlrpc_master_callback so that: it makes a copy of EV
and
>> > queue it in a
>> > FIFO and return, another thread process ev''s in this FIFO
and callback
>> > one by one and
>> > we can guarantee events order and call real callbacks without
lnet_lock
>> >   
>>     
>
> We can even have an eq_callback_thread (or threads pool) in LNet,
> lnet_enq_event_locked() enqueue event and wakeup the callback_thread,
> so we don''t need change ptlrpc at all.
>   
I dislike the idea of introducing any additional callback-devoted 
threads because 1) it would spoil the original design of callbacks as 
light-weight notifications and 2) introduce additional latency. I''d 
prefer to see per-MD locks (or per-EQ array of locks, that''s quite 
equivalent) to serialize calling callbacks associated with any 
particular MD. This approach looks more natural and "right" than 
inventing callback-threads.

Sincerely,
Maxim

Peter Braam

2008-Aug-13 22:28 UTC

head link

[Lustre-devel] Completion callbacks

I think it''s a requirement that software using the API''s is
not aware of
this (ie no significant changes in ptlrpc).  As long as the solution is
compatible with that, this looks fine.

Peter


On 8/13/08 5:04 AM, "Maxim V. Patlasov" <Maxim.Patlasov at
Sun.COM> wrote:
> Folks,
> 
>> Liang Zhen ??:
>>   
>>>> 
>>>> 2. we change ptlrpc_master_callback so that: it makes a copy of
EV and
>>>> queue it in a
>>>> FIFO and return, another thread process ev''s in this
FIFO and callback
>>>> one by one and
>>>> we can guarantee events order and call real callbacks without
lnet_lock
>>>>   
>>>     
>> 
>> We can even have an eq_callback_thread (or threads pool) in LNet,
>> lnet_enq_event_locked() enqueue event and wakeup the callback_thread,
>> so we don''t need change ptlrpc at all.
>>   
> 
> I dislike the idea of introducing any additional callback-devoted
> threads because 1) it would spoil the original design of callbacks as
> light-weight notifications and 2) introduce additional latency.
I''d
> prefer to see per-MD locks (or per-EQ array of locks, that''s quite
> equivalent) to serialize calling callbacks associated with any
> particular MD. This approach looks more natural and "right" than
> inventing callback-threads.
> 
> Sincerely,
> Maxim
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Eric Barton

2008-Aug-14 09:28 UTC

head link

[Lustre-devel] Completion callbacks

Thank you for all your feedback.

-- Braam wrote...
> > The change we''re considering is to use one lock per EQ so
that we
> > get better concurrency by using many EQs.  This avoids
> > complicating the existing EQ locking code, but it does require
> > Lustre changes.  However making Lustre use a pool of EQs (say 1
> > per CPU) should be a very simple and self-contained change.
> 
> This doesn''t sound so attractive.  Isn''t it possible to
hide this
> under the LNET API?
Indeed, but it''s not so simple - see Liang/Isaac''s suggestion
and my
comment below.

-- Nikita wrote...
> What about starting with a single lock for callbacks, _different_
> from the lock protecting ME matching? Also, what callbacks lock
> protects exactly? Maybe it can be replaced with a read-write lock?
Indeed - that would allow us to determine whether we really need to
work further of EQ callback concurrency.

-- Liang Zhan wrote...
> > The change we''re considering is to detect when a portal is
used
> > exclusively for match-unique MEs (situation (b) - we already use
> > different portals for (a) and (b)) and match using a hash table
> > rather than a list search.
> >   
> if we can always ignore "ignore_bits" of ME (never used by
Lustre),
> we can hash MEs by match_bits, otherwise we can only hash NID of
> peer which is less reasonable to me.
The "ignore_bits" parameter _is_ used by lustre.  The 2 usages I
mentioned were "match any", where peer ID is don''t care and
ignore_bits is -1, and "match unique", where peer ID is fully
specified and ignore_bits is 0.
> Isaac and I discussed about this and we think:
> 1. We can create an array of locks for each EQ (for example NCPUs
> locks for each EQ), and hash MD (i.e, by handle cookie) to these
> locks to get cocurrent of eq_callback without losing order of events
> for each MD, also, upper layers wouldn''t see any change.
Yes, this ensures callbacks on each MD remain ordered - however the
current code also guarantees that the callback and any MD
auto-unlinking completes before LNetEQPoll() can return.  We have to
verify that relaxing ordering here is OK or else do some similar
lock-hashing, say on the EQ slot.
> We can even have an eq_callback_thread (or threads pool) in LNet,
> lnet_enq_event_locked() enqueue event and wakeup the
> callback_thread, so we don''t need change ptlrpc at all.
That adds unnecessary context switching.  EQ callbacks may happen in
the context either of the thread doing a PUT or GET (PUT buffered
immediately or you''re using the lolnd), or more normally, of an LND
worker thread.  That''s plenty of potential concurrency we can
exploit.

    Cheers,
              Eric

Liang Zhen

2008-Aug-15 02:42 UTC

head link

[Lustre-devel] Completion callbacks

Eric Barton ??:>> Isaac and I discussed about this and we think:
>> 1. We can create an array of locks for each EQ (for example NCPUs
>> locks for each EQ), and hash MD (i.e, by handle cookie) to these
>> locks to get cocurrent of eq_callback without losing order of events
>> for each MD, also, upper layers wouldn''t see any change.
>>     
>
> Yes, this ensures callbacks on each MD remain ordered - however the
> current code also guarantees that the callback and any MD
> auto-unlinking completes before LNetEQPoll() can return.  We have to
> verify that relaxing ordering here is OK or else do some similar
> lock-hashing, say on the EQ slot.
>   Yes, then we will need two lock-tables just for serializing eq_callback and
LNetEQPoll, and I think this part of code will be complex but the only
benefit is concurrency of eq_callback. After think over it again, I 
perfer to
design like this:

1. Have different hash table for different LNet descriptor handles.
2. lnet_eq_lock
protects EQ-handle table, and all EQ related operations
3. lnet_portal_lock
protects ME-handle table, and all ME related operations, also,
attach/detach/match MD will need this lock too.
4. lnet_md_lock (lock-table)
each slot of the lock-table protects a MD-handle table (MDs are hashed into
different MD-tables by handle-cookie), it also protects all MD operations
in the MD-table, eq_callbacks are serialized by it as well.
5. it''s legal to have lnet_md_lock then race lnet_eq_lock or 
lnet_portal_lock
6. lnet_lock
it protects rest part of LNet (peer table, router-buffer, credits...)

Now let''s see some examples:
1. LNetMDBind: only needs lnet_md_lock(i) to insert to MD-table
2. LNetMDAttach: needs lnet_md_lock(i) to insert to MD-table,
then lnet_portal_lock() to attach with ME
3. LNetPut: needs lnet_md_lock(i) to find MD, then lnet_lock() for rest ops
4. lnet_parse_put: needs lnet_portal_lock() for matching
5. lnet_finalize: needs lnet_md_lock(i) (may need lnet_portal_lock for a
short while if MD is on portal) to check status of MD, serialze callback
and md_unlink, lnet_eq_lock to enqueue event, then lnet_lock to return
credits etc.
6. LNetEQPoll: needs lnet_eq_lock

Although add more new locks here, but it''s natural to have different 
locks for
different objects, also, everything still follow original logic of LNet,
implementation is very straightforward, most time we just need to replace
LNET_LOCK with new lock name.

All LNet interfaces can benefity from change like this: We will not only 
have
better concurrency of eq_callback, we also have concurrency of all APIs.

(If we want, we can even have individual lock for each portal based on this
design(save portal index in ME handle), although I think it''s no really
necessary...)
>
>   
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Lustre devel - Aug 2008 - Completion callbacks

[Lustre-devel] Completion callbacks

[Lustre-devel] Completion callbacks

[Lustre-devel] Completion callbacks

[Lustre-devel] Completion callbacks

[Lustre-devel] Completion callbacks

[Lustre-devel] Completion callbacks

[Lustre-devel] Completion callbacks

[Lustre-devel] Completion callbacks

[Lustre-devel] Completion callbacks