Hi, I''ve been discussing how to improve LNET performance on multi-core nodes with the Lustre networking team. The current code uses a single spinlock for almost everything - this has the advantage of simplicity but at the obvious expense of serialising everything. Where usage is simply to safeguard global table handling (e.g. to add an MD to the global MD hash table), we''re inclined to stick with the single global lock. In such cases, it''s not clear concurrency gains will justify additional code complication. So we''re really focussing on the 2 main issues where the global lock can potentially be held for extended periods. 1. ME matching. When an LNET GET or PUT arrives, it must be matched against an ME queued on its destination portal. There can potentially be many thousands of these and they can potentially be searched sequentially with the global lock held. In fact, Lustre uses MEs in only 2 ways - (a) match-any for incoming RPC requests and (b) match-unique for everything else. (a) is effectively the only place Lustre does actual message queueing - (b) is used for everything else (i.e. bulk data and RPC replies) and I think of it as an RDMA, where the matchbits+peerid is the RDMA descriptor. (a) always matches the first entry in the list of MEs, so even though we post 1,000s of request buffers on a server, we never search the whole list. (b) can potentially lead to a full list search. The change we''re considering is to detect when a portal is used exclusively for match-unique MEs (situation (b) - we already use different portals for (a) and (b)) and match using a hash table rather than a list search. 2. EQ callbacks. EQ callbacks are currently serialised by the single global LNET lock. Holding the lock for the duration of the callback greatly simplifies the LNET event handling code, but unfortunately is open to abuse by the handler itself (see below). The change we''re considering is to use one lock per EQ so that we get better concurrency by using many EQs. This avoids complicating the existing EQ locking code, but it does require Lustre changes. However making Lustre use a pool of EQs (say 1 per CPU) should be a very simple and self-contained change. I''d appreciate any feedback on these suggested changes. -- While I''m talking about EQ callbacks, I _do_ think there is still a need to restructure some of Lustre''s event handlers. EQ callbacks are meant to provide notification and nothing else. Originally they could even be called in interrupt context, so all you are supposed to do in them is update status and schedule anything significant for later. Nowadays, EQ callbacks are only called in thread context, but the general guidance on doing very little apart from notification remains. Except things are never quite as black-and-white as that. For example, request_out_callback() has always called ptlrpc_req_finished() - and whereas this most usually decrements a refcount and maybe frees the request, during shutdown this can actually recurse into a complete import cleanup. This blatant flouting of the EQ callback rules has never been fixed since it didn''t actually break anything and didn''t hurt performance during normal operation. However problems like this can be compounded as new features are developed (e.g. additional code in these callbacks to support secure ptlrpc). So I think it''s time to review what can happen inside the lustre event handlers and consider what might need to be restructured. Even with improved EQ handler concurrency, you''re still tying down an LNET thread for the duration of the EQ callback with possible unforseen consequences. For example, some LNDs use a pool of worker threads - one thread per CPU. If the LND assigns particular connections to particular worker threads (e.g. socklnd), none of these connections can make progress while the worker thread is executing the event callback handler. Cheers, Eric
...snip....> 2. EQ callbacks. > > EQ callbacks are currently serialised by the single global LNET lock. > Holding the lock for the duration of the callback greatly simplifies > the LNET event handling code, but unfortunately is open to abuse by > the handler itself (see below). > > The change we''re considering is to use one lock per EQ so that we > get better concurrency by using many EQs. This avoids complicating > the existing EQ locking code, but it does require Lustre changes. > However making Lustre use a pool of EQs (say 1 per CPU) should be a > very simple and self-contained change.This doesn''t sound so attractive. Isn''t it possible to hide this under the LNET API? Peter> I''d appreciate any feedback on these suggested changes. > > -- > > While I''m talking about EQ callbacks, I _do_ think there is still a > need to restructure some of Lustre''s event handlers. EQ callbacks are > meant to provide notification and nothing else. Originally they could > even be called in interrupt context, so all you are supposed to do in > them is update status and schedule anything significant for later. > Nowadays, EQ callbacks are only called in thread context, but the > general guidance on doing very little apart from notification remains. > > Except things are never quite as black-and-white as that. For > example, request_out_callback() has always called > ptlrpc_req_finished() - and whereas this most usually decrements a > refcount and maybe frees the request, during shutdown this can > actually recurse into a complete import cleanup. > > This blatant flouting of the EQ callback rules has never been fixed > since it didn''t actually break anything and didn''t hurt performance > during normal operation. However problems like this can be compounded > as new features are developed (e.g. additional code in these callbacks > to support secure ptlrpc). So I think it''s time to review what can > happen inside the lustre event handlers and consider what might need > to be restructured. Even with improved EQ handler concurrency, you''re > still tying down an LNET thread for the duration of the EQ callback > with possible unforseen consequences. > > For example, some LNDs use a pool of worker threads - one thread per > CPU. If the LND assigns particular connections to particular worker > threads (e.g. socklnd), none of these connections can make progress > while the worker thread is executing the event callback handler. > > Cheers, > Eric > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Eric Barton writes: > Hi, > [...] > > I''d appreciate any feedback on these suggested changes. What about starting with a single lock for callbacks, _different_ from the lock protecting ME matching? Also, what callbacks lock protects exactly? Maybe it can be replaced with a read-write lock? Nikita.
Eric, Reply inline... Eric Barton ??:> Hi, > > I''ve been discussing how to improve LNET performance on multi-core > nodes with the Lustre networking team. The current code uses a single > spinlock for almost everything - this has the advantage of simplicity > but at the obvious expense of serialising everything. > > Where usage is simply to safeguard global table handling (e.g. to add > an MD to the global MD hash table), we''re inclined to stick with the > single global lock. In such cases, it''s not clear concurrency gains > will justify additional code complication. So we''re really focussing > on the 2 main issues where the global lock can potentially be held for > extended periods. > > 1. ME matching. > > When an LNET GET or PUT arrives, it must be matched against an ME > queued on its destination portal. There can potentially be many > thousands of these and they can potentially be searched sequentially > with the global lock held. > > In fact, Lustre uses MEs in only 2 ways - (a) match-any for incoming > RPC requests and (b) match-unique for everything else. (a) is > effectively the only place Lustre does actual message queueing - (b) > is used for everything else (i.e. bulk data and RPC replies) and I > think of it as an RDMA, where the matchbits+peerid is the RDMA > descriptor. > > (a) always matches the first entry in the list of MEs, so even though > we post 1,000s of request buffers on a server, we never search the > whole list. (b) can potentially lead to a full list search. > > The change we''re considering is to detect when a portal is used > exclusively for match-unique MEs (situation (b) - we already use > different portals for (a) and (b)) and match using a hash table rather > than a list search. >if we can always ignore "ignore_bits" of ME (never used by Lustre), we can hash MEs by match_bits, otherwise we can only hash NID of peer which is less reasonable to me.> 2. EQ callbacks. > > EQ callbacks are currently serialised by the single global LNET lock. > Holding the lock for the duration of the callback greatly simplifies > the LNET event handling code, but unfortunately is open to abuse by > the handler itself (see below). > > The change we''re considering is to use one lock per EQ so that we > get better concurrency by using many EQs. This avoids complicating > the existing EQ locking code, but it does require Lustre changes. > However making Lustre use a pool of EQs (say 1 per CPU) should be a > very simple and self-contained change. > > I''d appreciate any feedback on these suggested changes. >Isaac and I discussed about this and we think: 1. We can create an array of locks for each EQ (for example NCPUs locks for each EQ), and hash MD (i.e, by handle cookie) to these locks to get cocurrent of eq_callback without losing order of events for each MD, also, upper layers wouldn''t see any change. 2. we change ptlrpc_master_callback so that: it makes a copy of EV and queue it in a FIFO and return, another thread process ev''s in this FIFO and callback one by one and we can guarantee events order and call real callbacks without lnet_lock> -- > > While I''m talking about EQ callbacks, I _do_ think there is still a > need to restructure some of Lustre''s event handlers. EQ callbacks are > meant to provide notification and nothing else. Originally they could > even be called in interrupt context, so all you are supposed to do in > them is update status and schedule anything significant for later. > Nowadays, EQ callbacks are only called in thread context, but the > general guidance on doing very little apart from notification remains. > > Except things are never quite as black-and-white as that. For > example, request_out_callback() has always called > ptlrpc_req_finished() - and whereas this most usually decrements a > refcount and maybe frees the request, during shutdown this can > actually recurse into a complete import cleanup. > > This blatant flouting of the EQ callback rules has never been fixed > since it didn''t actually break anything and didn''t hurt performance > during normal operation. However problems like this can be compounded > as new features are developed (e.g. additional code in these callbacks > to support secure ptlrpc). So I think it''s time to review what can > happen inside the lustre event handlers and consider what might need > to be restructured. Even with improved EQ handler concurrency, you''re > still tying down an LNET thread for the duration of the EQ callback > with possible unforseen consequences. > > For example, some LNDs use a pool of worker threads - one thread per > CPU. If the LND assigns particular connections to particular worker > threads (e.g. socklnd), none of these connections can make progress > while the worker thread is executing the event callback handler. > > Cheers, > Eric > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
Liang Zhen ??:> > 2. we change ptlrpc_master_callback so that: it makes a copy of EV and > queue it in a > FIFO and return, another thread process ev''s in this FIFO and callback > one by one and > we can guarantee events order and call real callbacks without lnet_lock >We can even have an eq_callback_thread (or threads pool) in LNet, lnet_enq_event_locked() enqueue event and wakeup the callback_thread, so we don''t need change ptlrpc at all.> > -devel >
Folks,> Liang Zhen ??: > >> > >> > 2. we change ptlrpc_master_callback so that: it makes a copy of EV and >> > queue it in a >> > FIFO and return, another thread process ev''s in this FIFO and callback >> > one by one and >> > we can guarantee events order and call real callbacks without lnet_lock >> > >> > > We can even have an eq_callback_thread (or threads pool) in LNet, > lnet_enq_event_locked() enqueue event and wakeup the callback_thread, > so we don''t need change ptlrpc at all. >I dislike the idea of introducing any additional callback-devoted threads because 1) it would spoil the original design of callbacks as light-weight notifications and 2) introduce additional latency. I''d prefer to see per-MD locks (or per-EQ array of locks, that''s quite equivalent) to serialize calling callbacks associated with any particular MD. This approach looks more natural and "right" than inventing callback-threads. Sincerely, Maxim
I think it''s a requirement that software using the API''s is not aware of this (ie no significant changes in ptlrpc). As long as the solution is compatible with that, this looks fine. Peter On 8/13/08 5:04 AM, "Maxim V. Patlasov" <Maxim.Patlasov at Sun.COM> wrote:> Folks, > >> Liang Zhen ??: >> >>>> >>>> 2. we change ptlrpc_master_callback so that: it makes a copy of EV and >>>> queue it in a >>>> FIFO and return, another thread process ev''s in this FIFO and callback >>>> one by one and >>>> we can guarantee events order and call real callbacks without lnet_lock >>>> >>> >> >> We can even have an eq_callback_thread (or threads pool) in LNet, >> lnet_enq_event_locked() enqueue event and wakeup the callback_thread, >> so we don''t need change ptlrpc at all. >> > > I dislike the idea of introducing any additional callback-devoted > threads because 1) it would spoil the original design of callbacks as > light-weight notifications and 2) introduce additional latency. I''d > prefer to see per-MD locks (or per-EQ array of locks, that''s quite > equivalent) to serialize calling callbacks associated with any > particular MD. This approach looks more natural and "right" than > inventing callback-threads. > > Sincerely, > Maxim > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Thank you for all your feedback. -- Braam wrote...> > The change we''re considering is to use one lock per EQ so that we > > get better concurrency by using many EQs. This avoids > > complicating the existing EQ locking code, but it does require > > Lustre changes. However making Lustre use a pool of EQs (say 1 > > per CPU) should be a very simple and self-contained change. > > This doesn''t sound so attractive. Isn''t it possible to hide this > under the LNET API?Indeed, but it''s not so simple - see Liang/Isaac''s suggestion and my comment below. -- Nikita wrote...> What about starting with a single lock for callbacks, _different_ > from the lock protecting ME matching? Also, what callbacks lock > protects exactly? Maybe it can be replaced with a read-write lock?Indeed - that would allow us to determine whether we really need to work further of EQ callback concurrency. -- Liang Zhan wrote...> > The change we''re considering is to detect when a portal is used > > exclusively for match-unique MEs (situation (b) - we already use > > different portals for (a) and (b)) and match using a hash table > > rather than a list search. > > > if we can always ignore "ignore_bits" of ME (never used by Lustre), > we can hash MEs by match_bits, otherwise we can only hash NID of > peer which is less reasonable to me.The "ignore_bits" parameter _is_ used by lustre. The 2 usages I mentioned were "match any", where peer ID is don''t care and ignore_bits is -1, and "match unique", where peer ID is fully specified and ignore_bits is 0.> Isaac and I discussed about this and we think: > 1. We can create an array of locks for each EQ (for example NCPUs > locks for each EQ), and hash MD (i.e, by handle cookie) to these > locks to get cocurrent of eq_callback without losing order of events > for each MD, also, upper layers wouldn''t see any change.Yes, this ensures callbacks on each MD remain ordered - however the current code also guarantees that the callback and any MD auto-unlinking completes before LNetEQPoll() can return. We have to verify that relaxing ordering here is OK or else do some similar lock-hashing, say on the EQ slot.> We can even have an eq_callback_thread (or threads pool) in LNet, > lnet_enq_event_locked() enqueue event and wakeup the > callback_thread, so we don''t need change ptlrpc at all.That adds unnecessary context switching. EQ callbacks may happen in the context either of the thread doing a PUT or GET (PUT buffered immediately or you''re using the lolnd), or more normally, of an LND worker thread. That''s plenty of potential concurrency we can exploit. Cheers, Eric
Eric Barton ??:>> Isaac and I discussed about this and we think: >> 1. We can create an array of locks for each EQ (for example NCPUs >> locks for each EQ), and hash MD (i.e, by handle cookie) to these >> locks to get cocurrent of eq_callback without losing order of events >> for each MD, also, upper layers wouldn''t see any change. >> > > Yes, this ensures callbacks on each MD remain ordered - however the > current code also guarantees that the callback and any MD > auto-unlinking completes before LNetEQPoll() can return. We have to > verify that relaxing ordering here is OK or else do some similar > lock-hashing, say on the EQ slot. >Yes, then we will need two lock-tables just for serializing eq_callback and LNetEQPoll, and I think this part of code will be complex but the only benefit is concurrency of eq_callback. After think over it again, I perfer to design like this: 1. Have different hash table for different LNet descriptor handles. 2. lnet_eq_lock protects EQ-handle table, and all EQ related operations 3. lnet_portal_lock protects ME-handle table, and all ME related operations, also, attach/detach/match MD will need this lock too. 4. lnet_md_lock (lock-table) each slot of the lock-table protects a MD-handle table (MDs are hashed into different MD-tables by handle-cookie), it also protects all MD operations in the MD-table, eq_callbacks are serialized by it as well. 5. it''s legal to have lnet_md_lock then race lnet_eq_lock or lnet_portal_lock 6. lnet_lock it protects rest part of LNet (peer table, router-buffer, credits...) Now let''s see some examples: 1. LNetMDBind: only needs lnet_md_lock(i) to insert to MD-table 2. LNetMDAttach: needs lnet_md_lock(i) to insert to MD-table, then lnet_portal_lock() to attach with ME 3. LNetPut: needs lnet_md_lock(i) to find MD, then lnet_lock() for rest ops 4. lnet_parse_put: needs lnet_portal_lock() for matching 5. lnet_finalize: needs lnet_md_lock(i) (may need lnet_portal_lock for a short while if MD is on portal) to check status of MD, serialze callback and md_unlink, lnet_eq_lock to enqueue event, then lnet_lock to return credits etc. 6. LNetEQPoll: needs lnet_eq_lock Although add more new locks here, but it''s natural to have different locks for different objects, also, everything still follow original logic of LNet, implementation is very straightforward, most time we just need to replace LNET_LOCK with new lock name. All LNet interfaces can benefity from change like this: We will not only have better concurrency of eq_callback, we also have concurrency of all APIs. (If we want, we can even have individual lock for each portal based on this design(save portal index in ME handle), although I think it''s no really necessary...)> > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >