vitaly@clusterfs.com
2007-Jan-05 09:46 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10589 Created an attachment (id=9277) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9277&action=view) HLD v1 The HLD is ready for the inspection.
green@clusterfs.com
2007-Jan-10 06:43 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10589 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9277|review?(green@clusterfs.com)|review- Flag| | (From update of attachment 9277)> The period of time that a lock is allowed to be alive. When expired, the lock is to be cancelled. > LDLM_MAX_ALIVEDon''t make it static, it should be adjustable through /proc> int ldlm_resource_get_unused(ldlm_resource *res, list_head *head, __u64 bits)> This method gathers all the unused locks on the resource @res into the @head. Only those inodebits locks are added to @head that match by the policy @bits. The further attempts to find or cancel locks inserted into the list @head, will not find nor be able to cancel them.I think it should be applicable to any lock, not just inodebit locks. You want this same feature for extent locks too, for example (on destroy). perhaps it should take policy here and gather all locks that intersect policy-wise. In fact I see you use this function from OSC too, and bits there just make zero sense. Also throughout the HLD you keep talking about MDS, this should work on both MDS and OST.> int ll_mdc_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc, void *data, int flag) > The callback that is called when a blocking request is received or a cancel request is about to be sent. It changes all the metadata and data guarded with the cancelled lock.It does not change metadata, it uncaches/invalidates.> void lustre_swab_ldlm_request(struct ldlm_request *rq) > Convert the ldlm_request structre from the little endian wire format to CPU-endian format.In fact wire format is cpu-endin format for ending cpu. swabbing only happens if nodes at the ends of the wire are differently-endian, as I understand it. usecase missing: extent lock enqueue cancelling old locks from lru beforehand. logic: in ldlm_lru_get_redundant: LDLM_LOCK_GET(lock); spin_unlock(&ns->ns_unused_lock); lock_res_and_lock(lock); And at this point lock might have been removed from lru by other process (and not necessary for cancel (you check this case)! you need to verify it is still in lru by e.g. calling ldlm_lock_remove_from_lru (you need to do it anyway to remove the lock from lru), see example in ldlm_cancel_lru in ll_mdc_blocking_ast: You cannot use ll_have_md_lock, I think. It does match with LDLM_FL_CBPENDING and therefore will always match (this same lock you are cancelling) and will return 1 always. In osc_destroy: we want to set LDLM_FL_DISCARD_DATA in flags of the locks before cancelling it, so that whatever dirty data we have is not flushed to server. Elsewhere in mdc* osc_destroy: there is no point in searching locks when OBD_CONNECT_EARLY_CANCEL I think? Or do you think we can save on blocking ast? Then for such a case cancelling ASTs should be sent to servers (and ldlm_resource_cancel_unused must not set the cancelling flags), failing to send these ASTs will cause server to send blocking ASTs for these locks still (that client does not have already), and then evicting client because it never replies with cancel. In lustre_swab_ldlm_request: Why no swabbing for lock handles themselves if there are more than 2? Locking:> While we are sending a batched cancel request, a lock may be cancelled on server in parallel. Then this lock will not be found by its handle on the server when our batched cancel request will be handled. This is also considered as a valid race.This cannot happen. Server won''t cancel lock without confirmation from the client, so either it degenerates into 1st case in locking section or the lock is already canceled client-side and then we won''t find it on client and won''t send the handle to server.
vitaly@clusterfs.com
2007-Jan-10 12:27 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10589 (In reply to comment #27)> > int ldlm_resource_get_unused(ldlm_resource *res, list_head *head, > > __u64 bits) > > > This method gathers all the unused locks on the resource @res into > > the @head. Only those inodebits locks are added to @head that match > > by the policy @bits. The further attempts to find or cancel locks > > inserted into the list @head, will not find nor be able to cancel them. > > I think it should be applicable to any lock, not just inodebit locks.yes, it is. but @bits are checked for inodebits locks only.> You want this same feature for extent locks too, for example (on destroy). > perhaps it should take policy here and gather all locks that intersect > policy-wise. > In fact I see you use this function from OSC too, and bits there just > make zero sense.right, no defect here.> Elsewhere in mdc* osc_destroy: > there is no point in searching locks when OBD_CONNECT_EARLY_CANCEL I think? > Or do you think we can save on blocking ast?blocking will be sent by server anyway, but we can save on the later canceling rpc.> Then for such a case cancelling > ASTs should be sent to servers (and ldlm_resource_cancel_unused must not > set the cancelling flags), failing to send these ASTs will cause server > to send blocking ASTs for these locks still (that client does not have > already), and then evicting client because it never replies with cancel.it seems not correct, server will get blocking reply with EINVAL error which is considered as a legal race currently in ldlm_handle_ast_error(), after that server is not waiting for cancel, it just cancel its lock and proceed without client eviction. meanwhile canceling rpc will not be sent back to server, what saves 1 rpc.> In lustre_swab_ldlm_request: > Why no swabbing for lock handles themselves if there are more than 2?lock handles are opaque and are not swabbed.> Locking: > > While we are sending a batched cancel request, a lock may be cancelled > > on server in parallel. Then this lock will not be found by its handle > > on the server when our batched cancel request will be handled. This > > is also considered as a valid race. > > This cannot happen. Server won''t cancel lock without confirmation from > the client, so either it degenerates into 1st case in locking section > or the lock is already canceled client-side and then we won''t find it > on client and won''t send the handle to server.yes, this is the repeating of the 1st case of the locking section. I will remove this one.
vitaly@clusterfs.com
2007-Jan-12 13:43 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10589 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9277 is|0 |1 obsolete| | Created an attachment (id=9331) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9331&action=view) HLD v2 The fixed version includes some bugfixes according to Oleg''s inspection: -- lock alive time limit is tunable through proc; -- use case is added: extent lock enqueue cancels unused locks beforehand; -- ldlm_lru_get_redundant() logic is fixed; -- osc_destroy sets LDLM_FL_DISCARD_DATA to locks it cancels beforehand; -- some other minor fixes.
green@clusterfs.com
2007-Jan-12 16:19 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10589 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9331|review?(green@clusterfs.com)|review- Flag| | (From update of attachment 9331) requirements:> The client cancels all the conflict locks for the current metadata operation on the client side beforehandI think you shoulddrop word ''metadata'' here and everywhere else in this section. We want to cancel locks beforehand even for non-metadata operations like destroy (and may be truncate and read/write in the future). ldlm_resource_cancel_unused is poorly named, there is nothing about unused, perhaps it is better to rename it to ldlm_resource_cancel_list? Like I already said, ldlm_resource_get_unused should take policy argument OR may be we need specific versions of that function for different lock types? hardcoding bits is confusing, I think. We well might want to match certain subse ts of extent locks very soon now. In ldlm_resource_get_unused( you happily use l_bl_ast, but I think we need some proof there is no parallel user for that filed too (like lru cancel). Code calling ldlm_lru_get_redundant now does not set instant cancel flag at all which will result in cancel RPCs to be sent even if server support bundled cance ls. This is bad. I think you only need to use ldlm_lru_get_redundant (in ldlm_cli_enqueue) if server supports bundled cancels. And always set LDLM_FL_CANCEL_ON_BLOCK flag. In section Scalability and performance: Currently there is a case where performance gets worse and number of RPCs grow: when you bundle cancel (And cancel lock), but before you finally send your rpc, some parallel thread gets that lock again. That''s one RPC more (to get the lock) Alternatives: like we discussed - we might bundle cancels for conflicting PR and PW locks when getting intersecting PW/PR locks, but this is counterproductive. Better course of actions here would be to revive our lock-converting code and improve it to allove for extent-growing (I made a comment in bug 2746 to better explain that).
vitaly@clusterfs.com
2007-Jan-12 17:11 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10589 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9331 is|0 |1 obsolete| | Created an attachment (id=9334) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9334&action=view) HLD v3 The following fixes are included according to Oleg''s inspection: -- ldlm_resource_get_unused() is given policy argument insped of bits; -- remove initially introduced l_bl_ast usage in the ldlm_cancel_lru() (it seems there are no more usages of l_bl_ast on the client, thus ldlm_resource_get_unused() should work properly); -- ldlm_lru_get_redundant() has gotten another parameter lock_flags, which are installed to the matched locks; -- ldlm_cli_enqueue() calls ldlm_lru_get_redundant() only when the server supports the bundled cancels, it passes CANCEL_ON_BLOCK to it to avoid sending a separate cancels; -- some other minor bugs and documentation improvements.
vitaly@clusterfs.com
2007-Jan-15 07:45 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10589 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9334 is|0 |1 obsolete| | Created an attachment (id=9340) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9340&action=view) HLD v4 The following fixes are included according to Oleg''s inspection: -- lock_handle2 is not used as the parent lock handle anymore; -- ldlm_cli_enqueue reserves 1 more lock handle as 1st is occupied; -- spelling mistakes.
vitaly@clusterfs.com
2007-Jan-16 16:35 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10589 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9340 is|0 |1 obsolete| | Created an attachment (id=9348) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9348&action=view) HLD v5 The following fixes are included according to Alex''s inspection: -- as ldlm_cli_enqueue does not always create request, ldlm_resource_get_unused() is added to callers which does it (mdc_enqueue, osc_enqueue); -- missed ldlm_lock_reprocess and dput are added; -- cleanups.
green@clusterfs.com
2007-Jan-17 12:46 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10589 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9348|review?(green@clusterfs.com)|review+ Flag| | (From update of attachment 9348) Now that I think about it - you should probabl mention somewhere in alternatives that lmv code in case of clustered metadata should offer similar facilities and take care of all hte complexity, because now you suddenly want to cancel locks in different namespaces on different metadata servers.