vitaly@clusterfs.com
2007-Jan-05  09:46 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10589 Created an attachment (id=9277) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9277&action=view) HLD v1 The HLD is ready for the inspection.
green@clusterfs.com
2007-Jan-10  06:43 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10589
           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9277|review?(green@clusterfs.com)|review-
               Flag|                            |
(From update of attachment 9277)> The period of time that a lock is allowed to be alive. When expired, the
lock is to be cancelled.
> LDLM_MAX_ALIVE
Don''t make it static, it should be adjustable through /proc
> int ldlm_resource_get_unused(ldlm_resource *res, list_head *head, __u64
bits)
> This method gathers all the unused locks on the resource @res into the
@head. Only those inodebits locks are added to @head that match by the policy
@bits. The further attempts to find or cancel locks inserted into the list
@head, will not find nor be able to cancel them.
I think it should be applicable to any lock, not just inodebit locks.
You want this same feature for extent locks too, for example (on destroy).
perhaps it should take policy here and gather all locks that intersect
policy-wise.
In fact I see you use this function from OSC too, and bits there just
make zero sense.
Also throughout the HLD you keep talking about MDS, this should work on
both MDS and OST.
> int ll_mdc_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc
*desc, void *data, int flag)
> The callback that is called when a blocking request is received or a cancel
request is about to be sent. It changes all the metadata and data guarded with
the cancelled lock.
It does not change metadata, it uncaches/invalidates.
> void lustre_swab_ldlm_request(struct ldlm_request *rq)
> Convert the ldlm_request structre from the little endian wire format to
CPU-endian format.
In fact wire format is cpu-endin format for ending cpu. swabbing only happens
if nodes at the ends of the wire are differently-endian, as I understand it.
usecase missing:
extent lock enqueue cancelling old locks from lru beforehand.
logic:
in ldlm_lru_get_redundant:
    LDLM_LOCK_GET(lock);
    spin_unlock(&ns->ns_unused_lock);
    lock_res_and_lock(lock);
And at this point lock might have been removed from lru by other process (and
not necessary for cancel (you check this case)!
you need to verify it is still in lru by e.g. calling ldlm_lock_remove_from_lru
(you need to do it anyway to remove the lock from lru), see example in
ldlm_cancel_lru
in ll_mdc_blocking_ast:
You cannot use ll_have_md_lock, I think. It does match with LDLM_FL_CBPENDING
and therefore will always match (this same lock you are cancelling) and
will return 1 always.
In osc_destroy: we want to set LDLM_FL_DISCARD_DATA in flags of the locks
before cancelling it, so that whatever dirty data we have is not flushed
to server.
Elsewhere in mdc* osc_destroy:
there is no point in searching locks when OBD_CONNECT_EARLY_CANCEL I think?
Or do you think we can save on blocking ast? Then for such a case cancelling
ASTs should be sent to servers (and ldlm_resource_cancel_unused must not
set the cancelling flags), failing to send these ASTs will cause server to send
blocking ASTs for these locks still (that client does not have already),
and then evicting client because it never replies with cancel.
In lustre_swab_ldlm_request:
Why no swabbing for lock handles themselves if there are more than 2?
Locking:> While we are sending a batched cancel request, a lock may be cancelled on
server in parallel. Then this lock will not be found by its handle on the server
when our batched cancel request will be handled. This is also considered as a
valid race.
This cannot happen. Server won''t cancel lock without confirmation from
the
client, so either it degenerates into 1st case in locking section or the lock
is already canceled client-side and then we won''t find it on client and
won''t
send the handle to server.
vitaly@clusterfs.com
2007-Jan-10  12:27 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10589 (In reply to comment #27)> > int ldlm_resource_get_unused(ldlm_resource *res, list_head *head, > > __u64 bits) > > > This method gathers all the unused locks on the resource @res into > > the @head. Only those inodebits locks are added to @head that match > > by the policy @bits. The further attempts to find or cancel locks > > inserted into the list @head, will not find nor be able to cancel them. > > I think it should be applicable to any lock, not just inodebit locks.yes, it is. but @bits are checked for inodebits locks only.> You want this same feature for extent locks too, for example (on destroy). > perhaps it should take policy here and gather all locks that intersect > policy-wise. > In fact I see you use this function from OSC too, and bits there just > make zero sense.right, no defect here.> Elsewhere in mdc* osc_destroy: > there is no point in searching locks when OBD_CONNECT_EARLY_CANCEL I think? > Or do you think we can save on blocking ast?blocking will be sent by server anyway, but we can save on the later canceling rpc.> Then for such a case cancelling > ASTs should be sent to servers (and ldlm_resource_cancel_unused must not > set the cancelling flags), failing to send these ASTs will cause server > to send blocking ASTs for these locks still (that client does not have > already), and then evicting client because it never replies with cancel.it seems not correct, server will get blocking reply with EINVAL error which is considered as a legal race currently in ldlm_handle_ast_error(), after that server is not waiting for cancel, it just cancel its lock and proceed without client eviction. meanwhile canceling rpc will not be sent back to server, what saves 1 rpc.> In lustre_swab_ldlm_request: > Why no swabbing for lock handles themselves if there are more than 2?lock handles are opaque and are not swabbed.> Locking: > > While we are sending a batched cancel request, a lock may be cancelled > > on server in parallel. Then this lock will not be found by its handle > > on the server when our batched cancel request will be handled. This > > is also considered as a valid race. > > This cannot happen. Server won''t cancel lock without confirmation from > the client, so either it degenerates into 1st case in locking section > or the lock is already canceled client-side and then we won''t find it > on client and won''t send the handle to server.yes, this is the repeating of the 1st case of the locking section. I will remove this one.
vitaly@clusterfs.com
2007-Jan-12  13:43 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10589
           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #9277 is|0                           |1
           obsolete|                            |
Created an attachment (id=9331)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9331&action=view)
HLD v2
The fixed version includes some bugfixes according to Oleg''s
inspection:
-- lock alive time limit is tunable through proc;
-- use case is added: extent lock enqueue cancels unused locks beforehand;
-- ldlm_lru_get_redundant() logic is fixed;
-- osc_destroy sets LDLM_FL_DISCARD_DATA to locks it cancels beforehand;
-- some other minor fixes.
green@clusterfs.com
2007-Jan-12  16:19 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10589
           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9331|review?(green@clusterfs.com)|review-
               Flag|                            |
(From update of attachment 9331)
requirements:> The client cancels all the conflict locks for the current metadata
operation on the client side beforehand
I think you shoulddrop word ''metadata'' here and everywhere
else in this
section. We want to cancel locks beforehand even for non-metadata operations
like destroy (and may be truncate and read/write in the future).
ldlm_resource_cancel_unused is poorly named, there is nothing about unused,
perhaps it is better to rename it to ldlm_resource_cancel_list?
Like I already said, ldlm_resource_get_unused should take policy argument OR
may
 be we need specific versions of that function for different lock types?
hardcoding bits is confusing, I think. We well might want to match certain
subse
ts of extent locks very soon now.
In ldlm_resource_get_unused( you happily use l_bl_ast, but I think we need some
proof there is no parallel user for that filed too (like lru cancel).
Code calling ldlm_lru_get_redundant now does not set instant cancel flag at all
which will result in cancel RPCs to be sent even if server support bundled
cance
ls. This is bad. I think you only need to use ldlm_lru_get_redundant (in
ldlm_cli_enqueue) if server supports bundled cancels. And always set
LDLM_FL_CANCEL_ON_BLOCK flag.
In section Scalability and performance:
Currently there is a case where performance gets worse and number of RPCs grow:
when you bundle cancel (And cancel lock), but before you finally send your rpc,
some parallel thread gets that lock again. That''s one RPC more (to get
the
lock)
Alternatives:
like we discussed - we might bundle cancels for conflicting PR and PW locks
when getting intersecting PW/PR locks, but this is counterproductive. Better
course of actions here would be to revive our lock-converting code and improve
it to allove for extent-growing (I made a comment in bug 2746 to better explain
that).
vitaly@clusterfs.com
2007-Jan-12  17:11 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10589
           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #9331 is|0                           |1
           obsolete|                            |
Created an attachment (id=9334)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9334&action=view)
HLD v3
The following fixes are included according to Oleg''s inspection:
-- ldlm_resource_get_unused() is given policy argument insped of bits;
-- remove initially introduced l_bl_ast usage in the ldlm_cancel_lru()
(it seems there are no more usages of l_bl_ast on the client, thus
ldlm_resource_get_unused() should work properly);
-- ldlm_lru_get_redundant() has gotten another parameter lock_flags, which
are installed to the matched locks;
-- ldlm_cli_enqueue() calls ldlm_lru_get_redundant() only when the server
supports the bundled cancels, it passes CANCEL_ON_BLOCK to it to avoid
sending a separate cancels;
-- some other minor bugs and documentation improvements.
vitaly@clusterfs.com
2007-Jan-15  07:45 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10589
           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #9334 is|0                           |1
           obsolete|                            |
Created an attachment (id=9340)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9340&action=view)
HLD v4
The following fixes are included according to Oleg''s inspection:
-- lock_handle2 is not used as the parent lock handle anymore;
-- ldlm_cli_enqueue reserves 1 more lock handle as 1st is occupied;
-- spelling mistakes.
vitaly@clusterfs.com
2007-Jan-16  16:35 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10589
           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #9340 is|0                           |1
           obsolete|                            |
Created an attachment (id=9348)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9348&action=view)
HLD v5
The following fixes are included according to Alex''s inspection:
-- as ldlm_cli_enqueue does not always create request,
ldlm_resource_get_unused() is added to callers which does it
(mdc_enqueue, osc_enqueue);
-- missed ldlm_lock_reprocess and dput are added;
-- cleanups.
green@clusterfs.com
2007-Jan-17  12:46 UTC
[Lustre-devel] [Bug 10589] metadata RPC reduction (e.g. for rm performance)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10589
           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9348|review?(green@clusterfs.com)|review+
               Flag|                            |
(From update of attachment 9348)
Now that I think about it - you should probabl mention somewhere in
alternatives that lmv code in case of clustered metadata should offer similar
facilities and take care of all hte complexity, because now you suddenly want
to cancel locks in different namespaces on different metadata servers.