thr3ads.net - Lustre devel - [Lustre-devel] Re: Chat with Oleg about 11706 [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Peter Braam

2007-Mar-01 02:57 UTC

[Lustre-devel] Re: Chat with Oleg about 11706

Hi Oleg -

Eric will probably fix this along the lines you suggested. 

However, I would like to hear why there are no race conditions in this 
protocol. How can we be certain that a canceled lock isn''t used by the 
client?  Also, is there a point to having these locks at all?

thanks.

- Peter -

Eric Barton wrote:> <green> eeb: it seems a major assumption about possibility of
>         reply-less comunication between client and server is broken by
>         credits
> <eeb>   credits are designed to limit the # of LND buffers we have to
>         post but require a responsive peer
> <green> right. so if we have enough of credits - everything should
>         work
>         I mean - we can send a lot of rpcs from servers without
>         wedging anything - and we already post a lot of buffers at
>         liblustre side without actually having a chance to use them
> <eeb>   green: increasing credits lets you have more outstanding
>         communications at the expense of memory
> <green> right. With liblustre we have tons of credits posted already
>         and no real chance to use them except for mds that can send
>         AST cancels (this usage is already broken if more than
>         peercredits needs to be posted). also there are connect
>         replies and pings that are async and they depend on a number
>         of credits tha is a direct function of number of targets per
>         server
>         we cannot really wait for pings to arrive, btw, what if it
>         takes severalseconds for ping to be processed? what if it is
>         80 seconds? That''s colossal waste of time
> <eeb>   What I meant by my comment is that I don''t think
"tinkering"
>         with tunables is going to solve this issue - we have to design
>         the solution for liblustre failover - until then, we know that
>         ensuring the liblustre client is responsive to the network at
>         all times when communications may be directed at them (the
>         current working assumption) is a correct solution
> <green> eeb: except we have a feature that directly depends on
ability
>         to send traffic to unresponsive liblustre client and for it to
>         succeed
>         eeb: And, for all cases but those ASTs from MDS we can
>         directly predict how much packets we can expect servers to
>         send to liblustre client at any given time if liblustre is
>         unresponsive
> <eeb>   which feature?  pinger replies?  yes you''re right -
that
>         violates the assumption that liblustre always waits for RPC
>         replies
> <green> feature is "instant lock cancels" from MDS.
>         ther we send ASTs to clients and not wait for them, clients
>         will process tose ASTs when they wake up.
> <eeb>   looks like we have a communications issue between people - it
>         has always been an article of faith as far as I have been
>         aware that liblustre receives only solicited communications
>         and doesn''t return until it has received them
> <green> no. This instant cancel feature was developed with dependence
>         on unlimited amount of communication to be sent to liblustre
>         and received there in mind, even if liblustre is not listening
>         i think we even discussed that with you atthe time and you
>         sait it''s possible. I think credits did not exist back
then
> <eeb>   really?
> <green> it was around 1.4.5 timeframe or even earlier. Did we always
>         have credits?
> <eeb>   the ptllnd has always had a credit flow scheme
>         also, "large" GET/PUT (> 0.5K) has always relied on a
>         responsive liblustre
> <green> eeb: Ah. Well, we are not speaking of large RPCs here anyway.
>         eeb: So, if we find this function of how many rpcs might be
>         send from server to unresponsive clients (pings & async
>         connects, not taking ASTs into account now) - we can tune
>         peercredits settings accordingly based on number of targets
> <eeb>   now you''re putting your finger on our fundamental
disagreement
>         - I think we need a "designed" solution not a "we
can get by
>         because we happen not to have exceeded this or that limit"
> <green> This should mitigate 10706 and other similar issues I think?
>         At expense of higher memory consumption on liblustre nodes
> <eeb>   Peter (Braam) and I have already talked a bit about doing
what
>         you want - it''s really required if liblustre is to be
suitable
>         for asynchronous I/O - but it''s really a _big_deal_ to
>         implement
>         (where "what you want" == "arbitrary outstanding
>         communications with an unresponsive liblustre")
> <green> eeb: async i/o? I believe you!!
>         eeb: As far as a design goes. Do you not thing that
>         cxollecting all usecases of sending RPCs for unresponsive
>         client from servers and ensuring we always have enough credits
>         constitutes a design too?
> <eeb>   I can''t totally agree I''m afraid - assuming
by collecting use
>         cases you mean enumerating _all_ allowable use cases,
>         sufficient credits is still only 1 of the requirements for
>         operation with an unresponsive liblustre client
> <green> eeb: what are the others?
> <eeb>   maximum PUT payload to an unresponsive client
>         are GETs to an unresponsive clients required
> <green> well, maximum put payload could also be calculated. We can
>         forbid GETs to unresponsive clients at all (they do not happen
>         anyway or we would have noticed)
>         eeb: so what do you think?
>         eeb: btw, is there a way to get peercredits from inside
>         liblustre?
> <eeb>   I''m very reluctant to overturn the "nothing
tried to talk to
>         liblustre clients unless liblustre has control" assumption
>         until we''re _sure_ we understand all the consequences
>         s/nothing tried/nothing may try"
> <green> ?but such a talk happening now is a matter of fact
>         and can''t be easily fixed
> <eeb>   similarly, I''m not nearly _sure_ I understand all
the
>         consequences of that
> <green> hm, I thought it all was simple - for PUT''s, ifthere
is a
>         buffer posted - it''s succesful (credits aside), if no
buffers
>         - messae is lost. gets would timeout ;)
> <eeb>   do you have a formula to compute the number of buffers we
need
>         to post to prevent messages being lost?  How can it be OK for
>         arbitrary messages to be lost?  If not, how do we discriminate
>         between message we can lose and those we can''t?
> <green> Formula is simple - we need to post at least as mucvh buffers
>         as there might be messages. Losing messages is not ok, of
>         course, but at least lnet server side would be aware of that
>         immediatelly, I presume? And so no 50 seconds
>         timeout. Generally we know that maximal number of messages to
>         be sent to liblustre is not greater than there are targets on
>         the server, plus whatever more messages lnet might be sending.
>         the only exception is MDS ASTs - if we can know number of
>         peercredits form liblustre - we can set lru_size to that
>         number so that MDS won''t try to send more cancels than we
can
>         tolerate
>         (I count possible RPCs to unresponsive client by taking the
>         fact currently there mght be only one async rpc from client to
>         every target - this is not taking into account MDS ASTs,
>         again. But on MDS we only have one targetanyway)
> <eeb>   server-side LNET will only become aware of dropped messages
if
>         we start using ACKed Cray Portals PUTs
> <green> oh.
> <eeb>   But actually I don''t think we can design a fix right
here on
>         IRC - neither of us are fully informed about all the issues
>         AFAICS
> <green> well, if messages were dropped - we can make lustre to cause
>         reconnect. No messages should b e dropped of course ;)
> <eeb>   probably we have most of the info between us, but I
don''t
>         think we can simply jump at what look like promising fixes
> <green> eeb: Due to my unavareness of all the underlying stuff, fix
>         actually look pretty simple to me - as having enough credits
>         available ;)
> <eeb>   yes - I realise
> <green> what we know fo sure if peercredits is less than number of
>         targets on a server - credits WOULD BE exausted
> <eeb>   I''m really sorry to appear so uncooperative, but I
need to
>         take this a bit more slowly and devote 100% of my mind to it
>         rather than replying "off the cuff" while I''m
really busy on
>         other things
> <green> eeb: Sure, take your time. I am just trying to make sure you
>         understand my idea of things.
>         So is there a way to know peercredits number from liblustre to
>         adjust lru_size accordingly?
> <eeb>   what is "lru_size"?
>         BTW, to help me think about this, can you tell me all the
>         places servers can talk to clients when clients are not
>         blocking in liblustre?
> <green> this is amount of unused locks client is allowed to cache
>         unused locks are locks we can get ASTs for.
>         so the only places that can send to clients as I see it right
>         now - async replies - currently knows are connect and ping
>         replies, all other places wait for replies. And ASTs - can
>         happen on MDS only, we can regulate this number with amount of
>         locks we hold per client
>         thankfully, we do not have any real async i/o ;)
> <eeb>   what do you think we gain by running the pinger on liblustre
>         clients?
> <green> currently failover depends on this, I believe. I need to
>         refresh my memorty on exact details. I remember that without
>         pinger failover does not know what targets to chose
> <eeb>   does that mean we could disable the pinger until we implement
>         liblustre failover correctly (this includes ptllnd fixes)
> <green> no. Cray actually uses failover on some of their
>         installations. It is not proper failover, but rather an
>         ability to chose between several targets for same service if
>         main target is down for some reason
> <eeb>   I would like braam to be involved in this discussion
> <green> I have no objections
>         but this is probably too lowlevel for him
>         eeb: so are we done with this topic or is there anything else
>         I can do for you with this issue?
> <eeb>   I''ll come and hassle you when I have some time...
> <green> ok. I just would like to highlight that this bug is regarded
>         as high priority by cray. they expect another system with high
>         number of targets per oss to be entered into production soon
>         and there they won''t be able to use current workaround of
>         disabling pinger
> <eeb>   green: yes - I''m aware of that - it will get my full
attention
>         RSN
> <green> great. thanks
>
>                 Cheers,
>                         Eric
>
> ---------------------------------------------------
> |Eric Barton        Barton Software               |
> |9 York Gardens     Tel:    +44 (117) 330 1575    |
> |Clifton            Mobile: +44 (7909) 680 356    |
> |Bristol BS8 4LL    Fax:    call first            |
> |United Kingdom     E-Mail: eeb@bartonsoftware.com|
> ---------------------------------------------------
>
>
>
>   -------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20070301/fc4582a2/attachment.html

''Oleg Drokin''

2007-Mar-01 05:55 UTC

head link

[Lustre-devel] Re: Chat with Oleg about 11706

Hello!
On Wed, Feb 28, 2007 at 09:45:29AM -0700, Peter Braam wrote:
> However, I would like to hear why there are no race conditions in this 
> protocol. How can we be certain that a canceled lock isn''t used by
the
There are, they do not matter.
We already release the lock halfway through its processing (even for vfs
clients).
Server will do all the validation itself later anyway.
There are no write locks for metadata on clients (yet), and so they are
merely a hint that "we just checked this file and during the check it
was in place (and his attributes were ...)"
Or what protocol do you mean
> client?  Also, is there a point to having these locks at all?
Yes. Cache.

Bye,
    Oleg

Lustre devel - Mar 2007 - Re: Chat with Oleg about 11706

[Lustre-devel] Re: Chat with Oleg about 11706

[Lustre-devel] Re: Chat with Oleg about 11706