Hi Oleg -
Eric will probably fix this along the lines you suggested.
However, I would like to hear why there are no race conditions in this
protocol. How can we be certain that a canceled lock isn''t used by the
client? Also, is there a point to having these locks at all?
thanks.
- Peter -
Eric Barton wrote:> <green> eeb: it seems a major assumption about possibility of
> reply-less comunication between client and server is broken by
> credits
> <eeb> credits are designed to limit the # of LND buffers we have to
> post but require a responsive peer
> <green> right. so if we have enough of credits - everything should
> work
> I mean - we can send a lot of rpcs from servers without
> wedging anything - and we already post a lot of buffers at
> liblustre side without actually having a chance to use them
> <eeb> green: increasing credits lets you have more outstanding
> communications at the expense of memory
> <green> right. With liblustre we have tons of credits posted already
> and no real chance to use them except for mds that can send
> AST cancels (this usage is already broken if more than
> peercredits needs to be posted). also there are connect
> replies and pings that are async and they depend on a number
> of credits tha is a direct function of number of targets per
> server
> we cannot really wait for pings to arrive, btw, what if it
> takes severalseconds for ping to be processed? what if it is
> 80 seconds? That''s colossal waste of time
> <eeb> What I meant by my comment is that I don''t think
"tinkering"
> with tunables is going to solve this issue - we have to design
> the solution for liblustre failover - until then, we know that
> ensuring the liblustre client is responsive to the network at
> all times when communications may be directed at them (the
> current working assumption) is a correct solution
> <green> eeb: except we have a feature that directly depends on
ability
> to send traffic to unresponsive liblustre client and for it to
> succeed
> eeb: And, for all cases but those ASTs from MDS we can
> directly predict how much packets we can expect servers to
> send to liblustre client at any given time if liblustre is
> unresponsive
> <eeb> which feature? pinger replies? yes you''re right -
that
> violates the assumption that liblustre always waits for RPC
> replies
> <green> feature is "instant lock cancels" from MDS.
> ther we send ASTs to clients and not wait for them, clients
> will process tose ASTs when they wake up.
> <eeb> looks like we have a communications issue between people - it
> has always been an article of faith as far as I have been
> aware that liblustre receives only solicited communications
> and doesn''t return until it has received them
> <green> no. This instant cancel feature was developed with dependence
> on unlimited amount of communication to be sent to liblustre
> and received there in mind, even if liblustre is not listening
> i think we even discussed that with you atthe time and you
> sait it''s possible. I think credits did not exist back
then
> <eeb> really?
> <green> it was around 1.4.5 timeframe or even earlier. Did we always
> have credits?
> <eeb> the ptllnd has always had a credit flow scheme
> also, "large" GET/PUT (> 0.5K) has always relied on a
> responsive liblustre
> <green> eeb: Ah. Well, we are not speaking of large RPCs here anyway.
> eeb: So, if we find this function of how many rpcs might be
> send from server to unresponsive clients (pings & async
> connects, not taking ASTs into account now) - we can tune
> peercredits settings accordingly based on number of targets
> <eeb> now you''re putting your finger on our fundamental
disagreement
> - I think we need a "designed" solution not a "we
can get by
> because we happen not to have exceeded this or that limit"
> <green> This should mitigate 10706 and other similar issues I think?
> At expense of higher memory consumption on liblustre nodes
> <eeb> Peter (Braam) and I have already talked a bit about doing
what
> you want - it''s really required if liblustre is to be
suitable
> for asynchronous I/O - but it''s really a _big_deal_ to
> implement
> (where "what you want" == "arbitrary outstanding
> communications with an unresponsive liblustre")
> <green> eeb: async i/o? I believe you!!
> eeb: As far as a design goes. Do you not thing that
> cxollecting all usecases of sending RPCs for unresponsive
> client from servers and ensuring we always have enough credits
> constitutes a design too?
> <eeb> I can''t totally agree I''m afraid - assuming
by collecting use
> cases you mean enumerating _all_ allowable use cases,
> sufficient credits is still only 1 of the requirements for
> operation with an unresponsive liblustre client
> <green> eeb: what are the others?
> <eeb> maximum PUT payload to an unresponsive client
> are GETs to an unresponsive clients required
> <green> well, maximum put payload could also be calculated. We can
> forbid GETs to unresponsive clients at all (they do not happen
> anyway or we would have noticed)
> eeb: so what do you think?
> eeb: btw, is there a way to get peercredits from inside
> liblustre?
> <eeb> I''m very reluctant to overturn the "nothing
tried to talk to
> liblustre clients unless liblustre has control" assumption
> until we''re _sure_ we understand all the consequences
> s/nothing tried/nothing may try"
> <green> ?but such a talk happening now is a matter of fact
> and can''t be easily fixed
> <eeb> similarly, I''m not nearly _sure_ I understand all
the
> consequences of that
> <green> hm, I thought it all was simple - for PUT''s, ifthere
is a
> buffer posted - it''s succesful (credits aside), if no
buffers
> - messae is lost. gets would timeout ;)
> <eeb> do you have a formula to compute the number of buffers we
need
> to post to prevent messages being lost? How can it be OK for
> arbitrary messages to be lost? If not, how do we discriminate
> between message we can lose and those we can''t?
> <green> Formula is simple - we need to post at least as mucvh buffers
> as there might be messages. Losing messages is not ok, of
> course, but at least lnet server side would be aware of that
> immediatelly, I presume? And so no 50 seconds
> timeout. Generally we know that maximal number of messages to
> be sent to liblustre is not greater than there are targets on
> the server, plus whatever more messages lnet might be sending.
> the only exception is MDS ASTs - if we can know number of
> peercredits form liblustre - we can set lru_size to that
> number so that MDS won''t try to send more cancels than we
can
> tolerate
> (I count possible RPCs to unresponsive client by taking the
> fact currently there mght be only one async rpc from client to
> every target - this is not taking into account MDS ASTs,
> again. But on MDS we only have one targetanyway)
> <eeb> server-side LNET will only become aware of dropped messages
if
> we start using ACKed Cray Portals PUTs
> <green> oh.
> <eeb> But actually I don''t think we can design a fix right
here on
> IRC - neither of us are fully informed about all the issues
> AFAICS
> <green> well, if messages were dropped - we can make lustre to cause
> reconnect. No messages should b e dropped of course ;)
> <eeb> probably we have most of the info between us, but I
don''t
> think we can simply jump at what look like promising fixes
> <green> eeb: Due to my unavareness of all the underlying stuff, fix
> actually look pretty simple to me - as having enough credits
> available ;)
> <eeb> yes - I realise
> <green> what we know fo sure if peercredits is less than number of
> targets on a server - credits WOULD BE exausted
> <eeb> I''m really sorry to appear so uncooperative, but I
need to
> take this a bit more slowly and devote 100% of my mind to it
> rather than replying "off the cuff" while I''m
really busy on
> other things
> <green> eeb: Sure, take your time. I am just trying to make sure you
> understand my idea of things.
> So is there a way to know peercredits number from liblustre to
> adjust lru_size accordingly?
> <eeb> what is "lru_size"?
> BTW, to help me think about this, can you tell me all the
> places servers can talk to clients when clients are not
> blocking in liblustre?
> <green> this is amount of unused locks client is allowed to cache
> unused locks are locks we can get ASTs for.
> so the only places that can send to clients as I see it right
> now - async replies - currently knows are connect and ping
> replies, all other places wait for replies. And ASTs - can
> happen on MDS only, we can regulate this number with amount of
> locks we hold per client
> thankfully, we do not have any real async i/o ;)
> <eeb> what do you think we gain by running the pinger on liblustre
> clients?
> <green> currently failover depends on this, I believe. I need to
> refresh my memorty on exact details. I remember that without
> pinger failover does not know what targets to chose
> <eeb> does that mean we could disable the pinger until we implement
> liblustre failover correctly (this includes ptllnd fixes)
> <green> no. Cray actually uses failover on some of their
> installations. It is not proper failover, but rather an
> ability to chose between several targets for same service if
> main target is down for some reason
> <eeb> I would like braam to be involved in this discussion
> <green> I have no objections
> but this is probably too lowlevel for him
> eeb: so are we done with this topic or is there anything else
> I can do for you with this issue?
> <eeb> I''ll come and hassle you when I have some time...
> <green> ok. I just would like to highlight that this bug is regarded
> as high priority by cray. they expect another system with high
> number of targets per oss to be entered into production soon
> and there they won''t be able to use current workaround of
> disabling pinger
> <eeb> green: yes - I''m aware of that - it will get my full
attention
> RSN
> <green> great. thanks
>
> Cheers,
> Eric
>
> ---------------------------------------------------
> |Eric Barton Barton Software |
> |9 York Gardens Tel: +44 (117) 330 1575 |
> |Clifton Mobile: +44 (7909) 680 356 |
> |Bristol BS8 4LL Fax: call first |
> |United Kingdom E-Mail: eeb@bartonsoftware.com|
> ---------------------------------------------------
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20070301/fc4582a2/attachment.html