thr3ads.net - Lustre devel - [Lustre-devel] hiding non-fatal communications errors [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Eric Barton

2008-Jun-04 13:25 UTC

[Lustre-devel] hiding non-fatal communications errors

Something for recovery experts...

Communications may timeout for non-fatal reasons e.g...

1. Adaptive timeouts were too aggressive (e.g. if server load has
   suddenly become extreme).

2. An LNET router has failed but one or more of its peers hasn''t
   detected this yet.

When a lustre client times out an RPC it sent to a server, it (a) allows
pending signals to be delivered (i.e. you can now ^C the process doing
the I/O) and (b) tries to reconnect and/or fail over.  If it reconnects
and confirms that the server has not rebooted, the RPC is resent and
may now succeed.

This should work in all "normal" RPCs (i.e. all RPCs apart from ldlm
callbacks (ASTs)) since the server knows whether it actually processed
the RPC or not and can handle the resent request appropriately.

However I think there is a problem if the RPC is an ldlm callback.  In
this case, the lustre server sends the RPC to the lustre client and
AFAIK the request is not resent if it times out.  If the request is a
blocking AST, the lustre client isn''t notified to clean its cache and
cancel locks - and it risks being evicted.

How should this be handled?

Peter Braam

2008-Jun-04 21:17 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

Andreas has been suggesting re-transmission of these callback (aka AST) RPCs
for years.  If we think it through carefully, it might be a simple solution.

Peter


On 6/4/08 6:25 AM, "Eric Barton" <eeb at sun.com> wrote:
> Something for recovery experts...
> 
> Communications may timeout for non-fatal reasons e.g...
> 
> 1. Adaptive timeouts were too aggressive (e.g. if server load has
>    suddenly become extreme).
> 
> 2. An LNET router has failed but one or more of its peers hasn''t
>    detected this yet.
> 
> When a lustre client times out an RPC it sent to a server, it (a) allows
> pending signals to be delivered (i.e. you can now ^C the process doing
> the I/O) and (b) tries to reconnect and/or fail over.  If it reconnects
> and confirms that the server has not rebooted, the RPC is resent and
> may now succeed.
> 
> This should work in all "normal" RPCs (i.e. all RPCs apart from
ldlm
> callbacks (ASTs)) since the server knows whether it actually processed
> the RPC or not and can handle the resent request appropriately.
> 
> However I think there is a problem if the RPC is an ldlm callback.  In
> this case, the lustre server sends the RPC to the lustre client and
> AFAIK the request is not resent if it times out.  If the request is a
> blocking AST, the lustre client isn''t notified to clean its cache
and
> cancel locks - and it risks being evicted.
> 
> How should this be handled?
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Andreas Dilger

2008-Jun-04 22:20 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

On Jun 04, 2008  14:17 -0700, Peter J. Braam wrote:> Andreas has been suggesting re-transmission of these callback (aka AST)
RPCs
> for years.  If we think it through carefully, it might be a simple
solution.
Yes, server->client resends at least to a limited extent would help in the
case of short-term network partitioning or e.g. a suddenly-failed router.

We have some amount of "RPC resend before recovery" support for bulk
RPCs
in the case of checksum errors - e.g. retry the bulk RPC 5 times for a
checksum error before returning an IO error to the application.

I suspect this could be adapted to allowing a fixed number of retries for
server-originated RPCs also.  In the case of LDLM blocking callbacks sent
to a client, a resend is currently harmless (either the client is already
processing the callback, or the lock was cancelled).
> On 6/4/08 6:25 AM, "Eric Barton" <eeb at sun.com> wrote:
> 
> > Something for recovery experts...
> > 
> > Communications may timeout for non-fatal reasons e.g...
> > 
> > 1. Adaptive timeouts were too aggressive (e.g. if server load has
> >    suddenly become extreme).
> > 
> > 2. An LNET router has failed but one or more of its peers
hasn''t
> >    detected this yet.
> > 
> > When a lustre client times out an RPC it sent to a server, it (a)
allows
> > pending signals to be delivered (i.e. you can now ^C the process doing
> > the I/O) and (b) tries to reconnect and/or fail over.  If it
reconnects
> > and confirms that the server has not rebooted, the RPC is resent and
> > may now succeed.
> > 
> > This should work in all "normal" RPCs (i.e. all RPCs apart
from ldlm
> > callbacks (ASTs)) since the server knows whether it actually processed
> > the RPC or not and can handle the resent request appropriately.
> > 
> > However I think there is a problem if the RPC is an ldlm callback.  In
> > this case, the lustre server sends the RPC to the lustre client and
> > AFAIK the request is not resent if it times out.  If the request is a
> > blocking AST, the lustre client isn''t notified to clean its
cache and
> > cancel locks - and it risks being evicted.
> > 
> > How should this be handled?
> > 
> > _______________________________________________
> > Lustre-devel mailing list
> > Lustre-devel at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Eric Barton

2008-Jun-04 23:41 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

> -----Original Message-----
> From: Peter.Braam at Sun.COM [mailto:Peter.Braam at Sun.COM] 
> Sent: 04 June 2008 10:17 PM
> To: Eric Barton; ''Lustre Development Mailing List''
> Subject: Re: [Lustre-devel] hiding non-fatal communications errors
> 
> Andreas has been suggesting re-transmission of these callback (aka AST)
RPCs
> for years.  If we think it through carefully, it might be a simple
solution.
Yes - carefully is the watchword - I suspect lock callback RPCs (aka ASTs)
have some fundamentally different properties.  Nathan and I seemed to
touch on this when we last chatted about related AT issues.

Any volunteers to s/we/me/ ?
> 
> Peter
> 
> 
> On 6/4/08 6:25 AM, "Eric Barton" <eeb at sun.com> wrote:
> 
> > Something for recovery experts...
> > 
> > Communications may timeout for non-fatal reasons e.g...
> > 
> > 1. Adaptive timeouts were too aggressive (e.g. if server load has
> >    suddenly become extreme).
> > 
> > 2. An LNET router has failed but one or more of its peers
hasn''t
> >    detected this yet.
> > 
> > When a lustre client times out an RPC it sent to a server, it (a)
allows
> > pending signals to be delivered (i.e. you can now ^C the process doing
> > the I/O) and (b) tries to reconnect and/or fail over.  If it
reconnects
> > and confirms that the server has not rebooted, the RPC is resent and
> > may now succeed.
> > 
> > This should work in all "normal" RPCs (i.e. all RPCs apart
from ldlm
> > callbacks (ASTs)) since the server knows whether it actually processed
> > the RPC or not and can handle the resent request appropriately.
> > 
> > However I think there is a problem if the RPC is an ldlm callback.  In
> > this case, the lustre server sends the RPC to the lustre client and
> > AFAIK the request is not resent if it times out.  If the request is a
> > blocking AST, the lustre client isn''t notified to clean its
cache and
> > cancel locks - and it risks being evicted.
> > 
> > How should this be handled?
> > 
> > _______________________________________________
> > Lustre-devel mailing list
> > Lustre-devel at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> 
>

Oleg Drokin

2008-Jun-05 04:12 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

Hello!

On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote:
> I suspect this could be adapted to allowing a fixed number of  
> retries for
> server-originated RPCs also.  In the case of LDLM blocking callbacks  
> sent
> to a client, a resend is currently harmless (either the client is  
> already
> processing the callback, or the lock was cancelled).
We need to be careful here and decide on a good strategy on when to  
resend.
E.g. recent case at ORNL (even if a bit pathologic) is they pound  
through
thousands of clients to 4 OSSes via 2 routers. That creates request  
waiting
lists on OSSes well into tens of thousands. When we block on a lock  
and send
blocking AST to the client, it quickly turns around and puts in his  
data...
at the end of our list that takes hundreds of seconds (more than  
obd_timeout,
obviously). No matter how much you resend, it won''t help.
Now a good argument is before we kill such clients (or do any sort of  
resend),
perhaps it makes sense to check incoming queue to see if there is  
anything?
On the other hand that would be like half of request scheduler,  
probably, and
with such queues, it would take ages, I guess.
BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2,  
why is that?
(when AT is disabled). All this is bug 15332.

Or was the resend meant just for initial RPC where we do not get a  
confirmation
soon? Yes, there it makes sense to retry soon, but this case above  
needs to be
still considered, since currently we do not retry writeouts too, which  
has as
much of a bad effect on dirty client caches, and of course all the  
above is very
true in such cases too.
Also without lnet patch in 15332, where small messages are prioritized  
on routers,
it is way too easy to timeout ast response because of router  
congestion and no
amount of resending would help then.

Bye,
     Oleg

Robert Read

2008-Jun-05 16:42 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

On Jun 4, 2008, at 21:12 , Oleg Drokin wrote:
> Hello!
>
> On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote:
>
>> I suspect this could be adapted to allowing a fixed number of
>> retries for
>> server-originated RPCs also.  In the case of LDLM blocking callbacks
>> sent
>> to a client, a resend is currently harmless (either the client is
>> already
>> processing the callback, or the lock was cancelled).
>
> We need to be careful here and decide on a good strategy on when to
> resend.
> E.g. recent case at ORNL (even if a bit pathologic) is they pound
> through
> thousands of clients to 4 OSSes via 2 routers. That creates request
> waiting
> lists on OSSes well into tens of thousands. When we block on a lock
> and send
> blocking AST to the client, it quickly turns around and puts in his
> data...
> at the end of our list that takes hundreds of seconds (more than
> obd_timeout,
> obviously). No matter how much you resend, it won''t help.

This looks like the poster child for adaptive timeouts, although we  
might want need some version of the early margin update patch on  
15501.  Have you tried enabling AT?
>
> Now a good argument is before we kill such clients (or do any sort of
> resend),
> perhaps it makes sense to check incoming queue to see if there is
> anything?
> On the other hand that would be like half of request scheduler,
> probably, and
> with such queues, it would take ages, I guess.
> BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2,
> why is that?
> (when AT is disabled). All this is bug 15332.
>
Maybe that''s was done to discourage people from disabling AT?  
Seriously, though, I don''t know why that was changed. Perhaps it was  
done on b1_6 before to AT landed?

robert

> Or was the resend meant just for initial RPC where we do not get a
> confirmation
> soon? Yes, there it makes sense to retry soon, but this case above
> needs to be
> still considered, since currently we do not retry writeouts too, which
> has as
> much of a bad effect on dirty client caches, and of course all the
> above is very
> true in such cases too.
> Also without lnet patch in 15332, where small messages are prioritized
> on routers,
> it is way too easy to timeout ast response because of router
> congestion and no
> amount of resending would help then.
>
> Bye,
>     Oleg
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Oleg Drokin

2008-Jun-05 16:59 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

Hello!

On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
>>> I suspect this could be adapted to allowing a fixed number of
>>> retries for
>>> server-originated RPCs also.  In the case of LDLM blocking
callbacks
>>> sent
>>> to a client, a resend is currently harmless (either the client is
>>> already
>>> processing the callback, or the lock was cancelled).
>> We need to be careful here and decide on a good strategy on when to
>> resend.
>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
>> through
>> thousands of clients to 4 OSSes via 2 routers. That creates request
>> waiting
>> lists on OSSes well into tens of thousands. When we block on a lock
>> and send
>> blocking AST to the client, it quickly turns around and puts in his
>> data...
>> at the end of our list that takes hundreds of seconds (more than
>> obd_timeout,
>> obviously). No matter how much you resend, it won''t help.
> This looks like the poster child for adaptive timeouts, although we  
> might want need some version of the early margin update patch on  
> 15501.  Have you tried enabling AT?
The problem is AT does not handle this specific case, there is no way to
deliver "early replay" from a client to server that "I am working
on
it" outside of
just sending dirty data. But dirty data gets into a queue for way too  
long.
There re no timed out requests, the only thing timing out is lock that  
is not
cancelled in time.
AT was not tried - this is hard to do at ORNL, as client side is Cray  
XT4 machine,
and updating clients is hard. So they are on 1.4.11 of some sort.
They can easily update servers, but this won''t help, of course.
> Maybe that''s was done to discourage people from disabling AT?  
> Seriously, though, I don''t know why that was changed. Perhaps it
was
> done on b1_6 before to AT landed?
hm, indeed. I see this change in 1.6.3.

Bye,
     Oleg

Peter Braam

2008-Jun-06 03:29 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

Why can we not send early replies?


On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote:
> Hello!
> 
> On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
> 
>>>> I suspect this could be adapted to allowing a fixed number of
>>>> retries for
>>>> server-originated RPCs also.  In the case of LDLM blocking
callbacks
>>>> sent
>>>> to a client, a resend is currently harmless (either the client
is
>>>> already
>>>> processing the callback, or the lock was cancelled).
>>> We need to be careful here and decide on a good strategy on when to
>>> resend.
>>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
>>> through
>>> thousands of clients to 4 OSSes via 2 routers. That creates request
>>> waiting
>>> lists on OSSes well into tens of thousands. When we block on a lock
>>> and send
>>> blocking AST to the client, it quickly turns around and puts in his
>>> data...
>>> at the end of our list that takes hundreds of seconds (more than
>>> obd_timeout,
>>> obviously). No matter how much you resend, it won''t help.
>> This looks like the poster child for adaptive timeouts, although we
>> might want need some version of the early margin update patch on
>> 15501.  Have you tried enabling AT?
> 
> The problem is AT does not handle this specific case, there is no way to
> deliver "early replay" from a client to server that "I am
working on
> it" outside of
> just sending dirty data. But dirty data gets into a queue for way too
> long.
> There re no timed out requests, the only thing timing out is lock that
> is not
> cancelled in time.
> AT was not tried - this is hard to do at ORNL, as client side is Cray
> XT4 machine,
> and updating clients is hard. So they are on 1.4.11 of some sort.
> They can easily update servers, but this won''t help, of course.
> 
>> Maybe that''s was done to discourage people from disabling AT?
>> Seriously, though, I don''t know why that was changed. Perhaps
it was
>> done on b1_6 before to AT landed?
> 
> hm, indeed. I see this change in 1.6.3.
> 
> Bye,
>      Oleg
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Peter Braam

2008-Jun-06 03:37 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

On 6/5/08 9:42 AM, "Robert Read" <rread at sun.com> wrote:
> 
> On Jun 4, 2008, at 21:12 , Oleg Drokin wrote:
> 
>> Hello!
>> 
>> On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote:
>> 
>>> I suspect this could be adapted to allowing a fixed number of
>>> retries for
>>> server-originated RPCs also.  In the case of LDLM blocking
callbacks
>>> sent
>>> to a client, a resend is currently harmless (either the client is
>>> already
>>> processing the callback, or the lock was cancelled).
>> 
>> We need to be careful here and decide on a good strategy on when to
>> resend.
>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
>> through
>> thousands of clients to 4 OSSes via 2 routers. That creates request
>> waiting
>> lists on OSSes well into tens of thousands. When we block on a lock
>> and send
>> blocking AST to the client, it quickly turns around and puts in his
>> data...
>> at the end of our list that takes hundreds of seconds (more than
>> obd_timeout,
>> obviously). No matter how much you resend, it won''t help.
> 
I think this is an SNS issue.  Eric?

Peter

> 
> This looks like the poster child for adaptive timeouts, although we
> might want need some version of the early margin update patch on
> 15501.  Have you tried enabling AT?
> 
>> 
>> Now a good argument is before we kill such clients (or do any sort of
>> resend),
>> perhaps it makes sense to check incoming queue to see if there is
>> anything?
>> On the other hand that would be like half of request scheduler,
>> probably, and
>> with such queues, it would take ages, I guess.
>> BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2,
>> why is that?
>> (when AT is disabled). All this is bug 15332.
>> 
> 
> Maybe that''s was done to discourage people from disabling AT?
> Seriously, though, I don''t know why that was changed. Perhaps it
was
> done on b1_6 before to AT landed?
> 
> robert
> 
> 
>> Or was the resend meant just for initial RPC where we do not get a
>> confirmation
>> soon? Yes, there it makes sense to retry soon, but this case above
>> needs to be
>> still considered, since currently we do not retry writeouts too, which
>> has as
>> much of a bad effect on dirty client caches, and of course all the
>> above is very
>> true in such cases too.
>> Also without lnet patch in 15332, where small messages are prioritized
>> on routers,
>> it is way too easy to timeout ast response because of router
>> congestion and no
>> amount of resending would help then.
>> 
>> Bye,
>>     Oleg
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Oleg Drokin

2008-Jun-06 03:38 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

Hello!

    Because there is no way to deliver them. We send our first  
acknowledge of ast reception and it is delivered fast, this is the  
reply.
    Now what left is to send actual dirty data and then cancel  
request. These are not replies, but stand-alone client-generated RPCs,
    we cannot cancel locks while dirty data is not flushed. Just  
inventing some sort of ldlm "I am still alive" RPCs to send
periodically
    instead of cancels is dangerous - data-sending part could be  
wedged for unrelated reasons, for example, not only because of  
contention, but due
    to some client problems, and if we prolong locks by other means,  
that potentially can wedge all access to that part of a file forever.
    And dirty data itself takes too long to get to the actual server  
processing.
    On of the solutions here is request scheduler, or some stand-alone  
part of it that could peek early into RPCs as they arrive, so that
    when the decision is being made about client eviction, we can  
quickly see what is in the queue from that client and perhaps
    based on this data to postpone the eviction. This was discussed on  
ORNL call.
    Andreas said that AT is currently already looking into incoming  
RPCs before processing, to get ideas about expected service times,  
perhaps
    it would not be too hard to add some logic that would link  
requests into actual exports they came from for further analysis if  
the need for
    it arises.

Bye,
     Oleg
On Jun 5, 2008, at 11:29 PM, Peter Braam wrote:
> Why can we not send early replies?
>
>
> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin at Sun.COM>
wrote:
>
>> Hello!
>>
>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
>>
>>>>> I suspect this could be adapted to allowing a fixed number
of
>>>>> retries for
>>>>> server-originated RPCs also.  In the case of LDLM blocking
>>>>> callbacks
>>>>> sent
>>>>> to a client, a resend is currently harmless (either the
client is
>>>>> already
>>>>> processing the callback, or the lock was cancelled).
>>>> We need to be careful here and decide on a good strategy on
when to
>>>> resend.
>>>> E.g. recent case at ORNL (even if a bit pathologic) is they
pound
>>>> through
>>>> thousands of clients to 4 OSSes via 2 routers. That creates
request
>>>> waiting
>>>> lists on OSSes well into tens of thousands. When we block on a
lock
>>>> and send
>>>> blocking AST to the client, it quickly turns around and puts in
his
>>>> data...
>>>> at the end of our list that takes hundreds of seconds (more
than
>>>> obd_timeout,
>>>> obviously). No matter how much you resend, it won''t
help.
>>> This looks like the poster child for adaptive timeouts, although we
>>> might want need some version of the early margin update patch on
>>> 15501.  Have you tried enabling AT?
>>
>> The problem is AT does not handle this specific case, there is no  
>> way to
>> deliver "early replay" from a client to server that "I
am working on
>> it" outside of
>> just sending dirty data. But dirty data gets into a queue for way too
>> long.
>> There re no timed out requests, the only thing timing out is lock  
>> that
>> is not
>> cancelled in time.
>> AT was not tried - this is hard to do at ORNL, as client side is Cray
>> XT4 machine,
>> and updating clients is hard. So they are on 1.4.11 of some sort.
>> They can easily update servers, but this won''t help, of
course.
>>
>>> Maybe that''s was done to discourage people from disabling
AT?
>>> Seriously, though, I don''t know why that was changed.
Perhaps it was
>>> done on b1_6 before to AT landed?
>>
>> hm, indeed. I see this change in 1.6.3.
>>
>> Bye,
>>     Oleg
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
>

Peter Braam

2008-Jun-06 03:40 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

Ah yes.  So monitoring progress is the only thing we can do and with SNS you
will be able to get that information long before the request is being
handled.

Peter


On 6/5/08 8:38 PM, "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote:
> Hello!
> 
>     Because there is no way to deliver them. We send our first
> acknowledge of ast reception and it is delivered fast, this is the
> reply.
>     Now what left is to send actual dirty data and then cancel
> request. These are not replies, but stand-alone client-generated RPCs,
>     we cannot cancel locks while dirty data is not flushed. Just
> inventing some sort of ldlm "I am still alive" RPCs to send
periodically
>     instead of cancels is dangerous - data-sending part could be
> wedged for unrelated reasons, for example, not only because of
> contention, but due
>     to some client problems, and if we prolong locks by other means,
> that potentially can wedge all access to that part of a file forever.
>     And dirty data itself takes too long to get to the actual server
> processing.
>     On of the solutions here is request scheduler, or some stand-alone
> part of it that could peek early into RPCs as they arrive, so that
>     when the decision is being made about client eviction, we can
> quickly see what is in the queue from that client and perhaps
>     based on this data to postpone the eviction. This was discussed on
> ORNL call.
>     Andreas said that AT is currently already looking into incoming
> RPCs before processing, to get ideas about expected service times,
> perhaps
>     it would not be too hard to add some logic that would link
> requests into actual exports they came from for further analysis if
> the need for
>     it arises.
> 
> Bye,
>      Oleg
> On Jun 5, 2008, at 11:29 PM, Peter Braam wrote:
> 
>> Why can we not send early replies?
>> 
>> 
>> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin at
Sun.COM> wrote:
>> 
>>> Hello!
>>> 
>>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
>>> 
>>>>>> I suspect this could be adapted to allowing a fixed
number of
>>>>>> retries for
>>>>>> server-originated RPCs also.  In the case of LDLM
blocking
>>>>>> callbacks
>>>>>> sent
>>>>>> to a client, a resend is currently harmless (either the
client is
>>>>>> already
>>>>>> processing the callback, or the lock was cancelled).
>>>>> We need to be careful here and decide on a good strategy on
when to
>>>>> resend.
>>>>> E.g. recent case at ORNL (even if a bit pathologic) is they
pound
>>>>> through
>>>>> thousands of clients to 4 OSSes via 2 routers. That creates
request
>>>>> waiting
>>>>> lists on OSSes well into tens of thousands. When we block
on a lock
>>>>> and send
>>>>> blocking AST to the client, it quickly turns around and
puts in his
>>>>> data...
>>>>> at the end of our list that takes hundreds of seconds (more
than
>>>>> obd_timeout,
>>>>> obviously). No matter how much you resend, it
won''t help.
>>>> This looks like the poster child for adaptive timeouts,
although we
>>>> might want need some version of the early margin update patch
on
>>>> 15501.  Have you tried enabling AT?
>>> 
>>> The problem is AT does not handle this specific case, there is no
>>> way to
>>> deliver "early replay" from a client to server that
"I am working on
>>> it" outside of
>>> just sending dirty data. But dirty data gets into a queue for way
too
>>> long.
>>> There re no timed out requests, the only thing timing out is lock
>>> that
>>> is not
>>> cancelled in time.
>>> AT was not tried - this is hard to do at ORNL, as client side is
Cray
>>> XT4 machine,
>>> and updating clients is hard. So they are on 1.4.11 of some sort.
>>> They can easily update servers, but this won''t help, of
course.
>>> 
>>>> Maybe that''s was done to discourage people from
disabling AT?
>>>> Seriously, though, I don''t know why that was changed.
Perhaps it was
>>>> done on b1_6 before to AT landed?
>>> 
>>> hm, indeed. I see this change in 1.6.3.
>>> 
>>> Bye,
>>>     Oleg
>>> _______________________________________________
>>> Lustre-devel mailing list
>>> Lustre-devel at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>> 
>> 
>

Andreas Dilger

2008-Jun-06 04:41 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

On Jun 05, 2008  20:40 -0700, Peter J. Braam wrote:> Ah yes.  So monitoring progress is the only thing we can do and with SNS
you
> will be able to get that information long before the request is being
> handled.
You mean NRS, instead of SNS, right?
> On 6/5/08 8:38 PM, "Oleg Drokin" <Oleg.Drokin at Sun.COM>
wrote:
> >     Because there is no way to deliver them. We send our first
> > acknowledge of ast reception and it is delivered fast, this is the
> > reply.
> >     Now what left is to send actual dirty data and then cancel
> > request. These are not replies, but stand-alone client-generated RPCs,
> >     we cannot cancel locks while dirty data is not flushed. Just
> > inventing some sort of ldlm "I am still alive" RPCs to send
periodically
> >     instead of cancels is dangerous - data-sending part could be
> > wedged for unrelated reasons, for example, not only because of
> > contention, but due
> >     to some client problems, and if we prolong locks by other means,
> > that potentially can wedge all access to that part of a file forever.
> >     And dirty data itself takes too long to get to the actual server
> > processing.
> >     On of the solutions here is request scheduler, or some stand-alone
> > part of it that could peek early into RPCs as they arrive, so that
> >     when the decision is being made about client eviction, we can
> > quickly see what is in the queue from that client and perhaps
> >     based on this data to postpone the eviction. This was discussed on
> > ORNL call.
> >     Andreas said that AT is currently already looking into incoming
> > RPCs before processing, to get ideas about expected service times,
> > perhaps
> >     it would not be too hard to add some logic that would link
> > requests into actual exports they came from for further analysis if
> > the need for
> >     it arises.
I think hooking the requests into the exports at arrival time is fairly
straight forward, and is a easy first step toward implementing the NRS.
> > Bye,
> >      Oleg
> > On Jun 5, 2008, at 11:29 PM, Peter Braam wrote:
> > 
> >> Why can we not send early replies?
> >> 
> >> 
> >> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin at
Sun.COM> wrote:
> >> 
> >>> Hello!
> >>> 
> >>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
> >>> 
> >>>>>> I suspect this could be adapted to allowing a
fixed number of
> >>>>>> retries for
> >>>>>> server-originated RPCs also.  In the case of LDLM
blocking
> >>>>>> callbacks
> >>>>>> sent
> >>>>>> to a client, a resend is currently harmless
(either the client is
> >>>>>> already
> >>>>>> processing the callback, or the lock was
cancelled).
> >>>>> We need to be careful here and decide on a good
strategy on when to
> >>>>> resend.
> >>>>> E.g. recent case at ORNL (even if a bit pathologic) is
they pound
> >>>>> through
> >>>>> thousands of clients to 4 OSSes via 2 routers. That
creates request
> >>>>> waiting
> >>>>> lists on OSSes well into tens of thousands. When we
block on a lock
> >>>>> and send
> >>>>> blocking AST to the client, it quickly turns around
and puts in his
> >>>>> data...
> >>>>> at the end of our list that takes hundreds of seconds
(more than
> >>>>> obd_timeout,
> >>>>> obviously). No matter how much you resend, it
won''t help.
> >>>> This looks like the poster child for adaptive timeouts,
although we
> >>>> might want need some version of the early margin update
patch on
> >>>> 15501.  Have you tried enabling AT?
> >>> 
> >>> The problem is AT does not handle this specific case, there is
no
> >>> way to
> >>> deliver "early replay" from a client to server that
"I am working on
> >>> it" outside of
> >>> just sending dirty data. But dirty data gets into a queue for
way too
> >>> long.
> >>> There re no timed out requests, the only thing timing out is
lock
> >>> that
> >>> is not
> >>> cancelled in time.
> >>> AT was not tried - this is hard to do at ORNL, as client side
is Cray
> >>> XT4 machine,
> >>> and updating clients is hard. So they are on 1.4.11 of some
sort.
> >>> They can easily update servers, but this won''t help,
of course.
> >>> 
> >>>> Maybe that''s was done to discourage people from
disabling AT?
> >>>> Seriously, though, I don''t know why that was
changed. Perhaps it was
> >>>> done on b1_6 before to AT landed?
> >>> 
> >>> hm, indeed. I see this change in 1.6.3.
> >>> 
> >>> Bye,
> >>>     Oleg
> >>> _______________________________________________
> >>> Lustre-devel mailing list
> >>> Lustre-devel at lists.lustre.org
> >>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> >> 
> >> 
> > 
> 
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Eric Barton

2008-Jun-06 11:13 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

Oleg''s comments about congestion and the ORNL discussions I''ve
been
involved in are effectively presenting arguments for allowing
expedited communications.  This is possible but comes at a cost.

The "proper" implementation effectively holds an uncongested network
in reserve for expedited communications.  That''s a high price to pay
because it pretty well means doubling up all the LNET state - twice
the number of queues/sockets/queuepairs/connections.  That''s
unavoidable since we''re using these structures for backpressure and
once they''re "full" you can only bypass with an additional
connection.

You can go a long way towards the ideal by allowing prioritised
traffic to "overtake" everywhere apart from the wire - i.e. all
packets serialise once they have been passed to the comms APIs below
the LNDs, but take priority within the LNDs, LNET (including routers)
and ptlrpc.  

It''s hard to say without further thought and/or experiment whether
either of these alternatives actually solves the problem in all
envisaged use cases and doesn''t just shift it elsewhere.  For example,
even the "proper" implementation could end up with a logjam on both
low and high priority networks in pathalogical use cases.  And I''m not
ready to believe that increasing the number of priority levels can add
anything fundamental to the argument.

I think our best bet is to find a way to keep congestion to a minimum
in the first place so that peer ping latency in a single-priority
network can be bounded and kept relatively short (seconds, not
minutes).

Unfortunately, the current algorithms for exploiting network and disk
bandwidth are unbelievably simplistic and invite congestion.
Increasing the number of service threads until performance levels off
ignores completely the issue of service latency.  Allowing a single
client to post sufficient traffic to max the network is fine when it''s
the only one, but mad when it''s one of 100000.  We''re tuning
systems
to the point of instability, so of course timeouts have to become
unmanageable long.

Scheduling can be a subtle problem where "obvious" solutions can have
non-obvious consequences.  But it might be a start to give servers more
dynamic control over the number of concurrent requests individual clients
are allowed to submit so that when many clients are active individual
clients only submit one RPC at a time, and when few clients are active
concurrency on these clients can increase.  

    Cheers,
              Eric

Peter Braam

2008-Jun-06 12:23 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

Sorry yes, network request scheduling; which is btw the most basic instance
of a secondary resource management protocol as Eric described in his post.

Peter


On 6/5/08 10:41 PM, "Andreas Dilger" <adilger at sun.com> wrote:
> On Jun 05, 2008  20:40 -0700, Peter J. Braam wrote:
>> Ah yes.  So monitoring progress is the only thing we can do and with
SNS you
>> will be able to get that information long before the request is being
>> handled.
> 
> You mean NRS, instead of SNS, right?
> 
>> On 6/5/08 8:38 PM, "Oleg Drokin" <Oleg.Drokin at
Sun.COM> wrote:
>>>     Because there is no way to deliver them. We send our first
>>> acknowledge of ast reception and it is delivered fast, this is the
>>> reply.
>>>     Now what left is to send actual dirty data and then cancel
>>> request. These are not replies, but stand-alone client-generated
RPCs,
>>>     we cannot cancel locks while dirty data is not flushed. Just
>>> inventing some sort of ldlm "I am still alive" RPCs to
send periodically
>>>     instead of cancels is dangerous - data-sending part could be
>>> wedged for unrelated reasons, for example, not only because of
>>> contention, but due
>>>     to some client problems, and if we prolong locks by other
means,
>>> that potentially can wedge all access to that part of a file
forever.
>>>     And dirty data itself takes too long to get to the actual
server
>>> processing.
>>>     On of the solutions here is request scheduler, or some
stand-alone
>>> part of it that could peek early into RPCs as they arrive, so that
>>>     when the decision is being made about client eviction, we can
>>> quickly see what is in the queue from that client and perhaps
>>>     based on this data to postpone the eviction. This was discussed
on
>>> ORNL call.
>>>     Andreas said that AT is currently already looking into incoming
>>> RPCs before processing, to get ideas about expected service times,
>>> perhaps
>>>     it would not be too hard to add some logic that would link
>>> requests into actual exports they came from for further analysis if
>>> the need for
>>>     it arises.
> 
> I think hooking the requests into the exports at arrival time is fairly
> straight forward, and is a easy first step toward implementing the NRS.
> 
>>> Bye,
>>>      Oleg
>>> On Jun 5, 2008, at 11:29 PM, Peter Braam wrote:
>>> 
>>>> Why can we not send early replies?
>>>> 
>>>> 
>>>> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin at
Sun.COM> wrote:
>>>> 
>>>>> Hello!
>>>>> 
>>>>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
>>>>> 
>>>>>>>> I suspect this could be adapted to allowing a
fixed number of
>>>>>>>> retries for
>>>>>>>> server-originated RPCs also.  In the case of
LDLM blocking
>>>>>>>> callbacks
>>>>>>>> sent
>>>>>>>> to a client, a resend is currently harmless
(either the client is
>>>>>>>> already
>>>>>>>> processing the callback, or the lock was
cancelled).
>>>>>>> We need to be careful here and decide on a good
strategy on when to
>>>>>>> resend.
>>>>>>> E.g. recent case at ORNL (even if a bit pathologic)
is they pound
>>>>>>> through
>>>>>>> thousands of clients to 4 OSSes via 2 routers. That
creates request
>>>>>>> waiting
>>>>>>> lists on OSSes well into tens of thousands. When we
block on a lock
>>>>>>> and send
>>>>>>> blocking AST to the client, it quickly turns around
and puts in his
>>>>>>> data...
>>>>>>> at the end of our list that takes hundreds of
seconds (more than
>>>>>>> obd_timeout,
>>>>>>> obviously). No matter how much you resend, it
won''t help.
>>>>>> This looks like the poster child for adaptive timeouts,
although we
>>>>>> might want need some version of the early margin update
patch on
>>>>>> 15501.  Have you tried enabling AT?
>>>>> 
>>>>> The problem is AT does not handle this specific case, there
is no
>>>>> way to
>>>>> deliver "early replay" from a client to server
that "I am working on
>>>>> it" outside of
>>>>> just sending dirty data. But dirty data gets into a queue
for way too
>>>>> long.
>>>>> There re no timed out requests, the only thing timing out
is lock
>>>>> that
>>>>> is not
>>>>> cancelled in time.
>>>>> AT was not tried - this is hard to do at ORNL, as client
side is Cray
>>>>> XT4 machine,
>>>>> and updating clients is hard. So they are on 1.4.11 of some
sort.
>>>>> They can easily update servers, but this won''t
help, of course.
>>>>> 
>>>>>> Maybe that''s was done to discourage people
from disabling AT?
>>>>>> Seriously, though, I don''t know why that was
changed. Perhaps it was
>>>>>> done on b1_6 before to AT landed?
>>>>> 
>>>>> hm, indeed. I see this change in 1.6.3.
>>>>> 
>>>>> Bye,
>>>>>     Oleg
>>>>> _______________________________________________
>>>>> Lustre-devel mailing list
>>>>> Lustre-devel at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>>> 
>>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Nathaniel Rutman

2008-Jun-19 20:24 UTC

head link

[Lustre-devel] hiding non-fatal communications errors

Eric Barton wrote:> Oleg''s comments about congestion and the ORNL discussions
I''ve been
> involved in are effectively presenting arguments for allowing
> expedited communications.  This is possible but comes at a cost.
>   
> The "proper" implementation effectively holds an uncongested
network
> in reserve for expedited communications.  That''s a high price to
pay
> because it pretty well means doubling up all the LNET state - twice
> the number of queues/sockets/queuepairs/connections.  That''s
> unavoidable since we''re using these structures for backpressure
and
> once they''re "full" you can only bypass with an
additional connection.
>   That''s assuming network congestion is the cause of the lock timeout.  
What if the server disk is busy doing who knows what, the client''s
cache
flush RPCs are all sitting on the server in the request queue just 
waiting for some disk time.  Furthermore assume that a bunch of other 
clients are all doing the same thing, so that we can''t simply
prioritize
this clients RPCs over everybody else''s. 

I think the method suggested by Oleg has the most potential in this 
case: "sniff" the incoming RPCs to see if they are cache flushes, and
do
not decide to evict those clients until after those RPCs have been 
processed.  As mentioned, we already do sniff the incoming reqs to check 
adaptive timeout deadlines (ptlrpc_server_handle_req_in).

One further thing I would like to do is respond to "easy" RPCs 
immediately (in a reserved thread).  "Easy" would certainly include 
pings, maybe others that have no disk access.  This would allow us to 
free up LNET buffers and other resources, prevent us from evicting 
clients "we haven''t heard from in X seconds" (although I just
realized
we could fix that right now in ptlrpc_server_handle_req_in), and more 
quickly determine network and server loading remotely.

Lustre devel - Jun 2008 - hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors

[Lustre-devel] hiding non-fatal communications errors