Something for recovery experts... Communications may timeout for non-fatal reasons e.g... 1. Adaptive timeouts were too aggressive (e.g. if server load has suddenly become extreme). 2. An LNET router has failed but one or more of its peers hasn''t detected this yet. When a lustre client times out an RPC it sent to a server, it (a) allows pending signals to be delivered (i.e. you can now ^C the process doing the I/O) and (b) tries to reconnect and/or fail over. If it reconnects and confirms that the server has not rebooted, the RPC is resent and may now succeed. This should work in all "normal" RPCs (i.e. all RPCs apart from ldlm callbacks (ASTs)) since the server knows whether it actually processed the RPC or not and can handle the resent request appropriately. However I think there is a problem if the RPC is an ldlm callback. In this case, the lustre server sends the RPC to the lustre client and AFAIK the request is not resent if it times out. If the request is a blocking AST, the lustre client isn''t notified to clean its cache and cancel locks - and it risks being evicted. How should this be handled?
Andreas has been suggesting re-transmission of these callback (aka AST) RPCs for years. If we think it through carefully, it might be a simple solution. Peter On 6/4/08 6:25 AM, "Eric Barton" <eeb at sun.com> wrote:> Something for recovery experts... > > Communications may timeout for non-fatal reasons e.g... > > 1. Adaptive timeouts were too aggressive (e.g. if server load has > suddenly become extreme). > > 2. An LNET router has failed but one or more of its peers hasn''t > detected this yet. > > When a lustre client times out an RPC it sent to a server, it (a) allows > pending signals to be delivered (i.e. you can now ^C the process doing > the I/O) and (b) tries to reconnect and/or fail over. If it reconnects > and confirms that the server has not rebooted, the RPC is resent and > may now succeed. > > This should work in all "normal" RPCs (i.e. all RPCs apart from ldlm > callbacks (ASTs)) since the server knows whether it actually processed > the RPC or not and can handle the resent request appropriately. > > However I think there is a problem if the RPC is an ldlm callback. In > this case, the lustre server sends the RPC to the lustre client and > AFAIK the request is not resent if it times out. If the request is a > blocking AST, the lustre client isn''t notified to clean its cache and > cancel locks - and it risks being evicted. > > How should this be handled? > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Andreas Dilger
2008-Jun-04 22:20 UTC
[Lustre-devel] hiding non-fatal communications errors
On Jun 04, 2008 14:17 -0700, Peter J. Braam wrote:> Andreas has been suggesting re-transmission of these callback (aka AST) RPCs > for years. If we think it through carefully, it might be a simple solution.Yes, server->client resends at least to a limited extent would help in the case of short-term network partitioning or e.g. a suddenly-failed router. We have some amount of "RPC resend before recovery" support for bulk RPCs in the case of checksum errors - e.g. retry the bulk RPC 5 times for a checksum error before returning an IO error to the application. I suspect this could be adapted to allowing a fixed number of retries for server-originated RPCs also. In the case of LDLM blocking callbacks sent to a client, a resend is currently harmless (either the client is already processing the callback, or the lock was cancelled).> On 6/4/08 6:25 AM, "Eric Barton" <eeb at sun.com> wrote: > > > Something for recovery experts... > > > > Communications may timeout for non-fatal reasons e.g... > > > > 1. Adaptive timeouts were too aggressive (e.g. if server load has > > suddenly become extreme). > > > > 2. An LNET router has failed but one or more of its peers hasn''t > > detected this yet. > > > > When a lustre client times out an RPC it sent to a server, it (a) allows > > pending signals to be delivered (i.e. you can now ^C the process doing > > the I/O) and (b) tries to reconnect and/or fail over. If it reconnects > > and confirms that the server has not rebooted, the RPC is resent and > > may now succeed. > > > > This should work in all "normal" RPCs (i.e. all RPCs apart from ldlm > > callbacks (ASTs)) since the server knows whether it actually processed > > the RPC or not and can handle the resent request appropriately. > > > > However I think there is a problem if the RPC is an ldlm callback. In > > this case, the lustre server sends the RPC to the lustre client and > > AFAIK the request is not resent if it times out. If the request is a > > blocking AST, the lustre client isn''t notified to clean its cache and > > cancel locks - and it risks being evicted. > > > > How should this be handled? > > > > _______________________________________________ > > Lustre-devel mailing list > > Lustre-devel at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-devel > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-develCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
> -----Original Message----- > From: Peter.Braam at Sun.COM [mailto:Peter.Braam at Sun.COM] > Sent: 04 June 2008 10:17 PM > To: Eric Barton; ''Lustre Development Mailing List'' > Subject: Re: [Lustre-devel] hiding non-fatal communications errors > > Andreas has been suggesting re-transmission of these callback (aka AST) RPCs > for years. If we think it through carefully, it might be a simple solution.Yes - carefully is the watchword - I suspect lock callback RPCs (aka ASTs) have some fundamentally different properties. Nathan and I seemed to touch on this when we last chatted about related AT issues. Any volunteers to s/we/me/ ?> > Peter > > > On 6/4/08 6:25 AM, "Eric Barton" <eeb at sun.com> wrote: > > > Something for recovery experts... > > > > Communications may timeout for non-fatal reasons e.g... > > > > 1. Adaptive timeouts were too aggressive (e.g. if server load has > > suddenly become extreme). > > > > 2. An LNET router has failed but one or more of its peers hasn''t > > detected this yet. > > > > When a lustre client times out an RPC it sent to a server, it (a) allows > > pending signals to be delivered (i.e. you can now ^C the process doing > > the I/O) and (b) tries to reconnect and/or fail over. If it reconnects > > and confirms that the server has not rebooted, the RPC is resent and > > may now succeed. > > > > This should work in all "normal" RPCs (i.e. all RPCs apart from ldlm > > callbacks (ASTs)) since the server knows whether it actually processed > > the RPC or not and can handle the resent request appropriately. > > > > However I think there is a problem if the RPC is an ldlm callback. In > > this case, the lustre server sends the RPC to the lustre client and > > AFAIK the request is not resent if it times out. If the request is a > > blocking AST, the lustre client isn''t notified to clean its cache and > > cancel locks - and it risks being evicted. > > > > How should this be handled? > > > > _______________________________________________ > > Lustre-devel mailing list > > Lustre-devel at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-devel > > >
Hello! On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote:> I suspect this could be adapted to allowing a fixed number of > retries for > server-originated RPCs also. In the case of LDLM blocking callbacks > sent > to a client, a resend is currently harmless (either the client is > already > processing the callback, or the lock was cancelled).We need to be careful here and decide on a good strategy on when to resend. E.g. recent case at ORNL (even if a bit pathologic) is they pound through thousands of clients to 4 OSSes via 2 routers. That creates request waiting lists on OSSes well into tens of thousands. When we block on a lock and send blocking AST to the client, it quickly turns around and puts in his data... at the end of our list that takes hundreds of seconds (more than obd_timeout, obviously). No matter how much you resend, it won''t help. Now a good argument is before we kill such clients (or do any sort of resend), perhaps it makes sense to check incoming queue to see if there is anything? On the other hand that would be like half of request scheduler, probably, and with such queues, it would take ages, I guess. BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2, why is that? (when AT is disabled). All this is bug 15332. Or was the resend meant just for initial RPC where we do not get a confirmation soon? Yes, there it makes sense to retry soon, but this case above needs to be still considered, since currently we do not retry writeouts too, which has as much of a bad effect on dirty client caches, and of course all the above is very true in such cases too. Also without lnet patch in 15332, where small messages are prioritized on routers, it is way too easy to timeout ast response because of router congestion and no amount of resending would help then. Bye, Oleg
On Jun 4, 2008, at 21:12 , Oleg Drokin wrote:> Hello! > > On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote: > >> I suspect this could be adapted to allowing a fixed number of >> retries for >> server-originated RPCs also. In the case of LDLM blocking callbacks >> sent >> to a client, a resend is currently harmless (either the client is >> already >> processing the callback, or the lock was cancelled). > > We need to be careful here and decide on a good strategy on when to > resend. > E.g. recent case at ORNL (even if a bit pathologic) is they pound > through > thousands of clients to 4 OSSes via 2 routers. That creates request > waiting > lists on OSSes well into tens of thousands. When we block on a lock > and send > blocking AST to the client, it quickly turns around and puts in his > data... > at the end of our list that takes hundreds of seconds (more than > obd_timeout, > obviously). No matter how much you resend, it won''t help.This looks like the poster child for adaptive timeouts, although we might want need some version of the early margin update patch on 15501. Have you tried enabling AT?> > Now a good argument is before we kill such clients (or do any sort of > resend), > perhaps it makes sense to check incoming queue to see if there is > anything? > On the other hand that would be like half of request scheduler, > probably, and > with such queues, it would take ages, I guess. > BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2, > why is that? > (when AT is disabled). All this is bug 15332. >Maybe that''s was done to discourage people from disabling AT? Seriously, though, I don''t know why that was changed. Perhaps it was done on b1_6 before to AT landed? robert> Or was the resend meant just for initial RPC where we do not get a > confirmation > soon? Yes, there it makes sense to retry soon, but this case above > needs to be > still considered, since currently we do not retry writeouts too, which > has as > much of a bad effect on dirty client caches, and of course all the > above is very > true in such cases too. > Also without lnet patch in 15332, where small messages are prioritized > on routers, > it is way too easy to timeout ast response because of router > congestion and no > amount of resending would help then. > > Bye, > Oleg > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Hello! On Jun 5, 2008, at 12:42 PM, Robert Read wrote:>>> I suspect this could be adapted to allowing a fixed number of >>> retries for >>> server-originated RPCs also. In the case of LDLM blocking callbacks >>> sent >>> to a client, a resend is currently harmless (either the client is >>> already >>> processing the callback, or the lock was cancelled). >> We need to be careful here and decide on a good strategy on when to >> resend. >> E.g. recent case at ORNL (even if a bit pathologic) is they pound >> through >> thousands of clients to 4 OSSes via 2 routers. That creates request >> waiting >> lists on OSSes well into tens of thousands. When we block on a lock >> and send >> blocking AST to the client, it quickly turns around and puts in his >> data... >> at the end of our list that takes hundreds of seconds (more than >> obd_timeout, >> obviously). No matter how much you resend, it won''t help. > This looks like the poster child for adaptive timeouts, although we > might want need some version of the early margin update patch on > 15501. Have you tried enabling AT?The problem is AT does not handle this specific case, there is no way to deliver "early replay" from a client to server that "I am working on it" outside of just sending dirty data. But dirty data gets into a queue for way too long. There re no timed out requests, the only thing timing out is lock that is not cancelled in time. AT was not tried - this is hard to do at ORNL, as client side is Cray XT4 machine, and updating clients is hard. So they are on 1.4.11 of some sort. They can easily update servers, but this won''t help, of course.> Maybe that''s was done to discourage people from disabling AT? > Seriously, though, I don''t know why that was changed. Perhaps it was > done on b1_6 before to AT landed?hm, indeed. I see this change in 1.6.3. Bye, Oleg
Why can we not send early replies? On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote:> Hello! > > On Jun 5, 2008, at 12:42 PM, Robert Read wrote: > >>>> I suspect this could be adapted to allowing a fixed number of >>>> retries for >>>> server-originated RPCs also. In the case of LDLM blocking callbacks >>>> sent >>>> to a client, a resend is currently harmless (either the client is >>>> already >>>> processing the callback, or the lock was cancelled). >>> We need to be careful here and decide on a good strategy on when to >>> resend. >>> E.g. recent case at ORNL (even if a bit pathologic) is they pound >>> through >>> thousands of clients to 4 OSSes via 2 routers. That creates request >>> waiting >>> lists on OSSes well into tens of thousands. When we block on a lock >>> and send >>> blocking AST to the client, it quickly turns around and puts in his >>> data... >>> at the end of our list that takes hundreds of seconds (more than >>> obd_timeout, >>> obviously). No matter how much you resend, it won''t help. >> This looks like the poster child for adaptive timeouts, although we >> might want need some version of the early margin update patch on >> 15501. Have you tried enabling AT? > > The problem is AT does not handle this specific case, there is no way to > deliver "early replay" from a client to server that "I am working on > it" outside of > just sending dirty data. But dirty data gets into a queue for way too > long. > There re no timed out requests, the only thing timing out is lock that > is not > cancelled in time. > AT was not tried - this is hard to do at ORNL, as client side is Cray > XT4 machine, > and updating clients is hard. So they are on 1.4.11 of some sort. > They can easily update servers, but this won''t help, of course. > >> Maybe that''s was done to discourage people from disabling AT? >> Seriously, though, I don''t know why that was changed. Perhaps it was >> done on b1_6 before to AT landed? > > hm, indeed. I see this change in 1.6.3. > > Bye, > Oleg > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
On 6/5/08 9:42 AM, "Robert Read" <rread at sun.com> wrote:> > On Jun 4, 2008, at 21:12 , Oleg Drokin wrote: > >> Hello! >> >> On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote: >> >>> I suspect this could be adapted to allowing a fixed number of >>> retries for >>> server-originated RPCs also. In the case of LDLM blocking callbacks >>> sent >>> to a client, a resend is currently harmless (either the client is >>> already >>> processing the callback, or the lock was cancelled). >> >> We need to be careful here and decide on a good strategy on when to >> resend. >> E.g. recent case at ORNL (even if a bit pathologic) is they pound >> through >> thousands of clients to 4 OSSes via 2 routers. That creates request >> waiting >> lists on OSSes well into tens of thousands. When we block on a lock >> and send >> blocking AST to the client, it quickly turns around and puts in his >> data... >> at the end of our list that takes hundreds of seconds (more than >> obd_timeout, >> obviously). No matter how much you resend, it won''t help. >I think this is an SNS issue. Eric? Peter> > This looks like the poster child for adaptive timeouts, although we > might want need some version of the early margin update patch on > 15501. Have you tried enabling AT? > >> >> Now a good argument is before we kill such clients (or do any sort of >> resend), >> perhaps it makes sense to check incoming queue to see if there is >> anything? >> On the other hand that would be like half of request scheduler, >> probably, and >> with such queues, it would take ages, I guess. >> BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2, >> why is that? >> (when AT is disabled). All this is bug 15332. >> > > Maybe that''s was done to discourage people from disabling AT? > Seriously, though, I don''t know why that was changed. Perhaps it was > done on b1_6 before to AT landed? > > robert > > >> Or was the resend meant just for initial RPC where we do not get a >> confirmation >> soon? Yes, there it makes sense to retry soon, but this case above >> needs to be >> still considered, since currently we do not retry writeouts too, which >> has as >> much of a bad effect on dirty client caches, and of course all the >> above is very >> true in such cases too. >> Also without lnet patch in 15332, where small messages are prioritized >> on routers, >> it is way too easy to timeout ast response because of router >> congestion and no >> amount of resending would help then. >> >> Bye, >> Oleg >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Hello! Because there is no way to deliver them. We send our first acknowledge of ast reception and it is delivered fast, this is the reply. Now what left is to send actual dirty data and then cancel request. These are not replies, but stand-alone client-generated RPCs, we cannot cancel locks while dirty data is not flushed. Just inventing some sort of ldlm "I am still alive" RPCs to send periodically instead of cancels is dangerous - data-sending part could be wedged for unrelated reasons, for example, not only because of contention, but due to some client problems, and if we prolong locks by other means, that potentially can wedge all access to that part of a file forever. And dirty data itself takes too long to get to the actual server processing. On of the solutions here is request scheduler, or some stand-alone part of it that could peek early into RPCs as they arrive, so that when the decision is being made about client eviction, we can quickly see what is in the queue from that client and perhaps based on this data to postpone the eviction. This was discussed on ORNL call. Andreas said that AT is currently already looking into incoming RPCs before processing, to get ideas about expected service times, perhaps it would not be too hard to add some logic that would link requests into actual exports they came from for further analysis if the need for it arises. Bye, Oleg On Jun 5, 2008, at 11:29 PM, Peter Braam wrote:> Why can we not send early replies? > > > On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote: > >> Hello! >> >> On Jun 5, 2008, at 12:42 PM, Robert Read wrote: >> >>>>> I suspect this could be adapted to allowing a fixed number of >>>>> retries for >>>>> server-originated RPCs also. In the case of LDLM blocking >>>>> callbacks >>>>> sent >>>>> to a client, a resend is currently harmless (either the client is >>>>> already >>>>> processing the callback, or the lock was cancelled). >>>> We need to be careful here and decide on a good strategy on when to >>>> resend. >>>> E.g. recent case at ORNL (even if a bit pathologic) is they pound >>>> through >>>> thousands of clients to 4 OSSes via 2 routers. That creates request >>>> waiting >>>> lists on OSSes well into tens of thousands. When we block on a lock >>>> and send >>>> blocking AST to the client, it quickly turns around and puts in his >>>> data... >>>> at the end of our list that takes hundreds of seconds (more than >>>> obd_timeout, >>>> obviously). No matter how much you resend, it won''t help. >>> This looks like the poster child for adaptive timeouts, although we >>> might want need some version of the early margin update patch on >>> 15501. Have you tried enabling AT? >> >> The problem is AT does not handle this specific case, there is no >> way to >> deliver "early replay" from a client to server that "I am working on >> it" outside of >> just sending dirty data. But dirty data gets into a queue for way too >> long. >> There re no timed out requests, the only thing timing out is lock >> that >> is not >> cancelled in time. >> AT was not tried - this is hard to do at ORNL, as client side is Cray >> XT4 machine, >> and updating clients is hard. So they are on 1.4.11 of some sort. >> They can easily update servers, but this won''t help, of course. >> >>> Maybe that''s was done to discourage people from disabling AT? >>> Seriously, though, I don''t know why that was changed. Perhaps it was >>> done on b1_6 before to AT landed? >> >> hm, indeed. I see this change in 1.6.3. >> >> Bye, >> Oleg >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel > >
Ah yes. So monitoring progress is the only thing we can do and with SNS you will be able to get that information long before the request is being handled. Peter On 6/5/08 8:38 PM, "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote:> Hello! > > Because there is no way to deliver them. We send our first > acknowledge of ast reception and it is delivered fast, this is the > reply. > Now what left is to send actual dirty data and then cancel > request. These are not replies, but stand-alone client-generated RPCs, > we cannot cancel locks while dirty data is not flushed. Just > inventing some sort of ldlm "I am still alive" RPCs to send periodically > instead of cancels is dangerous - data-sending part could be > wedged for unrelated reasons, for example, not only because of > contention, but due > to some client problems, and if we prolong locks by other means, > that potentially can wedge all access to that part of a file forever. > And dirty data itself takes too long to get to the actual server > processing. > On of the solutions here is request scheduler, or some stand-alone > part of it that could peek early into RPCs as they arrive, so that > when the decision is being made about client eviction, we can > quickly see what is in the queue from that client and perhaps > based on this data to postpone the eviction. This was discussed on > ORNL call. > Andreas said that AT is currently already looking into incoming > RPCs before processing, to get ideas about expected service times, > perhaps > it would not be too hard to add some logic that would link > requests into actual exports they came from for further analysis if > the need for > it arises. > > Bye, > Oleg > On Jun 5, 2008, at 11:29 PM, Peter Braam wrote: > >> Why can we not send early replies? >> >> >> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote: >> >>> Hello! >>> >>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote: >>> >>>>>> I suspect this could be adapted to allowing a fixed number of >>>>>> retries for >>>>>> server-originated RPCs also. In the case of LDLM blocking >>>>>> callbacks >>>>>> sent >>>>>> to a client, a resend is currently harmless (either the client is >>>>>> already >>>>>> processing the callback, or the lock was cancelled). >>>>> We need to be careful here and decide on a good strategy on when to >>>>> resend. >>>>> E.g. recent case at ORNL (even if a bit pathologic) is they pound >>>>> through >>>>> thousands of clients to 4 OSSes via 2 routers. That creates request >>>>> waiting >>>>> lists on OSSes well into tens of thousands. When we block on a lock >>>>> and send >>>>> blocking AST to the client, it quickly turns around and puts in his >>>>> data... >>>>> at the end of our list that takes hundreds of seconds (more than >>>>> obd_timeout, >>>>> obviously). No matter how much you resend, it won''t help. >>>> This looks like the poster child for adaptive timeouts, although we >>>> might want need some version of the early margin update patch on >>>> 15501. Have you tried enabling AT? >>> >>> The problem is AT does not handle this specific case, there is no >>> way to >>> deliver "early replay" from a client to server that "I am working on >>> it" outside of >>> just sending dirty data. But dirty data gets into a queue for way too >>> long. >>> There re no timed out requests, the only thing timing out is lock >>> that >>> is not >>> cancelled in time. >>> AT was not tried - this is hard to do at ORNL, as client side is Cray >>> XT4 machine, >>> and updating clients is hard. So they are on 1.4.11 of some sort. >>> They can easily update servers, but this won''t help, of course. >>> >>>> Maybe that''s was done to discourage people from disabling AT? >>>> Seriously, though, I don''t know why that was changed. Perhaps it was >>>> done on b1_6 before to AT landed? >>> >>> hm, indeed. I see this change in 1.6.3. >>> >>> Bye, >>> Oleg >>> _______________________________________________ >>> Lustre-devel mailing list >>> Lustre-devel at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-devel >> >> >
Andreas Dilger
2008-Jun-06 04:41 UTC
[Lustre-devel] hiding non-fatal communications errors
On Jun 05, 2008 20:40 -0700, Peter J. Braam wrote:> Ah yes. So monitoring progress is the only thing we can do and with SNS you > will be able to get that information long before the request is being > handled.You mean NRS, instead of SNS, right?> On 6/5/08 8:38 PM, "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote: > > Because there is no way to deliver them. We send our first > > acknowledge of ast reception and it is delivered fast, this is the > > reply. > > Now what left is to send actual dirty data and then cancel > > request. These are not replies, but stand-alone client-generated RPCs, > > we cannot cancel locks while dirty data is not flushed. Just > > inventing some sort of ldlm "I am still alive" RPCs to send periodically > > instead of cancels is dangerous - data-sending part could be > > wedged for unrelated reasons, for example, not only because of > > contention, but due > > to some client problems, and if we prolong locks by other means, > > that potentially can wedge all access to that part of a file forever. > > And dirty data itself takes too long to get to the actual server > > processing. > > On of the solutions here is request scheduler, or some stand-alone > > part of it that could peek early into RPCs as they arrive, so that > > when the decision is being made about client eviction, we can > > quickly see what is in the queue from that client and perhaps > > based on this data to postpone the eviction. This was discussed on > > ORNL call. > > Andreas said that AT is currently already looking into incoming > > RPCs before processing, to get ideas about expected service times, > > perhaps > > it would not be too hard to add some logic that would link > > requests into actual exports they came from for further analysis if > > the need for > > it arises.I think hooking the requests into the exports at arrival time is fairly straight forward, and is a easy first step toward implementing the NRS.> > Bye, > > Oleg > > On Jun 5, 2008, at 11:29 PM, Peter Braam wrote: > > > >> Why can we not send early replies? > >> > >> > >> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote: > >> > >>> Hello! > >>> > >>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote: > >>> > >>>>>> I suspect this could be adapted to allowing a fixed number of > >>>>>> retries for > >>>>>> server-originated RPCs also. In the case of LDLM blocking > >>>>>> callbacks > >>>>>> sent > >>>>>> to a client, a resend is currently harmless (either the client is > >>>>>> already > >>>>>> processing the callback, or the lock was cancelled). > >>>>> We need to be careful here and decide on a good strategy on when to > >>>>> resend. > >>>>> E.g. recent case at ORNL (even if a bit pathologic) is they pound > >>>>> through > >>>>> thousands of clients to 4 OSSes via 2 routers. That creates request > >>>>> waiting > >>>>> lists on OSSes well into tens of thousands. When we block on a lock > >>>>> and send > >>>>> blocking AST to the client, it quickly turns around and puts in his > >>>>> data... > >>>>> at the end of our list that takes hundreds of seconds (more than > >>>>> obd_timeout, > >>>>> obviously). No matter how much you resend, it won''t help. > >>>> This looks like the poster child for adaptive timeouts, although we > >>>> might want need some version of the early margin update patch on > >>>> 15501. Have you tried enabling AT? > >>> > >>> The problem is AT does not handle this specific case, there is no > >>> way to > >>> deliver "early replay" from a client to server that "I am working on > >>> it" outside of > >>> just sending dirty data. But dirty data gets into a queue for way too > >>> long. > >>> There re no timed out requests, the only thing timing out is lock > >>> that > >>> is not > >>> cancelled in time. > >>> AT was not tried - this is hard to do at ORNL, as client side is Cray > >>> XT4 machine, > >>> and updating clients is hard. So they are on 1.4.11 of some sort. > >>> They can easily update servers, but this won''t help, of course. > >>> > >>>> Maybe that''s was done to discourage people from disabling AT? > >>>> Seriously, though, I don''t know why that was changed. Perhaps it was > >>>> done on b1_6 before to AT landed? > >>> > >>> hm, indeed. I see this change in 1.6.3. > >>> > >>> Bye, > >>> Oleg > >>> _______________________________________________ > >>> Lustre-devel mailing list > >>> Lustre-devel at lists.lustre.org > >>> http://lists.lustre.org/mailman/listinfo/lustre-devel > >> > >> > > > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-develCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Oleg''s comments about congestion and the ORNL discussions I''ve been involved in are effectively presenting arguments for allowing expedited communications. This is possible but comes at a cost. The "proper" implementation effectively holds an uncongested network in reserve for expedited communications. That''s a high price to pay because it pretty well means doubling up all the LNET state - twice the number of queues/sockets/queuepairs/connections. That''s unavoidable since we''re using these structures for backpressure and once they''re "full" you can only bypass with an additional connection. You can go a long way towards the ideal by allowing prioritised traffic to "overtake" everywhere apart from the wire - i.e. all packets serialise once they have been passed to the comms APIs below the LNDs, but take priority within the LNDs, LNET (including routers) and ptlrpc. It''s hard to say without further thought and/or experiment whether either of these alternatives actually solves the problem in all envisaged use cases and doesn''t just shift it elsewhere. For example, even the "proper" implementation could end up with a logjam on both low and high priority networks in pathalogical use cases. And I''m not ready to believe that increasing the number of priority levels can add anything fundamental to the argument. I think our best bet is to find a way to keep congestion to a minimum in the first place so that peer ping latency in a single-priority network can be bounded and kept relatively short (seconds, not minutes). Unfortunately, the current algorithms for exploiting network and disk bandwidth are unbelievably simplistic and invite congestion. Increasing the number of service threads until performance levels off ignores completely the issue of service latency. Allowing a single client to post sufficient traffic to max the network is fine when it''s the only one, but mad when it''s one of 100000. We''re tuning systems to the point of instability, so of course timeouts have to become unmanageable long. Scheduling can be a subtle problem where "obvious" solutions can have non-obvious consequences. But it might be a start to give servers more dynamic control over the number of concurrent requests individual clients are allowed to submit so that when many clients are active individual clients only submit one RPC at a time, and when few clients are active concurrency on these clients can increase. Cheers, Eric
Sorry yes, network request scheduling; which is btw the most basic instance of a secondary resource management protocol as Eric described in his post. Peter On 6/5/08 10:41 PM, "Andreas Dilger" <adilger at sun.com> wrote:> On Jun 05, 2008 20:40 -0700, Peter J. Braam wrote: >> Ah yes. So monitoring progress is the only thing we can do and with SNS you >> will be able to get that information long before the request is being >> handled. > > You mean NRS, instead of SNS, right? > >> On 6/5/08 8:38 PM, "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote: >>> Because there is no way to deliver them. We send our first >>> acknowledge of ast reception and it is delivered fast, this is the >>> reply. >>> Now what left is to send actual dirty data and then cancel >>> request. These are not replies, but stand-alone client-generated RPCs, >>> we cannot cancel locks while dirty data is not flushed. Just >>> inventing some sort of ldlm "I am still alive" RPCs to send periodically >>> instead of cancels is dangerous - data-sending part could be >>> wedged for unrelated reasons, for example, not only because of >>> contention, but due >>> to some client problems, and if we prolong locks by other means, >>> that potentially can wedge all access to that part of a file forever. >>> And dirty data itself takes too long to get to the actual server >>> processing. >>> On of the solutions here is request scheduler, or some stand-alone >>> part of it that could peek early into RPCs as they arrive, so that >>> when the decision is being made about client eviction, we can >>> quickly see what is in the queue from that client and perhaps >>> based on this data to postpone the eviction. This was discussed on >>> ORNL call. >>> Andreas said that AT is currently already looking into incoming >>> RPCs before processing, to get ideas about expected service times, >>> perhaps >>> it would not be too hard to add some logic that would link >>> requests into actual exports they came from for further analysis if >>> the need for >>> it arises. > > I think hooking the requests into the exports at arrival time is fairly > straight forward, and is a easy first step toward implementing the NRS. > >>> Bye, >>> Oleg >>> On Jun 5, 2008, at 11:29 PM, Peter Braam wrote: >>> >>>> Why can we not send early replies? >>>> >>>> >>>> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote: >>>> >>>>> Hello! >>>>> >>>>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote: >>>>> >>>>>>>> I suspect this could be adapted to allowing a fixed number of >>>>>>>> retries for >>>>>>>> server-originated RPCs also. In the case of LDLM blocking >>>>>>>> callbacks >>>>>>>> sent >>>>>>>> to a client, a resend is currently harmless (either the client is >>>>>>>> already >>>>>>>> processing the callback, or the lock was cancelled). >>>>>>> We need to be careful here and decide on a good strategy on when to >>>>>>> resend. >>>>>>> E.g. recent case at ORNL (even if a bit pathologic) is they pound >>>>>>> through >>>>>>> thousands of clients to 4 OSSes via 2 routers. That creates request >>>>>>> waiting >>>>>>> lists on OSSes well into tens of thousands. When we block on a lock >>>>>>> and send >>>>>>> blocking AST to the client, it quickly turns around and puts in his >>>>>>> data... >>>>>>> at the end of our list that takes hundreds of seconds (more than >>>>>>> obd_timeout, >>>>>>> obviously). No matter how much you resend, it won''t help. >>>>>> This looks like the poster child for adaptive timeouts, although we >>>>>> might want need some version of the early margin update patch on >>>>>> 15501. Have you tried enabling AT? >>>>> >>>>> The problem is AT does not handle this specific case, there is no >>>>> way to >>>>> deliver "early replay" from a client to server that "I am working on >>>>> it" outside of >>>>> just sending dirty data. But dirty data gets into a queue for way too >>>>> long. >>>>> There re no timed out requests, the only thing timing out is lock >>>>> that >>>>> is not >>>>> cancelled in time. >>>>> AT was not tried - this is hard to do at ORNL, as client side is Cray >>>>> XT4 machine, >>>>> and updating clients is hard. So they are on 1.4.11 of some sort. >>>>> They can easily update servers, but this won''t help, of course. >>>>> >>>>>> Maybe that''s was done to discourage people from disabling AT? >>>>>> Seriously, though, I don''t know why that was changed. Perhaps it was >>>>>> done on b1_6 before to AT landed? >>>>> >>>>> hm, indeed. I see this change in 1.6.3. >>>>> >>>>> Bye, >>>>> Oleg >>>>> _______________________________________________ >>>>> Lustre-devel mailing list >>>>> Lustre-devel at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel >>>> >>>> >>> >> >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Nathaniel Rutman
2008-Jun-19 20:24 UTC
[Lustre-devel] hiding non-fatal communications errors
Eric Barton wrote:> Oleg''s comments about congestion and the ORNL discussions I''ve been > involved in are effectively presenting arguments for allowing > expedited communications. This is possible but comes at a cost. > > The "proper" implementation effectively holds an uncongested network > in reserve for expedited communications. That''s a high price to pay > because it pretty well means doubling up all the LNET state - twice > the number of queues/sockets/queuepairs/connections. That''s > unavoidable since we''re using these structures for backpressure and > once they''re "full" you can only bypass with an additional connection. >That''s assuming network congestion is the cause of the lock timeout. What if the server disk is busy doing who knows what, the client''s cache flush RPCs are all sitting on the server in the request queue just waiting for some disk time. Furthermore assume that a bunch of other clients are all doing the same thing, so that we can''t simply prioritize this clients RPCs over everybody else''s. I think the method suggested by Oleg has the most potential in this case: "sniff" the incoming RPCs to see if they are cache flushes, and do not decide to evict those clients until after those RPCs have been processed. As mentioned, we already do sniff the incoming reqs to check adaptive timeout deadlines (ptlrpc_server_handle_req_in). One further thing I would like to do is respond to "easy" RPCs immediately (in a reserved thread). "Easy" would certainly include pings, maybe others that have no disk access. This would allow us to free up LNET buffers and other resources, prevent us from evicting clients "we haven''t heard from in X seconds" (although I just realized we could fix that right now in ptlrpc_server_handle_req_in), and more quickly determine network and server loading remotely.