Jeremy Filizetti
2010-Apr-30 02:59 UTC
[Lustre-devel] question about ldlm_server_glimpse_ast
In our Lustre WAN environment a few times we''ve had a link drop for an extended period of time which causes problems on systems accessing data in the same directory as the remote system that becomes unavailable. Our OSS''s seem to be stuck in a loop of ptlrpc_queue_wait called from ldlm_server_glimpse_ast. The remote site is accesed through an LNet router which is still available. However the OSS resends requests every 7 seconds successfully to the router but squbsequently with timeout which causes it to loop in ptlrpc_queue_wait. Looking over the ldlm_server_blocking_ast and ldlm_server_completion_ast functions I see they set rq_no_resend = 1, but ldlm_server_glimpse_ast does not. I''m not familiar with the locking in Lustre, is there a reason that ldlm_server_glimpse_ast doesn''t set rq_no_resend = 1? This would get rid of the loop ptlrpc_queue_wait is stuck in until the client comes back, but I''m not sure if it would have other unexpected consequences. Jeremy -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20100429/1d2bc58d/attachment.html
On 04/29/2010 09:59 PM, Jeremy Filizetti wrote:> In our Lustre WAN environment a few times we''ve had a link drop for an > extended period of time which causes problems on systems accessing data > in the same directory as the remote system that becomes unavailable. > Our OSS''s seem to be stuck in a loop of ptlrpc_queue_wait called from > ldlm_server_glimpse_ast. The remote site is accesed through an LNet > router which is still available. However the OSS resends requests every > 7 seconds successfully to the router but squbsequently with timeout > which causes it to loop in ptlrpc_queue_wait. > > Looking over the ldlm_server_blocking_ast and ldlm_server_completion_ast > functions I see they set rq_no_resend = 1, but ldlm_server_glimpse_ast > does not. I''m not familiar with the locking in Lustre, is there a > reason that ldlm_server_glimpse_ast doesn''t set rq_no_resend = 1? This > would get rid of the loop ptlrpc_queue_wait is stuck in until the client > comes back, but I''m not sure if it would have other unexpected consequences.We have the same issue at TACC, and there is a bugzilla entry: https://bugzilla.lustre.org/show_bug.cgi?id=21937 I tested a patch which set rq_no_resend = 0 for glimpses, and found that clients only had about 6 seconds to reply before eviction. Since eviction creates the possibility for data loss, a 6 second timeout was deemed too short for production. (With the patch applied, it was easy for me to create cases where data was indeed lost.) I was also able to observe some file consistency issues which lasted for a few seconds after eviction, as well as a failure of the file operations on the evicted client to return an error. See also: https://bugzilla.lustre.org/show_bug.cgi?id=22360 -John -- John L. Hammond, Ph.D. ICES, The University of Texas at Austin jhammond at ices.utexas.edu (512) 471-9304
Hello! On Apr 30, 2010, at 9:00 AM, John Hammond wrote:> I tested a patch which set rq_no_resend = 0 for glimpses, and found that > clients only had about 6 seconds to reply before eviction. Since > eviction creates the possibility for data loss, a 6 second timeout was > deemed too short for production. (With the patch applied, it was easy > for me to create cases where data was indeed lost.) I was also able toPlease note that the 6 second timeout is in fact common ldlm_timeout and it''s not just glimpses that are bound by this value. any ldlm callbacks are required to reply withing this time, so if your network can have delays of more then this much, you need to consider increasing ldlm_timeout value (/proc/sys/lustre/ldlm_timeout). On the other hand if you have a packet loss issue, even if resending of glimpse ASTs would be present, we don''t currently resend other ASTs so the situation still has a potential for evictions with subsequent possible data loss. Bye, Oleg
Hello. Increasing ldlm_timeout has no effect whatsoever if adaptive timeouts are enabled. See bug 22569. I suggest that you tune up at_min instead. Thanks, -Cory Oleg Drokin wrote:> Hello! > > On Apr 30, 2010, at 9:00 AM, John Hammond wrote: >> I tested a patch which set rq_no_resend = 0 for glimpses, and found that >> clients only had about 6 seconds to reply before eviction. Since >> eviction creates the possibility for data loss, a 6 second timeout was >> deemed too short for production. (With the patch applied, it was easy >> for me to create cases where data was indeed lost.) I was also able to > > Please note that the 6 second timeout is in fact common ldlm_timeout and it''s > not just glimpses that are bound by this value. > any ldlm callbacks are required to reply withing this time, so if your > network can have delays of more then this much, you need to consider > increasing ldlm_timeout value (/proc/sys/lustre/ldlm_timeout). > On the other hand if you have a packet loss issue, even if > resending of glimpse ASTs would be present, we don''t currently resend > other ASTs so the situation still has a potential for evictions > with subsequent possible data loss. > > Bye, > Oleg > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
On 04/30/2010 01:44 PM, Oleg Drokin wrote:> Hello! > > On Apr 30, 2010, at 9:00 AM, John Hammond wrote: >> I tested a patch which set rq_no_resend = 0 for glimpses, and found that >> clients only had about 6 seconds to reply before eviction. Since >> eviction creates the possibility for data loss, a 6 second timeout was >> deemed too short for production. (With the patch applied, it was easy >> for me to create cases where data was indeed lost.) I was also able to > > Please note that the 6 second timeout is in fact common ldlm_timeout and it''s > not just glimpses that are bound by this value. > any ldlm callbacks are required to reply withing this time, so if your > network can have delays of more then this much, you need to consider > increasing ldlm_timeout value (/proc/sys/lustre/ldlm_timeout). > On the other hand if you have a packet loss issue, even if > resending of glimpse ASTs would be present, we don''t currently resend > other ASTs so the situation still has a potential for evictions > with subsequent possible data loss.Are there any nonobvious ramifications of changing ldlm_timeout? I noticed that it was set to 20 seconds (except for MDS''s?) in 1.8.2. Also there is some suspect looking logic in obd_config.c and elsewhere to keep it from being set too high relative to obd_timeout: if (ldlm_timeout >= obd_timeout) ldlm_timeout = max(obd_timeout / 3, 1U); Does this mean that ldlm_timeout should not exceed 1/3 of obd_timeout? Thanks, -John -- John L. Hammond, Ph.D. ICES, The University of Texas at Austin jhammond at ices.utexas.edu (512) 471-9304
Hello! On Apr 30, 2010, at 5:07 PM, John Hammond wrote:> Are there any nonobvious ramifications of changing ldlm_timeout? I noticed that it was set to 20 seconds (except for MDS''s?) in 1.8.2. Also there is some suspect looking logic in obd_config.c and elsewhere to keep it from being set too high relative to obd_timeout: > > if (ldlm_timeout >= obd_timeout) > ldlm_timeout = max(obd_timeout / 3, 1U); > > Does this mean that ldlm_timeout should not exceed 1/3 of obd_timeout?ldlm_timeout should not be set too high, because if a client that holds a lock dies this is for how long nobody will be able to get a conflicting lock. Of course if your network might delay packets (roundtrip) for more than ldlm_timeout, then you need to lift the limit. 1/3 is there so that if your network delay is potentially this big (and you do not use AT), there should be enough time to do some processing still and then send a reply to the client (obd_timeout is used for the client to determine when the reply should come) before the client times out that request. Also see the comment from Cory, that if you use AT it is all not controlled by at_min setting instead and then is dynamically adjusted as the system detects your network latency. Bye, Oleg