nic@cray.com
2007-Jan-09 10:43 UTC
[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511 This is a bug that has likely hit twice in 24 hours. The complaint from the site is that Lustre isn''t cleaning up and evicting nids -- preventing new jobs from starting. This state has persisted for > 2 hours in the current "hit" of this problem. As for the issue itself: We seem to have an issue trying to evict a job that queues up lots of FLK (flock) locks on a common resource and then dies. We come in with llrd and try to evict the nids and somehow get into a loop that is trying to send completion ASTs to (now dead) liblustre clients to clean up the FLK locks. These ASTs are obviously timing out, at the rate of 1 every 2 second. For the current instance, it is a 6000 node job that seems to be causing this, which if left alone it seems 6000 * 2sec == 200 min, or just under 3.5 hours to complete the lock timeouts. The messages from ldlm_server_completion_ast() also indicate that these locks have been waiting for an extremely long time: ldlm_server_completion_ast()) ### enqueue wait took 9401808080us from 1168351505 ns In the lustre logs that will be attached, the python pid 4501 is llrd. I have a few -1 debug logs that show this processing loop quite nicely. Is this just another example of bug 11330 -- where the processing code should take into account the obd_timeout when processing & cleaning up locks ?
nic@cray.com
2007-Jan-09 10:44 UTC
[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511 Created an attachment (id=9301) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9301&action=view) first -1 debug log from mds during flock processing
nic@cray.com
2007-Jan-09 10:45 UTC
[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511 Created an attachment (id=9302) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9302&action=view) second -1 debug log from MDS
nic@cray.com
2007-Jan-10 12:05 UTC
[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511 I don''t have a crystal clear reproducer -- other than to say an application (obviously) gets a bunch of flock locks and then dies. It seems the common threads is that they are all flock''ing the same resource on the MDS - so probably only one client gets a granted lock, the rest are waiting. Once the application is dead, we come in with llrd to clean these nids up and do the evictions. I am sure we are only going to see more of these. It should be quite easy to write an MPI test app that would do a bunch of flock enqueues on a single resource and then fall over dead (segfault, etc) It does seem that we are killing a node with the lock held, which gets the completion AST sent to the client (which seems silly, given that we _know_ one of the clients is dead) and then when that AST timesout, we release that lock and reprocess the queue of pending locks for that resource. I understand there isn''t much we can do, given that llrd only gives us a single nid at once. We *could* utlize the evict nid by list changes that are floating around somewhere in Bugzilla and update llrd to use them. I do not know if there is a limit to the number of nids we can write into this proc file -- but we certainly need to know. This would give Lustre a single look at all the nids we are trying to kill. If Lustre could then mark each as "ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up granted locks, etc the various paths that would send RPCs to these clients could be prevented from taking too much time. Also -- it should be possible to look at the time spent waiting for the flock locks and if it was > obd_timeout (from request sent to being actually granted), dump the request as old. I believe this is similiar to the approach for bug 11330.
green@clusterfs.com
2007-Jan-10 12:28 UTC
[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511 (In reply to comment #10)> I don''t have a crystal clear reproducer -- other than to say an application > (obviously) gets a bunch of flock locks and then dies. It seems the commonNote that only one app can get a lock if all the locks are conflicting.> threads is that they are all flock''ing the same resource on the MDS - so > probably only one client gets a granted lock, the rest are waiting. Once the > application is dead, we come in with llrd to clean these nids up and do the > evictions. I am sure we are only going to see more of these. It should be quiteYes. this sounds possible, though quite stupid thing to do on something like xt3. If you have 6000 nodes job, and 5999 nodes just wait until 1 node will release a lock on a file (in who knows what time), this is quite unproductive use of resources.> easy to write an MPI test app that would do a bunch of flock enqueues on a > single resource and then fall over dead (segfault, etc)Is single node exiting means rest of nodes would be forcefully killed too?> It does seem that we are killing a node with the lock held, which gets the > completion AST sent to the client (which seems silly, given that we _know_ one > of the clients is dead) and then when that AST timesout, we release that lockThis is not silly, because we are killing ONE client and we are granting lock to ANOTHER that is not killed yet.> and reprocess the queue of pending locks for that resource.Yes, because we killed one lock and now we need to see if something was waiting for it to go away and needs to be granted. If you kill all the processes that do not have locks granted first, this won''t happen, of course.> I understand there isn''t much we can do, given that llrd only gives us a single > nid at once. We *could* utlize the evict nid by list changes that are floating > around somewhere in Bugzilla and update llrd to use them. I do not know if there > is a limit to the number of nids we can write into this proc file -- but we > certainly need to know. This would give Lustre a single look at all the nids we > are trying to kill. If Lustre could then mark each as > "ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up granted locks, etc > the various paths that would send RPCs to these clients could be prevented from > taking too much time.This would make such an eviction two stage process, I think First go mark all of them as eviction pending, then go and evict everybody. Twice as much work done for obscure case.> Also -- it should be possible to look at the time spent waiting for the flock > locks and if it was > obd_timeout (from request sent to being actually granted), > dump the request as old. I believe this is similiar to the approach for bug11330. This won''t work. There is absolutely no limit on amount of time flock lock can be held. So with what you propose if one node gets a lock and another node waits for conflicting lock. First node holds the lock for say obd_timeout+1, then second node won''t get its lock at all because the timeout expired?
Canon, Richard Shane
2007-Jan-11 09:13 UTC
[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop
The situation that Nic is describing is not obscure. While the application doing this type of access may not be ideal, I can image several cases where this might happen. Also, even if we correct it in this app, another user will have the same problem down the road. Just to make it clear, typically when one task dies in an MPI job, then entire application stops and all tasks exit. This type of scenario is a typical example of what I think of when we talk about scalable recovery. This situation is actually an easier case, because LLRD can provide you a list of all the nodes/nids that should be cleaned up. A two stage process (which would probably require seconds) is fine given the alternative of it taking hours. I''m still unclear as to why we are seeing this now. Nic: is this a "new" application or one we have run a good bit? Has anything changed in Lustre that could have caused this to become an issue? --Shane -----Original Message----- From: lustre-devel-bounces@clusterfs.com [mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of green@clusterfs.com Sent: Wednesday, January 10, 2007 2:29 PM To: lustre-devel@clusterfs.com Subject: [Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511 (In reply to comment #10)> I don''t have a crystal clear reproducer -- other than to say anapplication> (obviously) gets a bunch of flock locks and then dies. It seems thecommon Note that only one app can get a lock if all the locks are conflicting.> threads is that they are all flock''ing the same resource on the MDS -so> probably only one client gets a granted lock, the rest are waiting.Once the> application is dead, we come in with llrd to clean these nids up anddo the> evictions. I am sure we are only going to see more of these. It shouldbe quite Yes. this sounds possible, though quite stupid thing to do on something like xt3. If you have 6000 nodes job, and 5999 nodes just wait until 1 node will release a lock on a file (in who knows what time), this is quite unproductive use of resources.> easy to write an MPI test app that would do a bunch of flock enqueueson a> single resource and then fall over dead (segfault, etc)Is single node exiting means rest of nodes would be forcefully killed too?> It does seem that we are killing a node with the lock held, which getsthe> completion AST sent to the client (which seems silly, given that we_know_ one> of the clients is dead) and then when that AST timesout, we releasethat lock This is not silly, because we are killing ONE client and we are granting lock to ANOTHER that is not killed yet.> and reprocess the queue of pending locks for that resource.Yes, because we killed one lock and now we need to see if something was waiting for it to go away and needs to be granted. If you kill all the processes that do not have locks granted first, this won''t happen, of course.> I understand there isn''t much we can do, given that llrd only gives usa single> nid at once. We *could* utlize the evict nid by list changes that arefloating> around somewhere in Bugzilla and update llrd to use them. I do notknow if there> is a limit to the number of nids we can write into this proc file --but we> certainly need to know. This would give Lustre a single look at allthe nids we> are trying to kill. If Lustre could then mark each as > "ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up grantedlocks, etc> the various paths that would send RPCs to these clients could beprevented from> taking too much time.This would make such an eviction two stage process, I think First go mark all of them as eviction pending, then go and evict everybody. Twice as much work done for obscure case.> Also -- it should be possible to look at the time spent waiting forthe flock> locks and if it was > obd_timeout (from request sent to being actuallygranted),> dump the request as old. I believe this is similiar to the approachfor bug 11330. This won''t work. There is absolutely no limit on amount of time flock lock can be held. So with what you propose if one node gets a lock and another node waits for conflicting lock. First node holds the lock for say obd_timeout+1, then second node won''t get its lock at all because the timeout expired? _______________________________________________ Lustre-devel mailing list Lustre-devel@clusterfs.com https://mail.clusterfs.com/mailman/listinfo/lustre-devel
Oral, H. Sarp
2007-Jan-11 09:21 UTC
[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock astprocessing loop
<snip> "I''m still unclear as to why we are seeing this now. Nic: is this a "new" application or one we have run a good bit? Has anything changed in Lustre that could have caused this to become an issue?" <snip> I wonder this myself, since what has been described is not such an obscure case. Sarp -------------------- Sarp Oral, Ph.D. National Center for Computational Sciences (NCCS) Oak Ridge National Lab, Oak Ridge, Tennessee 37831 865-574-2173, oralhs@ornl.gov -----Original Message----- From: lustre-devel-bounces@clusterfs.com [mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of Canon, Richard Shane Sent: Thursday, January 11, 2007 11:13 AM To: lustre-devel@clusterfs.com Subject: RE: [Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock astprocessing loop The situation that Nic is describing is not obscure. While the application doing this type of access may not be ideal, I can image several cases where this might happen. Also, even if we correct it in this app, another user will have the same problem down the road. Just to make it clear, typically when one task dies in an MPI job, then entire application stops and all tasks exit. This type of scenario is a typical example of what I think of when we talk about scalable recovery. This situation is actually an easier case, because LLRD can provide you a list of all the nodes/nids that should be cleaned up. A two stage process (which would probably require seconds) is fine given the alternative of it taking hours. I''m still unclear as to why we are seeing this now. Nic: is this a "new" application or one we have run a good bit? Has anything changed in Lustre that could have caused this to become an issue? --Shane -----Original Message----- From: lustre-devel-bounces@clusterfs.com [mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of green@clusterfs.com Sent: Wednesday, January 10, 2007 2:29 PM To: lustre-devel@clusterfs.com Subject: [Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511 (In reply to comment #10)> I don''t have a crystal clear reproducer -- other than to say anapplication> (obviously) gets a bunch of flock locks and then dies. It seems thecommon Note that only one app can get a lock if all the locks are conflicting.> threads is that they are all flock''ing the same resource on the MDS -so> probably only one client gets a granted lock, the rest are waiting.Once the> application is dead, we come in with llrd to clean these nids up anddo the> evictions. I am sure we are only going to see more of these. It shouldbe quite Yes. this sounds possible, though quite stupid thing to do on something like xt3. If you have 6000 nodes job, and 5999 nodes just wait until 1 node will release a lock on a file (in who knows what time), this is quite unproductive use of resources.> easy to write an MPI test app that would do a bunch of flock enqueueson a> single resource and then fall over dead (segfault, etc)Is single node exiting means rest of nodes would be forcefully killed too?> It does seem that we are killing a node with the lock held, which getsthe> completion AST sent to the client (which seems silly, given that we_know_ one> of the clients is dead) and then when that AST timesout, we releasethat lock This is not silly, because we are killing ONE client and we are granting lock to ANOTHER that is not killed yet.> and reprocess the queue of pending locks for that resource.Yes, because we killed one lock and now we need to see if something was waiting for it to go away and needs to be granted. If you kill all the processes that do not have locks granted first, this won''t happen, of course.> I understand there isn''t much we can do, given that llrd only gives usa single> nid at once. We *could* utlize the evict nid by list changes that arefloating> around somewhere in Bugzilla and update llrd to use them. I do notknow if there> is a limit to the number of nids we can write into this proc file --but we> certainly need to know. This would give Lustre a single look at allthe nids we> are trying to kill. If Lustre could then mark each as > "ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up grantedlocks, etc> the various paths that would send RPCs to these clients could beprevented from> taking too much time.This would make such an eviction two stage process, I think First go mark all of them as eviction pending, then go and evict everybody. Twice as much work done for obscure case.> Also -- it should be possible to look at the time spent waiting forthe flock> locks and if it was > obd_timeout (from request sent to being actuallygranted),> dump the request as old. I believe this is similiar to the approachfor bug 11330. This won''t work. There is absolutely no limit on amount of time flock lock can be held. So with what you propose if one node gets a lock and another node waits for conflicting lock. First node holds the lock for say obd_timeout+1, then second node won''t get its lock at all because the timeout expired? _______________________________________________ Lustre-devel mailing list Lustre-devel@clusterfs.com https://mail.clusterfs.com/mailman/listinfo/lustre-devel _______________________________________________ Lustre-devel mailing list Lustre-devel@clusterfs.com https://mail.clusterfs.com/mailman/listinfo/lustre-devel
Nicholas Henke
2007-Jan-11 10:37 UTC
[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock astprocessing loop
Oral, H. Sarp wrote:> <snip> > > "I''m still unclear as to why we are seeing this now. Nic: is this a > "new" application or one we have run a good bit? Has anything changed > in Lustre that could have caused this to become an issue?" > > <snip> > > > > I wonder this myself, since what has been described is not such an > obscure case. >Good darn question folks. I really can''t explain this, other than to say it might be a factor of scale in a specific job ? I think we need to have the application killed at just the *wrong* time for this to happen -- I''m guessing that usually the nodes process this flock''d file quite quickly and never gets killed while the locks are held. Here is a possible solution we are thinking about. Warning -- this isn''t fully tested yet. It does look good on paper though :) The basic issue is that with Portals, liblustre & Catamount, one cannot do anything but set a timer when sending an RPC to a client node. Once the application is dead, this timer will run to completion as QK is looking at the destination pid and seeing nobody exists to receive that message and drops it on the floor. Now, there exists some paths in Lustre node eviction that will result in RPC traffic to nodes -- given that we evict nodes one-by-one (and evicting the whole list at once is problematic for a host of reasons), we can get into the situation where we are sending RPCs to a node that llrd knows is dead, but we''ve not gotten that information into Lustre yet. I will grant that these are probably due to some varying level of application Evil Quotient -- but in the end, the system needs to protect against this. Consider the case: nids 1-10 all do an flock on MDS inode 123456 nid 1 is granted the lock, nids 2-10 are put int the pending list. Job up & dies (how rude!) llrd evicts nid1 - causing Lustre to delete it''s lock and causing the list of pending locks to be reprocessed Lustre sends a Completion AST to nid2 informing it that it now has the lock -- this times out after 2s Lustre repeats this process for nids 3-10 total time spent waiting for nids 2-10 == 9 *2s, or 18s. The following is a snippet of an idea that we have to deal with this problem.> Eric B. came up with the idea that we can use lctl --net ptl del_peer > <nid> for every nid we are evicting to delete the LNET level information > for that nid -- in effect preventing any future communication with that > node. This should cause these RPC requests to immediately fail > (something I''ll be testing to verify) -- preventing the long and arduous > serial 2second cleanup for hours and hours. > >Note that this only works as LNET on the servers will *not* try to reconnect to a libclient. Deleting the peer has the effect of of failing the dead-node-bound-RPC immediately on the server rather than after a 2s timeout. Nic
nic@cray.com
2007-Jan-11 11:40 UTC
[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511 Eric, Oleg & I talked about possible solutions to the bug 11511 issue that is plaguing ORNL. I explored the possibility of something Cray Portals side that would immediately NAK a LNET message on a node where no application was active, but nothing was going to work or pass muster. Eric then came up with the idea that we can use lctl --net ptl del_peer <nid> for every nid we are evicting to delete the LNET level information for that nid -- in effect preventing any future communication with that node. This should cause these RPC requests to immediately fail (something I''ll be testing to verify) -- preventing the long and arduous serial 2second cleanup for hours and hours. I know this needs to be done for certain on the MDS, but is there any benefit to doing this on the OSTs as well? Can it only help? I think we need to explore this before looking at code change -- given the nature of the flock.