thr3ads.net - Lustre devel - [Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop [Jan 2007]

If this information is useful, please help other people find it:
Share via:

nic@cray.com

2007-Jan-09 10:43 UTC

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



This is a bug that has likely hit twice in 24 hours. The complaint from the site
is that Lustre isn''t cleaning up and evicting nids -- preventing new
jobs from
starting.  This state has persisted for > 2 hours in the current
"hit" of this
problem.


As for the issue itself:

We seem to have an issue trying to evict a job that queues up lots of FLK
(flock) locks on a common resource and then dies. We come in with llrd and try
to evict the nids and somehow get into a loop that is trying to send completion
ASTs to (now dead) liblustre clients to clean up the FLK locks. These ASTs are
obviously timing out, at the rate of 1 every 2 second. For the current instance,
it is a 6000 node job that seems to be causing this, which if left alone it
seems 6000 * 2sec == 200 min, or just under 3.5 hours to complete the lock
timeouts.

The messages from ldlm_server_completion_ast() also indicate that these locks
have been waiting for an extremely long time:
ldlm_server_completion_ast()) ### enqueue wait took 9401808080us from 1168351505
ns
 
In the lustre logs that will be attached, the python pid 4501 is llrd. I have a
few -1 debug logs that show this processing loop quite nicely.

Is this just another example of bug 11330 -- where the processing code should
take into account the obd_timeout when processing & cleaning up locks ?

nic@cray.com

2007-Jan-09 10:44 UTC

head link

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



Created an attachment (id=9301)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9301&action=view)
first -1 debug log from mds during flock processing

nic@cray.com

2007-Jan-09 10:45 UTC

head link

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



Created an attachment (id=9302)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9302&action=view)
second -1 debug log from MDS

nic@cray.com

2007-Jan-10 12:05 UTC

head link

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



I don''t have a crystal clear reproducer -- other than to say an
application
(obviously) gets a bunch of flock locks and then dies. It seems the common
threads is that they are all flock''ing the same resource on the MDS -
so
probably only one client gets a granted lock, the rest are waiting. Once the
application is dead, we come in with llrd to clean these nids up and do the
evictions. I am sure we are only going to see more of these. It should be quite
easy to write an MPI test app that would do a bunch of flock enqueues on a
single resource and then fall over dead (segfault, etc)

It does seem that we are killing a node with the lock held, which gets the
completion AST sent to the client (which seems silly, given that we _know_ one
of the clients is dead) and then when that AST timesout, we release that lock
and reprocess the queue of pending locks for that resource. 

I understand there isn''t much we can do, given that llrd only gives us
a single
nid at once. We *could* utlize the evict nid by list changes that are floating
around somewhere in Bugzilla and update llrd to use them. I do not know if there
is a limit to the number of nids we can write into this proc file -- but we
certainly need to know. This would give Lustre a single look at all the nids we
are trying to kill. If Lustre could then mark each as
"ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up granted
locks, etc
the various paths that would send RPCs to these clients could be prevented from
taking too much time.

Also -- it should be possible to look at the time spent waiting for the flock
locks and if it was > obd_timeout (from request sent to being actually
granted),
dump the request as old. I believe this is similiar to the approach for bug
11330.

green@clusterfs.com

2007-Jan-10 12:28 UTC

head link

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



(In reply to comment #10)> I don''t have a crystal clear reproducer -- other than to say an
application
> (obviously) gets a bunch of flock locks and then dies. It seems the common
Note that only one app can get a lock if all the locks are conflicting.
> threads is that they are all flock''ing the same resource on the
MDS - so
> probably only one client gets a granted lock, the rest are waiting. Once
the
> application is dead, we come in with llrd to clean these nids up and do the
> evictions. I am sure we are only going to see more of these. It should be
quite
Yes. this sounds possible, though quite stupid thing to do on something like
xt3.
If you have 6000 nodes job, and 5999 nodes just wait until 1 node will release a
lock on a file (in who knows what time), this is quite unproductive use of
resources.
> easy to write an MPI test app that would do a bunch of flock enqueues on a
> single resource and then fall over dead (segfault, etc)
Is single node exiting means rest of nodes would be forcefully killed too?
> It does seem that we are killing a node with the lock held, which gets the
> completion AST sent to the client (which seems silly, given that we _know_
one
> of the clients is dead) and then when that AST timesout, we release that
lock
This is not silly, because we are killing ONE client and we are granting lock to
ANOTHER that is not killed yet.
> and reprocess the queue of pending locks for that resource. 
Yes, because we killed one lock and now we need to see if something was waiting
for it to go away and needs to be granted.
If you kill all the processes that do not have locks granted first, this
won''t
happen, of course.
> I understand there isn''t much we can do, given that llrd only
gives us a single
> nid at once. We *could* utlize the evict nid by list changes that are
floating
> around somewhere in Bugzilla and update llrd to use them. I do not know if
there
> is a limit to the number of nids we can write into this proc file -- but we
> certainly need to know. This would give Lustre a single look at all the
nids we
> are trying to kill. If Lustre could then mark each as
> "ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up
granted locks, etc
> the various paths that would send RPCs to these clients could be prevented
from
> taking too much time.
This would make such an eviction two stage process, I think
First go mark all of them as eviction pending, then go and evict everybody.
Twice as much work done for obscure case.
> Also -- it should be possible to look at the time spent waiting for the
flock
> locks and if it was > obd_timeout (from request sent to being actually
granted),
> dump the request as old. I believe this is similiar to the approach for bug11330.      

This won''t work. There is absolutely no limit on amount of time flock
lock can
be held.
So with what you propose if one node gets a lock and another node waits for
conflicting lock.
First node holds the lock for say obd_timeout+1, then second node won''t
get its
lock at all because the timeout expired?

Canon, Richard Shane

2007-Jan-11 09:13 UTC

head link

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

The situation that Nic is describing is not obscure.  While the
application doing this type of access may not be ideal, I can image
several cases where this might happen.  Also, even if we correct it in
this app, another user will have the same problem down the road.  Just
to make it clear, typically when one task dies in an MPI job, then
entire application stops and all tasks exit.

This type of scenario is a typical example of what I think of when we
talk about scalable recovery.  This situation is actually an easier
case, because LLRD can provide you a list of all the nodes/nids that
should be cleaned up.  A two stage process (which would probably require
seconds) is fine given the alternative of it taking hours.

I''m still unclear as to why we are seeing this now.  Nic: is this a
"new" application or one we have run a good bit?  Has anything changed
in Lustre that could have caused this to become an issue?

--Shane

-----Original Message-----
From: lustre-devel-bounces@clusterfs.com
[mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of
green@clusterfs.com
Sent: Wednesday, January 10, 2007 2:29 PM
To: lustre-devel@clusterfs.com
Subject: [Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock
ast processing loop

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



(In reply to comment #10)> I don''t have a crystal clear reproducer -- other than to say an
application> (obviously) gets a bunch of flock locks and then dies. It seems thecommon

Note that only one app can get a lock if all the locks are conflicting.
> threads is that they are all flock''ing the same resource on the
MDS -
so> probably only one client gets a granted lock, the rest are waiting.
Once the> application is dead, we come in with llrd to clean these nids up and
do the> evictions. I am sure we are only going to see more of these. It shouldbe quite

Yes. this sounds possible, though quite stupid thing to do on something
like xt3.
If you have 6000 nodes job, and 5999 nodes just wait until 1 node will
release a
lock on a file (in who knows what time), this is quite unproductive use
of
resources.
> easy to write an MPI test app that would do a bunch of flock enqueues
on a> single resource and then fall over dead (segfault, etc)
Is single node exiting means rest of nodes would be forcefully killed
too?
> It does seem that we are killing a node with the lock held, which gets
the> completion AST sent to the client (which seems silly, given that we
_know_ one> of the clients is dead) and then when that AST timesout, we releasethat lock

This is not silly, because we are killing ONE client and we are granting
lock to
ANOTHER that is not killed yet.
> and reprocess the queue of pending locks for that resource. 
Yes, because we killed one lock and now we need to see if something was
waiting
for it to go away and needs to be granted.
If you kill all the processes that do not have locks granted first, this
won''t
happen, of course.
> I understand there isn''t much we can do, given that llrd only
gives us
a single> nid at once. We *could* utlize the evict nid by list changes that are
floating> around somewhere in Bugzilla and update llrd to use them. I do not
know if there> is a limit to the number of nids we can write into this proc file --
but we> certainly need to know. This would give Lustre a single look at all
the nids we> are trying to kill. If Lustre could then mark each as
> "ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up
granted
locks, etc> the various paths that would send RPCs to these clients could be
prevented from> taking too much time.
This would make such an eviction two stage process, I think
First go mark all of them as eviction pending, then go and evict
everybody.
Twice as much work done for obscure case.
> Also -- it should be possible to look at the time spent waiting for
the flock> locks and if it was > obd_timeout (from request sent to being actually
granted),> dump the request as old. I believe this is similiar to the approachfor bug
11330.      

This won''t work. There is absolutely no limit on amount of time flock
lock can
be held.
So with what you propose if one node gets a lock and another node waits
for
conflicting lock.
First node holds the lock for say obd_timeout+1, then second node won''t
get its
lock at all because the timeout expired?

_______________________________________________
Lustre-devel mailing list
Lustre-devel@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Oral, H. Sarp

2007-Jan-11 09:21 UTC

head link

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock astprocessing loop

<snip>

"I''m still unclear as to why we are seeing this now.  Nic: is this
a
"new" application or one we have run a good bit?  Has anything changed
in Lustre that could have caused this to become an issue?"

<snip>



I wonder this myself, since what has been described is not such an
obscure case.


Sarp
--------------------
Sarp Oral, Ph.D.
 
National Center for Computational Sciences (NCCS)
Oak Ridge National Lab, Oak Ridge, Tennessee 37831
865-574-2173, oralhs@ornl.gov
 

-----Original Message-----
From: lustre-devel-bounces@clusterfs.com
[mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of Canon, Richard
Shane
Sent: Thursday, January 11, 2007 11:13 AM
To: lustre-devel@clusterfs.com
Subject: RE: [Lustre-devel] [Bug 11511] can''t evict nodes; stuck in
flock astprocessing loop



The situation that Nic is describing is not obscure.  While the
application doing this type of access may not be ideal, I can image
several cases where this might happen.  Also, even if we correct it in
this app, another user will have the same problem down the road.  Just
to make it clear, typically when one task dies in an MPI job, then
entire application stops and all tasks exit.

This type of scenario is a typical example of what I think of when we
talk about scalable recovery.  This situation is actually an easier
case, because LLRD can provide you a list of all the nodes/nids that
should be cleaned up.  A two stage process (which would probably require
seconds) is fine given the alternative of it taking hours.

I''m still unclear as to why we are seeing this now.  Nic: is this a
"new" application or one we have run a good bit?  Has anything changed
in Lustre that could have caused this to become an issue?

--Shane

-----Original Message-----
From: lustre-devel-bounces@clusterfs.com
[mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of
green@clusterfs.com
Sent: Wednesday, January 10, 2007 2:29 PM
To: lustre-devel@clusterfs.com
Subject: [Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock
ast processing loop

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



(In reply to comment #10)> I don''t have a crystal clear reproducer -- other than to say an
application> (obviously) gets a bunch of flock locks and then dies. It seems thecommon

Note that only one app can get a lock if all the locks are conflicting.
> threads is that they are all flock''ing the same resource on the
MDS -
so> probably only one client gets a granted lock, the rest are waiting.
Once the> application is dead, we come in with llrd to clean these nids up and
do the> evictions. I am sure we are only going to see more of these. It shouldbe quite

Yes. this sounds possible, though quite stupid thing to do on something
like xt3.
If you have 6000 nodes job, and 5999 nodes just wait until 1 node will
release a
lock on a file (in who knows what time), this is quite unproductive use
of
resources.
> easy to write an MPI test app that would do a bunch of flock enqueues
on a> single resource and then fall over dead (segfault, etc)
Is single node exiting means rest of nodes would be forcefully killed
too?
> It does seem that we are killing a node with the lock held, which gets
the> completion AST sent to the client (which seems silly, given that we
_know_ one> of the clients is dead) and then when that AST timesout, we releasethat lock

This is not silly, because we are killing ONE client and we are granting
lock to
ANOTHER that is not killed yet.
> and reprocess the queue of pending locks for that resource. 
Yes, because we killed one lock and now we need to see if something was
waiting
for it to go away and needs to be granted.
If you kill all the processes that do not have locks granted first, this
won''t
happen, of course.
> I understand there isn''t much we can do, given that llrd only
gives us
a single> nid at once. We *could* utlize the evict nid by list changes that are
floating> around somewhere in Bugzilla and update llrd to use them. I do not
know if there> is a limit to the number of nids we can write into this proc file --
but we> certainly need to know. This would give Lustre a single look at all
the nids we> are trying to kill. If Lustre could then mark each as
> "ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up
granted
locks, etc> the various paths that would send RPCs to these clients could be
prevented from> taking too much time.
This would make such an eviction two stage process, I think
First go mark all of them as eviction pending, then go and evict
everybody.
Twice as much work done for obscure case.
> Also -- it should be possible to look at the time spent waiting for
the flock> locks and if it was > obd_timeout (from request sent to being actually
granted),> dump the request as old. I believe this is similiar to the approachfor bug
11330.      

This won''t work. There is absolutely no limit on amount of time flock
lock can
be held.
So with what you propose if one node gets a lock and another node waits
for
conflicting lock.
First node holds the lock for say obd_timeout+1, then second node won''t
get its
lock at all because the timeout expired?

_______________________________________________
Lustre-devel mailing list
Lustre-devel@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

_______________________________________________
Lustre-devel mailing list
Lustre-devel@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Nicholas Henke

2007-Jan-11 10:37 UTC

head link

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock astprocessing loop

Oral, H. Sarp wrote:> <snip>
>
> "I''m still unclear as to why we are seeing this now.  Nic: is
this a
> "new" application or one we have run a good bit?  Has anything
changed
> in Lustre that could have caused this to become an issue?"
>
> <snip>
>
>
>
> I wonder this myself, since what has been described is not such an
> obscure case.
>   
Good darn question folks. I really can''t explain this, other than to
say
it might be a factor of scale in a specific job ?

I think we need to have the application killed at just the *wrong* time 
for this to happen -- I''m guessing that usually the nodes process this 
flock''d file quite quickly and never gets killed while the locks are
held.

Here is a possible solution we are thinking about.  Warning -- this 
isn''t fully tested yet. It does look good on paper though :)

The basic issue is that with Portals, liblustre & Catamount, one cannot 
do anything but set a timer when sending an RPC to a client node. Once 
the application is dead, this timer will run to completion as QK is 
looking at the destination pid and seeing nobody exists to receive that 
message and drops it on the floor.

Now, there exists some paths in Lustre node eviction that will result in 
RPC traffic to nodes -- given that we evict nodes one-by-one (and 
evicting the whole list at once is problematic for a host of reasons), 
we can get into the situation where we are sending RPCs to a node that 
llrd knows is dead, but we''ve not gotten that information into Lustre 
yet. I will grant that these are probably due to some varying level of 
application Evil Quotient -- but in the end, the system needs to protect 
against this.

Consider the case:

nids 1-10 all do an flock on MDS inode 123456

nid 1 is granted the lock, nids 2-10 are put int the pending list.

Job up & dies (how rude!)

llrd evicts nid1 - causing Lustre to delete it''s lock and causing the 
list of pending locks to be reprocessed

Lustre sends a Completion AST to nid2 informing it that it now has the 
lock -- this times out after 2s

Lustre repeats this process for nids 3-10

total time spent waiting for nids 2-10 == 9 *2s, or 18s.

The following is a snippet of an idea that we have to deal with this 
problem.> Eric B. came up with the idea that we can use lctl --net ptl del_peer 
> <nid> for every nid we are evicting to delete the LNET level
information
> for that nid -- in effect preventing any future communication with that 
> node. This should cause these RPC requests to immediately fail 
> (something I''ll be testing to verify) -- preventing the long and
arduous
> serial 2second cleanup for hours and hours.
>
>   
Note that this only works as LNET on the servers will *not* try to 
reconnect to a libclient. Deleting the peer has the effect of of failing 
the dead-node-bound-RPC immediately on the server rather than after a 2s 
timeout.

Nic

nic@cray.com

2007-Jan-11 11:40 UTC

head link

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



Eric, Oleg & I talked about possible solutions to the bug 11511 issue that
is
plaguing ORNL. I explored the possibility of something Cray Portals side that
would immediately NAK a LNET message on a node where no application was active,
but nothing was going to work or pass muster.

Eric then came up with the idea that we can use lctl --net ptl del_peer
<nid>
for every nid we are evicting to delete the LNET level information for that nid
-- in effect preventing any future communication with that node. This should
cause these RPC requests to immediately fail (something I''ll be testing
to
verify) -- preventing the long and arduous serial 2second cleanup for hours and
hours.

I know this needs to be done for certain on the MDS, but is there any benefit to
doing this on the OSTs as well? Can it only help?

I think we need to explore this before looking at code change -- given the
nature of the flock.

Lustre devel - Jan 2007 - [Bug 11511] can't evict nodes; stuck in flock ast processing loop

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock astprocessing loop

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock astprocessing loop

[Lustre-devel] [Bug 11511] can''t evict nodes; stuck in flock ast processing loop