thr3ads.net - Lustre devel - [Lustre-devel] [Bug 10734] ptlrpc connect to non-existant node crashes [Jan 2007]

If this information is useful, please help other people find it:
Share via:

pbojanic@clusterfs.com

2007-Jan-18 05:37 UTC

[Lustre-devel] [Bug 10734] ptlrpc connect to non-existant node crashes

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10734



Eric advises...

1. Create a global list of zombie imports and exports and a cleanup thread that
consumes them.  

2. Change class_import_put() to add the import to the zombie import list on
removing the last reference and run the rest of it from the cleanup thread.

3. Change __class_export_put() to add the export to the zombie export list on
removing the last reference and tun the rest of it from the cleanup thread.

1 week to code and unit test?  I wonder if there are existing fields in struct
obd_export and struct obd_import that could be used
for the queueing, but it''s not a big deal.  But someone with
familiarity with
this stuff should verify my suggestion.

eeb@clusterfs.com

2007-Jan-18 13:12 UTC

head link

[Lustre-devel] [Bug 10734] ptlrpc connect to non-existant node crashes

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10734



Created an attachment (id=9372)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9372&action=view)
patch against b1_5

This patch changes put_{im,ex}port to schedule the {im,ex}port for destruction
by the ptlrpc daemon.  It''s more of a DLD than an actual solution and
the
following points must be considered.

1. It uses the ptlrpc daemon to do the actual {im,ex}port destruction.	I used
it for convenience and the extra cleanup work is negligible and wont affect
performance.  But this means the ptlrpc daemon must be run everywhere, not just

clients (and the MDS?) as it has been up till now.

2. This introduces an extra level of asynchronousness into shutting down. 
Please note that shutdown has never actually been synchronous even though it
may
appear to be so and lconf actually assumes so - it''s always been the
case that
network callbacks have to complete before everything can clean up but normally
this is quite fast.  More work to make shutdown/unload scripts block properly
may be required.

3. I''ve not actually tested this code on b1_5

nathan@clusterfs.com

2007-Jan-19 12:35 UTC

head link

[Lustre-devel] [Bug 10734] ptlrpc connect to non-existant node crashes

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10734

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #9372 is|0                           |1
           obsolete|                            |


Created an attachment (id=9388)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9388&action=view)
compile fix and added regression test

eeb@clusterfs.com

2007-Jan-20 04:14 UTC

head link

[Lustre-devel] [Bug 10734] ptlrpc connect to non-existant node crashes

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10734



> Eric, is there any more work you want to do here before we land this?
I''m not familiar with 1.6 teardown / module unload procedures.  Does
that need
to take account of possible delays and if so, does it?

Apart from that issue, I''m happy this can land.

nathan@clusterfs.com

2007-Jan-20 11:41 UTC

head link

[Lustre-devel] [Bug 10734] ptlrpc connect to non-existant node crashes

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10734

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


(In reply to comment #13)> > Eric, is there any more work you want to do here before we land this?
> 
> I''m not familiar with 1.6 teardown / module unload procedures. 
Does that need
> to take account of possible delays and if so, does it?
umount waits for the last disk reference to drop before returning, but
isn''t
particularly concerned that all obds stop.  As long as all obd''s
eventually do
stop (the mgc in this case), I don''t have a problem with it.  It will
prevent
immediate re-startup (probably EALREADY), but since it''s trying to talk
to a
non-responding node anyhow, I don''t think this will cause anyone pain.
> Apart from that issue, I''m happy this can land.      
Landed on b1_5

Lustre devel - Jan 2007 - [Bug 10734] ptlrpc connect to non-existant node crashes

[Lustre-devel] [Bug 10734] ptlrpc connect to non-existant node crashes

[Lustre-devel] [Bug 10734] ptlrpc connect to non-existant node crashes

[Lustre-devel] [Bug 10734] ptlrpc connect to non-existant node crashes

[Lustre-devel] [Bug 10734] ptlrpc connect to non-existant node crashes

[Lustre-devel] [Bug 10734] ptlrpc connect to non-existant node crashes