behlendorf1@llnl.gov
2007-Feb-05 14:59 UTC
[Lustre-devel] [Bug 11327] Assertion in target_handle_connect
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11327 Created an attachment (id=9511) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9511&action=view) Kernel dk log
behlendorf1@llnl.gov
2007-Feb-05 19:09 UTC
[Lustre-devel] [Bug 11327] Assertion in target_handle_connect
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11327 Created an attachment (id=9514) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9514&action=view) proposed patch against 1.4.8-5chaos Proposed fix to resolve connection / eviction race. Here''s the basic race full details can be seen in the attached kernel dk log. ping_evictor ll_ost_io_77 --------------------------------------------------------------------- - ping_evictor_main() - timeout evict client - exp->exp_fail = 1 - target_handle_connect() - valid (but failed) export found in target->obd_exports - exp->exp_connecting = 1 - class_disconnect destroys cookie - export = class_conn2export(&conn) lookup now fails due, no cookie. - LASSERT(export != NULL) The proposed fix checks for exp_fail to be set in the export when it''s scanned for in target->obd_exports. If its set we bail because this export will soon be destroyed. This patch also ensures that if an export is successfully found and exp_fail is not set that exp_connecting will be set under the exp_lock spin_lock. This can then be used to prevent class_fail_export() from setting the exp_fail and destroying the export when the connect is in progress.
behlendorf1@llnl.gov
2007-Feb-05 19:16 UTC
[Lustre-devel] [Bug 11327] Assertion in target_handle_connect
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11327 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9514 is|0 |1 obsolete| | Created an attachment (id=9515) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9515&action=view) proposed patch against 1.4.8-5chaos Drat, I knew as soon as I attached it to the bug I''d spot some little mistake. Here''s a new version which ensures exp_fail doesn''t get set in the connecting case.
adilger@clusterfs.com
2007-Feb-06 16:07 UTC
[Lustre-devel] [Bug 11327] Assertion in target_handle_connect
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11327 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9515|review?(adilger@clusterfs.co|review- Flag|m) | (From update of attachment 9515)> void class_fail_export(struct obd_export *exp) > { >- int rc, already_failed; >+ int rc, already_failed, connecting; > > spin_lock(&exp->exp_lock); > already_failed = exp->exp_failed; >- exp->exp_failed = 1; >+ connecting = exp->exp_connecting; >+ if (!connecting) >+ exp->exp_failed = 1; > spin_unlock(&exp->exp_lock);The only objection that I have here is that this appears to allow a reconnecting client to stifle the effort of the server to evict the client. I can understand that if the reconnection has already started and it is holding the locks then we can''t really interrupt it. However, I wonder if we shouldn''t mark the export failed anyways and handle this more gracefully in target_handle_connect() instead of LBUGging? What about just returning -ENOTCONN where we currently LBUG if we can''t look up the connection handle? If that lookup succeeds, but the export is failed after that, the current code at least has a refcount on the export and it will be destroyed after we (confusingly) return success for the connection and drop the reference held by req->rq_export. We could also set the rq_status == -ENOTCONN right at the end of target_handle_connect() if exp_failed is set, but then there is still a very small window after that check where the client will return success but the export is destroyed.>@@ -627,22 +627,35 @@ int target_handle_connect(struct ptlrpc_ >+ spin_lock(&export->exp_lock); >+ if (export->exp_failed) { /* bug 11327 evict/conn race */ >+ spin_unlock(&export->exp_lock); >+ CWARN("%s: exp %p connect racing with " >+ "eviction\n", export->exp_obd->obd_name, >+ export); >+ export = NULL; >+ rc = -EALREADY; >+ break; >+ }Specifically, this check could be moved to the very end of the "success" path. I''m also not sure if -EALREADY is right here, instead of -ENOTCONN.