tianzy@clusterfs.com
2006-Dec-21 02:24 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11330 Created an attachment (id=9198) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9198&action=view) Following Adilger and Green''s sugguest, create this patch
green@clusterfs.com
2006-Dec-21 17:09 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11330 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9198|review?(green@clusterfs.com)|review+ Flag| | (From update of attachment 9198) Looks good. Few minor comments.>@@ -695,8 +698,21 @@ static int ost_brw_read(struct ptlrpc_re >+ /* fix the bug11330 */This comment is too cryptic. Better to say something like ''Check if getting the lock took more time than client was willing to wait''>+ do_gettimeofday(&present_time); >+ timediff = cfs_timeval_sub(&present_time, &req->rq_arrival_time,NULL); >+ if (timediff / 1000000 > (long)obd_timeout) { >+ no_reply = 1; >+ CERROR("Dropping timed-out opc %d request from %s" >+ ": %ld seconds old\n", req->rq_reqmsg->opc, >+ libcfs_id2str(req->rq_peer), >+ timediff / 1000000);The error message here should be a bit changed too, I think. Something like ''Dropping opc ... because taking the lock took too long'' (and print resource for the lock too, I think, now that we actully have it) Same in ost_brw_write. Also we probably need a test and FAIL_LOC injected to actually test this codepath (and recovert-small.sh test too?)
adilger@clusterfs.com
2006-Dec-27 15:22 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11330 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9198|review?(adilger@clusterfs.co|review- Flag|m) | (From update of attachment 9198)>@@ -695,8 +698,21 @@ static int ost_brw_read(struct ptlrpc_re > rc = ost_brw_lock_get(LCK_PR, req->rq_export, ioo, pp_rnb, &lockh); >+ if (rc != 0) { > GOTO(out_bulk, rc); >+ } else {Since this is an unconditional GOTO() you don''t need the "else" and indentation following it.>+ /* fix the bug11330 */Please have a more meaningful comment here, like "/* don''t process request if we were blocked longer than the timeout. b=11330 */".>+ do_gettimeofday(&present_time); >+ timediff = cfs_timeval_sub(&present_time, &req->rq_arrival_time,NULL); >+ if (timediff / 1000000 > (long)obd_timeout) { >+ no_reply = 1; >+ CERROR("Dropping timed-out opc %d request from %s" >+ ": %ld seconds old\n", req->rq_reqmsg->opc, >+ libcfs_id2str(req->rq_peer), >+ timediff / 1000000); >+ goto out_lock; >+ }[minor] Since this time difference is in the order of many seconds, and obd_timeout is also seconds, we don''t need millisecond accuracy. You can use cfs_time_current_sec() to just get the seconds here and avoid putting present_time and timediff on the stack (it is fine to recalculate timediff for the error message, since this should happen very rarely. [minor] Also, to make the error message more useful, please put something like "Dropping timed-out read request", instead of "opc 3 request".>@@ -922,8 +943,22 @@ static int ost_brw_write(struct ptlrpc_r > GOTO(out, rc = -ENOMEM); > > rc = ost_brw_lock_get(LCK_PW, req->rq_export, ioo, pp_rnb, &lockh); >- if (rc != 0) >+ if (rc != 0) { > GOTO(out_bulk, rc); >+ } else { >+ /* fix the bug11330 */ >+ do_gettimeofday(&present_time); >+ timediff = cfs_timeval_sub(&present_time, &req->rq_arrival_time,NULL); >+ if (timediff / 1000000 > (long)obd_timeout) { >+ no_reply = 1; >+ CERROR("Dropping timed-out opc %d request from %s" >+ ": %ld seconds old\n", req->rq_reqmsg->opc, >+ libcfs_id2str(req->rq_peer), >+ timediff / 1000000); >+ goto out_lock; >+ } >+ }[minor] All of the same changes here also.
tianzy@clusterfs.com
2006-Dec-29 01:23 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11330 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9198 is|0 |1 obsolete| | Created an attachment (id=9243) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9243&action=view) updated patch updated issues: 1. inappropriate comments. 2. add OBD_FAIL test code.
adilger@clusterfs.com
2006-Dec-29 02:27 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11330 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9243|review?(adilger@clusterfs.co|review- Flag|m) | (From update of attachment 9243)>Index: include/linux/obd_support.h >==================================================================>RCS file: /cvsroot/cfs/lustre-core/include/linux/obd_support.h,v >retrieving revision 1.92.2.8.22.36 >diff -p -u -r1.92.2.8.22.36 obd_support.h >--- include/linux/obd_support.h 21 Dec 2006 00:35:46 -0000 1.92.2.8.22.36 >+++ include/linux/obd_support.h 29 Dec 2006 08:18:04 -0000 >@@ -132,6 +132,7 @@ extern int obd_race_state; >+#define OBD_FAIL_OST_DROP_REQ 0x21cNote that 0x21c is already in use on b1_5. Please use 0x21d.>@@ -698,6 +699,26 @@ static int ost_brw_read(struct ptlrpc_re > if (rc != 0) > GOTO(out_bulk, rc); > >+ /* >+ * If getting the lock took more time than >+ * client was willing to wait, drop it. b=11330 >+ */ >+ if (cfs_time_current_sec() > (time_t)obd_timeout ||Sorry, this doesn''t make sense. The current time (i.e. seconds since 1970) will ALWAYS be > obd_timeout. What you really want is: if ((cfs_time_current_sec() > req->rq_arrival_time.tv_sec + obd_timeout) || OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) { CERROR("Dropping timed-out read request from %s " "because taking the lock: "LPX64" took %lu seconds.\n", libcfs_id2str(req->rq_peer), lockh.cookie, cfs_time_current_sec() - req->rq_arrival_time.tv_sec); goto out_lock; }>+ * If getting the lock took more time than >+ * client was willing to wait, drop it. b=11330 >+ */ >+ if (cfs_time_current_sec() > (time_t)obd_timeout || >+ OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) {Same here.
tianzy@clusterfs.com
2006-Dec-29 03:17 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11330 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9243 is|0 |1 obsolete| | Created an attachment (id=9244) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9244&action=view) updated patch
tianzy@clusterfs.com
2006-Dec-31 02:03 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11330 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9244 is|0 |1 obsolete| | Created an attachment (id=9251) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9251&action=view) updated patch(add test code)
tianzy@clusterfs.com
2006-Dec-31 02:08 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11330 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9251|review?(adilger@clusterfs.co|review?(green@clusterfs.com) Flag|m) | (From update of attachment 9251) This is the final version of this patch. Because Adilger leaves for a long time, I need your approval before landing this patch
green@clusterfs.com
2007-Jan-11 16:29 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11330 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9251|review?(green@clusterfs.com)|review+ Flag| | (From update of attachment 9251)>@@ -698,6 +699,21 @@ static int ost_brw_read(struct ptlrpc_re > if (rc != 0) > GOTO(out_bulk, rc); > >+ /* >+ * If getting the lock took more time than >+ * client was willing to wait, drop it. b=11330 >+ */ >+ if (cfs_time_current_sec() > req->rq_arrival_time.tv_sec + obd_timeout || >+ OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) { >+ no_reply = 1; >+ CERROR("Dropping timed-out read request from %s " >+ "because taking the lock: "LPX64" took %ld seconds.\n", >+ libcfs_id2str(req->rq_peer), >+ lockh.cookie, >+ cfs_time_current_sec() - req->rq_arrival_time.tv_sec);minor, but still pretty desirable: no point in printing lock cookie, it does not tell us anything useful (without having debug lock. Better to print object id ioo->ioo_id (so it will read like "... taking the lock on object NUMBER")>@@ -925,6 +944,21 @@ static int ost_brw_write(struct ptlrpc_r > if (rc != 0) > GOTO(out_bulk, rc); > >+ /* >+ * If getting the lock took more time than >+ * client was willing to wait, drop it. b=11330 >+ */ >+ if (cfs_time_current_sec() > req->rq_arrival_time.tv_sec + obd_timeout || >+ OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) { >+ no_reply = 1; >+ CERROR("Dropping timed-out write request from %s " >+ "because taking the lock: "LPX64" took %ld seconds.\n", >+ libcfs_id2str(req->rq_peer), >+ lockh.cookie, >+ cfs_time_current_sec() - req->rq_arrival_time.tv_sec);Same as above. Otherwise is good. Needs to be landed on b1_4 and b1_5