tianzy@clusterfs.com
2006-Dec-21 02:24 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11330 Created an attachment (id=9198) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9198&action=view) Following Adilger and Green''s sugguest, create this patch
green@clusterfs.com
2006-Dec-21 17:09 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9198|review?(green@clusterfs.com)|review+
Flag| |
(From update of attachment 9198)
Looks good.
Few minor comments.
>@@ -695,8 +698,21 @@ static int ost_brw_read(struct ptlrpc_re
>+ /* fix the bug11330 */
This comment is too cryptic.
Better to say something like ''Check if getting the lock took more time
than
client was willing to wait''
>+ do_gettimeofday(&present_time);
>+ timediff = cfs_timeval_sub(&present_time,
&req->rq_arrival_time,NULL);
>+ if (timediff / 1000000 > (long)obd_timeout) {
>+ no_reply = 1;
>+ CERROR("Dropping timed-out opc %d request from
%s"
>+ ": %ld seconds old\n",
req->rq_reqmsg->opc,
>+ libcfs_id2str(req->rq_peer),
>+ timediff / 1000000);
The error message here should be a bit changed too, I think.
Something like ''Dropping opc ... because taking the lock took too
long'' (and
print resource for the lock too,
I think, now that we actully have it)
Same in ost_brw_write.
Also we probably need a test and FAIL_LOC injected to actually test this
codepath (and recovert-small.sh test too?)
adilger@clusterfs.com
2006-Dec-27 15:22 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9198|review?(adilger@clusterfs.co|review-
Flag|m) |
(From update of attachment 9198)
>@@ -695,8 +698,21 @@ static int ost_brw_read(struct ptlrpc_re
> rc = ost_brw_lock_get(LCK_PR, req->rq_export, ioo, pp_rnb,
&lockh);
>+ if (rc != 0) {
> GOTO(out_bulk, rc);
>+ } else {
Since this is an unconditional GOTO() you don''t need the
"else" and indentation
following it.
>+ /* fix the bug11330 */
Please have a more meaningful comment here, like "/* don''t process
request if
we were blocked longer than the timeout. b=11330 */".
>+ do_gettimeofday(&present_time);
>+ timediff = cfs_timeval_sub(&present_time,
&req->rq_arrival_time,NULL);
>+ if (timediff / 1000000 > (long)obd_timeout) {
>+ no_reply = 1;
>+ CERROR("Dropping timed-out opc %d request from
%s"
>+ ": %ld seconds old\n",
req->rq_reqmsg->opc,
>+ libcfs_id2str(req->rq_peer),
>+ timediff / 1000000);
>+ goto out_lock;
>+ }
[minor] Since this time difference is in the order of many seconds, and
obd_timeout is also seconds, we don''t need millisecond accuracy. You
can use
cfs_time_current_sec() to just get the seconds here and avoid putting
present_time and timediff on the stack (it is fine to recalculate timediff for
the error message, since this should happen very rarely.
[minor] Also, to make the error message more useful, please put something like
"Dropping timed-out read request", instead of "opc 3
request".
>@@ -922,8 +943,22 @@ static int ost_brw_write(struct ptlrpc_r
> GOTO(out, rc = -ENOMEM);
>
> rc = ost_brw_lock_get(LCK_PW, req->rq_export, ioo, pp_rnb,
&lockh);
>- if (rc != 0)
>+ if (rc != 0) {
> GOTO(out_bulk, rc);
>+ } else {
>+ /* fix the bug11330 */
>+ do_gettimeofday(&present_time);
>+ timediff = cfs_timeval_sub(&present_time,
&req->rq_arrival_time,NULL);
>+ if (timediff / 1000000 > (long)obd_timeout) {
>+ no_reply = 1;
>+ CERROR("Dropping timed-out opc %d request from
%s"
>+ ": %ld seconds old\n",
req->rq_reqmsg->opc,
>+ libcfs_id2str(req->rq_peer),
>+ timediff / 1000000);
>+ goto out_lock;
>+ }
>+ }
[minor] All of the same changes here also.
tianzy@clusterfs.com
2006-Dec-29 01:23 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9198 is|0 |1
obsolete| |
Created an attachment (id=9243)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
--> (https://bugzilla.lustre.org/attachment.cgi?id=9243&action=view)
updated patch
updated issues:
1. inappropriate comments.
2. add OBD_FAIL test code.
adilger@clusterfs.com
2006-Dec-29 02:27 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9243|review?(adilger@clusterfs.co|review-
Flag|m) |
(From update of attachment 9243)>Index: include/linux/obd_support.h
>==================================================================>RCS
file: /cvsroot/cfs/lustre-core/include/linux/obd_support.h,v
>retrieving revision 1.92.2.8.22.36
>diff -p -u -r1.92.2.8.22.36 obd_support.h
>--- include/linux/obd_support.h 21 Dec 2006 00:35:46 -0000 1.92.2.8.22.36
>+++ include/linux/obd_support.h 29 Dec 2006 08:18:04 -0000
>@@ -132,6 +132,7 @@ extern int obd_race_state;
>+#define OBD_FAIL_OST_DROP_REQ 0x21c
Note that 0x21c is already in use on b1_5. Please use 0x21d.
>@@ -698,6 +699,26 @@ static int ost_brw_read(struct ptlrpc_re
> if (rc != 0)
> GOTO(out_bulk, rc);
>
>+ /*
>+ * If getting the lock took more time than
>+ * client was willing to wait, drop it. b=11330
>+ */
>+ if (cfs_time_current_sec() > (time_t)obd_timeout ||
Sorry, this doesn''t make sense. The current time (i.e. seconds since
1970)
will ALWAYS be > obd_timeout. What you really want is:
if ((cfs_time_current_sec() > req->rq_arrival_time.tv_sec +
obd_timeout) ||
OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) {
CERROR("Dropping timed-out read request from %s "
"because taking the lock: "LPX64" took %lu
seconds.\n",
libcfs_id2str(req->rq_peer),
lockh.cookie,
cfs_time_current_sec() - req->rq_arrival_time.tv_sec);
goto out_lock;
}
>+ * If getting the lock took more time than
>+ * client was willing to wait, drop it. b=11330
>+ */
>+ if (cfs_time_current_sec() > (time_t)obd_timeout ||
>+ OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) {
Same here.
tianzy@clusterfs.com
2006-Dec-29 03:17 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9243 is|0 |1
obsolete| |
Created an attachment (id=9244)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
--> (https://bugzilla.lustre.org/attachment.cgi?id=9244&action=view)
updated patch
tianzy@clusterfs.com
2006-Dec-31 02:03 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9244 is|0 |1
obsolete| |
Created an attachment (id=9251)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
--> (https://bugzilla.lustre.org/attachment.cgi?id=9251&action=view)
updated patch(add test code)
tianzy@clusterfs.com
2006-Dec-31 02:08 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9251|review?(adilger@clusterfs.co|review?(green@clusterfs.com)
Flag|m) |
(From update of attachment 9251)
This is the final version of this patch.
Because Adilger leaves for a long time, I need your approval before landing
this patch
green@clusterfs.com
2007-Jan-11 16:29 UTC
[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9251|review?(green@clusterfs.com)|review+
Flag| |
(From update of attachment 9251)>@@ -698,6 +699,21 @@ static int ost_brw_read(struct ptlrpc_re
> if (rc != 0)
> GOTO(out_bulk, rc);
>
>+ /*
>+ * If getting the lock took more time than
>+ * client was willing to wait, drop it. b=11330
>+ */
>+ if (cfs_time_current_sec() > req->rq_arrival_time.tv_sec +
obd_timeout ||
>+ OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) {
>+ no_reply = 1;
>+ CERROR("Dropping timed-out read request from %s "
>+ "because taking the lock: "LPX64"
took %ld seconds.\n",
>+ libcfs_id2str(req->rq_peer),
>+ lockh.cookie,
>+ cfs_time_current_sec() -
req->rq_arrival_time.tv_sec);
minor, but still pretty desirable: no point in printing lock cookie, it does
not tell us anything useful (without having debug lock.
Better to print object id ioo->ioo_id (so it will read like "... taking
the
lock on object NUMBER")
>@@ -925,6 +944,21 @@ static int ost_brw_write(struct ptlrpc_r
> if (rc != 0)
> GOTO(out_bulk, rc);
>
>+ /*
>+ * If getting the lock took more time than
>+ * client was willing to wait, drop it. b=11330
>+ */
>+ if (cfs_time_current_sec() > req->rq_arrival_time.tv_sec +
obd_timeout ||
>+ OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) {
>+ no_reply = 1;
>+ CERROR("Dropping timed-out write request from %s
"
>+ "because taking the lock: "LPX64"
took %ld seconds.\n",
>+ libcfs_id2str(req->rq_peer),
>+ lockh.cookie,
>+ cfs_time_current_sec() -
req->rq_arrival_time.tv_sec);
Same as above.
Otherwise is good.
Needs to be landed on b1_4 and b1_5