thr3ads.net - Lustre devel - [Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST [Dec 2006]

If this information is useful, please help other people find it:
Share via:

tianzy@clusterfs.com

2006-Dec-21 02:24 UTC

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330



Created an attachment (id=9198)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9198&action=view)
Following Adilger and Green''s sugguest, create this patch

green@clusterfs.com

2006-Dec-21 17:09 UTC

head link

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9198|review?(green@clusterfs.com)|review+
               Flag|                            |


(From update of attachment 9198)
Looks good.
Few minor comments.
>@@ -695,8 +698,21 @@ static int ost_brw_read(struct ptlrpc_re
>+                /* fix the bug11330 */
This comment is too cryptic.
Better to say something like ''Check if getting the lock took more time
than
client was willing to wait''
>+                do_gettimeofday(&present_time);                
>+                timediff = cfs_timeval_sub(&present_time,
&req->rq_arrival_time,NULL);
>+                if (timediff / 1000000 > (long)obd_timeout) {
>+                        no_reply = 1;
>+                        CERROR("Dropping timed-out opc %d request from
%s"
>+                               ": %ld seconds old\n",
req->rq_reqmsg->opc,
>+                               libcfs_id2str(req->rq_peer),
>+                               timediff / 1000000);
The error message here should be a bit changed too, I think.
Something like ''Dropping opc ... because taking the lock took too
long'' (and
print resource for the lock too,
I think, now that we actully have it)

Same in ost_brw_write.

Also we probably need a test and FAIL_LOC injected to actually test this
codepath (and recovert-small.sh test too?)

adilger@clusterfs.com

2006-Dec-27 15:22 UTC

head link

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9198|review?(adilger@clusterfs.co|review-
               Flag|m)                          |


(From update of attachment 9198)
>@@ -695,8 +698,21 @@ static int ost_brw_read(struct ptlrpc_re
>         rc = ost_brw_lock_get(LCK_PR, req->rq_export, ioo, pp_rnb,
&lockh);
>+        if (rc != 0) {
>                 GOTO(out_bulk, rc);
>+        } else {
Since this is an unconditional GOTO() you don''t need the
"else" and indentation
following it.
>+                /* fix the bug11330 */
Please have a more meaningful comment here, like "/* don''t process
request if
we were blocked longer than the timeout.  b=11330 */".
>+                do_gettimeofday(&present_time);                
>+                timediff = cfs_timeval_sub(&present_time,
&req->rq_arrival_time,NULL);
>+                if (timediff / 1000000 > (long)obd_timeout) {
>+                        no_reply = 1;
>+                        CERROR("Dropping timed-out opc %d request from
%s"
>+                               ": %ld seconds old\n",
req->rq_reqmsg->opc,
>+                               libcfs_id2str(req->rq_peer),
>+                               timediff / 1000000);
>+                        goto out_lock;
>+                }
[minor] Since this time difference is in the order of many seconds, and
obd_timeout is also seconds, we don''t need millisecond accuracy.  You
can use
cfs_time_current_sec() to just get the seconds here and avoid putting
present_time and timediff on the stack (it is fine to recalculate timediff for
the error message, since this should happen very rarely.

[minor] Also, to make the error message more useful, please put something like
"Dropping timed-out read request", instead of "opc 3
request".
>@@ -922,8 +943,22 @@ static int ost_brw_write(struct ptlrpc_r
>                 GOTO(out, rc = -ENOMEM);
> 
>         rc = ost_brw_lock_get(LCK_PW, req->rq_export, ioo, pp_rnb,
&lockh);
>-        if (rc != 0)
>+        if (rc != 0) {
>                 GOTO(out_bulk, rc);
>+        } else {
>+                /* fix the bug11330 */
>+                do_gettimeofday(&present_time);                
>+                timediff = cfs_timeval_sub(&present_time,
&req->rq_arrival_time,NULL);
>+                if (timediff / 1000000 > (long)obd_timeout) {
>+                        no_reply = 1;
>+                        CERROR("Dropping timed-out opc %d request from
%s"
>+                               ": %ld seconds old\n",
req->rq_reqmsg->opc,
>+                               libcfs_id2str(req->rq_peer),
>+                               timediff / 1000000);
>+                        goto out_lock;
>+                }
>+        }
[minor] All of the same changes here also.

tianzy@clusterfs.com

2006-Dec-29 01:23 UTC

head link

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #9198 is|0                           |1
           obsolete|                            |


Created an attachment (id=9243)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9243&action=view)
updated patch

updated issues:
1. inappropriate comments.
2. add OBD_FAIL test code.

adilger@clusterfs.com

2006-Dec-29 02:27 UTC

head link

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9243|review?(adilger@clusterfs.co|review-
               Flag|m)                          |


(From update of attachment 9243)>Index: include/linux/obd_support.h
>==================================================================>RCS
file: /cvsroot/cfs/lustre-core/include/linux/obd_support.h,v
>retrieving revision 1.92.2.8.22.36
>diff -p -u -r1.92.2.8.22.36 obd_support.h
>--- include/linux/obd_support.h	21 Dec 2006 00:35:46 -0000	1.92.2.8.22.36
>+++ include/linux/obd_support.h	29 Dec 2006 08:18:04 -0000
>@@ -132,6 +132,7 @@ extern int obd_race_state;
>+#define OBD_FAIL_OST_DROP_REQ            0x21c
Note that 0x21c is already in use on b1_5.  Please use 0x21d.
>@@ -698,6 +699,26 @@ static int ost_brw_read(struct ptlrpc_re
>         if (rc != 0)
>                 GOTO(out_bulk, rc);
> 
>+        /* 
>+         * If getting the lock took more time than
>+         * client was willing to wait, drop it. b=11330
>+         */
>+        if (cfs_time_current_sec() > (time_t)obd_timeout ||
Sorry, this doesn''t make sense.  The current time (i.e. seconds since
1970)
will ALWAYS be > obd_timeout.  What you really want is:

	  if ((cfs_time_current_sec() > req->rq_arrival_time.tv_sec +
obd_timeout) ||
	      OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) {
		  CERROR("Dropping timed-out read request from %s "
			 "because taking the lock: "LPX64" took %lu
seconds.\n",
			 libcfs_id2str(req->rq_peer),
			 lockh.cookie,
			 cfs_time_current_sec() - req->rq_arrival_time.tv_sec);
		  goto out_lock;
	  }
>+         * If getting the lock took more time than
>+         * client was willing to wait, drop it. b=11330
>+         */
>+        if (cfs_time_current_sec() > (time_t)obd_timeout || 
>+            OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) {
Same here.

tianzy@clusterfs.com

2006-Dec-29 03:17 UTC

head link

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #9243 is|0                           |1
           obsolete|                            |


Created an attachment (id=9244)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9244&action=view)
updated patch

tianzy@clusterfs.com

2006-Dec-31 02:03 UTC

head link

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #9244 is|0                           |1
           obsolete|                            |


Created an attachment (id=9251)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9251&action=view)
updated patch(add test code)

tianzy@clusterfs.com

2006-Dec-31 02:08 UTC

head link

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9251|review?(adilger@clusterfs.co|review?(green@clusterfs.com)
               Flag|m)                          |


(From update of attachment 9251)
This is the final version of this patch.
Because Adilger leaves for a long time, I need your approval before landing
this patch

green@clusterfs.com

2007-Jan-11 16:29 UTC

head link

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11330

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9251|review?(green@clusterfs.com)|review+
               Flag|                            |


(From update of attachment 9251)>@@ -698,6 +699,21 @@ static int ost_brw_read(struct ptlrpc_re
>         if (rc != 0)
>                 GOTO(out_bulk, rc);
> 
>+        /* 
>+         * If getting the lock took more time than
>+         * client was willing to wait, drop it. b=11330
>+         */
>+        if (cfs_time_current_sec() > req->rq_arrival_time.tv_sec +
obd_timeout ||
>+            OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) {
>+                no_reply = 1;
>+                CERROR("Dropping timed-out read request from %s "
>+                       "because taking the lock: "LPX64"
took %ld seconds.\n",
>+                       libcfs_id2str(req->rq_peer),
>+                       lockh.cookie,
>+                       cfs_time_current_sec() -
req->rq_arrival_time.tv_sec);
minor, but still pretty desirable: no point in printing lock cookie, it does
not tell us anything useful (without having debug lock.
Better to print object id ioo->ioo_id (so it will read like "... taking
the
lock on object NUMBER")
>@@ -925,6 +944,21 @@ static int ost_brw_write(struct ptlrpc_r
>         if (rc != 0)
>                 GOTO(out_bulk, rc);
> 
>+        /* 
>+         * If getting the lock took more time than
>+         * client was willing to wait, drop it. b=11330
>+         */
>+        if (cfs_time_current_sec() > req->rq_arrival_time.tv_sec +
obd_timeout ||
>+            OBD_FAIL_CHECK(OBD_FAIL_OST_DROP_REQ)) {
>+                no_reply = 1;
>+                CERROR("Dropping timed-out write request from %s
"
>+                       "because taking the lock: "LPX64"
took %ld seconds.\n",
>+                       libcfs_id2str(req->rq_peer),
>+                       lockh.cookie,
>+                       cfs_time_current_sec() -
req->rq_arrival_time.tv_sec);
Same as above.

Otherwise is good.
Needs to be landed on b1_4 and b1_5

Lustre devel - Dec 2006 - [Bug 11330] liblustre application death during I/O locks up OST

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST

[Lustre-devel] [Bug 11330] liblustre application death during I/O locks up OST