thr3ads.net - Lustre devel - [Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq

If this information is useful, please help other people find it:
Share via:

adilger@clusterfs.com

2007-Jan-17 16:47 UTC

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9086|                            |review?(alex@clusterfs.com)
               Flag|                            |


(From update of attachment 9086)
Alex, can you please have a look at this patch?  It seems relatively straight
forward to avoid adding an already-committed request to the replay list, since
the call to ptlrpc_free_committed() a few lines below will just remove it and
having it on the list opens up the window for this LASSERT to fail.

Any change to have replayable bulk requests would remove the offending
LASSERT() and this change would remain valid after that time.

adilger@clusterfs.com

2007-Jan-17 16:48 UTC

head link

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9086|                            |review?(green@clusterfs.com)
               Flag|                            |


(From update of attachment 9086)
Oleg, is there someone on Beaver that could work on the regression test for
this patch?  You could re-assign the inspection to them in that case.

bobijam@clusterfs.com

2007-Jan-18 20:41 UTC

head link

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829



Created an attachment (id=9382)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9382&action=view)
regression test draft

bobijam@clusterfs.com

2007-Jan-18 20:47 UTC

head link

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829



Created an attachment (id=9383)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9383&action=view)
test output (dmesg) with the regression test

without fixing patch code run the regression test produces dmesg output, which
shows the "went back in time" message, but doesn''t LBUG/hang.

bobijam@clusterfs.com

2007-Jan-24 01:14 UTC

head link

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #9382 is|0                           |1
           obsolete|                            |


Created an attachment (id=9406)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9406&action=view)
fix typo (but generate similar test output as attachment 9383)

adilger@clusterfs.com

2007-Jan-26 01:47 UTC

head link

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9406|review?(adilger@clusterfs.co|review-
               Flag|m)                          |


(From update of attachment 9406)
Test still doesn''t hit the failure case, so we are no further ahead in
reproducing this problem.

bobijam@clusterfs.com

2007-Jan-28 23:16 UTC

head link

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #9406 is|0                           |1
           obsolete|                            |


Created an attachment (id=9435)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9435&action=view)
test patch tried

adilger,

  I find the OBD_FAIL_TIMEOUT doesn''t wait the specified seconds before
it
wakes
up again. And here I tried fixed this (checking cfs_schedule_timeout() return
value which is the remaining jiffies till to the timeout value, as the patch
shows).

  And I applied OBD_FAIL_TIMEOUT both before the "spin_lock(xxx)" and
before
"ptlrpc_free_committed(imp)" (as the patch to ptlrpc/client.c shows),
both case

do not hit the failure case. :(

  Any more suggestions to hit it?

adilger@clusterfs.com

2007-Jan-31 21:32 UTC

head link

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829



(From update of attachment 9435)> #define cfs_schedule_timeout(s, t)              \
>-        do {                                    \
>-                cfs_waitlink_t    l;            \
>-                cfs_waitq_timedwait(&l, s, t);  \
>-        } while (0)
>+({                                              \
>+        cfs_duration_t _ret;                    \
>+        cfs_waitlink_t    l;                    \
>+        _ret = cfs_waitq_timedwait(&l, s, t);   \
>+        _ret;                                   \
>+})
Strange, I haven''t noticed such problems, but maybe I missed them? 
What other
code uses cfs_schedule_timeout(), and should it be changed to do the same thing
as OBD_FAIL_TIMEOUT?  I think this change needs inspection from others with
more knowledge of this area, maybe Nikita and/or Oleg?	What kernel were you
testing with?
> #define OBD_FAIL_TIMEOUT(id, secs)                                         
\
> do {                                                                       
\
>         if (OBD_FAIL_CHECK_ONCE(id)) {                                     
\
>+                cfs_duration_t timeout = cfs_time_seconds(secs);           
\
>                 CERROR("obd_fail_timeout id %x sleeping for %d
secs\n",      \
>                        (id), (secs));                                      
\
>+                do {                                                       
\
>+                        set_current_state(TASK_UNINTERRUPTIBLE);           
\
>+                        timeout = cfs_schedule_timeout(CFS_TASK_UNINT,     
\
>+                                            timeout);                      
\
>+                        CERROR("cfs_schedule_timeout return
%ld\n", timeout);\
>+                } while (timeout > 0);                                  
\
This could all be done inside the cfs_schedule_timeout() macro also?
>@@ -638,6 +638,7 @@ static int after_reply(struct ptlrpc_req
>         lustre_msg_set_transno(req->rq_reqmsg, req->rq_transno);
> 
>         if (req->rq_import->imp_replayable) {
>+                //OBD_FAIL_TIMEOUT(OBD_FAIL_PTLRPC_DELAY_AFTER_REPLY,
obd_timeout);
We may as well just make this a separate failure location and make 2 versions
of test_8 (8a, 8b).  I think you should also increase the timeout to be
slightly larger, like obd_timeout * 2, so that we definitely get into recovery
and the OST finishes recovery before this times out.
>+test_8() {
>+    ost_facet=${ost1_svc}
>+    do_facet ost1 $LCTL --device %$ost_facet readonly
>+    # don''t set notransno - we want transactions to commit that
are "lost"
>+    dd if=/dev/zero of=$DIR/$tfile bs=4k count=1 || error "dd $tfile
failed"
>+    # might need an OBD_FAIL_TIMEOUT in after_reply() so the request is
still
>+    # waiting on replay list when transno goes back in time and recovery
starts
>+#define OBD_FAIL_PTLRPC_DELAY_AFTER_REPLY      0x507
>+    do_facet ost1 "sysctl -w lustre.fail_loc=0x80000507"
>+
>+    sync; sleep 2; sync
>+    fail ost1
>+
>+    dmesg | grep "went back in time" || error
"didn''t go back in time"
>+    # would LBUG/hang here without this fix
>+    do_facet ost1 "sysctl -w lustre.fail_loc=0"
>+}
>+run_test 8 "Fail OST testing transno goes back"
I can''t think of any other way to hit this failure, and this resembles
the
customer failure case as best as I can tell.

There were also reports that the "fix" patch in attachment 9086 caused
acceptance-small.sh to crash.  Did you have any similar problems when running
acceptance-small.sh with the fix in place?

bobijam@clusterfs.com

2007-Feb-01 01:16 UTC

head link

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829



yes, it oops quickly.

# sh acceptance-small.sh
<snip>...
+ ''['' -e lov.sh '']''
+ sh lov.sh
+ ''['' ''!'' -e lov.xml '']''
+ ''['' '''' ''!='' no
'']''
+ sh runtests
loading module: libcfs srcdir ./../utils/../../lnet devdir libcfs
<snip>...
NETWORK: NET_foobar_tcp NET_foobar_tcp_UUID tcp foobar
OSD: ost1 ost1_UUID obdfilter ./tmp/ost1-foobar 150000 ldiskfs no 0 256
OST mount options: errors=remount-ro
<snip>...
creating /mnt/lustre/runtest.3125
copying files from /etc to /mnt/lustre/runtest.3125/etc at Thu Feb  1 19:37:25
EST 2007
<snip>...
tar: Removing leading `/'' from member names
etc/hotplug.d/default/default.hotplug
etc/modprobe.conf.BeforeVMwareToolsInstall
<snip>...
etc/iproute2/rt_tables
etc/group-
Disconnecting: Timeout, server not responding.


and the dmesg

Process tar (pid: 3842, threadinfo=c8c5a000 task=cafd0030)
Stack: 00000001 c8c5bda4 00000000 00000246 00000050 c8c5bcf8 c014b0fc cbeef880
       00000050 c8cb7c74 c8c5bcf8 ccae7507 c8cb7c74 c8c74650 c8c5bd3c ccadb563
       c8c74650 00000001 c8c5bda4 00000000 ccb26b54 0000006b 00000000 c8c5a000
Call Trace:
 [<c01063b3>] show_stack+0x80/0x96
 [<c0106546>] show_registers+0x15d/0x1d6
 [<c0106742>] die+0xfa/0x1a2
 [<c011761a>] do_page_fault+0x463/0x648
 [<c031ee13>] error_code+0x2f/0x38
 [<cca5259d>] mdc_close+0x8f6/0xe56 [mdc]
 [<ccc8ffff>] ll_close_inode_openhandle+0x176/0xcd9 [llite]
 [<ccc90cab>] ll_mdc_real_close+0x149/0x51f [llite]
 [<ccc913a0>] ll_mdc_close+0x31f/0x5ca [llite]
 [<ccc91772>] ll_file_release+0x127/0x4ee [llite]
 [<c0162e6b>] __fput+0x110/0x13f
 [<c016157a>] filp_close+0x50/0x8e
 [<c0161623>] sys_close+0x6b/0x8b
 [<c031e3b3>] syscall_cal+0x7/0xb
Code: ff e9 0a fe ff ff 55 89 e5 5d c3 55 89 e5 57 56 53 81 ec 94 00 00 00 8b 5d
 08 8b 7d 10 8b 53 54 c7 45 f0 00 00 00 00 85 d2 74 07 <0f> b7 42 34 89 45
f0 8b
 73 48 31 c9 85 f6 74 04 0f b7 4e 34 8b
 <0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception

bobijam@clusterfs.com

2007-Feb-05 01:23 UTC

head link

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829



Created an attachment (id=9503)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9503&action=view)
avoid adding requests to imp_replay_list that are already committed (revised)

Wang Di suggests this fix which I''ve tested with acceptance-small.sh,
it avoids

the oops.

adilger@clusterfs.com

2007-Feb-05 21:31 UTC

head link

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9503|review?(adilger@clusterfs.co|review+
               Flag|m)                          |


(From update of attachment 9503)>@@ -640,7 +640,11 @@ static int after_reply(struct ptlrpc_req
>         if (req->rq_import->imp_replayable) {
>                 spin_lock(&imp->imp_lock);
>-                if (req->rq_transno != 0)
>+                /* no point in adding already-committed requests to the
replay
>+                 * list, we will just remove them immediately. b=9829 */
>+                if (req->rq_transno != 0 && 
>+                                (req->rq_transno <=
req->rq_repmsg->last_committed ||
>+                                 req->rq_replay))
>                         ptlrpc_retain_replayable_request(req, imp);
Ah, good catch.  Of course we need to save replayable requests regardless of
the transno.

Can you please fix the indenting to match the Lustre coding style:
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://mail.clusterfs.com/wikis/lustre/CodingGuidelines

		if (req->rq_transno != 0 && 
		    (req->rq_transno <= req->rq_repmsg->last_committed ||
		     req->rq_replay))

Can you please land on b1_4 for 1.4.10 and b1_5.

bobijam@clusterfs.com

2007-Feb-09 00:44 UTC

head link

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #9503 is|0                           |1
           obsolete|                            |


Created an attachment (id=9554)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9554&action=view)
updated patch for b1_4

Lustre devel - Jan 2007 - [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed

[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed