adilger@clusterfs.com
2007-Jan-17 16:47 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9086| |review?(alex@clusterfs.com)
Flag| |
(From update of attachment 9086)
Alex, can you please have a look at this patch? It seems relatively straight
forward to avoid adding an already-committed request to the replay list, since
the call to ptlrpc_free_committed() a few lines below will just remove it and
having it on the list opens up the window for this LASSERT to fail.
Any change to have replayable bulk requests would remove the offending
LASSERT() and this change would remain valid after that time.
adilger@clusterfs.com
2007-Jan-17 16:48 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9086| |review?(green@clusterfs.com)
Flag| |
(From update of attachment 9086)
Oleg, is there someone on Beaver that could work on the regression test for
this patch? You could re-assign the inspection to them in that case.
bobijam@clusterfs.com
2007-Jan-18 20:41 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 Created an attachment (id=9382) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9382&action=view) regression test draft
bobijam@clusterfs.com
2007-Jan-18 20:47 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 Created an attachment (id=9383) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9383&action=view) test output (dmesg) with the regression test without fixing patch code run the regression test produces dmesg output, which shows the "went back in time" message, but doesn''t LBUG/hang.
bobijam@clusterfs.com
2007-Jan-24 01:14 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9382 is|0 |1
obsolete| |
Created an attachment (id=9406)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
--> (https://bugzilla.lustre.org/attachment.cgi?id=9406&action=view)
fix typo (but generate similar test output as attachment 9383)
adilger@clusterfs.com
2007-Jan-26 01:47 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9406|review?(adilger@clusterfs.co|review-
Flag|m) |
(From update of attachment 9406)
Test still doesn''t hit the failure case, so we are no further ahead in
reproducing this problem.
bobijam@clusterfs.com
2007-Jan-28 23:16 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9406 is|0 |1
obsolete| |
Created an attachment (id=9435)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
--> (https://bugzilla.lustre.org/attachment.cgi?id=9435&action=view)
test patch tried
adilger,
I find the OBD_FAIL_TIMEOUT doesn''t wait the specified seconds before
it
wakes
up again. And here I tried fixed this (checking cfs_schedule_timeout() return
value which is the remaining jiffies till to the timeout value, as the patch
shows).
And I applied OBD_FAIL_TIMEOUT both before the "spin_lock(xxx)" and
before
"ptlrpc_free_committed(imp)" (as the patch to ptlrpc/client.c shows),
both case
do not hit the failure case. :(
Any more suggestions to hit it?
adilger@clusterfs.com
2007-Jan-31 21:32 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 (From update of attachment 9435)> #define cfs_schedule_timeout(s, t) \ >- do { \ >- cfs_waitlink_t l; \ >- cfs_waitq_timedwait(&l, s, t); \ >- } while (0) >+({ \ >+ cfs_duration_t _ret; \ >+ cfs_waitlink_t l; \ >+ _ret = cfs_waitq_timedwait(&l, s, t); \ >+ _ret; \ >+})Strange, I haven''t noticed such problems, but maybe I missed them? What other code uses cfs_schedule_timeout(), and should it be changed to do the same thing as OBD_FAIL_TIMEOUT? I think this change needs inspection from others with more knowledge of this area, maybe Nikita and/or Oleg? What kernel were you testing with?> #define OBD_FAIL_TIMEOUT(id, secs) \ > do { \ > if (OBD_FAIL_CHECK_ONCE(id)) { \ >+ cfs_duration_t timeout = cfs_time_seconds(secs); \ > CERROR("obd_fail_timeout id %x sleeping for %d secs\n", \ > (id), (secs)); \ >+ do { \ >+ set_current_state(TASK_UNINTERRUPTIBLE); \ >+ timeout = cfs_schedule_timeout(CFS_TASK_UNINT, \ >+ timeout); \ >+ CERROR("cfs_schedule_timeout return %ld\n", timeout);\ >+ } while (timeout > 0); \This could all be done inside the cfs_schedule_timeout() macro also?>@@ -638,6 +638,7 @@ static int after_reply(struct ptlrpc_req > lustre_msg_set_transno(req->rq_reqmsg, req->rq_transno); > > if (req->rq_import->imp_replayable) { >+ //OBD_FAIL_TIMEOUT(OBD_FAIL_PTLRPC_DELAY_AFTER_REPLY, obd_timeout);We may as well just make this a separate failure location and make 2 versions of test_8 (8a, 8b). I think you should also increase the timeout to be slightly larger, like obd_timeout * 2, so that we definitely get into recovery and the OST finishes recovery before this times out.>+test_8() { >+ ost_facet=${ost1_svc} >+ do_facet ost1 $LCTL --device %$ost_facet readonly >+ # don''t set notransno - we want transactions to commit that are "lost" >+ dd if=/dev/zero of=$DIR/$tfile bs=4k count=1 || error "dd $tfile failed" >+ # might need an OBD_FAIL_TIMEOUT in after_reply() so the request is still >+ # waiting on replay list when transno goes back in time and recovery starts >+#define OBD_FAIL_PTLRPC_DELAY_AFTER_REPLY 0x507 >+ do_facet ost1 "sysctl -w lustre.fail_loc=0x80000507" >+ >+ sync; sleep 2; sync >+ fail ost1 >+ >+ dmesg | grep "went back in time" || error "didn''t go back in time" >+ # would LBUG/hang here without this fix >+ do_facet ost1 "sysctl -w lustre.fail_loc=0" >+} >+run_test 8 "Fail OST testing transno goes back"I can''t think of any other way to hit this failure, and this resembles the customer failure case as best as I can tell. There were also reports that the "fix" patch in attachment 9086 caused acceptance-small.sh to crash. Did you have any similar problems when running acceptance-small.sh with the fix in place?
bobijam@clusterfs.com
2007-Feb-01 01:16 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829
yes, it oops quickly.
# sh acceptance-small.sh
<snip>...
+ ''['' -e lov.sh '']''
+ sh lov.sh
+ ''['' ''!'' -e lov.xml '']''
+ ''['' '''' ''!='' no
'']''
+ sh runtests
loading module: libcfs srcdir ./../utils/../../lnet devdir libcfs
<snip>...
NETWORK: NET_foobar_tcp NET_foobar_tcp_UUID tcp foobar
OSD: ost1 ost1_UUID obdfilter ./tmp/ost1-foobar 150000 ldiskfs no 0 256
OST mount options: errors=remount-ro
<snip>...
creating /mnt/lustre/runtest.3125
copying files from /etc to /mnt/lustre/runtest.3125/etc at Thu Feb 1 19:37:25
EST 2007
<snip>...
tar: Removing leading `/'' from member names
etc/hotplug.d/default/default.hotplug
etc/modprobe.conf.BeforeVMwareToolsInstall
<snip>...
etc/iproute2/rt_tables
etc/group-
Disconnecting: Timeout, server not responding.
and the dmesg
Process tar (pid: 3842, threadinfo=c8c5a000 task=cafd0030)
Stack: 00000001 c8c5bda4 00000000 00000246 00000050 c8c5bcf8 c014b0fc cbeef880
00000050 c8cb7c74 c8c5bcf8 ccae7507 c8cb7c74 c8c74650 c8c5bd3c ccadb563
c8c74650 00000001 c8c5bda4 00000000 ccb26b54 0000006b 00000000 c8c5a000
Call Trace:
[<c01063b3>] show_stack+0x80/0x96
[<c0106546>] show_registers+0x15d/0x1d6
[<c0106742>] die+0xfa/0x1a2
[<c011761a>] do_page_fault+0x463/0x648
[<c031ee13>] error_code+0x2f/0x38
[<cca5259d>] mdc_close+0x8f6/0xe56 [mdc]
[<ccc8ffff>] ll_close_inode_openhandle+0x176/0xcd9 [llite]
[<ccc90cab>] ll_mdc_real_close+0x149/0x51f [llite]
[<ccc913a0>] ll_mdc_close+0x31f/0x5ca [llite]
[<ccc91772>] ll_file_release+0x127/0x4ee [llite]
[<c0162e6b>] __fput+0x110/0x13f
[<c016157a>] filp_close+0x50/0x8e
[<c0161623>] sys_close+0x6b/0x8b
[<c031e3b3>] syscall_cal+0x7/0xb
Code: ff e9 0a fe ff ff 55 89 e5 5d c3 55 89 e5 57 56 53 81 ec 94 00 00 00 8b 5d
08 8b 7d 10 8b 53 54 c7 45 f0 00 00 00 00 85 d2 74 07 <0f> b7 42 34 89 45
f0 8b
73 48 31 c9 85 f6 74 04 0f b7 4e 34 8b
<0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception
bobijam@clusterfs.com
2007-Feb-05 01:23 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 Created an attachment (id=9503) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9503&action=view) avoid adding requests to imp_replay_list that are already committed (revised) Wang Di suggests this fix which I''ve tested with acceptance-small.sh, it avoids the oops.
adilger@clusterfs.com
2007-Feb-05 21:31 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9503|review?(adilger@clusterfs.co|review+
Flag|m) |
(From update of attachment 9503)>@@ -640,7 +640,11 @@ static int after_reply(struct ptlrpc_req
> if (req->rq_import->imp_replayable) {
> spin_lock(&imp->imp_lock);
>- if (req->rq_transno != 0)
>+ /* no point in adding already-committed requests to the
replay
>+ * list, we will just remove them immediately. b=9829 */
>+ if (req->rq_transno != 0 &&
>+ (req->rq_transno <=
req->rq_repmsg->last_committed ||
>+ req->rq_replay))
> ptlrpc_retain_replayable_request(req, imp);
Ah, good catch. Of course we need to save replayable requests regardless of
the transno.
Can you please fix the indenting to match the Lustre coding style:
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://mail.clusterfs.com/wikis/lustre/CodingGuidelines
if (req->rq_transno != 0 &&
(req->rq_transno <= req->rq_repmsg->last_committed ||
req->rq_replay))
Can you please land on b1_4 for 1.4.10 and b1_5.
bobijam@clusterfs.com
2007-Feb-09 00:44 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9503 is|0 |1
obsolete| |
Created an attachment (id=9554)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
--> (https://bugzilla.lustre.org/attachment.cgi?id=9554&action=view)
updated patch for b1_4