adilger@clusterfs.com
2007-Jan-17 16:47 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9086| |review?(alex@clusterfs.com) Flag| | (From update of attachment 9086) Alex, can you please have a look at this patch? It seems relatively straight forward to avoid adding an already-committed request to the replay list, since the call to ptlrpc_free_committed() a few lines below will just remove it and having it on the list opens up the window for this LASSERT to fail. Any change to have replayable bulk requests would remove the offending LASSERT() and this change would remain valid after that time.
adilger@clusterfs.com
2007-Jan-17 16:48 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9086| |review?(green@clusterfs.com) Flag| | (From update of attachment 9086) Oleg, is there someone on Beaver that could work on the regression test for this patch? You could re-assign the inspection to them in that case.
bobijam@clusterfs.com
2007-Jan-18 20:41 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 Created an attachment (id=9382) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9382&action=view) regression test draft
bobijam@clusterfs.com
2007-Jan-18 20:47 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 Created an attachment (id=9383) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9383&action=view) test output (dmesg) with the regression test without fixing patch code run the regression test produces dmesg output, which shows the "went back in time" message, but doesn''t LBUG/hang.
bobijam@clusterfs.com
2007-Jan-24 01:14 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9382 is|0 |1 obsolete| | Created an attachment (id=9406) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9406&action=view) fix typo (but generate similar test output as attachment 9383)
adilger@clusterfs.com
2007-Jan-26 01:47 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9406|review?(adilger@clusterfs.co|review- Flag|m) | (From update of attachment 9406) Test still doesn''t hit the failure case, so we are no further ahead in reproducing this problem.
bobijam@clusterfs.com
2007-Jan-28 23:16 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9406 is|0 |1 obsolete| | Created an attachment (id=9435) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9435&action=view) test patch tried adilger, I find the OBD_FAIL_TIMEOUT doesn''t wait the specified seconds before it wakes up again. And here I tried fixed this (checking cfs_schedule_timeout() return value which is the remaining jiffies till to the timeout value, as the patch shows). And I applied OBD_FAIL_TIMEOUT both before the "spin_lock(xxx)" and before "ptlrpc_free_committed(imp)" (as the patch to ptlrpc/client.c shows), both case do not hit the failure case. :( Any more suggestions to hit it?
adilger@clusterfs.com
2007-Jan-31 21:32 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 (From update of attachment 9435)> #define cfs_schedule_timeout(s, t) \ >- do { \ >- cfs_waitlink_t l; \ >- cfs_waitq_timedwait(&l, s, t); \ >- } while (0) >+({ \ >+ cfs_duration_t _ret; \ >+ cfs_waitlink_t l; \ >+ _ret = cfs_waitq_timedwait(&l, s, t); \ >+ _ret; \ >+})Strange, I haven''t noticed such problems, but maybe I missed them? What other code uses cfs_schedule_timeout(), and should it be changed to do the same thing as OBD_FAIL_TIMEOUT? I think this change needs inspection from others with more knowledge of this area, maybe Nikita and/or Oleg? What kernel were you testing with?> #define OBD_FAIL_TIMEOUT(id, secs) \ > do { \ > if (OBD_FAIL_CHECK_ONCE(id)) { \ >+ cfs_duration_t timeout = cfs_time_seconds(secs); \ > CERROR("obd_fail_timeout id %x sleeping for %d secs\n", \ > (id), (secs)); \ >+ do { \ >+ set_current_state(TASK_UNINTERRUPTIBLE); \ >+ timeout = cfs_schedule_timeout(CFS_TASK_UNINT, \ >+ timeout); \ >+ CERROR("cfs_schedule_timeout return %ld\n", timeout);\ >+ } while (timeout > 0); \This could all be done inside the cfs_schedule_timeout() macro also?>@@ -638,6 +638,7 @@ static int after_reply(struct ptlrpc_req > lustre_msg_set_transno(req->rq_reqmsg, req->rq_transno); > > if (req->rq_import->imp_replayable) { >+ //OBD_FAIL_TIMEOUT(OBD_FAIL_PTLRPC_DELAY_AFTER_REPLY, obd_timeout);We may as well just make this a separate failure location and make 2 versions of test_8 (8a, 8b). I think you should also increase the timeout to be slightly larger, like obd_timeout * 2, so that we definitely get into recovery and the OST finishes recovery before this times out.>+test_8() { >+ ost_facet=${ost1_svc} >+ do_facet ost1 $LCTL --device %$ost_facet readonly >+ # don''t set notransno - we want transactions to commit that are "lost" >+ dd if=/dev/zero of=$DIR/$tfile bs=4k count=1 || error "dd $tfile failed" >+ # might need an OBD_FAIL_TIMEOUT in after_reply() so the request is still >+ # waiting on replay list when transno goes back in time and recovery starts >+#define OBD_FAIL_PTLRPC_DELAY_AFTER_REPLY 0x507 >+ do_facet ost1 "sysctl -w lustre.fail_loc=0x80000507" >+ >+ sync; sleep 2; sync >+ fail ost1 >+ >+ dmesg | grep "went back in time" || error "didn''t go back in time" >+ # would LBUG/hang here without this fix >+ do_facet ost1 "sysctl -w lustre.fail_loc=0" >+} >+run_test 8 "Fail OST testing transno goes back"I can''t think of any other way to hit this failure, and this resembles the customer failure case as best as I can tell. There were also reports that the "fix" patch in attachment 9086 caused acceptance-small.sh to crash. Did you have any similar problems when running acceptance-small.sh with the fix in place?
bobijam@clusterfs.com
2007-Feb-01 01:16 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 yes, it oops quickly. # sh acceptance-small.sh <snip>... + ''['' -e lov.sh '']'' + sh lov.sh + ''['' ''!'' -e lov.xml '']'' + ''['' '''' ''!='' no '']'' + sh runtests loading module: libcfs srcdir ./../utils/../../lnet devdir libcfs <snip>... NETWORK: NET_foobar_tcp NET_foobar_tcp_UUID tcp foobar OSD: ost1 ost1_UUID obdfilter ./tmp/ost1-foobar 150000 ldiskfs no 0 256 OST mount options: errors=remount-ro <snip>... creating /mnt/lustre/runtest.3125 copying files from /etc to /mnt/lustre/runtest.3125/etc at Thu Feb 1 19:37:25 EST 2007 <snip>... tar: Removing leading `/'' from member names etc/hotplug.d/default/default.hotplug etc/modprobe.conf.BeforeVMwareToolsInstall <snip>... etc/iproute2/rt_tables etc/group- Disconnecting: Timeout, server not responding. and the dmesg Process tar (pid: 3842, threadinfo=c8c5a000 task=cafd0030) Stack: 00000001 c8c5bda4 00000000 00000246 00000050 c8c5bcf8 c014b0fc cbeef880 00000050 c8cb7c74 c8c5bcf8 ccae7507 c8cb7c74 c8c74650 c8c5bd3c ccadb563 c8c74650 00000001 c8c5bda4 00000000 ccb26b54 0000006b 00000000 c8c5a000 Call Trace: [<c01063b3>] show_stack+0x80/0x96 [<c0106546>] show_registers+0x15d/0x1d6 [<c0106742>] die+0xfa/0x1a2 [<c011761a>] do_page_fault+0x463/0x648 [<c031ee13>] error_code+0x2f/0x38 [<cca5259d>] mdc_close+0x8f6/0xe56 [mdc] [<ccc8ffff>] ll_close_inode_openhandle+0x176/0xcd9 [llite] [<ccc90cab>] ll_mdc_real_close+0x149/0x51f [llite] [<ccc913a0>] ll_mdc_close+0x31f/0x5ca [llite] [<ccc91772>] ll_file_release+0x127/0x4ee [llite] [<c0162e6b>] __fput+0x110/0x13f [<c016157a>] filp_close+0x50/0x8e [<c0161623>] sys_close+0x6b/0x8b [<c031e3b3>] syscall_cal+0x7/0xb Code: ff e9 0a fe ff ff 55 89 e5 5d c3 55 89 e5 57 56 53 81 ec 94 00 00 00 8b 5d 08 8b 7d 10 8b 53 54 c7 45 f0 00 00 00 00 85 d2 74 07 <0f> b7 42 34 89 45 f0 8b 73 48 31 c9 85 f6 74 04 0f b7 4e 34 8b <0>Fatal exception: panic in 5 seconds Kernel panic - not syncing: Fatal exception
bobijam@clusterfs.com
2007-Feb-05 01:23 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 Created an attachment (id=9503) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9503&action=view) avoid adding requests to imp_replay_list that are already committed (revised) Wang Di suggests this fix which I''ve tested with acceptance-small.sh, it avoids the oops.
adilger@clusterfs.com
2007-Feb-05 21:31 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9503|review?(adilger@clusterfs.co|review+ Flag|m) | (From update of attachment 9503)>@@ -640,7 +640,11 @@ static int after_reply(struct ptlrpc_req > if (req->rq_import->imp_replayable) { > spin_lock(&imp->imp_lock); >- if (req->rq_transno != 0) >+ /* no point in adding already-committed requests to the replay >+ * list, we will just remove them immediately. b=9829 */ >+ if (req->rq_transno != 0 && >+ (req->rq_transno <= req->rq_repmsg->last_committed || >+ req->rq_replay)) > ptlrpc_retain_replayable_request(req, imp);Ah, good catch. Of course we need to save replayable requests regardless of the transno. Can you please fix the indenting to match the Lustre coding style: Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://mail.clusterfs.com/wikis/lustre/CodingGuidelines if (req->rq_transno != 0 && (req->rq_transno <= req->rq_repmsg->last_committed || req->rq_replay)) Can you please land on b1_4 for 1.4.10 and b1_5.
bobijam@clusterfs.com
2007-Feb-09 00:44 UTC
[Lustre-devel] [Bug 9829] ptlrpc_replay_req()) ASSERTION(req->rq_bulk == NULL) failed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=9829 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9503 is|0 |1 obsolete| | Created an attachment (id=9554) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9554&action=view) updated patch for b1_4