chas williams - CONTRACTOR
2011-Aug-08 16:03 UTC
[Lustre-discuss] [bug?] mdc_enter_request() problems
we have seen a few crashes that look like: [250696.381575] RIP: 0010:[<ffffffffa0a1f9e4>] [<ffffffffa0a1f9e4>] mdc_exit_request+0x74/0xb0 [mdc] ... [250696.381575] Call Trace: [250696.381575] [<ffffffffa0a25042>] mdc_intent_getattr_async_interpret+0x82/0x500 [mdc] [250696.381575] [<ffffffffa089efd0>] ptlrpc_check_set+0x200/0x1690 [ptlrpc] [250696.381575] [<ffffffffa08d3140>] ptlrpcd_check+0x110/0x250 [ptlrpc] and i sort of gather the problem arises from mdc_enter_request(). it allocates an mdc_cache_waiter on the stack and inserts it into the wait list and then returns. int mdc_enter_request(struct client_obd *cli) ... struct mdc_cache_waiter mcw; ... list_add_tail(&mcw.mcw_entry, &cli->cl_cache_waiters); init_waitqueue_head(&mcw.mcw_waitq); later mdc_exit_request() finds this mcw by iterating the list. seeing as mcw was allocated on the stack, i dont think you can do this. mcw might have been reused by the time mdc_exit_request() gets around to removing it. void mdc_exit_request(struct client_obd *cli) ... mcw = list_entry(l, struct mdc_cache_waiter, mcw_entry);
On 2011-08-08, at 10:03 AM, chas williams - CONTRACTOR wrote:> we have seen a few crashes that look like: > > [250696.381575] RIP: 0010:[<ffffffffa0a1f9e4>] [<ffffffffa0a1f9e4>] mdc_exit_request+0x74/0xb0 [mdc] > ... > [250696.381575] Call Trace: > [250696.381575] [<ffffffffa0a25042>] mdc_intent_getattr_async_interpret+0x82/0x500 [mdc] > [250696.381575] [<ffffffffa089efd0>] ptlrpc_check_set+0x200/0x1690 [ptlrpc] > [250696.381575] [<ffffffffa08d3140>] ptlrpcd_check+0x110/0x250 [ptlrpc] > > and i sort of gather the problem arises from mdc_enter_request(). > it allocates an mdc_cache_waiter on the stack and inserts it into the > wait list and then returns. > > int mdc_enter_request(struct client_obd *cli) > ... > struct mdc_cache_waiter mcw; > ... > list_add_tail(&mcw.mcw_entry, &cli->cl_cache_waiters); > init_waitqueue_head(&mcw.mcw_waitq); > > later mdc_exit_request() finds this mcw by iterating the list. > seeing as mcw was allocated on the stack, i dont think you can do this. > mcw might have been reused by the time mdc_exit_request() gets around > to removing it.What version of Lustre is this? Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
chas williams - CONTRACTOR
2011-Aug-08 18:10 UTC
[Lustre-discuss] [bug?] mdc_enter_request() problems
On Mon, 08 Aug 2011 12:03:25 -0400 chas williams - CONTRACTOR <chas at cmf.nrl.navy.mil> wrote:> later mdc_exit_request() finds this mcw by iterating the list. > seeing as mcw was allocated on the stack, i dont think you can do this. > mcw might have been reused by the time mdc_exit_request() gets around > to removing it.nevermind. i see this has been fixed in later releases apparently (i was looking at 1.8.5). if l_wait_event() returns "early" (like from being interrupted) mdc_enter_request() does the cleanup itself now.
Hello! I guess this is some sort of 1.8 due to the init_waitq_head call. 2.1 code is notably different in this case after LU-234 landed, namely removing mcw_entry from the list on error. The patch originates from bug 18213 and claimed as 1.8 port to 2.1, but I don''t see anything like this in the 1.8 patch. Bye, Oleg On Aug 8, 2011, at 2:07 PM, Andreas Dilger wrote:> On 2011-08-08, at 10:03 AM, chas williams - CONTRACTOR wrote: >> we have seen a few crashes that look like: >> >> [250696.381575] RIP: 0010:[<ffffffffa0a1f9e4>] [<ffffffffa0a1f9e4>] mdc_exit_request+0x74/0xb0 [mdc] >> ... >> [250696.381575] Call Trace: >> [250696.381575] [<ffffffffa0a25042>] mdc_intent_getattr_async_interpret+0x82/0x500 [mdc] >> [250696.381575] [<ffffffffa089efd0>] ptlrpc_check_set+0x200/0x1690 [ptlrpc] >> [250696.381575] [<ffffffffa08d3140>] ptlrpcd_check+0x110/0x250 [ptlrpc] >> >> and i sort of gather the problem arises from mdc_enter_request(). >> it allocates an mdc_cache_waiter on the stack and inserts it into the >> wait list and then returns. >> >> int mdc_enter_request(struct client_obd *cli) >> ... >> struct mdc_cache_waiter mcw; >> ... >> list_add_tail(&mcw.mcw_entry, &cli->cl_cache_waiters); >> init_waitqueue_head(&mcw.mcw_waitq); >> >> later mdc_exit_request() finds this mcw by iterating the list. >> seeing as mcw was allocated on the stack, i dont think you can do this. >> mcw might have been reused by the time mdc_exit_request() gets around >> to removing it. > > What version of Lustre is this? > > Cheers, Andreas > -- > Andreas Dilger > Principal Engineer > Whamcloud, Inc. > > >-- Oleg Drokin Senior Software Engineer Whamcloud, Inc.
Kevin Van Maren
2011-Aug-09 16:29 UTC
[Lustre-discuss] [bug?] mdc_enter_request() problems
chas williams - CONTRACTOR wrote:> On Mon, 08 Aug 2011 12:03:25 -0400 > chas williams - CONTRACTOR <chas at cmf.nrl.navy.mil> wrote: > > >> later mdc_exit_request() finds this mcw by iterating the list. >> seeing as mcw was allocated on the stack, i dont think you can do this. >> mcw might have been reused by the time mdc_exit_request() gets around >> to removing it. >> > > nevermind. i see this has been fixed in later releases apparently (i > was looking at 1.8.5). if l_wait_event() returns "early" (like > from being interrupted) mdc_enter_request() does the cleanup itself now. >That code is unchanged in 1.8.6. Kevin
chas williams - CONTRACTOR
2011-Aug-09 16:56 UTC
[Lustre-discuss] [bug?] mdc_enter_request() problems
On Tue, 09 Aug 2011 10:29:43 -0600 Kevin Van Maren <kevin.van.maren at oracle.com> wrote:> > chas williams - CONTRACTOR wrote: > > nevermind. i see this has been fixed in later releases apparently (i > > was looking at 1.8.5). if l_wait_event() returns "early" (like > > from being interrupted) mdc_enter_request() does the cleanup itself now. > > That code is unchanged in 1.8.6.it appears to have been fixed in the 2.x releases. i think this is the relevant change http://review.whamcloud.com/#change,506
Johann Lombardi
2011-Aug-10 06:33 UTC
[Lustre-discuss] [bug?] mdc_enter_request() problems
On Tue, Aug 09, 2011 at 10:29:43AM -0600, Kevin Van Maren wrote:> That code is unchanged in 1.8.6.The two relevant patches for 1.8 are the following: http://review.whamcloud.com/#change,457 http://review.whamcloud.com/#change,506 Both patches are included in 1.8.6-wc1 and waiting for landing approval on Oracle''s side (see bugzilla 24508). Cheers, Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com