th@llnl.gov
2006-Dec-27 13:27 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 This struck again: 2006-12-27 08:39:27 LustreError: 2947:0:(handler.c:1422:mds_handle()) operation 400 on unconnected MDS from 12345-173@elan1 2006-12-27 08:39:27 LustreError: 2947:0:(handler.c:1422:mds_handle()) Skipped 2 previous similar messages 2006-12-27 08:39:27 LustreError: 2947:0:(ldlm_lib.c:1275:target_send_reply_msg()) @@@ processing error (-107) req@000001002245dc00 x14975018/t0 o400-><?>@<?>:-1 lens 64/0 ref 0 fl Interpret: /0/0 rc -107/0 2006-12-27 08:39:27 LustreError: 2947:0:(ldlm_lib.c:1275:target_send_reply_msg()) @@@ processing error (-107) req@000001003017ec00 x21346808/t0 o400-><?>@<?>:-1 lens 64/0 ref 0 fl Interpret: /0/0 rc -107/0 2006-12-27 08:39:27 LustreError: 3156:0:(ldlm_lockd.c:1043:ldlm_handle_cancel()) received cancel for unknown lock cookie 0xc2df8baa28cc4cb0 from client adbfd3d6-1ca9-3c5e-e258-15872df55fcc i d 12345-951@elan3 2006-12-27 08:39:27 LustreError: 2942:0:(handler.c:910:mds_getattr_lock()) ASSERTION(granted_lock != NULL) failed:1782474520/2811824055 lockh 0xc2df8baa28cc4be5 2006-12-27 08:39:27 LustreError: 2942:0:(linux-debug.c:130:lbug_with_loc()) LBUG 2006-12-27 08:39:27 ll_mdt_02 R running task 0 2942 1 2943 2941 (L-TLB) 2006-12-27 08:39:27 0000010220eee0b0 ffffffff8010e5f9 ffffffff8047de51 0000000000000246 2006-12-27 08:39:27 ffffffff8030187c ffffffff802ed4de 0000000000000000 0000000000000000 2006-12-27 08:39:27 000001021a85d798 00000000fffffffe 2006-12-27 08:39:27 Call Trace:<ffffffff80134275>{vprintk+515} <0>LustreError: 2945:0:(handler.c:910:mds_getattr_lock()) ASSERTION(granted_lock != NULL) failed:1782474520/2811824055 lockh 0x c2df8baa28cc4c24 2006-12-27 08:39:27 LustreError: 2945:0:(linux-debug.c:130:lbug_with_loc()) LBUG 2006-12-27 08:39:27 ll_mdt_05 R running task 0 2945 1 2946 2944 (L-TLB) 2006-12-27 08:39:27 0000010220eee0b0 ffffffff8010e5f9 ffffffff8047de51 0000000000000246 2006-12-27 08:39:27 ffffffff8030187c ffffffff802ed4de 0000000000000000 0000000000000000 2006-12-27 08:39:27 000001021a8d1798 00000000fffffffe 2006-12-27 08:39:27 Call Trace:<ffffffff80134275>{vprintk+515} 2006-12-27 08:39:27 <ffffffff8013430e>{printk+141} <ffffffff80134275>{vprintk+515} <ffffffff8013430e>{printk+141} 2006-12-27 08:39:27 <ffffffff80134275>{vprintk+515} 2006-12-27 08:39:27 <ffffffff8013430e>{printk+141} <ffffffff80145177>{__kernel_text_address+26} <ffffffff8011071c>{show_trace+375} 2006-12-27 08:39:27 <ffffffff80110858>{show_stack+241} <ffffffffa01d49f6>{:libcfs:lbug_with_loc+72} 2006-12-27 08:39:27 <ffffffffa03fc702>{:mds:mds_getattr_lock+1877} <ffffffffa0408c9e>{:mds:mds_intent_policy+2177} 2006-12-27 08:39:27 <ffffffff8013430e>{printk+141} 2006-12-27 08:39:27 <ffffffff8012f9d3>{__wake_up+54} <ffffffffa02a568a>{:ptlrpc:ldlm_lock_enqueue+209} 2006-12-27 08:39:27 <ffffffffa02c8d88>{:ptlrpc:ldlm_handle_enqueue+7012} 2006-12-27 08:39:27 <ffffffffa02c53c8>{:ptlrpc:ldlm_server_blocking_ast+0} 2006-12-27 08:39:27 <ffffffffa02c5fa6>{:ptlrpc:ldlm_server_completion_ast+0} 2006-12-27 08:39:27 <ffffffffa040346a>{:mds:mds_handle+26229} <ffffffff80145177>{__kernel_text_address+26} <ffffffff8011071c>{show_trace+375} 2006-12-27 08:39:27 <ffffffff80110858>{show_stack+241} <ffffffffa01d49f6>{:libcfs:lbug_with_loc+72} 2006-12-27 08:39:27 <ffffffffa03fc702>{:mds:mds_getattr_lock+1877} <ffffffffa0408c9e>{:mds:mds_intent_policy+2177} 2006-12-27 08:39:27 <ffffffff8012f9d3>{__wake_up+54} <ffffffffa02a568a>{:ptlrpc:ldlm_lock_enqueue+209} 2006-12-27 08:39:27 <ffffffffa02c8d88>{:ptlrpc:ldlm_handle_enqueue+7012} 2006-12-27 08:39:27 <ffffffffa02c53c8>{:ptlrpc:ldlm_server_blocking_ast+0} 2006-12-27 08:39:27 <ffffffffa02c5fa6>{:ptlrpc:ldlm_server_completion_ast+0} 2006-12-27 08:39:27 <ffffffffa040346a>{:mds:mds_handle+26229} <ffffffff801e2d53>{vsnprintf+1406} 2006-12-27 08:39:27 <ffffffff801e2d53>{vsnprintf+1406} 2006-12-27 08:39:27 <ffffffff801e2b25>{vsnprintf+848} <ffffffff801e2b25>{vsnprintf+848} <ffffffff801e2e36>{snprintf+131} 2006-12-27 08:39:27 <ffffffffa02eb45b>{:ptlrpc:ptlrpc_server_handle_request+2700} 2006-12-27 08:39:27 <ffffffffa02eb469>{:ptlrpc:ptlrpc_server_handle_request+2714} 2006-12-27 08:39:27 <ffffffff8013bb5c>{__mod_timer+293} <ffffffffa02ec5f3>{:ptlrpc:ptlrpc_main+2127} 2006-12-27 08:39:27 <ffffffffa02ebd97>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa02ebd97>{:ptlrpc:ptlrpc_retry_rqbds+0} 2006-12-27 08:39:27 <ffffffffa02ebd97>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffff8010ff3f>{child_rip+8} 2006-12-27 08:39:27 <ffffffffa02ebda4>{:ptlrpc:ptlrpc_main+0} <ffffffff8010ff37>{child_rip+0} 2006-12-27 08:39:27 2006-12-27 08:39:27 LustreError: dumping log to /var/tmp/lustre-log-pigs-mds1.1167237567.2942 2006-12-27 08:39:27 <ffffffff801e2e36>{snprintf+131} 2006-12-27 08:39:27 <ffffffffa02eb45b>{:ptlrpc:ptlrpc_server_handle_request+2700} 2006-12-27 08:39:27 <ffffffffa02eb469>{:ptlrpc:ptlrpc_server_handle_request+2714}
th@llnl.gov
2006-Dec-27 15:08 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 The dk log from the 12/27 event appears to have been lost; this mds node is dropping dk logs frequently (but only has LBUGed once).
behlendorf1@llnl.gov
2007-Jan-03 11:26 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 This bug struck again on 1/3/07 taking the MDS for lscratcha (bigs-mds1) out of service. We dumped the thread state and crash dumped the node to get things back in service. Unfortunately the crash dump didn''t succeed. We do have some logs dumped by the LBUGs for analysis.
behlendorf1@llnl.gov
2007-Jan-03 14:08 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 What |Removed |Added ---------------------------------------------------------------------------- Severity|S3 (normal) |S2 (major) Priority|P3 |P2 Updated to P2/S2 bug, this is causing extended FS downtimes. Full dk logs from the most recent failures have been uploaded to the FS site as ''bigs-mds1.log.sorted.gz''. This logs represents 277 individual logs dumps spliced together with a ''sort -k4 -t: <files>'' so there may be many multi-minute gaps.
behlendorf1@llnl.gov
2007-Jan-03 14:15 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 Created an attachment (id=9259) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9259&action=view) Console log Console logs for debug log uploaded to FTP site.
behlendorf1@llnl.gov
2007-Jan-03 18:31 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 What |Removed |Added ---------------------------------------------------------------------------- CC| |th@llnl.gov Added Terry to CC list
green@clusterfs.com
2007-Jan-04 06:53 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 What |Removed |Added ---------------------------------------------------------------------------- CC| |shadow@clusterfs.com That''s what happens here: 1. client does some operation, which take long time on server (trying to take a lock) 2. client times out and resends request. 3. new request arrives at server, we look in the list of locks and find original requested lock from (1), we get this lock and its cookie, decide that this is the lock we sent to client and start to do resending preparations (passes the lustre_handle_is_used() check in mds_getattr_lock). 4. at this time thread serving request from (1) finally gets the lock it was waiting for and decides to return it to client, it rewrites remote handle and returns with ELDLM_LOCK_REPLACED from mds_intent_policy(). This makes ldlm_lock_enqueue() to drop existing lock from (1) (and it''s cookie). 5. thread serving resent request is trying to get a lock from cookie(now invalid), fails and trips on the assertion. Easy fix I think about is to add an extra check fixup_handle_for_resent_req() that the lock we are about to choose is granted (req_mode == granted_mode). The drawback is we will obviously redo entire lock granting for second request. But should be enough for a quick fix. Shadow: please start with a recovery-small.sh test to replicate the issue (you will need to strategically place OBD_FAIL/OBD_RACE in mds/handler.c), then try the solution I described above.
pbojanic@clusterfs.com
2007-Jan-04 08:43 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 What |Removed |Added ---------------------------------------------------------------------------- CC| |green@clusterfs.com AssignedTo|bugs@clusterfs.com |shadow@clusterfs.com Status Whiteboard| |2007-01-04: initial analysis | |has been completed; engineer | |assigned to replicate the | |issue locally to test | |hypothesis Comment Tags: Status Change 2007-01-04: initial analysis has been completed; engineer assigned to replicate the issue locally to test hypothesis
shadow@clusterfs.com
2007-Jan-05 11:52 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 Created an attachment (id=9278) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9278&action=view) fix return lock in wrong state. This patch block to return lock in wrong state (req_mode == granted_mode). Also fix bug in OBD_RACE, when thread don`t wake up after second call OBD_RACE.
green@clusterfs.com
2007-Jan-05 17:00 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9278|review?(green@clusterfs.com)|review+, Flag| |review?(adilger@clusterfs.co | |m) (From update of attachment 9278) This patch should ease the pain of this specific problem at LLNL for now. But as we recently learned, a different measure for entire class of similar problems was developed for CMD2 project and what we should do is to adopt that solution. (do not allow a client to reconnect if there are some requests from it are processed still). OBD_RACE change will affect some of our existing regression tests, but should not affect anything during normal operations. Andreas, why it was decided not to wake up the other thread in OBD_RACE and instead only wake it up when second thread finishes its job completely?
green@clusterfs.com
2007-Jan-08 15:47 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 I found that the more correct approach is in fact implemented in bug 1451. The patch needs to be updated to b1_4 & b1_5 and landed.
shadow@clusterfs.com
2007-Jan-11 04:15 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 Created an attachment (id=9318) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9318&action=view) refuse connect from busy exports port bug 1451 to b1_4. if client has incompleate request and tried to reconect - his connect request refused before old request don`t compleate.
green@clusterfs.com
2007-Jan-11 15:42 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9318|review?(green@clusterfs.com)|review+ Flag| | (From update of attachment 9318) Looks good. Would need to be landed to b1_5 too