th@llnl.gov
2006-Dec-27 13:27 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277
This struck again:
2006-12-27 08:39:27 LustreError: 2947:0:(handler.c:1422:mds_handle()) operation
400 on unconnected MDS from 12345-173@elan1
2006-12-27 08:39:27 LustreError: 2947:0:(handler.c:1422:mds_handle()) Skipped 2
previous similar messages
2006-12-27 08:39:27 LustreError:
2947:0:(ldlm_lib.c:1275:target_send_reply_msg()) @@@ processing error (-107)
req@000001002245dc00 x14975018/t0 o400-><?>@<?>:-1 lens 64/0 ref
0 fl Interpret:
/0/0 rc -107/0
2006-12-27 08:39:27 LustreError:
2947:0:(ldlm_lib.c:1275:target_send_reply_msg()) @@@ processing error (-107)
req@000001003017ec00 x21346808/t0 o400-><?>@<?>:-1 lens 64/0 ref
0 fl Interpret:
/0/0 rc -107/0
2006-12-27 08:39:27 LustreError: 3156:0:(ldlm_lockd.c:1043:ldlm_handle_cancel())
received cancel for unknown lock cookie 0xc2df8baa28cc4cb0 from client
adbfd3d6-1ca9-3c5e-e258-15872df55fcc i
d 12345-951@elan3
2006-12-27 08:39:27 LustreError: 2942:0:(handler.c:910:mds_getattr_lock())
ASSERTION(granted_lock != NULL) failed:1782474520/2811824055 lockh
0xc2df8baa28cc4be5
2006-12-27 08:39:27 LustreError: 2942:0:(linux-debug.c:130:lbug_with_loc()) LBUG
2006-12-27 08:39:27 ll_mdt_02 R running task 0 2942 1
2943 2941 (L-TLB)
2006-12-27 08:39:27 0000010220eee0b0 ffffffff8010e5f9 ffffffff8047de51
0000000000000246
2006-12-27 08:39:27 ffffffff8030187c ffffffff802ed4de 0000000000000000
0000000000000000
2006-12-27 08:39:27 000001021a85d798 00000000fffffffe
2006-12-27 08:39:27 Call Trace:<ffffffff80134275>{vprintk+515}
<0>LustreError:
2945:0:(handler.c:910:mds_getattr_lock()) ASSERTION(granted_lock != NULL)
failed:1782474520/2811824055 lockh 0x
c2df8baa28cc4c24
2006-12-27 08:39:27 LustreError: 2945:0:(linux-debug.c:130:lbug_with_loc()) LBUG
2006-12-27 08:39:27 ll_mdt_05 R running task 0 2945 1
2946 2944 (L-TLB)
2006-12-27 08:39:27 0000010220eee0b0 ffffffff8010e5f9 ffffffff8047de51
0000000000000246
2006-12-27 08:39:27 ffffffff8030187c ffffffff802ed4de 0000000000000000
0000000000000000
2006-12-27 08:39:27 000001021a8d1798 00000000fffffffe
2006-12-27 08:39:27 Call Trace:<ffffffff80134275>{vprintk+515}
2006-12-27 08:39:27 <ffffffff8013430e>{printk+141}
<ffffffff80134275>{vprintk+515} <ffffffff8013430e>{printk+141}
2006-12-27 08:39:27 <ffffffff80134275>{vprintk+515}
2006-12-27 08:39:27 <ffffffff8013430e>{printk+141}
<ffffffff80145177>{__kernel_text_address+26}
<ffffffff8011071c>{show_trace+375}
2006-12-27 08:39:27 <ffffffff80110858>{show_stack+241}
<ffffffffa01d49f6>{:libcfs:lbug_with_loc+72}
2006-12-27 08:39:27 <ffffffffa03fc702>{:mds:mds_getattr_lock+1877}
<ffffffffa0408c9e>{:mds:mds_intent_policy+2177}
2006-12-27 08:39:27 <ffffffff8013430e>{printk+141}
2006-12-27 08:39:27 <ffffffff8012f9d3>{__wake_up+54}
<ffffffffa02a568a>{:ptlrpc:ldlm_lock_enqueue+209}
2006-12-27 08:39:27
<ffffffffa02c8d88>{:ptlrpc:ldlm_handle_enqueue+7012}
2006-12-27 08:39:27
<ffffffffa02c53c8>{:ptlrpc:ldlm_server_blocking_ast+0}
2006-12-27 08:39:27
<ffffffffa02c5fa6>{:ptlrpc:ldlm_server_completion_ast+0}
2006-12-27 08:39:27 <ffffffffa040346a>{:mds:mds_handle+26229}
<ffffffff80145177>{__kernel_text_address+26}
<ffffffff8011071c>{show_trace+375}
2006-12-27 08:39:27 <ffffffff80110858>{show_stack+241}
<ffffffffa01d49f6>{:libcfs:lbug_with_loc+72}
2006-12-27 08:39:27 <ffffffffa03fc702>{:mds:mds_getattr_lock+1877}
<ffffffffa0408c9e>{:mds:mds_intent_policy+2177}
2006-12-27 08:39:27 <ffffffff8012f9d3>{__wake_up+54}
<ffffffffa02a568a>{:ptlrpc:ldlm_lock_enqueue+209}
2006-12-27 08:39:27
<ffffffffa02c8d88>{:ptlrpc:ldlm_handle_enqueue+7012}
2006-12-27 08:39:27
<ffffffffa02c53c8>{:ptlrpc:ldlm_server_blocking_ast+0}
2006-12-27 08:39:27
<ffffffffa02c5fa6>{:ptlrpc:ldlm_server_completion_ast+0}
2006-12-27 08:39:27 <ffffffffa040346a>{:mds:mds_handle+26229}
<ffffffff801e2d53>{vsnprintf+1406}
2006-12-27 08:39:27 <ffffffff801e2d53>{vsnprintf+1406}
2006-12-27 08:39:27 <ffffffff801e2b25>{vsnprintf+848}
<ffffffff801e2b25>{vsnprintf+848} <ffffffff801e2e36>{snprintf+131}
2006-12-27 08:39:27
<ffffffffa02eb45b>{:ptlrpc:ptlrpc_server_handle_request+2700}
2006-12-27 08:39:27
<ffffffffa02eb469>{:ptlrpc:ptlrpc_server_handle_request+2714}
2006-12-27 08:39:27 <ffffffff8013bb5c>{__mod_timer+293}
<ffffffffa02ec5f3>{:ptlrpc:ptlrpc_main+2127}
2006-12-27 08:39:27
<ffffffffa02ebd97>{:ptlrpc:ptlrpc_retry_rqbds+0}
<ffffffffa02ebd97>{:ptlrpc:ptlrpc_retry_rqbds+0}
2006-12-27 08:39:27
<ffffffffa02ebd97>{:ptlrpc:ptlrpc_retry_rqbds+0}
<ffffffff8010ff3f>{child_rip+8}
2006-12-27 08:39:27 <ffffffffa02ebda4>{:ptlrpc:ptlrpc_main+0}
<ffffffff8010ff37>{child_rip+0}
2006-12-27 08:39:27
2006-12-27 08:39:27 LustreError: dumping log to
/var/tmp/lustre-log-pigs-mds1.1167237567.2942
2006-12-27 08:39:27 <ffffffff801e2e36>{snprintf+131}
2006-12-27 08:39:27
<ffffffffa02eb45b>{:ptlrpc:ptlrpc_server_handle_request+2700}
2006-12-27 08:39:27
<ffffffffa02eb469>{:ptlrpc:ptlrpc_server_handle_request+2714}
th@llnl.gov
2006-Dec-27 15:08 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 The dk log from the 12/27 event appears to have been lost; this mds node is dropping dk logs frequently (but only has LBUGed once).
behlendorf1@llnl.gov
2007-Jan-03 11:26 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 This bug struck again on 1/3/07 taking the MDS for lscratcha (bigs-mds1) out of service. We dumped the thread state and crash dumped the node to get things back in service. Unfortunately the crash dump didn''t succeed. We do have some logs dumped by the LBUGs for analysis.
behlendorf1@llnl.gov
2007-Jan-03 14:08 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277
What |Removed |Added
----------------------------------------------------------------------------
Severity|S3 (normal) |S2 (major)
Priority|P3 |P2
Updated to P2/S2 bug, this is causing extended FS downtimes. Full dk logs from
the most recent failures have been uploaded to the FS site as
''bigs-mds1.log.sorted.gz''. This logs represents 277
individual logs dumps
spliced together with a ''sort -k4 -t: <files>'' so there
may be many multi-minute
gaps.
behlendorf1@llnl.gov
2007-Jan-03 14:15 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 Created an attachment (id=9259) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9259&action=view) Console log Console logs for debug log uploaded to FTP site.
behlendorf1@llnl.gov
2007-Jan-03 18:31 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277
What |Removed |Added
----------------------------------------------------------------------------
CC| |th@llnl.gov
Added Terry to CC list
green@clusterfs.com
2007-Jan-04 06:53 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277
What |Removed |Added
----------------------------------------------------------------------------
CC| |shadow@clusterfs.com
That''s what happens here:
1. client does some operation, which take long time on server (trying to take a
lock)
2. client times out and resends request.
3. new request arrives at server, we look in the list of locks and find original
requested lock from (1), we get this lock and its cookie, decide that this is
the lock we sent to client and start to do resending preparations (passes the
lustre_handle_is_used() check in mds_getattr_lock).
4. at this time thread serving request from (1) finally gets the lock it was
waiting for and decides to return it to client, it rewrites remote handle and
returns with ELDLM_LOCK_REPLACED from mds_intent_policy(). This makes
ldlm_lock_enqueue() to drop existing lock from (1) (and it''s cookie).
5. thread serving resent request is trying to get a lock from cookie(now
invalid), fails and trips on the assertion.
Easy fix I think about is to add an extra check fixup_handle_for_resent_req()
that the lock we are about to choose is granted (req_mode == granted_mode).
The drawback is we will obviously redo entire lock granting for second request.
But should be enough for a quick fix.
Shadow: please start with a recovery-small.sh test to replicate the issue (you
will need to strategically place OBD_FAIL/OBD_RACE in mds/handler.c), then try
the solution I described above.
pbojanic@clusterfs.com
2007-Jan-04 08:43 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277
What |Removed |Added
----------------------------------------------------------------------------
CC| |green@clusterfs.com
AssignedTo|bugs@clusterfs.com |shadow@clusterfs.com
Status Whiteboard| |2007-01-04: initial analysis
| |has been completed; engineer
| |assigned to replicate the
| |issue locally to test
| |hypothesis
Comment Tags: Status Change
2007-01-04: initial analysis has been completed; engineer assigned to replicate
the issue locally to test hypothesis
shadow@clusterfs.com
2007-Jan-05 11:52 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 Created an attachment (id=9278) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9278&action=view) fix return lock in wrong state. This patch block to return lock in wrong state (req_mode == granted_mode). Also fix bug in OBD_RACE, when thread don`t wake up after second call OBD_RACE.
green@clusterfs.com
2007-Jan-05 17:00 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9278|review?(green@clusterfs.com)|review+,
Flag| |review?(adilger@clusterfs.co
| |m)
(From update of attachment 9278)
This patch should ease the pain of this specific problem at LLNL for now.
But as we recently learned, a different measure for entire class of similar
problems was developed for CMD2 project and what we should do is to adopt that
solution.
(do not allow a client to reconnect if there are some requests from it are
processed still).
OBD_RACE change will affect some of our existing regression tests, but should
not affect anything during normal operations.
Andreas, why it was decided not to wake up the other thread in OBD_RACE and
instead only wake it up when second thread finishes its job completely?
green@clusterfs.com
2007-Jan-08 15:47 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 I found that the more correct approach is in fact implemented in bug 1451. The patch needs to be updated to b1_4 & b1_5 and landed.
shadow@clusterfs.com
2007-Jan-11 04:15 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11277 Created an attachment (id=9318) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9318&action=view) refuse connect from busy exports port bug 1451 to b1_4. if client has incompleate request and tried to reconect - his connect request refused before old request don`t compleate.
green@clusterfs.com
2007-Jan-11 15:42 UTC
[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9318|review?(green@clusterfs.com)|review+
Flag| |
(From update of attachment 9318)
Looks good. Would need to be landed to b1_5 too