thr3ads.net - Lustre devel - [Lustre-devel] [Bug 11277] ASSERTION(granted

If this information is useful, please help other people find it:
Share via:

th@llnl.gov

2006-Dec-27 13:27 UTC

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277



This struck again:

2006-12-27 08:39:27 LustreError: 2947:0:(handler.c:1422:mds_handle()) operation
400 on unconnected MDS from 12345-173@elan1
2006-12-27 08:39:27 LustreError: 2947:0:(handler.c:1422:mds_handle()) Skipped 2
previous similar messages
2006-12-27 08:39:27 LustreError:
2947:0:(ldlm_lib.c:1275:target_send_reply_msg()) @@@ processing error (-107)
req@000001002245dc00 x14975018/t0 o400-><?>@<?>:-1 lens 64/0 ref
0 fl Interpret:
/0/0 rc -107/0
2006-12-27 08:39:27 LustreError:
2947:0:(ldlm_lib.c:1275:target_send_reply_msg()) @@@ processing error (-107)
req@000001003017ec00 x21346808/t0 o400-><?>@<?>:-1 lens 64/0 ref
0 fl Interpret:
/0/0 rc -107/0
2006-12-27 08:39:27 LustreError: 3156:0:(ldlm_lockd.c:1043:ldlm_handle_cancel())
received cancel for unknown lock cookie 0xc2df8baa28cc4cb0 from client
adbfd3d6-1ca9-3c5e-e258-15872df55fcc i
d 12345-951@elan3
2006-12-27 08:39:27 LustreError: 2942:0:(handler.c:910:mds_getattr_lock())
ASSERTION(granted_lock != NULL) failed:1782474520/2811824055 lockh
0xc2df8baa28cc4be5
2006-12-27 08:39:27 LustreError: 2942:0:(linux-debug.c:130:lbug_with_loc()) LBUG
2006-12-27 08:39:27 ll_mdt_02     R  running task       0  2942      1         
2943  2941 (L-TLB)
2006-12-27 08:39:27 0000010220eee0b0 ffffffff8010e5f9 ffffffff8047de51
0000000000000246 
2006-12-27 08:39:27        ffffffff8030187c ffffffff802ed4de 0000000000000000
0000000000000000 
2006-12-27 08:39:27        000001021a85d798 00000000fffffffe 
2006-12-27 08:39:27 Call Trace:<ffffffff80134275>{vprintk+515}
<0>LustreError:
2945:0:(handler.c:910:mds_getattr_lock()) ASSERTION(granted_lock != NULL)
failed:1782474520/2811824055 lockh 0x
c2df8baa28cc4c24
2006-12-27 08:39:27 LustreError: 2945:0:(linux-debug.c:130:lbug_with_loc()) LBUG
2006-12-27 08:39:27 ll_mdt_05     R  running task       0  2945      1         
2946  2944 (L-TLB)
2006-12-27 08:39:27 0000010220eee0b0 ffffffff8010e5f9 ffffffff8047de51
0000000000000246 
2006-12-27 08:39:27        ffffffff8030187c ffffffff802ed4de 0000000000000000
0000000000000000 
2006-12-27 08:39:27        000001021a8d1798 00000000fffffffe 
2006-12-27 08:39:27 Call Trace:<ffffffff80134275>{vprintk+515} 
2006-12-27 08:39:27        <ffffffff8013430e>{printk+141}
<ffffffff80134275>{vprintk+515} <ffffffff8013430e>{printk+141} 
2006-12-27 08:39:27        <ffffffff80134275>{vprintk+515} 
2006-12-27 08:39:27        <ffffffff8013430e>{printk+141}
<ffffffff80145177>{__kernel_text_address+26}
<ffffffff8011071c>{show_trace+375}
2006-12-27 08:39:27        <ffffffff80110858>{show_stack+241}
<ffffffffa01d49f6>{:libcfs:lbug_with_loc+72} 
2006-12-27 08:39:27        <ffffffffa03fc702>{:mds:mds_getattr_lock+1877}
<ffffffffa0408c9e>{:mds:mds_intent_policy+2177} 
2006-12-27 08:39:27        <ffffffff8013430e>{printk+141} 
2006-12-27 08:39:27        <ffffffff8012f9d3>{__wake_up+54}
<ffffffffa02a568a>{:ptlrpc:ldlm_lock_enqueue+209} 
2006-12-27 08:39:27       
<ffffffffa02c8d88>{:ptlrpc:ldlm_handle_enqueue+7012}
2006-12-27 08:39:27       
<ffffffffa02c53c8>{:ptlrpc:ldlm_server_blocking_ast+0}
2006-12-27 08:39:27       
<ffffffffa02c5fa6>{:ptlrpc:ldlm_server_completion_ast+0}
2006-12-27 08:39:27        <ffffffffa040346a>{:mds:mds_handle+26229}
<ffffffff80145177>{__kernel_text_address+26}
<ffffffff8011071c>{show_trace+375}
2006-12-27 08:39:27        <ffffffff80110858>{show_stack+241}
<ffffffffa01d49f6>{:libcfs:lbug_with_loc+72} 
2006-12-27 08:39:27        <ffffffffa03fc702>{:mds:mds_getattr_lock+1877}
<ffffffffa0408c9e>{:mds:mds_intent_policy+2177} 
2006-12-27 08:39:27        <ffffffff8012f9d3>{__wake_up+54}
<ffffffffa02a568a>{:ptlrpc:ldlm_lock_enqueue+209} 
2006-12-27 08:39:27       
<ffffffffa02c8d88>{:ptlrpc:ldlm_handle_enqueue+7012}
2006-12-27 08:39:27       
<ffffffffa02c53c8>{:ptlrpc:ldlm_server_blocking_ast+0}
2006-12-27 08:39:27       
<ffffffffa02c5fa6>{:ptlrpc:ldlm_server_completion_ast+0}
2006-12-27 08:39:27        <ffffffffa040346a>{:mds:mds_handle+26229}
<ffffffff801e2d53>{vsnprintf+1406} 
2006-12-27 08:39:27        <ffffffff801e2d53>{vsnprintf+1406} 
2006-12-27 08:39:27        <ffffffff801e2b25>{vsnprintf+848}
<ffffffff801e2b25>{vsnprintf+848} <ffffffff801e2e36>{snprintf+131} 
2006-12-27 08:39:27       
<ffffffffa02eb45b>{:ptlrpc:ptlrpc_server_handle_request+2700} 
2006-12-27 08:39:27       
<ffffffffa02eb469>{:ptlrpc:ptlrpc_server_handle_request+2714} 
2006-12-27 08:39:27        <ffffffff8013bb5c>{__mod_timer+293}
<ffffffffa02ec5f3>{:ptlrpc:ptlrpc_main+2127} 
2006-12-27 08:39:27       
<ffffffffa02ebd97>{:ptlrpc:ptlrpc_retry_rqbds+0}
<ffffffffa02ebd97>{:ptlrpc:ptlrpc_retry_rqbds+0} 
2006-12-27 08:39:27       
<ffffffffa02ebd97>{:ptlrpc:ptlrpc_retry_rqbds+0}
<ffffffff8010ff3f>{child_rip+8} 
2006-12-27 08:39:27        <ffffffffa02ebda4>{:ptlrpc:ptlrpc_main+0}
<ffffffff8010ff37>{child_rip+0} 
2006-12-27 08:39:27        
2006-12-27 08:39:27 LustreError: dumping log to
/var/tmp/lustre-log-pigs-mds1.1167237567.2942
2006-12-27 08:39:27 <ffffffff801e2e36>{snprintf+131} 
2006-12-27 08:39:27       
<ffffffffa02eb45b>{:ptlrpc:ptlrpc_server_handle_request+2700} 
2006-12-27 08:39:27       
<ffffffffa02eb469>{:ptlrpc:ptlrpc_server_handle_request+2714}

th@llnl.gov

2006-Dec-27 15:08 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277



The dk log from the 12/27 event appears
to have been lost;   this mds node is dropping
dk logs frequently (but only has LBUGed once).

behlendorf1@llnl.gov

2007-Jan-03 11:26 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277



This bug struck again on 1/3/07 taking the MDS for lscratcha (bigs-mds1) out
 of service.  We dumped the thread state and crash dumped the node to get 
things back in service.  Unfortunately the crash dump didn''t succeed. 
We
do have some logs dumped by the LBUGs for analysis.

behlendorf1@llnl.gov

2007-Jan-03 14:08 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|S3 (normal)                 |S2 (major)
           Priority|P3                          |P2


Updated to P2/S2 bug, this is causing extended FS downtimes.  Full dk logs from
the most recent failures have been uploaded to the FS site as
''bigs-mds1.log.sorted.gz''.  This logs represents 277
individual logs dumps
spliced together with a ''sort -k4 -t: <files>'' so there
may be many multi-minute
gaps.

behlendorf1@llnl.gov

2007-Jan-03 14:15 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277



Created an attachment (id=9259)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9259&action=view)
Console log

Console logs for debug log uploaded to FTP site.

behlendorf1@llnl.gov

2007-Jan-03 18:31 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |th@llnl.gov


Added Terry to CC list

green@clusterfs.com

2007-Jan-04 06:53 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |shadow@clusterfs.com


That''s what happens here:
1. client does some operation, which take long time on server (trying to take a
lock)
2. client times out and resends request.
3. new request arrives at server, we look in the list of locks and find original
requested lock from (1), we get this lock and its cookie, decide that this is
the lock we sent to client and start to do resending preparations (passes the 
lustre_handle_is_used() check in mds_getattr_lock).
4. at this time thread serving request from (1) finally gets the lock it was
waiting for and decides to return it to client, it rewrites remote handle and
returns with ELDLM_LOCK_REPLACED from mds_intent_policy(). This makes
ldlm_lock_enqueue() to drop existing lock from (1) (and it''s cookie).
5. thread serving resent request is trying to get a lock from cookie(now
invalid), fails and trips on the assertion.

Easy fix I think about is to add an extra check fixup_handle_for_resent_req()
that the lock we are about to choose is granted (req_mode == granted_mode).
The drawback is we will obviously redo entire lock granting for second request.
But should be enough for a quick fix.

Shadow: please start with a recovery-small.sh test to replicate the issue (you
will need to strategically place OBD_FAIL/OBD_RACE in mds/handler.c), then try
the solution I described above.

pbojanic@clusterfs.com

2007-Jan-04 08:43 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |green@clusterfs.com
         AssignedTo|bugs@clusterfs.com          |shadow@clusterfs.com
  Status Whiteboard|                            |2007-01-04: initial analysis
                   |                            |has been completed; engineer
                   |                            |assigned to replicate the
                   |                            |issue locally to test
                   |                            |hypothesis


Comment Tags: Status Change

2007-01-04: initial analysis has been completed; engineer assigned to replicate
the issue locally to test  hypothesis

shadow@clusterfs.com

2007-Jan-05 11:52 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277



Created an attachment (id=9278)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9278&action=view)
fix return lock in wrong state.

This patch block to return lock in wrong state (req_mode == granted_mode). Also

fix bug in OBD_RACE, when thread don`t wake up after second call OBD_RACE.

green@clusterfs.com

2007-Jan-05 17:00 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9278|review?(green@clusterfs.com)|review+,
               Flag|                            |review?(adilger@clusterfs.co
                   |                            |m)


(From update of attachment 9278)
This patch should ease the pain of this specific problem at LLNL for now.
But as we recently learned, a different measure for entire class of similar
problems was developed for CMD2 project and what we should do is to adopt that
solution.
(do not allow a client to reconnect if there are some requests from it are
processed still).

OBD_RACE change will affect some of our existing regression tests, but should
not affect anything during normal operations.
Andreas, why it was decided not to wake up the other thread in OBD_RACE and
instead only wake it up when second thread finishes its job completely?

green@clusterfs.com

2007-Jan-08 15:47 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277



I found that the more correct approach is in fact implemented in bug 1451. The
patch needs to be updated to b1_4 & b1_5 and landed.

shadow@clusterfs.com

2007-Jan-11 04:15 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277



Created an attachment (id=9318)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9318&action=view)
refuse connect from busy exports

port bug 1451 to b1_4. 
if client has incompleate request and tried to reconect - his connect request
refused before old request don`t compleate.

green@clusterfs.com

2007-Jan-11 15:42 UTC

head link

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9318|review?(green@clusterfs.com)|review+
               Flag|                            |


(From update of attachment 9318)
Looks good. Would need to be landed to b1_5 too

Lustre devel - Dec 2006 - [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)

[Lustre-devel] [Bug 11277] ASSERTION(granted_lock != NULL)