Hi,
I''m having a strange issue and would like to get closer to
understanding it.
With lustre 1.6.2 over o2ib I had some cluster nodes hanging on lustre I/O
processes and rebooted them. No LBUGs seen, only RDMA failures. Only the
client nodes were rebooted.
After re-mounting the lustre filesystem, "ls" hangs (traceback is
below).
But when the lustre FS is unmounted with "umount -f" ls returns the
correct
output.
Any idea on what could be wrong? I noticed that on the buggy clients
   cat /proc/fs/lustre/ldlm/namespaces/*/lock_count
shows something very different from the output of the "good" clients.
Only
the MGC* lock_count is 1, the others are zero. Is there a way to fix this?
Best regards,
Erich
===== traceback of hanging "ls" command
====================================ls            S 00000000ffffc29c     0  4296
4092                     (NOTLB)
000001007c049958 0000000000000002 000001007e2f3030 ffffffff00000074
       00000101376bc808 0000000039288440 0000010080051000 00000001a02f67d3
       000001007e004800 000000000000205a
Call Trace:<ffffffff8013f4a4>{__mod_timer+293}
<ffffffff80320c33>{schedule_timeout+367}
       <ffffffff8013fed4>{process_timeout+0}
<ffffffffa030b474>{:ptlrpc:ptlrpc_set_wait+932}
       <ffffffff801335c2>{default_wake_function+0}
<ffffffffa03096b0>{:ptlrpc:ptlrpc_expired_set+0}
       <ffffffffa0307710>{:ptlrpc:ptlrpc_interrupted_set+0}
       <ffffffffa03096b0>{:ptlrpc:ptlrpc_expired_set+0}
<ffffffffa0307710>{:ptlrpc:ptlrpc_interrupted_set+0}
       <ffffffffa042637d>{:lustre:ll_glimpse_size+1613}
<ffffffffa02e0e6a>{:ptlrpc:__ldlm_handle2lock+794}
       <ffffffffa02dc035>{:ptlrpc:lock_res_and_lock+53}
<ffffffffa02dc035>{:ptlrpc:lock_res_and_lock+53}
       <ffffffffa02dc06f>{:ptlrpc:unlock_res_and_lock+31}
       <ffffffffa02e0aaa>{:ptlrpc:ldlm_lock_decref_internal+746}
       <ffffffffa0424b40>{:lustre:ll_extent_lock_callback+0}
       <ffffffffa02f2960>{:ptlrpc:ldlm_completion_ast+0}
<ffffffffa0424ee0>{:lustre:ll_glimpse_callback+0}
       <ffffffffa0416c4f>{:lustre:ll_intent_drop_lock+143}
       <ffffffffa0431318>{:lustre:ll_inode_revalidate_it+1528}
       <ffffffffa044f360>{:lustre:ll_mdc_blocking_ast+0}
<ffffffff8018e106>{dput+55}
       <ffffffff801859eb>{__link_path_walk+3928}
<ffffffff80185b75>{link_path_walk+179}
       <ffffffffa04313c4>{:lustre:ll_getattr_it+36}
<ffffffffa04314f5>{:lustre:ll_getattr+53}
       <ffffffff80180257>{vfs_getattr64_it+146}
<ffffffff80180532>{vfs_lstat64+100}
       <ffffffff8016838d>{handle_mm_fault+354}
<ffffffff801ea149>{__up_read+16}
       <ffffffff80123991>{do_page_fault+577}
<ffffffffa0416c70>{:lustre:ll_intent_release+0}
       <ffffffff80180891>{sys_newlstat+17}
<ffffffff8018a01c>{vfs_readdir+176}
       <ffffffff8018a428>{sys_getdents64+166}
<ffffffff80110c2d>{error_exit+0}
       <ffffffff8011022a>{system_call+126}