[reposted as I omitted the attachment] dear all, I'm evaluating ocfs2 as the fs of choice in my setup at university lab. all the PCs access a Infortrend A16F-R1211 storage system via FC SAN based on QLogic boards. today it was the first day of heavy use in our cluster. and we got a problem. I try to tell the story I'm trying to reconstruct. main question, how can I recover from this situation ? I can't umount the partition as umount freezes... I'm trying with shutdown of involved hosts then fsck.ocfs2 all PCs mounts 4 ocfs2 partitions: LABEL=disk1 /storage/disk1 ocfs2 noauto,_netdev 0 0 LABEL=disk2 /storage/disk2 ocfs2 noauto,_netdev 0 0 LABEL=disk3 /storage/disk3 ocfs2 noauto,_netdev 0 0 LABEL=disk4 /storage/disk4 ocfs2 noauto,_netdev 0 0 theboss is the front PC, rack[1-8] are calculation PCs rack1 and rack2 were running jobs which required the I/O of big data sets to/from /storage/disk1. during the day, I experienced slowdown of theboss, probably related to heavy I/O done by rack[12] rack3...rack8 did not use ocfs2 partitions then, we had some problems on rack9 and/or rack10, with subsequent reset of both of them, maybe unrelated to ocfs2 I collected data from a bunch of PCs via the command: egrep 'Jun 26.*(dlm|o2|ocfs).*' /var/log/messages they are in attachment together with the cluster.conf particularly informative is bugreport/ocfs_theboss.ape.log Jun 26 18:20:19 theboss kernel: (17499,2):dlm_send_remote_convert_request:393 ERROR: status = -107 Jun 26 18:20:19 theboss kernel: (17499,2):dlm_wait_for_node_death:285 5AFE69831DFC414A90CEA2B8718644C4: wai ting 5000ms for notification of death of node 10 Jun 26 18:20:23 theboss kernel: (2458,0):o2net_set_nn_state:415 accepted connection from node rack9 (num 10 ) at 10.0.2.29:7777 Jun 26 18:20:24 theboss kernel: (17499,2):dlm_send_remote_convert_request:393 ERROR: status = -92 Jun 26 18:20:24 theboss kernel: (17499,2):dlm_wait_for_node_death:285 5AFE69831DFC414A90CEA2B8718644C4: wai ting 5000ms for notification of death of node 10 Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:377 Nodes in my domain ("5AFE69831DFC414A90CEA2B 8718644C4"): Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 1 Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 2 Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 3 Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 4 Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 5 Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 6 Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 7 Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 8 Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 9 Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 10 Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 11 Jun 26 18:20:27 theboss kernel: (2458,0):dlm_assert_master_handler:1599 ERROR: assert_master from 9, but cu rrent owner is 10! (S000000000000000000000200000000) Jun 26 18:20:27 theboss kernel: (2458,0):dlm_assert_master_handler:1691 ERROR: Bad message received from an other node. Dumping state and killing the other node now! This node is OK and can continue. Jun 26 18:20:27 theboss kernel: (2458,0):dlm_dump_lock_resources:125 struct dlm_ctxt: 5AFE69831DFC414A90CEA 2B8718644C4, node=3, key=821947029 Jun 26 18:20:27 theboss kernel: (2458,0):dlm_print_one_lock_resource:52 lockres: M00000000000000379e03e2b25 92ded, owner=3, state=0 a tgz with logs and cluster.conf is at http://apegate.roma1.infn.it/~rossetti/apeNEXT/bugreport.tgz -- davide.rossetti at gmail.com ICQ:290677265 SKYPE:d.rossetti