Mark Rife
2006-Jul-19 16:49 UTC
[Ocfs2-users] dlm_get_lock_resource/dlm_query_join_handler errors
Hello, I am running a 4-node ocfs2 cluster. Our servers are running Redhat AS4, kernel 2.6.9-34.0.1.ELhugemem. Our ocfs2 package versions are: # rpm -qa | grep ocfs2 ocfs2-tools-debuginfo-1.2.1-1 ocfs2-tools-1.2.1-1 ocfs2-2.6.9-34.0.1.ELhugemem-1.2.2-1 ocfs2console-1.2.1-1 One of the nodes (#3) crashed. We?re rebooted node 3, but now it hangs as it tries to rejoin the cluster. On two of the nodes that are up (0 and 1), I am getting messages repeated /var/log/messages that look like this: Jul 19 09:39:40 radon6 kernel: (3994,2):dlm_query_join_handler:614 node 3 trying to join, but recovery is ongoing. Jul 19 09:39:50 radon6 last message repeated 25 times Jul 19 09:39:51 radon6 kernel: (27704,1):dlm_get_lock_resource:895 46A341FD43114DE4A10E7D63C5099461:M0000000000000000667f6c991b8fc9: at least one node (3) torecover before lock mastery can begin Jul 19 09:39:51 radon6 kernel: (3994,2):dlm_query_join_handler:614 node 3 trying to join, but recovery is ongoing. Jul 19 09:39:51 radon6 kernel: (10183,1):dlm_get_lock_resource:895 46A341FD43114DE4A10E7D63C5099461:M00000000000000000081e17e89ae74: at least one node (3) torecover before lock mastery can begin Jul 19 09:39:51 radon6 kernel: (3994,2):dlm_query_join_handler:614 node 3 trying to join, but recovery is ongoing. This appears to be in an infinite loop and node 3 never starts. I?m not seeing the messages on node 2. The cluster is up and running on 3 of the 4 servers, but I need to get all 4 nodes running again. Can anyone provide any insight on what is going on or how this should be handled? Thanks! Mark Rife Oracle Applications DBA markrife at hotmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060719/e96ce6ae/attachment.html