Anyone aware of this problem, and if so is there a fix available? I have two nodes, alice and bob. On both I have a shared ocfs2 mount at /ocfs2. The FS appears to mount and work perfectly fine. Now on alice I take out an exclusive lock on /dlm/foo/bar and block the process forever. Next I start a loop on bob that tries to take out the same lock (trylock exclusive mode) once each second, which fails properly. Now, I unplug alice completely... machine is off. The trylock process on bob now hangs permanently, ten seconds pass. The following appears on my console for bob: (0,0):o2net_idle_timer:1293 connection to node kano (num 0) at 10.10.0.2:7777 has been idle for 10 seconds, shutting it down. (0,0):o2net_idle_timer:1304 here are some times that might help debug the situation: (tmr 1144294337.323052 now 1144294347.317365 dr 1144294337.323045 adv 1144294337.323053:1144294337.323053 func (7b10fddd:505) 1144294324.934836:1144294324.934838) (2179,0):o2net_set_nn_state:409 no longer connected to node kano (num 0) at 10.10.0.2:7777 (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -112 (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107 (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107 The status = -107 message prints approx once every 100ms now forever, and a few seconds after this all starts scrolling I get: (2493,0):ocfs2_replay_journal:1180 Recovering node 0 from slot 0 on device (8,2) (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107 kjournald starting. Commit interval 5 seconds In the middle of all the scrolling. The trylock process on bob is permanently hung and the -107 message continues to scroll. I have tried using the subversion ocfs2/trunk modules under 2.6.16 (changed to use mutexes), the modules that come with mainline 2.6.16 and the mainline 2.6.16.1. All of these seem to act the same. OCFS2 Node Manager, DLM, DLMFS all v 1.3.3 OCFS2-Tools v 1.2.0 The bugreports I've found related to this problem say I need to upgrade to -Tools ver 1.0.3, which I think I'm a little past. (Could be wrong) Thanks, Jonathan Steinert
Please file a bug on oss.oracle.com/bugzilla. We've made many fixes in mastery/recovery since 1.2.0. We can add a test to check this issue too. Jonathan Steinert wrote:> Anyone aware of this problem, and if so is there a fix available? > > I have two nodes, alice and bob. On both I have a shared ocfs2 mount at > /ocfs2. The FS appears to mount and work perfectly fine. > > Now on alice I take out an exclusive lock on /dlm/foo/bar and block the > process forever. Next I start a loop on bob that tries to take out the > same lock (trylock exclusive mode) once each second, which fails properly. > > Now, I unplug alice completely... machine is off. The trylock process on > bob now hangs permanently, ten seconds pass. The following appears on my > console for bob: > > (0,0):o2net_idle_timer:1293 connection to node kano (num 0) at > 10.10.0.2:7777 has been idle for 10 seconds, shutting it down. > (0,0):o2net_idle_timer:1304 here are some times that might help debug > the situation: (tmr 1144294337.323052 now 1144294347.317365 dr > 1144294337.323045 adv 1144294337.323053:1144294337.323053 func > (7b10fddd:505) 1144294324.934836:1144294324.934838) > (2179,0):o2net_set_nn_state:409 no longer connected to node kano (num 0) > at 10.10.0.2:7777 > (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -112 > (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107 > (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107 > > The status = -107 message prints approx once every 100ms now forever, > and a few seconds after this all starts scrolling I get: > > (2493,0):ocfs2_replay_journal:1180 Recovering node 0 from slot 0 on > device (8,2) > (2492,0):dlm_send_remote_lock_request:264 ERROR: status = -107 > kjournald starting. Commit interval 5 seconds > > In the middle of all the scrolling. The trylock process on bob is > permanently hung and the -107 message continues to scroll. > > I have tried using the subversion ocfs2/trunk modules under 2.6.16 > (changed to use mutexes), the modules that come with mainline 2.6.16 and > the mainline 2.6.16.1. All of these seem to act the same. > > OCFS2 Node Manager, DLM, DLMFS all v 1.3.3 > OCFS2-Tools v 1.2.0 > > The bugreports I've found related to this problem say I need to upgrade > to -Tools ver 1.0.3, which I think I'm a little past. (Could be wrong) > > Thanks, > Jonathan Steinert > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >