Silviu Marin-Caea
2006-Apr-13 09:43 UTC
[Ocfs2-users] Panic on umount OCFS2 1.1.8 SLES9 kernel 2.6.5-7.252
We have a backup script on node1 that does this: # stop App Serv & DB ssh node2-cluster1 "rcoracle_as stop; rcoracle_db stop" ssh node1-cluster1 "rcoracle_as stop; rcoracle_db stop" sleep 10 # stop Clusterware ssh node2-cluster1 "/etc/init.d/init.crs stop; sleep 2m" ssh node1-cluster1 "/etc/init.d/init.crs stop; sleep 2m" # umount ocfs2 from node2 and remount on node1 ssh node2-cluster1 "sync; rcocfs2 stop; sleep 30" <---- panic on node1 ssh node1-cluster1 "sync; rcocfs2 stop; sleep 30; rcocfs2 start; sleep 30" tar cf /dev/st0 /srv/database /opt/oracle ... then the reverse operations, after tar has finished Two or three times a week, we have a kernel panic on OCFS2 unmount. While OCFS2 is unmounted on node2, node1 panics. Why do I umount OCFS2 on node2 you might ask? Because if I don't, the tape backup hangs on /opt/oracle, at random stages. --- node1 --- Apr 12 21:36:59 node1-cluster1 logger: Oracle CSSD graceful shutdown Apr 12 21:39:01 node1-cluster1 kernel: (6628,2):__dlm_print_nodes:384 Nodes in my domain ("896135A6F0CD432CB496D54A96ECDDBB"): Apr 12 21:39:01 node1-cluster1 kernel: (6628,2):__dlm_print_nodes:388 node 0 Apr 12 21:39:02 node1-cluster1 kernel: (6628,2):__dlm_print_nodes:384 Nodes in my domain ("9145F4AEDB7348BC9DE95747E0E4ECE2"): Apr 12 21:39:02 node1-cluster1 kernel: (6628,2):__dlm_print_nodes:388 node 0 Apr 12 21:39:21 node1-cluster1 kernel: (6628,6):__dlm_print_nodes:384 Nodes in my domain ("E4B55CA57DAC49BA97E4C65F6CA4A72F"): Apr 12 21:39:21 node1-cluster1 kernel: (6628,6):__dlm_print_nodes:388 node 0 Apr 12 21:39:31 node1-cluster1 kernel: (6628,6):o2net_idle_timer:1306 connection to node node2-cluster1 (num 1) at 10.0.0.2:7777 has been i dle for 10 seconds, shutting it down. Apr 12 21:39:31 node1-cluster1 kernel: (6628,6):o2net_idle_timer:1317 here are some times that might help debug the situation: (tmr 1144867 161.785725 now 1144867171.784420 dr 1144867166.785023 adv 1144867161.785723:1144867161.785723 func (5bed52dc:513) 1144867161.785725:1144867 161.785599) Apr 12 21:39:34 node1-cluster1 kernel: (25146,7):o2net_set_nn_state:407 no longer connected to node node2-cluster1 (num 1) at 10.0.0.2:7777 Apr 12 21:40:01 node1-cluster1 /USR/SBIN/CRON[21616]: (root) CMD (/usr/local/bin/ora_backup) Apr 12 21:40:02 node1-cluster1 kernel: (25,7):o2hb_write_timeout:165 ERROR: Heartbeat write timeout to device sdd1 after 30000 milliseconds Apr 12 21:40:02 node1-cluster1 kernel: (25,7):o2hb_stop_all_regions:1728 ERROR: stopping heartbeat on all active regions. Apr 12 21:40:02 node1-cluster1 kernel: Kernel panic: ocfs2 is very sorry to be fencing this system by panicing --- node2 --- Apr 12 21:39:01 node2-cluster1 kernel: ocfs2: Unmounting device (8,17) on (node 1) Apr 12 21:39:03 node2-cluster1 kernel: ocfs2: Unmounting device (8,33) on (node 1) Apr 12 21:39:31 node2-cluster1 kernel: (0,2):o2net_idle_timer:1306 connection to node node1-cluster1 (num 0) at 10.0.0.1:7777 has been idle for 10 seconds, shutting it down. Apr 12 21:39:31 node2-cluster1 kernel: (0,2):o2net_idle_timer:1317 here are some times that might help debug the situation: (tmr 1144867161 .784993 now 1144867171.783852 dr 1144867161.784983 adv 1144867161.785004:1144867161.785005 func (5bed52dc:502) 1144867161.784994:1144867161 .784999) Apr 12 21:39:31 node2-cluster1 kernel: (6595,2):o2net_set_nn_state:407 no longer connected to node node1-cluster1 (num 0) at 10.0.0.1:7777 Apr 12 21:39:31 node2-cluster1 kernel: (23910,3):dlm_leave_domain:472 Error -112 sending domain exit message to node 0 Apr 12 21:39:33 node2-cluster1 kernel: ocfs2: Unmounting device (8,49) on (node 1)