On a two node cluster I got a reboot (core dump) on the first node. The /var/log/messages doesn't show anything wrong, but running crash on the code dump shows that ocfs2 panicked the system. Nodes are SuSE Sles 10 SP1 systems. I believe node 1 had a backup routine (Veritas) running on it at the time. Any ideas on what happened ? thank you, charlie node 1 system info: ---------------------------- Jun 4 11:48:03 bustech-bu kernel: OCFS2 Node Manager 1.2.5-SLES-r2997 Tue Mar 27 16:33:19 EDT 2007 (build sles) Jun 4 11:48:03 bustech-bu kernel: OCFS2 DLM 1.2.5-SLES-r2997 Tue Mar 27 16:33:19 EDT 2007 (build sles) Jun 4 11:48:03 bustech-bu kernel: OCFS2 DLMFS 1.2.5-SLES-r2997 Tue Mar 27 16:33:19 EDT 2007 (build sles) node 1 boot.msg file. looks like the reboot was on June 4, 2008 at 11:40 ------------------------------------------------------------------------ --------------------------------- INIT: Boot logging started on /dev/tty1(/dev/console) at Wed Jun 4 11:40:23 2008 Master Resource Control: previous runlevel: N, switching to runlevel: 1 Starting irqbalance unused Saving 1979 MB crash dump to /var/log/dump/2008-06-04-11:40 ... Entering runlevel: 1 node 1 /var/log/messages ------------------------------------- Jun 4 08:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0 Jun 4 09:15:01 bustech-bu run-crons[12180]: time.cron returned 1 Jun 4 09:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0 Jun 4 10:15:01 bustech-bu run-crons[14123]: time.cron returned 1 Jun 4 10:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0 Jun 4 11:15:01 bustech-bu run-crons[16066]: time.cron returned 1 Jun 4 11:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0 **reboot here. note no previous errors** Jun 4 11:46:22 bustech-bu syslog-ng[4018]: syslog-ng version 1.6.8 starting Jun 4 11:46:22 bustech-bu ifup: lo node 1 crash info ------------------------ crash> bt PID: 13 TASK: dff1f670 CPU: 3 COMMAND: "events/3" #0 [dff21f08] crash_kexec at c013bb1a #1 [dff21f4c] panic at c0120172 #2 [dff21f68] o2quo_fence_self at fb8cc399 #3 [dff21f70] run_workqueue at c012de27 #4 [dff21f8c] worker_thread at c012e754 #5 [dff21fcc] kthread at c0130e77 #6 [dff21fe8] kernel_thread_helper at c0102003 crash> Node 2 /var/log/messages. it looks like this node saw node 1 go away and come back ------------------------------------------------------------------------ ----------------------------------------------------- Jun 4 10:30:01 CN2 run-crons[1587]: time.cron returned 1 Jun 4 11:07:27 CN2 syslog-ng[4054]: STATS: dropped 0 Jun 4 11:30:01 CN2 run-crons[3546]: time.cron returned 1 Jun 4 11:41:33 CN2 kernel: o2net: connection to node bustech-bu (num 0) at 192.168.200.10:7777 has been idle for 120.0 seconds, shutting it down. Jun 4 11:41:33 CN2 kernel: (0,1):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1212604773.841590 now 1212604893.850342 dr 1212604773.841583 adv 1212604773.841590:1212604773.841591 func (04f07b3d:505) 1212595634.502257:1212595634.502260) Jun 4 11:41:33 CN2 kernel: o2net: no longer connected to node bustech-bu (num 0) at 192.168.200.10:7777 Jun 4 11:41:38 CN2 kernel: (5717,1):dlm_get_lock_resource:920 5CA2BC69EF1C446B97521FEB7175EF1C:$RECOVERY: at least one node (0) torecover before lock mastery can begin Jun 4 11:41:38 CN2 kernel: (5717,1):dlm_get_lock_resource:954 5CA2BC69EF1C446B97521FEB7175EF1C: recovery map is not empty, but must master $RECOVERY lock now Jun 4 11:41:38 CN2 kernel: (5728,3):dlm_get_lock_resource:920 8057F00ED41A4507A24B6A4EF0211F1D:$RECOVERY: at least one node (0) torecover before lock mastery can begin Jun 4 11:41:38 CN2 kernel: (5728,3):dlm_get_lock_resource:954 8057F00ED41A4507A24B6A4EF0211F1D: recovery map is not empty, but must master $RECOVERY lock now Jun 4 11:41:39 CN2 kernel: (6117,3):dlm_get_lock_resource:920 mas:$RECOVERY: at least one node (0) torecover before lock mastery can begin Jun 4 11:41:39 CN2 kernel: (6117,3):dlm_get_lock_resource:954 mas: recovery map is not empty, but must master $RECOVERY lock now Jun 4 11:41:40 CN2 kernel: (5706,1):dlm_get_lock_resource:920 6FFB00A1F4F94113B6748BC33CA47F83:$RECOVERY: at least one node (0) torecover before lock mastery can begin Jun 4 11:41:40 CN2 kernel: (5706,1):dlm_get_lock_resource:954 6FFB00A1F4F94113B6748BC33CA47F83: recovery map is not empty, but must master $RECOVERY lock now Jun 4 11:41:40 CN2 kernel: (5770,3):dlm_get_lock_resource:920 E2A008B35C664DDC9FF850F59B0E122F:$RECOVERY: at least one node (0) torecover before lock mastery can begin Jun 4 11:41:40 CN2 kernel: (5770,3):dlm_get_lock_resource:954 E2A008B35C664DDC9FF850F59B0E122F: recovery map is not empty, but must master $RECOVERY lock now Jun 4 11:41:41 CN2 kernel: (5759,3):dlm_get_lock_resource:920 DD202255EE9C419781F4E61DE6E33CFE:$RECOVERY: at least one node (0) torecover before lock mastery can begin Jun 4 11:41:41 CN2 kernel: (5759,3):dlm_get_lock_resource:954 DD202255EE9C419781F4E61DE6E33CFE: recovery map is not empty, but must master $RECOVERY lock now Jun 4 11:48:08 CN2 kernel: o2net: connected to node bustech-bu (num 0) at 192.168.200.10:7777 Jun 4 11:48:11 CN2 kernel: ocfs2_dlm: Node 0 joins domain 6FFB00A1F4F94113B6748BC33CA47F83 Jun 4 11:48:11 CN2 kernel: ocfs2_dlm: Nodes in domain ("6FFB00A1F4F94113B6748BC33CA47F83"): 0 1 Jun 4 11:48:15 CN2 kernel: ocfs2_dlm: Node 0 joins domain 5CA2BC69EF1C446B97521FEB7175EF1C Jun 4 11:48:15 CN2 kernel: ocfs2_dlm: Nodes in domain ("5CA2BC69EF1C446B97521FEB7175EF1C"): 0 1 Jun 4 11:48:20 CN2 kernel: ocfs2_dlm: Node 0 joins domain 8057F00ED41A4507A24B6A4EF0211F1D Jun 4 11:48:20 CN2 kernel: ocfs2_dlm: Nodes in domain ("8057F00ED41A4507A24B6A4EF0211F1D"): 0 1 Jun 4 11:48:24 CN2 kernel: ocfs2_dlm: Node 0 joins domain DD202255EE9C419781F4E61DE6E33CFE Jun 4 11:48:24 CN2 kernel: ocfs2_dlm: Nodes in domain ("DD202255EE9C419781F4E61DE6E33CFE"): 0 1 Jun 4 11:48:28 CN2 kernel: ocfs2_dlm: Node 0 joins domain E2A008B35C664DDC9FF850F59B0E122F Jun 4 11:48:28 CN2 kernel: ocfs2_dlm: Nodes in domain ("E2A008B35C664DDC9FF850F59B0E122F"): 0 1 Jun 4 11:49:07 CN2 kernel: ocfs2_dlm: Node 0 joins domain mas Jun 4 11:49:07 CN2 kernel: ocfs2_dlm: Nodes in domain ("mas"): 0 1 Jun 4 12:07:28 CN2 syslog-ng[4054]: STATS: dropped 0 Jun 4 12:30:01 CN2 run-crons[5552]: time.cron returned 1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080610/3c01bae0/attachment-0001.html
http://oss.oracle.com/bugzilla/show_bug.cgi?id=919 Fixed in 1.2.9-1. SUSE has the fix checked into their tree. It should be out soon with sles10 sp1. Charlie Sharkey wrote:> > On a two node cluster I got a reboot (core dump) on the first node. > The /var/log/messages doesn't show anything wrong, but running > crash on the code dump shows that ocfs2 panicked the system. > Nodes are SuSE Sles 10 SP1 systems. I believe node 1 had a > backup routine (Veritas) running on it at the time. > Any ideas on what happened ? > > thank you, > charlie > > > node 1 system info: > ---------------------------- > Jun 4 11:48:03 bustech-bu kernel: OCFS2 Node Manager 1.2.5-SLES-r2997 > Tue Mar 27 16:33:19 EDT 2007 (build sles) > Jun 4 11:48:03 bustech-bu kernel: OCFS2 DLM 1.2.5-SLES-r2997 Tue Mar > 27 16:33:19 EDT 2007 (build sles) > Jun 4 11:48:03 bustech-bu kernel: OCFS2 DLMFS 1.2.5-SLES-r2997 Tue > Mar 27 16:33:19 EDT 2007 (build sles) > > node 1 boot.msg file. looks like the reboot was on June 4, 2008 at 11:40 > --------------------------------------------------------------------------------------------------------- > INIT: > Boot logging started on /dev/tty1(/dev/console) at Wed Jun 4 11:40:23 > 2008 > > Master Resource Control: previous runlevel: N, switching to runlevel: 1 > Starting irqbalance unused > Saving 1979 MB crash dump to /var/log/dump/2008-06-04-11:40 ... > Entering runlevel: 1 > > > node 1 /var/log/messages > ------------------------------------- > Jun 4 08:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0 > Jun 4 09:15:01 bustech-bu run-crons[12180]: time.cron returned 1 > Jun 4 09:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0 > Jun 4 10:15:01 bustech-bu run-crons[14123]: time.cron returned 1 > Jun 4 10:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0 > Jun 4 11:15:01 bustech-bu run-crons[16066]: time.cron returned 1 > Jun 4 11:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0 > **reboot here. note no previous errors** > Jun 4 11:46:22 bustech-bu syslog-ng[4018]: syslog-ng version 1.6.8 > starting > Jun 4 11:46:22 bustech-bu ifup: lo > > node 1 crash info > ------------------------ > crash> bt > PID: 13 TASK: dff1f670 CPU: 3 COMMAND: "events/3" > #0 [dff21f08] crash_kexec at c013bb1a > #1 [dff21f4c] panic at c0120172 > #2 [dff21f68] o2quo_fence_self at fb8cc399 > #3 [dff21f70] run_workqueue at c012de27 > #4 [dff21f8c] worker_thread at c012e754 > #5 [dff21fcc] kthread at c0130e77 > #6 [dff21fe8] kernel_thread_helper at c0102003 > crash> > > Node 2 /var/log/messages. it looks like this node saw node 1 go away > and come back > ----------------------------------------------------------------------------------------------------------------------------- > Jun 4 10:30:01 CN2 run-crons[1587]: time.cron returned 1 > Jun 4 11:07:27 CN2 syslog-ng[4054]: STATS: dropped 0 > Jun 4 11:30:01 CN2 run-crons[3546]: time.cron returned 1 > Jun 4 11:41:33 CN2 kernel: o2net: connection to node bustech-bu (num > 0) at 192.168.200.10:7777 has been idle for 120.0 seconds, shutting it > down. > Jun 4 11:41:33 CN2 kernel: (0,1):o2net_idle_timer:1426 here are some > times that might help debug the situation: (tmr 1212604773.841590 now > 1212604893.850342 dr 1212604773.841583 adv > 1212604773.841590:1212604773.841591 func (04f07b3d:505) > 1212595634.502257:1212595634.502260) > Jun 4 11:41:33 CN2 kernel: o2net: no longer connected to node > bustech-bu (num 0) at 192.168.200.10:7777 > Jun 4 11:41:38 CN2 kernel: (5717,1):dlm_get_lock_resource:920 > 5CA2BC69EF1C446B97521FEB7175EF1C:$RECOVERY: at least one node (0) > torecover before lock mastery can begin > Jun 4 11:41:38 CN2 kernel: (5717,1):dlm_get_lock_resource:954 > 5CA2BC69EF1C446B97521FEB7175EF1C: recovery map is not empty, but must > master $RECOVERY lock now > Jun 4 11:41:38 CN2 kernel: (5728,3):dlm_get_lock_resource:920 > 8057F00ED41A4507A24B6A4EF0211F1D:$RECOVERY: at least one node (0) > torecover before lock mastery can begin > Jun 4 11:41:38 CN2 kernel: (5728,3):dlm_get_lock_resource:954 > 8057F00ED41A4507A24B6A4EF0211F1D: recovery map is not empty, but must > master $RECOVERY lock now > Jun 4 11:41:39 CN2 kernel: (6117,3):dlm_get_lock_resource:920 > mas:$RECOVERY: at least one node (0) torecover before lock mastery can > begin > Jun 4 11:41:39 CN2 kernel: (6117,3):dlm_get_lock_resource:954 mas: > recovery map is not empty, but must master $RECOVERY lock now > Jun 4 11:41:40 CN2 kernel: (5706,1):dlm_get_lock_resource:920 > 6FFB00A1F4F94113B6748BC33CA47F83:$RECOVERY: at least one node (0) > torecover before lock mastery can begin > Jun 4 11:41:40 CN2 kernel: (5706,1):dlm_get_lock_resource:954 > 6FFB00A1F4F94113B6748BC33CA47F83: recovery map is not empty, but must > master $RECOVERY lock now > Jun 4 11:41:40 CN2 kernel: (5770,3):dlm_get_lock_resource:920 > E2A008B35C664DDC9FF850F59B0E122F:$RECOVERY: at least one node (0) > torecover before lock mastery can begin > Jun 4 11:41:40 CN2 kernel: (5770,3):dlm_get_lock_resource:954 > E2A008B35C664DDC9FF850F59B0E122F: recovery map is not empty, but must > master $RECOVERY lock now > Jun 4 11:41:41 CN2 kernel: (5759,3):dlm_get_lock_resource:920 > DD202255EE9C419781F4E61DE6E33CFE:$RECOVERY: at least one node (0) > torecover before lock mastery can begin > Jun 4 11:41:41 CN2 kernel: (5759,3):dlm_get_lock_resource:954 > DD202255EE9C419781F4E61DE6E33CFE: recovery map is not empty, but must > master $RECOVERY lock now > Jun 4 11:48:08 CN2 kernel: o2net: connected to node bustech-bu (num > 0) at 192.168.200.10:7777 > Jun 4 11:48:11 CN2 kernel: ocfs2_dlm: Node 0 joins domain > 6FFB00A1F4F94113B6748BC33CA47F83 > Jun 4 11:48:11 CN2 kernel: ocfs2_dlm: Nodes in domain > ("6FFB00A1F4F94113B6748BC33CA47F83"): 0 1 > Jun 4 11:48:15 CN2 kernel: ocfs2_dlm: Node 0 joins domain > 5CA2BC69EF1C446B97521FEB7175EF1C > Jun 4 11:48:15 CN2 kernel: ocfs2_dlm: Nodes in domain > ("5CA2BC69EF1C446B97521FEB7175EF1C"): 0 1 > Jun 4 11:48:20 CN2 kernel: ocfs2_dlm: Node 0 joins domain > 8057F00ED41A4507A24B6A4EF0211F1D > Jun 4 11:48:20 CN2 kernel: ocfs2_dlm: Nodes in domain > ("8057F00ED41A4507A24B6A4EF0211F1D"): 0 1 > Jun 4 11:48:24 CN2 kernel: ocfs2_dlm: Node 0 joins domain > DD202255EE9C419781F4E61DE6E33CFE > Jun 4 11:48:24 CN2 kernel: ocfs2_dlm: Nodes in domain > ("DD202255EE9C419781F4E61DE6E33CFE"): 0 1 > Jun 4 11:48:28 CN2 kernel: ocfs2_dlm: Node 0 joins domain > E2A008B35C664DDC9FF850F59B0E122F > Jun 4 11:48:28 CN2 kernel: ocfs2_dlm: Nodes in domain > ("E2A008B35C664DDC9FF850F59B0E122F"): 0 1 > Jun 4 11:49:07 CN2 kernel: ocfs2_dlm: Node 0 joins domain mas > Jun 4 11:49:07 CN2 kernel: ocfs2_dlm: Nodes in domain ("mas"): 0 1 > Jun 4 12:07:28 CN2 syslog-ng[4054]: STATS: dropped 0 > Jun 4 12:30:01 CN2 run-crons[5552]: time.cron returned 1 > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users