thr3ads.net - Ocfs2 users - [Ocfs2-users] mystery reboot [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Charlie Sharkey

2008-Jun-10 14:34 UTC

[Ocfs2-users] mystery reboot

On a two node cluster I got a reboot (core dump) on the first node.
The /var/log/messages doesn't show anything wrong, but running
crash on the code dump shows that ocfs2 panicked the system.
Nodes are SuSE Sles 10 SP1 systems. I believe node 1 had a 
backup routine (Veritas) running on it at the time. 
Any ideas on what happened ?
 
thank you,
  charlie
 
 
node 1 system info:
----------------------------
Jun  4 11:48:03 bustech-bu kernel: OCFS2 Node Manager 1.2.5-SLES-r2997
Tue Mar 27 16:33:19 EDT 2007 (build sles)
Jun  4 11:48:03 bustech-bu kernel: OCFS2 DLM 1.2.5-SLES-r2997 Tue Mar 27
16:33:19 EDT 2007 (build sles)
Jun  4 11:48:03 bustech-bu kernel: OCFS2 DLMFS 1.2.5-SLES-r2997 Tue Mar
27 16:33:19 EDT 2007 (build sles)

 
node 1 boot.msg file. looks like the reboot was on June 4, 2008  at
11:40
------------------------------------------------------------------------
---------------------------------
INIT: 
Boot logging started on /dev/tty1(/dev/console) at Wed Jun  4 11:40:23
2008
 
Master Resource Control: previous runlevel: N, switching to runlevel: 1
Starting irqbalance unused
Saving 1979 MB crash dump to /var/log/dump/2008-06-04-11:40 ...
Entering runlevel: 1
 
 
node 1 /var/log/messages
-------------------------------------
Jun  4 08:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0
Jun  4 09:15:01 bustech-bu run-crons[12180]: time.cron returned 1
Jun  4 09:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0
Jun  4 10:15:01 bustech-bu run-crons[14123]: time.cron returned 1
Jun  4 10:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0
Jun  4 11:15:01 bustech-bu run-crons[16066]: time.cron returned 1
Jun  4 11:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0
         **reboot here. note no previous errors**
Jun  4 11:46:22 bustech-bu syslog-ng[4018]: syslog-ng version 1.6.8
starting 
Jun  4 11:46:22 bustech-bu ifup:     lo        

 
node 1 crash info
------------------------
crash> bt
PID: 13     TASK: dff1f670  CPU: 3   COMMAND: "events/3"
 #0 [dff21f08] crash_kexec at c013bb1a
 #1 [dff21f4c] panic at c0120172
 #2 [dff21f68] o2quo_fence_self at fb8cc399
 #3 [dff21f70] run_workqueue at c012de27
 #4 [dff21f8c] worker_thread at c012e754
 #5 [dff21fcc] kthread at c0130e77
 #6 [dff21fe8] kernel_thread_helper at c0102003
crash>   
 
Node 2 /var/log/messages. it looks like this node saw node 1 go away and
come back
------------------------------------------------------------------------
-----------------------------------------------------
Jun  4 10:30:01 CN2 run-crons[1587]: time.cron returned 1
Jun  4 11:07:27 CN2 syslog-ng[4054]: STATS: dropped 0
Jun  4 11:30:01 CN2 run-crons[3546]: time.cron returned 1
Jun  4 11:41:33 CN2 kernel: o2net: connection to node bustech-bu (num 0)
at 192.168.200.10:7777 has been idle for 120.0 seconds, shutting it
down.
Jun  4 11:41:33 CN2 kernel: (0,1):o2net_idle_timer:1426 here are some
times that might help debug the situation: (tmr 1212604773.841590 now
1212604893.850342 dr 1212604773.841583 adv
1212604773.841590:1212604773.841591 func (04f07b3d:505)
1212595634.502257:1212595634.502260)
Jun  4 11:41:33 CN2 kernel: o2net: no longer connected to node
bustech-bu (num 0) at 192.168.200.10:7777
Jun  4 11:41:38 CN2 kernel: (5717,1):dlm_get_lock_resource:920
5CA2BC69EF1C446B97521FEB7175EF1C:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Jun  4 11:41:38 CN2 kernel: (5717,1):dlm_get_lock_resource:954
5CA2BC69EF1C446B97521FEB7175EF1C: recovery map is not empty, but must
master $RECOVERY lock now
Jun  4 11:41:38 CN2 kernel: (5728,3):dlm_get_lock_resource:920
8057F00ED41A4507A24B6A4EF0211F1D:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Jun  4 11:41:38 CN2 kernel: (5728,3):dlm_get_lock_resource:954
8057F00ED41A4507A24B6A4EF0211F1D: recovery map is not empty, but must
master $RECOVERY lock now
Jun  4 11:41:39 CN2 kernel: (6117,3):dlm_get_lock_resource:920
mas:$RECOVERY: at least one node (0) torecover before lock mastery can
begin
Jun  4 11:41:39 CN2 kernel: (6117,3):dlm_get_lock_resource:954 mas:
recovery map is not empty, but must master $RECOVERY lock now
Jun  4 11:41:40 CN2 kernel: (5706,1):dlm_get_lock_resource:920
6FFB00A1F4F94113B6748BC33CA47F83:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Jun  4 11:41:40 CN2 kernel: (5706,1):dlm_get_lock_resource:954
6FFB00A1F4F94113B6748BC33CA47F83: recovery map is not empty, but must
master $RECOVERY lock now
Jun  4 11:41:40 CN2 kernel: (5770,3):dlm_get_lock_resource:920
E2A008B35C664DDC9FF850F59B0E122F:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Jun  4 11:41:40 CN2 kernel: (5770,3):dlm_get_lock_resource:954
E2A008B35C664DDC9FF850F59B0E122F: recovery map is not empty, but must
master $RECOVERY lock now
Jun  4 11:41:41 CN2 kernel: (5759,3):dlm_get_lock_resource:920
DD202255EE9C419781F4E61DE6E33CFE:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Jun  4 11:41:41 CN2 kernel: (5759,3):dlm_get_lock_resource:954
DD202255EE9C419781F4E61DE6E33CFE: recovery map is not empty, but must
master $RECOVERY lock now
Jun  4 11:48:08 CN2 kernel: o2net: connected to node bustech-bu (num 0)
at 192.168.200.10:7777
Jun  4 11:48:11 CN2 kernel: ocfs2_dlm: Node 0 joins domain
6FFB00A1F4F94113B6748BC33CA47F83
Jun  4 11:48:11 CN2 kernel: ocfs2_dlm: Nodes in domain
("6FFB00A1F4F94113B6748BC33CA47F83"): 0 1 
Jun  4 11:48:15 CN2 kernel: ocfs2_dlm: Node 0 joins domain
5CA2BC69EF1C446B97521FEB7175EF1C
Jun  4 11:48:15 CN2 kernel: ocfs2_dlm: Nodes in domain
("5CA2BC69EF1C446B97521FEB7175EF1C"): 0 1 
Jun  4 11:48:20 CN2 kernel: ocfs2_dlm: Node 0 joins domain
8057F00ED41A4507A24B6A4EF0211F1D
Jun  4 11:48:20 CN2 kernel: ocfs2_dlm: Nodes in domain
("8057F00ED41A4507A24B6A4EF0211F1D"): 0 1 
Jun  4 11:48:24 CN2 kernel: ocfs2_dlm: Node 0 joins domain
DD202255EE9C419781F4E61DE6E33CFE
Jun  4 11:48:24 CN2 kernel: ocfs2_dlm: Nodes in domain
("DD202255EE9C419781F4E61DE6E33CFE"): 0 1 
Jun  4 11:48:28 CN2 kernel: ocfs2_dlm: Node 0 joins domain
E2A008B35C664DDC9FF850F59B0E122F
Jun  4 11:48:28 CN2 kernel: ocfs2_dlm: Nodes in domain
("E2A008B35C664DDC9FF850F59B0E122F"): 0 1 
Jun  4 11:49:07 CN2 kernel: ocfs2_dlm: Node 0 joins domain mas
Jun  4 11:49:07 CN2 kernel: ocfs2_dlm: Nodes in domain ("mas"): 0 1 
Jun  4 12:07:28 CN2 syslog-ng[4054]: STATS: dropped 0
Jun  4 12:30:01 CN2 run-crons[5552]: time.cron returned 1

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080610/3c01bae0/attachment-0001.html

Sunil Mushran

2008-Jun-10 16:50 UTC

head link

[Ocfs2-users] mystery reboot

http://oss.oracle.com/bugzilla/show_bug.cgi?id=919

Fixed in 1.2.9-1. SUSE has the fix checked into their
tree. It should be out soon with sles10 sp1.

Charlie Sharkey wrote:>  
> On a two node cluster I got a reboot (core dump) on the first node.
> The /var/log/messages doesn't show anything wrong, but running
> crash on the code dump shows that ocfs2 panicked the system.
> Nodes are SuSE Sles 10 SP1 systems. I believe node 1 had a
> backup routine (Veritas) running on it at the time.
> Any ideas on what happened ?
>  
> thank you,
>   charlie
>  
>  
> node 1 system info:
> ----------------------------
> Jun  4 11:48:03 bustech-bu kernel: OCFS2 Node Manager 1.2.5-SLES-r2997 
> Tue Mar 27 16:33:19 EDT 2007 (build sles)
> Jun  4 11:48:03 bustech-bu kernel: OCFS2 DLM 1.2.5-SLES-r2997 Tue Mar 
> 27 16:33:19 EDT 2007 (build sles)
> Jun  4 11:48:03 bustech-bu kernel: OCFS2 DLMFS 1.2.5-SLES-r2997 Tue 
> Mar 27 16:33:19 EDT 2007 (build sles)
>  
> node 1 boot.msg file. looks like the reboot was on June 4, 2008  at 11:40
>
---------------------------------------------------------------------------------------------------------
> INIT:
> Boot logging started on /dev/tty1(/dev/console) at Wed Jun  4 11:40:23 
> 2008
>  
> Master Resource Control: previous runlevel: N, switching to runlevel: 1
> Starting irqbalance unused
> Saving 1979 MB crash dump to /var/log/dump/2008-06-04-11:40 ...
> Entering runlevel: 1
>  
>  
> node 1 /var/log/messages
> -------------------------------------
> Jun  4 08:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0
> Jun  4 09:15:01 bustech-bu run-crons[12180]: time.cron returned 1
> Jun  4 09:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0
> Jun  4 10:15:01 bustech-bu run-crons[14123]: time.cron returned 1
> Jun  4 10:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0
> Jun  4 11:15:01 bustech-bu run-crons[16066]: time.cron returned 1
> Jun  4 11:37:21 bustech-bu syslog-ng[3986]: STATS: dropped 0
>          **reboot here. note no previous errors**
> Jun  4 11:46:22 bustech-bu syslog-ng[4018]: syslog-ng version 1.6.8 
> starting 
> Jun  4 11:46:22 bustech-bu ifup:     lo       
>  
> node 1 crash info
> ------------------------
> crash> bt
> PID: 13     TASK: dff1f670  CPU: 3   COMMAND: "events/3"
>  #0 [dff21f08] crash_kexec at c013bb1a
>  #1 [dff21f4c] panic at c0120172
>  #2 [dff21f68] o2quo_fence_self at fb8cc399
>  #3 [dff21f70] run_workqueue at c012de27
>  #4 [dff21f8c] worker_thread at c012e754
>  #5 [dff21fcc] kthread at c0130e77
>  #6 [dff21fe8] kernel_thread_helper at c0102003
> crash>  
>  
> Node 2 /var/log/messages. it looks like this node saw node 1 go away 
> and come back
>
-----------------------------------------------------------------------------------------------------------------------------
> Jun  4 10:30:01 CN2 run-crons[1587]: time.cron returned 1
> Jun  4 11:07:27 CN2 syslog-ng[4054]: STATS: dropped 0
> Jun  4 11:30:01 CN2 run-crons[3546]: time.cron returned 1
> Jun  4 11:41:33 CN2 kernel: o2net: connection to node bustech-bu (num 
> 0) at 192.168.200.10:7777 has been idle for 120.0 seconds, shutting it 
> down.
> Jun  4 11:41:33 CN2 kernel: (0,1):o2net_idle_timer:1426 here are some 
> times that might help debug the situation: (tmr 1212604773.841590 now 
> 1212604893.850342 dr 1212604773.841583 adv 
> 1212604773.841590:1212604773.841591 func (04f07b3d:505) 
> 1212595634.502257:1212595634.502260)
> Jun  4 11:41:33 CN2 kernel: o2net: no longer connected to node 
> bustech-bu (num 0) at 192.168.200.10:7777
> Jun  4 11:41:38 CN2 kernel: (5717,1):dlm_get_lock_resource:920 
> 5CA2BC69EF1C446B97521FEB7175EF1C:$RECOVERY: at least one node (0) 
> torecover before lock mastery can begin
> Jun  4 11:41:38 CN2 kernel: (5717,1):dlm_get_lock_resource:954 
> 5CA2BC69EF1C446B97521FEB7175EF1C: recovery map is not empty, but must 
> master $RECOVERY lock now
> Jun  4 11:41:38 CN2 kernel: (5728,3):dlm_get_lock_resource:920 
> 8057F00ED41A4507A24B6A4EF0211F1D:$RECOVERY: at least one node (0) 
> torecover before lock mastery can begin
> Jun  4 11:41:38 CN2 kernel: (5728,3):dlm_get_lock_resource:954 
> 8057F00ED41A4507A24B6A4EF0211F1D: recovery map is not empty, but must 
> master $RECOVERY lock now
> Jun  4 11:41:39 CN2 kernel: (6117,3):dlm_get_lock_resource:920 
> mas:$RECOVERY: at least one node (0) torecover before lock mastery can 
> begin
> Jun  4 11:41:39 CN2 kernel: (6117,3):dlm_get_lock_resource:954 mas: 
> recovery map is not empty, but must master $RECOVERY lock now
> Jun  4 11:41:40 CN2 kernel: (5706,1):dlm_get_lock_resource:920 
> 6FFB00A1F4F94113B6748BC33CA47F83:$RECOVERY: at least one node (0) 
> torecover before lock mastery can begin
> Jun  4 11:41:40 CN2 kernel: (5706,1):dlm_get_lock_resource:954 
> 6FFB00A1F4F94113B6748BC33CA47F83: recovery map is not empty, but must 
> master $RECOVERY lock now
> Jun  4 11:41:40 CN2 kernel: (5770,3):dlm_get_lock_resource:920 
> E2A008B35C664DDC9FF850F59B0E122F:$RECOVERY: at least one node (0) 
> torecover before lock mastery can begin
> Jun  4 11:41:40 CN2 kernel: (5770,3):dlm_get_lock_resource:954 
> E2A008B35C664DDC9FF850F59B0E122F: recovery map is not empty, but must 
> master $RECOVERY lock now
> Jun  4 11:41:41 CN2 kernel: (5759,3):dlm_get_lock_resource:920 
> DD202255EE9C419781F4E61DE6E33CFE:$RECOVERY: at least one node (0) 
> torecover before lock mastery can begin
> Jun  4 11:41:41 CN2 kernel: (5759,3):dlm_get_lock_resource:954 
> DD202255EE9C419781F4E61DE6E33CFE: recovery map is not empty, but must 
> master $RECOVERY lock now
> Jun  4 11:48:08 CN2 kernel: o2net: connected to node bustech-bu (num 
> 0) at 192.168.200.10:7777
> Jun  4 11:48:11 CN2 kernel: ocfs2_dlm: Node 0 joins domain 
> 6FFB00A1F4F94113B6748BC33CA47F83
> Jun  4 11:48:11 CN2 kernel: ocfs2_dlm: Nodes in domain 
> ("6FFB00A1F4F94113B6748BC33CA47F83"): 0 1
> Jun  4 11:48:15 CN2 kernel: ocfs2_dlm: Node 0 joins domain 
> 5CA2BC69EF1C446B97521FEB7175EF1C
> Jun  4 11:48:15 CN2 kernel: ocfs2_dlm: Nodes in domain 
> ("5CA2BC69EF1C446B97521FEB7175EF1C"): 0 1
> Jun  4 11:48:20 CN2 kernel: ocfs2_dlm: Node 0 joins domain 
> 8057F00ED41A4507A24B6A4EF0211F1D
> Jun  4 11:48:20 CN2 kernel: ocfs2_dlm: Nodes in domain 
> ("8057F00ED41A4507A24B6A4EF0211F1D"): 0 1
> Jun  4 11:48:24 CN2 kernel: ocfs2_dlm: Node 0 joins domain 
> DD202255EE9C419781F4E61DE6E33CFE
> Jun  4 11:48:24 CN2 kernel: ocfs2_dlm: Nodes in domain 
> ("DD202255EE9C419781F4E61DE6E33CFE"): 0 1
> Jun  4 11:48:28 CN2 kernel: ocfs2_dlm: Node 0 joins domain 
> E2A008B35C664DDC9FF850F59B0E122F
> Jun  4 11:48:28 CN2 kernel: ocfs2_dlm: Nodes in domain 
> ("E2A008B35C664DDC9FF850F59B0E122F"): 0 1
> Jun  4 11:49:07 CN2 kernel: ocfs2_dlm: Node 0 joins domain mas
> Jun  4 11:49:07 CN2 kernel: ocfs2_dlm: Nodes in domain ("mas"): 0
1
> Jun  4 12:07:28 CN2 syslog-ng[4054]: STATS: dropped 0
> Jun  4 12:30:01 CN2 run-crons[5552]: time.cron returned 1
>  
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Ocfs2 users - Jun 2008 - mystery reboot

[Ocfs2-users] mystery reboot

[Ocfs2-users] mystery reboot