Hi. I'm running RHEL4 for my test system, Adaptec Firewire controllers, Maxtor One Touch III shared disk (see the details below), 100Mb/s dedicated interconnect. It panics with no load about each 20 minutes (error message from netconsole attached) Any clues? Yegor --- [root at rac1 ~]# cat /proc/fs/ocfs2/version OCFS2 1.2.0 Tue Mar 7 15:51:20 PST 2006 (build db06cd9cd891710e73c5d89a6b4d8812) [root at rac1 ~]# --- [root at rac1 ~]# lspci 06:02.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 61) [root at rac1 ~]# --- [root at rac1 ~]# cat /boot/grub/menu.lst default=0 timeout=3 title Red Hat Enterprise Linux AS (2.6.9-34.ELsmp) root (hd0,0) kernel /vmlinuz-2.6.9-34.ELsmp ro root=/dev/VolGroup00/LogVol00 elevator=deadline vga=0xf05 initrd /initrd-2.6.9-34.ELsmp.img title Red Hat Enterprise Linux AS-up (2.6.9-34.EL) root (hd0,0) kernel /vmlinuz-2.6.9-34.EL ro root=/dev/VolGroup00/LogVol00 elevator=deadline vga=0xf05 initrd /initrd-2.6.9-34.EL.img [root at rac1 ~]# --- [root at rac1 ~]# cat /etc/sysconfig/o2cb O2CB_ENABLED=true O2CB_BOOTCLUSTER=ocfs2 O2CB_HEARTBEAT_THRESHOLD=16 [root at rac1 ~]# --- Crash message: --- Apr 18 15:54:43 rac1/rac1 (2858,1):o2net_set_nn_state:426 accepted connection from node rac2 (num 1) at 10.0.1.2:7777 Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:384 Nodes in my domain ("CA641AEC0417495BA7302FC14F6F99B7"): Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:388 node 0 Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:388 node 1 Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:384 Nodes in my domain ("8BD4774D69C44FDC8FD8EC5E13EA9996"): Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:388 node 0 Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:388 node 1 Apr 18 15:56:43 rac2/rac2 (3,0):o2hb_write_timeout:164 ERROR: Heartbeat write timeout to device sda5 after 30000 milliseconds Apr 18 15:56:43 rac2/rac2 (3,0):o2hb_stop_all_regions:1727 ERROR: stopping heartbeat on all active regions. Apr 18 15:56:43 rac2/rac2 Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing Apr 18 15:56:43 rac2/rac2 Apr 18 15:56:45 rac1/rac1 (2903,0):o2net_set_nn_state:411 no longer connected to node rac2 (num 1) at 10.0.1.2:7777 Apr 18 15:56:45 rac1/rac1 (2897,1):dlm_send_proxy_ast_msg:448 ERROR: status = -107 Apr 18 15:56:45 rac1/rac1 (2897,1):dlm_flush_asts:556 ERROR: status = -107 Apr 18 15:56:46 rac1/rac1 (19545,0):ocfs2_replay_journal:1172 Recovering node 1 from slot 1 on device (8,41) Apr 18 15:56:46 rac1/rac1 (19544,0):ocfs2_replay_journal:1172 Recovering node 1 from slot 1 on device (8,37) Apr 18 15:56:51 rac2/rac2 <0>Rebooting in 60 seconds..<5>(3,0):o2net_idle_timer:1310 connection to node rac1 (num 0) at 10.0.1.1:7777 has been idle for 10 seconds, shutting it down. Apr 18 15:56:51 rac2/rac2 (3,0):o2net_idle_timer:1321 here are some times that might help debug the situation: (tmr 1145401001.986417 now 1145401011.984614 dr 1145401005.947636 adv 1145401001.986418:1145401001.986418 func (f7672ffa:505) 1145400976.990658:1145400976.990672) ---
There should be more messages on the netdump server. Yegor Gorshkov wrote:> Hi. > > I'm running RHEL4 for my test system, Adaptec Firewire controllers, > Maxtor One Touch III shared disk (see the details below), > 100Mb/s dedicated interconnect. It panics with no load about each > 20 minutes (error message from netconsole attached) > > Any clues? > > Yegor > > --- > [root at rac1 ~]# cat /proc/fs/ocfs2/version > OCFS2 1.2.0 Tue Mar 7 15:51:20 PST 2006 (build > db06cd9cd891710e73c5d89a6b4d8812) > [root at rac1 ~]# > --- > [root at rac1 ~]# lspci > 06:02.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 61) > [root at rac1 ~]# > --- > [root at rac1 ~]# cat /boot/grub/menu.lst > default=0 > timeout=3 > title Red Hat Enterprise Linux AS (2.6.9-34.ELsmp) > root (hd0,0) > kernel /vmlinuz-2.6.9-34.ELsmp ro root=/dev/VolGroup00/LogVol00 > elevator=deadline vga=0xf05 > initrd /initrd-2.6.9-34.ELsmp.img > title Red Hat Enterprise Linux AS-up (2.6.9-34.EL) > root (hd0,0) > kernel /vmlinuz-2.6.9-34.EL ro root=/dev/VolGroup00/LogVol00 > elevator=deadline vga=0xf05 > initrd /initrd-2.6.9-34.EL.img > [root at rac1 ~]# > --- > [root at rac1 ~]# cat /etc/sysconfig/o2cb > O2CB_ENABLED=true > O2CB_BOOTCLUSTER=ocfs2 > O2CB_HEARTBEAT_THRESHOLD=16 > [root at rac1 ~]# > --- > > > Crash message: > --- > Apr 18 15:54:43 rac1/rac1 (2858,1):o2net_set_nn_state:426 accepted > connection from node rac2 (num 1) at 10.0.1.2:7777 > Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:384 Nodes in my > domain ("CA641AEC0417495BA7302FC14F6F99B7"): > Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:388 node 0 > Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:388 node 1 > Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:384 Nodes in my > domain ("8BD4774D69C44FDC8FD8EC5E13EA9996"): > Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:388 node 0 > Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:388 node 1 > Apr 18 15:56:43 rac2/rac2 (3,0):o2hb_write_timeout:164 ERROR: Heartbeat > write timeout to device sda5 after 30000 milliseconds > Apr 18 15:56:43 rac2/rac2 (3,0):o2hb_stop_all_regions:1727 ERROR: > stopping heartbeat on all active regions. > Apr 18 15:56:43 rac2/rac2 Kernel panic - not syncing: ocfs2 is very > sorry to be fencing this system by panicing > Apr 18 15:56:43 rac2/rac2 > Apr 18 15:56:45 rac1/rac1 (2903,0):o2net_set_nn_state:411 no longer > connected to node rac2 (num 1) at 10.0.1.2:7777 > Apr 18 15:56:45 rac1/rac1 (2897,1):dlm_send_proxy_ast_msg:448 ERROR: > status = -107 > Apr 18 15:56:45 rac1/rac1 (2897,1):dlm_flush_asts:556 ERROR: status = -107 > Apr 18 15:56:46 rac1/rac1 (19545,0):ocfs2_replay_journal:1172 Recovering > node 1 from slot 1 on device (8,41) > Apr 18 15:56:46 rac1/rac1 (19544,0):ocfs2_replay_journal:1172 Recovering > node 1 from slot 1 on device (8,37) > Apr 18 15:56:51 rac2/rac2 <0>Rebooting in 60 > seconds..<5>(3,0):o2net_idle_timer:1310 connection to node rac1 (num 0) > at 10.0.1.1:7777 has been idle for 10 seconds, shutting it down. > Apr 18 15:56:51 rac2/rac2 (3,0):o2net_idle_timer:1321 here are some > times that might help debug the situation: (tmr 1145401001.986417 now > 1145401011.984614 dr 1145401005.947636 adv > 1145401001.986418:1145401001.986418 func (f7672ffa:505) > 1145400976.990658:1145400976.990672) > --- > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >