Hi all, my test environment, is composed by 2 server with centos 4.4 nodes is exporting with aoe6-43 + vblade-14 kernel-2.6.9-42.0.3.EL ocfs2-tools-1.2.2-1 ocfs2console-1.2.2-1 ocfs2-2.6.9-42.0.3.EL-1.2.3-1 /dev/etherd/e2.0 on /ocfs2 type ocfs2 (rw,_netdev,heartbeat=local) /dev/etherd/e3.0 on /ocfs2_nfs type ocfs2 (rw,_netdev,heartbeat=local) Device FS Nodes /dev/etherd/e2.0 ocfs2 ocfs2, becks /dev/etherd/e3.0 ocfs2 ocfs2, becks Device FS UUID Label /dev/etherd/e2.0 ocfs2 b24cc18d-af89-4980-a75e-a87530b1b878 test1 /dev/etherd/e3.0 ocfs2 101a92fd-b83b-4294-8bfc-fbaa069c3239 nfs4 O2CB_HEARTBEAT_THRESHOLD=31 when i try to make stress test: Index 4: took 0 ms to do checking slots Index 5: took 2 ms to do waiting for write completion Index 6: took 1995 ms to do msleep Index 7: took 0 ms to do allocating bios for read Index 8: took 0 ms to do bio alloc read Index 9: took 0 ms to do bio add page read Index 10: took 0 ms to do submit_bio for read Index 11: took 2 ms to do waiting for read completion Index 12: took 0 ms to do bio alloc write Index 13: took 0 ms to do bio add page write Index 14: took 0 ms to do submit_bio for write Index 15: took 0 ms to do checking slots Index 16: took 1 ms to do waiting for write completion Index 17: took 1996 ms to do msleep Index 18: took 0 ms to do allocating bios for read Index 19: took 0 ms to do bio allo read Index 20: took 0 ms to do bio add page read Index 21: took 0 ms to do submit_bio for read Index 22: took 10001 ms to do waiting for read completion (3,0):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all active regions. Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing <6>o2net: connection to node ocfs2 (num 2) at 10.1.7.107:777 has been idle for 10 seconds, shutting it down (3,0): o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr: 1169487957.71650 now 1169487967.69569 dr 1169487962.88883 adv 1169487957.71671:1159487957.71674 func 83bce37b2:505) 1169487901.984644:1169487901.984676) the kernel panic occurs always on the same node, and the other node still responding thanks!
problem appears to be that IO is taking more time than effective O2CB_HEARTBEAT_THRESHOLD. Your configured value "31" doesn't seem to be effective? Index 6: took 1995 ms to do msleepIndex Index 17: took 1996 ms to do msleep Index 22: took 10001 ms to do waiting for read completion. Can you please cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold and verify. Thanks, --Srini. Consulente3 wrote:> Hi all, > > my test environment, is composed by 2 server with centos 4.4 > nodes is exporting with aoe6-43 + vblade-14 > > kernel-2.6.9-42.0.3.EL > ocfs2-tools-1.2.2-1 > ocfs2console-1.2.2-1 > ocfs2-2.6.9-42.0.3.EL-1.2.3-1 > > /dev/etherd/e2.0 on /ocfs2 type ocfs2 (rw,_netdev,heartbeat=local) > /dev/etherd/e3.0 on /ocfs2_nfs type ocfs2 (rw,_netdev,heartbeat=local) > > Device FS Nodes > /dev/etherd/e2.0 ocfs2 ocfs2, becks > /dev/etherd/e3.0 ocfs2 ocfs2, becks > > Device FS UUID Label > /dev/etherd/e2.0 ocfs2 b24cc18d-af89-4980-a75e-a87530b1b878 test1 > /dev/etherd/e3.0 ocfs2 101a92fd-b83b-4294-8bfc-fbaa069c3239 nfs4 > > O2CB_HEARTBEAT_THRESHOLD=31 > > when i try to make stress test: > > Index 4: took 0 ms to do checking slots > Index 5: took 2 ms to do waiting for write completion > Index 6: took 1995 ms to do msleep > Index 7: took 0 ms to do allocating bios for read > Index 8: took 0 ms to do bio alloc read > Index 9: took 0 ms to do bio add page read > Index 10: took 0 ms to do submit_bio for read > Index 11: took 2 ms to do waiting for read completion > Index 12: took 0 ms to do bio alloc write > Index 13: took 0 ms to do bio add page write > Index 14: took 0 ms to do submit_bio for write > Index 15: took 0 ms to do checking slots > Index 16: took 1 ms to do waiting for write completion > Index 17: took 1996 ms to do msleep > Index 18: took 0 ms to do allocating bios for read > Index 19: took 0 ms to do bio allo read > Index 20: took 0 ms to do bio add page read > Index 21: took 0 ms to do submit_bio for read > Index 22: took 10001 ms to do waiting for read completion > (3,0):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all active > regions. > Kernel panic - not syncing: ocfs2 is very sorry to be fencing this > system by panicing > > > <6>o2net: connection to node ocfs2 (num 2) at 10.1.7.107:777 has been > idle for 10 seconds, shutting it down > (3,0): o2net_idle_timer:1309 here are some times that might help debug > the situation: > (tmr: 1169487957.71650 now 1169487967.69569 dr 1169487962.88883 adv > 1169487957.71671:1159487957.71674 > func 83bce37b2:505) 1169487901.984644:1169487901.984676) > > the kernel panic occurs always on the same node, and the other node > still responding > > thanks! > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
I can reprodute it, every time on heavy IO I have read this FAQ: I encounter "Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing" whenever I run a heavy io load? so, i have append on node "becks" , the string "elevator=deadline" on boot node ocfs2, has the default rh IO scheduler This is my last panic on node becks: Index 19: took 0 ms to do bio add page read Index 20: took 0 ms to do submit_bio for read Index 21: took 36 ms to do waiting for read completion Index 22: took 0 ms to do bio alloc write Index 23: took 0 ms to do bio add page write Index 0: took 0 ms to do submit_bio for write Index 1: took 0 ms to do checking slots Index 2: took 1 ms to do waiting for write completion Index 3: took 1962 ms to do msleep Index 4: took 0 ms allocating bios for read Index 5: took 0 ms to do bio alloc read Index 6: took 0 ms to do bio add page read Index 7: took 0 ms to do submit_bio for read Index 8: took 9362 ms to do waiting for read completion Index 9: took 0 ms to do bio alloc write Index 10: took 0 ms to do add page write Index 11: took 0 ms to do submit_bio for write Index 12: took 0 ms to do checking slots Index 13: took 48665 ms to do waiting for write completion (3,0):02hb_stop_all_regions:1908 ERROR: stopping heartbeat on all active regions . Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing Other info: [root@ocfs2 ~]# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold 31 [root@becks ~]# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold 31 [root@ocfs2 ~]# mount -t ocfs2 /dev/etherd/e2.0 on /ocfs2 type ocfs2 (rw,_netdev,heartbeat=local) /dev/etherd/e3.0 on /ocfs2_nfs type ocfs2 (rw,_netdev,heartbeat=local) [root@becks ~]# mount -t ocfs2 /dev/etherd/e2.0 on /ocfs2 type ocfs2 (rw,_netdev,heartbeat=local) /dev/etherd/e3.0 on /ocfs2_nfs type ocfs2 (rw,_netdev,heartbeat=local) [root@ocfs2 ~]# /etc/init.d/ocfs2 status Active OCFS2 mountpoints: /ocfs2 /ocfs2_nfs [root@becks ~]# /etc/init.d/ocfs2 status Active OCFS2 mountpoints: /ocfs2 /ocfs2_nfs [root@ocfs2 ~]# mounted.ocfs2 -f Device FS Nodes /dev/etherd/e2.0 ocfs2 ocfs2, becks /dev/etherd/e3.0 ocfs2 ocfs2, becks [root@becks ~]# mounted.ocfs2 -f Device FS Nodes /dev/etherd/e3.0 ocfs2 ocfs2, becks /dev/etherd/e2.0 ocfs2 ocfs2, becks [root@ocfs2 ~]# mounted.ocfs2 -d Device FS UUID Label /dev/etherd/e2.0 ocfs2 b24cc18d-af89-4980-a75e-a87530b1b878 seceti /dev/etherd/e3.0 ocfs2 101a92fd-b83b-4294-8bfc-fbaa069c3239 nfs4 [root@becks ~]# mounted.ocfs2 -d Device FS UUID Label /dev/etherd/e3.0 ocfs2 101a92fd-b83b-4294-8bfc-fbaa069c3239 nfs4 /dev/etherd/e2.0 ocfs2 b24cc18d-af89-4980-a75e-a87530b1b878 seceti i can panic the nodes, also detaching the network cable... If you have any more debugging questions, feel free to ask me thanks -----Messaggio originale----- Da: Srinivas Eeda [mailto:srinivas.eeda@oracle.com] Inviato: luned? 22 gennaio 2007 18.30 A: Consulente3 Cc: ocfs2-users@oss.oracle.com Oggetto: Re: [Ocfs2-users] kernel panic - not syncing problem appears to be that IO is taking more time than effective O2CB_HEARTBEAT_THRESHOLD. Your configured value "31" doesn't seem to be effective? Index 6: took 1995 ms to do msleepIndex Index 17: took 1996 ms to do msleep Index 22: took 10001 ms to do waiting for read completion. Can you please cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold and verify.