Roger Trang
2006-May-19 17:36 UTC
[Ocfs-users] Oracle cluster panics after removing a device path
Hi, Here is the configuration on both hosts: Oracle: 10.2.0.1 Oracle home: OCFS2 shared Oracle data files: OCFS2 shared # cat redhat-release Red Hat Enterprise Linux ES release 4 (Nahant Update 2) # rpm -qa | grep -i device device-mapper-1.01.04-1.0.RHEL4 device-mapper-1.01.04-1.0.RHEL4 # rpm -qa | grep -i ocfs ocfs2-tools-1.2.0-1 ocfs2console-1.2.0-1 ocfs2-2.6.9-22.ELsmp-1.2.0-1 We are testing OCFS2 with Linux multipathing. When a path is removed, both the cluster nodes panic or fences with a failure to receive heatbeat event. After we remove the path we see the I/O on the other path on the storage array and then cluster fences after a min or so and panics the nodes. We also modified the timeout threads hold to 601 but problem still persist and also tried the deadline I/O scheduler and the problem persists. Console message from the host ------------------------------------------------- Host1 ============Kernel BUG at panic:74 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core ocfs2 (U) debugfs(U) ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs(U) sunrpc ds yenta_socket pcmcia_core dm_mirror dm_mod hw_random egenera_nmi(U) egenera_veth(U) sd_mod egenera_vscsi(U) scsi_mod egenera_vmdump(U) egenera_dumpdev(U) egenera_ipmi(U) egenera_base(U) egenera_virtual_bus(U) egenera_fs(U) ext3 jbd Pid: 6, comm: events/0 Tainted: PF 2.6.9-22.ELsmp RIP: 0010:[<ffffffff801368c2>] <ffffffff801368c2>{panic+211} RSP: 0018:000001020fd81d88 EFLAGS: 00010282 RAX: 000000000000005a RBX: ffffffffa01d1778 RCX: 0000000000000246 RDX: 000000000000445b RSI: 0000000000000246 RDI: ffffffff803d7960 RBP: 000001020e6ffce0 R08: 0000000000000246 R09: ffffffffa01d1778 R10: 0000000000000046 R11: 0000000000000000 R12: 000001000c03ed40 R13: 0000000000000216 R14: 000001020e6ffc00 R15: ffffffffa01c6042 FS: 0000002a9589fb00(0000) GS:ffffffff804d3100(0000) knlGS:00000000f7fdf6c0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000007fbffff816 CR3: 0000000000101000 CR4: 00000000000006e0 Process events/0 (pid: 6, threadinfo 000001020fd80000, task 00000100efefd7f0) Stack: 0000003000000008 000001020fd81e68 000001020fd81da8 0000000000000006 0000000000000000 0000000000000246 ffffffffa01dd1b0 ffffffffa01dd160 ffffffff803d7948 000001020e6ffcd8 Call Trace:<ffffffffa01c8c2a>{:ocfs2_nodemanager:o2hb_stop_all_regions+95} <ffffffffa01ca4f4>{:ocfs2_nodemanager:o2quo_disk_timeout+0} <ffffffff801464f2>{worker_thread+419} <ffffffff80132e8d> {default_wake_function+0} <ffffffff80132ede>{__wake_up_common+67} <ffffffff80132e8d> {default_wake_function+0} <ffffffff8014634f>{worker_thread+0} <ffffffff8014a167>{kthread+200} <ffffffff80110ca3>{child_rip+8} <ffffffff8014a09f>{kthread+0} <ffffffff80110c9b>{child_rip+0} Code: 0f 0b 3a 71 31 80 ff ff ff ff 4a 00 31 ff e8 d7 c4 fe ff e8 RIP <ffffffff801368c2>{panic+211} RSP <000001020fd81d88> Dumping to /dev/egenera_dump_dev_ifca... Writing dump header ... <6>dumpdev: file (/crash_dumps/ap7.1147734852.dmp) opened Writing dump pages ................ Dump complete. rebooting. Host 2 =========== [root at eg09 ~]# (6,0):o2hb_write_timeout:164 ERROR: Heartbeat write timeout to device sdc1 after 90000 milliseconds (6,0):o2hb_stop_all_regions:1727 ERROR: stopping heartbeat on all active regions. Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at panic:74 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core ocfs2 (U) debugfs(U) ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs(U) sunrpc ds yenta_socket pcmcia_core dm_mirror dm_mod hw_random egenera_nmi(U) egenera_veth(U) sd_mod egenera_vscsi(U) scsi_mod egenera_vmdump(U) egenera_dumpdev(U) egenera_ipmi(U) egenera_base(U) egenera_virtual_bus(U) egenera_fs(U) ext3 jbd Pid: 6, comm: events/0 Tainted: PF 2.6.9-22.ELsmp RIP: 0010:[<ffffffff801368c2>] <ffffffff801368c2>{panic+211} RSP: 0018:000001020fd81d88 EFLAGS: 00010282 RAX: 000000000000005a RBX: ffffffffa01d1778 RCX: 0000000000000246 RDX: 0000000000004345 RSI: 0000000000000246 RDI: ffffffff803d7960 RBP: 000001010c043ce0 R08: 0000000000000246 R09: ffffffffa01d1778 R10: 0000000000000046 R11: 0000000000000000 R12: 000001000c03ed40 R13: 0000000000000216 R14: 000001010c043c00 R15: ffffffffa01c6042 FS: 0000002a9589fb00(0000) GS:ffffffff804d3100(0000) knlGS:00000000f7fdf6c0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000332988ed20 CR3: 0000000000101000 CR4: 00000000000006e0 Process events/0 (pid: 6, threadinfo 000001020fd80000, task 00000100efefd7f0) Stack: 0000003000000008 000001020fd81e68 000001020fd81da8 0000000000000006 0000000000000000 0000000000000246 ffffffffa01dd1b0 ffffffffa01dd160 ffffffff803d7948 000001010c043cd8 Call Trace:<ffffffffa01c8c2a>{:ocfs2_nodemanager:o2hb_stop_all_regions+95} <ffffffffa01ca4f4>{:ocfs2_nodemanager:o2quo_disk_timeout+0} <ffffffff801464f2>{worker_thread+419} <ffffffff80132e8d> {default_wake_function+0} <ffffffff80132ede>{__wake_up_common+67} <ffffffff80132e8d> {default_wake_function+0} <ffffffff8014634f>{worker_thread+0} <ffffffff8014a167>{kthread+200} <ffffffff80110ca3>{child_rip+8} <ffffffff8014a09f>{kthread+0} <ffffffff80110c9b>{child_rip+0} Code: 0f 0b 3a 71 31 80 ff ff ff ff 4a 00 31 ff e8 d7 c4 fe ff e8 RIP <ffffffff801368c2>{panic+211} RSP <000001020fd81d88> Dumping to /dev/egenera_dump_dev_ifca... Writing dump header ... <6>dumpdev: file (/crash_dumps/ap8.1147734852.dmp) opened Writing dump pages ............. Dump complete. Thanks in advance, Roger--- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs-users/attachments/20060519/3c30458e/attachment.html