Ulf Zimmermann
2013-Jun-21 13:17 UTC
[Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6
We have a production cluster of 6 nodes, which are currently running RHEL 5.8 with OCFS2 1.4.10. We snapclone these volumes to multiple destinations, one of them is a RHEL4 machine with OCFS2 1.2.9. Because of that the volumes are set so that we can read them there. We are now trying to bring up a new server, this one has OEL 6.3 on it and it comes with OCFS2 1.8.0 and tools 1.8.0-10. I can use tunefs.ocfs2 -cloned-volume to reset the UUID, but when I try to change the label I get: [root at co-db03 ulf]# tunefs.ocfs2 -L /export/backuprecovery.AUCP /dev/mapper/aucp_data_bk_2_x tunefs.ocfs2: Invalid name for a cluster while opening device "/dev/mapper/aucp_data_bk_2_x" fsck.ocfs2 core dumps with the following, I also filed a bug on Bugzilla for that: [root at co-db03 ulf]# fsck.ocfs2 /dev/mapper/aucp_data_bk_2_x fsck.ocfs2 1.8.0 *** glibc detected *** fsck.ocfs2: double free or corruption (fasttop): 0x000000000197f320 *** ======= Backtrace: ========/lib64/libc.so.6[0x3656475366] fsck.ocfs2[0x434c31] fsck.ocfs2[0x403bc2] /lib64/libc.so.6(__libc_start_main+0xfd)[0x365641ecdd] fsck.ocfs2[0x402879] ======= Memory map: =======00400000-00450000 r-xp 00000000 fc:00 12489 /sbin/fsck.ocfs2 0064f000-00651000 rw-p 0004f000 fc:00 12489 /sbin/fsck.ocfs2 00651000-00652000 rw-p 00000000 00:00 0 00850000-00851000 rw-p 00050000 fc:00 12489 /sbin/fsck.ocfs2 0197e000-0199f000 rw-p 00000000 00:00 0 [heap] 3655c00000-3655c20000 r-xp 00000000 fc:00 8797 /lib64/ld-2.12.so 3655e1f000-3655e20000 r--p 0001f000 fc:00 8797 /lib64/ld-2.12.so 3655e20000-3655e21000 rw-p 00020000 fc:00 8797 /lib64/ld-2.12.so 3655e21000-3655e22000 rw-p 00000000 00:00 0 3656400000-3656589000 r-xp 00000000 fc:00 8798 /lib64/libc-2.12.so 3656589000-3656788000 ---p 00189000 fc:00 8798 /lib64/libc-2.12.so 3656788000-365678c000 r--p 00188000 fc:00 8798 /lib64/libc-2.12.so 365678c000-365678d000 rw-p 0018c000 fc:00 8798 /lib64/libc-2.12.so 365678d000-3656792000 rw-p 00000000 00:00 0 3659c00000-3659c16000 r-xp 00000000 fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3659c16000-3659e15000 ---p 00016000 fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3659e15000-3659e16000 rw-p 00015000 fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3d3e800000-3d3e817000 r-xp 00000000 fc:00 12028 /lib64/libpthread-2.12.so 3d3e817000-3d3ea17000 ---p 00017000 fc:00 12028 /lib64/libpthread-2.12.so 3d3ea17000-3d3ea18000 r--p 00017000 fc:00 12028 /lib64/libpthread-2.12.so 3d3ea18000-3d3ea19000 rw-p 00018000 fc:00 12028 /lib64/libpthread-2.12.so 3d3ea19000-3d3ea1d000 rw-p 00000000 00:00 0 3e26600000-3e26603000 r-xp 00000000 fc:00 426 /lib64/libcom_err.so.2.1 3e26603000-3e26802000 ---p 00003000 fc:00 426 /lib64/libcom_err.so.2.1 3e26802000-3e26803000 r--p 00002000 fc:00 426 /lib64/libcom_err.so.2.1 3e26803000-3e26804000 rw-p 00003000 fc:00 426 /lib64/libcom_err.so.2.1 7fb063711000-7fb063714000 rw-p 00000000 00:00 0 7fb06371d000-7fb063720000 rw-p 00000000 00:00 0 7fffd5b95000-7fffd5bb6000 rw-p 00000000 00:00 0 [stack] 7fffd5bc5000-7fffd5bc6000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Abort (core dumped) I think one of the main question is what is the "Invalid name for a cluster while trying to join the group" or "Invalid name for a cluster while opening device". I am pretty sure that /etc/sysconfig/o2cb and /etc/ocfs2/cluster.conf is correct. Ulf. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130621/3e16acf2/attachment.html
Sunil Mushran
2013-Jun-21 18:11 UTC
[Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6
Can you dump the following using the 1.8 binary. debugfs.ocfs2 -R "stats" /dev/mapper/..... On Fri, Jun 21, 2013 at 6:17 AM, Ulf Zimmermann <ulf at openlane.com> wrote:> We have a production cluster of 6 nodes, which are currently running > RHEL 5.8 with OCFS2 1.4.10. We snapclone these volumes to multiple > destinations, one of them is a RHEL4 machine with OCFS2 1.2.9. Because of > that the volumes are set so that we can read them there.**** > > ** ** > > We are now trying to bring up a new server, this one has OEL 6.3 on it and > it comes with OCFS2 1.8.0 and tools 1.8.0-10. I can use tunefs.ocfs2 > ?cloned-volume to reset the UUID, but when I try to change the label I get: > **** > > ** ** > > [root at co-db03 ulf]# tunefs.ocfs2 -L /export/backuprecovery.AUCP > /dev/mapper/aucp_data_bk_2_x**** > > tunefs.ocfs2: Invalid name for a cluster while opening device > "/dev/mapper/aucp_data_bk_2_x"**** > > ** ** > > fsck.ocfs2 core dumps with the following, I also filed a bug on Bugzilla > for that:**** > > ** ** > > [root at co-db03 ulf]# fsck.ocfs2 /dev/mapper/aucp_data_bk_2_x ** ** > > fsck.ocfs2 1.8.0**** > > *** glibc detected *** fsck.ocfs2: double free or corruption (fasttop): > 0x000000000197f320 ******* > > ======= Backtrace: =========**** > > /lib64/libc.so.6[0x3656475366]**** > > fsck.ocfs2[0x434c31]**** > > fsck.ocfs2[0x403bc2]**** > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x365641ecdd]**** > > fsck.ocfs2[0x402879]**** > > ======= Memory map: ========**** > > 00400000-00450000 r-xp 00000000 fc:00 12489 > /sbin/fsck.ocfs2**** > > 0064f000-00651000 rw-p 0004f000 fc:00 12489 > /sbin/fsck.ocfs2**** > > 00651000-00652000 rw-p 00000000 00:00 0 **** > > 00850000-00851000 rw-p 00050000 fc:00 12489 > /sbin/fsck.ocfs2**** > > 0197e000-0199f000 rw-p 00000000 00:00 0 > [heap]**** > > 3655c00000-3655c20000 r-xp 00000000 fc:00 8797 > /lib64/ld-2.12.so**** > > 3655e1f000-3655e20000 r--p 0001f000 fc:00 8797 > /lib64/ld-2.12.so**** > > 3655e20000-3655e21000 rw-p 00020000 fc:00 8797 > /lib64/ld-2.12.so**** > > 3655e21000-3655e22000 rw-p 00000000 00:00 0 **** > > 3656400000-3656589000 r-xp 00000000 fc:00 8798 > /lib64/libc-2.12.so**** > > 3656589000-3656788000 ---p 00189000 fc:00 8798 > /lib64/libc-2.12.so**** > > 3656788000-365678c000 r--p 00188000 fc:00 8798 > /lib64/libc-2.12.so**** > > 365678c000-365678d000 rw-p 0018c000 fc:00 8798 > /lib64/libc-2.12.so**** > > 365678d000-3656792000 rw-p 00000000 00:00 0 **** > > 3659c00000-3659c16000 r-xp 00000000 fc:00 8802 > /lib64/libgcc_s-4.4.6-20120305.so.1**** > > 3659c16000-3659e15000 ---p 00016000 fc:00 8802 > /lib64/libgcc_s-4.4.6-20120305.so.1**** > > 3659e15000-3659e16000 rw-p 00015000 fc:00 8802 > /lib64/libgcc_s-4.4.6-20120305.so.1**** > > 3d3e800000-3d3e817000 r-xp 00000000 fc:00 12028 > /lib64/libpthread-2.12.so**** > > 3d3e817000-3d3ea17000 ---p 00017000 fc:00 12028 > /lib64/libpthread-2.12.so**** > > 3d3ea17000-3d3ea18000 r--p 00017000 fc:00 12028 > /lib64/libpthread-2.12.so**** > > 3d3ea18000-3d3ea19000 rw-p 00018000 fc:00 12028 > /lib64/libpthread-2.12.so**** > > 3d3ea19000-3d3ea1d000 rw-p 00000000 00:00 0 **** > > 3e26600000-3e26603000 r-xp 00000000 fc:00 426 > /lib64/libcom_err.so.2.1**** > > 3e26603000-3e26802000 ---p 00003000 fc:00 426 > /lib64/libcom_err.so.2.1**** > > 3e26802000-3e26803000 r--p 00002000 fc:00 426 > /lib64/libcom_err.so.2.1**** > > 3e26803000-3e26804000 rw-p 00003000 fc:00 426 > /lib64/libcom_err.so.2.1**** > > 7fb063711000-7fb063714000 rw-p 00000000 00:00 0 **** > > 7fb06371d000-7fb063720000 rw-p 00000000 00:00 0 **** > > 7fffd5b95000-7fffd5bb6000 rw-p 00000000 00:00 0 > [stack]**** > > 7fffd5bc5000-7fffd5bc6000 r-xp 00000000 00:00 0 > [vdso]**** > > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 > [vsyscall]**** > > Abort (core dumped)**** > > ** ** > > I think one of the main question is what is the ?Invalid name for a > cluster while trying to join the group? or ?Invalid name for a cluster > while opening device?. I am pretty sure that /etc/sysconfig/o2cb and > /etc/ocfs2/cluster.conf is correct.**** > > ** ** > > Ulf.**** > > ** ** > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130621/7145556f/attachment-0001.html
Hi, We saw the following errors this morning on 2 of the 3 node OCFS2 (1.4.7-1) cluster running on Redhat5.5.? Does anyone know what it means? Sep 12 07:02:54 ncom kernel:? [<ffffffff888144b7>] :ocfs2:ocfs2_wait_for_mask+0xd/0x19 Sep 12 07:02:54 ncom kernel:? [<ffffffff888158d8>] :ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3 Sep 12 07:02:54 ncom kernel:? [<ffffffff8881429c>] :ocfs2:ocfs2_init_mask_waiter+0x24/0x3d Sep 12 07:02:54 ncom kernel:? [<ffffffff888160b8>] :ocfs2:ocfs2_inode_lock_full+0x24e/0xfd9 Sep 12 07:02:54 ncom kernel:? [<ffffffff88834379>] :ocfs2:ocfs2_mknod+0xf4/0xa07 Sep 12 07:02:54 ncom kernel:? [<ffffffff88834e1b>] :ocfs2:ocfs2_create+0x91/0xff Sep 12 07:02:54 ncom kernel:? [<ffffffff888144b7>] :ocfs2:ocfs2_wait_for_mask+0xd/0x19 Sep 12 07:02:54 ncom kernel:? [<ffffffff888158d8>] :ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3 Sep 12 07:02:54 ncom kernel:? [<ffffffff888160b8>] :ocfs2:ocfs2_inode_lock_full+0x24e/0xfd9 Sep 12 07:02:54 ncom kernel:? [<ffffffff888128b9>] :ocfs2:ocfs2_cluster_unlock+0x290/0x30d Sep 12 07:02:54 ncom kernel:? [<ffffffff88824176>] :ocfs2:ocfs2_permission+0x77/0x1a4 Sep 12 07:02:54 ncom kernel:? [<ffffffff888157d1>] :ocfs2:ocfs2_cluster_lock+0x8a7/0x9d3 Sep 12 07:02:54 ncom kernel:? [<ffffffff888144b7>] :ocfs2:ocfs2_wait_for_mask+0xd/0x19 Sep 12 07:02:54 ncom kernel:? [<ffffffff888158d8>] :ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3 Sep 12 07:02:54 ncom kernel:? [<ffffffff888160b8>] :ocfs2:ocfs2_inode_lock_full+0x24e/0xfd9 Sep 12 07:02:54 ncom kernel:? [<ffffffff888128b9>] :ocfs2:ocfs2_cluster_unlock+0x290/0x30d Sep 12 07:02:54 ncom kernel:? [<ffffffff88824176>] :ocfs2:ocfs2_permission+0x77/0x1a4 Sep 12 07:02:54 ncom kernel:? [<ffffffff8881429c>] :ocfs2:ocfs2_init_mask_waiter+0x24/0x3d Sep 12 07:02:54 ncom kernel:? [<ffffffff888157d1>] :ocfs2:ocfs2_cluster_lock+0x8a7/0x9d3 Sep 12 07:02:54 ncom kernel:? [<ffffffff888144b7>] :ocfs2:ocfs2_wait_for_mask+0xd/0x19 Sep 12 07:02:54 ncom kernel:? [<ffffffff888158d8>] :ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3 Sep 12 07:02:54 ncom kernel:? [<ffffffff888160b8>] :ocfs2:ocfs2_inode_lock_full+0x24e/0xfd9 Sep 12 07:02:54 ncom kernel:? [<ffffffff888128b9>] :ocfs2:ocfs2_cluster_unlock+0x290/0x30d Sep 12 07:02:54 ncom kernel:? [<ffffffff88824176>] :ocfs2:ocfs2_permission+0x77/0x1a4 We also saw the application got stuck, but not sure which one was the cause: Sep 12 07:02:54 ncom kernel: INFO: task ReplayServer:19626 blocked for more than 120 seconds. Sep 12 07:02:54 ncom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 12 07:02:54 ncom kernel: ReplayServer? D ffff810b12997860???? 0 19626?? 9648???????? 19627 19625 (NOTLB) Sep 12 07:02:54 ncom kernel:? ffff810a7e513dc8 0000000000200086 ffff810c257feba0 ffffffff800099ae Sep 12 07:02:54 ncom kernel:? 0000000000002440 000000000000000a ffff810aafcbe820 ffff810b12997860 Sep 12 07:02:54 ncom kernel:? 00018bd3e41e07b5 000000000000e79e ffff810aafcbea08 0000000127eda9c0 Sep 12 07:02:54 ncom kernel: Call Trace: Sep 12 07:02:54 ncom kernel:? [<ffffffff800099ae>] __link_path_walk+0x173/0xf42 Sep 12 07:02:54 ncom kernel:? [<ffffffff8002cd2c>] mntput_no_expire+0x19/0x89 Sep 12 07:02:54 ncom kernel:? [<ffffffff8000ea75>] link_path_walk+0xa6/0xb2 Sep 12 07:02:54 ncom kernel:? [<ffffffff80064c6f>] __mutex_lock_slowpath+0x60/0x9b Sep 12 07:02:54 ncom kernel:? [<ffffffff800236d7>] __path_lookup_intent_open+0x56/0x97 Sep 12 07:02:54 ncom kernel:? [<ffffffff80064cb9>] .text.lock.mutex+0xf/0x14 Sep 12 07:02:54 ncom kernel:? [<ffffffff8001afe7>] open_namei+0xea/0x6d5 Sep 12 07:02:54 ncom kernel:? [<ffffffff80067b88>] do_page_fault+0x4fe/0x874 Sep 12 07:02:54 ncom kernel:? [<ffffffff800274fb>] do_filp_open+0x1c/0x38 Sep 12 07:02:54 ncom kernel:? [<ffffffff80019e1e>] do_sys_open+0x44/0xbe Sep 12 07:02:54 ncom kernel:? [<ffffffff8005e116>] system_call+0x7e/0x83 Sep 12 07:02:54 ncom kernel: Sep 12 07:02:54 ncom kernel: INFO: task ReplayServer:19634 blocked for more than 120 seconds. Sep 12 07:02:54 ncom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 12 07:02:54 ncom kernel: ReplayServer? D ffff810b3ce150c0???? 0 19634?? 9648???????? 19635 19633 (NOTLB) Sep 12 07:02:54 ncom kernel:? ffff810aafa3bc58 0000000000200086 0000000052319eb3 00000000178c2c15 Sep 12 07:02:54 ncom kernel:? ffff810aafa3bba8 000000000000000a ffff810a914e3080 ffff810b3ce150c0 Sep 12 07:02:54 ncom kernel:? 00018bd3ebab498a 0000000000018f23 ffff810a914e3268 00000001888140b3 Sep 12 07:02:54 ncom kernel: Call Trace: Sep 12 07:02:54 ncom kernel:? [<ffffffff80064c6f>] __mutex_lock_slowpath+0x60/0x9b Sep 12 07:02:54 ncom kernel:? [<ffffffff80064cb9>] .text.lock.mutex+0xf/0x14 Sep 12 07:02:54 ncom kernel:? [<ffffffff8000cef1>] do_lookup+0x90/0x1e6 Sep 12 07:02:54 ncom kernel:? [<ffffffff8000a23c>] __link_path_walk+0xa01/0xf42 Sep 12 07:02:54 ncom kernel:? [<ffffffff8000ea11>] link_path_walk+0x42/0xb2 Sep 12 07:02:54 ncom kernel:? [<ffffffff8000cce1>] do_path_lookup+0x275/0x2f1 Sep 12 07:02:54 ncom kernel:? [<ffffffff800236d7>] __path_lookup_intent_open+0x56/0x97 Sep 12 07:02:54 ncom kernel:? [<ffffffff8001af70>] open_namei+0x73/0x6d5 Sep 12 07:02:54 ncom kernel:? [<ffffffff80067b88>] do_page_fault+0x4fe/0x874 Sep 12 07:02:54 ncom kernel:? [<ffffffff800274fb>] do_filp_open+0x1c/0x38 Sep 12 07:02:54 ncom kernel:? [<ffffffff80019e1e>] do_sys_open+0x44/0xbe Sep 12 07:02:54 ncom kernel:? [<ffffffff8005e116>] system_call+0x7e/0x83 Sep 12 07:02:54 ncom kernel: Sep 12 07:02:54 ncom kernel: INFO: task ReplayServer:19629 blocked for more than 120 seconds. Sep 12 07:02:54 ncom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 12 07:02:54 ncom kernel: ReplayServer? D ffff81063d0e1420???? 0 19629?? 9648???????? 19630 19628 (NOTLB) Sep 12 07:02:54 ncom kernel:? ffff810b13007dc8 0000000000200082 ffff810c257feba0 ffffffff800099ae Sep 12 07:02:54 ncom kernel:? 0000000000002a97 000000000000000a ffff810b7723b7e0 ffff810628051080 Sep 12 07:02:54 ncom kernel:? 00018bd41261c4c8 00000000000090fa ffff810b7723b9c8 0000000127eda9c0 Sep 12 07:02:54 ncom kernel: Call Trace: Sep 12 07:02:54 ncom kernel:? [<ffffffff800099ae>] __link_path_walk+0x173/0xf42 Sep 12 07:02:54 ncom kernel:? [<ffffffff8002cd2c>] mntput_no_expire+0x19/0x89 Sep 12 07:02:54 ncom kernel:? [<ffffffff80064c6f>] __mutex_lock_slowpath+0x60/0x9b Sep 12 07:02:54 ncom kernel:? [<ffffffff800236d7>] __path_lookup_intent_open+0x56/0x97 Sep 12 07:02:54 ncom kernel:? [<ffffffff80064cb9>] .text.lock.mutex+0xf/0x14 Sep 12 07:02:54 ncom kernel:? [<ffffffff8001afe7>] open_namei+0xea/0x6d5 Sep 12 07:02:54 ncom kernel:? [<ffffffff88277b9f>] :dm_mod:dm_put+0x2a/0x15d Sep 12 07:02:54 ncom kernel:? [<ffffffff800274fb>] do_filp_open+0x1c/0x38 Sep 12 07:02:54 ncom kernel:? [<ffffffff80019e1e>] do_sys_open+0x44/0xbe Sep 12 07:02:54 ncom kernel:? [<ffffffff8005e116>] system_call+0x7e/0x83 Sep 12 07:02:54 ncom kernel: Sep 12 07:02:54 ncom kernel: INFO: task ReplayServer:19639 blocked for more than 120 seconds. Sep 12 07:02:54 ncom kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 12 07:02:54 ncom kernel: ReplayServer? D 0000000000000000???? 0 19639?? 9648???????? 19640 19638 (NOTLB) Sep 12 07:02:54 ncom kernel:? ffff810a916f3b98 0000000000200082 00000000000024f0 0000000014234000 Sep 12 07:02:54 ncom kernel:? ffff810c2427bba0 0000000000000009 ffff810b1768e0c0 ffff810a914e37e0 Sep 12 07:02:54 ncom kernel:? 00018bd422e0df8e 00000000000006cc ffff810b1768e2a8 0000000180019bc9 Sep 12 07:02:54 ncom kernel: Call Trace: Sep 12 07:02:54 ncom kernel:? [<ffffffff80064167>] wait_for_completion+0x79/0xa2 Sep 12 07:02:54 ncom kernel:? [<ffffffff8008e16d>] default_wake_function+0x0/0xe Sep 12 07:02:54 ncom kernel:? [<ffffffff888144b7>] :ocfs2:ocfs2_wait_for_mask+0xd/0x19 Sep 12 07:02:54 ncom kernel:? [<ffffffff888158d8>] :ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3 Sep 12 07:02:54 ncom kernel:? [<ffffffff8881429c>] :ocfs2:ocfs2_init_mask_waiter+0x24/0x3d Sep 12 07:02:54 ncom kernel:? [<ffffffff888160b8>] :ocfs2:ocfs2_inode_lock_full+0x24e/0xfd9 Sep 12 07:02:54 ncom kernel:? [<ffffffff88834379>] :ocfs2:ocfs2_mknod+0xf4/0xa07 Sep 12 07:02:54 ncom kernel:? [<ffffffff88834e1b>] :ocfs2:ocfs2_create+0x91/0xff Sep 12 07:02:54 ncom kernel:? [<ffffffff8003a8a2>] vfs_create+0xe6/0x158 Sep 12 07:02:54 ncom kernel:? [<ffffffff8001b09a>] open_namei+0x19d/0x6d5 Sep 12 07:02:54 ncom kernel:? [<ffffffff800274fb>] do_filp_open+0x1c/0x38 Sep 12 07:02:54 ncom kernel:? [<ffffffff80019e1e>] do_sys_open+0x44/0xbe Sep 12 07:02:54 ncom kernel:? [<ffffffff8005e116>] system_call+0x7e/0x83 Thanks, William -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130912/d28e4270/attachment.html