thr3ads.net - Ocfs2 users - [Ocfs2-users] Troubles with two node [Nov 2007]

If this information is useful, please help other people find it:
Share via:

inode

2007-Nov-29 09:27 UTC

[Ocfs2-users] Troubles with two node

Hi all,

I'm running OCFS2 on two system with OpenSUSE 10.2 connected on fibre
channel with a shared storage (HP MSA1500 + HP PROLIANT MSA20).

The cluster has two node (web-ha1 and web-ha2), sometimes (1 or 2 times
on a month) the OCFS2 stop to work on both system. On the first node I'm
getting no error in log files and after a forced shoutdown of the first
node on the second I can see the logs on the bottom of this message.

I saw some other people is getting on a similar trouble
(http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg01135.html)
but the thread don't gave me help...

Anyone has any idea?

Thanks you in advance.

Maurizio


web-ha1:~ # cat /etc/sysconfig/o2cb

O2CB_ENABLED=true
O2CB_BOOTCLUSTER=ocfs2
O2CB_HEARTBEAT_THRESHOLD=451

web-ha1:~ #
web-ha1:~ # cat /etc/ocfs2/cluster.conf
node:
        ip_port = 7777
        ip_address = 192.168.255.1
        number = 0
        name = web-ha1
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 192.168.255.2
        number = 1
        name = web-ha2
        cluster = ocfs2

cluster:
        node_count = 2
        name = ocfs2

web-ha1:~ #



Nov 28 15:28:59 web-ha2 kernel: o2net: connection to node web-ha1 (num
0) at 192.168.255.1:7777 has been idle for 10 seconds, shutting it down.
Nov 28 15:28:59 web-ha2 kernel: (23432,0):o2net_idle_timer:1297 here are
some times that might help debug the situation: (tmr 1196260129.36511
now 1196260139
.34907 dr 1196260129.36503 adv 1196260129.36514:1196260129.36515 func
(95bc84eb:504) 1196260129.36329:1196260129.36337)
Nov 28 15:28:59 web-ha2 kernel: o2net: no longer connected to node
web-ha1 (num 0) at 192.168.255.1:7777
Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_get_lock_resource:915
ERROR: status = -112
Nov 28 15:29:18 web-ha2 sshd[23503]: pam_unix2(sshd:auth): conversation
failed
Nov 28 15:29:18 web-ha2 sshd[23503]: error: ssh_msg_send: write
Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_get_lock_resource:915
ERROR: status = -107
Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_get_lock_resource:915
ERROR: status = -107
Nov 28 15:29:46 web-ha2 kernel: (23443,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
ERROR: status = -107

[...]

Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:51 web-ha2 kernel: (17804,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M0000000000000009e7e54516dd16ec: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:51 web-ha2 kernel: (17730,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M0000000000000009e76bf516dd144d: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:51 web-ha2 kernel: (17634,0):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M000000000000000ac0d22b1f78e53c: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11

[...]

Nov 22 18:14:54 web-ha2 kernel: (17702,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M0000000000000007a6dab9ef6eacbd: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:54 web-ha2 kernel: (17701,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M000000000000000a06a13716de553e: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:849
86472C5C33A54FF88030591B1210C560:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:876
86472C5C33A54FF88030591B1210C560: recovery map is not empty, but must
master $RECOVERY lock now
Nov 22 18:14:54 web-ha2 kernel: (17893,0):ocfs2_replay_journal:1184
Recovering node 0 from slot 0 on device (8,17)
Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_wait_for_lock_mastery:1036
ERROR: status = -11

Sunil Mushran

2007-Nov-29 10:23 UTC

head link

[Ocfs2-users] Troubles with two node

What's the kernel version#?

inode wrote:> Hi all,
>
> I'm running OCFS2 on two system with OpenSUSE 10.2 connected on fibre
> channel with a shared storage (HP MSA1500 + HP PROLIANT MSA20).
>
> The cluster has two node (web-ha1 and web-ha2), sometimes (1 or 2 times
> on a month) the OCFS2 stop to work on both system. On the first node
I'm
> getting no error in log files and after a forced shoutdown of the first
> node on the second I can see the logs on the bottom of this message.
>
> I saw some other people is getting on a similar trouble
> (http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg01135.html)
> but the thread don't gave me help...
>
> Anyone has any idea?
>
> Thanks you in advance.
>
> Maurizio
>
>
> web-ha1:~ # cat /etc/sysconfig/o2cb
>
> O2CB_ENABLED=true
> O2CB_BOOTCLUSTER=ocfs2
> O2CB_HEARTBEAT_THRESHOLD=451
>
> web-ha1:~ #
> web-ha1:~ # cat /etc/ocfs2/cluster.conf
> node:
>         ip_port = 7777
>         ip_address = 192.168.255.1
>         number = 0
>         name = web-ha1
>         cluster = ocfs2
>
> node:
>         ip_port = 7777
>         ip_address = 192.168.255.2
>         number = 1
>         name = web-ha2
>         cluster = ocfs2
>
> cluster:
>         node_count = 2
>         name = ocfs2
>
> web-ha1:~ #
>
>
>
> Nov 28 15:28:59 web-ha2 kernel: o2net: connection to node web-ha1 (num
> 0) at 192.168.255.1:7777 has been idle for 10 seconds, shutting it down.
> Nov 28 15:28:59 web-ha2 kernel: (23432,0):o2net_idle_timer:1297 here are
> some times that might help debug the situation: (tmr 1196260129.36511
> now 1196260139
> .34907 dr 1196260129.36503 adv 1196260129.36514:1196260129.36515 func
> (95bc84eb:504) 1196260129.36329:1196260129.36337)
> Nov 28 15:28:59 web-ha2 kernel: o2net: no longer connected to node
> web-ha1 (num 0) at 192.168.255.1:7777
> Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_do_master_request:1331
> ERROR: link to 0 went down!
> Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_get_lock_resource:915
> ERROR: status = -112
> Nov 28 15:29:18 web-ha2 sshd[23503]: pam_unix2(sshd:auth): conversation
> failed
> Nov 28 15:29:18 web-ha2 sshd[23503]: error: ssh_msg_send: write
> Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_do_master_request:1331
> ERROR: link to 0 went down!
> Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_get_lock_resource:915
> ERROR: status = -107
> Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_do_master_request:1331
> ERROR: link to 0 went down!
> Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_get_lock_resource:915
> ERROR: status = -107
> Nov 28 15:29:46 web-ha2 kernel: (23443,0):dlm_do_master_request:1331
> ERROR: link to 0 went down!
> ERROR: status = -107
>
> [...]
>
> Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_restart_lock_mastery:1215
> ERROR: node down! 0
> Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_wait_for_lock_mastery:1036
> ERROR: status = -11
> Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_restart_lock_mastery:1215
> ERROR: node down! 0
> Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_wait_for_lock_mastery:1036
> ERROR: status = -11
> Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_restart_lock_mastery:1215
> ERROR: node down! 0
> Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_wait_for_lock_mastery:1036
> ERROR: status = -11
> Nov 22 18:14:51 web-ha2 kernel: (17804,1):dlm_get_lock_resource:896
> 86472C5C33A54FF88030591B1210C560:M0000000000000009e7e54516dd16ec: at
> least one node (0) torecover before lock mastery can begin
> Nov 22 18:14:51 web-ha2 kernel: (17730,1):dlm_get_lock_resource:896
> 86472C5C33A54FF88030591B1210C560:M0000000000000009e76bf516dd144d: at
> least one node (0) torecover before lock mastery can begin
> Nov 22 18:14:51 web-ha2 kernel: (17634,0):dlm_get_lock_resource:896
> 86472C5C33A54FF88030591B1210C560:M000000000000000ac0d22b1f78e53c: at
> least one node (0) torecover before lock mastery can begin
> Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_restart_lock_mastery:1215
> ERROR: node down! 0
> Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_wait_for_lock_mastery:1036
> ERROR: status = -11
>
> [...]
>
> Nov 22 18:14:54 web-ha2 kernel: (17702,1):dlm_get_lock_resource:896
> 86472C5C33A54FF88030591B1210C560:M0000000000000007a6dab9ef6eacbd: at
> least one node (0) torecover before lock mastery can begin
> Nov 22 18:14:54 web-ha2 kernel: (17701,1):dlm_get_lock_resource:896
> 86472C5C33A54FF88030591B1210C560:M000000000000000a06a13716de553e: at
> least one node (0) torecover before lock mastery can begin
> Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:849
> 86472C5C33A54FF88030591B1210C560:$RECOVERY: at least one node (0)
> torecover before lock mastery can begin
> Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:876
> 86472C5C33A54FF88030591B1210C560: recovery map is not empty, but must
> master $RECOVERY lock now
> Nov 22 18:14:54 web-ha2 kernel: (17893,0):ocfs2_replay_journal:1184
> Recovering node 0 from slot 0 on device (8,17)
> Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_restart_lock_mastery:1215
> ERROR: node down! 0
> Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_wait_for_lock_mastery:1036
> ERROR: status = -11
> Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_restart_lock_mastery:1215
> ERROR: node down! 0
> Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_wait_for_lock_mastery:1036
> ERROR: status = -11
>
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Possibly Parallel Threads

Search for more reasonably related threads

Ocfs2 users - Nov 2007 - Troubles with two node

[Ocfs2-users] Troubles with two node

[Ocfs2-users] Troubles with two node

Possibly Parallel Threads