Hello, I've setup multiple SLES 10 boxes for a OCFS2/Linux-HA solution
but i'm having some difficulties with o2net. Just one pair of boxes
work with OCFS2/Linux-HA.
I really need some help in this because I don't know what's different
in the only pair that it works except SLES 10 only got heartbeat* and
ocfs2* updates and the others got full updates.
---- network config ----
system1:
eth1 Link encap:Ethernet HWaddr 00:0C:29:1A:7C:6A
inet addr:192.168.100.1 Bcast:192.168.100.255 Mask:255.255.255.0
system2:
eth1 Link encap:Ethernet HWaddr 00:0C:29:17:E1:D3
inet addr:192.168.100.2 Bcast:192.168.100.255 Mask:255.255.255.0
system1:~ # ping 192.168.100.2
PING 192.168.100.2 (192.168.100.2) 56(84) bytes of data.
64 bytes from 192.168.100.2: icmp_seq=1 ttl=64 time=1.12 ms
64 bytes from 192.168.100.2: icmp_seq=2 ttl=64 time=0.199 ms
64 bytes from 192.168.100.2: icmp_seq=3 ttl=64 time=0.334 ms
system2:~ # ping 192.168.100.1
PING 192.168.100.1 (192.168.100.1) 56(84) bytes of data.
64 bytes from 192.168.100.1: icmp_seq=1 ttl=64 time=0.941 ms
64 bytes from 192.168.100.1: icmp_seq=2 ttl=64 time=0.209 ms
system1:~ # cat /etc/hosts
127.0.0.1 localhost system1.site.pt system1
192.168.229.131 system2.site.pt system2
system2:~ # cat /etc/hosts
127.0.0.1 localhost system2.site.pt system2
192.168.229.130 system1.site.pt system1
system1:~ # route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.100.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
192.168.229.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo
0.0.0.0 192.168.229.2 0.0.0.0 UG 0 0 0 eth0
system2:~ # route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.100.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
192.168.229.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo
0.0.0.0 192.168.229.2 0.0.0.0 UG 0 0 0 eth0
---- o2cb config ---
system1:~ # cat /etc/sysconfig/o2cb
O2CB_ENABLED=true
O2CB_BOOTCLUSTER=cluster
O2CB_HEARTBEAT_THRESHOLD=31
O2CB_HEARTBEAT_MODE="user"
system2:~ # cat /etc/sysconfig/o2cb
O2CB_ENABLED=true
O2CB_BOOTCLUSTER=cluster
O2CB_HEARTBEAT_THRESHOLD=31
O2CB_HEARTBEAT_MODE="user"
---- /etc/ocfs2/cluster.conf ----
system1:~ # cat /etc/ocfs2/cluster.conf
node:
ip_port = 7777
ip_address = 192.168.100.1
number = 0
name = system1
cluster = cluster
node:
ip_port = 7777
ip_address = 192.168.100.2
number = 1
name = system2
cluster = cluster
cluster:
node_count = 2
name = cluster
system1:~ # md5sum /etc/ocfs2/cluster.conf
7cb6fa81132051e8a1951832d02945fc /etc/ocfs2/cluster.conf
system2:~ # md5sum /etc/ocfs2/cluster.conf
7cb6fa81132051e8a1951832d02945fc /etc/ocfs2/cluster.conf
---- /var/log/messages ----
Jan 13 21:31:12 system1 kernel: Node system1 is up in group
03AE9F3FE04A4E5DAAD052FC42AE50E2
Jan 13 21:31:12 system1 kernel: Node system2 is up in group
03AE9F3FE04A4E5DAAD052FC42AE50E2
Jan 13 21:31:13 system1 kernel: o2net: accepted connection from node
system2 (num 1) at 192.168.100.2:7777
Jan 13 21:31:15 system1 kernel: ocfs2_dlm: Nodes in domain
("03AE9F3FE04A4E5DAAD052FC42AE50E2"): 0 1
Jan 13 21:31:15 system1 kernel: kjournald starting. Commit interval 5 seconds
Jan 13 21:31:15 system1 kernel: ocfs2: Mounting device (8,17) on (node
0, slot 1)
Jan 13 21:31:16 system1 kernel: o2net: no longer connected to node
system2 (num 1) at 192.168.100.2:7777
Jan 13 21:31:23 system2 kernel: o2net: connected to node system1 (num
0) at 192.168.100.1:7777
Jan 13 21:31:23 system2 kernel: ocfs2_dlm: Nodes in domain
("03AE9F3FE04A4E5DAAD052FC42AE50E2"): 1
Jan 13 21:31:23 system2 kernel: (3161,0):ocfs2_find_slot:261 slot 0 is
already allocated to this node!
Jan 13 21:31:23 system2 kernel: (3161,0):ocfs2_check_volume:1651 File
system was not unmounted cleanly, recovering volume.
Jan 13 21:31:23 system2 kernel: (fs/jbd/recovery.c, 255):
journal_recover: JBD: recovery, exit status 0, recovered transactions
18 to 19
Jan 13 21:31:23 system2 kernel: (fs/jbd/recovery.c, 257):
journal_recover: JBD: Replayed 0 and revoked 0/0 blocks
Jan 13 21:31:23 system2 kernel: kjournald starting. Commit interval 5 seconds
Jan 13 21:31:24 system2 kernel: ocfs2: Mounting device (8,17) on (node
1, slot 0)
Jan 13 21:31:24 system2 kernel: (3169,0):ocfs2_replay_journal:1174
Recovering node 0 from slot 1 on device (8,17)
Jan 13 21:31:25 system2 kernel: (fs/jbd/recovery.c, 255):
journal_recover: JBD: recovery, exit status 0, recovered transactions
11 to 12
Jan 13 21:31:25 system2 kernel: (fs/jbd/recovery.c, 257):
journal_recover: JBD: Replayed 0 and revoked 0/0 blocks
Jan 13 21:31:25 system2 kernel: kjournald starting. Commit interval 5 seconds
Jan 13 21:31:31 system2 kernel: ocfs2_dlm: Node 0 joins domain
03AE9F3FE04A4E5DAAD052FC42AE50E2
Jan 13 21:31:31 system2 kernel: ocfs2_dlm: Nodes in domain
("03AE9F3FE04A4E5DAAD052FC42AE50E2"): 0 1
Jan 13 21:31:33 system2 kernel: o2net: connection to node system1 (num
0) at 192.168.100.1:7777 has been idle for 10 seconds, shutting it
down.
Jan 13 21:31:33 system2 kernel: (3173,0):o2net_idle_timer:1314 here
are some times that might help debug the situation: (tmr
1168723883.85116 now 1168723893.85809 dr 1168723892.403630 adv
1168723892.403647:1168723892.403647 func (ce961a9e:504)
1168723892.403117:1168723892.403161)
Jan 13 21:31:33 system2 kernel: o2net: no longer connected to node
system1 (num 0) at 192.168.100.1:7777