Hi, i have two nodes (SLES 11 SP4 64bit), which are connected via FC to a SAN. On the SAN i created an OCFS2 volume. One host (let's call him 20) mounts the OCFS2 volume while booting automatically. The other (let's call him 10) doesn't. Here is my /etc/ocfs2/cluster.conf: cluster: node_count = 2 name = idg node: ip_port = 7777 ip_address = 192.168.100.10 number = 1 name = sunhb65277 cluster = idg node: ip_port = 7777 ip_address = 192.168.100.20 number = 2 name = sunhb58820 cluster = idg 192.168.100.10 is host 10, 192.168.100.20 is host 20. File is identical on both nodes. /etc/fstab: /dev/disk/by-id/dm-uuid-mpath-3600c0ff00012824b04af7a5201000000 /images ocfs2 _netdev,defaults 0 0 This is the error message on 10: Oct 24 19:16:27 sunhb65277 kernel: [ 46.302046] OCFS2 1.5.0 Oct 24 19:17:23 sunhb65277 kernel: [ 102.296137] (kworker/u:0,5,3):o2net_connect_expired:1724 ERROR: no connection established with node 2 after 60.0 seconds, giving up and returning errors. Oct 24 19:17:23 sunhb65277 kernel: [ 102.296182] (mount.ocfs2,6555,0):dlm_request_join:1472 ERROR: Error -107 when sending message 510 (key 0x666c6172) to node 2 Oct 24 19:17:23 sunhb65277 kernel: [ 102.296188] (mount.ocfs2,6555,0):dlm_try_to_join_domain:1648 ERROR: status = -107 Oct 24 19:17:23 sunhb65277 kernel: [ 102.296193] (mount.ocfs2,6555,0):dlm_join_domain:1948 ERROR: status = -107 Oct 24 19:17:23 sunhb65277 kernel: [ 102.296311] (mount.ocfs2,6555,0):dlm_register_domain:2214 ERROR: status = -107 Oct 24 19:17:23 sunhb65277 kernel: [ 102.296330] (mount.ocfs2,6555,0):o2cb_cluster_connect:313 ERROR: status = -107 Oct 24 19:17:23 sunhb65277 kernel: [ 102.296334] (mount.ocfs2,6555,0):ocfs2_dlm_init:2995 ERROR: status = -107 Oct 24 19:17:23 sunhb65277 kernel: [ 102.296350] (mount.ocfs2,6555,0):ocfs2_mount_volume:1881 ERROR: status = -107 Oct 24 19:17:23 sunhb65277 kernel: [ 102.296387] ocfs2: Unmounting device (252,5) on (node 0) Oct 24 19:17:23 sunhb65277 kernel: [ 102.296395] (mount.ocfs2,6555,0):ocfs2_fill_super:1236 ERROR: status = -107 The error is logical. In SLES, the firewall init script is the last one executed. I don't know why, but it seems to be normal for SuSE. So, port 7777 is not opened when host 20 tries to connect host 10. And when the port is opened, host 20 has already stopped connecting. The host stuck in the init script from ocfs2 until "Network idle timeout: 60000" has run out. The other way (host 20 booting independent if host 10 is online or not), the ocfs2 init script starts the mount, waits some seconds and the host continues to boot (and the ocfs2 volume is mounted). What i find out already is that the node with the higher number (number 2, host 20) tries to connect the node with the lower number (number 1,host 10) (https://oss.oracle.com/pipermail/ocfs2-users/2009-June/003626.html). Although i would expect that always the booting host tries to connect the other one(s). Host 20 also mounts automatically when host 10 is offline. Questions: Why does host 20 mount automatically and host 10 does not ?>From where does host 20 know that host 10 is trying to mount the ocfs2 volume,because in exactly that moment host 20 tries to connect host 10 on port 7777 ? There is no packet from host 10 visible before !?! Of course i could fumble on the init scripts or change the order of them, but i would prefer having a running solution out of the box. And i like to understand it. Bernd -- Bernd Lentes Systemadministration institute of developmental genetics Geb?ude 35.34 - Raum 208 HelmholtzZentrum M?nchen bernd.lentes at helmholtz-muenchen.de phone: +49 (0)89 3187 1241 fax: +49 (0)89 3187 2294 Erst wenn man sich auf etwas festlegt kann man Unrecht haben Scott Adams Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671
Hi Bernd, The error message shows that connection between node 1 and 2 cannot be set up. So you should make sure network is up before mounting and can be reached through port 7777. Thanks, Joseph On 2016/10/25 3:36, Lentes, Bernd wrote:> Hi, > > i have two nodes (SLES 11 SP4 64bit), which are connected via FC to a SAN. On the SAN i created an OCFS2 volume. > One host (let's call him 20) mounts the OCFS2 volume while booting automatically. The other (let's call him 10) doesn't. > Here is my /etc/ocfs2/cluster.conf: > > cluster: > node_count = 2 > name = idg > > node: > ip_port = 7777 > ip_address = 192.168.100.10 > number = 1 > name = sunhb65277 > cluster = idg > > node: > ip_port = 7777 > ip_address = 192.168.100.20 > number = 2 > name = sunhb58820 > cluster = idg > > > 192.168.100.10 is host 10, 192.168.100.20 is host 20. File is identical on both nodes. > > /etc/fstab: > /dev/disk/by-id/dm-uuid-mpath-3600c0ff00012824b04af7a5201000000 /images ocfs2 _netdev,defaults 0 0 > > > This is the error message on 10: > > Oct 24 19:16:27 sunhb65277 kernel: [ 46.302046] OCFS2 1.5.0 > Oct 24 19:17:23 sunhb65277 kernel: [ 102.296137] (kworker/u:0,5,3):o2net_connect_expired:1724 ERROR: no connection established with node 2 after 60.0 seconds, giving up and returning errors. > Oct 24 19:17:23 sunhb65277 kernel: [ 102.296182] (mount.ocfs2,6555,0):dlm_request_join:1472 ERROR: Error -107 when sending message 510 (key 0x666c6172) to node 2 > Oct 24 19:17:23 sunhb65277 kernel: [ 102.296188] (mount.ocfs2,6555,0):dlm_try_to_join_domain:1648 ERROR: status = -107 > Oct 24 19:17:23 sunhb65277 kernel: [ 102.296193] (mount.ocfs2,6555,0):dlm_join_domain:1948 ERROR: status = -107 > Oct 24 19:17:23 sunhb65277 kernel: [ 102.296311] (mount.ocfs2,6555,0):dlm_register_domain:2214 ERROR: status = -107 > Oct 24 19:17:23 sunhb65277 kernel: [ 102.296330] (mount.ocfs2,6555,0):o2cb_cluster_connect:313 ERROR: status = -107 > Oct 24 19:17:23 sunhb65277 kernel: [ 102.296334] (mount.ocfs2,6555,0):ocfs2_dlm_init:2995 ERROR: status = -107 > Oct 24 19:17:23 sunhb65277 kernel: [ 102.296350] (mount.ocfs2,6555,0):ocfs2_mount_volume:1881 ERROR: status = -107 > Oct 24 19:17:23 sunhb65277 kernel: [ 102.296387] ocfs2: Unmounting device (252,5) on (node 0) > Oct 24 19:17:23 sunhb65277 kernel: [ 102.296395] (mount.ocfs2,6555,0):ocfs2_fill_super:1236 ERROR: status = -107 > > The error is logical. In SLES, the firewall init script is the last one executed. I don't know why, but it seems to be normal for SuSE. > So, port 7777 is not opened when host 20 tries to connect host 10. And when the port is opened, host 20 has already stopped connecting. > The host stuck in the init script from ocfs2 until "Network idle timeout: 60000" has run out. > The other way (host 20 booting independent if host 10 is online or not), the ocfs2 init script starts the mount, waits some seconds > and the host continues to boot (and the ocfs2 volume is mounted). > > What i find out already is that the node with the higher number (number 2, host 20) tries to connect the node with the lower number (number 1,host 10) > (https://oss.oracle.com/pipermail/ocfs2-users/2009-June/003626.html). > Although i would expect that always the booting host tries to connect the other one(s). > Host 20 also mounts automatically when host 10 is offline. > > Questions: > > Why does host 20 mount automatically and host 10 does not ? > From where does host 20 know that host 10 is trying to mount the ocfs2 volume, > because in exactly that moment host 20 tries to connect host 10 on port 7777 ? > There is no packet from host 10 visible before !?! > > Of course i could fumble on the init scripts or change the order of them, but i would prefer having a running solution out of the box. > And i like to understand it. > > > Bernd >