thr3ads.net - Gluster users - [Gluster-users] Problem setting up geo-replication: gsyncd using incorrect slave hostname [Sep 2014]

If this information is useful, please help other people find it:
Share via:
Darrell Budic
2014-Sep-23 20:05 UTC
[Gluster-users] Problem setting up geo-replication: gsyncd using incorrect slave hostname

I was having trouble setting up geo-replication, and I finally figured out why.
Gsyncd is trying to use the wrong (but valid) name for the slave server, and
it?s resolving to an address it can?t reach. It does this even though I tried to
setup the geo-replication to a specific IP address, and the initial create
succeeded.

The environment I?m working with has server1 & server3 with bricks for a
replication volume, and my slave server ?spire? is located remotely. Because
there are multiple paths between the systems, I wanted to force geo-replication
to occur with a specific slave address, so I tried to setup replication with an
ip based URL.

Setup/config of the geo-replication works, and appears to start, but is
permanently in a failure state:

[root at server1 geo-replication]# gluster volume geo-replication status
No active geo-replication sessions
[root at server1 geo-replication]# gluster volume geo-replication gvOvirt
10.78.4.1::geobk-Ovirt create push-pem
Creating geo-replication session between gvOvirt & 10.78.4.1::geobk-Ovirt
has been successful
[root at server1 geo-replication]# gluster volume geo-replication status
 
MASTER NODE                    MASTER VOL    MASTER BRICK             SLAVE     
STATUS         CHECKPOINT STATUS    CRAWL STATUS
-------------------------------------------------------------------------------------------------------------------------------------------------------
server1.xxx.xxx.xxx    gvOvirt       /v0/ha-engine/gbOvirt   
ssh://10.78.4.1::geobk-Ovirt    Not Started    N/A                  N/A
server3.xxx.xxx.xxx    gvOvirt       /v0/ha-engine/gbOvirt   
ssh://10.78.4.1::geobk-Ovirt    Not Started    N/A                  N/A
[root at server1 geo-replication]# gluster volume geo-replication gvOvirt
10.78.4.1::geobk-Ovirt start
Starting geo-replication session between gvOvirt & 10.78.4.1::geobk-Ovirt
has been successful
[root at server1 geo-replication]# gluster volume geo-replication status
 
MASTER NODE                    MASTER VOL    MASTER BRICK             SLAVE     
STATUS    CHECKPOINT STATUS    CRAWL STATUS
--------------------------------------------------------------------------------------------------------------------------------------------------
server1.xxx.xxx.xxx    gvOvirt       /v0/ha-engine/gbOvirt   
ssh://10.78.4.1::geobk-Ovirt    faulty    N/A                  N/A
server3.xxx.xxx.xxx    gvOvirt       /v0/ha-engine/gbOvirt   
ssh://10.78.4.1::geobk-Ovirt    faulty    N/A                  N/A

Tracking it down in the logs, it?s because it?s trying to connect to ?root at
spire? instead of ?root at 10.78.4.1?, even though the setup seemed to use
10.78.4.1 just fine:

2014-09-23 14:13:41.595123] I [monitor(monitor):130:monitor] Monitor: starting
gsyncd worker
[2014-09-23 14:13:41.764898] I [gsyncd(/v0/ha-engine/gbOvirt):532:main_i]
<top>: syncing: gluster://localhost:gvOvirt -> ssh://root at
spire:gluster://localhost:geobk-Ovirt
[2014-09-23 14:13:53.40934] E
[syncdutils(/v0/ha-engine/gbOvirt):223:log_raise_exception] <top>:
connection to peer is broken
[2014-09-23 14:13:53.41244] E [resource(/v0/ha-engine/gbOvirt):204:errlog]
Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no
-i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S
/tmp/gsyncd-aux-ssh-gaHXrp/e68410dcff276efa39a94dc5b33e0d8e.sock root at spire
/nonexistent/gsyncd --session-owner c9a371cd-8644-4706-a4b7-f12bc2c37ac6 -N
--listen --timeout 120 gluster://localhost:geobk-Ovirt" returned with 255,
saying:
[2014-09-23 14:13:53.41356] E [resource(/v0/ha-engine/gbOvirt):207:logerr]
Popen: ssh> ssh_exchange_identification: Connection closed by remote host
[2014-09-23 14:13:53.41569] I [syncdutils(/v0/ha-engine/gbOvirt):192:finalize]
<top>: exiting.
[2014-09-23 14:13:54.44226] I [monitor(monitor):157:monitor] Monitor:
worker(/v0/ha-engine/gbOvirt) died in startup phase

So where did it pick this up from? If anything, I think it?d try the full name
of the slave server, but it didn?t even try that. The slave?s hostname, btw, is
spire.yyy.yyy.xxx, not even in the same domain as the master.

I worked around this by setting up a new hostname resolution for ?spire' and
tweaking the search path so it resolved things the way I wanted (to 10.78.4.1),
but that shouldn?t be necessary. Also seems like this might cause security
issues (well, ok, probably not since it?s rsyncing over ssh, but traffic could
definitely wind up where you didn?t expect it) because it?s not trying to
connect to what I explicitly told it to. And in this case, it ran afoul of some
sshd root login allowed rules, but could have easily been other firewall rules
too. And confusion because the geo-replication status report is misleading, it?s
trying to sync to an address that isn?t reflected there.

Seems like a bug to me, but figured I?d check here first.

  -Darrell

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140923/c815a717/attachment.html>
Gluster users - Sep 2014 - Problem setting up geo-replication: gsyncd using incorrect slave hostname

[Gluster-users] Problem setting up geo-replication: gsyncd using incorrect slave hostname