thr3ads.net - Gluster users - [Gluster-users] Faulty geo-replication [Aug 2015]

If this information is useful, please help other people find it:
Share via:
Thibault Godouet
2015-Aug-24 15:39 UTC
[Gluster-users] Faulty geo-replication

I have had multiple issues with geo-replication.  It seems to work OK
initially, the replica gets up to date, and not long after (e.g. a couple
of days), the replication goes into a faulty state and won't get out of it.

I have tried a few times now, and last attempt I re-created the slave
volume and setup the replication again.  Same symptoms again.



I use Gluster 3.7.3, and you will find my setup and log messages at the
bottom of the email.



Any idea what could cause this and how to fix it?

Thanks,
Thibault.

ps: my setup and log messages:



Master:



Volume Name: home
Type: Replicate
Volume ID: 2299a204-a1dc-449d-8556-bc65197373c7
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: server4.uberit.net:/gluster/home-brick-1
Brick2: server5.uberit.net:/gluster/home-brick-1
Options Reconfigured:
performance.readdir-ahead: on
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on



Slave:

Volume Name: homegs
Type: Distribute
Volume ID: 746dfdc3-650d-4468-9fdd-d621dd215b94
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: remoteserver1.uberit.net:/gluster/homegs-brick-1/brick
Options Reconfigured:
performance.readdir-ahead: on



The geo-replication status and config (I think I ended up with only
defaults values) are:



# gluster volume geo-replication home ssh://remoteserver1::homegs status

MASTER NODE       MASTER VOL    MASTER BRICK             SLAVE USER
SLAVE                           SLAVE NODE    STATUS    CRAWL STATUS
LAST_SYNCED
---------------------------------------------------------------------------------------------------------------------------------------------------------
server5          home          /gluster/home-brick-1    root
ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A
server4          home          /gluster/home-brick-1    root
ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A



# gluster volume geo-replication home ssh://remoteserver1::homegs config
special_sync_mode: partial
state_socket_unencoded:
/var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.socket
gluster_log_file:
/var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.gluster.log
ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
/var/lib/glusterd/geo-replication/secret.pem
ignore_deletes: false
change_detector: changelog
gluster_command_dir: /usr/sbin/
georep_session_working_dir:
/var/lib/glusterd/geo-replication/home_remoteserver1_homegs/
state_file:
/var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.status
remote_gsyncd: /nonexistent/gsyncd
session_owner: 2299a204-a1dc-449d-8556-bc65197373c7
changelog_log_file:
/var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-changes.log
socketdir: /var/run/gluster
working_dir:
/var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs
state_detail_file:
/var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-detail.status
ssh_command_tar: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no
-i /var/lib/glusterd/geo-replication/tar_ssh.pem
pid_file:
/var/lib/glusterd/geo-replication/home_remoteserver_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.pid
log_file:
/var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.log
gluster_params: aux-gfid-mount acl
volume_id: 2299a204-a1dc-449d-8556-bc65197373c7



The logs look like on the master on server1:



[2015-08-24 15:21:07.955600] I [monitor(monitor):221:monitor] Monitor:
------------------------------------------------------------
[2015-08-24 15:21:07.955883] I [monitor(monitor):222:monitor] Monitor:
starting gsyncd worker
[2015-08-24 15:21:08.69528] I [gsyncd(/gluster/home-brick-1):649:main_i]
<top>: syncing: gluster://localhost:home ->
ssh://root at vivlinuxinfra1.uberit.net:gluster://localhost:homegs
[2015-08-24 15:21:08.70938] I [changelogagent(agent):75:__init__]
ChangelogAgent: Agent listining...
[2015-08-24 15:21:11.255237] I
[master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync
change detection mode
[2015-08-24 15:21:11.255532] I [master(/gluster/home-brick-1):404:__init__]
_GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:11.256570] I
[master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
changelog change detection mode
[2015-08-24 15:21:11.256726] I [master(/gluster/home-brick-1):404:__init__]
_GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:11.257345] I
[master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
changeloghistory change detection mode
[2015-08-24 15:21:11.257534] I [master(/gluster/home-brick-1):404:__init__]
_GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:13.333628] I
[master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp
directory:
/var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40172.18.0.169%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync
[2015-08-24 15:21:13.333870] I
[resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time:
1440426073
[2015-08-24 15:21:13.401132] I
[master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with
volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...
[2015-08-24 15:21:13.412795] I
[master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1
seconds
[2015-08-24 15:21:13.427340] I [master(/gluster/home-brick-1):1127:crawl]
_GMaster: starting history crawl... turns: 1, stime: (1440411353, 0)
[2015-08-24 15:21:14.432327] I [master(/gluster/home-brick-1):1156:crawl]
_GMaster: slave's time: (1440411353, 0)
[2015-08-24 15:21:14.890889] E [repce(/gluster/home-brick-1):207:__call__]
RepceClient: call 20960:140215190427392:1440426074.56 (entry_ops) failed on
peer with OSError
[2015-08-24 15:21:14.891124] E
[syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165,
in
main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659,
in
main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line
1438,
in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 584,
in
crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line
1165, in
crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line
1074, in
changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 952,
in
process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 907,
in
process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226,
in
__call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208,
in
__call__
    raise res
OSError: [Errno 5] Input/output error
[2015-08-24 15:21:14.892291] I
[syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting.
[2015-08-24 15:21:14.893665] I [repce(agent):92:service_loop] RepceServer:
terminating on reaching EOF.
[2015-08-24 15:21:14.893879] I [syncdutils(agent):220:finalize] <top>:
exiting.
[2015-08-24 15:21:15.259360] I [monitor(monitor):282:monitor] Monitor:
worker(/gluster/home-brick-1) died in startup phase



and on master server2:



[2015-08-24 15:21:07.650707] I [monitor(monitor):221:monitor] Monitor:
------------------------------------------------------------
[2015-08-24 15:21:07.651144] I [monitor(monitor):222:monitor] Monitor:
starting gsyncd worker
[2015-08-24 15:21:07.764817] I [gsyncd(/gluster/home-brick-1):649:main_i]
<top>: syncing: gluster://localhost:home ->
ssh://root at remoteserver1.uberit.net:gluster://localhost:homegs
[2015-08-24 15:21:07.768552] I [changelogagent(agent):75:__init__]
ChangelogAgent: Agent listining...
[2015-08-24 15:21:11.9820] I
[master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync
change detection mode
[2015-08-24 15:21:11.10199] I [master(/gluster/home-brick-1):404:__init__]
_GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:11.10946] I
[master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
changelog change detection mode
[2015-08-24 15:21:11.11115] I [master(/gluster/home-brick-1):404:__init__]
_GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:11.11744] I
[master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
changeloghistory change detection mode
[2015-08-24 15:21:11.11933] I [master(/gluster/home-brick-1):404:__init__]
_GMaster: using 'rsync' as the sync engine
[2015-08-24 15:21:13.59192] I [master(/gluster/home-brick-1):1212:register]
_GMaster: xsync temp directory:
/var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync
[2015-08-24 15:21:13.59454] I
[resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time:
1440426073
[2015-08-24 15:21:13.113203] I
[master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with
volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...
[2015-08-24 15:21:13.122018] I
[master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1
seconds
[2015-08-24 15:21:23.209912] E [repce(/gluster/home-brick-1):207:__call__]
RepceClient: call 1561:140164806457088:1440426083.11 (keep_alive) failed on
peer with OSError
[2015-08-24 15:21:23.210119] E
[syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line
306,
in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 438,
in
keep_alive
    cls.slave.server.keep_alive(vi)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226,
in
__call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208,
in
__call__
    raise res
OSError: [Errno 22] Invalid argument
[2015-08-24 15:21:23.210975] I
[syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting.
[2015-08-24 15:21:23.212455] I [repce(agent):92:service_loop] RepceServer:
terminating on reaching EOF.
[2015-08-24 15:21:23.212707] I [syncdutils(agent):220:finalize] <top>:
exiting.
[2015-08-24 15:21:24.23336] I [monitor(monitor):282:monitor] Monitor:
worker(/gluster/home-brick-1) died in startup phase



and on the slave (in a different timezone, one hour behind):



[2015-08-24 14:22:02.923098] I [resource(slave):844:service_loop] GLUSTER:
slave listening
[2015-08-24 14:22:07.606774] E [repce(slave):117:worker] <top>: call
failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113,
in
worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line
731, in
entry_ops
    [ESTALE, EINVAL])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line
475,
in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line
78, in
lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line
37, in
raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 5] Input/output error
[2015-08-24 14:22:07.652092] I [repce(slave):92:service_loop] RepceServer:
terminating on reaching EOF.
[2015-08-24 14:22:07.652364] I [syncdutils(slave):220:finalize] <top>:
exiting.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150824/ed4ff820/attachment.html>
Gluster users - Aug 2015 - Faulty geo-replication

[Gluster-users] Faulty geo-replication