I have had multiple issues with geo-replication. It seems to work OK initially, the replica gets up to date, and not long after (e.g. a couple of days), the replication goes into a faulty state and won't get out of it. I have tried a few times now, and last attempt I re-created the slave volume and setup the replication again. Same symptoms again. I use Gluster 3.7.3, and you will find my setup and log messages at the bottom of the email. Any idea what could cause this and how to fix it? Thanks, Thibault. ps: my setup and log messages: Master: Volume Name: home Type: Replicate Volume ID: 2299a204-a1dc-449d-8556-bc65197373c7 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: server4.uberit.net:/gluster/home-brick-1 Brick2: server5.uberit.net:/gluster/home-brick-1 Options Reconfigured: performance.readdir-ahead: on geo-replication.indexing: on geo-replication.ignore-pid-check: on changelog.changelog: on Slave: Volume Name: homegs Type: Distribute Volume ID: 746dfdc3-650d-4468-9fdd-d621dd215b94 Status: Started Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: remoteserver1.uberit.net:/gluster/homegs-brick-1/brick Options Reconfigured: performance.readdir-ahead: on The geo-replication status and config (I think I ended up with only defaults values) are: # gluster volume geo-replication home ssh://remoteserver1::homegs status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED --------------------------------------------------------------------------------------------------------------------------------------------------------- server5 home /gluster/home-brick-1 root ssh://remoteserver1::homegs N/A Faulty N/A N/A server4 home /gluster/home-brick-1 root ssh://remoteserver1::homegs N/A Faulty N/A N/A # gluster volume geo-replication home ssh://remoteserver1::homegs config special_sync_mode: partial state_socket_unencoded: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.socket gluster_log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.gluster.log ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem ignore_deletes: false change_detector: changelog gluster_command_dir: /usr/sbin/ georep_session_working_dir: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ state_file: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.status remote_gsyncd: /nonexistent/gsyncd session_owner: 2299a204-a1dc-449d-8556-bc65197373c7 changelog_log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-changes.log socketdir: /var/run/gluster working_dir: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs state_detail_file: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-detail.status ssh_command_tar: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem pid_file: /var/lib/glusterd/geo-replication/home_remoteserver_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.pid log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.log gluster_params: aux-gfid-mount acl volume_id: 2299a204-a1dc-449d-8556-bc65197373c7 The logs look like on the master on server1: [2015-08-24 15:21:07.955600] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------ [2015-08-24 15:21:07.955883] I [monitor(monitor):222:monitor] Monitor: starting gsyncd worker [2015-08-24 15:21:08.69528] I [gsyncd(/gluster/home-brick-1):649:main_i] <top>: syncing: gluster://localhost:home -> ssh://root at vivlinuxinfra1.uberit.net:gluster://localhost:homegs [2015-08-24 15:21:08.70938] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining... [2015-08-24 15:21:11.255237] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync change detection mode [2015-08-24 15:21:11.255532] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine [2015-08-24 15:21:11.256570] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changelog change detection mode [2015-08-24 15:21:11.256726] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine [2015-08-24 15:21:11.257345] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changeloghistory change detection mode [2015-08-24 15:21:11.257534] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine [2015-08-24 15:21:13.333628] I [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40172.18.0.169%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync [2015-08-24 15:21:13.333870] I [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time: 1440426073 [2015-08-24 15:21:13.401132] I [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with volume id 2299a204-a1dc-449d-8556-bc65197373c7 ... [2015-08-24 15:21:13.412795] I [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1 seconds [2015-08-24 15:21:13.427340] I [master(/gluster/home-brick-1):1127:crawl] _GMaster: starting history crawl... turns: 1, stime: (1440411353, 0) [2015-08-24 15:21:14.432327] I [master(/gluster/home-brick-1):1156:crawl] _GMaster: slave's time: (1440411353, 0) [2015-08-24 15:21:14.890889] E [repce(/gluster/home-brick-1):207:__call__] RepceClient: call 20960:140215190427392:1440426074.56 (entry_ops) failed on peer with OSError [2015-08-24 15:21:14.891124] E [syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1438, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 584, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1165, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1074, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 952, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 907, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 5] Input/output error [2015-08-24 15:21:14.892291] I [syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting. [2015-08-24 15:21:14.893665] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-24 15:21:14.893879] I [syncdutils(agent):220:finalize] <top>: exiting. [2015-08-24 15:21:15.259360] I [monitor(monitor):282:monitor] Monitor: worker(/gluster/home-brick-1) died in startup phase and on master server2: [2015-08-24 15:21:07.650707] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------ [2015-08-24 15:21:07.651144] I [monitor(monitor):222:monitor] Monitor: starting gsyncd worker [2015-08-24 15:21:07.764817] I [gsyncd(/gluster/home-brick-1):649:main_i] <top>: syncing: gluster://localhost:home -> ssh://root at remoteserver1.uberit.net:gluster://localhost:homegs [2015-08-24 15:21:07.768552] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining... [2015-08-24 15:21:11.9820] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync change detection mode [2015-08-24 15:21:11.10199] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine [2015-08-24 15:21:11.10946] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changelog change detection mode [2015-08-24 15:21:11.11115] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine [2015-08-24 15:21:11.11744] I [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up changeloghistory change detection mode [2015-08-24 15:21:11.11933] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the sync engine [2015-08-24 15:21:13.59192] I [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync [2015-08-24 15:21:13.59454] I [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time: 1440426073 [2015-08-24 15:21:13.113203] I [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with volume id 2299a204-a1dc-449d-8556-bc65197373c7 ... [2015-08-24 15:21:13.122018] I [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1 seconds [2015-08-24 15:21:23.209912] E [repce(/gluster/home-brick-1):207:__call__] RepceClient: call 1561:140164806457088:1440426083.11 (keep_alive) failed on peer with OSError [2015-08-24 15:21:23.210119] E [syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 438, in keep_alive cls.slave.server.keep_alive(vi) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 22] Invalid argument [2015-08-24 15:21:23.210975] I [syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting. [2015-08-24 15:21:23.212455] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-24 15:21:23.212707] I [syncdutils(agent):220:finalize] <top>: exiting. [2015-08-24 15:21:24.23336] I [monitor(monitor):282:monitor] Monitor: worker(/gluster/home-brick-1) died in startup phase and on the slave (in a different timezone, one hour behind): [2015-08-24 14:22:02.923098] I [resource(slave):844:service_loop] GLUSTER: slave listening [2015-08-24 14:22:07.606774] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 731, in entry_ops [ESTALE, EINVAL]) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap return call(*arg) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78, in lsetxattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 5] Input/output error [2015-08-24 14:22:07.652092] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF. [2015-08-24 14:22:07.652364] I [syncdutils(slave):220:finalize] <top>: exiting. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150824/ed4ff820/attachment.html>