thr3ads.net - Gluster users - [Gluster-users] Geo-replication stops after 4-5 hours [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Sunny Kumar

2018-Aug-13 19:49 UTC

[Gluster-users] Geo-replication stops after 4-5 hours

Hi Marcus,

Can you please share mount log from slave (You can find it at
"/var/log/glusterfs/geo-replication-slaves/<mastervol>hostname<slavevol>/mnt____.log").

- Sunny
On Tue, Aug 14, 2018 at 12:48 AM Marcus Peders?n <marcus.pedersen at
slu.se> wrote:>
> Hi again,
>
> New changes in behaviour, both master master nodes that are active toggles
to failure and the logs repeat the same over and over again.
>
>
> Part of log, node1:
>
> [2018-08-13 18:24:44.701711] I [gsyncdstatus(worker
/urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change       
status=Active
> [2018-08-13 18:24:44.704360] I [gsyncdstatus(worker
/urd-gds/gluster):248:set_worker_crawl_status] GeorepStatus: Crawl Status Change
status=History Crawl
> [2018-08-13 18:24:44.705162] I [master(worker /urd-gds/gluster):1448:crawl]
_GMaster: starting history crawl    turns=1 stime=(1523907056, 0)  
entry_stime=None        etime=1534184684
> [2018-08-13 18:24:45.717072] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time      stime=(1523907056, 0)
> [2018-08-13 18:24:45.904958] E [repce(worker
/urd-gds/gluster):197:__call__] RepceClient: call failed  
call=5919:140339726538560:1534184685.88 method=entry_ops       
error=GsyncdError
> [2018-08-13 18:24:45.905111] E [syncdutils(worker
/urd-gds/gluster):298:log_raise_exception] <top>: execution of
"gluster" failed with ENOENT (No such file or directory)
> [2018-08-13 18:24:45.919265] I [repce(agent
/urd-gds/gluster):80:service_loop] RepceServer: terminating on reaching EOF.
> [2018-08-13 18:24:46.553194] I [monitor(monitor):272:monitor] Monitor:
worker died in startup phase     brick=/urd-gds/gluster
> [2018-08-13 18:24:46.561784] I
[gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change
status=Faulty
> [2018-08-13 18:24:56.581748] I [monitor(monitor):158:monitor] Monitor:
starting gsyncd worker   brick=/urd-gds/gluster  slave_node=urd-gds-geo-000
> [2018-08-13 18:24:56.655164] I [gsyncd(worker /urd-gds/gluster):297:main]
<top>: Using session config file     
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-13 18:24:56.655193] I [gsyncd(agent /urd-gds/gluster):297:main]
<top>: Using session config file      
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-13 18:24:56.655889] I [changelogagent(agent
/urd-gds/gluster):72:__init__] ChangelogAgent: Agent listining...
> [2018-08-13 18:24:56.664628] I [resource(worker
/urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between
master and slave...
> [2018-08-13 18:24:58.347415] I [resource(worker
/urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and
slave established.        duration=1.6824
> [2018-08-13 18:24:58.348151] I [resource(worker
/urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...
> [2018-08-13 18:24:59.463598] I [resource(worker
/urd-gds/gluster):1090:connect] GLUSTER: Mounted gluster volume duration=1.1150
> [2018-08-13 18:24:59.464184] I [subcmds(worker
/urd-gds/gluster):70:subcmd_worker] <top>: Worker spawn successful.
Acknowledging back to monitor
> [2018-08-13 18:25:01.549007] I [master(worker
/urd-gds/gluster):1534:register] _GMaster: Working dir   
path=/var/lib/misc/gluster/gsyncd/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/urd-gds-gluster
> [2018-08-13 18:25:01.549606] I [resource(worker
/urd-gds/gluster):1253:service_loop] GLUSTER: Register time     time=1534184701
> [2018-08-13 18:25:01.593946] I [gsyncdstatus(worker
/urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change       
status=Active
>
>
> Part of log, node2:
>
> [2018-08-13 18:25:14.554233] I
[gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change
status=Faulty
> [2018-08-13 18:25:24.568727] I [monitor(monitor):158:monitor] Monitor:
starting gsyncd worker   brick=/urd-gds/gluster  slave_node=urd-gds-geo-000
> [2018-08-13 18:25:24.609642] I [gsyncd(agent /urd-gds/gluster):297:main]
<top>: Using session config file      
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-13 18:25:24.609678] I [gsyncd(worker /urd-gds/gluster):297:main]
<top>: Using session config file     
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-13 18:25:24.610362] I [changelogagent(agent
/urd-gds/gluster):72:__init__] ChangelogAgent: Agent listining...
> [2018-08-13 18:25:24.621551] I [resource(worker
/urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between
master and slave...
> [2018-08-13 18:25:26.164855] I [resource(worker
/urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and
slave established.        duration=1.5431
> [2018-08-13 18:25:26.165124] I [resource(worker
/urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...
> [2018-08-13 18:25:27.331969] I [resource(worker
/urd-gds/gluster):1090:connect] GLUSTER: Mounted gluster volume duration=1.1667
> [2018-08-13 18:25:27.335560] I [subcmds(worker
/urd-gds/gluster):70:subcmd_worker] <top>: Worker spawn successful.
Acknowledging back to monitor
> [2018-08-13 18:25:37.768867] I [master(worker
/urd-gds/gluster):1534:register] _GMaster: Working dir   
path=/var/lib/misc/gluster/gsyncd/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/urd-gds-gluster
> [2018-08-13 18:25:37.769479] I [resource(worker
/urd-gds/gluster):1253:service_loop] GLUSTER: Register time     time=1534184737
> [2018-08-13 18:25:37.787317] I [gsyncdstatus(worker
/urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change       
status=Active
> [2018-08-13 18:25:37.789822] I [gsyncdstatus(worker
/urd-gds/gluster):248:set_worker_crawl_status] GeorepStatus: Crawl Status Change
status=History Crawl
> [2018-08-13 18:25:37.790008] I [master(worker /urd-gds/gluster):1448:crawl]
_GMaster: starting history crawl    turns=1 stime=(1525290650, 0)  
entry_stime=(1525296245, 0)     etime=1534184737
> [2018-08-13 18:25:37.791222] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time      stime=(1525290650, 0)
> [2018-08-13 18:25:38.63499] I [master(worker
/urd-gds/gluster):1301:process] _GMaster: Skipping already processed entry ops  
to_changelog=1525290651 num_changelogs=1        from_changelog=1525290651
> [2018-08-13 18:25:38.63621] I [master(worker
/urd-gds/gluster):1315:process] _GMaster: Entry Time Taken MKD=0   MKN=0   LIN=0
SYM=0   REN=0   RMD=0   CRE=0   duration=0.0000 UNL=0
> [2018-08-13 18:25:38.63678] I [master(worker
/urd-gds/gluster):1325:process] _GMaster: Data/Metadata Time Taken SETA=1 
SETX=0  meta_duration=0.0228    data_duration=0.2456    DATA=0  XATT=0
> [2018-08-13 18:25:38.63822] I [master(worker
/urd-gds/gluster):1335:process] _GMaster: Batch Completed 
changelog_end=1525290651        entry_stime=(1525296245, 0)    
changelog_start=1525290651      stime=(152\
> 5290650, 0)   duration=0.2723 num_changelogs=1       
mode=history_changelog
> [2018-08-13 18:25:38.73400] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time       stime=(1525290650, 0)
> [2018-08-13 18:25:38.480941] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.1327
num_files=3     job=3   return_code=23
> [2018-08-13 18:25:39.963423] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.1133
num_files=8     job=1   return_code=23
> [2018-08-13 18:25:39.980724] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.6315
num_files=47    job=2   return_code=23
>
>
> ...............
>
>
> [2018-08-13 18:26:04.534953] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.0988
num_files=18    job=2   return_code=23
> [2018-08-13 18:26:07.798583] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.2600
num_files=27    job=2   return_code=23
> [2018-08-13 18:26:08.708100] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.4090
num_files=67    job=2   return_code=23
> [2018-08-13 18:26:14.865883] E [repce(worker
/urd-gds/gluster):197:__call__] RepceClient: call failed  
call=18662:140079998809920:1534184774.58        method=entry_ops       
error=GsyncdError
> [2018-08-13 18:26:14.866166] E [syncdutils(worker
/urd-gds/gluster):298:log_raise_exception] <top>: execution of
"gluster" failed with ENOENT (No such file or directory)
> [2018-08-13 18:26:14.991022] I [repce(agent
/urd-gds/gluster):80:service_loop] RepceServer: terminating on reaching EOF.
> [2018-08-13 18:26:15.384844] I [monitor(monitor):272:monitor] Monitor:
worker died in startup phase     brick=/urd-gds/gluster
> [2018-08-13 18:26:15.397360] I
[gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change
status=Faulty
>
>
> Help would be appriciated!
>
> Thanks!
>
>
> Regards
>
> Marcus Peders?n
>
>
> ________________________________
> Fr?n: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> f?r Marcus Peders?n <marcus.pedersen at slu.se>
> Skickat: den 12 augusti 2018 22:18
> Till: khiremat at redhat.com
> Kopia: gluster-users at gluster.org
> ?mne: Re: [Gluster-users] Geo-replication stops after 4-5 hours
>
>
> Hi,
>
> As the geo-replication stopped after 4-5 hours, I added a cron job that
stopped, paused for 2 mins and started geo-replication again every 6 hours.
>
> The cron job has been running for 5 days and the changelogs has been
catching up.
>
>
> Now a different behavior has shown up.
>
> In one of the active master nodes I get a python error.
>
> The other active master node has started to toggle status between active
and faulty.
>
> See parts of logs below.
>
>
> When I read Troubleshooting Geo-replication, there is a suggestion when
sync is not complete, to enforce a full sync of the data by erasing the index
and restarting GlusterFS geo-replication.
>
> There is no explanation of how to erase the index.
>
> Should I enforse a full sync?
>
> How do I erase the index?
>
>
> Thanks a lot!
>
>
> Best regards
>
> Marcus Peders?n
>
>
>
> Node with python error:
>
> [2018-08-12 16:02:05.304924] I [resource(worker
/urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between
master and slave...
> [2018-08-12 16:02:06.842832] I [resource(worker
/urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and
slave established.        duration=1.5376
> [2018-08-12 16:02:06.843370] I [resource(worker
/urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...
> [2018-08-12 16:02:07.930706] I [resource(worker
/urd-gds/gluster):1090:connect] GLUSTER: Mounted gluster volume duration=1.0869
> [2018-08-12 16:02:07.931536] I [subcmds(worker
/urd-gds/gluster):70:subcmd_worker] <top>: Worker spawn successful.
Acknowledging back to monitor
> [2018-08-12 16:02:20.759797] I [master(worker
/urd-gds/gluster):1534:register] _GMaster: Working dir   
path=/var/lib/misc/gluster/gsyncd/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/urd-gds-gluster
> [2018-08-12 16:02:20.760411] I [resource(worker
/urd-gds/gluster):1253:service_loop] GLUSTER: Register time     time=1534089740
> [2018-08-12 16:02:20.831918] I [gsyncdstatus(worker
/urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change       
status=Active
> [2018-08-12 16:02:20.835541] I [gsyncdstatus(worker
/urd-gds/gluster):248:set_worker_crawl_status] GeorepStatus: Crawl Status Change
status=History Crawl
> [2018-08-12 16:02:20.836832] I [master(worker /urd-gds/gluster):1448:crawl]
_GMaster: starting history crawl    turns=1 stime=(1523906126, 0)  
entry_stime=None        etime=1534089740
> [2018-08-12 16:02:21.848570] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time      stime=(1523906126, 0)
> [2018-08-12 16:02:21.950453] E [syncdutils(worker
/urd-gds/gluster):330:log_raise_exception] <top>: FAIL:
> Traceback (most recent call last):
>   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py",
line 360, in twrap
>     tf(*aargs)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line
1880, in syncjob
>     po = self.sync_engine(pb, self.log_err)
>   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py",
line 1413, in rsync
>     rconf.ssh_ctl_args + \
> AttributeError: 'NoneType' object has no attribute 'split'
> [2018-08-12 16:02:21.975228] I [repce(agent
/urd-gds/gluster):80:service_loop] RepceServer: terminating on reaching EOF.
> [2018-08-12 16:02:22.947170] I [monitor(monitor):272:monitor] Monitor:
worker died in startup phase     brick=/urd-gds/gluster
> [2018-08-12 16:02:22.954096] I
[gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change
status=Faulty
> [2018-08-12 16:02:32.973948] I [monitor(monitor):158:monitor] Monitor:
starting gsyncd worker   brick=/urd-gds/gluster  slave_node=urd-gds-geo-000
> [2018-08-12 16:02:33.16155] I [gsyncd(agent /urd-gds/gluster):297:main]
<top>: Using session config file       
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-12 16:02:33.16882] I [changelogagent(agent
/urd-gds/gluster):72:__init__] ChangelogAgent: Agent listining...
> [2018-08-12 16:02:33.17292] I [gsyncd(worker /urd-gds/gluster):297:main]
<top>: Using session config file      
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-12 16:02:33.26951] I [resource(worker
/urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between
master and slave...
> [2018-08-12 16:02:34.642838] I [resource(worker
/urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and
slave established.        duration=1.6156
> [2018-08-12 16:02:34.643369] I [resource(worker
/urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...
>
>
>
>
> Node that toggles status between active and faulty:
>
> [2018-08-12 19:33:03.475833] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.2757
num_files=27    job=2   return_code=23
> [2018-08-12 19:33:04.818854] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.3767
num_files=67    job=1   return_code=23
> [2018-08-12 19:33:09.926820] E [repce(worker
/urd-gds/gluster):197:__call__] RepceClient: call failed  
call=14853:139697829693248:1534102389.64        method=entry_ops       
error=GsyncdError
> [2018-08-12 19:33:09.927042] E [syncdutils(worker
/urd-gds/gluster):298:log_raise_exception] <top>: execution of
"gluster" failed with ENOENT (No such file or directory)
> [2018-08-12 19:33:09.942267] I [repce(agent
/urd-gds/gluster):80:service_loop] RepceServer: terminating on reaching EOF.
> [2018-08-12 19:33:10.349848] I [monitor(monitor):272:monitor] Monitor:
worker died in startup phase     brick=/urd-gds/gluster
> [2018-08-12 19:33:10.363173] I
[gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change
status=Faulty
> [2018-08-12 19:33:20.386089] I [monitor(monitor):158:monitor] Monitor:
starting gsyncd worker   brick=/urd-gds/gluster  slave_node=urd-gds-geo-000
> [2018-08-12 19:33:20.456687] I [gsyncd(agent /urd-gds/gluster):297:main]
<top>: Using session config file      
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-12 19:33:20.456686] I [gsyncd(worker /urd-gds/gluster):297:main]
<top>: Using session config file     
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-12 19:33:20.457559] I [changelogagent(agent
/urd-gds/gluster):72:__init__] ChangelogAgent: Agent listining...
> [2018-08-12 19:33:20.511825] I [resource(worker
/urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between
master and slave...
> [2018-08-12 19:33:22.88713] I [resource(worker
/urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and
slave established. duration=1.5766
> [2018-08-12 19:33:22.89272] I [resource(worker
/urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...
> [2018-08-12 19:33:23.179249] I [resource(worker
/urd-gds/gluster):1090:connect] GLUSTER: Mounted gluster volume duration=1.0896
> [2018-08-12 19:33:23.179805] I [subcmds(worker
/urd-gds/gluster):70:subcmd_worker] <top>: Worker spawn successful.
Acknowledging back to monitor
> [2018-08-12 19:33:35.245277] I [master(worker
/urd-gds/gluster):1534:register] _GMaster: Working dir   
path=/var/lib/misc/gluster/gsyncd/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/urd-gds-gluster
> [2018-08-12 19:33:35.246495] I [resource(worker
/urd-gds/gluster):1253:service_loop] GLUSTER: Register time     time=1534102415
> [2018-08-12 19:33:35.321988] I [gsyncdstatus(worker
/urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change       
status=Active
> [2018-08-12 19:33:35.324270] I [gsyncdstatus(worker
/urd-gds/gluster):248:set_worker_crawl_status] GeorepStatus: Crawl Status Change
status=History Crawl
> [2018-08-12 19:33:35.324902] I [master(worker /urd-gds/gluster):1448:crawl]
_GMaster: starting history crawl    turns=1 stime=(1525290650, 0)  
entry_stime=(1525296245, 0)     etime=1534102415
> [2018-08-12 19:33:35.328735] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time      stime=(1525290650, 0)
> [2018-08-12 19:33:35.574338] I [master(worker
/urd-gds/gluster):1301:process] _GMaster: Skipping already processed entry ops  
to_changelog=1525290651 num_changelogs=1        from_changelog=1525290651
> [2018-08-12 19:33:35.574448] I [master(worker
/urd-gds/gluster):1315:process] _GMaster: Entry Time Taken        MKD=0   MKN=0 
LIN=0   SYM=0   REN=0   RMD=0   CRE=0   duration=0.0000 UNL=0
> [2018-08-12 19:33:35.574507] I [master(worker
/urd-gds/gluster):1325:process] _GMaster: Data/Metadata Time Taken        SETA=1
SETX=0  meta_duration=0.0249    data_duration=0.2156    DATA=0  XATT=0
> [2018-08-12 19:33:35.574723] I [master(worker
/urd-gds/gluster):1335:process] _GMaster: Batch Completed
changelog_end=1525290651        entry_stime=(1525296245, 0)    
changelog_start=1525290651      stime=(152\
> 5290650, 0)   duration=0.2455 num_changelogs=1       
mode=history_changelog
> [2018-08-12 19:33:35.582545] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time      stime=(1525290650, 0)
> [2018-08-12 19:33:35.780823] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.0847
num_files=3     job=2   return_code=23
> [2018-08-12 19:33:37.362822] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.0807
num_files=4     job=2   return_code=23
> [2018-08-12 19:33:37.818542] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.1098
num_files=11    job=1   return_code=23
>
>
> ________________________________
> Fr?n: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> f?r Marcus Peders?n <marcus.pedersen at slu.se>
> Skickat: den 6 augusti 2018 13:28
> Till: khiremat at redhat.com
> Kopia: gluster-users at gluster.org
> ?mne: Re: [Gluster-users] Geo-replication stops after 4-5 hours
>
>
> Hi,
>
> Is there a way to resolve the problem with rsync and hanging processes?
>
> Do I need to kill all the processes and hope that it starts again or
stop/start geo-replication?
>
>
> If I stop/start geo-replication it will start again, I have tried it
before.
>
>
> Regards
>
> Marcus
>
>
>
> ________________________________
> Fr?n: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> f?r Marcus Peders?n <marcus.pedersen at slu.se>
> Skickat: den 2 augusti 2018 10:04
> Till: Kotresh Hiremath Ravishankar
> Kopia: gluster-users at gluster.org
> ?mne: Re: [Gluster-users] Geo-replication stops after 4-5 hours
>
> Hi Kotresh,
>
> I get the following and then it hangs:
>
> strace: Process 5921 attached                    write(2, "rsync:
link_stat \"/tmp/gsyncd-au"..., 12811
>
>
> When sync is running I can see rsync with geouser on the slave node.
>
> Regards
> Marcus
>
> ################
> Marcus Peders?n
> Systemadministrator
> Interbull Centre
> ################
> Sent from my phone
> ################
>
> Den 2 aug. 2018 09:31 skrev Kotresh Hiremath Ravishankar <khiremat at
redhat.com>:
>
> Cool, just check whether they are hung by any chance with following
command.
>
> #strace -f -p 5921
>
> On Thu, Aug 2, 2018 at 12:25 PM, Marcus Peders?n <marcus.pedersen at
slu.se> wrote:
>
> On both active master nodes there is an rsync process. As in:
>
> root      5921  0.0  0.0 115424  1176 ?        S    Aug01   0:00 rsync -aR0
--inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs
--xattrs --acls . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no
-i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S
/tmp/gsyncd-aux-ssh-stuphs/bf60c68f1a195dad59573a8dbaa309f2.sock geouser at
urd-gds-geo-001:/proc/13077/cwd
>
> There is also ssh tunnels to slave nodes and  gsyncd.py processes.
>
> Regards
> Marcus
>
> ################
> Marcus Peders?n
> Systemadministrator
> Interbull Centre
> ################
> Sent from my phone
> ################
>
> Den 2 aug. 2018 08:07 skrev Kotresh Hiremath Ravishankar <khiremat at
redhat.com>:
> Could you look of any rsync processes hung in master or slave?
>
> On Thu, Aug 2, 2018 at 11:18 AM, Marcus Peders?n <marcus.pedersen at
slu.se> wrote:
>
> Hi Kortesh,
> rsync  version 3.1.2  protocol version 31
> All nodes run CentOS 7, updated the last couple of days.
>
> Thanks
> Marcus
>
> ################
> Marcus Peders?n
> Systemadministrator
> Interbull Centre
> ################
> Sent from my phone
> ################
>
>
> Den 2 aug. 2018 06:13 skrev Kotresh Hiremath Ravishankar <khiremat at
redhat.com>:
>
> Hi Marcus,
>
> What's the rsync version being used?
>
> Thanks,
> Kotresh HR
>
> On Thu, Aug 2, 2018 at 1:48 AM, Marcus Peders?n <marcus.pedersen at
slu.se> wrote:
>
> Hi all!
>
> I upgraded from 3.12.9 to 4.1.1 and had problems with geo-replication.
>
> With help from the list with some sym links and so on (handled in another
thread)
>
> I got the geo-replication running.
>
> It ran for 4-5 hours and then stopped, I stopped and started
geo-replication and it ran for another 4-5 hours.
>
> 4.1.2 was released and I updated, hoping this would solve the problem.
>
> I still have the same problem, at start it runs for 4-5 hours and then it
stops.
>
> After that nothing happens, I have waited for days but still nothing
happens.
>
>
> I have looked through logs but can not find anything obvious.
>
>
> Status for geo-replication is active for the two same nodes all the time:
>
>
> MASTER NODE    MASTER VOL        MASTER BRICK         SLAVE USER    SLAVE  
SLAVE NODE         STATUS     CRAWL STATUS     LAST_SYNCED            ENTRY   
DATA     META    FAILURES    CHECKPOINT TIME        CHECKPOINT COMPLETED   
CHECKPOINT COMPLETION TIME
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> urd-gds-001    urd-gds-volume    /urd-gds/gluster     geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-000    Active     History
Crawl    2018-04-16 20:32:09    0        14205    0       0           2018-07-27
21:12:44    No                      N/A
> urd-gds-002    urd-gds-volume    /urd-gds/gluster     geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-002    Passive    N/A         
N/A                    N/A      N/A      N/A     N/A         N/A                
N/A                     N/A
> urd-gds-004    urd-gds-volume    /urd-gds/gluster     geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-002    Passive    N/A         
N/A                    N/A      N/A      N/A     N/A         N/A                
N/A                     N/A
> urd-gds-003    urd-gds-volume    /urd-gds/gluster     geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-000    Active     History
Crawl    2018-05-01 20:58:14    285      4552     0       0           2018-07-27
21:12:44    No                      N/A
> urd-gds-000    urd-gds-volume    /urd-gds/gluster1    geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-001    Passive    N/A         
N/A                    N/A      N/A      N/A     N/A         N/A                
N/A                     N/A
> urd-gds-000    urd-gds-volume    /urd-gds/gluster2    geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-001    Passive    N/A         
N/A                    N/A      N/A      N/A     N/A         N/A                
N/A                     N/A
>
>
> Master cluster is Distribute-Replicate
>
> 2 x (2 + 1)
>
> Used space 30TB
>
>
> Slave cluster is Replicate
>
> 1 x (2 + 1)
>
> Used space 9TB
>
>
> Parts from gsyncd.logs are enclosed.
>
>
> Thanks a lot!
>
>
> Best regards
>
> Marcus Peders?n
>
>
>
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
>
> --
> Thanks and Regards,
> Kotresh H R
>
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
>
>
>
> --
> Thanks and Regards,
> Kotresh H R
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
>
>
>
> --
> Thanks and Regards,
> Kotresh H R
>
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users

Marcus Pedersén

2018-Aug-13 20:45 UTC

head link

[Gluster-users] Geo-replication stops after 4-5 hours

Hi Sunny,
Please find the enclosed mount logs for the two active mater nodes.
I cut them down to todays logs.

Thanks!

Marcus

________________________________________
Fr?n: Sunny Kumar <sunkumar at redhat.com>
Skickat: den 13 augusti 2018 21:49
Till: Marcus Peders?n
Kopia: Kotresh Hiremath Ravishankar; gluster-users at gluster.org
?mne: Re: [Gluster-users] Geo-replication stops after 4-5 hours

Hi Marcus,

Can you please share mount log from slave (You can find it at
"/var/log/glusterfs/geo-replication-slaves/<mastervol>hostname<slavevol>/mnt____.log").

- Sunny
On Tue, Aug 14, 2018 at 12:48 AM Marcus Peders?n <marcus.pedersen at
slu.se> wrote:>
> Hi again,
>
> New changes in behaviour, both master master nodes that are active toggles
to failure and the logs repeat the same over and over again.
>
>
> Part of log, node1:
>
> [2018-08-13 18:24:44.701711] I [gsyncdstatus(worker
/urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change       
status=Active
> [2018-08-13 18:24:44.704360] I [gsyncdstatus(worker
/urd-gds/gluster):248:set_worker_crawl_status] GeorepStatus: Crawl Status Change
status=History Crawl
> [2018-08-13 18:24:44.705162] I [master(worker /urd-gds/gluster):1448:crawl]
_GMaster: starting history crawl    turns=1 stime=(1523907056, 0)  
entry_stime=None        etime=1534184684
> [2018-08-13 18:24:45.717072] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time      stime=(1523907056, 0)
> [2018-08-13 18:24:45.904958] E [repce(worker
/urd-gds/gluster):197:__call__] RepceClient: call failed  
call=5919:140339726538560:1534184685.88 method=entry_ops       
error=GsyncdError
> [2018-08-13 18:24:45.905111] E [syncdutils(worker
/urd-gds/gluster):298:log_raise_exception] <top>: execution of
"gluster" failed with ENOENT (No such file or directory)
> [2018-08-13 18:24:45.919265] I [repce(agent
/urd-gds/gluster):80:service_loop] RepceServer: terminating on reaching EOF.
> [2018-08-13 18:24:46.553194] I [monitor(monitor):272:monitor] Monitor:
worker died in startup phase     brick=/urd-gds/gluster
> [2018-08-13 18:24:46.561784] I
[gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change
status=Faulty
> [2018-08-13 18:24:56.581748] I [monitor(monitor):158:monitor] Monitor:
starting gsyncd worker   brick=/urd-gds/gluster  slave_node=urd-gds-geo-000
> [2018-08-13 18:24:56.655164] I [gsyncd(worker /urd-gds/gluster):297:main]
<top>: Using session config file     
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-13 18:24:56.655193] I [gsyncd(agent /urd-gds/gluster):297:main]
<top>: Using session config file      
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-13 18:24:56.655889] I [changelogagent(agent
/urd-gds/gluster):72:__init__] ChangelogAgent: Agent listining...
> [2018-08-13 18:24:56.664628] I [resource(worker
/urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between
master and slave...
> [2018-08-13 18:24:58.347415] I [resource(worker
/urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and
slave established.        duration=1.6824
> [2018-08-13 18:24:58.348151] I [resource(worker
/urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...
> [2018-08-13 18:24:59.463598] I [resource(worker
/urd-gds/gluster):1090:connect] GLUSTER: Mounted gluster volume duration=1.1150
> [2018-08-13 18:24:59.464184] I [subcmds(worker
/urd-gds/gluster):70:subcmd_worker] <top>: Worker spawn successful.
Acknowledging back to monitor
> [2018-08-13 18:25:01.549007] I [master(worker
/urd-gds/gluster):1534:register] _GMaster: Working dir   
path=/var/lib/misc/gluster/gsyncd/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/urd-gds-gluster
> [2018-08-13 18:25:01.549606] I [resource(worker
/urd-gds/gluster):1253:service_loop] GLUSTER: Register time     time=1534184701
> [2018-08-13 18:25:01.593946] I [gsyncdstatus(worker
/urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change       
status=Active
>
>
> Part of log, node2:
>
> [2018-08-13 18:25:14.554233] I
[gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change
status=Faulty
> [2018-08-13 18:25:24.568727] I [monitor(monitor):158:monitor] Monitor:
starting gsyncd worker   brick=/urd-gds/gluster  slave_node=urd-gds-geo-000
> [2018-08-13 18:25:24.609642] I [gsyncd(agent /urd-gds/gluster):297:main]
<top>: Using session config file      
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-13 18:25:24.609678] I [gsyncd(worker /urd-gds/gluster):297:main]
<top>: Using session config file     
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-13 18:25:24.610362] I [changelogagent(agent
/urd-gds/gluster):72:__init__] ChangelogAgent: Agent listining...
> [2018-08-13 18:25:24.621551] I [resource(worker
/urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between
master and slave...
> [2018-08-13 18:25:26.164855] I [resource(worker
/urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and
slave established.        duration=1.5431
> [2018-08-13 18:25:26.165124] I [resource(worker
/urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...
> [2018-08-13 18:25:27.331969] I [resource(worker
/urd-gds/gluster):1090:connect] GLUSTER: Mounted gluster volume duration=1.1667
> [2018-08-13 18:25:27.335560] I [subcmds(worker
/urd-gds/gluster):70:subcmd_worker] <top>: Worker spawn successful.
Acknowledging back to monitor
> [2018-08-13 18:25:37.768867] I [master(worker
/urd-gds/gluster):1534:register] _GMaster: Working dir   
path=/var/lib/misc/gluster/gsyncd/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/urd-gds-gluster
> [2018-08-13 18:25:37.769479] I [resource(worker
/urd-gds/gluster):1253:service_loop] GLUSTER: Register time     time=1534184737
> [2018-08-13 18:25:37.787317] I [gsyncdstatus(worker
/urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change       
status=Active
> [2018-08-13 18:25:37.789822] I [gsyncdstatus(worker
/urd-gds/gluster):248:set_worker_crawl_status] GeorepStatus: Crawl Status Change
status=History Crawl
> [2018-08-13 18:25:37.790008] I [master(worker /urd-gds/gluster):1448:crawl]
_GMaster: starting history crawl    turns=1 stime=(1525290650, 0)  
entry_stime=(1525296245, 0)     etime=1534184737
> [2018-08-13 18:25:37.791222] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time      stime=(1525290650, 0)
> [2018-08-13 18:25:38.63499] I [master(worker
/urd-gds/gluster):1301:process] _GMaster: Skipping already processed entry ops  
to_changelog=1525290651 num_changelogs=1        from_changelog=1525290651
> [2018-08-13 18:25:38.63621] I [master(worker
/urd-gds/gluster):1315:process] _GMaster: Entry Time Taken MKD=0   MKN=0   LIN=0
SYM=0   REN=0   RMD=0   CRE=0   duration=0.0000 UNL=0
> [2018-08-13 18:25:38.63678] I [master(worker
/urd-gds/gluster):1325:process] _GMaster: Data/Metadata Time Taken SETA=1 
SETX=0  meta_duration=0.0228    data_duration=0.2456    DATA=0  XATT=0
> [2018-08-13 18:25:38.63822] I [master(worker
/urd-gds/gluster):1335:process] _GMaster: Batch Completed 
changelog_end=1525290651        entry_stime=(1525296245, 0)    
changelog_start=1525290651      stime=(152\
> 5290650, 0)   duration=0.2723 num_changelogs=1       
mode=history_changelog
> [2018-08-13 18:25:38.73400] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time       stime=(1525290650, 0)
> [2018-08-13 18:25:38.480941] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.1327
num_files=3     job=3   return_code=23
> [2018-08-13 18:25:39.963423] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.1133
num_files=8     job=1   return_code=23
> [2018-08-13 18:25:39.980724] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.6315
num_files=47    job=2   return_code=23
>
>
> ...............
>
>
> [2018-08-13 18:26:04.534953] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.0988
num_files=18    job=2   return_code=23
> [2018-08-13 18:26:07.798583] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.2600
num_files=27    job=2   return_code=23
> [2018-08-13 18:26:08.708100] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.4090
num_files=67    job=2   return_code=23
> [2018-08-13 18:26:14.865883] E [repce(worker
/urd-gds/gluster):197:__call__] RepceClient: call failed  
call=18662:140079998809920:1534184774.58        method=entry_ops       
error=GsyncdError
> [2018-08-13 18:26:14.866166] E [syncdutils(worker
/urd-gds/gluster):298:log_raise_exception] <top>: execution of
"gluster" failed with ENOENT (No such file or directory)
> [2018-08-13 18:26:14.991022] I [repce(agent
/urd-gds/gluster):80:service_loop] RepceServer: terminating on reaching EOF.
> [2018-08-13 18:26:15.384844] I [monitor(monitor):272:monitor] Monitor:
worker died in startup phase     brick=/urd-gds/gluster
> [2018-08-13 18:26:15.397360] I
[gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change
status=Faulty
>
>
> Help would be appriciated!
>
> Thanks!
>
>
> Regards
>
> Marcus Peders?n
>
>
> ________________________________
> Fr?n: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> f?r Marcus Peders?n <marcus.pedersen at slu.se>
> Skickat: den 12 augusti 2018 22:18
> Till: khiremat at redhat.com
> Kopia: gluster-users at gluster.org
> ?mne: Re: [Gluster-users] Geo-replication stops after 4-5 hours
>
>
> Hi,
>
> As the geo-replication stopped after 4-5 hours, I added a cron job that
stopped, paused for 2 mins and started geo-replication again every 6 hours.
>
> The cron job has been running for 5 days and the changelogs has been
catching up.
>
>
> Now a different behavior has shown up.
>
> In one of the active master nodes I get a python error.
>
> The other active master node has started to toggle status between active
and faulty.
>
> See parts of logs below.
>
>
> When I read Troubleshooting Geo-replication, there is a suggestion when
sync is not complete, to enforce a full sync of the data by erasing the index
and restarting GlusterFS geo-replication.
>
> There is no explanation of how to erase the index.
>
> Should I enforse a full sync?
>
> How do I erase the index?
>
>
> Thanks a lot!
>
>
> Best regards
>
> Marcus Peders?n
>
>
>
> Node with python error:
>
> [2018-08-12 16:02:05.304924] I [resource(worker
/urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between
master and slave...
> [2018-08-12 16:02:06.842832] I [resource(worker
/urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and
slave established.        duration=1.5376
> [2018-08-12 16:02:06.843370] I [resource(worker
/urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...
> [2018-08-12 16:02:07.930706] I [resource(worker
/urd-gds/gluster):1090:connect] GLUSTER: Mounted gluster volume duration=1.0869
> [2018-08-12 16:02:07.931536] I [subcmds(worker
/urd-gds/gluster):70:subcmd_worker] <top>: Worker spawn successful.
Acknowledging back to monitor
> [2018-08-12 16:02:20.759797] I [master(worker
/urd-gds/gluster):1534:register] _GMaster: Working dir   
path=/var/lib/misc/gluster/gsyncd/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/urd-gds-gluster
> [2018-08-12 16:02:20.760411] I [resource(worker
/urd-gds/gluster):1253:service_loop] GLUSTER: Register time     time=1534089740
> [2018-08-12 16:02:20.831918] I [gsyncdstatus(worker
/urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change       
status=Active
> [2018-08-12 16:02:20.835541] I [gsyncdstatus(worker
/urd-gds/gluster):248:set_worker_crawl_status] GeorepStatus: Crawl Status Change
status=History Crawl
> [2018-08-12 16:02:20.836832] I [master(worker /urd-gds/gluster):1448:crawl]
_GMaster: starting history crawl    turns=1 stime=(1523906126, 0)  
entry_stime=None        etime=1534089740
> [2018-08-12 16:02:21.848570] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time      stime=(1523906126, 0)
> [2018-08-12 16:02:21.950453] E [syncdutils(worker
/urd-gds/gluster):330:log_raise_exception] <top>: FAIL:
> Traceback (most recent call last):
>   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py",
line 360, in twrap
>     tf(*aargs)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line
1880, in syncjob
>     po = self.sync_engine(pb, self.log_err)
>   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py",
line 1413, in rsync
>     rconf.ssh_ctl_args + \
> AttributeError: 'NoneType' object has no attribute 'split'
> [2018-08-12 16:02:21.975228] I [repce(agent
/urd-gds/gluster):80:service_loop] RepceServer: terminating on reaching EOF.
> [2018-08-12 16:02:22.947170] I [monitor(monitor):272:monitor] Monitor:
worker died in startup phase     brick=/urd-gds/gluster
> [2018-08-12 16:02:22.954096] I
[gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change
status=Faulty
> [2018-08-12 16:02:32.973948] I [monitor(monitor):158:monitor] Monitor:
starting gsyncd worker   brick=/urd-gds/gluster  slave_node=urd-gds-geo-000
> [2018-08-12 16:02:33.16155] I [gsyncd(agent /urd-gds/gluster):297:main]
<top>: Using session config file       
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-12 16:02:33.16882] I [changelogagent(agent
/urd-gds/gluster):72:__init__] ChangelogAgent: Agent listining...
> [2018-08-12 16:02:33.17292] I [gsyncd(worker /urd-gds/gluster):297:main]
<top>: Using session config file      
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-12 16:02:33.26951] I [resource(worker
/urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between
master and slave...
> [2018-08-12 16:02:34.642838] I [resource(worker
/urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and
slave established.        duration=1.6156
> [2018-08-12 16:02:34.643369] I [resource(worker
/urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...
>
>
>
>
> Node that toggles status between active and faulty:
>
> [2018-08-12 19:33:03.475833] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.2757
num_files=27    job=2   return_code=23
> [2018-08-12 19:33:04.818854] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.3767
num_files=67    job=1   return_code=23
> [2018-08-12 19:33:09.926820] E [repce(worker
/urd-gds/gluster):197:__call__] RepceClient: call failed  
call=14853:139697829693248:1534102389.64        method=entry_ops       
error=GsyncdError
> [2018-08-12 19:33:09.927042] E [syncdutils(worker
/urd-gds/gluster):298:log_raise_exception] <top>: execution of
"gluster" failed with ENOENT (No such file or directory)
> [2018-08-12 19:33:09.942267] I [repce(agent
/urd-gds/gluster):80:service_loop] RepceServer: terminating on reaching EOF.
> [2018-08-12 19:33:10.349848] I [monitor(monitor):272:monitor] Monitor:
worker died in startup phase     brick=/urd-gds/gluster
> [2018-08-12 19:33:10.363173] I
[gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker Status Change
status=Faulty
> [2018-08-12 19:33:20.386089] I [monitor(monitor):158:monitor] Monitor:
starting gsyncd worker   brick=/urd-gds/gluster  slave_node=urd-gds-geo-000
> [2018-08-12 19:33:20.456687] I [gsyncd(agent /urd-gds/gluster):297:main]
<top>: Using session config file      
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-12 19:33:20.456686] I [gsyncd(worker /urd-gds/gluster):297:main]
<top>: Using session config file     
path=/var/lib/glusterd/geo-replication/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/gsyncd.conf
> [2018-08-12 19:33:20.457559] I [changelogagent(agent
/urd-gds/gluster):72:__init__] ChangelogAgent: Agent listining...
> [2018-08-12 19:33:20.511825] I [resource(worker
/urd-gds/gluster):1348:connect_remote] SSH: Initializing SSH connection between
master and slave...
> [2018-08-12 19:33:22.88713] I [resource(worker
/urd-gds/gluster):1395:connect_remote] SSH: SSH connection between master and
slave established. duration=1.5766
> [2018-08-12 19:33:22.89272] I [resource(worker
/urd-gds/gluster):1067:connect] GLUSTER: Mounting gluster volume locally...
> [2018-08-12 19:33:23.179249] I [resource(worker
/urd-gds/gluster):1090:connect] GLUSTER: Mounted gluster volume duration=1.0896
> [2018-08-12 19:33:23.179805] I [subcmds(worker
/urd-gds/gluster):70:subcmd_worker] <top>: Worker spawn successful.
Acknowledging back to monitor
> [2018-08-12 19:33:35.245277] I [master(worker
/urd-gds/gluster):1534:register] _GMaster: Working dir   
path=/var/lib/misc/gluster/gsyncd/urd-gds-volume_urd-gds-geo-001_urd-gds-volume/urd-gds-gluster
> [2018-08-12 19:33:35.246495] I [resource(worker
/urd-gds/gluster):1253:service_loop] GLUSTER: Register time     time=1534102415
> [2018-08-12 19:33:35.321988] I [gsyncdstatus(worker
/urd-gds/gluster):276:set_active] GeorepStatus: Worker Status Change       
status=Active
> [2018-08-12 19:33:35.324270] I [gsyncdstatus(worker
/urd-gds/gluster):248:set_worker_crawl_status] GeorepStatus: Crawl Status Change
status=History Crawl
> [2018-08-12 19:33:35.324902] I [master(worker /urd-gds/gluster):1448:crawl]
_GMaster: starting history crawl    turns=1 stime=(1525290650, 0)  
entry_stime=(1525296245, 0)     etime=1534102415
> [2018-08-12 19:33:35.328735] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time      stime=(1525290650, 0)
> [2018-08-12 19:33:35.574338] I [master(worker
/urd-gds/gluster):1301:process] _GMaster: Skipping already processed entry ops  
to_changelog=1525290651 num_changelogs=1        from_changelog=1525290651
> [2018-08-12 19:33:35.574448] I [master(worker
/urd-gds/gluster):1315:process] _GMaster: Entry Time Taken        MKD=0   MKN=0 
LIN=0   SYM=0   REN=0   RMD=0   CRE=0   duration=0.0000 UNL=0
> [2018-08-12 19:33:35.574507] I [master(worker
/urd-gds/gluster):1325:process] _GMaster: Data/Metadata Time Taken        SETA=1
SETX=0  meta_duration=0.0249    data_duration=0.2156    DATA=0  XATT=0
> [2018-08-12 19:33:35.574723] I [master(worker
/urd-gds/gluster):1335:process] _GMaster: Batch Completed
changelog_end=1525290651        entry_stime=(1525296245, 0)    
changelog_start=1525290651      stime=(152\
> 5290650, 0)   duration=0.2455 num_changelogs=1       
mode=history_changelog
> [2018-08-12 19:33:35.582545] I [master(worker /urd-gds/gluster):1477:crawl]
_GMaster: slave's time      stime=(1525290650, 0)
> [2018-08-12 19:33:35.780823] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.0847
num_files=3     job=2   return_code=23
> [2018-08-12 19:33:37.362822] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.0807
num_files=4     job=2   return_code=23
> [2018-08-12 19:33:37.818542] I [master(worker
/urd-gds/gluster):1885:syncjob] Syncer: Sync Time Taken   duration=0.1098
num_files=11    job=1   return_code=23
>
>
> ________________________________
> Fr?n: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> f?r Marcus Peders?n <marcus.pedersen at slu.se>
> Skickat: den 6 augusti 2018 13:28
> Till: khiremat at redhat.com
> Kopia: gluster-users at gluster.org
> ?mne: Re: [Gluster-users] Geo-replication stops after 4-5 hours
>
>
> Hi,
>
> Is there a way to resolve the problem with rsync and hanging processes?
>
> Do I need to kill all the processes and hope that it starts again or
stop/start geo-replication?
>
>
> If I stop/start geo-replication it will start again, I have tried it
before.
>
>
> Regards
>
> Marcus
>
>
>
> ________________________________
> Fr?n: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> f?r Marcus Peders?n <marcus.pedersen at slu.se>
> Skickat: den 2 augusti 2018 10:04
> Till: Kotresh Hiremath Ravishankar
> Kopia: gluster-users at gluster.org
> ?mne: Re: [Gluster-users] Geo-replication stops after 4-5 hours
>
> Hi Kotresh,
>
> I get the following and then it hangs:
>
> strace: Process 5921 attached                    write(2, "rsync:
link_stat \"/tmp/gsyncd-au"..., 12811
>
>
> When sync is running I can see rsync with geouser on the slave node.
>
> Regards
> Marcus
>
> ################
> Marcus Peders?n
> Systemadministrator
> Interbull Centre
> ################
> Sent from my phone
> ################
>
> Den 2 aug. 2018 09:31 skrev Kotresh Hiremath Ravishankar <khiremat at
redhat.com>:
>
> Cool, just check whether they are hung by any chance with following
command.
>
> #strace -f -p 5921
>
> On Thu, Aug 2, 2018 at 12:25 PM, Marcus Peders?n <marcus.pedersen at
slu.se> wrote:
>
> On both active master nodes there is an rsync process. As in:
>
> root      5921  0.0  0.0 115424  1176 ?        S    Aug01   0:00 rsync -aR0
--inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs
--xattrs --acls . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no
-i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S
/tmp/gsyncd-aux-ssh-stuphs/bf60c68f1a195dad59573a8dbaa309f2.sock geouser at
urd-gds-geo-001:/proc/13077/cwd
>
> There is also ssh tunnels to slave nodes and  gsyncd.py processes.
>
> Regards
> Marcus
>
> ################
> Marcus Peders?n
> Systemadministrator
> Interbull Centre
> ################
> Sent from my phone
> ################
>
> Den 2 aug. 2018 08:07 skrev Kotresh Hiremath Ravishankar <khiremat at
redhat.com>:
> Could you look of any rsync processes hung in master or slave?
>
> On Thu, Aug 2, 2018 at 11:18 AM, Marcus Peders?n <marcus.pedersen at
slu.se> wrote:
>
> Hi Kortesh,
> rsync  version 3.1.2  protocol version 31
> All nodes run CentOS 7, updated the last couple of days.
>
> Thanks
> Marcus
>
> ################
> Marcus Peders?n
> Systemadministrator
> Interbull Centre
> ################
> Sent from my phone
> ################
>
>
> Den 2 aug. 2018 06:13 skrev Kotresh Hiremath Ravishankar <khiremat at
redhat.com>:
>
> Hi Marcus,
>
> What's the rsync version being used?
>
> Thanks,
> Kotresh HR
>
> On Thu, Aug 2, 2018 at 1:48 AM, Marcus Peders?n <marcus.pedersen at
slu.se> wrote:
>
> Hi all!
>
> I upgraded from 3.12.9 to 4.1.1 and had problems with geo-replication.
>
> With help from the list with some sym links and so on (handled in another
thread)
>
> I got the geo-replication running.
>
> It ran for 4-5 hours and then stopped, I stopped and started
geo-replication and it ran for another 4-5 hours.
>
> 4.1.2 was released and I updated, hoping this would solve the problem.
>
> I still have the same problem, at start it runs for 4-5 hours and then it
stops.
>
> After that nothing happens, I have waited for days but still nothing
happens.
>
>
> I have looked through logs but can not find anything obvious.
>
>
> Status for geo-replication is active for the two same nodes all the time:
>
>
> MASTER NODE    MASTER VOL        MASTER BRICK         SLAVE USER    SLAVE  
SLAVE NODE         STATUS     CRAWL STATUS     LAST_SYNCED            ENTRY   
DATA     META    FAILURES    CHECKPOINT TIME        CHECKPOINT COMPLETED   
CHECKPOINT COMPLETION TIME
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> urd-gds-001    urd-gds-volume    /urd-gds/gluster     geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-000    Active     History
Crawl    2018-04-16 20:32:09    0        14205    0       0           2018-07-27
21:12:44    No                      N/A
> urd-gds-002    urd-gds-volume    /urd-gds/gluster     geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-002    Passive    N/A         
N/A                    N/A      N/A      N/A     N/A         N/A                
N/A                     N/A
> urd-gds-004    urd-gds-volume    /urd-gds/gluster     geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-002    Passive    N/A         
N/A                    N/A      N/A      N/A     N/A         N/A                
N/A                     N/A
> urd-gds-003    urd-gds-volume    /urd-gds/gluster     geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-000    Active     History
Crawl    2018-05-01 20:58:14    285      4552     0       0           2018-07-27
21:12:44    No                      N/A
> urd-gds-000    urd-gds-volume    /urd-gds/gluster1    geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-001    Passive    N/A         
N/A                    N/A      N/A      N/A     N/A         N/A                
N/A                     N/A
> urd-gds-000    urd-gds-volume    /urd-gds/gluster2    geouser       geouser
at urd-gds-geo-001::urd-gds-volume    urd-gds-geo-001    Passive    N/A         
N/A                    N/A      N/A      N/A     N/A         N/A                
N/A                     N/A
>
>
> Master cluster is Distribute-Replicate
>
> 2 x (2 + 1)
>
> Used space 30TB
>
>
> Slave cluster is Replicate
>
> 1 x (2 + 1)
>
> Used space 9TB
>
>
> Parts from gsyncd.logs are enclosed.
>
>
> Thanks a lot!
>
>
> Best regards
>
> Marcus Peders?n
>
>
>
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
>
> --
> Thanks and Regards,
> Kotresh H R
>
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
>
>
>
> --
> Thanks and Regards,
> Kotresh H R
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
>
>
>
> --
> Thanks and Regards,
> Kotresh H R
>
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
> E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users---
N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r
<https://www.slu.se/om-slu/kontakta-slu/personuppgifter/>
E-mailing SLU will result in SLU processing your personal data. For more
information on how this is done, click here
<https://www.slu.se/en/about-slu/contact-slu/personal-data/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: today_mnt-urd-gds-001-urd-gds-gluster.log.gz
Type: application/gzip
Size: 1154878 bytes
Desc: today_mnt-urd-gds-001-urd-gds-gluster.log.gz
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180813/c38ab5de/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: today_mnt-urd-gds-003-urd-gds-gluster.log.gz
Type: application/gzip
Size: 533166 bytes
Desc: today_mnt-urd-gds-003-urd-gds-gluster.log.gz
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180813/c38ab5de/attachment-0003.bin>

Gluster users - Aug 2018 - Geo-replication stops after 4-5 hours

[Gluster-users] Geo-replication stops after 4-5 hours

[Gluster-users] Geo-replication stops after 4-5 hours