thr3ads.net - Gluster users - [Gluster-users] Geo-replication completely broken [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Felix Kölzow

2020-Jun-25 07:57 UTC

[Gluster-users] Geo-replication completely broken

Hey Rob,


same issue for our third volume. Have a look at the logs just from right
now (below).

Question: You removed the htime files and the old changelogs. Just rm
the files or is there something to pay more attention

before removing the changelog files and the htime file.

Regards,

Felix

[2020-06-25 07:51:53.795430] I [resource(worker
/gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote] SSH: SSH
connection between master and slave established. duration=1.2341
[2020-06-25 07:51:53.795639] I [resource(worker
/gluster/vg00/dispersed_fuse1024/brick):1105:connect] GLUSTER: Mounting
gluster volume locally...
[2020-06-25 07:51:54.520601] I [monitor(monitor):280:monitor] Monitor:
worker died in startup phase brick=/gluster/vg01/dispersed_fuse1024/brick
[2020-06-25 07:51:54.535809] I
[gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
Status Change??? status=Faulty
[2020-06-25 07:51:54.882143] I [resource(worker
/gluster/vg00/dispersed_fuse1024/brick):1128:connect] GLUSTER: Mounted
gluster volume??? duration=1.0864
[2020-06-25 07:51:54.882388] I [subcmds(worker
/gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker] <top>: Worker
spawn successful. Acknowledging back to monitor
[2020-06-25 07:51:56.911412] E [repce(agent
/gluster/vg00/dispersed_fuse1024/brick):121:worker] <top>: call failed:
Traceback (most recent call last):
 ? File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 117,
in worker
 ??? res = getattr(self.obj, rmeth)(*in_data[2:])
 ? File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
line 40, in register
 ??? return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries)
 ? File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
line 46, in cl_register
 ??? cls.raise_changelog_err()
 ? File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
line 30, in raise_changelog_err
 ??? raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2020-06-25 07:51:56.912056] E [repce(worker
/gluster/vg00/dispersed_fuse1024/brick):213:__call__] RepceClient: call
failed??? call=75086:140098349655872:1593071514.91 method=register???
error=ChangelogException
[2020-06-25 07:51:56.912396] E [resource(worker
/gluster/vg00/dispersed_fuse1024/brick):1286:service_loop] GLUSTER:
Changelog register failed??? error=[Errno 2] No such file or directory
[2020-06-25 07:51:56.928031] I [repce(agent
/gluster/vg00/dispersed_fuse1024/brick):96:service_loop] RepceServer:
terminating on reaching EOF.
[2020-06-25 07:51:57.886126] I [monitor(monitor):280:monitor] Monitor:
worker died in startup phase brick=/gluster/vg00/dispersed_fuse1024/brick
[2020-06-25 07:51:57.895920] I
[gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
Status Change??? status=Faulty
[2020-06-25 07:51:58.607405] I [gsyncdstatus(worker
/gluster/vg00/dispersed_fuse1024/brick):287:set_passive] GeorepStatus:
Worker Status Change??? status=Passive
[2020-06-25 07:51:58.607768] I [gsyncdstatus(worker
/gluster/vg01/dispersed_fuse1024/brick):287:set_passive] GeorepStatus:
Worker Status Change??? status=Passive
[2020-06-25 07:51:58.608004] I [gsyncdstatus(worker
/gluster/vg00/dispersed_fuse1024/brick):281:set_active] GeorepStatus:
Worker Status Change??? status=Active


On 25/06/2020 09:15, Rob.Quagliozzi at rabobank.com
wrote:>
> Hi All,
>
> We?ve got two six node RHEL 7.8 clusters and geo-replication would
> appear to be completely broken between them. I?ve deleted the session,
> removed & recreated pem files, old changlogs/htime (after removing
> relevant options from volume) and completely set up geo-rep from
> scratch, but the new session comes up as Initializing, then goes
> faulty, and starts looping. Volume (on both sides) is a 4 x 2
> disperse, running Gluster v6 (RH latest). ?Gsyncd reports:
>
> [2020-06-25 07:07:14.701423] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
> Status Change status=Initializing...
>
> [2020-06-25 07:07:14.701744] I [monitor(monitor):159:monitor] Monitor:
> starting gsyncd worker?? brick=/rhgs/brick20/brick
> slave_node=bxts470194.eu.rabonet.com
>
> [2020-06-25 07:07:14.707997] D [monitor(monitor):230:monitor] Monitor:
> Worker would mount volume privately
>
> [2020-06-25 07:07:14.757181] I [gsyncd(agent
> /rhgs/brick20/brick):318:main] <top>: Using session config file
>
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>
> [2020-06-25 07:07:14.758126] D [subcmds(agent
> /rhgs/brick20/brick):107:subcmd_agent] <top>: RPC FD?????
> rpc_fd='5,12,11,10'
>
> [2020-06-25 07:07:14.758627] I [changelogagent(agent
> /rhgs/brick20/brick):72:__init__] ChangelogAgent: Agent listining...
>
> [2020-06-25 07:07:14.764234] I [gsyncd(worker
> /rhgs/brick20/brick):318:main] <top>: Using session config file
>
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>
> [2020-06-25 07:07:14.779409] I [resource(worker
> /rhgs/brick20/brick):1386:connect_remote] SSH: Initializing SSH
> connection between master and slave...
>
> [2020-06-25 07:07:14.841793] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068834.84 __repce_version__() ...
>
> [2020-06-25 07:07:16.148725] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068834.84 __repce_version__ -> 1.0
>
> [2020-06-25 07:07:16.148911] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068836.15 version() ...
>
> [2020-06-25 07:07:16.149574] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068836.15 version -> 1.0
>
> [2020-06-25 07:07:16.149735] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068836.15 pid() ...
>
> [2020-06-25 07:07:16.150588] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068836.15 pid -> 30703
>
> [2020-06-25 07:07:16.150747] I [resource(worker
> /rhgs/brick20/brick):1435:connect_remote] SSH: SSH connection between
> master and slave established. duration=1.3712
>
> [2020-06-25 07:07:16.150819] I [resource(worker
> /rhgs/brick20/brick):1105:connect] GLUSTER: Mounting gluster volume
> locally...
>
> [2020-06-25 07:07:16.265860] D [resource(worker
> /rhgs/brick20/brick):879:inhibit] DirectMounter: auxiliary glusterfs
> mount in place
>
> [2020-06-25 07:07:17.272511] D [resource(worker
> /rhgs/brick20/brick):953:inhibit] DirectMounter: auxiliary glusterfs
> mount prepared
>
> [2020-06-25 07:07:17.272708] I [resource(worker
> /rhgs/brick20/brick):1128:connect] GLUSTER: Mounted gluster
> volume????? duration=1.1218
>
> [2020-06-25 07:07:17.272794] I [subcmds(worker
> /rhgs/brick20/brick):84:subcmd_worker] <top>: Worker spawn
successful.
> Acknowledging back to monitor
>
> [2020-06-25 07:07:17.272973] D [master(worker
> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
> detection mode mode=xsync
>
> [2020-06-25 07:07:17.273063] D [monitor(monitor):273:monitor] Monitor:
> worker(/rhgs/brick20/brick) connected
>
> [2020-06-25 07:07:17.273678] D [master(worker
> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
> detection mode mode=changelog
>
> [2020-06-25 07:07:17.274224] D [master(worker
> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
> detection mode mode=changeloghistory
>
> [2020-06-25 07:07:17.276484] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068837.28 version() ...
>
> [2020-06-25 07:07:17.276916] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068837.28 version -> 1.0
>
> [2020-06-25 07:07:17.277009] D [master(worker
> /rhgs/brick20/brick):777:setup_working_dir] _GMaster: changelog
> working dir
>
/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick
>
> [2020-06-25 07:07:17.277098] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068837.28 init() ...
>
> [2020-06-25 07:07:17.292944] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068837.28 init -> None
>
> [2020-06-25 07:07:17.293097] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068837.29 register('/rhgs/brick20/brick',
>
'/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
>
'/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
> 8, 5) ...
>
> [2020-06-25 07:07:19.296294] E [repce(agent
> /rhgs/brick20/brick):121:worker] <top>: call failed:
>
> Traceback (most recent call last):
>
> ? File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
117,
> in worker
>
> ??? res = getattr(self.obj, rmeth)(*in_data[2:])
>
> ? File
"/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
> line 40, in register
>
> ??? return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level,
> retries)
>
> ? File
"/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
> line 46, in cl_register
>
> ??? cls.raise_changelog_err()
>
> ? File
"/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
> line 30, in raise_changelog_err
>
> ??? raise ChangelogException(errn, os.strerror(errn))
>
> ChangelogException: [Errno 2] No such file or directory
>
> [2020-06-25 07:07:19.297161] E [repce(worker
> /rhgs/brick20/brick):213:__call__] RepceClient: call failed???????
> call=6799:140380783982400:1593068837.29 method=register
> error=ChangelogException
>
> [2020-06-25 07:07:19.297338] E [resource(worker
> /rhgs/brick20/brick):1286:service_loop] GLUSTER: Changelog register
> failed????? error=[Errno 2] No such file or directory
>
> [2020-06-25 07:07:19.315074] I [repce(agent
> /rhgs/brick20/brick):96:service_loop] RepceServer: terminating on
> reaching EOF.
>
> [2020-06-25 07:07:20.275701] I [monitor(monitor):280:monitor] Monitor:
> worker died in startup phase???? brick=/rhgs/brick20/brick
>
> [2020-06-25 07:07:20.277383] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
> Status Change status=Faulty
>
> We?ve done everything we can think of, including an ?strace ?f? on the
> pid, and we can?t really find anything. I?m about to lose the last of
> my hair over this, so does anyone have any ideas at all? We?ve even
> removed the entire slave vol and rebuilt it.
>
> Thanks
>
> Rob
>
> *Rob Quagliozzi*
>
> *Specialised Application Support*
>
>
>
> ------------------------------------------------------------------------
> This email (including any attachments to it) is confidential, legally
> privileged, subject to copyright and is sent for the personal
> attention of the intended recipient only. If you have received this
> email in error, please advise us immediately and delete it. You are
> notified that disclosing, copying, distributing or taking any action
> in reliance on the contents of this information is strictly
> prohibited. Although we have taken reasonable precautions to ensure no
> viruses are present in this email, we cannot accept responsibility for
> any loss or damage arising from the viruses in this email or
> attachments. We exclude any liability for the content of this email,
> or for the consequences of any actions taken on the basis of the
> information provided in this email or its attachments, unless that
> information is subsequently confirmed in writing. <#rbnl#1898i>
> ------------------------------------------------------------------------
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200625/f4c587d8/attachment.html>

Shwetha Acharya

2020-Jun-25 08:04 UTC

head link

[Gluster-users] Geo-replication completely broken

Hi Rob and Felix,

Please share the *-changes.log files and brick logs, which will help in
analysis of the issue.

Regards,
Shwetha

On Thu, Jun 25, 2020 at 1:26 PM Felix K?lzow <felix.koelzow at gmx.de>
wrote:
> Hey Rob,
>
>
> same issue for our third volume. Have a look at the logs just from right
> now (below).
>
> Question: You removed the htime files and the old changelogs. Just rm the
> files or is there something to pay more attention
>
> before removing the changelog files and the htime file.
>
> Regards,
>
> Felix
>
> [2020-06-25 07:51:53.795430] I [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote] SSH: SSH
> connection between master and slave established.    duration=1.2341
> [2020-06-25 07:51:53.795639] I [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1105:connect] GLUSTER: Mounting
> gluster volume locally...
> [2020-06-25 07:51:54.520601] I [monitor(monitor):280:monitor] Monitor:
> worker died in startup phase   
brick=/gluster/vg01/dispersed_fuse1024/brick
> [2020-06-25 07:51:54.535809] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status
> Change    status=Faulty
> [2020-06-25 07:51:54.882143] I [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1128:connect] GLUSTER: Mounted
> gluster volume    duration=1.0864
> [2020-06-25 07:51:54.882388] I [subcmds(worker
> /gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker] <top>:
Worker
> spawn successful. Acknowledging back to monitor
> [2020-06-25 07:51:56.911412] E [repce(agent
> /gluster/vg00/dispersed_fuse1024/brick):121:worker] <top>: call
failed:
> Traceback (most recent call last):
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
117, in
> worker
>     res = getattr(self.obj, rmeth)(*in_data[2:])
>   File
"/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line
> 40, in register
>     return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries)
>   File
"/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
> 46, in cl_register
>     cls.raise_changelog_err()
>   File
"/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
> 30, in raise_changelog_err
>     raise ChangelogException(errn, os.strerror(errn))
> ChangelogException: [Errno 2] No such file or directory
> [2020-06-25 07:51:56.912056] E [repce(worker
> /gluster/vg00/dispersed_fuse1024/brick):213:__call__] RepceClient: call
> failed    call=75086:140098349655872:1593071514.91    method=register
> error=ChangelogException
> [2020-06-25 07:51:56.912396] E [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1286:service_loop] GLUSTER:
> Changelog register failed    error=[Errno 2] No such file or directory
> [2020-06-25 07:51:56.928031] I [repce(agent
> /gluster/vg00/dispersed_fuse1024/brick):96:service_loop] RepceServer:
> terminating on reaching EOF.
> [2020-06-25 07:51:57.886126] I [monitor(monitor):280:monitor] Monitor:
> worker died in startup phase   
brick=/gluster/vg00/dispersed_fuse1024/brick
> [2020-06-25 07:51:57.895920] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status
> Change    status=Faulty
> [2020-06-25 07:51:58.607405] I [gsyncdstatus(worker
> /gluster/vg00/dispersed_fuse1024/brick):287:set_passive] GeorepStatus:
> Worker Status Change    status=Passive
> [2020-06-25 07:51:58.607768] I [gsyncdstatus(worker
> /gluster/vg01/dispersed_fuse1024/brick):287:set_passive] GeorepStatus:
> Worker Status Change    status=Passive
> [2020-06-25 07:51:58.608004] I [gsyncdstatus(worker
> /gluster/vg00/dispersed_fuse1024/brick):281:set_active] GeorepStatus:
> Worker Status Change    status=Active
>
>
> On 25/06/2020 09:15, Rob.Quagliozzi at rabobank.com wrote:
>
> Hi All,
>
>
>
> We?ve got two six node RHEL 7.8 clusters and geo-replication would appear
> to be completely broken between them. I?ve deleted the session, removed
&
> recreated pem files, old changlogs/htime (after removing relevant options
> from volume) and completely set up geo-rep from scratch, but the new
> session comes up as Initializing, then goes faulty, and starts looping.
> Volume (on both sides) is a 4 x 2 disperse, running Gluster v6 (RH
> latest).  Gsyncd reports:
>
>
>
> [2020-06-25 07:07:14.701423] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status
> Change status=Initializing...
>
> [2020-06-25 07:07:14.701744] I [monitor(monitor):159:monitor] Monitor:
> starting gsyncd worker   brick=/rhgs/brick20/brick       slave_node>
bxts470194.eu.rabonet.com
>
> [2020-06-25 07:07:14.707997] D [monitor(monitor):230:monitor] Monitor:
> Worker would mount volume privately
>
> [2020-06-25 07:07:14.757181] I [gsyncd(agent
> /rhgs/brick20/brick):318:main] <top>: Using session config file
>
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>
> [2020-06-25 07:07:14.758126] D [subcmds(agent
> /rhgs/brick20/brick):107:subcmd_agent] <top>: RPC FD
> rpc_fd='5,12,11,10'
>
> [2020-06-25 07:07:14.758627] I [changelogagent(agent
> /rhgs/brick20/brick):72:__init__] ChangelogAgent: Agent listining...
>
> [2020-06-25 07:07:14.764234] I [gsyncd(worker
> /rhgs/brick20/brick):318:main] <top>: Using session config file
>
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>
> [2020-06-25 07:07:14.779409] I [resource(worker
> /rhgs/brick20/brick):1386:connect_remote] SSH: Initializing SSH connection
> between master and slave...
>
> [2020-06-25 07:07:14.841793] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068834.84 __repce_version__() ...
>
> [2020-06-25 07:07:16.148725] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068834.84 __repce_version__ -> 1.0
>
> [2020-06-25 07:07:16.148911] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068836.15 version() ...
>
> [2020-06-25 07:07:16.149574] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068836.15 version -> 1.0
>
> [2020-06-25 07:07:16.149735] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068836.15 pid() ...
>
> [2020-06-25 07:07:16.150588] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068836.15 pid -> 30703
>
> [2020-06-25 07:07:16.150747] I [resource(worker
> /rhgs/brick20/brick):1435:connect_remote] SSH: SSH connection between
> master and slave established.     duration=1.3712
>
> [2020-06-25 07:07:16.150819] I [resource(worker
> /rhgs/brick20/brick):1105:connect] GLUSTER: Mounting gluster volume
> locally...
>
> [2020-06-25 07:07:16.265860] D [resource(worker
> /rhgs/brick20/brick):879:inhibit] DirectMounter: auxiliary glusterfs mount
> in place
>
> [2020-06-25 07:07:17.272511] D [resource(worker
> /rhgs/brick20/brick):953:inhibit] DirectMounter: auxiliary glusterfs mount
> prepared
>
> [2020-06-25 07:07:17.272708] I [resource(worker
> /rhgs/brick20/brick):1128:connect] GLUSTER: Mounted gluster volume
> duration=1.1218
>
> [2020-06-25 07:07:17.272794] I [subcmds(worker
> /rhgs/brick20/brick):84:subcmd_worker] <top>: Worker spawn
successful.
> Acknowledging back to monitor
>
> [2020-06-25 07:07:17.272973] D [master(worker
> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
> detection mode mode=xsync
>
> [2020-06-25 07:07:17.273063] D [monitor(monitor):273:monitor] Monitor:
> worker(/rhgs/brick20/brick) connected
>
> [2020-06-25 07:07:17.273678] D [master(worker
> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
> detection mode mode=changelog
>
> [2020-06-25 07:07:17.274224] D [master(worker
> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
> detection mode mode=changeloghistory
>
> [2020-06-25 07:07:17.276484] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068837.28 version() ...
>
> [2020-06-25 07:07:17.276916] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068837.28 version -> 1.0
>
> [2020-06-25 07:07:17.277009] D [master(worker
> /rhgs/brick20/brick):777:setup_working_dir] _GMaster: changelog working dir
>
/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick
>
> [2020-06-25 07:07:17.277098] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068837.28 init() ...
>
> [2020-06-25 07:07:17.292944] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068837.28 init -> None
>
> [2020-06-25 07:07:17.293097] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068837.29 register('/rhgs/brick20/brick',
>
'/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
>
'/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
> 8, 5) ...
>
> [2020-06-25 07:07:19.296294] E [repce(agent
> /rhgs/brick20/brick):121:worker] <top>: call failed:
>
> Traceback (most recent call last):
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
117, in
> worker
>
>     res = getattr(self.obj, rmeth)(*in_data[2:])
>
>   File
"/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line
> 40, in register
>
>     return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries)
>
>   File
"/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
> 46, in cl_register
>
>     cls.raise_changelog_err()
>
>   File
"/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
> 30, in raise_changelog_err
>
>     raise ChangelogException(errn, os.strerror(errn))
>
> ChangelogException: [Errno 2] No such file or directory
>
> [2020-06-25 07:07:19.297161] E [repce(worker
> /rhgs/brick20/brick):213:__call__] RepceClient: call failed
> call=6799:140380783982400:1593068837.29 method=register
> error=ChangelogException
>
> [2020-06-25 07:07:19.297338] E [resource(worker
> /rhgs/brick20/brick):1286:service_loop] GLUSTER: Changelog register
> failed      error=[Errno 2] No such file or directory
>
> [2020-06-25 07:07:19.315074] I [repce(agent
> /rhgs/brick20/brick):96:service_loop] RepceServer: terminating on reaching
> EOF.
>
> [2020-06-25 07:07:20.275701] I [monitor(monitor):280:monitor] Monitor:
> worker died in startup phase     brick=/rhgs/brick20/brick
>
> [2020-06-25 07:07:20.277383] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status
> Change status=Faulty
>
>
>
> We?ve done everything we can think of, including an ?strace ?f? on the
> pid, and we can?t really find anything. I?m about to lose the last of my
> hair over this, so does anyone have any ideas at all? We?ve even removed
> the entire slave vol and rebuilt it.
>
>
>
> Thanks
>
> Rob
>
>
>
> *Rob Quagliozzi*
>
> *Specialised Application Support*
>
>
>
>
> ------------------------------
> This email (including any attachments to it) is confidential, legally
> privileged, subject to copyright and is sent for the personal attention of
> the intended recipient only. If you have received this email in error,
> please advise us immediately and delete it. You are notified that
> disclosing, copying, distributing or taking any action in reliance on the
> contents of this information is strictly prohibited. Although we have taken
> reasonable precautions to ensure no viruses are present in this email, we
> cannot accept responsibility for any loss or damage arising from the
> viruses in this email or attachments. We exclude any liability for the
> content of this email, or for the consequences of any actions taken on the
> basis of the information provided in this email or its attachments, unless
> that information is subsequently confirmed in writing. <#rbnl#1898i>
> ------------------------------
>
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing listGluster-users at
gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200625/a7cda701/attachment.html>

Felix Kölzow

2020-Jul-03 08:16 UTC

head link

[Gluster-users] Geo-replication completely broken

Dear Users,
the geo-replication is still broken. This is not really a comfortable
situation.
Does any user has had the same experience and is able to share a
possible workaround?
We are actually running gluster v6.0
Regards,

Felix


On 25/06/2020 10:04, Shwetha Acharya wrote:> Hi Rob and Felix,
>
> Please share the *-changes.log files and brick logs, which will help
> in analysis of the issue.
>
> Regards,
> Shwetha
>
> On Thu, Jun 25, 2020 at 1:26 PM Felix K?lzow <felix.koelzow at gmx.de
> <mailto:felix.koelzow at gmx.de>> wrote:
>
>     Hey Rob,
>
>
>     same issue for our third volume. Have a look at the logs just from
>     right now (below).
>
>     Question: You removed the htime files and the old changelogs. Just
>     rm the files or is there something to pay more attention
>
>     before removing the changelog files and the htime file.
>
>     Regards,
>
>     Felix
>
>     [2020-06-25 07:51:53.795430] I [resource(worker
>     /gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote] SSH:
>     SSH connection between master and slave established.???
>     duration=1.2341
>     [2020-06-25 07:51:53.795639] I [resource(worker
>     /gluster/vg00/dispersed_fuse1024/brick):1105:connect] GLUSTER:
>     Mounting gluster volume locally...
>     [2020-06-25 07:51:54.520601] I [monitor(monitor):280:monitor]
>     Monitor: worker died in startup phase
>     brick=/gluster/vg01/dispersed_fuse1024/brick
>     [2020-06-25 07:51:54.535809] I
>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
>     Status Change??? status=Faulty
>     [2020-06-25 07:51:54.882143] I [resource(worker
>     /gluster/vg00/dispersed_fuse1024/brick):1128:connect] GLUSTER:
>     Mounted gluster volume??? duration=1.0864
>     [2020-06-25 07:51:54.882388] I [subcmds(worker
>     /gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker] <top>:
>     Worker spawn successful. Acknowledging back to monitor
>     [2020-06-25 07:51:56.911412] E [repce(agent
>     /gluster/vg00/dispersed_fuse1024/brick):121:worker] <top>: call
>     failed:
>     Traceback (most recent call last):
>     ? File "/usr/libexec/glusterfs/python/syncdaemon/repce.py",
line
>     117, in worker
>     ??? res = getattr(self.obj, rmeth)(*in_data[2:])
>     ? File
>     "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
line
>     40, in register
>     ??? return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level,
>     retries)
>     ? File
>     "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
line
>     46, in cl_register
>     ??? cls.raise_changelog_err()
>     ? File
>     "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
line
>     30, in raise_changelog_err
>     ??? raise ChangelogException(errn, os.strerror(errn))
>     ChangelogException: [Errno 2] No such file or directory
>     [2020-06-25 07:51:56.912056] E [repce(worker
>     /gluster/vg00/dispersed_fuse1024/brick):213:__call__] RepceClient:
>     call failed call=75086:140098349655872:1593071514.91
>     method=register??? error=ChangelogException
>     [2020-06-25 07:51:56.912396] E [resource(worker
>     /gluster/vg00/dispersed_fuse1024/brick):1286:service_loop]
>     GLUSTER: Changelog register failed??? error=[Errno 2] No such file
>     or directory
>     [2020-06-25 07:51:56.928031] I [repce(agent
>     /gluster/vg00/dispersed_fuse1024/brick):96:service_loop]
>     RepceServer: terminating on reaching EOF.
>     [2020-06-25 07:51:57.886126] I [monitor(monitor):280:monitor]
>     Monitor: worker died in startup phase
>     brick=/gluster/vg00/dispersed_fuse1024/brick
>     [2020-06-25 07:51:57.895920] I
>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
>     Status Change??? status=Faulty
>     [2020-06-25 07:51:58.607405] I [gsyncdstatus(worker
>     /gluster/vg00/dispersed_fuse1024/brick):287:set_passive]
>     GeorepStatus: Worker Status Change??? status=Passive
>     [2020-06-25 07:51:58.607768] I [gsyncdstatus(worker
>     /gluster/vg01/dispersed_fuse1024/brick):287:set_passive]
>     GeorepStatus: Worker Status Change??? status=Passive
>     [2020-06-25 07:51:58.608004] I [gsyncdstatus(worker
>     /gluster/vg00/dispersed_fuse1024/brick):281:set_active]
>     GeorepStatus: Worker Status Change??? status=Active
>
>
>     On 25/06/2020 09:15, Rob.Quagliozzi at rabobank.com
>     <mailto:Rob.Quagliozzi at rabobank.com> wrote:
>>
>>     Hi All,
>>
>>     We?ve got two six node RHEL 7.8 clusters and geo-replication
>>     would appear to be completely broken between them. I?ve deleted
>>     the session, removed & recreated pem files, old changlogs/htime
>>     (after removing relevant options from volume) and completely set
>>     up geo-rep from scratch, but the new session comes up as
>>     Initializing, then goes faulty, and starts looping. Volume (on
>>     both sides) is a 4 x 2 disperse, running Gluster v6 (RH latest).?
>>     Gsyncd reports:
>>
>>     [2020-06-25 07:07:14.701423] I
>>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
>>     Worker Status Change status=Initializing...
>>
>>     [2020-06-25 07:07:14.701744] I [monitor(monitor):159:monitor]
>>     Monitor: starting gsyncd worker?? brick=/rhgs/brick20/brick
>>     slave_node=bxts470194.eu.rabonet.com
>>     <http://bxts470194.eu.rabonet.com>
>>
>>     [2020-06-25 07:07:14.707997] D [monitor(monitor):230:monitor]
>>     Monitor: Worker would mount volume privately
>>
>>     [2020-06-25 07:07:14.757181] I [gsyncd(agent
>>     /rhgs/brick20/brick):318:main] <top>: Using session config
file
>>    
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>>
>>     [2020-06-25 07:07:14.758126] D [subcmds(agent
>>     /rhgs/brick20/brick):107:subcmd_agent] <top>: RPC FD?????
>>     rpc_fd='5,12,11,10'
>>
>>     [2020-06-25 07:07:14.758627] I [changelogagent(agent
>>     /rhgs/brick20/brick):72:__init__] ChangelogAgent: Agent
listining...
>>
>>     [2020-06-25 07:07:14.764234] I [gsyncd(worker
>>     /rhgs/brick20/brick):318:main] <top>: Using session config
file
>>    
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>>
>>     [2020-06-25 07:07:14.779409] I [resource(worker
>>     /rhgs/brick20/brick):1386:connect_remote] SSH: Initializing SSH
>>     connection between master and slave...
>>
>>     [2020-06-25 07:07:14.841793] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068834.84 __repce_version__() ...
>>
>>     [2020-06-25 07:07:16.148725] D [repce(worker
>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>     6799:140380783982400:1593068834.84 __repce_version__ -> 1.0
>>
>>     [2020-06-25 07:07:16.148911] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068836.15 version() ...
>>
>>     [2020-06-25 07:07:16.149574] D [repce(worker
>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>     6799:140380783982400:1593068836.15 version -> 1.0
>>
>>     [2020-06-25 07:07:16.149735] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068836.15 pid() ...
>>
>>     [2020-06-25 07:07:16.150588] D [repce(worker
>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>     6799:140380783982400:1593068836.15 pid -> 30703
>>
>>     [2020-06-25 07:07:16.150747] I [resource(worker
>>     /rhgs/brick20/brick):1435:connect_remote] SSH: SSH connection
>>     between master and slave established. duration=1.3712
>>
>>     [2020-06-25 07:07:16.150819] I [resource(worker
>>     /rhgs/brick20/brick):1105:connect] GLUSTER: Mounting gluster
>>     volume locally...
>>
>>     [2020-06-25 07:07:16.265860] D [resource(worker
>>     /rhgs/brick20/brick):879:inhibit] DirectMounter: auxiliary
>>     glusterfs mount in place
>>
>>     [2020-06-25 07:07:17.272511] D [resource(worker
>>     /rhgs/brick20/brick):953:inhibit] DirectMounter: auxiliary
>>     glusterfs mount prepared
>>
>>     [2020-06-25 07:07:17.272708] I [resource(worker
>>     /rhgs/brick20/brick):1128:connect] GLUSTER: Mounted gluster
>>     volume????? duration=1.1218
>>
>>     [2020-06-25 07:07:17.272794] I [subcmds(worker
>>     /rhgs/brick20/brick):84:subcmd_worker] <top>: Worker spawn
>>     successful. Acknowledging back to monitor
>>
>>     [2020-06-25 07:07:17.272973] D [master(worker
>>     /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>>     change detection mode mode=xsync
>>
>>     [2020-06-25 07:07:17.273063] D [monitor(monitor):273:monitor]
>>     Monitor: worker(/rhgs/brick20/brick) connected
>>
>>     [2020-06-25 07:07:17.273678] D [master(worker
>>     /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>>     change detection mode mode=changelog
>>
>>     [2020-06-25 07:07:17.274224] D [master(worker
>>     /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>>     change detection mode mode=changeloghistory
>>
>>     [2020-06-25 07:07:17.276484] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068837.28 version() ...
>>
>>     [2020-06-25 07:07:17.276916] D [repce(worker
>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>     6799:140380783982400:1593068837.28 version -> 1.0
>>
>>     [2020-06-25 07:07:17.277009] D [master(worker
>>     /rhgs/brick20/brick):777:setup_working_dir] _GMaster: changelog
>>     working dir
>>    
/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick
>>
>>     [2020-06-25 07:07:17.277098] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068837.28 init() ...
>>
>>     [2020-06-25 07:07:17.292944] D [repce(worker
>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>     6799:140380783982400:1593068837.28 init -> None
>>
>>     [2020-06-25 07:07:17.293097] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068837.29
>>     register('/rhgs/brick20/brick',
>>    
'/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
>>    
'/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
>>     8, 5) ...
>>
>>     [2020-06-25 07:07:19.296294] E [repce(agent
>>     /rhgs/brick20/brick):121:worker] <top>: call failed:
>>
>>     Traceback (most recent call last):
>>
>>     ? File
"/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
>>     117, in worker
>>
>>     ??? res = getattr(self.obj, rmeth)(*in_data[2:])
>>
>>     ? File
>>    
"/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
>>     line 40, in register
>>
>>     ??? return Changes.cl_register(cl_brick, cl_dir, cl_log,
>>     cl_level, retries)
>>
>>     ? File
>>    
"/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
>>     line 46, in cl_register
>>
>>     ??? cls.raise_changelog_err()
>>
>>     ? File
>>    
"/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
>>     line 30, in raise_changelog_err
>>
>>     ??? raise ChangelogException(errn, os.strerror(errn))
>>
>>     ChangelogException: [Errno 2] No such file or directory
>>
>>     [2020-06-25 07:07:19.297161] E [repce(worker
>>     /rhgs/brick20/brick):213:__call__] RepceClient: call failed
>>     call=6799:140380783982400:1593068837.29 method=register
>>     error=ChangelogException
>>
>>     [2020-06-25 07:07:19.297338] E [resource(worker
>>     /rhgs/brick20/brick):1286:service_loop] GLUSTER: Changelog
>>     register failed????? error=[Errno 2] No such file or directory
>>
>>     [2020-06-25 07:07:19.315074] I [repce(agent
>>     /rhgs/brick20/brick):96:service_loop] RepceServer: terminating on
>>     reaching EOF.
>>
>>     [2020-06-25 07:07:20.275701] I [monitor(monitor):280:monitor]
>>     Monitor: worker died in startup phase???? brick=/rhgs/brick20/brick
>>
>>     [2020-06-25 07:07:20.277383] I
>>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
>>     Worker Status Change status=Faulty
>>
>>     We?ve done everything we can think of, including an ?strace ?f?
>>     on the pid, and we can?t really find anything. I?m about to lose
>>     the last of my hair over this, so does anyone have any ideas at
>>     all? We?ve even removed the entire slave vol and rebuilt it.
>>
>>     Thanks
>>
>>     Rob
>>
>>     *Rob Quagliozzi*
>>
>>     *Specialised Application Support*
>>
>>
>>
>>    
------------------------------------------------------------------------
>>     This email (including any attachments to it) is confidential,
>>     legally privileged, subject to copyright and is sent for the
>>     personal attention of the intended recipient only. If you have
>>     received this email in error, please advise us immediately and
>>     delete it. You are notified that disclosing, copying,
>>     distributing or taking any action in reliance on the contents of
>>     this information is strictly prohibited. Although we have taken
>>     reasonable precautions to ensure no viruses are present in this
>>     email, we cannot accept responsibility for any loss or damage
>>     arising from the viruses in this email or attachments. We exclude
>>     any liability for the content of this email, or for the
>>     consequences of any actions taken on the basis of the information
>>     provided in this email or its attachments, unless that
>>     information is subsequently confirmed in writing.
<#rbnl#1898i>
>>    
------------------------------------------------------------------------
>>
>>
>>     ________
>>
>>
>>
>>     Community Meeting Calendar:
>>
>>     Schedule -
>>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>     Bridge:https://bluejeans.com/441850968
>>
>>     Gluster-users mailing list
>>     Gluster-users at gluster.org  <mailto:Gluster-users at
gluster.org>
>>     https://lists.gluster.org/mailman/listinfo/gluster-users
>     ________
>
>
>
>     Community Meeting Calendar:
>
>     Schedule -
>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     Bridge: https://bluejeans.com/441850968
>
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>     https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200703/993e8f43/attachment.html>

Gluster users - Jul 2020 - Geo-replication completely broken

[Gluster-users] Geo-replication completely broken

[Gluster-users] Geo-replication completely broken

[Gluster-users] Geo-replication completely broken