thr3ads.net - Gluster users - [Gluster-users] __Geo-replication status is getting Faulty after few seconds [Feb 2024]

If this information is useful, please help other people find it:
Share via:

Aravinda

2024-Feb-06 13:43 UTC

[Gluster-users] __Geo-replication status is getting Faulty after few seconds

Geo-rep delete and create will recreate the session and continue syncing from
where it stopped last time. Last sync time is remembered in the xattrs of Brick
root of Primary volume
(`trusted.glusterfs.<primary-volume-id>.<secondary-volume-id>.stime`).

If reset-sync-time option used with delete, then it will remove the above
mentioned xattr. When Geo-rep session is created again, it will start a new
session.

New session starts with Hybrid crawl then switches to History crawl and then
switches to Changelog crawl. History crawl does file system crawl and all the
files will be identified as to be synced. Geo-rep uses rsync to sync the files
(Change detection is done externally and rsync only gets the list of files to be
synced), rsync will only sync the difference if any.

Only issue? I see of using this option is it may take lot of time to reach the
current time and start syncing the recent files.

>?I tried to remove the secret.pem.pub from master1 and create it again.
Then, I ran georep create push pem, which copied the missing entry to the drtier1data
(slave) node. However, I am still getting the same error. I have also checked
the logs on drtier1data (slave) node and found the following logs.

Please share the errors you see from Primary geo-rep logs.

--

Thanks and Regards
Aravinda

Kadalu Technologies

---- On Tue, 06 Feb 2024 18:50:37 +0530 Anant Saraswat <anant.saraswat at
techblue.co.uk> wrote ---

Hi All,

Does anyone know what will happen if I delete the geo-replication with
reset-sync-time
 and create it again, Will it copy the whole data again, or will it just check
the files and then copy the new remaining files?

Thanks,

Anant

From:?Gluster-users <mailto:gluster-users-bounces at gluster.org> on
behalf of Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>
 Sent:?04 February 2024 12:44 PM
 To:?mailto:gluster-users at gluster.org <mailto:gluster-users at
gluster.org>
 Subject:?Re: [Gluster-users] Geo-replication status is getting Faulty after
few????seconds ?

EXTERNAL:?Do not click links or open attachments if you do not recognize the
sender.

Hello Everyone,

I'm running out of options, I've already tried rebooting
 the physical servers without any luck.

Now, I am thinking, should I try to delete the geo-replication
 with reset-sync-time? Will that resolve the issue? We have 6TB of data, so will
running delete with reset-sync-time copy the whole 6TB of data again, or will it
just check the files and then copy the new remaining files?

I have checked the documentation, and it's not clear
 what will happen with the files already present on the slave volume. Will they
be recopied, or just checked and skipped in case of deleting the session and
creating it again?

Anyone have any experience around this situation? Any help is appreciated.

Many thanks,

Anant?

From:?Gluster-users <mailto:gluster-users-bounces at gluster.org> on
behalf of Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>
 Sent:?01 February 2024 6:30 PM
 To:?Aravinda <mailto:aravinda at kadalu.tech>
 Cc:?mailto:gluster-users at gluster.org <mailto:gluster-users at
gluster.org>
 Subject:?Re: [Gluster-users] Geo-replication status is getting Faulty after
few????seconds ?

EXTERNAL:?Do not click links or open attachments if you do not recognize the
sender.

Hi mailto:aravinda at kadalu.tech,

Thanks for opening the gitlab issue.

I tried to remove the secret.pem.pub from master1 and create it again. Then, I
ran
 georep create push pem, which copied the missing entry to the drtier1data
(slave) node. However, I am still getting the same error. I have also checked
the logs on drtier1data (slave) node and found the following logs.

==>
/var/log/glusterfs/geo-replication-slaves/tier1data_drtier1data_drtier1data/gsyncd.log
 <=
[2024-02-01 18:00:02.93139] I [resource(slave
master1/opt/tier1data2019/brick):1139:connect]
 GLUSTER: Mounted gluster volume [{duration=1.0831}]

[2024-02-01 18:00:02.129343] I [resource(slave
master1/opt/tier1data2019/brick):1166:service_loop]
 GLUSTER: slave listening

[2024-02-01 18:00:06.603335] I [repce(slave
master1/opt/tier1data2019/brick):96:service_loop]
 RepceServer: terminating on reaching EOF.

==>
/var/log/glusterfs/geo-replication-slaves/tier1data_drtier1data_drtier1data/mnt-master1-opt-tier1data2019-brick.log
 <=
[2024-02-01 18:00:06.632144 +0000] I [fuse-bridge.c:6233:fuse_thread_proc]
0-fuse:
 initiating unmount of /tmp/gsyncd-aux-mount-j35hwrxj

[2024-02-01 18:00:06.633483 +0000] W [glusterfsd.c:1429:cleanup_and_exit]
(-->/lib64/libpthread.so.0(+0x817a)
 [0x7f3797b7717a] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xfd)
[0x5646034c5bfd] -->/usr/sbin/glusterfs(cleanup_and_exit+0x58)
[0x5646034c5a48] ) 0-: received signum (15), shutting down?

[2024-02-01 18:00:06.633589 +0000] I [fuse-bridge.c:7063:fini] 0-fuse:
Unmounting '/tmp/gsyncd-aux-mount-j35hwrxj'.

[2024-02-01 18:00:06.633622 +0000] I [fuse-bridge.c:7068:fini] 0-fuse: Closing
fuse
 connection to '/tmp/gsyncd-aux-mount-j35hwrxj'.

Thanks,

Anant

From:?Aravinda <mailto:aravinda at kadalu.tech>
 Sent:?01 February 2024 8:58 AM
 To:?Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>
 Cc:?mailto:gluster-users at gluster.org <mailto:gluster-users at
gluster.org>
 Subject:?Re: [Gluster-users] Geo-replication status is getting Faulty after
few????seconds ?

EXTERNAL:?Do
 not click links or open attachments if you do not recognize the sender.
>?I think we can enhance Geo-rep to accept configuration option to select a
worker as Active manually. I will think about it and update here if possible.

Opened a Github issue for the same.

https://urldefense.com/v3/__https://github.com/gluster/glusterfs/issues/4304__;!!I_DbfM1H!BhqOtEyWh2TghCsLjQAu6kNSYqtFoRKfU8nSfUnVX4ZZfmlMRVka_BTot4C1XsR1fBFnoBMgNlYUi-xIXH9J5jo8GPwD7Ao$

--

Aravinda

Kadalu Technologies

---- On Thu, 01 Feb 2024 07:43:40 +0530 Aravinda <mailto:aravinda at
kadalu.tech>?wrote ---

Hi Anant,

> Still same thing happening, One thing I have noticed that everytime master1
is trying to be primary node, how it's selected that which node will be primary node in the
geo-replication?

Gsyncd worker from the first node of the Replica bricks will become Active and
participate
 in syncing. If the Active worker node goes down then the other worker becomes
Active (Not when it is Faulty). Geo-rep can use Gluster meta volume to hold a
lock and automatically switch when a Active worker is Faulty (Refer: Meta Volume
config
https://urldefense.com/v3/__https://docs.gluster.org/en/main/Administrator-Guide/Geo-Replication/*configurable-options__;Iw!!I_DbfM1H!BhqOtEyWh2TghCsLjQAu6kNSYqtFoRKfU8nSfUnVX4ZZfmlMRVka_BTot4C1XsR1fBFnoBMgNlYUi-xIXH9J5jo8MnMAKb4$).

I think we can enhance Geo-rep to accept configuration option to select a worker
as Active
 manually. I will think about it and update here if possible.

> Second, I have noticed in the ".ssh/authorized_keys" file on the
secondry node, master2 and master3 have 2 entries
(command="/usr/libexec/glusterfs/gsyncd" AND command="tar
${SSH_ORIGINAL_COMMAND#* }", but master1 node have only one entry with command="tar
${SSH_ORIGINAL_COMMAND#* }" , This indicates that there is a?missing entry
for master1 with the command="/usr/libexec/glusterfs/gsyncd",
 Does it make any sense?

Missing entry is the problem as you observed. Delete
/var/lib/glusterd/geo-replication/secret.pem.pub
 file in master1 and then run gsec_create, georep create push pem again.

--

Aravinda

Kadalu Technologies

---- On Thu, 01 Feb 2024 00:27:34 +0530  Anant Saraswat
<mailto:anant.saraswat at techblue.co.uk>?wrote ---

Hi mailto:aravinda at kadalu.tech,

As advised, I have removed all the master server-related entries from the
 ".ssh/authorized_keys" file on the secondary node. Then, I ran the
"ssh-copy-id root at drtier1data" command on all the masters for
passwordless SSH and verified that I can access the drtier1data server from all
the master nodes.

After that, I ran the following commands.

gluster system:: execute gsec_create

gluster volume geo-replication tier1data drtier1data::drtier1data
 create push-pem force

gluster volume geo-replication tier1data drtier1data::drtier1data
 start

gluster volume geo-replication tier1data drtier1data::drtier1data
 status

[root at master3 ~]# gluster volume geo-replication
 tier1data drtier1data::drtier1data status

?

MASTER NODE ? ? ? ? ? ? ?MASTER VOL ? ?MASTER
 BRICK ? ? ? ? ? ? ? ?SLAVE USER ? ?SLAVE ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?SLAVE NODE ? ?STATUS ? ? ? ? ? ? CRAWL STATUS ? ?LAST_SYNCED ? ? ? ? ?

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

master3 ? ?tier1data ? ? /opt/tier1data2019/brick
 ? ?root ? ? ? ? ?drtier1data::drtier1data ? ?N/A ? ? ? ? ? Initializing... ?
?N/A ? ? ? ? ? ? N/A ? ? ? ? ? ? ? ? ?

master1 ? ? ?tier1data ? ? /opt/tier1data2019/brick
 ? ?root ? ? ? ? ?drtier1data::drtier1data ? ?N/A ? ? ? ? ? Initializing... ?
?N/A ? ? ? ? ? ? N/A ? ? ? ? ? ? ? ? ?

master2 ? ? ?tier1data ? ? /opt/tier1data2019/brick
 ? ?root ? ? ? ? ?drtier1data::drtier1data ? ?N/A ? ? ? ? ? Initializing... ?
?N/A ? ? ? ? ? ? N/A ? ? ? ? ? ? ? ? ?

[root at master3 ~]# gluster volume geo-replication
 tier1data drtier1data::drtier1data status

?

MASTER NODE ? ? ? ? ? ? ?MASTER VOL ? ?MASTER
 BRICK ? ? ? ? ? ? ? ?SLAVE USER ? ?SLAVE ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?SLAVE NODE ? ?STATUS ? ? CRAWL STATUS ? ? LAST_SYNCED ? ? ? ? ?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

master3 ? ?tier1data ? ? /opt/tier1data2019/brick
 ? ?root ? ? ? ? ?drtier1data::drtier1data ? ? ? ? ? ? ? ? ?Passive ? ?N/A ? ? ?
? ? ? ?N/A ? ? ? ? ? ? ? ? ?

master1 ? ? ?tier1data ? ? /opt/tier1data2019/brick
 ? ?root ? ? ? ? ?drtier1data::drtier1data ? ? ? ? ? ? ? ? ?Active ? ? History
Crawl ? ?N/A ? ? ? ? ? ? ? ? ?

master2 ? ? ?tier1data ? ? /opt/tier1data2019/brick
 ? ?root ? ? ? ? ?drtier1data::drtier1data ? ? ? ? ? ? ? ? ?Passive ? ?N/A ? ? ?
? ? ? ?N/A ? ? ? ? ? ? ? ? ?

[root at master3 ~]# gluster volume geo-replication
 tier1data drtier1data::drtier1data status

MASTER NODE ? ? ? ? ? ? ?MASTER VOL ? ?MASTER
 BRICK ? ? ? ? ? ? ? ?SLAVE USER ? ?SLAVE ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?SLAVE NODE ? ?STATUS ? ? CRAWL STATUS ? ?LAST_SYNCED ? ? ? ? ?

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

master3 ? ?tier1data ? ? /opt/tier1data2019/brick
 ? ?root ? ? ? ? ?drtier1data::drtier1data ? ? ? ? ? ? ? ? ?Passive ? ?N/A ? ? ?
? ? ? N/A ? ? ? ? ? ? ? ? ?

master1 ? ? ?tier1data ? ? /opt/tier1data2019/brick
 ? ?root ? ? ? ? ?drtier1data::drtier1data ? ?N/A ? ? ? ? ? Faulty ? ? N/A ? ? ?
? ? ? N/A ? ? ? ? ? ? ? ? ?

master2 ? ? ?tier1data ? ? /opt/tier1data2019/brick
 ? ?root ? ? ? ? ?drtier1data::drtier1data ? ? ? ? ? ? ? ? ?Passive ? ?N/A ? ? ?
? ? ? N/A?

Still same thing happening, One thing I have noticed
 that everytime master1 is trying to be primary node, how it's selected that
which node will be primary node in the geo-replication?

Second, I have noticed in the ".ssh/authorized_keys" file on the
secondry node, master2 and master3 have 2 entries
(command="/usr/libexec/glusterfs/gsyncd" AND command="tar
${SSH_ORIGINAL_COMMAND#* }", but
 master1 node have only one entry with command="tar
${SSH_ORIGINAL_COMMAND#* }" , This indicates that there is a?missing entry
for master1 with the command="/usr/libexec/glusterfs/gsyncd",
 Does it make any sense??

Really appreciate
 your help on this.

Thanks,

Anant

From:?Aravinda <mailto:aravinda at kadalu.tech>
 Sent:?31 January 2024 5:27 PM
 To:?Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>
 Cc:?mailto:gluster-users at gluster.org?<mailto:gluster-users at
gluster.org>;
 Strahil Nikolov <mailto:hunter86_bg at yahoo.com>
 Subject:?Re: [Gluster-users] Geo-replication status is getting Faulty after
few????seconds ?

EXTERNAL:?Do not click links or open attachments if you do not recognize
 the sender.

In one of the thread, I saw the SSH public key is manually copied to secondary
node's authorized_keys
 file. Check if that entry starts with "command=" or not. If not
added, delete the entry in that file (or delete all Georep related entries in
that file) and run Georep create push-pem force again.

--

Aravinda

Kadalu Technologies

---- On Wed, 31 Jan 2024 17:19:45 +0530 Anant Saraswat <mailto:anant.saraswat
at techblue.co.uk>?wrote
 ---

Hi mailto:aravinda at kadalu.tech,

I used the exact same commands when I added master3 to the primary
geo-replication.
 However, the issue is that the master3 node is in a passive state, and master1
is stuck in a loop of Initializing -> Active -> Faulty. It never considers
master2 or master3 as the primary master for geo-replication.

If master1 can connect to the secondary (drtier1data) server, and I see
 the following message in the master1 logs which says "SSH connection
between master and slave established.", do you still think it could be
related to key issues? I am willing to rerun the commands from master1 if you
advise.

2024-01-30 23:33:14.274611] I [resource(worker
 /opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...

[2024-01-30 23:33:15.960004] I [resource(worker
 /opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.6852}]

[2024-01-30 23:33:15.960300] I [resource(worker
 /opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...

[2024-01-30 23:33:16.995715] I [resource(worker
 /opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0353}]

[2024-01-30 23:33:16.995905] I [subcmds(worker
 /opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor

[2024-01-30 23:33:19.154376] I [master(worker
 /opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]

[2024-01-30 23:33:19.154759] I [resource(worker
 /opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706657599}]

[2024-01-30 23:33:19.191343] I [gsyncdstatus(worker
 /opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]

[2024-01-30 23:33:19.191940] I [gsyncdstatus(worker
 /opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]

[2024-01-30 23:33:19.192105] I [master(worker
 /opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706657599},
{entry_stime=(1705935991, 0)}]

[2024-01-30 23:33:20.269529] I [master(worker
 /opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]

[2024-01-30 23:33:20.385018] E [syncdutils(worker
 /opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]

[2024-01-30 23:33:21.674] I [monitor(monitor):228:monitor]
 Monitor: worker died in startup phase [{brick=/opt/tier1data2019/brick}]

[2024-01-30 23:33:21.11514] I [gsyncdstatus(monitor):248:set_worker_status]
 GeorepStatus: Worker Status Change [{status=Faulty}]

Many thanks,

Anant

From:?Aravinda <mailto:aravinda at kadalu.tech>
 Sent:?31 January 2024 11:14 AM
 To:?Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>
 Cc:?mailto:gluster-users at gluster.org?<mailto:gluster-users at
gluster.org>;
 Strahil Nikolov <mailto:hunter86_bg at yahoo.com>
 Subject:?Re: [Gluster-users] Geo-replication status is getting Faulty after
few????seconds ?

EXTERNAL:?Do not click links or open attachments if you do not recognize
 the sender.

Hi Anant,

You need to run the gsec_create command when a new node is added to Primary or
secondary

gluster system:: execute gsec_create

gluster volume geo-replication tier1data drtier1data::drtier1data create
push-pem force

gluster volume geo-replication tier1data drtier1data::drtier1data start

gluster volume geo-replication tier1data drtier1data::drtier1data status

Or use the Geo-rep setup tool fix the key related issues and resetup
(https://urldefense.com/v3/__https://github.com/aravindavk/gluster-georep-tools__;!!I_DbfM1H!F7q10L7Ll47-D3yZGbN2O2r5kpvRMP5-jO46OzkZu67d3vzF5AozNKd3umkbfJ7_29i2g8TKkhoGfcySbARDGIvyc-VOX7A$)

gluster-georep-setup tier1data drtier1data::drtier1data

--

gluster system:: execute gsec_create
 gluster system:: execute gsec_create

Kadalu Technologies

---- On Wed, 31 Jan 2024 02:49:08 +0530 Anant Saraswat <mailto:anant.saraswat
at techblue.co.uk>?wrote
 ---

Hi All,

As per the documentation, if we use `delete` only
 it will start the replication from the time where it was left before deleting
the session, So I tried that without any luck.

gluster volume geo-replication tier1data drtier1data::drtier1data
 delete

gluster volume geo-replication tier1data drtier1data::drtier1data
 create push-pem force

gluster volume geo-replication tier1data drtier1data::drtier1data
 start

gluster volume geo-replication tier1data drtier1data::drtier1data
 status

I have tried to check the drtier1data logs as
 well, and all I can see is master1 connects to drtier1data and send disconnect
after 5 seconds, please check the following logs from drtier1data.

[2024-01-30
 21:04:03.016805 +0000] I [MSGID: 114046]
[client-handshake.c:857:client_setvolume_cbk] 0-drtier1data-client-0: Connected,
attached to remote volume [{conn-name=drtier1data-client-0},
{remote_subvol=/opt/tier1data2019/brick}]?

[2024-01-30 21:04:03.020148
 +0000] I [fuse-bridge.c:5296:fuse_init] 0-glusterfs-fuse: FUSE inited with
protocol versions: glusterfs 7.24 kernel 7.33

[2024-01-30 21:04:03.020197
 +0000] I [fuse-bridge.c:5924:fuse_graph_sync] 0-fuse: switched to graph 0

[2024-01-30 21:04:08.573873
 +0000] I [fuse-bridge.c:6233:fuse_thread_proc] 0-fuse: initiating unmount of
/tmp/gsyncd-aux-mount-c8c41k2k

[2024-01-30 21:04:08.575131
 +0000] W [glusterfsd.c:1429:cleanup_and_exit]
(-->/lib64/libpthread.so.0(+0x817a) [0x7fb907e2e17a]
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xfd) [0x55f97b17dbfd]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x58) [0x55f97b17da48] ) 0-: received
signum (15),
 shutting down?

[2024-01-30 21:04:08.575227
 +0000] I [fuse-bridge.c:7063:fini] 0-fuse: Unmounting
'/tmp/gsyncd-aux-mount-c8c41k2k'.

[2024-01-30 21:04:08.575256
 +0000] I [fuse-bridge.c:7068:fini] 0-fuse: Closing fuse connection to
'/tmp/gsyncd-aux-mount-c8c41k2k'.

Can anyone suggest how can I find the reason of
 getting these disconnect requests from master1 or what shall I check next?

Many thanks,

A

From:?Gluster-users <mailto:gluster-users-bounces at gluster.org>
 on behalf of Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>
 Sent:?30 January 2024 2:14 PM
 To:?mailto:gluster-users at gluster.org?<mailto:gluster-users at
gluster.org>;
 Strahil Nikolov <mailto:hunter86_bg at yahoo.com>
 Subject:?Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds ?

EXTERNAL:?Do not click links or open attachments if you do not recognize
 the sender.

Hello Everyone,?

I am looking for some help. Can anyone please suggest if it's possible to
 promote a master node to be the primary in the geo-replication session?

We have three master nodes and one secondary node. We are facing issues
 where geo-replication is consistently failing from the primary master node. We
want to check if it works fine from another master node.

Any guidance or assistance would be highly appreciated.

Many thanks,

Anant

From:?Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>
 Sent:?29 January 2024 3:55 PM
 To:?mailto:gluster-users at gluster.org?<mailto:gluster-users at
gluster.org>;
 Strahil Nikolov <mailto:hunter86_bg at yahoo.com>
 Subject:?Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds ?

Hi mailto:hunter86_bg at yahoo.com,

We have been running this geo-replication for
 more than 5 years and it was working fine till last week, So I think it
shouldn't be something which was missed in the initial setup, but I am
unable to understand why it's not working now.

I have enabled SSH Debug on the secondary node(drtier1data),
 and I can see this in the logs.

Jan 29 14:25:52 drtier1data sshd[1268110]:
 debug1: server_input_channel_req: channel 0 request exec reply 1

Jan 29 14:25:52 drtier1data sshd[1268110]:
 debug1: session_by_channel: session 0 channel 0

Jan 29 14:25:52 drtier1data sshd[1268110]:
 debug1: session_input_channel_req: session 0 req exec

Jan 29 14:25:52 drtier1data sshd[1268110]:
 Starting session: command for root from XX.236.28.58 port 53082 id 0

Jan 29 14:25:52 drtier1data sshd[1268095]:
 debug1: session_new: session 0

Jan 29 14:25:58 drtier1data sshd[1268110]:
 debug1: Received SIGCHLD.

Jan 29 14:25:58 drtier1data sshd[1268110]:
 debug1: session_by_pid: pid 1268111

Jan 29 14:25:58 drtier1data sshd[1268110]:
 debug1: session_exit_message: session 0 channel 0 pid 1268111

Jan 29 14:25:58 drtier1data sshd[1268110]:
 debug1: session_exit_message: release channel 0

Jan 29 14:25:58 drtier1data sshd[1268110]:
 Received disconnect from XX.236.28.58 port 53082:11: disconnected by user

Jan 29 14:25:58 drtier1data sshd[1268110]:
 Disconnected from user root XX.236.28.58 port 53082

Jan 29 14:25:58 drtier1data sshd[1268110]:
 debug1: do_cleanup

Jan 29 14:25:58 drtier1data sshd[1268095]:
 debug1: do_cleanup

Jan 29 14:25:58 drtier1data sshd[1268095]:
 debug1: PAM: cleanup

Jan 29 14:25:58 drtier1data sshd[1268095]:
 debug1: PAM: closing session

Jan 29 14:25:58 drtier1data sshd[1268095]:
 pam_unix(sshd:session): session closed for user root

Jan 29 14:25:58 drtier1data sshd[1268095]:
 debug1: PAM: deleting credentials

As per the above logs, drtier1data node is getting SIGCHLD?from
 master1. (Received disconnect from XX.236.28.58 port 53082:11:
 disconnected by user)

Also, I have checked the gsyncd.log on master1,
 which says "SSH: SSH connection between master and slave established.
[{duration=1.7277}]", which means passwordless ssh is working fine.

As per my understanding, Master1 can connect to
 the drtier1data server, and then the geo-replication status changes to Active
--> History Crawl and then something happens on the master1 which triggers
the SSH disconnect.

is it possible to change the master node in geo-replication
 so that we can mark master2 as primary, instead of master1?

I am really struggling to fix this issue, Please
 help, any pointer is appreciated !!!

Many thanks,

Anant

From:?Gluster-users <mailto:gluster-users-bounces at gluster.org>
 on behalf of Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>
 Sent:?29 January 2024 12:20 AM
 To:?mailto:gluster-users at gluster.org?<mailto:gluster-users at
gluster.org>;
 Strahil Nikolov <mailto:hunter86_bg at yahoo.com>
 Subject:?Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds ?

EXTERNAL:?Do not click links or open attachments if you do not recognize
 the sender.

HI Strahil,

As mentioned in my last email, I have copied the
 gluster public key from master3 to secondary server, and I can now ssh from all
master nodes to secondary server,
 but still getting the same error.

[root at master1 geo-replication]# ssh root at drtier1data
 -i /var/lib/glusterd/geo-replication/secret.pem

Last login: Mon Jan 29 00:14:32 2024 from?

[root at drtier1data ~]#?

[root at master2 ~]# ssh -i /var/lib/glusterd/geo-replication/secret.pem root at
drtier1data

Last login: Mon Jan 29 00:02:34 2024 from?

[root at drtier1data ~]#?

[root at master3 ~]# ssh -i /var/lib/glusterd/geo-replication/secret.pem root at
drtier1data?

Last login: Mon Jan 29 00:14:41 2024 from?

[root at drtier1data ~]#

Thanks,

Anant

From:?Strahil Nikolov <mailto:hunter86_bg at yahoo.com>
 Sent:?28 January 2024 10:07 PM
 To:?Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>;
mailto:gluster-users at gluster.org?<mailto:gluster-users at gluster.org>
 Subject:?Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds ?

EXTERNAL:?Do not click links or
 open attachments if you do not recognize the sender.

Gluster doesn't use the ssh key in /root/.ssh,
 thus you need to exchange the public key that corresponds
to?/var/lib/glusterd/geo-replication/secret.pem . If you don't know the pub
key, google how to obtain it from the private key.

Ensure that all hosts can ssh to the secondary
 before proceeding with the troubleshooting.

Best Regards,

Strahil Nikolov

On Sun, Jan 28, 2024 at 15:58, Anant Saraswat

<mailto:anant.saraswat at techblue.co.uk>
 wrote:

Hi All,

I have now copied ?/var/lib/glusterd/geo-replication/secret.pem.pub? (public
key) from master3 to drtier1data /root/.ssh/authorized_keys, and now I can ssh
from master node3 to drtier1data using the georep key
(/var/lib/glusterd/geo-replication/secret.pem).

But I am still getting the same error, and geo-replication is getting faulty
again and
 again.

[2024-01-28 13:46:38.897683] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449598}]

[2024-01-28 13:46:38.922491] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]

[2024-01-28 13:46:38.923127] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]

[2024-01-28 13:46:38.923313] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449598},
 {entry_stime=(1705935991, 0)}]

[2024-01-28 13:46:39.973584] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]

[2024-01-28 13:46:40.98970] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]

[2024-01-28 13:46:40.757691] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]

[2024-01-28 13:46:40.766860] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]

[2024-01-28 13:46:50.793311] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]

[2024-01-28 13:46:50.793469] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]

[2024-01-28 13:46:50.874474] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...

[2024-01-28 13:46:52.659114] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7844}]

[2024-01-28 13:46:52.659461] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...

[2024-01-28 13:46:53.698769] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0392}]

[2024-01-28 13:46:53.698984] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor

[2024-01-28 13:46:55.831999] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]

[2024-01-28 13:46:55.832354] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449615}]

[2024-01-28 13:46:55.854684] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]

[2024-01-28 13:46:55.855251] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]

[2024-01-28 13:46:55.855419] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449615},
 {entry_stime=(1705935991, 0)}]

[2024-01-28 13:46:56.905496] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]

[2024-01-28 13:46:57.38262] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]

[2024-01-28 13:46:57.704128] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]

[2024-01-28 13:46:57.706743] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]

[2024-01-28 13:47:07.741438] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]

[2024-01-28 13:47:07.741582] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]

[2024-01-28 13:47:07.821284] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...

[2024-01-28 13:47:09.573661] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7521}]

[2024-01-28 13:47:09.573955] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...

[2024-01-28 13:47:10.612173] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0381}]

[2024-01-28 13:47:10.612359] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor

[2024-01-28 13:47:12.751856] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]

[2024-01-28 13:47:12.752237] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449632}]

[2024-01-28 13:47:12.759138] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]

[2024-01-28 13:47:12.759690] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]

[2024-01-28 13:47:12.759868] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449632},
 {entry_stime=(1705935991, 0)}]

[2024-01-28 13:47:13.810321] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]

[2024-01-28 13:47:13.924068] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]

[2024-01-28 13:47:14.617663] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]

[2024-01-28 13:47:14.620035] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]

[2024-01-28 13:47:24.646013] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]

[2024-01-28 13:47:24.646157] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]

[2024-01-28 13:47:24.725510] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...

[2024-01-28 13:47:26.491939] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7662}]

[2024-01-28 13:47:26.492235] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...

[2024-01-28 13:47:27.530852] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0385}]

[2024-01-28 13:47:27.531036] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor

[2024-01-28 13:47:29.670099] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]

[2024-01-28 13:47:29.670640] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449649}]

[2024-01-28 13:47:29.696144] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]

[2024-01-28 13:47:29.696709] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]

[2024-01-28 13:47:29.696899] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449649},
 {entry_stime=(1705935991, 0)}]

[2024-01-28 13:47:30.751127] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]

[2024-01-28 13:47:30.885824] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]

[2024-01-28 13:47:31.535252] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]

[2024-01-28 13:47:31.538450] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]

[2024-01-28 13:47:41.564276] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]

[2024-01-28 13:47:41.564426] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]

[2024-01-28 13:47:41.645110] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...

[2024-01-28 13:47:43.435830] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7904}]

[2024-01-28 13:47:43.436285] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...

[2024-01-28 13:47:44.475671] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0393}]

[2024-01-28 13:47:44.475865] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor

[2024-01-28 13:47:46.630478] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]

[2024-01-28 13:47:46.630924] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449666}]

[2024-01-28 13:47:46.655069] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]

[2024-01-28 13:47:46.655752] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]

[2024-01-28 13:47:46.655926] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449666},
 {entry_stime=(1705935991, 0)}]

[2024-01-28 13:47:47.706875] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]

[2024-01-28 13:47:47.834996] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]

[2024-01-28 13:47:48.480822] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]

[2024-01-28 13:47:48.491306] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]

[2024-01-28 13:47:58.518263] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]

[2024-01-28 13:47:58.518412] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]

[2024-01-28 13:47:58.601096] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...

[2024-01-28 13:48:00.355000] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7537}]

[2024-01-28 13:48:00.355345] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...

[2024-01-28 13:48:01.395025] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0396}]

[2024-01-28 13:48:01.395212] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor

[2024-01-28 13:48:03.541059] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]

[2024-01-28 13:48:03.541481] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449683}]

[2024-01-28 13:48:03.567552] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]

[2024-01-28 13:48:03.568172] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]

[2024-01-28 13:48:03.568376] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449683},
 {entry_stime=(1705935991, 0)}]

[2024-01-28 13:48:04.621488] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]

[2024-01-28 13:48:04.742268] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]

[2024-01-28 13:48:04.919335] I [master(worker
/opt/tier1data2019/brick):2013:syncjob] Syncer: Sync Time Taken [{job=3},
{num_files=10}, {return_code=3}, {duration=0.0180}]

[2024-01-28 13:48:04.919919] E [syncdutils(worker
/opt/tier1data2019/brick):847:errlog] Popen: command returned error [{cmd=rsync
-aR0 --inplace --files-from=- --super --stats
 --numeric-ids --no-implied-dirs --existing --xattrs --acls
--ignore-missing-args . -e ssh -oPasswordAuthentication=no
-oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22
-oControlMaster=auto -S
/tmp/gsyncd-aux-ssh-zo_ev6yu/75785990b3233f5dbbab9f43cc3ed895.sock
 drtier1data:/proc/799165/cwd}, {error=3}]

[2024-01-28 13:48:05.399226] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]

[2024-01-28 13:48:05.403931] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]

[2024-01-28 13:48:15.430175] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]

[2024-01-28 13:48:15.430308] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]

[2024-01-28 13:48:15.510770] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...

[2024-01-28 13:48:17.240311] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7294}]

[2024-01-28 13:48:17.240509] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...

[2024-01-28 13:48:18.279007] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0384}]

[2024-01-28 13:48:18.279195] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor

[2024-01-28 13:48:20.455937] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]

[2024-01-28 13:48:20.456274] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449700}]

[2024-01-28 13:48:20.464288] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]

[2024-01-28 13:48:20.464807] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]

[2024-01-28 13:48:20.464970] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449700},
 {entry_stime=(1705935991, 0)}]

[2024-01-28 13:48:21.514201] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]

[2024-01-28 13:48:21.644609] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]

[2024-01-28 13:48:22.284920] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]

[2024-01-28 13:48:22.286189] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]

[2024-01-28 13:48:32.312378] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]

[2024-01-28 13:48:32.312526] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]

[2024-01-28 13:48:32.393484] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...

[2024-01-28 13:48:34.91825] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.6981}]

[2024-01-28 13:48:34.92130] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...

Thanks,

Anant

From:?Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>
 Sent:?28 January 2024 1:33 AM
 To:?Strahil Nikolov <mailto:hunter86_bg at yahoo.com>;
mailto:gluster-users at gluster.org?<mailto:gluster-users at gluster.org>
 Subject:?Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds ?

Hi mailto:hunter86_bg at yahoo.com,

I have checked the ssh connection from all the master servers and I can ssh
drtier1data?from master1 and master2 server(old
 master servers), but I am unable to ssh drtier1data?from
 master3 (new node).

[root at master3 ~]# ssh -i /var/lib/glusterd/geo-replication/secret.pem root at
drtier1data

Traceback (most recent call last):

? File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 325,
in <module>

? ? main()

? File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 259,
in main

? ? if args.subcmd in ("worker"):

TypeError: 'in <string>' requires string as left operand, not
NoneType

Connection to drtier1data closed.

But?I am able to ssh ?drtier1data?from
 master3?without using the georep key.

[root at master3 ~]# ssh ?root at drtier1data

Last login: Sun Jan 28 01:16:25 2024 from 87.246.74.32

[root at drtier1data ~]#?

Also, today I restarted the gluster server on master1 as geo-replication is
trying to be active from master1 server, and sometimes I
 am getting the following error in gsyncd.log

[2024-01-28 01:27:24.722663] E [syncdutils(worker
/opt/tier1data2019/brick):847:errlog] Popen: command returned error [{cmd=rsync
 -aR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs
--existing --xattrs --acls --ignore-missing-args . -e ssh
-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
/var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto
 -S /tmp/gsyncd-aux-ssh-0exuoeg7/75785990b3233f5dbbab9f43cc3ed895.sock
drtier1data:/proc/553418/cwd}, {error=3}]

Many thanks,

Anant

From:?Strahil Nikolov <mailto:hunter86_bg at yahoo.com>
 Sent:?27 January 2024 5:25 AM
 To:?mailto:gluster-users at gluster.org?<mailto:gluster-users at
gluster.org>;
 Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>
 Subject:?Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds ?

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

 Don't forget to test with the georep key. I think it was
/var/lib/glusterd/geo-replication/secret.pem

 Best Regards,
 Strahil Nikolov

 ? ??????, 27 ?????? 2024 ?. ? 07:24:07 ?. ???????+2, Strahil Nikolov
<mailto:hunter86_bg at yahoo.com> ??????:

 Hi Anant,

 i would first start checking if you can do ssh from all masters to the slave
node.If you haven't setup a dedicated user for the session, then gluster is
using root.

 Best Regards,
 Strahil Nikolov

 ? ?????, 26 ?????? 2024 ?. ? 18:07:59 ?. ???????+2, Anant Saraswat
<mailto:anant.saraswat at techblue.co.uk>
 ??????:

 Hi All,

 I have run the following commands on master3, and that has added master3 to
geo-replication.

 gluster system:: execute gsec_create

 gluster volume geo-replication tier1data drtier1data::drtier1data create
push-pem force

 gluster volume geo-replication tier1data drtier1data::drtier1data stop

 gluster volume geo-replication tier1data drtier1data::drtier1data start

 Now I am able to start the geo-replication, but I am getting the same error.

 [2024-01-24 19:51:24.80892] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]

 [2024-01-24 19:51:24.81020] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]

 [2024-01-24 19:51:24.158021] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...

 [2024-01-24 19:51:25.951998] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7938}]

 [2024-01-24 19:51:25.952292] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...

 [2024-01-24 19:51:26.986974] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0346}]

 [2024-01-24 19:51:26.987137] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor

 [2024-01-24 19:51:29.139131] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]

 [2024-01-24 19:51:29.139531] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706125889}]

 [2024-01-24 19:51:29.173877] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]

 [2024-01-24 19:51:29.174407] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]

 [2024-01-24 19:51:29.174558] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706125889},
{entry_stime=(1705935991, 0)}]

 [2024-01-24 19:51:30.251965] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]

 [2024-01-24 19:51:30.376715] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]

 [2024-01-24 19:51:30.991856] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]

 [2024-01-24 19:51:30.993608] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]

 Any idea why it's stuck in this loop?

 Thanks,

 Anant

 ________________________________
 From: Gluster-users <mailto:gluster-users-bounces at gluster.org> on
behalf of Anant Saraswat <mailto:anant.saraswat at techblue.co.uk>
 Sent: 22 January 2024 9:00 PM
 To: mailto:gluster-users at gluster.org?<mailto:gluster-users at
gluster.org>
 Subject: [Gluster-users] Geo-replication status is getting Faulty after few
seconds

 EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

 Hi There,

 We have a Gluster setup with three master nodes in replicated mode and one
slave node with geo-replication.

 # gluster volume info

 Volume Name: tier1data

 Type: Replicate

 Volume ID: 93c45c14-f700-4d50-962b-7653be471e27

 Status: Started

 Snapshot Count: 0

 Number of Bricks: 1 x 3 = 3

 Transport-type: tcp

 Bricks:

 Brick1: master1:/opt/tier1data2019/brick

 Brick2: master2:/opt/tier1data2019/brick

 Brick3: master3:/opt/tier1data2019/brick

 master1 |master2 |?
------------------------------geo-replication----------------------------- |
drtier1datamaster3 |

 We added the master3 node a few months back, the initial setup consisted of 2
master nodes and one geo-replicated slave(drtier1data).

 Our geo-replication was functioning well with the initial two master nodes
(master1 and master2), where master1 was active and master2 was in passive mode.
However, today, we started experiencing issues where geo-replication suddenly
stopped and became stuck
 in a loop of Initializing..., Active.. Faulty on master1, while master2
remained in passive mode.

 Upon checking the gsyncd.log on the master1 node, we observed the following
error (please refer to the attached logs for more details):

 E [syncdutils(worker /opt/tier1data2019/brick):346:log_raise_exception]
<top>: Gluster Mount process exited [{error=ENOTCONN}]

 # gluster volume geo-replication tier1data status

 MASTER NODE??????????? MASTER VOL??? MASTER BRICK??????????????? SLAVE USER???
SLAVE??????????????????????????????????????????? SLAVE NODE???
STATUS???????????? CRAWL STATUS??? LAST_SYNCED

 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 master1??? tier1data???? /opt/tier1data2019/brick??? root?????????
ssh://drtier1data::drtier1data??? N/A?????????? Faulty??? ?????? N/A????????????
N/A

 master2??? tier1data???? /opt/tier1data2019/brick??? root?????????
ssh://drtier1data::drtier1data????????????????? Passive???????????
N/A???????????? N/A

 Suspecting an issue on the drtier1data(slave)?, I attempted to restart Gluster
on the slave node, also tried to restart drtier1data server without any luck.

 After that I tried the following command to get the Primary-log-file for
geo-replication on master1, and got the following error.

 # gluster volume geo-replication tier1data drtier1data::drtier1data config
log-file

 Staging failed on master3. Error: Geo-replication session between tier1data and
drtier1data::drtier1data does not exist.

 geo-replication command failed

 Master3 was the new node added a few months back, but geo-replication was
working until today, and we never added this node under geo-replication.

 After that, I forcefully stopped the geo-replication, thinking that restarting
geo-replication might fix the issue. However, now the geo-replication is not
starting and is giving the same error.

 # gluster volume geo-replication tier1data drtier1data::drtier1data start force

 Staging failed on master3. Error: Geo-replication session between tier1data and
drtier1data::drtier1data does not exist.

 geo-replication command failed

 Can anyone please suggest what I should do next to resolve this issue? As there
is 5TB of data in this volume, I don't want to resync the entire data to
drtier1data. Instead, I want to resume the sync from where it last stopped.

 Thanks in advance for any guidance/help.

 Kind regards,

 Anant

 ?

 DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential
 information and is intended only for the individual named. If you are not the
named addressee, you should not disseminate, distribute or copy this email.
Please notify the sender immediately by email if you have received this email by
mistake and delete this
 email from your system.

 If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

 DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential
 information and is intended only for the individual named. If you are not the
named addressee, you should not disseminate, distribute or copy this email.
Please notify the sender immediately by email if you have received this email by
mistake and delete this
 email from your system.

 If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

 ________

 Community Meeting Calendar:

 Schedule -
 Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
 Bridge:
https://urldefense.com/v3/__https://meet.google.com/cpu-eiue-hvk__;!!I_DbfM1H!FIFMVBFvoomIXp1pMhjtLbD-1B_qztpAUPBHP5MST7a1hcf3FP8o6GkbQwzQUnS2nT_YIQ1MF7GV_PtM0CAQoOCo4VSgaw$
 Gluster-users mailing list
 mailto:Gluster-users at gluster.org
 https://urldefense.com/v3/__https://lists.gluster.org/mailman/listinfo/gluster-users__;!!I_DbfM1H!FIFMVBFvoomIXp1pMhjtLbD-1B_qztpAUPBHP5MST7a1hcf3FP8o6GkbQwzQUnS2nT_YIQ1MF7GV_PtM0CAQoOBrmvmlMg$
DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed.
 If you have received this email in error, please notify the sender. This
message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email.
 Please notify the sender immediately by email if you have received this email
by mistake and delete this email from your system.

 If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
 are addressed. If you have received this email in error, please notify the
sender. This message contains confidential information and is intended only for
the individual named. If you are not the named addressee, you should not
disseminate, distribute or copy
 this email. Please notify the sender immediately by email if you have received
this email by mistake and delete this email from your system.

 If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.
DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
 are addressed. If you have received this email in error, please notify the
sender. This message contains confidential information and is intended only for
the individual named. If you are not the named addressee, you should not
disseminate, distribute or copy
 this email. Please notify the sender immediately by email if you have received
this email by mistake and delete this email from your system.

 If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
 are addressed. If you have received this email in error, please notify the
sender. This message contains confidential information and is intended only for
the individual named. If you are not the named addressee, you should not
disseminate, distribute or copy
 this email. Please notify the sender immediately by email if you have received
this email by mistake and delete this email from your system.

 If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.
________

 Community Meeting Calendar:

 Schedule -
 Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
 Bridge:
https://urldefense.com/v3/__https://meet.google.com/cpu-eiue-hvk__;!!I_DbfM1H!F7q10L7Ll47-D3yZGbN2O2r5kpvRMP5-jO46OzkZu67d3vzF5AozNKd3umkbfJ7_29i2g8TKkhoGfcySbARDGIvy66WGiu4$
 Gluster-users mailing list
 mailto:Gluster-users at gluster.org
 https://urldefense.com/v3/__https://lists.gluster.org/mailman/listinfo/gluster-users__;!!I_DbfM1H!F7q10L7Ll47-D3yZGbN2O2r5kpvRMP5-jO46OzkZu67d3vzF5AozNKd3umkbfJ7_29i2g8TKkhoGfcySbARDGIvyRXX72tM$

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
 are addressed. If you have received this email in error, please notify the
sender. This message contains confidential information and is intended only for
the individual named. If you are not the named addressee, you should not
disseminate, distribute or copy
 this email. Please notify the sender immediately by email if you have received
this email by mistake and delete this email from your system.

 If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
 are addressed. If you have received this email in error, please notify the
sender. This message contains confidential information and is intended only for
the individual named. If you are not the named addressee, you should not
disseminate, distribute or copy
 this email. Please notify the sender immediately by email if you have received
this email by mistake and delete this email from your system.

 If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

________

 Community Meeting Calendar:

 Schedule -
 Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
 Bridge:
https://urldefense.com/v3/__https://meet.google.com/cpu-eiue-hvk__;!!I_DbfM1H!BhqOtEyWh2TghCsLjQAu6kNSYqtFoRKfU8nSfUnVX4ZZfmlMRVka_BTot4C1XsR1fBFnoBMgNlYUi-xIXH9J5jo8klnopO0$
 Gluster-users mailing list
 mailto:Gluster-users at gluster.org
 https://urldefense.com/v3/__https://lists.gluster.org/mailman/listinfo/gluster-users__;!!I_DbfM1H!BhqOtEyWh2TghCsLjQAu6kNSYqtFoRKfU8nSfUnVX4ZZfmlMRVka_BTot4C1XsR1fBFnoBMgNlYUi-xIXH9J5jo8xixa_Vs$

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential
 information and is intended only for the individual named. If you are not the
named addressee, you should not disseminate, distribute or copy this email.
Please notify the sender immediately by email if you have received this email by
mistake and delete this
 email from your system.

 If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.
DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential
 information and is intended only for the individual named. If you are not the
named addressee, you should not disseminate, distribute or copy this email.
Please notify the sender immediately by email if you have received this email by
mistake and delete this
 email from your system.

 If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please
 notify the sender. This message contains confidential information and is
intended only for the individual named. If you are not the named addressee, you
should not disseminate, distribute or copy this email. Please notify the sender
immediately by email if
 you have received this email by mistake and delete this email from your system.

 If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

________

Community Meeting Calendar: 

Schedule - 
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC 
Bridge: https://meet.google.com/cpu-eiue-hvk 
Gluster-users mailing list 
Gluster-users at gluster.org 
https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20240206/bd16bd87/attachment-0001.html>

Anant Saraswat

2024-Feb-06 13:49 UTC

head link

[Gluster-users] __Geo-replication status is getting Faulty after few seconds

Thanks @Aravinda<mailto:aravinda at kadalu.tech>, Please check the logs
from primary master node.

[2024-02-04 20:15:10.150540] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-02-04 20:15:10.150680] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-02-04 20:15:10.225295] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...
[2024-02-04 20:15:11.915725] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.6902}]
[2024-02-04 20:15:11.916033] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...
[2024-02-04 20:15:12.950741] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0346}]
[2024-02-04 20:15:12.950921] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor
[2024-02-04 20:15:15.109059] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-02-04 20:15:15.109421] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1707077715}]
[2024-02-04 20:15:15.115965] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]
[2024-02-04 20:15:15.116454] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]
[2024-02-04 20:15:15.116628] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1707077715},
{entry_stime=(1705935991, 0)}]
[2024-02-04 20:15:16.194210] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]
[2024-02-04 20:15:16.323585] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]
[2024-02-04 20:15:16.955697] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-02-04 20:15:16.957722] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-02-04 20:15:26.993059] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]

________________________________
From: Aravinda <aravinda at kadalu.tech>
Sent: 06 February 2024 1:43 PM
To: Anant Saraswat <anant.saraswat at techblue.co.uk>
Cc: gluster-users at gluster.org <gluster-users at gluster.org>
Subject: Re:_[Gluster-users]_Geo-replication status is getting Faulty after few 
seconds

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

Geo-rep delete and create will recreate the session and continue syncing from
where it stopped last time. Last sync time is remembered in the xattrs of Brick
root of Primary volume
(`trusted.glusterfs.<primary-volume-id>.<secondary-volume-id>.stime`).

If reset-sync-time option used with delete, then it will remove the above
mentioned xattr. When Geo-rep session is created again, it will start a new
session.

New session starts with Hybrid crawl then switches to History crawl and then
switches to Changelog crawl. History crawl does file system crawl and all the
files will be identified as to be synced. Geo-rep uses rsync to sync the files
(Change detection is done externally and rsync only gets the list of files to be
synced), rsync will only sync the difference if any.

Only issue  I see of using this option is it may take lot of time to reach the
current time and start syncing the recent files.
> I tried to remove the secret.pem.pub from master1 and create it again.
Then, I ran georep create push pem, which copied the missing entry to the
drtier1data (slave) node. However, I am still getting the same error. I have
also checked the logs on drtier1data (slave) node and found the following logs.
Please share the errors you see from Primary geo-rep logs.

--
Thanks and Regards
Aravinda
Kadalu Technologies

---- On Tue, 06 Feb 2024 18:50:37 +0530 Anant Saraswat <anant.saraswat at
techblue.co.uk> wrote ---

Hi All,

Does anyone know what will happen if I delete the geo-replication with
reset-sync-time and create it again, Will it copy the whole data again, or will
it just check the files and then copy the new remaining files?

Thanks,
Anant

________________________________
From: Gluster-users <gluster-users-bounces at
gluster.org<mailto:gluster-users-bounces at gluster.org>> on behalf of
Anant Saraswat <anant.saraswat at techblue.co.uk<mailto:anant.saraswat at
techblue.co.uk>>
Sent: 04 February 2024 12:44 PM
To: gluster-users at gluster.org<mailto:gluster-users at gluster.org>
<gluster-users at gluster.org<mailto:gluster-users at gluster.org>>
Subject: Re: [Gluster-users] Geo-replication status is getting Faulty after few 
seconds

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

Hello Everyone,

I'm running out of options, I've already tried rebooting the physical
servers without any luck.

Now, I am thinking, should I try to delete the geo-replication with
reset-sync-time? Will that resolve the issue? We have 6TB of data, so will
running delete with reset-sync-time copy the whole 6TB of data again, or will it
just check the files and then copy the new remaining files?

I have checked the documentation, and it's not clear what will happen with
the files already present on the slave volume. Will they be recopied, or just
checked and skipped in case of deleting the session and creating it again?

Anyone have any experience around this situation? Any help is appreciated.

Many thanks,
Anant
________________________________
From: Gluster-users <gluster-users-bounces at
gluster.org<mailto:gluster-users-bounces at gluster.org>> on behalf of
Anant Saraswat <anant.saraswat at techblue.co.uk<mailto:anant.saraswat at
techblue.co.uk>>
Sent: 01 February 2024 6:30 PM
To: Aravinda <aravinda at kadalu.tech<mailto:aravinda at
kadalu.tech>>
Cc: gluster-users at gluster.org<mailto:gluster-users at gluster.org>
<gluster-users at gluster.org<mailto:gluster-users at gluster.org>>
Subject: Re: [Gluster-users] Geo-replication status is getting Faulty after few 
seconds

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

Hi @Aravinda<mailto:aravinda at kadalu.tech>,

Thanks for opening the gitlab issue.

I tried to remove the secret.pem.pub from master1 and create it again. Then, I
ran georep create push pem, which copied the missing entry to the drtier1data
(slave) node. However, I am still getting the same error. I have also checked
the logs on drtier1data (slave) node and found the following logs.

==>
/var/log/glusterfs/geo-replication-slaves/tier1data_drtier1data_drtier1data/gsyncd.log
<=[2024-02-01 18:00:02.93139] I [resource(slave
master1/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0831}]
[2024-02-01 18:00:02.129343] I [resource(slave
master1/opt/tier1data2019/brick):1166:service_loop] GLUSTER: slave listening
[2024-02-01 18:00:06.603335] I [repce(slave
master1/opt/tier1data2019/brick):96:service_loop] RepceServer: terminating on
reaching EOF.

==>
/var/log/glusterfs/geo-replication-slaves/tier1data_drtier1data_drtier1data/mnt-master1-opt-tier1data2019-brick.log
<=[2024-02-01 18:00:06.632144 +0000] I [fuse-bridge.c:6233:fuse_thread_proc]
0-fuse: initiating unmount of /tmp/gsyncd-aux-mount-j35hwrxj
[2024-02-01 18:00:06.633483 +0000] W [glusterfsd.c:1429:cleanup_and_exit]
(-->/lib64/libpthread.so.0(+0x817a) [0x7f3797b7717a]
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xfd) [0x5646034c5bfd]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x58) [0x5646034c5a48] ) 0-: received
signum (15), shutting down
[2024-02-01 18:00:06.633589 +0000] I [fuse-bridge.c:7063:fini] 0-fuse:
Unmounting '/tmp/gsyncd-aux-mount-j35hwrxj'.
[2024-02-01 18:00:06.633622 +0000] I [fuse-bridge.c:7068:fini] 0-fuse: Closing
fuse connection to '/tmp/gsyncd-aux-mount-j35hwrxj'.

Thanks,
Anant

________________________________
From: Aravinda <aravinda at kadalu.tech<mailto:aravinda at
kadalu.tech>>
Sent: 01 February 2024 8:58 AM
To: Anant Saraswat <anant.saraswat at techblue.co.uk<mailto:anant.saraswat
at techblue.co.uk>>
Cc: gluster-users at gluster.org<mailto:gluster-users at gluster.org>
<gluster-users at gluster.org<mailto:gluster-users at gluster.org>>
Subject: Re: [Gluster-users] Geo-replication status is getting Faulty after few 
seconds

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.
> I think we can enhance Geo-rep to accept configuration option to select a
worker as Active manually. I will think about it and update here if possible.
Opened a Github issue for the same.
https://github.com/gluster/glusterfs/issues/4304<https://urldefense.com/v3/__https://github.com/gluster/glusterfs/issues/4304__;!!I_DbfM1H!BhqOtEyWh2TghCsLjQAu6kNSYqtFoRKfU8nSfUnVX4ZZfmlMRVka_BTot4C1XsR1fBFnoBMgNlYUi-xIXH9J5jo8GPwD7Ao$>

--
Aravinda
Kadalu Technologies

---- On Thu, 01 Feb 2024 07:43:40 +0530 Aravinda <aravinda at
kadalu.tech<mailto:aravinda at kadalu.tech>> wrote ---

Hi Anant,
> Still same thing happening, One thing I have noticed that everytime master1
is trying to be primary node, how it's selected that which node will be
primary node in the geo-replication?
Gsyncd worker from the first node of the Replica bricks will become Active and
participate in syncing. If the Active worker node goes down then the other
worker becomes Active (Not when it is Faulty). Geo-rep can use Gluster meta
volume to hold a lock and automatically switch when a Active worker is Faulty
(Refer: Meta Volume config
https://docs.gluster.org/en/main/Administrator-Guide/Geo-Replication/#configurable-options<https://urldefense.com/v3/__https://docs.gluster.org/en/main/Administrator-Guide/Geo-Replication/*configurable-options__;Iw!!I_DbfM1H!BhqOtEyWh2TghCsLjQAu6kNSYqtFoRKfU8nSfUnVX4ZZfmlMRVka_BTot4C1XsR1fBFnoBMgNlYUi-xIXH9J5jo8MnMAKb4$>).

I think we can enhance Geo-rep to accept configuration option to select a worker
as Active manually. I will think about it and update here if possible.
> Second, I have noticed in the ".ssh/authorized_keys" file on the
secondry node, master2 and master3 have 2 entries
(command="/usr/libexec/glusterfs/gsyncd" AND command="tar
${SSH_ORIGINAL_COMMAND#* }", but master1 node have only one entry with
command="tar ${SSH_ORIGINAL_COMMAND#* }" , This indicates that there
is a missing entry for master1 with the
command="/usr/libexec/glusterfs/gsyncd", Does it make any sense?
Missing entry is the problem as you observed. Delete
/var/lib/glusterd/geo-replication/secret.pem.pub file in master1 and then run
gsec_create, georep create push pem again.

--
Aravinda
Kadalu Technologies

---- On Thu, 01 Feb 2024 00:27:34 +0530 Anant Saraswat <anant.saraswat at
techblue.co.uk<mailto:anant.saraswat at techblue.co.uk>> wrote ---

Hi @Aravinda<mailto:aravinda at kadalu.tech>,

As advised, I have removed all the master server-related entries from the
".ssh/authorized_keys" file on the secondary node. Then, I ran the
"ssh-copy-id root at drtier1data" command on all the masters for
passwordless SSH and verified that I can access the drtier1data server from all
the master nodes.

After that, I ran the following commands.

gluster system:: execute gsec_create
gluster volume geo-replication tier1data drtier1data::drtier1data create
push-pem force
gluster volume geo-replication tier1data drtier1data::drtier1data start
gluster volume geo-replication tier1data drtier1data::drtier1data status

[root at master3 ~]# gluster volume geo-replication tier1data
drtier1data::drtier1data status

MASTER NODE              MASTER VOL    MASTER BRICK                SLAVE USER   
SLAVE                                      SLAVE NODE    STATUS            
CRAWL STATUS    LAST_SYNCED
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
master3    tier1data     /opt/tier1data2019/brick    root         
drtier1data::drtier1data    N/A           Initializing...    N/A             N/A
master1      tier1data     /opt/tier1data2019/brick    root         
drtier1data::drtier1data    N/A           Initializing...    N/A             N/A
master2      tier1data     /opt/tier1data2019/brick    root         
drtier1data::drtier1data    N/A           Initializing...    N/A             N/A
[root at master3 ~]# gluster volume geo-replication tier1data
drtier1data::drtier1data status

MASTER NODE              MASTER VOL    MASTER BRICK                SLAVE USER   
SLAVE                                      SLAVE NODE    STATUS     CRAWL STATUS
LAST_SYNCED
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
master3    tier1data     /opt/tier1data2019/brick    root         
drtier1data::drtier1data                  Passive    N/A              N/A
master1      tier1data     /opt/tier1data2019/brick    root         
drtier1data::drtier1data                  Active     History Crawl    N/A
master2      tier1data     /opt/tier1data2019/brick    root         
drtier1data::drtier1data                  Passive    N/A              N/A
[root at master3 ~]# gluster volume geo-replication tier1data
drtier1data::drtier1data status

MASTER NODE              MASTER VOL    MASTER BRICK                SLAVE USER   
SLAVE                                      SLAVE NODE    STATUS     CRAWL STATUS
LAST_SYNCED
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
master3    tier1data     /opt/tier1data2019/brick    root         
drtier1data::drtier1data                  Passive    N/A             N/A
master1      tier1data     /opt/tier1data2019/brick    root         
drtier1data::drtier1data    N/A           Faulty     N/A             N/A
master2      tier1data     /opt/tier1data2019/brick    root         
drtier1data::drtier1data                  Passive    N/A             N/A

Still same thing happening, One thing I have noticed that everytime master1 is
trying to be primary node, how it's selected that which node will be primary
node in the geo-replication?

Second, I have noticed in the ".ssh/authorized_keys" file on the
secondry node, master2 and master3 have 2 entries
(command="/usr/libexec/glusterfs/gsyncd" AND command="tar
${SSH_ORIGINAL_COMMAND#* }", but master1 node have only one entry with
command="tar ${SSH_ORIGINAL_COMMAND#* }" , This indicates that there
is a missing entry for master1 with the
command="/usr/libexec/glusterfs/gsyncd", Does it make any sense?

Really appreciate your help on this.

Thanks,
Anant

________________________________
From: Aravinda <aravinda at kadalu.tech<mailto:aravinda at
kadalu.tech>>
Sent: 31 January 2024 5:27 PM
To: Anant Saraswat <anant.saraswat at techblue.co.uk<mailto:anant.saraswat
at techblue.co.uk>>
Cc: gluster-users at gluster.org<mailto:gluster-users at gluster.org>
<gluster-users at gluster.org<mailto:gluster-users at gluster.org>>;
Strahil Nikolov <hunter86_bg at yahoo.com<mailto:hunter86_bg at
yahoo.com>>
Subject: Re: [Gluster-users] Geo-replication status is getting Faulty after few 
seconds

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

In one of the thread, I saw the SSH public key is manually copied to secondary
node's authorized_keys file. Check if that entry starts with
"command=" or not. If not added, delete the entry in that file (or
delete all Georep related entries in that file) and run Georep create push-pem
force again.

--
Aravinda
Kadalu Technologies

---- On Wed, 31 Jan 2024 17:19:45 +0530 Anant Saraswat <anant.saraswat at
techblue.co.uk<mailto:anant.saraswat at techblue.co.uk>> wrote ---

Hi @Aravinda<mailto:aravinda at kadalu.tech>,

I used the exact same commands when I added master3 to the primary
geo-replication. However, the issue is that the master3 node is in a passive
state, and master1 is stuck in a loop of Initializing -> Active -> Faulty.
It never considers master2 or master3 as the primary master for geo-replication.

If master1 can connect to the secondary (drtier1data) server, and I see the
following message in the master1 logs which says "SSH connection between
master and slave established.", do you still think it could be related to
key issues? I am willing to rerun the commands from master1 if you advise.

2024-01-30 23:33:14.274611] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...
[2024-01-30 23:33:15.960004] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.6852}]
[2024-01-30 23:33:15.960300] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...
[2024-01-30 23:33:16.995715] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0353}]
[2024-01-30 23:33:16.995905] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor
[2024-01-30 23:33:19.154376] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-30 23:33:19.154759] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706657599}]
[2024-01-30 23:33:19.191343] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]
[2024-01-30 23:33:19.191940] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]
[2024-01-30 23:33:19.192105] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706657599},
{entry_stime=(1705935991, 0)}]
[2024-01-30 23:33:20.269529] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]
[2024-01-30 23:33:20.385018] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]
[2024-01-30 23:33:21.674] I [monitor(monitor):228:monitor] Monitor: worker died
in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-30 23:33:21.11514] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]

Many thanks,
Anant

________________________________
From: Aravinda <aravinda at kadalu.tech<mailto:aravinda at
kadalu.tech>>
Sent: 31 January 2024 11:14 AM
To: Anant Saraswat <anant.saraswat at techblue.co.uk<mailto:anant.saraswat
at techblue.co.uk>>
Cc: gluster-users at gluster.org<mailto:gluster-users at gluster.org>
<gluster-users at gluster.org<mailto:gluster-users at gluster.org>>;
Strahil Nikolov <hunter86_bg at yahoo.com<mailto:hunter86_bg at
yahoo.com>>
Subject: Re: [Gluster-users] Geo-replication status is getting Faulty after few 
seconds

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

Hi Anant,

You need to run the gsec_create command when a new node is added to Primary or
secondary

gluster system:: execute gsec_create
gluster volume geo-replication tier1data drtier1data::drtier1data create
push-pem force
gluster volume geo-replication tier1data drtier1data::drtier1data start
gluster volume geo-replication tier1data drtier1data::drtier1data status

Or use the Geo-rep setup tool fix the key related issues and resetup
(https://github.com/aravindavk/gluster-georep-tools<https://urldefense.com/v3/__https://github.com/aravindavk/gluster-georep-tools__;!!I_DbfM1H!F7q10L7Ll47-D3yZGbN2O2r5kpvRMP5-jO46OzkZu67d3vzF5AozNKd3umkbfJ7_29i2g8TKkhoGfcySbARDGIvyc-VOX7A$>)

gluster-georep-setup tier1data drtier1data::drtier1data

--
gluster system:: execute gsec_create
gluster system:: execute gsec_create

Kadalu Technologies

---- On Wed, 31 Jan 2024 02:49:08 +0530 Anant Saraswat <anant.saraswat at
techblue.co.uk<mailto:anant.saraswat at techblue.co.uk>> wrote ---

Hi All,

As per the documentation, if we use `delete` only it will start the replication
from the time where it was left before deleting the session, So I tried that
without any luck.

gluster volume geo-replication tier1data drtier1data::drtier1data delete
gluster volume geo-replication tier1data drtier1data::drtier1data create
push-pem force
gluster volume geo-replication tier1data drtier1data::drtier1data start
gluster volume geo-replication tier1data drtier1data::drtier1data status

I have tried to check the drtier1data logs as well, and all I can see is master1
connects to drtier1data and send disconnect after 5 seconds, please check the
following logs from drtier1data.

[2024-01-30 21:04:03.016805 +0000] I [MSGID: 114046]
[client-handshake.c:857:client_setvolume_cbk] 0-drtier1data-client-0: Connected,
attached to remote volume [{conn-name=drtier1data-client-0},
{remote_subvol=/opt/tier1data2019/brick}]
[2024-01-30 21:04:03.020148 +0000] I [fuse-bridge.c:5296:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.33
[2024-01-30 21:04:03.020197 +0000] I [fuse-bridge.c:5924:fuse_graph_sync]
0-fuse: switched to graph 0
[2024-01-30 21:04:08.573873 +0000] I [fuse-bridge.c:6233:fuse_thread_proc]
0-fuse: initiating unmount of /tmp/gsyncd-aux-mount-c8c41k2k
[2024-01-30 21:04:08.575131 +0000] W [glusterfsd.c:1429:cleanup_and_exit]
(-->/lib64/libpthread.so.0(+0x817a) [0x7fb907e2e17a]
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xfd) [0x55f97b17dbfd]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x58) [0x55f97b17da48] ) 0-: received
signum (15), shutting down
[2024-01-30 21:04:08.575227 +0000] I [fuse-bridge.c:7063:fini] 0-fuse:
Unmounting '/tmp/gsyncd-aux-mount-c8c41k2k'.
[2024-01-30 21:04:08.575256 +0000] I [fuse-bridge.c:7068:fini] 0-fuse: Closing
fuse connection to '/tmp/gsyncd-aux-mount-c8c41k2k'.

Can anyone suggest how can I find the reason of getting these disconnect
requests from master1 or what shall I check next?

Many thanks,
A

________________________________
From: Gluster-users <gluster-users-bounces at
gluster.org<mailto:gluster-users-bounces at gluster.org>> on behalf of
Anant Saraswat <anant.saraswat at techblue.co.uk<mailto:anant.saraswat at
techblue.co.uk>>
Sent: 30 January 2024 2:14 PM
To: gluster-users at gluster.org<mailto:gluster-users at gluster.org>
<gluster-users at gluster.org<mailto:gluster-users at gluster.org>>;
Strahil Nikolov <hunter86_bg at yahoo.com<mailto:hunter86_bg at
yahoo.com>>
Subject: Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

Hello Everyone,

I am looking for some help. Can anyone please suggest if it's possible to
promote a master node to be the primary in the geo-replication session?

We have three master nodes and one secondary node. We are facing issues where
geo-replication is consistently failing from the primary master node. We want to
check if it works fine from another master node.

Any guidance or assistance would be highly appreciated.

Many thanks,
Anant

________________________________
From: Anant Saraswat <anant.saraswat at
techblue.co.uk<mailto:anant.saraswat at techblue.co.uk>>
Sent: 29 January 2024 3:55 PM
To: gluster-users at gluster.org<mailto:gluster-users at gluster.org>
<gluster-users at gluster.org<mailto:gluster-users at gluster.org>>;
Strahil Nikolov <hunter86_bg at yahoo.com<mailto:hunter86_bg at
yahoo.com>>
Subject: Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds

Hi @Strahil Nikolov<mailto:hunter86_bg at yahoo.com>,

We have been running this geo-replication for more than 5 years and it was
working fine till last week, So I think it shouldn't be something which was
missed in the initial setup, but I am unable to understand why it's not
working now.

I have enabled SSH Debug on the secondary node(drtier1data), and I can see this
in the logs.

Jan 29 14:25:52 drtier1data sshd[1268110]: debug1: server_input_channel_req:
channel 0 request exec reply 1
Jan 29 14:25:52 drtier1data sshd[1268110]: debug1: session_by_channel: session 0
channel 0
Jan 29 14:25:52 drtier1data sshd[1268110]: debug1: session_input_channel_req:
session 0 req exec
Jan 29 14:25:52 drtier1data sshd[1268110]: Starting session: command for root
from XX.236.28.58 port 53082 id 0
Jan 29 14:25:52 drtier1data sshd[1268095]: debug1: session_new: session 0
Jan 29 14:25:58 drtier1data sshd[1268110]: debug1: Received SIGCHLD.
Jan 29 14:25:58 drtier1data sshd[1268110]: debug1: session_by_pid: pid 1268111
Jan 29 14:25:58 drtier1data sshd[1268110]: debug1: session_exit_message: session
0 channel 0 pid 1268111
Jan 29 14:25:58 drtier1data sshd[1268110]: debug1: session_exit_message: release
channel 0
Jan 29 14:25:58 drtier1data sshd[1268110]: Received disconnect from XX.236.28.58
port 53082:11: disconnected by user
Jan 29 14:25:58 drtier1data sshd[1268110]: Disconnected from user root
XX.236.28.58 port 53082
Jan 29 14:25:58 drtier1data sshd[1268110]: debug1: do_cleanup
Jan 29 14:25:58 drtier1data sshd[1268095]: debug1: do_cleanup
Jan 29 14:25:58 drtier1data sshd[1268095]: debug1: PAM: cleanup
Jan 29 14:25:58 drtier1data sshd[1268095]: debug1: PAM: closing session
Jan 29 14:25:58 drtier1data sshd[1268095]: pam_unix(sshd:session): session
closed for user root
Jan 29 14:25:58 drtier1data sshd[1268095]: debug1: PAM: deleting credentials

As per the above logs, drtier1data node is getting SIGCHLD from master1.
(Received disconnect from XX.236.28.58 port 53082:11: disconnected by user)

Also, I have checked the gsyncd.log on master1, which says "SSH: SSH
connection between master and slave established. [{duration=1.7277}]",
which means passwordless ssh is working fine.

As per my understanding, Master1 can connect to the drtier1data server, and then
the geo-replication status changes to Active --> History Crawl and then
something happens on the master1 which triggers the SSH disconnect.

is it possible to change the master node in geo-replication so that we can mark
master2 as primary, instead of master1?

I am really struggling to fix this issue, Please help, any pointer is
appreciated !!!

Many thanks,
Anant
________________________________
From: Gluster-users <gluster-users-bounces at
gluster.org<mailto:gluster-users-bounces at gluster.org>> on behalf of
Anant Saraswat <anant.saraswat at techblue.co.uk<mailto:anant.saraswat at
techblue.co.uk>>
Sent: 29 January 2024 12:20 AM
To: gluster-users at gluster.org<mailto:gluster-users at gluster.org>
<gluster-users at gluster.org<mailto:gluster-users at gluster.org>>;
Strahil Nikolov <hunter86_bg at yahoo.com<mailto:hunter86_bg at
yahoo.com>>
Subject: Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

HI Strahil,

As mentioned in my last email, I have copied the gluster public key from master3
to secondary server, and I can now ssh from all master nodes to secondary
server, but still getting the same error.

[root at master1 geo-replication]# ssh root at drtier1data -i
/var/lib/glusterd/geo-replication/secret.pem
Last login: Mon Jan 29 00:14:32 2024 from
[root at drtier1data ~]#

[root at master2 ~]# ssh -i /var/lib/glusterd/geo-replication/secret.pem root at
drtier1data
Last login: Mon Jan 29 00:02:34 2024 from
[root at drtier1data ~]#

[root at master3 ~]# ssh -i /var/lib/glusterd/geo-replication/secret.pem root at
drtier1data
Last login: Mon Jan 29 00:14:41 2024 from
[root at drtier1data ~]#

Thanks,
Anant
________________________________
From: Strahil Nikolov <hunter86_bg at yahoo.com<mailto:hunter86_bg at
yahoo.com>>
Sent: 28 January 2024 10:07 PM
To: Anant Saraswat <anant.saraswat at techblue.co.uk<mailto:anant.saraswat
at techblue.co.uk>>; gluster-users at gluster.org<mailto:gluster-users
at gluster.org> <gluster-users at gluster.org<mailto:gluster-users at
gluster.org>>
Subject: Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

Gluster doesn't use the ssh key in /root/.ssh, thus you need to exchange the
public key that corresponds to /var/lib/glusterd/geo-replication/secret.pem . If
you don't know the pub key, google how to obtain it from the private key.

Ensure that all hosts can ssh to the secondary before proceeding with the
troubleshooting.

Best Regards,
Strahil Nikolov

On Sun, Jan 28, 2024 at 15:58, Anant Saraswat
<anant.saraswat at techblue.co.uk<mailto:anant.saraswat at
techblue.co.uk>> wrote:
Hi All,

I have now copied  /var/lib/glusterd/geo-replication/secret.pem.pub  (public
key) from master3 to drtier1data /root/.ssh/authorized_keys, and now I can ssh
from master node3 to drtier1data using the georep key
(/var/lib/glusterd/geo-replication/secret.pem).

But I am still getting the same error, and geo-replication is getting faulty
again and again.

[2024-01-28 13:46:38.897683] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449598}]
[2024-01-28 13:46:38.922491] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]
[2024-01-28 13:46:38.923127] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]
[2024-01-28 13:46:38.923313] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449598},
{entry_stime=(1705935991, 0)}]
[2024-01-28 13:46:39.973584] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]
[2024-01-28 13:46:40.98970] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]
[2024-01-28 13:46:40.757691] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:46:40.766860] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:46:50.793311] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:46:50.793469] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:46:50.874474] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...
[2024-01-28 13:46:52.659114] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7844}]
[2024-01-28 13:46:52.659461] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...
[2024-01-28 13:46:53.698769] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0392}]
[2024-01-28 13:46:53.698984] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor
[2024-01-28 13:46:55.831999] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:46:55.832354] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449615}]
[2024-01-28 13:46:55.854684] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]
[2024-01-28 13:46:55.855251] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]
[2024-01-28 13:46:55.855419] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449615},
{entry_stime=(1705935991, 0)}]
[2024-01-28 13:46:56.905496] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]
[2024-01-28 13:46:57.38262] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]
[2024-01-28 13:46:57.704128] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:46:57.706743] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:47:07.741438] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:47:07.741582] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:47:07.821284] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...
[2024-01-28 13:47:09.573661] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7521}]
[2024-01-28 13:47:09.573955] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...
[2024-01-28 13:47:10.612173] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0381}]
[2024-01-28 13:47:10.612359] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor
[2024-01-28 13:47:12.751856] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:47:12.752237] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449632}]
[2024-01-28 13:47:12.759138] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]
[2024-01-28 13:47:12.759690] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]
[2024-01-28 13:47:12.759868] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449632},
{entry_stime=(1705935991, 0)}]
[2024-01-28 13:47:13.810321] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]
[2024-01-28 13:47:13.924068] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]
[2024-01-28 13:47:14.617663] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:47:14.620035] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:47:24.646013] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:47:24.646157] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:47:24.725510] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...
[2024-01-28 13:47:26.491939] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7662}]
[2024-01-28 13:47:26.492235] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...
[2024-01-28 13:47:27.530852] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0385}]
[2024-01-28 13:47:27.531036] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor
[2024-01-28 13:47:29.670099] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:47:29.670640] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449649}]
[2024-01-28 13:47:29.696144] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]
[2024-01-28 13:47:29.696709] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]
[2024-01-28 13:47:29.696899] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449649},
{entry_stime=(1705935991, 0)}]
[2024-01-28 13:47:30.751127] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]
[2024-01-28 13:47:30.885824] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]
[2024-01-28 13:47:31.535252] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:47:31.538450] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:47:41.564276] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:47:41.564426] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:47:41.645110] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...
[2024-01-28 13:47:43.435830] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7904}]
[2024-01-28 13:47:43.436285] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...
[2024-01-28 13:47:44.475671] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0393}]
[2024-01-28 13:47:44.475865] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor
[2024-01-28 13:47:46.630478] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:47:46.630924] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449666}]
[2024-01-28 13:47:46.655069] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]
[2024-01-28 13:47:46.655752] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]
[2024-01-28 13:47:46.655926] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449666},
{entry_stime=(1705935991, 0)}]
[2024-01-28 13:47:47.706875] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]
[2024-01-28 13:47:47.834996] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]
[2024-01-28 13:47:48.480822] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:47:48.491306] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:47:58.518263] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:47:58.518412] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:47:58.601096] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...
[2024-01-28 13:48:00.355000] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7537}]
[2024-01-28 13:48:00.355345] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...
[2024-01-28 13:48:01.395025] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0396}]
[2024-01-28 13:48:01.395212] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor
[2024-01-28 13:48:03.541059] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:48:03.541481] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449683}]
[2024-01-28 13:48:03.567552] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]
[2024-01-28 13:48:03.568172] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]
[2024-01-28 13:48:03.568376] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449683},
{entry_stime=(1705935991, 0)}]
[2024-01-28 13:48:04.621488] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]
[2024-01-28 13:48:04.742268] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]
[2024-01-28 13:48:04.919335] I [master(worker
/opt/tier1data2019/brick):2013:syncjob] Syncer: Sync Time Taken [{job=3},
{num_files=10}, {return_code=3}, {duration=0.0180}]
[2024-01-28 13:48:04.919919] E [syncdutils(worker
/opt/tier1data2019/brick):847:errlog] Popen: command returned error [{cmd=rsync
-aR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs
--existing --xattrs --acls --ignore-missing-args . -e ssh
-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
/var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S
/tmp/gsyncd-aux-ssh-zo_ev6yu/75785990b3233f5dbbab9f43cc3ed895.sock
drtier1data:/proc/799165/cwd}, {error=3}]
[2024-01-28 13:48:05.399226] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:48:05.403931] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:48:15.430175] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:48:15.430308] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:48:15.510770] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...
[2024-01-28 13:48:17.240311] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7294}]
[2024-01-28 13:48:17.240509] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...
[2024-01-28 13:48:18.279007] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0384}]
[2024-01-28 13:48:18.279195] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor
[2024-01-28 13:48:20.455937] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:48:20.456274] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706449700}]
[2024-01-28 13:48:20.464288] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]
[2024-01-28 13:48:20.464807] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]
[2024-01-28 13:48:20.464970] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706449700},
{entry_stime=(1705935991, 0)}]
[2024-01-28 13:48:21.514201] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]
[2024-01-28 13:48:21.644609] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]
[2024-01-28 13:48:22.284920] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:48:22.286189] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:48:32.312378] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:48:32.312526] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:48:32.393484] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...
[2024-01-28 13:48:34.91825] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.6981}]
[2024-01-28 13:48:34.92130] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...

Thanks,
Anant

________________________________
From: Anant Saraswat <anant.saraswat at
techblue.co.uk<mailto:anant.saraswat at techblue.co.uk>>
Sent: 28 January 2024 1:33 AM
To: Strahil Nikolov <hunter86_bg at yahoo.com<mailto:hunter86_bg at
yahoo.com>>; gluster-users at gluster.org<mailto:gluster-users at
gluster.org> <gluster-users at gluster.org<mailto:gluster-users at
gluster.org>>
Subject: Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds

Hi @Strahil Nikolov<mailto:hunter86_bg at yahoo.com>,

I have checked the ssh connection from all the master servers and I can ssh
drtier1data from master1 and master2 server(old master servers), but I am unable
to ssh drtier1data from master3 (new node).

[root at master3 ~]# ssh -i /var/lib/glusterd/geo-replication/secret.pem root at
drtier1data
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 325,
in <module>
    main()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 259,
in main
    if args.subcmd in ("worker"):
TypeError: 'in <string>' requires string as left operand, not
NoneType
Connection to drtier1data closed.

But I am able to ssh  drtier1data from master3 without using the georep key.

[root at master3 ~]# ssh  root at drtier1data
Last login: Sun Jan 28 01:16:25 2024 from 87.246.74.32
[root at drtier1data ~]#

Also, today I restarted the gluster server on master1 as geo-replication is
trying to be active from master1 server, and sometimes I am getting the
following error in gsyncd.log

[2024-01-28 01:27:24.722663] E [syncdutils(worker
/opt/tier1data2019/brick):847:errlog] Popen: command returned error [{cmd=rsync
-aR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs
--existing --xattrs --acls --ignore-missing-args . -e ssh
-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
/var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S
/tmp/gsyncd-aux-ssh-0exuoeg7/75785990b3233f5dbbab9f43cc3ed895.sock
drtier1data:/proc/553418/cwd}, {error=3}]

Many thanks,
Anant
________________________________
From: Strahil Nikolov <hunter86_bg at yahoo.com<mailto:hunter86_bg at
yahoo.com>>
Sent: 27 January 2024 5:25 AM
To: gluster-users at gluster.org<mailto:gluster-users at gluster.org>
<gluster-users at gluster.org<mailto:gluster-users at gluster.org>>;
Anant Saraswat <anant.saraswat at techblue.co.uk<mailto:anant.saraswat at
techblue.co.uk>>
Subject: Re: [Gluster-users] Geo-replication status is getting Faulty after few
seconds

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

Don't forget to test with the georep key. I think it was
/var/lib/glusterd/geo-replication/secret.pem

Best Regards,
Strahil Nikolov

? ??????, 27 ?????? 2024 ?. ? 07:24:07 ?. ???????+2, Strahil Nikolov
<hunter86_bg at yahoo.com<mailto:hunter86_bg at yahoo.com>> ??????:

Hi Anant,

i would first start checking if you can do ssh from all masters to the slave
node.If you haven't setup a dedicated user for the session, then gluster is
using root.

Best Regards,
Strahil Nikolov

? ?????, 26 ?????? 2024 ?. ? 18:07:59 ?. ???????+2, Anant Saraswat
<anant.saraswat at techblue.co.uk<mailto:anant.saraswat at
techblue.co.uk>> ??????:

Hi All,

I have run the following commands on master3, and that has added master3 to
geo-replication.

gluster system:: execute gsec_create

gluster volume geo-replication tier1data drtier1data::drtier1data create
push-pem force

gluster volume geo-replication tier1data drtier1data::drtier1data stop

gluster volume geo-replication tier1data drtier1data::drtier1data start

Now I am able to start the geo-replication, but I am getting the same error.

[2024-01-24 19:51:24.80892] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Initializing...}]

[2024-01-24 19:51:24.81020] I [monitor(monitor):160:monitor] Monitor: starting
gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]

[2024-01-24 19:51:24.158021] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection
between master and slave...

[2024-01-24 19:51:25.951998] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between
master and slave established. [{duration=1.7938}]

[2024-01-24 19:51:25.952292] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume
locally...

[2024-01-24 19:51:26.986974] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume
[{duration=1.0346}]

[2024-01-24 19:51:26.987137] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn
successful. Acknowledging back to monitor

[2024-01-24 19:51:29.139131] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir
[{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]

[2024-01-24 19:51:29.139531] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time
[{time=1706125889}]

[2024-01-24 19:51:29.173877] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change
[{status=Active}]

[2024-01-24 19:51:29.174407] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl
Status Change [{status=History Crawl}]

[2024-01-24 19:51:29.174558] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl
[{turns=1}, {stime=(1705935991, 0)}, {etime=1706125889},
{entry_stime=(1705935991, 0)}]

[2024-01-24 19:51:30.251965] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time
[{stime=(1705935991, 0)}]

[2024-01-24 19:51:30.376715] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount
process exited [{error=ENOTCONN}]

[2024-01-24 19:51:30.991856] I [monitor(monitor):228:monitor] Monitor: worker
died in startup phase [{brick=/opt/tier1data2019/brick}]

[2024-01-24 19:51:30.993608] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]

Any idea why it's stuck in this loop?

Thanks,

Anant

________________________________
From: Gluster-users <gluster-users-bounces at
gluster.org<mailto:gluster-users-bounces at gluster.org>> on behalf of
Anant Saraswat <anant.saraswat at techblue.co.uk<mailto:anant.saraswat at
techblue.co.uk>>
Sent: 22 January 2024 9:00 PM
To: gluster-users at gluster.org<mailto:gluster-users at gluster.org>
<gluster-users at gluster.org<mailto:gluster-users at gluster.org>>
Subject: [Gluster-users] Geo-replication status is getting Faulty after few
seconds

EXTERNAL: Do not click links or open attachments if you do not recognize the
sender.

Hi There,

We have a Gluster setup with three master nodes in replicated mode and one slave
node with geo-replication.

# gluster volume info

Volume Name: tier1data

Type: Replicate

Volume ID: 93c45c14-f700-4d50-962b-7653be471e27

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: master1:/opt/tier1data2019/brick

Brick2: master2:/opt/tier1data2019/brick

Brick3: master3:/opt/tier1data2019/brick

master1 |master2 | 
------------------------------geo-replication----------------------------- |
drtier1datamaster3 |

We added the master3 node a few months back, the initial setup consisted of 2
master nodes and one geo-replicated slave(drtier1data).

Our geo-replication was functioning well with the initial two master nodes
(master1 and master2), where master1 was active and master2 was in passive mode.
However, today, we started experiencing issues where geo-replication suddenly
stopped and became stuck in a loop of Initializing..., Active.. Faulty on
master1, while master2 remained in passive mode.

Upon checking the gsyncd.log on the master1 node, we observed the following
error (please refer to the attached logs for more details):

E [syncdutils(worker /opt/tier1data2019/brick):346:log_raise_exception]
<top>: Gluster Mount process exited [{error=ENOTCONN}]

# gluster volume geo-replication tier1data status

MASTER NODE            MASTER VOL    MASTER BRICK                SLAVE USER   
SLAVE                                            SLAVE NODE    STATUS           
CRAWL STATUS    LAST_SYNCED

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

master1    tier1data     /opt/tier1data2019/brick    root         
ssh://drtier1data::drtier1data    N/A           Faulty           N/A            
N/A

master2    tier1data     /opt/tier1data2019/brick    root         
ssh://drtier1data::drtier1data                  Passive            N/A          
N/A

Suspecting an issue on the drtier1data(slave)?, I attempted to restart Gluster
on the slave node, also tried to restart drtier1data server without any luck.

After that I tried the following command to get the Primary-log-file for
geo-replication on master1, and got the following error.

# gluster volume geo-replication tier1data drtier1data::drtier1data config
log-file

Staging failed on master3. Error: Geo-replication session between tier1data and
drtier1data::drtier1data does not exist.

geo-replication command failed

Master3 was the new node added a few months back, but geo-replication was
working until today, and we never added this node under geo-replication.

After that, I forcefully stopped the geo-replication, thinking that restarting
geo-replication might fix the issue. However, now the geo-replication is not
starting and is giving the same error.

# gluster volume geo-replication tier1data drtier1data::drtier1data start force

Staging failed on master3. Error: Geo-replication session between tier1data and
drtier1data::drtier1data does not exist.

geo-replication command failed

Can anyone please suggest what I should do next to resolve this issue? As there
is 5TB of data in this volume, I don't want to resync the entire data to
drtier1data. Instead, I want to resume the sync from where it last stopped.

Thanks in advance for any guidance/help.

Kind regards,

Anant

?

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge:
https://urldefense.com/v3/__https://meet.google.com/cpu-eiue-hvk__;!!I_DbfM1H!FIFMVBFvoomIXp1pMhjtLbD-1B_qztpAUPBHP5MST7a1hcf3FP8o6GkbQwzQUnS2nT_YIQ1MF7GV_PtM0CAQoOCo4VSgaw$
Gluster-users mailing list
Gluster-users at gluster.org<mailto:Gluster-users at gluster.org>
https://urldefense.com/v3/__https://lists.gluster.org/mailman/listinfo/gluster-users__;!!I_DbfM1H!FIFMVBFvoomIXp1pMhjtLbD-1B_qztpAUPBHP5MST7a1hcf3FP8o6GkbQwzQUnS2nT_YIQ1MF7GV_PtM0CAQoOBrmvmlMg$

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge:
https://meet.google.com/cpu-eiue-hvk<https://urldefense.com/v3/__https://meet.google.com/cpu-eiue-hvk__;!!I_DbfM1H!F7q10L7Ll47-D3yZGbN2O2r5kpvRMP5-jO46OzkZu67d3vzF5AozNKd3umkbfJ7_29i2g8TKkhoGfcySbARDGIvy66WGiu4$>
Gluster-users mailing list
Gluster-users at gluster.org<mailto:Gluster-users at gluster.org>
https://lists.gluster.org/mailman/listinfo/gluster-users<https://urldefense.com/v3/__https://lists.gluster.org/mailman/listinfo/gluster-users__;!!I_DbfM1H!F7q10L7Ll47-D3yZGbN2O2r5kpvRMP5-jO46OzkZu67d3vzF5AozNKd3umkbfJ7_29i2g8TKkhoGfcySbARDGIvyRXX72tM$>

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge:
https://meet.google.com/cpu-eiue-hvk<https://urldefense.com/v3/__https://meet.google.com/cpu-eiue-hvk__;!!I_DbfM1H!BhqOtEyWh2TghCsLjQAu6kNSYqtFoRKfU8nSfUnVX4ZZfmlMRVka_BTot4C1XsR1fBFnoBMgNlYUi-xIXH9J5jo8klnopO0$>
Gluster-users mailing list
Gluster-users at gluster.org<mailto:Gluster-users at gluster.org>
https://lists.gluster.org/mailman/listinfo/gluster-users<https://urldefense.com/v3/__https://lists.gluster.org/mailman/listinfo/gluster-users__;!!I_DbfM1H!BhqOtEyWh2TghCsLjQAu6kNSYqtFoRKfU8nSfUnVX4ZZfmlMRVka_BTot4C1XsR1fBFnoBMgNlYUi-xIXH9J5jo8xixa_Vs$>

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge:
https://meet.google.com/cpu-eiue-hvk<https://urldefense.com/v3/__https://meet.google.com/cpu-eiue-hvk__;!!I_DbfM1H!AOBbofCF6CW4CcriOixqH6vEmi3V5M5q5SCixRgyT-Vf_R_7Ulo9sM79cpcZVzStDaufJeEq459m7-NuFgm8i89ChEJGLcM$>
Gluster-users mailing list
Gluster-users at gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users<https://urldefense.com/v3/__https://lists.gluster.org/mailman/listinfo/gluster-users__;!!I_DbfM1H!AOBbofCF6CW4CcriOixqH6vEmi3V5M5q5SCixRgyT-Vf_R_7Ulo9sM79cpcZVzStDaufJeEq459m7-NuFgm8i89CDLF-Mvs$>

DISCLAIMER: This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the sender.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee, you should not
disseminate, distribute or copy this email. Please notify the sender immediately
by email if you have received this email by mistake and delete this email from
your system.

If you are not the intended recipient, you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited. Thanks for your cooperation.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20240206/64ee4c85/attachment.html>

Gluster users - Feb 2024 - __Geo-replication status is getting Faulty after few seconds

[Gluster-users] __Geo-replication status is getting Faulty after few seconds

[Gluster-users] __Geo-replication status is getting Faulty after few seconds