thr3ads.net - Gluster users - [Gluster-users] Replacing a node in a 4x2 distributed/replicated setup [Nov 2015]

If this information is useful, please help other people find it:
Share via:

Thomas Bätzler

2015-Nov-03 14:07 UTC

[Gluster-users] Replacing a node in a 4x2 distributed/replicated setup

Hi,

Atin Mukherjee wrote:> This could very well be related to op-version. Could you look at the
> faulty node's glusterd log and see the error log entries, that would
> give us the exact reason of failure.
op-version is 1 across all the nodes.

I've made some progress: by persistently wiping /var/lib/glusterd except
fpr glusterd.info and restarting glusterd on the new node, I've
progressed to a state where all nodes agree that my replacement node is
part of the gluster:

 root at glucfshead2:~# for i in `seq 2 9`; do echo "glucfshead$i:";
ssh
glucfshead$i "gluster peer status" | grep -A2 glucfshead9 ; done
 glucfshead2:
 Hostname: glucfshead9
 Uuid: 040e61dd-fd02-4957-8833-cf5708b837f0
 State: Peer in Cluster (Connected)
 glucfshead3:
 Hostname: glucfshead9
 Uuid: 040e61dd-fd02-4957-8833-cf5708b837f0
 State: Peer in Cluster (Connected)
 glucfshead4:
 Hostname: glucfshead9
 Uuid: 040e61dd-fd02-4957-8833-cf5708b837f0
 State: Peer in Cluster (Connected)
 glucfshead5:
 Hostname: glucfshead9
 Uuid: 040e61dd-fd02-4957-8833-cf5708b837f0
 State: Peer in Cluster (Connected)
 glucfshead6:
 Hostname: glucfshead9
 Uuid: 040e61dd-fd02-4957-8833-cf5708b837f0
 State: Peer in Cluster (Connected)
 glucfshead7:
 Hostname: glucfshead9
 Uuid: 040e61dd-fd02-4957-8833-cf5708b837f0
 State: Peer in Cluster (Connected)
 glucfshead8:
 Hostname: glucfshead9
 Uuid: 040e61dd-fd02-4957-8833-cf5708b837f0
 State: Peer in Cluster (Connected)

The new node sees all of the other nodes:

 root at glucfshead9:~# gluster peer status
 Number of Peers: 7

 Hostname: glucfshead4.bo.rz.pixum.net
 Uuid: 8547dadd-96bf-45fe-b49d-bab8f995c928
 State: Peer in Cluster (Connected)

 Hostname: glucfshead2
 Uuid: 73596f88-13ae-47d7-ba05-da7c347f6141
 State: Peer in Cluster (Connected)

 Hostname: glucfshead3
 Uuid: a17ae95d-4598-4cd7-9ae7-808af10fedb5
 State: Peer in Cluster (Connected)

 Hostname: glucfshead5.bo.rz.pixum.net
 Uuid: 249da8ea-fda6-47ff-98e0-dbff99dcb3f2
 State: Peer in Cluster (Connected)

 Hostname: glucfshead6
 Uuid: a0229511-978c-4904-87ae-7e1b32ac2c72
 State: Peer in Cluster (Connected)

 Hostname: glucfshead7
 Uuid: 548ec75a-0131-4c92-aaa9-7c6ee7b47a63
 State: Peer in Cluster (Connected)

 Hostname: glucfshead8
 Uuid: 5e54cbc1-482c-460b-ac38-00c4b71c50b9
 State: Peer in Cluster (Connected)

The old nodes all agree that the to-be-replaced node is offline:

 root at glucfshead2:~# for i in `seq 2 9`; do echo "glucfshead$i:";
ssh
glucfshead$i "gluster peer status" | grep -B2 Rej ; done
 glucfshead2:
 Hostname: glucfshead1
 Uuid: 09ed9a29-c923-4dc5-957a-e0d3e8032daf
 State: Peer Rejected (Disconnected)
 glucfshead3:
 Hostname: glucfshead1
 Uuid: 09ed9a29-c923-4dc5-957a-e0d3e8032daf
 State: Peer Rejected (Disconnected)
 glucfshead4:
 Hostname: glucfshead1
 Uuid: 09ed9a29-c923-4dc5-957a-e0d3e8032daf
 State: Peer Rejected (Disconnected)
 glucfshead5:
 Hostname: glucfshead1
 Uuid: 09ed9a29-c923-4dc5-957a-e0d3e8032daf
 State: Peer Rejected (Disconnected)
 glucfshead6:
 Hostname: glucfshead1
 Uuid: 09ed9a29-c923-4dc5-957a-e0d3e8032daf
 State: Peer Rejected (Disconnected)
 glucfshead7:
 Hostname: glucfshead1
 Uuid: 09ed9a29-c923-4dc5-957a-e0d3e8032daf
 State: Peer Rejected (Disconnected)
 glucfshead8:
 Hostname: glucfshead1
 Uuid: 09ed9a29-c923-4dc5-957a-e0d3e8032daf
 State: Peer Rejected (Disconnected)
 glucfshead9:

If I try to replace the downed brick with my new brick it says it's
successful:

 root at glucfshead2:~# gluster volume replace-brick archive
glucfshead1:/data/glusterfs/archive/brick1
glucfshead9:/data/glusterfs/archive/brick1/brick commit force
 volume replace-brick: success: replace-brick commit successful

However on checking the broken brick is still show as online:

 root at glucfshead2:~# gluster volume info

 Volume Name: archive
 Type: Distributed-Replicate
 Volume ID: d888b302-2a35-4559-9bb0-4e182f49f9c6
 Status: Started
 Number of Bricks: 4 x 2 = 8
 Transport-type: tcp
 Bricks:
 Brick1: glucfshead1:/data/glusterfs/archive/brick1
 Brick2: glucfshead5:/data/glusterfs/archive/brick1
 Brick3: glucfshead2:/data/glusterfs/archive/brick1
 Brick4: glucfshead6:/data/glusterfs/archive/brick1
 Brick5: glucfshead3:/data/glusterfs/archive/brick1
 Brick6: glucfshead7:/data/glusterfs/archive/brick1
 Brick7: glucfshead4:/data/glusterfs/archive/brick1
 Brick8: glucfshead8:/data/glusterfs/archive/brick1
 Options Reconfigured:
 cluster.data-self-heal: off
 cluster.entry-self-heal: off
 cluster.metadata-self-heal: off
 features.lock-heal: on
 cluster.readdir-optimize: on
 auth.allow: 172.16.15.*
 performance.flush-behind: off
 performance.io-thread-count: 16
 features.quota: off
 performance.quick-read: on
 performance.stat-prefetch: off
 performance.io-cache: on
 performance.cache-refresh-timeout: 1
 nfs.disable: on
 performance.cache-max-file-size: 200kb
 performance.cache-size: 2GB
 performance.write-behind-window-size: 4MB
 performance.read-ahead: off
 storage.linux-aio: off
 diagnostics.brick-sys-log-level: INFO
 server.statedump-path: /var/tmp
 cluster.self-heal-daemon: off

All of the old bricks complain loudly that they can't connect to
glucfshead1:

 [2015-11-03 13:54:59.422135] I [MSGID: 106004]
[glusterd-handler.c:4398:__glusterd_peer_rpc_notify] 0-management: Peer
09ed9a29-c923-4dc5-957a-e0d3e8032daf, in Peer Rejected state, has
disconnected from glusterd.
 [2015-11-03 13:56:24.996215] I
[glusterd-replace-brick.c:99:__glusterd_handle_replace_brick]
0-management: Received replace brick req
 [2015-11-03 13:56:24.996283] I
[glusterd-replace-brick.c:154:__glusterd_handle_replace_brick]
0-management: Received replace brick commit-force request
 [2015-11-03 13:56:25.016345] E
[glusterd-rpc-ops.c:1087:__glusterd_stage_op_cbk] 0-management: Received
stage RJT from uuid: 040e61dd-fd02-4957-8833-cf5708b837f0

The new server only logs "Stage failed".

 [2015-11-03 13:56:25.015942] E
[glusterd-op-sm.c:4585:glusterd_op_ac_stage_op] 0-management: Stage
failed on operation 'Volume Replace brick', Status : -1

I tried to detach glucfshead1 since it's no longer online, but I only
get a message that I can't do it since that server is still part of a
volume. Any further ideas ideas that I could try?

TIA,
Thomas



---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren gepr?ft.
https://www.avast.com/antivirus

Thomas Bätzler

2015-Nov-05 16:24 UTC

head link

[Gluster-users] Replacing a node in a 4x2 distributed/replicated setup

Hi,

A small update: since nothing else worked, I broke down and changed the
replacement system's IP and hostname to that of the broken system;
replaced its UUID with that of the downed machine and probed it back
into the gluster cluster. Had to restart glusterd several times to make
the other systems pick up the change.

I then added the volume-id attr to the new bricks as suggested on
https://joejulian.name/blog/replacing-a-brick-on-glusterfs-340/. After
that I was able to trigger a manual heal. By tomorrow I may have some
kind of estimate of how long the repair is going to take.


Bye,
Thomas



---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren gepr?ft.
https://www.avast.com/antivirus

Thomas Bätzler

2015-Nov-05 16:25 UTC

head link

[Gluster-users] Replacing a node in a 4x2 distributed/replicated setup

Hi,

A small update: since nothing else worked, I broke down and changed the
replacement system's IP and hostname to that of the broken system;
replaced its UUID with that of the downed machine and probed it back
into the gluster cluster. Had to restart glusterd several times to make
the other systems pick up the change.

I then added the volume-id attr to the new bricks as suggested on
https://joejulian.name/blog/replacing-a-brick-on-glusterfs-340/. After
that I was able to trigger a manual heal. By tomorrow I may have some
kind of estimate of how long the repair is going to take.


Bye,
Thomas



---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren gepr?ft.
https://www.avast.com/antivirus

Gluster users - Nov 2015 - Replacing a node in a 4x2 distributed/replicated setup

[Gluster-users] Replacing a node in a 4x2 distributed/replicated setup

[Gluster-users] Replacing a node in a 4x2 distributed/replicated setup

[Gluster-users] Replacing a node in a 4x2 distributed/replicated setup