thr3ads.net - Gluster users - [Gluster-users] Replica bricks fungible? [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Diego Zuccato

2021-Jun-09 13:17 UTC

[Gluster-users] Replica bricks fungible?

Il 05/06/2021 14:36, Zenon Panoussis ha scritto:
>> What I'm really asking is: can I physically move a brick
>> from one server to another such as
> I can now answer my own question: yes, replica bricks are
> identical and can be physically moved or copied from one
> server to another. I have now done it a few times without
> any problems, though I made sure no healing was pending
> before the moves.Well, if it's officially supported, that could be a really interesting 
option to quickly scale big storage systems.
I'm thinking about our scenario: 3 servers, 36 12TB disks each. When 
adding a new server (or another pair of servers, to keep an odd number)
it will require quite a lot of time to rebalance, with heavy 
implications both on IB network and latency for the users. If we could 
simply swap around some disks it could be a lot faster.
Have you documented the procedure you followed?

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Universit? di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Zenon Panoussis

2021-Jun-09 23:14 UTC

head link

[Gluster-users] Replica bricks fungible?

> it will require quite a lot of time to *rebalance*...
(my emphasis on "rebalance"). Just to avoid any misunderstandings,
I am talking about pure replica. No distributed replica and no
arbitrated replica. I guess that moving bricks would also work
on a distributed replica within, but not outside, each replica,
but that's only a guess.
> Have you documented the procedure you followed?
I did several different things. I moved a brick from one path
to another on the same server, and I also moved a brick from
one server to another. The procedure in both cases is the same.

# gluster volume heal gv0 statistics heal-count

If all heal count "number of entries" are 0,

# ssh root@{node01,node02,node03} "systemctl stop glusterd"

(This is to prevent any writing to any node while copy/move
operations are ongoing. It's not necessary if you have umounted
all the clients.)

# ssh root at node04
# rsync -vvaz --progress node01:/gfsroot/gv0 /gfsroot/

node04 in the above example is the new node. It could also be
a new brick on an existing node, like

# mount /dev/sdnewdisk1 /gfsnewroot
# rsync -vva --progress /gfsroot/gv0 /gfsnewroot/

Once you have a full copy of the old brick on the new location,
you can just

# ssh root@{node01,node02,node03,node04} "systemctl start glusterd"
# gluster volume add-brick gv0 replica 4 node04:/gfsroot
# gluster vol status
# gluster volume remove-brick gv0 replica 3 node01:/gfsroot

In this example I use add-brick first, before remove-brick, so
as to avoid the theoretical risk of split-brain of a 3-brick
volume if it is momentarily left with only two bricks. In real
life you will either have many more bricks than three, or you
will have kicked out all clients before this procedure, so the
order of add and remove won't matter.

Zenon Panoussis

2021-Jun-13 12:22 UTC

head link

[Gluster-users] Replica bricks fungible?

> Have you documented the procedure you followed?
There was a serious error in my previous reply to you:

   rsync -vvaz --progress node01:/gfsroot/gv0 /gfsroot/

That should have been 'rsync -vvazH' and the "H" is very
important. Gluster uses hard links to map file UUIDs to file
names, but rsync without -H ignores hard links and copies the
hardlinked data again into a new unrelated file, which breaks
gluster's coupling of data to metadata.

*

I have now also tried copying raw data on a three-brick replica
cluster (one brick per server) in a different way (do note the
hostname of the prompts below):

[root at node01 ~]# gluster volume status gv0
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick node01:/vol/gfs/gv0               49152     0          Y       35409
Brick node02:/vol/gfs/gv0               49152     0          Y       6814
Brick node03:/vol/gfs/gv0               49155     0          Y       21457

[root at node01 ~]# gluster volume heal gv0 statistics heal-count
(all 0)

[root at node02 ~]# umount 127.0.0.1:gv0
[root at node03 ~]# umount 127.0.0.1:gv0

[root at node01 ~]# gluster volume remove-brick gv0 replica 2
node03:/vol/gfs/gv0 force
[root at node01 ~]# gluster volume remove-brick gv0 replica 1
node02:/vol/gfs/gv0 force

You see here that, from node01 and with glusterd running on all
three nodes, I remove the other two nodes' bricks. This leaves
volume gv0 with one single brick and imposes a quorum of 1 (thank
you Strahil for this idea, albeit differently implemented here).

Now, left with a volume of only one single blick, I copy the data
to it on node01:

[root at node01 ~]# rsync -vva /datasource/blah 127.0.0.1:gv0/

This is fast. It is almost as fast as copying from one partition
to another on the same disk, because there is no network overhead
within gluster of nodes having to communicate multiple system
calls with each-other before they can write a file. And there
is no latency. System call latency ~200ms back and fro multiple
times is what is killing me (because of ADSL and 4.000 km between
my node01 and the other two), so this eliminates that problem.

In the next step I copied the raw gluster volume data to the other
two nodes. This is where 'rsync -H ' is important:

[root at node02 ~]# rsync -vvazH node01:/vol/gfs/gv0 /vol/gfs/
[root at node03 ~]# rsync -vvazH node02:/vol/gfs/gv0 /vol/gfs/

This is also fast; it copies raw data from A to B without any
communications needing to travel back and fro from every node
to every other node. Hence, no exponential latency multiplication
stonewall.

Finally, when all the raw data is in place on all three nodes,

[root at node01 www]# gluster volume add-brick gv0 replica 2 node02:/vol/gfs/gv0
force
[root at node01 www]# gluster volume add-brick gv0 replica 3 node03:/vol/gfs/gv0
force

For comparison: Copying a mail store of about 1,1 million small
and very small files, total ~80 GB, to this same gluster volume
the normal way, took me from the first days of January to early
May. Four months! Copying about 200.000 mostly small files
yesterday, total ~38 GB, with the above somewhat unorthodox way
took 12 hours from start to finish including the transfer over
ADSL.

Andreas Schwibbe

2022-Mar-01 11:02 UTC

head link

[Gluster-users] Replica bricks fungible?

Confirmed for gluster 7.9 in distributed-replicate and pure replicate
volume.

One of my 3 nodes died :(
I removed all bricks from dead node and added to new node.
I then started to add an arbiter volume as the distributed-replicate is
configured for 2 replica 1 arbiter.
I made sure to use the exact mount point and path and double / triple
checked the bricks had the same file content in any given dir exactly
as the running bricks it was about to be paired again. Then I used
replace-brick command to replace dead-node:brick0 with new-node:brick0.
Did this one by one for all bricks...

It took a while to get the replacement-node up and running, so the
cluster was still operational and in use. When finally moved all bricks
self-heal-daemon started heal on several files.
All worked out perfectly and with no downtime.

Finally I detached the dead node.
Done.

A.

Am Mittwoch, dem 09.06.2021 um 15:17 +0200 schrieb Diego
Zuccato:> Il 05/06/2021 14:36, Zenon Panoussis ha scritto:
>
> > > What I'm really asking is: can I physically move a brick
> > > from one server to another such as
> > I can now answer my own question: yes, replica bricks are
> > identical and can be physically moved or copied from one
> > server to another. I have now done it a few times without
> > any problems, though I made sure no healing was pending
> > before the moves.
> Well, if it's officially supported, that could be a really interesting
> option to quickly scale big storage systems.
> I'm thinking about our scenario: 3 servers, 36 12TB disks each. When
> adding a new server (or another pair of servers, to keep an odd number)
> it will require quite a lot of time to rebalance, with heavy
> implications both on IB network and latency for the users. If we could
> simply swap around some disks it could be a lot faster.
> Have you documented the procedure you followed?
>

Gluster users - Jun 2021 - Replica bricks fungible?

[Gluster-users] Replica bricks fungible?

[Gluster-users] Replica bricks fungible?

[Gluster-users] Replica bricks fungible?

[Gluster-users] Replica bricks fungible?