Il 05/06/2021 14:36, Zenon Panoussis ha scritto:>> What I'm really asking is: can I physically move a brick >> from one server to another such as > I can now answer my own question: yes, replica bricks are > identical and can be physically moved or copied from one > server to another. I have now done it a few times without > any problems, though I made sure no healing was pending > before the moves.Well, if it's officially supported, that could be a really interesting option to quickly scale big storage systems. I'm thinking about our scenario: 3 servers, 36 12TB disks each. When adding a new server (or another pair of servers, to keep an odd number) it will require quite a lot of time to rebalance, with heavy implications both on IB network and latency for the users. If we could simply swap around some disks it could be a lot faster. Have you documented the procedure you followed? -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Universit? di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
> it will require quite a lot of time to *rebalance*...(my emphasis on "rebalance"). Just to avoid any misunderstandings, I am talking about pure replica. No distributed replica and no arbitrated replica. I guess that moving bricks would also work on a distributed replica within, but not outside, each replica, but that's only a guess.> Have you documented the procedure you followed?I did several different things. I moved a brick from one path to another on the same server, and I also moved a brick from one server to another. The procedure in both cases is the same. # gluster volume heal gv0 statistics heal-count If all heal count "number of entries" are 0, # ssh root@{node01,node02,node03} "systemctl stop glusterd" (This is to prevent any writing to any node while copy/move operations are ongoing. It's not necessary if you have umounted all the clients.) # ssh root at node04 # rsync -vvaz --progress node01:/gfsroot/gv0 /gfsroot/ node04 in the above example is the new node. It could also be a new brick on an existing node, like # mount /dev/sdnewdisk1 /gfsnewroot # rsync -vva --progress /gfsroot/gv0 /gfsnewroot/ Once you have a full copy of the old brick on the new location, you can just # ssh root@{node01,node02,node03,node04} "systemctl start glusterd" # gluster volume add-brick gv0 replica 4 node04:/gfsroot # gluster vol status # gluster volume remove-brick gv0 replica 3 node01:/gfsroot In this example I use add-brick first, before remove-brick, so as to avoid the theoretical risk of split-brain of a 3-brick volume if it is momentarily left with only two bricks. In real life you will either have many more bricks than three, or you will have kicked out all clients before this procedure, so the order of add and remove won't matter.
> Have you documented the procedure you followed?There was a serious error in my previous reply to you: rsync -vvaz --progress node01:/gfsroot/gv0 /gfsroot/ That should have been 'rsync -vvazH' and the "H" is very important. Gluster uses hard links to map file UUIDs to file names, but rsync without -H ignores hard links and copies the hardlinked data again into a new unrelated file, which breaks gluster's coupling of data to metadata. * I have now also tried copying raw data on a three-brick replica cluster (one brick per server) in a different way (do note the hostname of the prompts below): [root at node01 ~]# gluster volume status gv0 Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick node01:/vol/gfs/gv0 49152 0 Y 35409 Brick node02:/vol/gfs/gv0 49152 0 Y 6814 Brick node03:/vol/gfs/gv0 49155 0 Y 21457 [root at node01 ~]# gluster volume heal gv0 statistics heal-count (all 0) [root at node02 ~]# umount 127.0.0.1:gv0 [root at node03 ~]# umount 127.0.0.1:gv0 [root at node01 ~]# gluster volume remove-brick gv0 replica 2 node03:/vol/gfs/gv0 force [root at node01 ~]# gluster volume remove-brick gv0 replica 1 node02:/vol/gfs/gv0 force You see here that, from node01 and with glusterd running on all three nodes, I remove the other two nodes' bricks. This leaves volume gv0 with one single brick and imposes a quorum of 1 (thank you Strahil for this idea, albeit differently implemented here). Now, left with a volume of only one single blick, I copy the data to it on node01: [root at node01 ~]# rsync -vva /datasource/blah 127.0.0.1:gv0/ This is fast. It is almost as fast as copying from one partition to another on the same disk, because there is no network overhead within gluster of nodes having to communicate multiple system calls with each-other before they can write a file. And there is no latency. System call latency ~200ms back and fro multiple times is what is killing me (because of ADSL and 4.000 km between my node01 and the other two), so this eliminates that problem. In the next step I copied the raw gluster volume data to the other two nodes. This is where 'rsync -H ' is important: [root at node02 ~]# rsync -vvazH node01:/vol/gfs/gv0 /vol/gfs/ [root at node03 ~]# rsync -vvazH node02:/vol/gfs/gv0 /vol/gfs/ This is also fast; it copies raw data from A to B without any communications needing to travel back and fro from every node to every other node. Hence, no exponential latency multiplication stonewall. Finally, when all the raw data is in place on all three nodes, [root at node01 www]# gluster volume add-brick gv0 replica 2 node02:/vol/gfs/gv0 force [root at node01 www]# gluster volume add-brick gv0 replica 3 node03:/vol/gfs/gv0 force For comparison: Copying a mail store of about 1,1 million small and very small files, total ~80 GB, to this same gluster volume the normal way, took me from the first days of January to early May. Four months! Copying about 200.000 mostly small files yesterday, total ~38 GB, with the above somewhat unorthodox way took 12 hours from start to finish including the transfer over ADSL.
Confirmed for gluster 7.9 in distributed-replicate and pure replicate volume. One of my 3 nodes died :( I removed all bricks from dead node and added to new node. I then started to add an arbiter volume as the distributed-replicate is configured for 2 replica 1 arbiter. I made sure to use the exact mount point and path and double / triple checked the bricks had the same file content in any given dir exactly as the running bricks it was about to be paired again. Then I used replace-brick command to replace dead-node:brick0 with new-node:brick0. Did this one by one for all bricks... It took a while to get the replacement-node up and running, so the cluster was still operational and in use. When finally moved all bricks self-heal-daemon started heal on several files. All worked out perfectly and with no downtime. Finally I detached the dead node. Done. A. Am Mittwoch, dem 09.06.2021 um 15:17 +0200 schrieb Diego Zuccato:> Il 05/06/2021 14:36, Zenon Panoussis ha scritto: > > > > What I'm really asking is: can I physically move a brick > > > from one server to another such as > > I can now answer my own question: yes, replica bricks are > > identical and can be physically moved or copied from one > > server to another. I have now done it a few times without > > any problems, though I made sure no healing was pending > > before the moves. > Well, if it's officially supported, that could be a really interesting > option to quickly scale big storage systems. > I'm thinking about our scenario: 3 servers, 36 12TB disks each. When > adding a new server (or another pair of servers, to keep an odd number) > it will require quite a lot of time to rebalance, with heavy > implications both on IB network and latency for the users. If we could > simply swap around some disks it could be a lot faster. > Have you documented the procedure you followed? >