Erik Jacobson
2021-Mar-23 14:43 UTC
[Gluster-users] Gluster usage scenarios in HPC cluster management
> I still have to grasp the "leader node" concept. > Weren't gluster nodes "peers"? Or by "leader" you mean that it's > mentioned in the fstab entry like > /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0 > while the peer list includes l1,l2,l3 and a bunch of other nodes?Right, it's a list of 24 peers. The 24 peers are split in to a 3x24 replicated/distributed setup for the volumes. They also have entries for themselves as clients in /etc/fstab. I'll dump some volume info at the end of this.> > So we would have 24 leader nodes, each leader would have a disk serving > > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded, > > one is for logs, and one is heavily optimized for non-object expanded > > tree NFS). The term "disk" is loose. > That's a system way bigger than ours (3 nodes, replica3arbiter1, up to > 36 bricks per node).I have one dedicated "disk" (could be disk, raid lun, single ssd) and 4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just for the lock and has a single file.> > > Specs of a leader node at a customer site: > > * 256G RAM > Glip! 256G for 4 bricks... No wonder I have had troubles running 26 > bricks in 64GB RAM... :)I'm not an expert in memory pools or how they would be impacted by more peers. I had to do a little research and I think what you're after is if I can run gluster volume status cm_shared mem on a real cluster that has a decent node count. I will see if I can do that. TEST ENV INFO for those who care -------------------------------- Here is some info on my own test environemnt which you can skip. I have the environment duplicated on my desktop using virtual machines and it runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache from the optimized volumes but other than that it is fine. In my development environment, the gluster disk is a 40G qcow2 image. Cache sizes changed from 8G to 100M to fit in the VM. XML snips for memory, cpus: <domain type='kvm' id='24'> <name>cm-leader1</name> <uuid>99d5a8fc-a32c-b181-2f1a-2929b29c3953</uuid> <memory unit='KiB'>3268608</memory> <currentMemory unit='KiB'>3268608</currentMemory> <vcpu placement='static'>2</vcpu> <resource> ...... I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test compute node for my development environment. My desktop where I test this cluster stack is a beefy but not brand new desktop: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Stepping: 1 CPU MHz: 2594.333 CPU max MHz: 3000.0000 CPU min MHz: 1200.0000 BogoMIPS: 4190.22 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-15 <SNIP> (Not that it matters but this is a HP Z640 Workstation) 128G memory (good for a desktop I know, but I think 64G would work since I also run windows10 vm environment for unrelated reasons) I was able to find a MegaRAID in the lab a few years ago and so I have 4 drives in a MegaRAID and carve off a separate volume for the VM disk images. It has a cache. So that's also more beefy than a normal desktop. (on the other hand, I have no SSDs. May experiment with that some day but things work so well now I'm tempted to leave it until something croaks :) I keep all VMs for the test cluster with "Unsafe cache mode" since there is no true data to worry about and it makes the test cases faster. So I am able to test a complete cluster management stack including 3-leader-gluster servers, an admin, and compute all on my desktop using virtual machines and shared networks within libivrt/qemu. It is so much easier to do development when you don't have to reserve scarce test clusters and compete with people. I can do 90% of my cluster development work this way. Things fall over when I need to care about BMCs/ILOs or need to do performance testing of course. Then I move to real hardware and play the hunger-games-of-internal-test-resources :) :) I mention all this just to show that the beefy servers are not needed nor the memory usage high. I'm not continually swapping or anything like that. Configuration Info from Real Machine ------------------------------------ Some info on an active 3x3 cluster. 2738 compute nodes. The most active volume here is "cm_obj_sharded". It is where the image objects live and this cluster uses image objects for compute node root filesystems. I by hand changed the IP addresses (in case I made an error doing that). Memory status for volume : cm_obj_sharded ---------------------------------------------- Brick : 10.1.0.5:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 20676608 Ordblks : 2077 Smblks : 518 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 53728 Uordblks : 5223376 Fordblks : 15453232 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.6:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 21409792 Ordblks : 2424 Smblks : 604 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 62304 Uordblks : 5468096 Fordblks : 15941696 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.7:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 24240128 Ordblks : 2471 Smblks : 563 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 58832 Uordblks : 5565360 Fordblks : 18674768 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.8:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 22454272 Ordblks : 2575 Smblks : 528 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 53920 Uordblks : 5583712 Fordblks : 16870560 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.9:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 22835200 Ordblks : 2493 Smblks : 570 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 59728 Uordblks : 5424992 Fordblks : 17410208 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.10:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 23085056 Ordblks : 2717 Smblks : 697 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 74016 Uordblks : 5631520 Fordblks : 17453536 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.11:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 26537984 Ordblks : 3044 Smblks : 985 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 103056 Uordblks : 5702592 Fordblks : 20835392 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.12:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 23556096 Ordblks : 2658 Smblks : 735 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 78720 Uordblks : 5568736 Fordblks : 17987360 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.13:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 26050560 Ordblks : 3064 Smblks : 926 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 96816 Uordblks : 5807312 Fordblks : 20243248 Keepcost : 127616 ---------------------------------------------- Volume configuration details for this one: Volume Name: cm_obj_sharded Type: Distributed-Replicate Volume ID: 76c30b65-7194-4af2-80f7-bf876f426e5a Status: Started Snapshot Count: 0 Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: 10.1.0.5:/data/brick_cm_obj_sharded Brick2: 10.1.0.6:/data/brick_cm_obj_sharded Brick3: 10.1.0.7:/data/brick_cm_obj_sharded Brick4: 10.1.0.8:/data/brick_cm_obj_sharded Brick5: 10.1.0.9:/data/brick_cm_obj_sharded Brick6: 10.1.0.10:/data/brick_cm_obj_sharded Brick7: 10.1.0.11:/data/brick_cm_obj_sharded Brick8: 10.1.0.12:/data/brick_cm_obj_sharded Brick9: 10.1.0.13:/data/brick_cm_obj_sharded Options Reconfigured: nfs.rpc-auth-allow: 10.1.* auth.allow: 10.1.* performance.client-io-threads: on nfs.disable: off storage.fips-mode-rchecksum: on transport.address-family: inet performance.cache-size: 8GB performance.flush-behind: on performance.cache-refresh-timeout: 60 performance.nfs.io-cache: on nfs.nlm: off nfs.export-volumes: on nfs.export-dirs: on nfs.exports-auth-enable: on transport.listen-backlog: 16384 nfs.mount-rmtab: /- performance.io-thread-count: 32 server.event-threads: 32 nfs.auth-refresh-interval-sec: 360 nfs.auth-cache-ttl-sec: 360 features.shard: on There are 3 other volumes (this is the only sharded one). I can provide more info if desired. Typical boot times for 3k nodes and 9 leaders, ignoring BIOS setup time, is 2-5 minutes. The power of the image objects is what makes that fast. An exapnded tree (traditional) nfs export where the whole directory tree would be exported and used file by file would be more like 9-12 minutes. Erik
Ewen Chan
2021-Mar-24 01:32 UTC
[Gluster-users] Gluster usage scenarios in HPC cluster management
Erik: I just want to say that I really appreciate you sharing this information with us. I don't think that my personal home lab micro cluster environment may get that complicated enough where I have a virtualized testing/Gluster development setup like you have, but on the other hand, as I mentioned before, I am running 100 Gbps Infiniband so what I am trying to do/use Gluster for is quite different than what and how most people deploy/install Gluster for production systems. If I wanted to splurge, I'd get a second set of IB cables so that the high speed interconnect layer can be split so that jobs will run on one layer of the Infiniband fabric whilst storage/Gluster may run on another layer. But for that, I'll have to revamp my entire microcluster, so there are no plans to do that just yet. Thank you. Sincerely, Ewen ________________________________ From: gluster-users-bounces at gluster.org <gluster-users-bounces at gluster.org> on behalf of Erik Jacobson <erik.jacobson at hpe.com> Sent: March 23, 2021 10:43 AM To: Diego Zuccato <diego.zuccato at unibo.it> Cc: gluster-users at gluster.org <gluster-users at gluster.org> Subject: Re: [Gluster-users] Gluster usage scenarios in HPC cluster management> I still have to grasp the "leader node" concept. > Weren't gluster nodes "peers"? Or by "leader" you mean that it's > mentioned in the fstab entry like > /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0 > while the peer list includes l1,l2,l3 and a bunch of other nodes?Right, it's a list of 24 peers. The 24 peers are split in to a 3x24 replicated/distributed setup for the volumes. They also have entries for themselves as clients in /etc/fstab. I'll dump some volume info at the end of this.> > So we would have 24 leader nodes, each leader would have a disk serving > > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded, > > one is for logs, and one is heavily optimized for non-object expanded > > tree NFS). The term "disk" is loose. > That's a system way bigger than ours (3 nodes, replica3arbiter1, up to > 36 bricks per node).I have one dedicated "disk" (could be disk, raid lun, single ssd) and 4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just for the lock and has a single file.> > > Specs of a leader node at a customer site: > > * 256G RAM > Glip! 256G for 4 bricks... No wonder I have had troubles running 26 > bricks in 64GB RAM... :)I'm not an expert in memory pools or how they would be impacted by more peers. I had to do a little research and I think what you're after is if I can run gluster volume status cm_shared mem on a real cluster that has a decent node count. I will see if I can do that. TEST ENV INFO for those who care -------------------------------- Here is some info on my own test environemnt which you can skip. I have the environment duplicated on my desktop using virtual machines and it runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache from the optimized volumes but other than that it is fine. In my development environment, the gluster disk is a 40G qcow2 image. Cache sizes changed from 8G to 100M to fit in the VM. XML snips for memory, cpus: <domain type='kvm' id='24'> <name>cm-leader1</name> <uuid>99d5a8fc-a32c-b181-2f1a-2929b29c3953</uuid> <memory unit='KiB'>3268608</memory> <currentMemory unit='KiB'>3268608</currentMemory> <vcpu placement='static'>2</vcpu> <resource> ...... I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test compute node for my development environment. My desktop where I test this cluster stack is a beefy but not brand new desktop: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Stepping: 1 CPU MHz: 2594.333 CPU max MHz: 3000.0000 CPU min MHz: 1200.0000 BogoMIPS: 4190.22 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-15 <SNIP> (Not that it matters but this is a HP Z640 Workstation) 128G memory (good for a desktop I know, but I think 64G would work since I also run windows10 vm environment for unrelated reasons) I was able to find a MegaRAID in the lab a few years ago and so I have 4 drives in a MegaRAID and carve off a separate volume for the VM disk images. It has a cache. So that's also more beefy than a normal desktop. (on the other hand, I have no SSDs. May experiment with that some day but things work so well now I'm tempted to leave it until something croaks :) I keep all VMs for the test cluster with "Unsafe cache mode" since there is no true data to worry about and it makes the test cases faster. So I am able to test a complete cluster management stack including 3-leader-gluster servers, an admin, and compute all on my desktop using virtual machines and shared networks within libivrt/qemu. It is so much easier to do development when you don't have to reserve scarce test clusters and compete with people. I can do 90% of my cluster development work this way. Things fall over when I need to care about BMCs/ILOs or need to do performance testing of course. Then I move to real hardware and play the hunger-games-of-internal-test-resources :) :) I mention all this just to show that the beefy servers are not needed nor the memory usage high. I'm not continually swapping or anything like that. Configuration Info from Real Machine ------------------------------------ Some info on an active 3x3 cluster. 2738 compute nodes. The most active volume here is "cm_obj_sharded". It is where the image objects live and this cluster uses image objects for compute node root filesystems. I by hand changed the IP addresses (in case I made an error doing that). Memory status for volume : cm_obj_sharded ---------------------------------------------- Brick : 10.1.0.5:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 20676608 Ordblks : 2077 Smblks : 518 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 53728 Uordblks : 5223376 Fordblks : 15453232 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.6:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 21409792 Ordblks : 2424 Smblks : 604 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 62304 Uordblks : 5468096 Fordblks : 15941696 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.7:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 24240128 Ordblks : 2471 Smblks : 563 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 58832 Uordblks : 5565360 Fordblks : 18674768 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.8:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 22454272 Ordblks : 2575 Smblks : 528 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 53920 Uordblks : 5583712 Fordblks : 16870560 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.9:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 22835200 Ordblks : 2493 Smblks : 570 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 59728 Uordblks : 5424992 Fordblks : 17410208 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.10:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 23085056 Ordblks : 2717 Smblks : 697 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 74016 Uordblks : 5631520 Fordblks : 17453536 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.11:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 26537984 Ordblks : 3044 Smblks : 985 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 103056 Uordblks : 5702592 Fordblks : 20835392 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.12:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 23556096 Ordblks : 2658 Smblks : 735 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 78720 Uordblks : 5568736 Fordblks : 17987360 Keepcost : 127616 ---------------------------------------------- Brick : 10.1.0.13:/data/brick_cm_obj_sharded Mallinfo -------- Arena : 26050560 Ordblks : 3064 Smblks : 926 Hblks : 17 Hblkhd : 17350656 Usmblks : 0 Fsmblks : 96816 Uordblks : 5807312 Fordblks : 20243248 Keepcost : 127616 ---------------------------------------------- Volume configuration details for this one: Volume Name: cm_obj_sharded Type: Distributed-Replicate Volume ID: 76c30b65-7194-4af2-80f7-bf876f426e5a Status: Started Snapshot Count: 0 Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: 10.1.0.5:/data/brick_cm_obj_sharded Brick2: 10.1.0.6:/data/brick_cm_obj_sharded Brick3: 10.1.0.7:/data/brick_cm_obj_sharded Brick4: 10.1.0.8:/data/brick_cm_obj_sharded Brick5: 10.1.0.9:/data/brick_cm_obj_sharded Brick6: 10.1.0.10:/data/brick_cm_obj_sharded Brick7: 10.1.0.11:/data/brick_cm_obj_sharded Brick8: 10.1.0.12:/data/brick_cm_obj_sharded Brick9: 10.1.0.13:/data/brick_cm_obj_sharded Options Reconfigured: nfs.rpc-auth-allow: 10.1.* auth.allow: 10.1.* performance.client-io-threads: on nfs.disable: off storage.fips-mode-rchecksum: on transport.address-family: inet performance.cache-size: 8GB performance.flush-behind: on performance.cache-refresh-timeout: 60 performance.nfs.io-cache: on nfs.nlm: off nfs.export-volumes: on nfs.export-dirs: on nfs.exports-auth-enable: on transport.listen-backlog: 16384 nfs.mount-rmtab: /- performance.io-thread-count: 32 server.event-threads: 32 nfs.auth-refresh-interval-sec: 360 nfs.auth-cache-ttl-sec: 360 features.shard: on There are 3 other volumes (this is the only sharded one). I can provide more info if desired. Typical boot times for 3k nodes and 9 leaders, ignoring BIOS setup time, is 2-5 minutes. The power of the image objects is what makes that fast. An exapnded tree (traditional) nfs export where the whole directory tree would be exported and used file by file would be more like 9-12 minutes. Erik ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210324/cac8c36f/attachment.html>