Ewen Chan
2021-Mar-19 16:04 UTC
[Gluster-users] Gluster usage scenarios in HPC cluster management
Erik: Thank you for sharing your insights in terms of how Gluster is used, in a professional, production environment (or at least how HPE is using it, either internally, and/or for your clients). I really appreciated reading this. I am just a home lab user and I have a very tiny 4-node micro cluster for HPC/CAE applications, and where I run into issues with storage is that NAND flash based SSDs have finite write endurance limits, which, as a home lab user, gets expensive. But I've also tested using tmpfs (allocating half of the RAM per compute node) and exporting that as a distributed stripped GlusterFS volume over NFS over RDMA to the 100 Gbps IB network so that the "ramdrives" can be used as a high speed "scratch disk space" that doesn't have the write endurance limits that NAND based flash memory SSDs have. Yes, it isn't as reliable or certainly not high availability (power goes down, and the battery backup is exhausted, then the data is lost because it sat in RAM), but it's to solve the problems of mechanically rotating hard drives are too slow, NAND flash based SSDs has finite write endurance limits, and RAM drives, whilst in theory, faster, is also the most expensive in a $/GB basis compared to the other storage solutions. It's rather unfortunately that you have these different "tiers" of storage, and there's really nothing else in between that can help address all of these issues simultaneously. Thank you for sharing your thoughts. Sincerely, Ewen Chan ________________________________ From: gluster-users-bounces at gluster.org <gluster-users-bounces at gluster.org> on behalf of Erik Jacobson <erik.jacobson at hpe.com> Sent: March 19, 2021 11:03 AM To: gluster-users at gluster.org <gluster-users at gluster.org> Subject: [Gluster-users] Gluster usage scenarios in HPC cluster management A while back I was asked to make a blog or something similar to discuss the use cases the team I work on (HPCM cluster management) at HPE. If you are not interested in reading about what I'm up to, just delete this and move on. I really don't have a public blogging mechanism so I'll just describe what we're up to here. Some of this was posted in some form in the past. Since this contains the raw materials, I could make a wiki-ized version if there were a public place to put it. We currently use gluster in two parts of cluster management. In fact, gluster in our management node infrastructure is helping us to provide scaling and consistency to some of the largest clusters in the world, clusters in the TOP100 list. While I can get in to trouble by sharing too much, I will just say that trends are continuing and the future may have some exciting announcements on where on TOP100 certain new giant systems may end up in the coming 1-2 years. At HPE, HPCM is the "traditional cluster manager." There is another team that develops a more cloud-like solution and I am not discussing that solution here. Use Case #1: Leader Nodes and Scale Out ------------------------------------------------------------------------------ - Why? * Scale out * Redundancy (combined with CTDB, any leader can fail) * Consistency (All servers and compute agree on what the content is) - Cluster manager has an admin or head node and zero or more leader nodes - Leader nodes are provisioned in groups of 3 to use distributed replica-3 volumes (although at least one customer has interest in replica-5) - We configure a few different volumes for different use cases - We use Gluster NFS still because, over a year ago, Ganesha was not working with our workload and we haven't had time to re-test and engage with the community. No blame - we would also owe making sure our settings are right. - We use CTDB for a measure of HA and IP alias management. We use this instead of pacemaker to reduce complexity. - The volume use cases are: * Image sharing for diskless compute nodes (sometimes 6,000 nodes) -> Normally squashFS image files for speed/efficiency exported NFS -> Expanded ("chrootable") traditional NFS trees for people who prefer that, but they don't scale as well and are slower to boot -> Squashfs images sit on a sharded volume while traditional gluster used for expanded tree. * TFTP/HTTP for network boot/PXE including miniroot -> Spread across leaders too due so one node is not saturated with PXE/DHCP requests -> Miniroot is a "fatter initrd" that has our CM toolchain * Logs/consoles -> For traditional logs and consoles (HCPM also uses elasticsearch/kafka/friends but we don't put that in gluster) -> Separate volume to have more non-cached friendly settings * 4 total volumes used (one sharded, one heavily optimized for caching, one for ctdb lock, and one traditional for logging/etc) - Leader Setup * Admin node installs the leaders like any other compute nodes * A setup tool operates that configures gluster volumes and CTDB * When ready, an admin/head node can be engaged with the leaders * At that point, certain paths on the admin become gluster fuse mounts and bind mounts to gluster fuse mounts. - How images are deployed (squashfs mode) * User creates an image using image creation tools that make a chrootable tree style image on the admin/head node * mksquashfs will generate a squashfs image file on to a shared storage gluster mount * Nodes will mount the filesystem with the squashfs images and then loop mount the squashfs as part of the boot process. - How are compute nodes tied to leaders * We simply have a variable in our database where human or automated discovery tools can assign a given node to a given IP alias. This works better for us than trying to play routing tricks or load balance tricks * When leaders PXE, the DHCP response includes next-server and the compute node uses the leader IP alias for the tftp/http for getting the boot loader DHCP config files are on shared storage to facilitate future scaling of DHCP services. * ipxe or grub2 network config files then fetch the kernel, initrd * initrd has a small update to load a miniroot (install environment) which has more tooling * Node is installed (for nodes with root disks) or does a network boot cycle. - Gluster sizing * We typically state compute nodes per leader but this is not for gluster per-se. Squashfs image objects are very efficient and probably would be fine for 2k nodes per leader. Leader nodes provide other services including console logs, system logs, and monitoring services. * Our biggest deployment at a customer site right now has 24 leader nodes. Bigger systems are coming. - Startup scripts - Getting all the gluster mounts and many bind mounts used in the solution, as well as ensuring gluster mounts and ctdb lock is available before ctdb start was too painful for my brian. So we have systemd startup scripts that sanity test and start things gracefully. - Future: The team is starting to test what a 96-leader (96 gluster servers) might look like for future exascale systems. - Future: Some customers have interest in replica-5 instead of replica-3. We want to test performance implications. - Future: Move to Ganesha, work with community if needed - Future: Work with Gluster community to make gluster fuse mounts efficient instead of NFS (may be easier with image objects than it was the old way with fully expanded trees for images!) - Challenges: * Every month someone tells me gluster is going to die because of Red Hat vs IBM and I have to justify things. It is getting harder. * Giant squashfs images fail - mksquashfs reports error - at around 90GB on sles15sp2 and sles15sp3. rhel8 does not suffer. Don't have the bandwidth to dig in right now but one upset customer. Work arounds provided to move development environment for that customer out of the operating system image. * Since we have our own build and special use cases, we're on our own for support (by "on our own" I mean no paid support, community help only). Our complex situations can produce some cases that you guys don't see and debugging them can take a month or more with the volunteer nature of the community. Paying for support is harder, even if it were viable politically, since we support 6 distros and 3 distro providers. Of course, paying for support is never the preference of management. It might be an interesting thing to investigate. * Any gluster update is terror. We don't update much because finding a gluster version that is stable for all our use cases PLUS being able to test at scale which means thousands of nodes is hard. We did some internal improvements here where we emulate a 2500-node-cluster using virtual machines on a much smaller cluster. However, it isn't ideal. So we start lagging the community over time until some problem forces us to refresh. Then we tip-toe in to the update. We most recently updated to gluster79 and it solved several problems related to use case #2 below. * Due to lack of HW, testing the SU Leader solution is hard because of the number of internal clusters. However, I recently moved my primary development to my beefy desktop where I have a full cluster stack including 3 leader nodes with gluster running in virtual machines. So we have eliminated an excuse preventing internal people from playing with the solution stack. * Growing volumes, replacing bricks, and replacing servers work. However, the process is very involved and quirky for us. I have complicated code that has to do more than I'd really like to do to simply wheel in a complete leader replacement for a failed one. Even with our tools, we often send up with some glusterd's not starting right and have to restart a few times to get a node or brick replacement going. I wish the process were just as simple as running a single command or set of commands and having it do the right thing. -> Finding two leaders to get a complete set of peer files -> Wedging in the replacement node with the UUID the node in that position had before -> Don't accidentally give a gluster peer it's own peer file -> Then go through an involved replace brick procedure -> "replace" vs " reset" -> A complicated dance with XATTRs that I still don't understand -> Ensuring indices and .glusterfs pre-exist with right permissions My hope is I just misunderstand and will bring this up in a future discussion. Use Case #2: Admin/Head Node High Availability ------------------------------------------------------------------------------ - We used to have an admin node HA solution that contained two servers and an external storage device. A VM was used for the "real admin node" provided by the two servers. - This solution was expensive due to the external storage - This solution was less optimal due to not having true quorum - Building on our gluster experience, we removed the external storage and added a 3rd server. (Due to some previous experience with DRBD we elected to stay with gluster here) - We build a gluster volume shared across the 3 servers, sharded, which primarily holds a virtual machine image file used by the admin node VM - The physical nodes use bonding for network redundancy and bridges to feed them in to the VM. - We use pacemaker in this case to manage the setup - It's pretty simple - pacemaker rules manage a VirtualDomain instance and a simple ssh monitor makes sure we can get in to it. - We typically have a 3-12T single image sitting on gluster sharded shared storage used by the virtual machine, which forms the true admin node. - We set this up with manual instructions but tooling is coming soon to aid in automated setup of this solution - This solution is in use actively in at least 4 very large supercomputers. - I am impressed by the speed of the solution on the sharded volume. The VM image creation speed using libvirt to talk to the image file hosted on a sharded gluster volume works slick. (We use the fuse mount because we don't want to build our own qemu/libvirt, which would be needed at least for SLES and maybe RHEL too since we have our own gluster build) - Challenges: * Not being able to boot all of a sudden was a terror for us (where the VM would only see the disk size as the size of a shard at random times). -> Thankfully community helped guide us to gluster79 and that resolved it * People keep asking to make a 2-node version but I push back. Especially with gluster but honestly with other solutions too, don't cheap out with arbitration is what I try to tell people. Network Utilization ------------------------------------------------------------------------------ - We tyhpically have 2x 10G bonds on leaders - For simplicity, we co-mingle gluster and compute node traffic together - It is very rare that we approach 20G full utilization even in very intensive operations - Newer solutions may increase the speed of the bonds, but it isn't due to a current pressing need. - Locality Challenge: * For future Exascale systems, we may need to become concerned about locality * Certain compute nodes are far closer to some gluster servers than others * And gluster servers themselves need to talk among themselves but could be stretched across the topology * We have tools to monitor switch link utilization and so far have not hit a scary zone * Somewhat complicated by fault tolerance. It would be sad to design the leaders such that a PDU goes bad so you lose qourum because 3 leaders in the same replica-3 were on the same PDU * But spreading them out may have locality implications * This is a future area of study for us. We have various ideas in mind if a problem strikes. ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210319/14709376/attachment.html>
Erik Jacobson
2021-Mar-19 16:24 UTC
[Gluster-users] Gluster usage scenarios in HPC cluster management
> But I've also tested using tmpfs (allocating half of the RAM per compute node) > and exporting that as a distributed stripped GlusterFS volume over NFS over > RDMA to the 100 Gbps IB network so that the "ramdrives" can be used as a high > speed "scratch disk space" that doesn't have the write endurance limits that > NAND based flash memory SSDs have.In my world, we leave the high speed networks to jobs so I don't have much to offer. In our test SU Leader setup where we may not have disks, we do carve gluster bricks out of TMPS mounts. However, in that test case, designed to test the tooling and not the workload, I use iscsi to emulate disks to test the true solution. I will just mention that the cluster manager use of squashfs image objects sitting on NFS mounts is very fast even on top of 20G (2x10G) mgmt infrastructure. If you combine it with a TMPFS overlay, which is our default, you will have a writable area in to TMPFS that doesn't persist. You will have low memory usage. For a 4-node cluster, you probably don't need to bother with squashfs even and just mount the directory tree for the image at the right time. By using tmpfs overlay and some post-boot configuration, you can perhaps avoid the memory usage of what you are doing. As long as you don't need to beat the crap out of root, an NFS root is fine and using gluster backed disks is fine. Note that if you use exported trees with gnfs instead of image objects, there are lots of volume tweaks you can make to push efficiency up. For squashfs, I used a sharded volume. It's easy for me to write this since we have the install environment. While nothing is "Hard" in there, it's a bunch of code developed over time. That said, if you wanted to experiment, I can share some pieces of what we do. I just fear it's too complicated. I will note that some customers advocate for a tiny root - say 1.5G -- that could fit in TMPFS easily and then attach in workloads (other filesystems with development environments over the network, or container environments, etc). That would be another way to keep memory use low for a diskless cluster. (we use gnfs because we're not ready to switch to ganesha yet. It's on our list to move if we can get it working for our load).> Yes, it isn't as reliable or certainly not high availability (power goes down, > and the battery backup is exhausted, then the data is lost because it sat in > RAM), but it's to solve the problems of mechanically rotating hard drives are > too slow, NAND flash based SSDs has finite write endurance limits, and RAM > drives, whilst in theory, faster, is also the most expensive in a $/GB basis > compared to the other storage solutions. > > It's rather unfortunately that you have these different "tiers" of storage, and > there's really nothing else in between that can help address all of these > issues simultaneously. > > Thank you for sharing your thoughts. > > Sincerely, > > Ewen Chan > > ??????????????????????????????????????????????????????????????????????????????? > From: gluster-users-bounces at gluster.org <gluster-users-bounces at gluster.org> on > behalf of Erik Jacobson <erik.jacobson at hpe.com> > Sent: March 19, 2021 11:03 AM > To: gluster-users at gluster.org <gluster-users at gluster.org> > Subject: [Gluster-users] Gluster usage scenarios in HPC cluster management > > A while back I was asked to make a blog or something similar to discuss > the use cases the team I work on (HPCM cluster management) at HPE. > > If you are not interested in reading about what I'm up to, just delete > this and move on. > > I really don't have a public blogging mechanism so I'll just describe > what we're up to here. Some of this was posted in some form in the past. > Since this contains the raw materials, I could make a wiki-ized version > if there were a public place to put it. > > > > We currently use gluster in two parts of cluster management. > > In fact, gluster in our management node infrastructure is helping us to > provide scaling and consistency to some of the largest clusters in the > world, clusters in the TOP100 list. While I can get in to trouble by > sharing too much, I will just say that trends are continuing and the > future may have some exciting announcements on where on TOP100 certain > new giant systems may end up in the coming 1-2 years. > > At HPE, HPCM is the "traditional cluster manager." There is another team > that develops a more cloud-like solution and I am not discussing that > solution here. > > > Use Case #1: Leader Nodes and Scale Out > ------------------------------------------------------------------------------ > - Why? > * Scale out > * Redundancy (combined with CTDB, any leader can fail) > * Consistency (All servers and compute agree on what the content is) > > - Cluster manager has an admin or head node and zero or more leader nodes > > - Leader nodes are provisioned in groups of 3 to use distributed > replica-3 volumes (although at least one customer has interest > in replica-5) > > - We configure a few different volumes for different use cases > > - We use Gluster NFS still because, over a year ago, Ganesha was not > working with our workload and we haven't had time to re-test and > engage with the community. No blame - we would also owe making sure > our settings are right. > > - We use CTDB for a measure of HA and IP alias management. We use this > instead of pacemaker to reduce complexity. > > - The volume use cases are: > * Image sharing for diskless compute nodes (sometimes 6,000 nodes) > -> Normally squashFS image files for speed/efficiency exported NFS > -> Expanded ("chrootable") traditional NFS trees for people who > prefer that, but they don't scale as well and are slower to boot > -> Squashfs images sit on a sharded volume while traditional gluster > used for expanded tree. > * TFTP/HTTP for network boot/PXE including miniroot > -> Spread across leaders too due so one node is not saturated with > PXE/DHCP requests > -> Miniroot is a "fatter initrd" that has our CM toolchain > * Logs/consoles > -> For traditional logs and consoles (HCPM also uses > elasticsearch/kafka/friends but we don't put that in gluster) > -> Separate volume to have more non-cached friendly settings > * 4 total volumes used (one sharded, one heavily optimized for > caching, one for ctdb lock, and one traditional for logging/etc) > > - Leader Setup > * Admin node installs the leaders like any other compute nodes > * A setup tool operates that configures gluster volumes and CTDB > * When ready, an admin/head node can be engaged with the leaders > * At that point, certain paths on the admin become gluster fuse mounts > and bind mounts to gluster fuse mounts. > > - How images are deployed (squashfs mode) > * User creates an image using image creation tools that make a > chrootable tree style image on the admin/head node > * mksquashfs will generate a squashfs image file on to a shared > storage gluster mount > * Nodes will mount the filesystem with the squashfs images and then > loop mount the squashfs as part of the boot process. > > - How are compute nodes tied to leaders > * We simply have a variable in our database where human or automated > discovery tools can assign a given node to a given IP alias. This > works better for us than trying to play routing tricks or load > balance tricks > * When leaders PXE, the DHCP response includes next-server and the > compute node uses the leader IP alias for the tftp/http for > getting the boot loader DHCP config files are on shared storage > to facilitate future scaling of DHCP services. > * ipxe or grub2 network config files then fetch the kernel, initrd > * initrd has a small update to load a miniroot (install environment) > which has more tooling > * Node is installed (for nodes with root disks) or does a network boot > cycle. > > - Gluster sizing > * We typically state compute nodes per leader but this is not for > gluster per-se. Squashfs image objects are very efficient and > probably would be fine for 2k nodes per leader. Leader nodes provide > other services including console logs, system logs, and monitoring > services. > * Our biggest deployment at a customer site right now has 24 leader > nodes. Bigger systems are coming. > > - Startup scripts - Getting all the gluster mounts and many bind mounts > used in the solution, as well as ensuring gluster mounts and ctdb lock > is available before ctdb start was too painful for my brian. So we > have systemd startup scripts that sanity test and start things > gracefully. > > - Future: The team is starting to test what a 96-leader (96 gluster > servers) might look like for future exascale systems. > > - Future: Some customers have interest in replica-5 instead of > replica-3. We want to test performance implications. > > - Future: Move to Ganesha, work with community if needed > > - Future: Work with Gluster community to make gluster fuse mounts > efficient instead of NFS (may be easier with image objects than it was > the old way with fully expanded trees for images!) > > - Challenges: > * Every month someone tells me gluster is going to die because of Red > Hat vs IBM and I have to justify things. It is getting harder. > * Giant squashfs images fail - mksquashfs reports error - at around > 90GB on sles15sp2 and sles15sp3. rhel8 does not suffer. Don't have > the bandwidth to dig in right now but one upset customer. Work > arounds provided to move development environment for that customer > out of the operating system image. > * Since we have our own build and special use cases, we're on our own > for support (by "on our own" I mean no paid support, community help > only). Our complex situations can produce some cases that you guys > don't see and debugging them can take a month or more with the > volunteer nature of the community. Paying for support is harder, > even if it were viable politically, since we support 6 distros and 3 > distro providers. Of course, paying for support is never the > preference of management. It might be an interesting thing to > investigate. > * Any gluster update is terror. We don't update much because finding a > gluster version that is stable for all our use cases PLUS being able to > test at scale which means thousands of nodes is hard. We did some > internal improvements here where we emulate a 2500-node-cluster > using virtual machines on a much smaller cluster. However, it isn't > ideal. So we start lagging the community over time until some > problem forces us to refresh. Then we tip-toe in to the update. We > most recently updated to gluster79 and it solved several problems > related to use case #2 below. > * Due to lack of HW, testing the SU Leader solution is hard because of > the number of internal clusters. However, I recently moved my > primary development to my beefy desktop where I have a full cluster > stack including 3 leader nodes with gluster running in virtual > machines. So we have eliminated an excuse preventing internal people > from playing with the solution stack. > * Growing volumes, replacing bricks, and replacing servers work. > However, the process is very involved and quirky for us. I have > complicated code that has to do more than I'd really like to do to > simply wheel in a complete leader replacement for a failed one. Even > with our tools, we often send up with some glusterd's not starting > right and have to restart a few times to get a node or brick > replacement going. I wish the process were just as simple as running > a single command or set of commands and having it do the right > thing. > -> Finding two leaders to get a complete set of peer files > -> Wedging in the replacement node with the UUID the node in that > position had before > -> Don't accidentally give a gluster peer it's own peer file > -> Then go through an involved replace brick procedure > -> "replace" vs " reset" > -> A complicated dance with XATTRs that I still don't understand > -> Ensuring indices and .glusterfs pre-exist with right > permissions > My hope is I just misunderstand and will bring this up in a future > discussion. > > > > > Use Case #2: Admin/Head Node High Availability > ------------------------------------------------------------------------------ > - We used to have an admin node HA solution that contained two servers > and an external storage device. A VM was used for the "real admin > node" provided by the two servers. > > - This solution was expensive due to the external storage > > - This solution was less optimal due to not having true quorum > > - Building on our gluster experience, we removed the external storage > and added a 3rd server. > (Due to some previous experience with DRBD we elected to stay with > gluster here) > > - We build a gluster volume shared across the 3 servers, sharded, which > primarily holds a virtual machine image file used by the admin node VM > > - The physical nodes use bonding for network redundancy and bridges to > feed them in to the VM. > > - We use pacemaker in this case to manage the setup > > - It's pretty simple - pacemaker rules manage a VirtualDomain instance > and a simple ssh monitor makes sure we can get in to it. > > - We typically have a 3-12T single image sitting on gluster sharded > shared storage used by the virtual machine, which forms the true admin > node. > > - We set this up with manual instructions but tooling is coming soon to > aid in automated setup of this solution > > - This solution is in use actively in at least 4 very large > supercomputers. > > - I am impressed by the speed of the solution on the sharded volume. The > VM image creation speed using libvirt to talk to the image file hosted > on a sharded gluster volume works slick. > (We use the fuse mount because we don't want to build our own > qemu/libvirt, which would be needed at least for SLES and maybe > RHEL too since we have our own gluster build) > > - Challenges: > * Not being able to boot all of a sudden was a terror for us (where > the VM would only see the disk size as the size of a shard at random > times). > -> Thankfully community helped guide us to gluster79 and that resolved it > > * People keep asking to make a 2-node version but I push back. > Especially with gluster but honestly with other solutions too, don't > cheap out with arbitration is what I try to tell people. > > > Network Utilization > ------------------------------------------------------------------------------ > - We tyhpically have 2x 10G bonds on leaders > - For simplicity, we co-mingle gluster and compute node traffic together > - It is very rare that we approach 20G full utilization even in very > intensive operations > - Newer solutions may increase the speed of the bonds, but it isn't due > to a current pressing need. > - Locality Challenge: > * For future Exascale systems, we may need to become concerned about > locality > * Certain compute nodes are far closer to some gluster servers than > others > * And gluster servers themselves need to talk among themselves but > could be stretched across the topology > * We have tools to monitor switch link utilization and so far have > not hit a scary zone > * Somewhat complicated by fault tolerance. It would be sad to design > the leaders such that a PDU goes bad so you lose qourum because > 3 leaders in the same replica-3 were on the same PDU > * But spreading them out may have locality implications > * This is a future area of study for us. We have various ideas in > mind if a problem strikes. > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users