thr3ads.net - Gluster users - [Gluster-users] Impressive boot times for big clusters: NFS, Image Objects, and Sharding [Apr 2020]

If this information is useful, please help other people find it:
Share via:

Erik Jacobson

2020-Apr-08 19:15 UTC

[Gluster-users] Impressive boot times for big clusters: NFS, Image Objects, and Sharding

I wanted to share some positive news with the group here.

Summary: Using sharding and squashfs image files instead of expanded
directory trees for RO NFS OS images have led to impressive boot times of
2k diskless node clusters using 12 servers for gluster+tftp+etc+etc.

Details:

As you may have seen in some of my other posts, we have been using
gluster to boot giant clusters, some of which are in the top500 list of
HPC resources. The compute nodes are diskless.

Up until now, we have done this by pushing an operating system from our
head node to the storage cluster, which is made up of one or more
3-server/(3-brick) subvolumes in a distributed/replicate configuration.
The servers are also PXE-boot and tftboot servers and also serve the
"miniroot" (basically a fat initrd with a cluster manager toolchain).
We also locate other management functions there unrelated to boot and
root.

This copy of the operating system is a simple a directory tree
representing the whole operating system image. You could 'chroot' in to
it, for example.

So this operating system is a read-only NFS mount point used as a base
by all compute nodes to use as their root filesystem.

This has been working well, getting us boot times (not including BIOS
startup) of between 10 and 15 minutes for a 2,000 node cluster. Typically a
cluster like this would have 12 gluster/nfs servers in 3 subvolumes. On simple
RHEL8 images without much customization, I tend to get 10 minutes.

We have observed some slow-downs with certain job launch work loads for
customers who have very metadata intensive job launch. The metadata load
of such an operation is very intensive, with giant loads being observed
on the gluster servers.

We recently started supporting RW NFS as opposed to TMPFS for this
solution for the writable components of root. Our customers tend to prefer
to keep every byte of memory for jobs. We came up with a solution of hosting
the RW NFS sparse files with XFS filesystems on top from a writable area in
gluster for NFS. This makes the RW NFS solution very fast because it reduces
RW NFS metadata per-node. Boot times didn't go up significantly (but our
first attempt with just using a directory tree was a slow disaster, hitting
the worse-case lots of small file write + lots of metadata work load). So we
solved that problem with XFS FS images on RW NFS.

Building on that idea, we have in our development branch, a version of the
solution that changes the RO NFS image to a squashfs file on a sharding
volume. That is, instead of each operating system being many thousands
of files and being (slowly) synced to the gluser servers, the head node
makes a squashfs file out of the image and pushes that. Then all the
compute nodes mount the squashfs image from the NFS mount.
  (mount RO NFS mount, loop-mount squashfs image).

On a 2,000 node cluster I had access to for a time, our prototype got us
boot times of 5 minutes -- including RO NFS with squashfs and the RW NFS
for writable areas like /etc, /var, etc (on an XFS image file).
  * We also tried RW NFS with OVERLAY and no problem there

I expect, for people who prefer the squashfs non-expanded format, we can
reduce the leader per compute density.

Now, not all customers will want squashfs. Some want to be able to edit
a file and see it instantly on all nodes. However, customers looking for
fast boot times or who are suffering slowness on metadata intensive
job launch work loads, will have a new fast option.

Therefore, it's very important we still solve the bug we're working on
in another thread. But I wanted to share something positive.

So now I've said something positive instead of only asking for help :)
:)

Erik

Strahil Nikolov

2020-Apr-09 05:25 UTC

head link

[Gluster-users] Impressive boot times for big clusters: NFS, Image Objects, and Sharding

On April 8, 2020 10:15:27 PM GMT+03:00, Erik Jacobson <erik.jacobson at
hpe.com> wrote:>I wanted to share some positive news with the group here.
>
>Summary: Using sharding and squashfs image files instead of expanded
>directory trees for RO NFS OS images have led to impressive boot times
>of
>2k diskless node clusters using 12 servers for gluster+tftp+etc+etc.
>
>Details:
>
>As you may have seen in some of my other posts, we have been using
>gluster to boot giant clusters, some of which are in the top500 list of
>HPC resources. The compute nodes are diskless.
>
>Up until now, we have done this by pushing an operating system from our
>head node to the storage cluster, which is made up of one or more
>3-server/(3-brick) subvolumes in a distributed/replicate configuration.
>The servers are also PXE-boot and tftboot servers and also serve the
>"miniroot" (basically a fat initrd with a cluster manager
toolchain).
>We also locate other management functions there unrelated to boot and
>root.
>
>This copy of the operating system is a simple a directory tree
>representing the whole operating system image. You could 'chroot' in
to
>it, for example.
>
>So this operating system is a read-only NFS mount point used as a base
>by all compute nodes to use as their root filesystem.
>
>This has been working well, getting us boot times (not including BIOS
>startup) of between 10 and 15 minutes for a 2,000 node cluster.
>Typically a
>cluster like this would have 12 gluster/nfs servers in 3 subvolumes. On
>simple
>RHEL8 images without much customization, I tend to get 10 minutes.
>
>We have observed some slow-downs with certain job launch work loads for
>customers who have very metadata intensive job launch. The metadata
>load
>of such an operation is very intensive, with giant loads being observed
>on the gluster servers.
>
>We recently started supporting RW NFS as opposed to TMPFS for this
>solution for the writable components of root. Our customers tend to
>prefer
>to keep every byte of memory for jobs. We came up with a solution of
>hosting
>the RW NFS sparse files with XFS filesystems on top from a writable
>area in
>gluster for NFS. This makes the RW NFS solution very fast because it
>reduces
>RW NFS metadata per-node. Boot times didn't go up significantly (but
>our
>first attempt with just using a directory tree was a slow disaster,
>hitting
>the worse-case lots of small file write + lots of metadata work load).
>So we
>solved that problem with XFS FS images on RW NFS.
>
>Building on that idea, we have in our development branch, a version of
>the
>solution that changes the RO NFS image to a squashfs file on a sharding
>volume. That is, instead of each operating system being many thousands
>of files and being (slowly) synced to the gluser servers, the head node
>makes a squashfs file out of the image and pushes that. Then all the
>compute nodes mount the squashfs image from the NFS mount.
>  (mount RO NFS mount, loop-mount squashfs image).
>
>On a 2,000 node cluster I had access to for a time, our prototype got
>us
>boot times of 5 minutes -- including RO NFS with squashfs and the RW
>NFS
>for writable areas like /etc, /var, etc (on an XFS image file).
>  * We also tried RW NFS with OVERLAY and no problem there
>
>I expect, for people who prefer the squashfs non-expanded format, we
>can
>reduce the leader per compute density.
>
>Now, not all customers will want squashfs. Some want to be able to edit
>a file and see it instantly on all nodes. However, customers looking
>for
>fast boot times or who are suffering slowness on metadata intensive
>job launch work loads, will have a new fast option.
>
>Therefore, it's very important we still solve the bug we're working
on
>in another thread. But I wanted to share something positive.
>
>So now I've said something positive instead of only asking for help :)
>:)
>
>Erik
>________
>
>
>
>Community Meeting Calendar:
>
>Schedule -
>Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>Bridge: https://bluejeans.com/441850968
>
>Gluster-users mailing list
>Gluster-users at gluster.org
>https://lists.gluster.org/mailman/listinfo/gluster-users
Good Job Erik!

Best Regards,
Strahil Nikolov

Gluster users - Apr 2020 - Impressive boot times for big clusters: NFS, Image Objects, and Sharding

[Gluster-users] Impressive boot times for big clusters: NFS, Image Objects, and Sharding

[Gluster-users] Impressive boot times for big clusters: NFS, Image Objects, and Sharding