Strahil
2019-Nov-05 03:05 UTC
[Gluster-users] hook script question related to ctdb, shared storage, and bind mounts
Sure, Here is what was the setup : [root at ovirt1 ~]# systemctl cat var-run-gluster-shared_storage.mount --no-pager # /run/systemd/generator/var-run-gluster-shared_storage.mount # Automatically generated by systemd-fstab-generator [Unit] SourcePath=/etc/fstab Documentation=man:fstab(5) man:systemd-fstab-generator(8) [Mount] What=gluster1:/gluster_shared_storage Where=/var/run/gluster/shared_storage Type=glusterfs Options=defaults,x-systemd.requires=glusterd.service,x-systemd.automount [root at ovirt1 ~]# systemctl cat var-run-gluster-shared_storage.automount --no-pager # /run/systemd/generator/var-run-gluster-shared_storage.automount # Automatically generated by systemd-fstab-generator [Unit] SourcePath=/etc/fstab Documentation=man:fstab(5) man:systemd-fstab-generator(8) Before=remote-fs.target After=glusterd.service Requires=glusterd.service [Automount] Where=/var/run/gluster/shared_storage [root at ovirt1 ~]# systemctl cat glusterd --no-pager # /etc/systemd/system/glusterd.service [Unit] Description=GlusterFS, a clustered file-system server Requires=rpcbind.service gluster_bricks-engine.mount gluster_bricks-data.mount gluster_bricks-isos.mount After=network.target rpcbind.service gluster_bricks-engine.mount gluster_bricks-data.mount gluster_bricks-isos.mount Before=network-online.target [Service] Type=forking PIDFile=/var/run/glusterd.pid LimitNOFILE=65536 Environment="LOG_LEVEL=INFO" EnvironmentFile=-/etc/sysconfig/glusterd ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS KillMode=process SuccessExitStatus=15 [Install] WantedBy=multi-user.target # /etc/systemd/system/glusterd.service.d/99-cpu.conf [Service] CPUAccounting=yes Slice=glusterfs.slice [root at ovirt1 ~]# systemctl cat ctdb --no-pager # /etc/systemd/system/ctdb.service [Unit] Description=CTDB Documentation=man:ctdbd(1) man:ctdb(7) After=network-online.target time-sync.target glusterd.service var-run-gluster-shared_storage.automount Conflicts=var-lib-nfs-rpc_pipefs.mount [Service] Environment=SYSTEMD_LOG_LEVEL=debug Type=forking LimitCORE=infinity PIDFile=/run/ctdb/ctdbd.pid ExecStartPre=/bin/bash -c "sleep 2; if [ -f /sys/fs/cgroup/cpu/system.slice/cpu.rt_runtime_us ]; then echo 10000 > /sys/fs/cgroup/cpu/system.slice/cpu.rt_runtime_us; fi" ExecStartPre=/bin/bash -c 'if [[ $(find /var/log/log.ctdb -type f -size +20971520c 2>/dev/null) ]]; then truncate -s 0 /var/log/log.ctdb;fi' ExecStartPre=/bin/bash -c 'if [ -d "/var/run/gluster/shared_storage/lock" ] ;then exit 4; fi' ExecStart=/usr/sbin/ctdbd_wrapper /run/ctdb/ctdbd.pid start ExecStop=/usr/sbin/ctdbd_wrapper /run/ctdb/ctdbd.pid stop KillMode=control-group Restart=no [Install] WantedBy=multi-user.target [root at ovirt1 ~]# systemctl cat nfs-ganesha --no-pager # /usr/lib/systemd/system/nfs-ganesha.service # This file is part of nfs-ganesha. # # There can only be one NFS-server active on a system. When NFS-Ganesha is # started, the kernel NFS-server should have been stopped. This is achieved by # the 'Conflicts' directive in this unit. # # The Network Locking Manager (rpc.statd) is provided by the nfs-utils package. # NFS-Ganesha comes with its own nfs-ganesha-lock.service to resolve potential # conflicts in starting multiple rpc.statd processes. See the comments in the # nfs-ganesha-lock.service for more details. # [Unit] Description=NFS-Ganesha file server Documentation=http://github.com/nfs-ganesha/nfs-ganesha/wiki After=rpcbind.service nfs-ganesha-lock.service Wants=rpcbind.service nfs-ganesha-lock.service Conflicts=nfs.target After=nfs-ganesha-config.service Wants=nfs-ganesha-config.service [Service] Type=forking Environment="NOFILE=1048576" EnvironmentFile=-/run/sysconfig/ganesha ExecStart=/bin/bash -c "${NUMACTL} ${NUMAOPTS} /usr/bin/ganesha.nfsd ${OPTIONS} ${EPOCH}" ExecStartPost=-/bin/bash -c "prlimit --pid $MAINPID --nofile=$NOFILE:$NOFILE" ExecStartPost=-/bin/bash -c "/usr/bin/sleep 2 && /bin/dbus-send --system --dest=org.ganesha.nfsd --type=method_call /org/ganesha/nfsd/admin org.ganesha.nfsd.admin.init_fds_limit" ExecReload=/bin/kill -HUP $MAINPID ExecStop=/bin/dbus-send --system --dest=org.ganesha.nfsd --type=method_call /org/ganesha/nfsd/admin org.ganesha.nfsd.admin.shutdown [Install] WantedBy=multi-user.target Also=nfs-ganesha-lock.service I can't guarantee that it will work 100% in your setup, but I remmember I had only few hicups after all node powerdown+powerup. P.S.: I still prefer corosync/pacemaker but in my setup I cannot have fencing and in hyperconverged setup it gets even more complex. If your cluster is gluster only - consider pacemaker for that task. Best Regards, Strahil NikolovOn Nov 4, 2019 15:57, Erik Jacobson <erik.jacobson at hpe.com> wrote:> > Thank you! I am very interested. I hadn't considered the automounter > idea. > > Also, your fstab has a different dependency approach than mine otherwise > as well. > > If you happen to have the examples handy, I'll give them a shot here. > > I'm looking forward to emerging from this dark place of dependencies not > working!! > > Thank you so much for writing back, > > Erik > > On Mon, Nov 04, 2019 at 06:59:10AM +0200, Strahil wrote: > > Hi Erik, > > > > I took another approach. > > > > 1.? I got a systemd mount unit for my ctdb lock volume's brick: > > [root at ovirt1 system]# grep var /etc/fstab > > gluster1:/gluster_shared_storage /var/run/gluster/shared_storage/ glusterfs defaults,x-systemd.requires=glusterd.service,x-systemd.automount??????? 0 0 > > > > As you can see - it is an automounter, because sometimes it fails to mount on time > > > > 2.? I got custom systemd services for glusterd,ctdb and vdo -? as I need to 'put' dependencies for each of those. > > > > Now, I'm no longer using ctdb & NFS Ganesha (as my version of ctdb cannot use hpstnames and my environment is a little bit crazy), but I can still provide hints how I did it. > > > > Best Regards, > > Strahil NikolovOn Nov 3, 2019 22:46, Erik Jacobson <erik.jacobson at hpe.com> wrote: > > > > > > So, I have a solution I have written about in the based that is based on > > > gluster with CTDB for IP and a level of redundancy. > > > > > > It's been working fine except for a few quirks I need to work out on > > > giant clusters when I get access. > > > > > > I have 3x9 gluster volume, each are also NFS servers, using gluster > > > NFS (ganesha isn't reliable for my workload yet). There are 9 IP > > > aliases spread across 9 servers. > > > > > > I also have many bind mounts that point to the shared storage as a > > > source, and the /gluster/lock volume ("ctdb") of course. > > > > > > glusterfs 4.1.6 (rhel8 today, but I use rhel7, rhel8, sles12, and > > > sles15) > > > > > > Things work well when everything is up and running. IP failover works > > > well when one of the servers goes down. My issue is when that server > > > comes back up. Despite my best efforts with systemd fstab dependencies, > > > the shared storage areas including the gluster lock for CTDB do not > > > always get mounted before CTDB starts. This causes trouble for CTDB > > > correctly joining the collective. I also have problems where my > > > bind mounts can happen before the shared storage is mounted, despite my > > > attempts at preventing this with dependencies in fstab. > > > > > > I decided a better approach would be to use a gluster hook and just > > > mount everything I need as I need it, and start up ctdb when I know and > > > verify that /gluster/lock is really gluster and not a local disk. > > > > > > I started down a road of doing this with a start host hook and after > > > spending a while at it, I realized my logic error. This will only fire > > > when the volume is *started*, not when a server that was down re-joins. > > > > > > I took a look at the code, glusterd-hooks.c, and found that support > > > for "brick start" is not in place for a hook script but it's nearly > > > there: > > > > > > ??????? [GD_OP_START_BRICK]???????????? = EMPTY, > > > ... > > > > > > and no entry in glusterd_hooks_add_op_args() yet. > > > > > > > > > Before I make a patch for my own use, I wanted to do a sanity check and > > > find out if others have solved this better than the road I'm heading > > > down. > > > > > > What I was thinking of doing is enabling a brick start hook, and > > > do my processing for volumes being mounted from there. However, I > > > suppose brick start is a bad choice for the case of simply stopping and > > > starting the volume, because my processing would try to complete before > > > the gluster volume was fully started. It would probably work for a brick > > > "coming back and joining" but not "stop volume/start volume". > > > > > > Any suggestions? > > > > > > My end goal is: > > > - mount shared storage every boot > > > - only attempt to mount when gluster is available (_netdev doesn't seem > > > ?? to be enough) > > > - never start ctdb unless /gluster/lock is a shared storage and not a > > > ?? directory. > > > - only do my bind mounts from shared storage in to the rest of the > > > ?? layout when we are sure the shared storage is mounted (don't > > > ?? bind-mount using an empty directory as a source by accident!) > > > > > > Thanks so much for reading my question, > > > > > > Erik > > > ________ > > > > > > Community Meeting Calendar: > > > > > > APAC Schedule - > > > Every 2nd and 4th Tuesday at 11:30 AM IST > > > Bridge: https://bluejeans.com/118564314? > > > > > > NA/EMEA Schedule - > > > Every 1st and 3rd Tuesday at 01:00 PM EDT > > > Bridge: https://bluejeans.com/118564314? > > > > > > Gluster-users mailing list > > > Gluster-users at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-users? > > > Erik Jacobson > Software Engineer > > erik.jacobson at hpe.com > +1 612 851 0550 Office > > Eagan, MN > hpe.com
Erik Jacobson
2019-Nov-05 15:39 UTC
[Gluster-users] hook script question related to ctdb, shared storage, and bind mounts
On Tue, Nov 05, 2019 at 05:05:08AM +0200, Strahil wrote:> Sure, > > Here is what was the setup :Thank you! You're very kind to send me this. I will verify it with my setup soon. Hoping to to rid myself of these dep problems. Thank you !!! Erik
Erik Jacobson
2019-Nov-09 18:15 UTC
[Gluster-users] hook script question related to ctdb, shared storage, and bind mounts
> Here is what was the setup :I thought I'd share an update in case it helps others. Your ideas inspired me to try a different approach. We support 4 main distros (and a 2 variants of some). We try not to provide our own versions of distro-supported packages like CTDB where possible. So a concern for me in modifying services is that they could be replaced in package updates. There are ways to mitigate that but that thought combined with yourr ideas led me to try this: - Be sure ctdb service is disabled - Added a systemd serivce of my own, oneshot, that runs a helper script - The helper script first ensures the gluster volumes show up (I use localhost in my case and besides, in our environment, we don't want CTDB to have a public IP anyway until NFS can be served so this helps there too) - Even with the gluster volume showing good, during init startup, first attempts to mount gluster volumes fail. So the helper script keeps looping until they work. It seems they work on the 2nd try (after a 3s sleep at failure). - Once the mounts are confirmed working and mounted, then my helper starts the ctdb service. - Awkward CTDB problems (where the lock check sometimes fails to detect a lock problem) are avoided since we won't start CTDB until we're 100% sure the gluster lock is mounted and pointing at gluster. The above is working in prototype form so I'm going to start adding my bind mounts to the equation. I think I have a solution that will work now and I thank you so much for the ideas. I'm taking things from prototype form now on to something we can provide people. With regards to pacemaker. There are a few pacemaker solutions that I've touched, and one I even helped implement. Now, it could be that I'm not an expert at writing rules, but pacemaker seems to have often given us more trouble than the problem it solves. I believe this is due to the complexity of the software and the power of it. I am not knocking pacemaker. However, a person really has to be a pacemaker expert to not make a mistake that could cause a down time. So I have attempted to avoid pacemaker in the new solution. I know there are down sides -- fencing is there for a reason -- but as far as I can tell the decision has been right for us. CTDB is less complicated even if does not provide 100% true full HA abilities. That said, in the solution, I've been careful to future-proof a move to pacemaker. For example, on the gluster servers/NFS servers, I bring up IP aliases (interfaces) on the network the BMCs reside so we're seamlessly able to switch to pacemaker with IPMI/BMC/redfish fencing later if needed without causing too much pain in the field with deployed servers. I do realize there are tools to help configure pacemaker for you. Some that I've tried have given me mixed results, perhaps due to the complexity of networking setup in the solutions we have. As we start to deploy this to more locations, I'll gain a feel for if a move to pacemaker is right or not. I just share this in the interest of learning. I'm always willing to learn and improve if I've overlooked something. Erik