Strahil
2019-Nov-05 03:05 UTC
[Gluster-users] hook script question related to ctdb, shared storage, and bind mounts
Sure,
Here is what was the setup :
[root at ovirt1 ~]# systemctl cat var-run-gluster-shared_storage.mount
--no-pager
# /run/systemd/generator/var-run-gluster-shared_storage.mount
# Automatically generated by systemd-fstab-generator
[Unit]
SourcePath=/etc/fstab
Documentation=man:fstab(5) man:systemd-fstab-generator(8)
[Mount]
What=gluster1:/gluster_shared_storage
Where=/var/run/gluster/shared_storage
Type=glusterfs
Options=defaults,x-systemd.requires=glusterd.service,x-systemd.automount
[root at ovirt1 ~]# systemctl cat var-run-gluster-shared_storage.automount
--no-pager
# /run/systemd/generator/var-run-gluster-shared_storage.automount
# Automatically generated by systemd-fstab-generator
[Unit]
SourcePath=/etc/fstab
Documentation=man:fstab(5) man:systemd-fstab-generator(8)
Before=remote-fs.target
After=glusterd.service
Requires=glusterd.service
[Automount]
Where=/var/run/gluster/shared_storage
[root at ovirt1 ~]# systemctl cat glusterd --no-pager
# /etc/systemd/system/glusterd.service
[Unit]
Description=GlusterFS, a clustered file-system server
Requires=rpcbind.service gluster_bricks-engine.mount gluster_bricks-data.mount
gluster_bricks-isos.mount
After=network.target rpcbind.service gluster_bricks-engine.mount
gluster_bricks-data.mount gluster_bricks-isos.mount
Before=network-online.target
[Service]
Type=forking
PIDFile=/var/run/glusterd.pid
LimitNOFILE=65536
Environment="LOG_LEVEL=INFO"
EnvironmentFile=-/etc/sysconfig/glusterd
ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL
$GLUSTERD_OPTIONS
KillMode=process
SuccessExitStatus=15
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/glusterd.service.d/99-cpu.conf
[Service]
CPUAccounting=yes
Slice=glusterfs.slice
[root at ovirt1 ~]# systemctl cat ctdb --no-pager
# /etc/systemd/system/ctdb.service
[Unit]
Description=CTDB
Documentation=man:ctdbd(1) man:ctdb(7)
After=network-online.target time-sync.target glusterd.service
var-run-gluster-shared_storage.automount
Conflicts=var-lib-nfs-rpc_pipefs.mount
[Service]
Environment=SYSTEMD_LOG_LEVEL=debug
Type=forking
LimitCORE=infinity
PIDFile=/run/ctdb/ctdbd.pid
ExecStartPre=/bin/bash -c "sleep 2; if [ -f
/sys/fs/cgroup/cpu/system.slice/cpu.rt_runtime_us ]; then echo 10000 >
/sys/fs/cgroup/cpu/system.slice/cpu.rt_runtime_us; fi"
ExecStartPre=/bin/bash -c 'if [[ $(find /var/log/log.ctdb -type f -size
+20971520c 2>/dev/null) ]]; then truncate -s 0 /var/log/log.ctdb;fi'
ExecStartPre=/bin/bash -c 'if [ -d
"/var/run/gluster/shared_storage/lock" ] ;then exit 4; fi'
ExecStart=/usr/sbin/ctdbd_wrapper /run/ctdb/ctdbd.pid start
ExecStop=/usr/sbin/ctdbd_wrapper /run/ctdb/ctdbd.pid stop
KillMode=control-group
Restart=no
[Install]
WantedBy=multi-user.target
[root at ovirt1 ~]# systemctl cat nfs-ganesha --no-pager
# /usr/lib/systemd/system/nfs-ganesha.service
# This file is part of nfs-ganesha.
#
# There can only be one NFS-server active on a system. When NFS-Ganesha is
# started, the kernel NFS-server should have been stopped. This is achieved by
# the 'Conflicts' directive in this unit.
#
# The Network Locking Manager (rpc.statd) is provided by the nfs-utils package.
# NFS-Ganesha comes with its own nfs-ganesha-lock.service to resolve potential
# conflicts in starting multiple rpc.statd processes. See the comments in the
# nfs-ganesha-lock.service for more details.
#
[Unit]
Description=NFS-Ganesha file server
Documentation=http://github.com/nfs-ganesha/nfs-ganesha/wiki
After=rpcbind.service nfs-ganesha-lock.service
Wants=rpcbind.service nfs-ganesha-lock.service
Conflicts=nfs.target
After=nfs-ganesha-config.service
Wants=nfs-ganesha-config.service
[Service]
Type=forking
Environment="NOFILE=1048576"
EnvironmentFile=-/run/sysconfig/ganesha
ExecStart=/bin/bash -c "${NUMACTL} ${NUMAOPTS} /usr/bin/ganesha.nfsd
${OPTIONS} ${EPOCH}"
ExecStartPost=-/bin/bash -c "prlimit --pid $MAINPID
--nofile=$NOFILE:$NOFILE"
ExecStartPost=-/bin/bash -c "/usr/bin/sleep 2 && /bin/dbus-send
--system --dest=org.ganesha.nfsd --type=method_call /org/ganesha/nfsd/admin
org.ganesha.nfsd.admin.init_fds_limit"
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/dbus-send --system --dest=org.ganesha.nfsd --type=method_call
/org/ganesha/nfsd/admin org.ganesha.nfsd.admin.shutdown
[Install]
WantedBy=multi-user.target
Also=nfs-ganesha-lock.service
I can't guarantee that it will work 100% in your setup, but I remmember I
had only few hicups after all node powerdown+powerup.
P.S.: I still prefer corosync/pacemaker but in my setup I cannot have fencing
and in hyperconverged setup it gets even more complex. If your cluster is
gluster only - consider pacemaker for that task.
Best Regards,
Strahil NikolovOn Nov 4, 2019 15:57, Erik Jacobson <erik.jacobson at
hpe.com> wrote:>
> Thank you! I am very interested. I hadn't considered the automounter
> idea.
>
> Also, your fstab has a different dependency approach than mine otherwise
> as well.
>
> If you happen to have the examples handy, I'll give them a shot here.
>
> I'm looking forward to emerging from this dark place of dependencies
not
> working!!
>
> Thank you so much for writing back,
>
> Erik
>
> On Mon, Nov 04, 2019 at 06:59:10AM +0200, Strahil wrote:
> > Hi Erik,
> >
> > I took another approach.
> >
> > 1.? I got a systemd mount unit for my ctdb lock volume's brick:
> > [root at ovirt1 system]# grep var /etc/fstab
> > gluster1:/gluster_shared_storage /var/run/gluster/shared_storage/
glusterfs
defaults,x-systemd.requires=glusterd.service,x-systemd.automount??????? 0 0
> >
> > As you can see - it is an automounter, because sometimes it fails to
mount on time
> >
> > 2.? I got custom systemd services for glusterd,ctdb and vdo -? as I
need to 'put' dependencies for each of those.
> >
> > Now, I'm no longer using ctdb & NFS Ganesha (as my version of
ctdb cannot use hpstnames and my environment is a little bit crazy), but I can
still provide hints how I did it.
> >
> > Best Regards,
> > Strahil NikolovOn Nov 3, 2019 22:46, Erik Jacobson <erik.jacobson
at hpe.com> wrote:
> > >
> > > So, I have a solution I have written about in the based that is
based on
> > > gluster with CTDB for IP and a level of redundancy.
> > >
> > > It's been working fine except for a few quirks I need to work
out on
> > > giant clusters when I get access.
> > >
> > > I have 3x9 gluster volume, each are also NFS servers, using
gluster
> > > NFS (ganesha isn't reliable for my workload yet). There are 9
IP
> > > aliases spread across 9 servers.
> > >
> > > I also have many bind mounts that point to the shared storage as
a
> > > source, and the /gluster/lock volume ("ctdb") of
course.
> > >
> > > glusterfs 4.1.6 (rhel8 today, but I use rhel7, rhel8, sles12, and
> > > sles15)
> > >
> > > Things work well when everything is up and running. IP failover
works
> > > well when one of the servers goes down. My issue is when that
server
> > > comes back up. Despite my best efforts with systemd fstab
dependencies,
> > > the shared storage areas including the gluster lock for CTDB do
not
> > > always get mounted before CTDB starts. This causes trouble for
CTDB
> > > correctly joining the collective. I also have problems where my
> > > bind mounts can happen before the shared storage is mounted,
despite my
> > > attempts at preventing this with dependencies in fstab.
> > >
> > > I decided a better approach would be to use a gluster hook and
just
> > > mount everything I need as I need it, and start up ctdb when I
know and
> > > verify that /gluster/lock is really gluster and not a local disk.
> > >
> > > I started down a road of doing this with a start host hook and
after
> > > spending a while at it, I realized my logic error. This will only
fire
> > > when the volume is *started*, not when a server that was down
re-joins.
> > >
> > > I took a look at the code, glusterd-hooks.c, and found that
support
> > > for "brick start" is not in place for a hook script but
it's nearly
> > > there:
> > >
> > > ??????? [GD_OP_START_BRICK]???????????? = EMPTY,
> > > ...
> > >
> > > and no entry in glusterd_hooks_add_op_args() yet.
> > >
> > >
> > > Before I make a patch for my own use, I wanted to do a sanity
check and
> > > find out if others have solved this better than the road I'm
heading
> > > down.
> > >
> > > What I was thinking of doing is enabling a brick start hook, and
> > > do my processing for volumes being mounted from there. However, I
> > > suppose brick start is a bad choice for the case of simply
stopping and
> > > starting the volume, because my processing would try to complete
before
> > > the gluster volume was fully started. It would probably work for
a brick
> > > "coming back and joining" but not "stop
volume/start volume".
> > >
> > > Any suggestions?
> > >
> > > My end goal is:
> > > - mount shared storage every boot
> > > - only attempt to mount when gluster is available (_netdev
doesn't seem
> > > ?? to be enough)
> > > - never start ctdb unless /gluster/lock is a shared storage and
not a
> > > ?? directory.
> > > - only do my bind mounts from shared storage in to the rest of
the
> > > ?? layout when we are sure the shared storage is mounted
(don't
> > > ?? bind-mount using an empty directory as a source by accident!)
> > >
> > > Thanks so much for reading my question,
> > >
> > > Erik
> > > ________
> > >
> > > Community Meeting Calendar:
> > >
> > > APAC Schedule -
> > > Every 2nd and 4th Tuesday at 11:30 AM IST
> > > Bridge: https://bluejeans.com/118564314?
> > >
> > > NA/EMEA Schedule -
> > > Every 1st and 3rd Tuesday at 01:00 PM EDT
> > > Bridge: https://bluejeans.com/118564314?
> > >
> > > Gluster-users mailing list
> > > Gluster-users at gluster.org
> > > https://lists.gluster.org/mailman/listinfo/gluster-users?
>
>
> Erik Jacobson
> Software Engineer
>
> erik.jacobson at hpe.com
> +1 612 851 0550 Office
>
> Eagan, MN
> hpe.com
Erik Jacobson
2019-Nov-05 15:39 UTC
[Gluster-users] hook script question related to ctdb, shared storage, and bind mounts
On Tue, Nov 05, 2019 at 05:05:08AM +0200, Strahil wrote:> Sure, > > Here is what was the setup :Thank you! You're very kind to send me this. I will verify it with my setup soon. Hoping to to rid myself of these dep problems. Thank you !!! Erik
Erik Jacobson
2019-Nov-09 18:15 UTC
[Gluster-users] hook script question related to ctdb, shared storage, and bind mounts
> Here is what was the setup :I thought I'd share an update in case it helps others. Your ideas inspired me to try a different approach. We support 4 main distros (and a 2 variants of some). We try not to provide our own versions of distro-supported packages like CTDB where possible. So a concern for me in modifying services is that they could be replaced in package updates. There are ways to mitigate that but that thought combined with yourr ideas led me to try this: - Be sure ctdb service is disabled - Added a systemd serivce of my own, oneshot, that runs a helper script - The helper script first ensures the gluster volumes show up (I use localhost in my case and besides, in our environment, we don't want CTDB to have a public IP anyway until NFS can be served so this helps there too) - Even with the gluster volume showing good, during init startup, first attempts to mount gluster volumes fail. So the helper script keeps looping until they work. It seems they work on the 2nd try (after a 3s sleep at failure). - Once the mounts are confirmed working and mounted, then my helper starts the ctdb service. - Awkward CTDB problems (where the lock check sometimes fails to detect a lock problem) are avoided since we won't start CTDB until we're 100% sure the gluster lock is mounted and pointing at gluster. The above is working in prototype form so I'm going to start adding my bind mounts to the equation. I think I have a solution that will work now and I thank you so much for the ideas. I'm taking things from prototype form now on to something we can provide people. With regards to pacemaker. There are a few pacemaker solutions that I've touched, and one I even helped implement. Now, it could be that I'm not an expert at writing rules, but pacemaker seems to have often given us more trouble than the problem it solves. I believe this is due to the complexity of the software and the power of it. I am not knocking pacemaker. However, a person really has to be a pacemaker expert to not make a mistake that could cause a down time. So I have attempted to avoid pacemaker in the new solution. I know there are down sides -- fencing is there for a reason -- but as far as I can tell the decision has been right for us. CTDB is less complicated even if does not provide 100% true full HA abilities. That said, in the solution, I've been careful to future-proof a move to pacemaker. For example, on the gluster servers/NFS servers, I bring up IP aliases (interfaces) on the network the BMCs reside so we're seamlessly able to switch to pacemaker with IPMI/BMC/redfish fencing later if needed without causing too much pain in the field with deployed servers. I do realize there are tools to help configure pacemaker for you. Some that I've tried have given me mixed results, perhaps due to the complexity of networking setup in the solutions we have. As we start to deploy this to more locations, I'll gain a feel for if a move to pacemaker is right or not. I just share this in the interest of learning. I'm always willing to learn and improve if I've overlooked something. Erik