thr3ads.net - samba - [Samba] Intermittent Event Script Timeouts on CTDB Cluster Nodes [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Howard, Stewart Jameson

2014-Dec-12 19:39 UTC

[Samba] Intermittent Event Script Timeouts on CTDB Cluster Nodes

Hi All,

I've got a CTDB cluster, managing NFSv3 and Samba, sitting in front of a
GPFS storage cluster.  The NFSv3 piece is carrying some pretty heavy traffic at
peak load.  About once every three to four days, CTDB has been exhibiting
behaviors that result in IP-failover between two nodes for reasons that are
currently unknown.  The exact chain of events has been a little different each
time this has happened, so a comprehensive summary is difficult.  However, I
will attempt to present the highlights below:

1)  CTDB begins to flap on an affected cluster node.  Clients connected to that
node see the NFS server not responding.

2)  After the fail-over, the CTDB log on the affected node is full of complaints
about event scripts timing out.  The first script to time out *seems* always to
be `61.nfstickle` or rpcinfo itself (perhaps rpcinfo running under the authority
of the CTDB event scripts), followed by timing out of event scripts related to
releasing IPs for takeover.  Additionally, around the event, we see the
IP-receiving peer logging a lot of errors about CTDB control traffic timing out.

3)  Fail-back is attended by similar difficulties.  During the last fail-back
procedure (12/09), clients experienced instability (mounts not responding,
bizarre permissions errors) while the CTDB hosts continued to complain their
logs about timed out event scripts (IP take or release) related.  Finally, CTDB
seems to get fed up and restart statd and nfsd and all goes back to normal.

4)  During two of these failure events, CTDB on one of the nodes has actually
*died* and had to be restarted for fail-back to occur.

So, in broad strokes, those are the kind of events that I've been seeing in
this cluster.  My theory about the cause of this had *previously* centered
around load-induced conditions.  While this is still a possibility, digging in
the logs and config files has led me to develop another theory.  Namely, that
misconfiguration of statd is causing monitoried clients not to appear in shared
storage, which is then causing fatal confusion during some failover events. 
This theory would postulate that those failovers that are problematic, follow
the reboot of some client on the network while failovers that are successful
happen after the connections of all rebooted clients have been reset.  The
specific configuration option that makes me think statd is misconfigured is this
one from /etc/sysconfig/nfs:

"""
STATD_HOSTNAME="$NFS_HOSTNAME -H /etc/ctdb/statd-callout -p 97"
"""

...I notice that the -P parameter is missing from this string, which is
described in `man rpc.statd` as follows:

"""
       -P, --state-directory-path pathname
              Specifies the pathname of the parent directory where NSM state
information resides.  If this option is not specified, rpc.statd uses
/var/lib/nfs/statd by default.
"""

...Also, I know that this parameter string is getting passed to the actual statd
invocation, along with an extraneous port specifier, because these messages also
appear in /var/log/log.ctdb:

"""
ERROR: STATD is not responding. Trying to restart it. [rpc.statd  -n
myservice.tld -H /etc/ctdb/statd-callout -p 97 -p 595 -o 596]
"""

...Looking in /var/lib/nfs/statd, I do see some clients listed on that
directory.  However, /etc/sysconfig/nfs also has the following variable
definitition:

"""
STATD_SHARED_DIRECTORY=/gs/var/nfs/rfs_shared
"""

So, I'm now wondering if statd is looking in different places at different
times for clients to monitor and, in some cases, IP-receiving peers are not able
to update their lists of monitored nodes.

Additionally, I'm wondering if anyone on this list has had a similar
experience with CTDB.  Also, I'm wondering what the list makes of my current
theory regarding the cause of these problems, or if anyone would like to advance
an alternate theory if my own is not sound.

Thank you so much for all of your help!

Stewart Howard

Apparently Analagous Threads

Search for more apparently analagous threads

samba - Dec 2014 - Intermittent Event Script Timeouts on CTDB Cluster Nodes

[Samba] Intermittent Event Script Timeouts on CTDB Cluster Nodes

Apparently Analagous Threads

Wisdom of the Ancients