I'm working the kinks of a new director based setup for the eventual
migration away from courier. At this point, with everything basically
working I'm trying to ensure that things are properly monitored and I've
run into an issue. There doesn't appear to be a way to get dovecot to
tell if it is (or is not) connected and properly synced with the other
director servers in the ring apart from the logs. It seems like this is
an important piece of information -- without it, it isn't apparent how
you would be able to tell if your director servers have lost track of
each other.
I'm also curious what people are doing to health check their director
servers when they are running load balancing upstream of them as well.
It doesn't seem like it is a good idea to let the load balancers check
all the way through to the real servers since a failure on the target
real server could end up leading to a director being dropped from the
pool (if so, it is most likely that the other directors would be dropped
as well.) Otherwise, the health check failure tolerance at the load
balancer must be greater than the tolerance for failure of the real
servers on the director- a dead director could end up in the pool for
longer than desired, or anyway, long enough to be sure that it isn't a
transient failure on the real server behind it.
A better method would seem to be for the load balancers to query the
director for the number of active back-end servers and, so long as it was
over a given threshold, to assume that the director is otherwise able to
do its job and rely on external monitoring to pickup internal failures
where dovecot isn't able to successfully proxy the connection to one of
the real servers.
So, how are people doing this in the real world?
--
Kelsey Cummings - kgc at corp.sonic.net sonic.net, inc.
System Architect 2260 Apollo Way
707.522.1000 Santa Rosa, CA 95407