thr3ads.net - dovecot - poolmon improvements [Aug 2014]

If this information is useful, please help other people find it:
Share via:

Timo Sirainen

2014-Aug-11 16:59 UTC

poolmon improvements

I've been planning to improve poolmon failure checking for a long time
already, but I still haven't managed to get to it. Maybe somebody else has
more time, so here's a feature request for anyone to implement:

poolmon currently gives up immediately if the first check to any service fails.
It really should be trying multiple times over multiple seconds before giving
up. I think ideally it would be:

 - Individual check timeout could still be the default 5 seconds
 - Add full check time setting, which could be e.g. 15 seconds. If all checks
fail during this time then disable the host.
 - If request fails because connection gets rejected, retry quickly, e.g. after
0,1 seconds
 - If check fails because of protocol errors, wait for a long time, e.g. 1
second

So this would avoid backend being removed in situations where it really
shouldn't be removed:

 - Dovecot restarts
 - Dovecot reloads
 - load spikes and other random issues that cause temporary problems

Especially the load spike is an annoying issue which my plan doesn't even
fully solve. The solution to fix a heavily overloaded cluster isn't really
to start removing all of its backends that are busy working..

Possibly Parallel Threads

Search for more seemingly similar threads

dovecot - Aug 2014 - poolmon improvements

poolmon improvements

Possibly Parallel Threads

Wisdom of the Ancients