I've been planning to improve poolmon failure checking for a long time already, but I still haven't managed to get to it. Maybe somebody else has more time, so here's a feature request for anyone to implement: poolmon currently gives up immediately if the first check to any service fails. It really should be trying multiple times over multiple seconds before giving up. I think ideally it would be: - Individual check timeout could still be the default 5 seconds - Add full check time setting, which could be e.g. 15 seconds. If all checks fail during this time then disable the host. - If request fails because connection gets rejected, retry quickly, e.g. after 0,1 seconds - If check fails because of protocol errors, wait for a long time, e.g. 1 second So this would avoid backend being removed in situations where it really shouldn't be removed: - Dovecot restarts - Dovecot reloads - load spikes and other random issues that cause temporary problems Especially the load spike is an annoying issue which my plan doesn't even fully solve. The solution to fix a heavily overloaded cluster isn't really to start removing all of its backends that are busy working..