Hello List
I have a strange problem here which i try to analyse, but i'm stuck.
Maybe someone has a hint?
What happened:
A few weeks ago one of the LDAPS Servers which is not maintained by us
has crashed. From that moment on, users could still login to check their
emails, but they were not able to send any email through postfix (which
uses smtpd_sasl_type = dovecot)
What i do not understand, is why did dovecot not switch to the second
configured LDAPS Server? It looks like it retried for ever to reconnect
to the crashed LDAP Server.
From the moment of the crash we see a lot of Errors like these in our
logfiles:
Nov 30 16:51:53 servername dovecot: [ID 583609 mail.error] auth: Error:
ldap(userone,USERS_IP1,<WKiTeBUJQQBUSAnE>): Connection appears to be
hanging, reconnecting
AND
Nov 30 16:51:59 servername dovecot: [ID 583609 mail.error] auth: Error:
plain(usertwo,USERS_IP2,<QgJvcBUJqABTTWrJ>): Request 1982.83548 timed
out after 151 secs, state=1
The used dovecot version is 2.2.13, runs on a solaris 10 system and the
configuration for passdb and userdb are:
passdb {
args = /etc/dovecot-ldap.conf
default_fields deny = no
driver = ldap
master = no
name override_fields pass = no
result_failure = continue
result_internalfail = continue
result_success = return-ok
skip = never
}
userdb {
args = /etc/dovecot-ldap.conf
default_fields driver = ldap
name override_fields result_failure = continue
result_internalfail = continue
result_success = return-ok
skip = never
}
And the dovecot-ldap.conf contains (obfuscated):
uris = ldaps://server2.tld ldaps://server1.tld
ldaps://server4.tld ldaps://server3.tld
dn = ...
dnpass = ...
ldap_version = 3
auth_bind = yes
base = ...
scope = onelevel
user_attrs = homeDirectory=home,uidNumber=uid,gidNumber=gid
user_filter = ...
pass_attrs = uid=user
pass_filter = ...
The strange thing is, that with the very same binaries and configuration
(okay, some minimal modifications have been made to bind to the correct
interfaces...) a test on our testsystem works as it should.
When we shutdown slapd, dovecot recognizes it an connects to the
alternate LDAPS. When we shutdown slapd and start a netcat (just to let
something listening without responding)... you guess it. Dovecot does
recognize it and switches over to the alternate testsystem.
So on our testsystem, everything worked as it should. But the productive
system did not. And since the LDAPS are not maintained by us it is
somewhat hard to try to reproduce something.
At least i got the logfiles from server2.tld and server1.tld. But they
only show what i still knew. Our server connected to server2.tld until
the crash happened. But server1.tld never got any connection.
Has someone an idea what i could try to find out why dovecot did not
switch to server1.tld?
Best regards
Matthias Egger
--
Matthias Egger
ETH Zurich
Department of Information Technology maegger at ee.ethz.ch
and Electrical Engineering
IT Support Group (ISG.EE), ETL/F/24.1 Phone +41 (0)44 632 03 90
Physikstrasse 3, CH-8092 Zurich Fax +41 (0)44 632 11 95
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4099 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://dovecot.org/pipermail/dovecot/attachments/20141216/13503602/attachment.p7s>
On 16/12/14 16:30, Matthias Egger wrote:> What happened: > A few weeks ago one of the LDAPS Servers which is not maintained by us > has crashed. From that moment on, users could still login to check their > emails, but they were not able to send any email through postfix (which > uses smtpd_sasl_type = dovecot) > > What i do not understand, is why did dovecot not switch to the second > configured LDAPS Server? It looks like it retried for ever to reconnect > to the crashed LDAP Server.This is speculation, but what has happened to us in the past is that the LDAP server stopped responding to queries, but the TCP socket was still open for connections. A new TCP connection would be established, but the daemon would not be notified of it. So, depending on precisely how the first LDAP server crashed, it may not be the same test as killing the process, but closer to sending it 'kill -STOP' (and then 'kill -CONT' afterwards, obviously) Simon. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Hello Simon On 12/16/2014 05:38 PM, Simon Fraser wrote:> This is speculation, but what has happened to us in the past is that the > LDAP server stopped responding to queries, but the TCP socket was still > open for connections. A new TCP connection would be established, but the > daemon would not be notified of it. > > So, depending on precisely how the first LDAP server crashed, it may not > be the same test as killing the process, but closer to sending it 'kill > -STOP' (and then 'kill -CONT' afterwards, obviously)Thank you very much for that hint. You were right. When i -SIGSTOP the slapd i receive a similar behaviour of dovecot as we had a few weeks ago. So do you (or someone other) has a hint on how i could work around such a situation? I found a statement from Timo Sirainen from June 2011: http://www.dovecot.org/pipermail/dovecot/2011-June/059905.html "...Fallbacking to another LDAP server is done by OpenLDAP internally..." So i thought, there should be a possibility to "tweak" the ldap.conf. I then found a german Post: https://listen.jpberlin.de/pipermail/dovecot/2014-June/000506.html Where someone mentioned some ldap.conf Settings: BIND_POLICY soft TIMELIMIT 5 NETWORK_TIMEOUT 5 TIMEOUT 8 and a link to: http://www.linuxquestions.org/questions/linux-enterprise-47/ldap-failover-timeout-client-setting-847718/ which also uses these two settings: BIND_TIMELIMIT 10 IDE_TIMELIMIT 10 I gave i try to them, but the result was still the same. Dovecot respectively OpenLDAP does not switch to another LDAP. Best regards Matthias -- Matthias Egger ETH Zurich Department of Information Technology maegger at ee.ethz.ch and Electrical Engineering IT Support Group (ISG.EE), ETL/F/24.1 Phone +41 (0)44 632 03 90 Physikstrasse 3, CH-8092 Zurich Fax +41 (0)44 632 11 95 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4099 bytes Desc: S/MIME Cryptographic Signature URL: <http://dovecot.org/pipermail/dovecot/attachments/20141217/86431e07/attachment-0001.p7s>