Kelsey Cummings
2012-Sep-03  18:26 UTC
[Dovecot] TIMO HELP! director ring wont stay connected
I've had 2x director ring up and running with production load on 2.1.8 
with around 10,000 active connections for two weeks and everything has 
been working great - until this morning.
There isn't anything obvious in the logs beyond the fact that the 
director connections started bouncing.  It was not resolved by reloads 
or restarts or an upgrade to 2.1.9 (only the directors.)
I've dropped one of the servers out of the ring to prevent suffering but 
this is a less than ideal situation.
Any idea what is going on?  Load today is consistent with low weekend 
load since it is a major US holiday so this wouldn't appear to be a load 
related issue.
auth_master_user_separator = *
auth_username_format = %Ln
auth_verbose = yes
default_client_limit = 10000
director_mail_servers = 1.1.1.1 1.1.1.2 1.1.1.3 1.1.1.4
director_servers = a.director.foo b.director.foo
disable_plaintext_auth = no
doveadm_proxy_port = 1842
login_trusted_networks = 10.0.0.1
mbox_write_locks = fcntl
passdb {
   args = /etc/dovecot/master-users
   driver = passwd-file
   master = yes
   pass = yes
}
passdb {
   args = proxy=y nopassword=y
   driver = static
}
service anvil {
   client_limit = 20103
}
service auth {
   client_limit = 41704
}
service director {
   fifo_listener login/proxy-notify {
     mode = 0666
   }
   inet_listener {
     port = 9321
   }
   unix_listener login/director {
     mode = 0666
   }
}
service imap-login {
   executable = imap-login director
   process_limit = 20000
   process_min_avail = 32
}
service imap {
   process_limit = 20480
}
service pop3-login {
   executable = pop3-login director
   process_limit = 20000
   process_min_avail = 32
}
ssl_ca = </etc/dovecot/ssl/gd_bundle.crt
ssl_cert = </etc/dovecot/ssl/imap.sonic.net.crt
ssl_key = </etc/dovecot/ssl/imap.sonic.net.key
ssl_parameters_regenerate = 1 days
syslog_facility = local0
protocol imap {
   ssl_ca = </etc/dovecot/ssl/gd_bundle.crt
   ssl_cert = </etc/dovecot/ssl/imap.sonic.net.crt
   ssl_key = </etc/dovecot/ssl/imap.sonic.net.key
}
protocol pop3 {
   ssl_ca = </etc/dovecot/ssl/pop.sonic.net.ca-bundle
   ssl_cert = </etc/dovecot/ssl/pop.sonic.net.crt
   ssl_key = </etc/dovecot/ssl/pop.sonic.net.key
}
 >Sep  3 09:22:42 b.director. b dovecot: director: Error: Director 
10.10.10.71:9321/right disconnected
 >Sep  3 09:22:45 a.director. a dovecot: director: Error: Director 
10.10.10.37:9321/left disconnected
 >Sep  3 09:22:49 b.director. b dovecot: director: Error: Director 
10.10.10.71:9321/left disconnected
 >Sep  3 09:22:53 b.director. b dovecot: director: Error: Director 
10.10.10.71:9321/left disconnected
 >Sep  3 09:22:54 a.director. a dovecot: director: Error: Director 
10.10.10.37:9321/left disconnected
 >Sep  3 09:22:59 b.director. b dovecot: director: Error: Director 
10.10.10.71:9321/left disconnected
 >Sep  3 09:23:02 a.director. a dovecot: director: Error: Director 
10.10.10.37:9321/right disconnected
 >Sep  3 09:23:02 a.director. a dovecot: director: Warning: director: 
Couldn't connect to right side, we must be the only director left
 >Sep  3 09:23:02 b.director. b dovecot: director: Error: Director 
10.10.10.71:9321/right disconnected
 >Sep  3 09:23:02 b.director. b dovecot: director: Warning: director: 
Couldn't connect to right side, we must be the only director left
 >Sep  3 09:23:32 a.director. a dovecot: director: Error: director: User 
foo host lookup failed: Timeout - queued for 47 secs (Ring synced for 30 
secs, weak user, user refreshed 47 secs ago)
 >Sep  3 09:23:32 a.director. a dovecot: director: Error: director: User 
bar host lookup failed: Timeout - queued for 38 secs (Ring synced for 30 
secs, weak user, user refreshed 38 secs ago)
 >Sep  3 09:23:32 a.director. a dovecot: director: Error: director: User 
bla host lookup failed: Timeout - queued for 30 secs (Ring synced for 30 
secs)
 >Sep  3 09:23:32 a.director. a dovecot: director: Error: director: User 
bla2 host lookup failed: Timeout - queued for 30 secs (Ring synced for 
30 secs)
 >Sep  3 09:23:32 a.director. a dovecot: director: Error: Director 
10.10.10.37:9321/right disconnected
 >Sep  3 09:23:32 a.director. a dovecot: director: Warning: director: 
Couldn't connect to right side, we must be the only director left
 >Sep  3 09:23:32 a.director. a dovecot: director: Warning: director: 
Couldn't connect to right side, we must be the only director left
 >Sep  3 09:23:32 b.director. b dovecot: director: Error: Director 
10.10.10.71:9321/out disconnected before handshake finished
 >Sep  3 09:23:32 b.director. b dovecot: director: Warning: director: 
Couldn't connect to right side, we must be the only director left
 >Sep  3 09:24:02 b.director. b dovecot: director: Error: Director 
10.10.10.71:9321/out disconnected before handshake finished
 >Sep  3 09:24:02 b.director. b dovecot: director: Warning: director: 
Couldn't connect to right side, we must be the only director left
 >Sep  3 09:24:05 a.director. a dovecot: director: Warning: Delaying new 
user requests until ring is synced
 >Sep  3 09:24:32 a.director. a dovecot: director: Warning: Ring is 
synced, continuing delayed requests
 >Sep  3 09:24:41 b.director. b dovecot: director: Error: Director 
10.10.10.71:9321/right disconnected
 >Sep  3 09:24:41 b.director. b dovecot: director: Warning: director: 
Couldn't connect to right side, we must be the only director left
 >Sep  3 09:24:41 a.director. a dovecot: director: Error: Director 
10.10.10.37:9321/right disconnected
 >Sep  3 09:24:41 a.director. a dovecot: director: Warning: director: 
Couldn't connect to right side, we must be the only director left
 >Sep  3 09:25:11 b.director. b dovecot: director: Error: User hash 
2285697953 is being redirected to two hosts: 10.10.10.39 and 10.10.10.76 
(old_ts=1346689481,handshaking,recv_ts=1346689467)
 >Sep  3 09:25:12 b.director. b dovecot: director: Error: User hash 
623192092 is being redirected to two hosts: 10.10.10.76 and 10.10.10.39 
(old_ts=1346689481,handshaking,recv_ts=1346689468)
 >Sep  3 09:25:12 b.director. b dovecot: director: Error: User hash 
1683990717 is being redirected to two hosts: 10.10.10.43 and 10.10.10.76 
(old_ts=1346689481,handshaking,recv_ts=1346689468)
 >Sep  3 09:25:12 a.director. a dovecot: director: Error: Director 
10.10.10.37:9321/right disconnected
 >Sep  3 09:25:12 a.director. a dovecot: director: Warning: director: 
Couldn't connect to right side, we must be the only director lef
-- 
Kelsey Cummings - kgc at corp.sonic.net      sonic.net, inc.
System Architect                          2260 Apollo Way
707.522.1000                              Santa Rosa, CA 95407
Robert Schetterer
2012-Sep-03  18:32 UTC
[Dovecot] TIMO HELP! director ring wont stay connected
Am 03.09.2012 20:26, schrieb Kelsey Cummings:> I've had 2x director ring up and running with production load on 2.1.8 > with around 10,000 active connections for two weeks and everything has > been working great - until this morning.meanwhile try upgrade to 2.1.9 i cant see big changes there for director which should brake something http://hg.dovecot.org/dovecot-2.1/rev/bc86680293d2 however its better debug with latest patchlevel -- Best Regards MfG Robert Schetterer
Timo Sirainen
2012-Sep-03  19:06 UTC
[Dovecot] TIMO HELP! director ring wont stay connected
On 3.9.2012, at 21.26, Kelsey Cummings wrote:> I've had 2x director ring up and running with production load on 2.1.8 with around 10,000 active connections for two weeks and everything has been working great - until this morning. > > There isn't anything obvious in the logs beyond the fact that the director connections started bouncing. It was not resolved by reloads or restarts or an upgrade to 2.1.9 (only the directors.)Did you try stopping both and then starting them again? That clears up all the state they have.> Any idea what is going on? Load today is consistent with low weekend load since it is a major US holiday so this wouldn't appear to be a load related issue.Directors themselves think they're having trouble connecting to each others.. Annoyingly it doesn't give specific error messages about what happened. I should improve the logging.. If the state clearing doesn't help, maybe this has something to do with the OS or the network is really having some issues.
Timo Sirainen
2012-Sep-05  00:58 UTC
[Dovecot] TIMO HELP! director ring wont stay connected
On 3.9.2012, at 21.26, Kelsey Cummings wrote:> passdb { > args = proxy=y nopassword=y > driver = static > }I wonder if someone was doing a ton of logins for different usernames? This kind of setup where director doesn't verify the username can be attacked that way.
Timo Sirainen
2012-Sep-11  21:29 UTC
[Dovecot] TIMO HELP! director ring wont stay connected
On 3.9.2012, at 21.26, Kelsey Cummings wrote:> >Sep 3 09:22:42 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/right disconnected > >Sep 3 09:22:45 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/left disconnected > >Sep 3 09:22:49 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected > >Sep 3 09:22:53 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected > >Sep 3 09:22:54 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/left disconnected > >Sep 3 09:22:59 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected > >Sep 3 09:23:02 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/right disconnectedAll of these connections had finished handshaking. They simply just seemed to disconnect the connection for no reason. I found one reason for why that could happen, which could explain all of this: http://hg.dovecot.org/dovecot-2.1/rev/24e791bbcf69 The user weakness is remembered until all directors are shut down or all directors have acknowledged the weakness.> >Sep 3 09:23:02 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director leftThe user weakness should have been removed at this point, but looks like there's code missing for it for 1-director-only setup. I'll fix that soon..> >Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User foo host lookup failed: Timeout - queued for 47 secs (Ring synced for 30 secs, weak user, user refreshed 47 secs ago) > >Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User bar host lookup failed: Timeout - queued for 38 secs (Ring synced for 30 secs, weak user, user refreshed 38 secs ago)These are the weak users causing the trouble. This improves logging: http://hg.dovecot.org/dovecot-2.1/rev/27d3289e1f5c