Sebastian Marske
2022-Feb-10 15:15 UTC
Hanging doveadm-server processes with dsync replication
Dear Dovecot mailing list, after updating Dovecot from 2.3.16 to 2.3.18 on a dsync-replicated server there are hanging doveadm-server processes piling up over time, eventually resulting in some of the affected users being shown as out of sync. Our setup is based on FreeBSD 13, ZFS and Dovecot (we use custom packages built with poudriere) with master/master replication using dsync. However, we use a shared ip (Carp), so that only one server is actually active. Please see the output of "doveconf -n" at the end for our config. Starting from an in-sync state, I updated Dovecot on the inactive server. Occasionally, Dovecot logs messages like: Feb 8 15:02:15 myhost dovecot[99800]: doveadm(someuser)<2090><ZLayIal3AmIqCAAADKIhQg>: Error: write(<local>) failed: Timed out after 60 seconds These occur for an increasing number of users (maybe 30 after two days), but not for every user (there are >4800 users on that server) and also only once for every affected user. Here's some more information about the process/user from the log entry: # doveadm replicator dsync-status username type status someuser incremental Waiting for dsync to finish (the type is "incremental for most users, but "normal" and "full" show up as well) # doveadm replicator status someuser username priority fast sync full sync success sync failed someuser low 00:01:51 21:56:06 69:56:10 y (again, not all are in failed state) # top -abp 2090 PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 2090 sysdov 1 20 0 26M 15M kqread 12 0:01 0.00% doveadm-server: [<local>] (doveadm-server) (didn't change for >1d) # top -m io -abp 2090 PID USERNAME VCSW IVCSW READ WRITE FAULT TOTAL PERCENT COMMAND 2090 sysdov 363 58 0 271 2 273 0.00% doveadm-server: [<local>] (doveadm-server) (didn't change for >1d) # gdb -p 2090 ... (gdb stuff; gdb complaining about missing debug symbols) (gdb) bt full #0 0x00000000416a94ca in _kevent () from /lib/libc.so.7 No symbol table info available. #1 0x00000000419b94f3 in ?? () from /lib/libthr.so.3 No symbol table info available. #2 0x000000004152a645 in io_loop_handler_run_internal () from /usr/local/lib/dovecot/libdovecot.so.0 No symbol table info available. #3 0x00000000415282fa in io_loop_handler_run () from /usr/local/lib/dovecot/libdovecot.so.0 No symbol table info available. #4 0x0000000041528138 in io_loop_run () from /usr/local/lib/dovecot/libdovecot.so.0 No symbol table info available. #5 0x000000004148ac58 in master_service_run () from /usr/local/lib/dovecot/libdovecot.so.0 No symbol table info available. #6 0x0000000001086431 in main () No symbol table info available. (gdb) So I guess it's waiting for something that * doesn't happen on my system * or it didn't wait for in 2.3.16>From what I've seen, mails from the active server (still on 2.3.16) arereplicated to this server. For non-affected users, mails are also replicated from this server to the active one. I can't tell about "outgoing" replication for affected users, yet. After downgrading back to 2.3.16, things are fine again. Most affected users jump back to being successfully synced within a couple of minutes. If not, starting replication via doveadm get's them there. Testing 2.3.18 again, it seems that the same users are affected again. I also tested 2.3.17 when it came out and had the same issue, paired with the ioloop issue [1], which was fixed in 2.3.18 and which I don't see anymore. The hanging doveadm processes remain, though. Do you have any suggestions on how to resolve this? [1] https://dovecot.org/pipermail/dovecot/2022-January/123907.html Best regards Sebastian # doveconf -n # 2.3.18 (9dd8408c18): /usr/local/etc/dovecot/dovecot.conf # Pigeonhole version 0.5.18 (0bc28b32) # OS: FreeBSD 13.0-RELEASE-p6 amd64 # Hostname: myhost... auth_cache_ttl = 0 auth_username_chars = abcdefghijklmnopqrstuvwxyz01234567890 at .- auth_username_format = %n default_client_limit = 126000 default_process_limit = 50000 default_vsz_limit = 512 M doveadm_password = # hidden, use -P to show it first_valid_gid = 20 first_valid_uid = 20 imap_client_workarounds = tb-extra-mailbox-sep imap_logout_format = in=%i out=%o delflag=%{deleted} deleted=%{expunged} trashed=%{trashed} session=<%{session}> login_trusted_networks = # imap proxy ips... mail_gid = sysdov mail_location maildir:~/maildir:INDEX=/addons/index/%u:CONTROL=~/control:LAYOUT=fs mail_plugins = acl notify replication mail_uid = sysdov managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mailbox date index ihave duplicate mime foreverypart extracttext editheader namespace fremdeordner { list = yes location maildir:%%h/maildir:INDEX=/addons/index/%u/FremdeOrdner/%%u:LAYOUT=fs prefix = FremdeOrdner/%%u/ separator = / subscriptions = no type = shared } namespace inbox { inbox = yes list = yes location mailbox Archive { auto = no special_use = \Archive } mailbox Archives { special_use = \Archive } mailbox AutoCleanSpam { auto = subscribe } mailbox "Deleted Items" { special_use = \Trash } mailbox "Deleted Messages" { special_use = \Trash } mailbox Drafts { auto = subscribe special_use = \Drafts } mailbox Entw?rfe { special_use = \Drafts } mailbox "Gel?schte Elemente" { special_use = \Trash } mailbox "Gesendete Elemente" { special_use = \Sent } mailbox Junk { special_use = \Junk } mailbox Sent { auto = subscribe special_use = \Sent } mailbox "Sent Items" { special_use = \Sent } mailbox "Sent Messages" { special_use = \Sent } mailbox Trash { auto = subscribe special_use = \Trash } mailbox name { special_use = \Drafts \Junk \Sent \Trash \Archive } prefix separator = / subscriptions = yes type = private } passdb { args = /usr/local/etc/dovecot/deny-users deny = yes driver = passwd-file } passdb { args = failure_show_msg=yes dovecot driver = pam } plugin { acl = vfile acl_shared_dict = file:/addons/acl/shared-folder mail_replica = tcp:myreplica...:12345 sieve = /addons/sieve/%u.sieve sieve_dir sieve_extensions = +imap4flags +editheader sieve_vacation_dont_check_recipient = yes } protocols = imap lmtp replication_dsync_parameters = -d -n 'inbox' -l 30 -U replication_max_conns = 150 service aggregator { fifo_listener replication-notify-fifo { user = sysdov } unix_listener replication-notify { user = sysdov } } service anvil { client_limit = 60003 unix_listener anvil-auth-penalty { mode = 00 } unix_listener anvil { group = nagios mode = 0660 } } service auth { client_limit = 126000 unix_listener auth-userdb { mode = 0644 user = sysdov } } service config { unix_listener config { user = sysdov } } service doveadm { inet_listener { port = 12345 } user = sysdov vsz_limit = 1 G } service imap-login { client_limit = 20000 process_limit = 10000 process_min_avail = 138 service_count = 0 } service imap { process_limit = 80000 process_min_avail = 10 vsz_limit = 512 M } service lmtp { client_limit = 1 executable = lmtp -L process_min_avail = 20 unix_listener /var/spool/postfix/private/dovecot-lmtp { group = postfix mode = 0660 user = sysdov } } service replicator { process_min_avail = 1 unix_listener replicator-doveadm { mode = 0600 user = sysdov } } ssl_cert = </usr/local/etc/letsencrypt/live/myhost/fullchain.pem ssl_key = # hidden, use -P to show it ssl_prefer_server_ciphers = yes userdb { default_fields = mail_replica=tcp:myreplica...:12345 driver = passwd override_fields = uid=29 gid=29 blocking=yes } verbose_proctitle = yes protocol imap { login_trusted_networks = # imap proxy ips... mail_max_userip_connections = 25 mail_plugins = acl notify replication acl imap_acl } protocol lmtp { mail_plugins = acl notify replication sieve postmaster_address = postmaster at ... sendmail_path = /usr/local/sbin/sendmail }