Reuben Farrelly
2020-Jun-13 02:38 UTC
Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol
Hi, I've been seeing errors logged for some time with replication processes, whereby replication sessions seem to be timing out periodically. This is with dovecot version 2.3.10.1 (a3d0e1171) and both are Gentoo x86_64. After some investigation I've determined that these timeouts are only ever occurring with tcps as the replication connection type. These errors never occur if non-encrypted tcp is configured. I've been able to validate this by changing only the replica_type on both ends of the replication configuration to tcp, and with no other changes and after a few days of operation there is not a single error logged. mail_replica = tcps:lightning.reub.net:4813 <<< periodic timeouts mail_replica = tcp:lightning.reub.net:4814 <<< works perfectly Example of the error: Jun 12 15:45:44 thunderstorm.reub.net dovecot[21149]: dsync-local(kaylene)<zx+WKTAU416UMwAAzkCIew>: Error: dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds (last sent=mailbox_delete, last recv=handshake) Jun 12 15:45:44 thunderstorm.reub.net dovecot[21149]: dsync-local(kaylene)<zx+WKTAU416UMwAAzkCIew>: Error: Timeout during state=recv_mailbox_tree doveadm: Error: Timeout during state=slave_recv_mailbox: 6 Time(s) doveadm: Error: Timeout during state=sync_mails (send=mail_requests recv=attributes): 31 Time(s) doveadm: Error: dsync(thunderstorm.reub.net): I/O has stalled, no activity for 600 seconds (last sent=mail_change (EOL), last recv=mailbox): 31 Time(s) doveadm: Error: dsync(thunderstorm.reub.net): I/O has stalled, no activity for 600 seconds (last sent=mailbox_delete, last recv=mailbox_delete): 6 Time(s) It is seen on both sides of the replication setup. The replica is offsite but only a few ms of latency away and there is no packet loss. The replication is happening over IPv6, and the local firewall is logging that sessions are always permitted, and only ever finishing due to tcp-fin or tcp-rst-from-client . SSL appears to be correctly configured, and it seems that the replication itself is for the most part working. Clients are able to use imaps just fine so I don't think there's anything much wrong from an SSL perspective else I'd be seeing complete replication failure and/or client devices unable to connect. Can anyone suggest how we can further debug this problem? Thanks, Reuben -------------- next part -------------- # 2.3.10.1 (a3d0e1171): /etc/dovecot/dovecot.conf # Pigeonhole version 0.5.10 (bf8ef1c2) # OS: Linux 5.7.2-gentoo x86_64 Gentoo Base System release 2.7 # Hostname: thunderstorm.reub.net auth_mechanisms = plain login auth_username_format = %Ln disable_plaintext_auth = no doveadm_password = # hidden, use -P to show it first_valid_uid = 1000 imap_client_workarounds = tb-lsub-flags tb-extra-mailbox-sep last_valid_uid = 1099 login_log_format_elements = user=<%u> auth-method=%m remote=%r local=%l %c %k login_trusted_networks = 192.168.0.0/16 2403:5800:7100:0900::/56 180.150.17.229 mail_attribute_dict = file:%h/Maildir/dovecot-attributes mail_location = maildir:~/Maildir mail_plugins = notify replication fts fts_lucene managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mailbox date index ihave duplicate mime foreverypart extracttext namespace inbox { inbox = yes location = mailbox Drafts { special_use = \Drafts } mailbox Junk { special_use = \Junk } mailbox Sent { special_use = \Sent } mailbox "Sent Messages" { special_use = \Sent } mailbox Trash { special_use = \Trash } prefix = } passdb { args = failure_show_msg=yes %s driver = pam } plugin { fts = lucene fts_autoindex = yes fts_languages = en fts_lucene = whitespace_chars=@. mail_replica = tcps:lightning.reub.net:4813 replication_full_sync_interval = 2 hours sieve = file:~/sieve;active=~/.dovecot.sieve } postmaster_address = postmaster at reub.net protocols = imap lmtp sieve submission sieve recipient_delimiter = - service aggregator { fifo_listener replication-notify-fifo { mode = 0666 user = root } unix_listener replication-notify { mode = 0666 user = root } } service auth { inet_listener { port = 45347 } unix_listener /var/spool/postfix/private/auth { group = postfix mode = 0666 user = postfix } unix_listener auth-userdb { mode = 0777 } } service doveadm { inet_listener { address = 2403:5800:7100:0910::23 port = 4813 ssl = yes } inet_listener { address = 2403:5800:7100:0910::23 port = 4814 ssl = no } inet_listener { address = 192.168.10.23 port = 4813 ssl = yes } inet_listener { address = 192.168.10.23 port = 4814 ssl = no } user = root } service imap-login { inet_listener imap { address = 192.168.10.23 2403:5800:7100:0910::23 } inet_listener imaps { address = 192.168.10.23 2403:5800:7100:0910::23 ssl = yes } } service lmtp { inet_listener lmtp { address = ::1 port = 24 } unix_listener /var/spool/postfix/private/dovecot-lmtp { group = postfix mode = 0660 user = postfix } } service managesieve-login { inet_listener sieve { address = 127.0.0.1 2403:5800:7100:0910::23 port = 4190 } } service replicator { process_min_avail = 1 unix_listener replicator-doveadm { mode = 0666 } } service submission-login { inet_listener submission { address = 192.168.10.23 2403:5800:7100:0910::23 } } ssl_cert = </etc/letsencrypt/live/reub.net/fullchain.pem ssl_client_ca_dir = /etc/ssl/certs ssl_dh = # hidden, use -P to show it ssl_key = # hidden, use -P to show it ssl_min_protocol = TLSv1.2 submission_client_workarounds = whitespace-before-path submission_relay_host = inside-mail.reub.net submission_relay_trusted = yes userdb { driver = passwd result_success = continue-ok } userdb { args = /etc/dovecot/passwd.extra driver = passwd-file skip = notfound } protocol lmtp { mail_plugins = notify replication fts fts_lucene sieve } protocol lda { mail_plugins = notify replication fts fts_lucene sieve } protocol imap { imap_metadata = yes mail_max_userip_connections = 25 } local_name imap.reub.net { ssl_cert = </etc/letsencrypt/live/reub.net/fullchain.pem ssl_key = # hidden, use -P to show it } local_name imap.htperham.name { ssl_cert = </etc/letsencrypt/live/imap.htperham.name/fullchain.pem ssl_key = # hidden, use -P to show it }
Aakash Patel
2020-Nov-18 19:37 UTC
Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol
Hello, I have two mail servers and am also experiencing sporadic replication errors over tcps, similar to Reuben. Each server is running Dovecot 2.3.11.3 (502c39af9) on Debian 10.6. *Log entries from MX1* Nov 18 00:39:26 mx1 dovecot: dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds (last sent=mailbox, last recv=mailbox_state) Nov 18 00:39:26 mx1 dovecot: dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: Timeout during state=sync_mails (send=mailbox recv=mailbox) Nov 18 06:39:32 mx1 dovecot: dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds (last sent=mailbox, last recv=mailbox_state) Nov 18 06:39:32 mx1 dovecot: dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: Timeout during state=sync_mails (send=mailbox recv=mailbox) *End* *Log entries from MX2* Nov 18 00:29:55 mx2 dovecot: dsync-local(user at example.com)<fKK3JzWxtF9zAgAA5XpYKg>: Error: Couldn't lock /var/vmail/user at example.com/.dovecot-sync.lock: fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 628) Nov 18 00:34:56 mx2 dovecot: dsync-local(user at example.com)<9IKaB2KytF92AgAA5XpYKg>: Error: Couldn't lock /var/vmail/user at example.com/.dovecot-sync.lock: fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 628) Nov 18 00:39:26 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): I/O has stalled, no activity for 600 seconds (last sent=mail_change (EOL), last recv=mailbox) Nov 18 06:39:32 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): I/O has stalled, no activity for 600 seconds (last sent=mail_change (EOL), last recv=mailbox) *End* I have configured "replication_full_sync_interval = 1 hours", which explains why some of the sync errors occur at the same increment on the hour (if the error does occur). I've tested replication over tcps using either IPv6 or IPv4 -- this did not appear to make a difference. Changing replication to occur over tcp solves the issue (with "ssl = yes" commented out, as well). IMAP clients are primarily connecting to MX1 using SSL, which works well (SSL connections to MX2 also work). These are very low traffic machines at the moment (just 1 user as I continue testing). I've attached the output of "dovecot -n" from each server. Are there known bugs with replication using SSL? I'd appreciate any guidance. Thank you, AP -------------- next part -------------- # 2.3.11.3 (502c39af9): /etc/dovecot/dovecot.conf # OS: Linux 4.19.0-12-amd64 x86_64 Debian 10.6 # Hostname: mx1.example.com doveadm_password = # hidden, use -P to show it doveadm_port = 12345 mail_location = maildir:~/Maildir mail_plugins = " notify replication" namespace inbox { inbox = yes location mailbox Archive { special_use = \Archive } mailbox "Deleted Messages" { special_use = \Trash } mailbox Drafts { special_use = \Drafts } mailbox Junk { special_use = \Junk } mailbox Sent { special_use = \Sent } mailbox "Sent Messages" { special_use = \Sent } mailbox Trash { special_use = \Trash } prefix } passdb { args = scheme=sha512-crypt /usr/local/etc/creds driver = passwd-file } plugin { mail_replica = tcps:mx2.example.com:12345 } protocols = " imap" replication_full_sync_interval = 1 hours service aggregator { fifo_listener replication-notify-fifo { user = vmail } unix_listener replication-notify { user = vmail } } service doveadm { inet_listener { port = 12345 ssl = yes } } service replicator { process_min_avail = 1 unix_listener replicator-doveadm { mode = 0600 user = vmail } } ssl_cert = </etc/letsencrypt/live/mx1.example.com/fullchain.pem ssl_client_ca_dir = /etc/ssl/certs ssl_key = # hidden, use -P to show it userdb { args = username_format=%u /usr/local/etc/creds default_fields = uid=vmail gid=vmail home=/var/vmail/%u driver = passwd-file } -------------- next part -------------- # 2.3.11.3 (502c39af9): /etc/dovecot/dovecot.conf # OS: Linux 4.19.0-12-amd64 x86_64 Debian 10.6 # Hostname: mx2.example.com doveadm_password = # hidden, use -P to show it doveadm_port = 12345 mail_location = maildir:~/Maildir mail_plugins = " notify replication" namespace inbox { inbox = yes location mailbox Archive { special_use = \Archive } mailbox "Deleted Messages" { special_use = \Trash } mailbox Drafts { special_use = \Drafts } mailbox Junk { special_use = \Junk } mailbox Sent { special_use = \Sent } mailbox "Sent Messages" { special_use = \Sent } mailbox Trash { special_use = \Trash } prefix } passdb { args = scheme=sha512-crypt /usr/local/etc/creds driver = passwd-file } plugin { mail_replica = tcps:mx1.example.com:12345 } protocols = " imap" replication_full_sync_interval = 1 hours service aggregator { fifo_listener replication-notify-fifo { user = vmail } unix_listener replication-notify { user = vmail } } service doveadm { inet_listener { port = 12345 ssl = yes } } service replicator { process_min_avail = 1 unix_listener replicator-doveadm { mode = 0600 user = vmail } } ssl_cert = </etc/letsencrypt/live/mx2.example.com/fullchain.pem ssl_client_ca_dir = /etc/ssl/certs ssl_key = # hidden, use -P to show it userdb { args = username_format=%u /usr/local/etc/creds default_fields = uid=vmail gid=vmail home=/var/vmail/%u driver = passwd-file }
James Pattinson
2020-Nov-19 08:30 UTC
Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol
On 18/11/2020 19:37, Aakash Patel wrote:> Hello, > > I have two mail servers and am also experiencing sporadic replication > errors over tcps, similar to Reuben. Each server is running Dovecot > 2.3.11.3 (502c39af9) on Debian 10.6. > > *Log entries from MX1* > Nov 18 00:39:26 mx1 dovecot: > dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: > dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds > (last sent=mailbox, last recv=mailbox_state) > Nov 18 00:39:26 mx1 dovecot: > dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: Timeout > during state=sync_mails (send=mailbox recv=mailbox) > Nov 18 06:39:32 mx1 dovecot: > dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: > dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds > (last sent=mailbox, last recv=mailbox_state) > Nov 18 06:39:32 mx1 dovecot: > dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: Timeout > during state=sync_mails (send=mailbox recv=mailbox) > *End* > > *Log entries from MX2* > Nov 18 00:29:55 mx2 dovecot: > dsync-local(user at example.com)<fKK3JzWxtF9zAgAA5XpYKg>: Error: Couldn't > lock /var/vmail/user at example.com/.dovecot-sync.lock: > fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock, > F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held > by pid 628) > Nov 18 00:34:56 mx2 dovecot: > dsync-local(user at example.com)<9IKaB2KytF92AgAA5XpYKg>: Error: Couldn't > lock /var/vmail/user at example.com/.dovecot-sync.lock: > fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock, > F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held > by pid 628) > Nov 18 00:39:26 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): > I/O has stalled, no activity for 600 seconds (last sent=mail_change > (EOL), last recv=mailbox) > Nov 18 06:39:32 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): > I/O has stalled, no activity for 600 seconds (last sent=mail_change > (EOL), last recv=mailbox) > *End* > > I have configured "replication_full_sync_interval = 1 hours", which > explains why some of the sync errors occur at the same increment on > the hour (if the error does occur). > > I've tested replication over tcps using either IPv6 or IPv4 -- this > did not appear to make a difference. > > Changing replication to occur over tcp solves the issue (with "ssl = > yes" commented out, as well). > > IMAP clients are primarily connecting to MX1 using SSL, which works > well (SSL connections to MX2 also work). These are very low traffic > machines at the moment (just 1 user as I continue testing). > > I've attached the output of "dovecot -n" from each server. > > Are there known bugs with replication using SSL? I'd appreciate any > guidance. > > Thank you, > AP >For what it's worth, I had the same issue when setting this up a few weeks ago. I switched to using SSH based transport and it's been great ever since. Is that an option for you? dsync_remote_cmd = ssh -l%{login} %{host} doveadm dsync-server -u%u mail_replica = remote:root at xx.xx.xx.xx Cheers James
Apparently Analagous Threads
- Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol
- dovecot replication - new and cur folders on mx1 and mx2
- dovecot replication - new and cur folders on mx1 and mx2
- 2.3.1 Replication is throwing scary errors
- 2.3.1 Replication is throwing scary errors