Reuben Farrelly
2020-Jun-13 02:38 UTC
Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol
Hi,
I've been seeing errors logged for some time with replication processes,
whereby replication sessions seem to be timing out periodically.
This is with dovecot version 2.3.10.1 (a3d0e1171) and both are Gentoo
x86_64.
After some investigation I've determined that these timeouts are only
ever occurring with tcps as the replication connection type. These
errors never occur if non-encrypted tcp is configured. I've been able
to validate this by changing only the replica_type on both ends of the
replication configuration to tcp, and with no other changes and after a
few days of operation there is not a single error logged.
mail_replica = tcps:lightning.reub.net:4813 <<< periodic timeouts
mail_replica = tcp:lightning.reub.net:4814 <<< works perfectly
Example of the error:
Jun 12 15:45:44 thunderstorm.reub.net dovecot[21149]:
dsync-local(kaylene)<zx+WKTAU416UMwAAzkCIew>: Error:
dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds
(last sent=mailbox_delete, last recv=handshake)
Jun 12 15:45:44 thunderstorm.reub.net dovecot[21149]:
dsync-local(kaylene)<zx+WKTAU416UMwAAzkCIew>: Error: Timeout during
state=recv_mailbox_tree
doveadm: Error: Timeout during state=slave_recv_mailbox: 6 Time(s)
doveadm: Error: Timeout during state=sync_mails (send=mail_requests
recv=attributes): 31 Time(s)
doveadm: Error: dsync(thunderstorm.reub.net): I/O has stalled, no
activity for 600 seconds (last sent=mail_change (EOL), last
recv=mailbox): 31 Time(s)
doveadm: Error: dsync(thunderstorm.reub.net): I/O has stalled, no
activity for 600 seconds (last sent=mailbox_delete, last
recv=mailbox_delete): 6 Time(s)
It is seen on both sides of the replication setup. The replica is
offsite but only a few ms of latency away and there is no packet loss.
The replication is happening over IPv6, and the local firewall is
logging that sessions are always permitted, and only ever finishing due
to tcp-fin or tcp-rst-from-client .
SSL appears to be correctly configured, and it seems that the
replication itself is for the most part working. Clients are able to
use imaps just fine so I don't think there's anything much wrong from an
SSL perspective else I'd be seeing complete replication failure and/or
client devices unable to connect.
Can anyone suggest how we can further debug this problem?
Thanks,
Reuben
-------------- next part --------------
# 2.3.10.1 (a3d0e1171): /etc/dovecot/dovecot.conf
# Pigeonhole version 0.5.10 (bf8ef1c2)
# OS: Linux 5.7.2-gentoo x86_64 Gentoo Base System release 2.7
# Hostname: thunderstorm.reub.net
auth_mechanisms = plain login
auth_username_format = %Ln
disable_plaintext_auth = no
doveadm_password = # hidden, use -P to show it
first_valid_uid = 1000
imap_client_workarounds = tb-lsub-flags tb-extra-mailbox-sep
last_valid_uid = 1099
login_log_format_elements = user=<%u> auth-method=%m remote=%r local=%l %c
%k
login_trusted_networks = 192.168.0.0/16 2403:5800:7100:0900::/56 180.150.17.229
mail_attribute_dict = file:%h/Maildir/dovecot-attributes
mail_location = maildir:~/Maildir
mail_plugins = notify replication fts fts_lucene
managesieve_notify_capability = mailto
managesieve_sieve_capability = fileinto reject envelope encoded-character
vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy
include variables body enotify environment mailbox date index ihave duplicate
mime foreverypart extracttext
namespace inbox {
inbox = yes
location =
mailbox Drafts {
special_use = \Drafts
}
mailbox Junk {
special_use = \Junk
}
mailbox Sent {
special_use = \Sent
}
mailbox "Sent Messages" {
special_use = \Sent
}
mailbox Trash {
special_use = \Trash
}
prefix =
}
passdb {
args = failure_show_msg=yes %s
driver = pam
}
plugin {
fts = lucene
fts_autoindex = yes
fts_languages = en
fts_lucene = whitespace_chars=@.
mail_replica = tcps:lightning.reub.net:4813
replication_full_sync_interval = 2 hours
sieve = file:~/sieve;active=~/.dovecot.sieve
}
postmaster_address = postmaster at reub.net
protocols = imap lmtp sieve submission sieve
recipient_delimiter = -
service aggregator {
fifo_listener replication-notify-fifo {
mode = 0666
user = root
}
unix_listener replication-notify {
mode = 0666
user = root
}
}
service auth {
inet_listener {
port = 45347
}
unix_listener /var/spool/postfix/private/auth {
group = postfix
mode = 0666
user = postfix
}
unix_listener auth-userdb {
mode = 0777
}
}
service doveadm {
inet_listener {
address = 2403:5800:7100:0910::23
port = 4813
ssl = yes
}
inet_listener {
address = 2403:5800:7100:0910::23
port = 4814
ssl = no
}
inet_listener {
address = 192.168.10.23
port = 4813
ssl = yes
}
inet_listener {
address = 192.168.10.23
port = 4814
ssl = no
}
user = root
}
service imap-login {
inet_listener imap {
address = 192.168.10.23 2403:5800:7100:0910::23
}
inet_listener imaps {
address = 192.168.10.23 2403:5800:7100:0910::23
ssl = yes
}
}
service lmtp {
inet_listener lmtp {
address = ::1
port = 24
}
unix_listener /var/spool/postfix/private/dovecot-lmtp {
group = postfix
mode = 0660
user = postfix
}
}
service managesieve-login {
inet_listener sieve {
address = 127.0.0.1 2403:5800:7100:0910::23
port = 4190
}
}
service replicator {
process_min_avail = 1
unix_listener replicator-doveadm {
mode = 0666
}
}
service submission-login {
inet_listener submission {
address = 192.168.10.23 2403:5800:7100:0910::23
}
}
ssl_cert = </etc/letsencrypt/live/reub.net/fullchain.pem
ssl_client_ca_dir = /etc/ssl/certs
ssl_dh = # hidden, use -P to show it
ssl_key = # hidden, use -P to show it
ssl_min_protocol = TLSv1.2
submission_client_workarounds = whitespace-before-path
submission_relay_host = inside-mail.reub.net
submission_relay_trusted = yes
userdb {
driver = passwd
result_success = continue-ok
}
userdb {
args = /etc/dovecot/passwd.extra
driver = passwd-file
skip = notfound
}
protocol lmtp {
mail_plugins = notify replication fts fts_lucene sieve
}
protocol lda {
mail_plugins = notify replication fts fts_lucene sieve
}
protocol imap {
imap_metadata = yes
mail_max_userip_connections = 25
}
local_name imap.reub.net {
ssl_cert = </etc/letsencrypt/live/reub.net/fullchain.pem
ssl_key = # hidden, use -P to show it
}
local_name imap.htperham.name {
ssl_cert = </etc/letsencrypt/live/imap.htperham.name/fullchain.pem
ssl_key = # hidden, use -P to show it
}
Aakash Patel
2020-Nov-18 19:37 UTC
Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol
Hello,
I have two mail servers and am also experiencing sporadic replication
errors over tcps, similar to Reuben. Each server is running Dovecot
2.3.11.3 (502c39af9) on Debian 10.6.
*Log entries from MX1*
Nov 18 00:39:26 mx1 dovecot:
dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error:
dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds
(last sent=mailbox, last recv=mailbox_state)
Nov 18 00:39:26 mx1 dovecot:
dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: Timeout
during state=sync_mails (send=mailbox recv=mailbox)
Nov 18 06:39:32 mx1 dovecot:
dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error:
dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds
(last sent=mailbox, last recv=mailbox_state)
Nov 18 06:39:32 mx1 dovecot:
dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: Timeout
during state=sync_mails (send=mailbox recv=mailbox)
*End*
*Log entries from MX2*
Nov 18 00:29:55 mx2 dovecot:
dsync-local(user at example.com)<fKK3JzWxtF9zAgAA5XpYKg>: Error:
Couldn't
lock /var/vmail/user at example.com/.dovecot-sync.lock:
fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock,
F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by
pid 628)
Nov 18 00:34:56 mx2 dovecot:
dsync-local(user at example.com)<9IKaB2KytF92AgAA5XpYKg>: Error:
Couldn't
lock /var/vmail/user at example.com/.dovecot-sync.lock:
fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock,
F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by
pid 628)
Nov 18 00:39:26 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): I/O
has stalled, no activity for 600 seconds (last sent=mail_change (EOL),
last recv=mailbox)
Nov 18 06:39:32 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): I/O
has stalled, no activity for 600 seconds (last sent=mail_change (EOL),
last recv=mailbox)
*End*
I have configured "replication_full_sync_interval = 1 hours", which
explains why some of the sync errors occur at the same increment on the
hour (if the error does occur).
I've tested replication over tcps using either IPv6 or IPv4 -- this did
not appear to make a difference.
Changing replication to occur over tcp solves the issue (with "ssl =
yes" commented out, as well).
IMAP clients are primarily connecting to MX1 using SSL, which works well
(SSL connections to MX2 also work). These are very low traffic machines
at the moment (just 1 user as I continue testing).
I've attached the output of "dovecot -n" from each server.
Are there known bugs with replication using SSL? I'd appreciate any
guidance.
Thank you,
AP
-------------- next part --------------
# 2.3.11.3 (502c39af9): /etc/dovecot/dovecot.conf
# OS: Linux 4.19.0-12-amd64 x86_64 Debian 10.6
# Hostname: mx1.example.com
doveadm_password = # hidden, use -P to show it
doveadm_port = 12345
mail_location = maildir:~/Maildir
mail_plugins = " notify replication"
namespace inbox {
inbox = yes
location mailbox Archive {
special_use = \Archive
}
mailbox "Deleted Messages" {
special_use = \Trash
}
mailbox Drafts {
special_use = \Drafts
}
mailbox Junk {
special_use = \Junk
}
mailbox Sent {
special_use = \Sent
}
mailbox "Sent Messages" {
special_use = \Sent
}
mailbox Trash {
special_use = \Trash
}
prefix }
passdb {
args = scheme=sha512-crypt /usr/local/etc/creds
driver = passwd-file
}
plugin {
mail_replica = tcps:mx2.example.com:12345
}
protocols = " imap"
replication_full_sync_interval = 1 hours
service aggregator {
fifo_listener replication-notify-fifo {
user = vmail
}
unix_listener replication-notify {
user = vmail
}
}
service doveadm {
inet_listener {
port = 12345
ssl = yes
}
}
service replicator {
process_min_avail = 1
unix_listener replicator-doveadm {
mode = 0600
user = vmail
}
}
ssl_cert = </etc/letsencrypt/live/mx1.example.com/fullchain.pem
ssl_client_ca_dir = /etc/ssl/certs
ssl_key = # hidden, use -P to show it
userdb {
args = username_format=%u /usr/local/etc/creds
default_fields = uid=vmail gid=vmail home=/var/vmail/%u
driver = passwd-file
}
-------------- next part --------------
# 2.3.11.3 (502c39af9): /etc/dovecot/dovecot.conf
# OS: Linux 4.19.0-12-amd64 x86_64 Debian 10.6
# Hostname: mx2.example.com
doveadm_password = # hidden, use -P to show it
doveadm_port = 12345
mail_location = maildir:~/Maildir
mail_plugins = " notify replication"
namespace inbox {
inbox = yes
location mailbox Archive {
special_use = \Archive
}
mailbox "Deleted Messages" {
special_use = \Trash
}
mailbox Drafts {
special_use = \Drafts
}
mailbox Junk {
special_use = \Junk
}
mailbox Sent {
special_use = \Sent
}
mailbox "Sent Messages" {
special_use = \Sent
}
mailbox Trash {
special_use = \Trash
}
prefix }
passdb {
args = scheme=sha512-crypt /usr/local/etc/creds
driver = passwd-file
}
plugin {
mail_replica = tcps:mx1.example.com:12345
}
protocols = " imap"
replication_full_sync_interval = 1 hours
service aggregator {
fifo_listener replication-notify-fifo {
user = vmail
}
unix_listener replication-notify {
user = vmail
}
}
service doveadm {
inet_listener {
port = 12345
ssl = yes
}
}
service replicator {
process_min_avail = 1
unix_listener replicator-doveadm {
mode = 0600
user = vmail
}
}
ssl_cert = </etc/letsencrypt/live/mx2.example.com/fullchain.pem
ssl_client_ca_dir = /etc/ssl/certs
ssl_key = # hidden, use -P to show it
userdb {
args = username_format=%u /usr/local/etc/creds
default_fields = uid=vmail gid=vmail home=/var/vmail/%u
driver = passwd-file
}
James Pattinson
2020-Nov-19 08:30 UTC
Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol
On 18/11/2020 19:37, Aakash Patel wrote:> Hello, > > I have two mail servers and am also experiencing sporadic replication > errors over tcps, similar to Reuben. Each server is running Dovecot > 2.3.11.3 (502c39af9) on Debian 10.6. > > *Log entries from MX1* > Nov 18 00:39:26 mx1 dovecot: > dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: > dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds > (last sent=mailbox, last recv=mailbox_state) > Nov 18 00:39:26 mx1 dovecot: > dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: Timeout > during state=sync_mails (send=mailbox recv=mailbox) > Nov 18 06:39:32 mx1 dovecot: > dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: > dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds > (last sent=mailbox, last recv=mailbox_state) > Nov 18 06:39:32 mx1 dovecot: > dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: Timeout > during state=sync_mails (send=mailbox recv=mailbox) > *End* > > *Log entries from MX2* > Nov 18 00:29:55 mx2 dovecot: > dsync-local(user at example.com)<fKK3JzWxtF9zAgAA5XpYKg>: Error: Couldn't > lock /var/vmail/user at example.com/.dovecot-sync.lock: > fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock, > F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held > by pid 628) > Nov 18 00:34:56 mx2 dovecot: > dsync-local(user at example.com)<9IKaB2KytF92AgAA5XpYKg>: Error: Couldn't > lock /var/vmail/user at example.com/.dovecot-sync.lock: > fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock, > F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held > by pid 628) > Nov 18 00:39:26 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): > I/O has stalled, no activity for 600 seconds (last sent=mail_change > (EOL), last recv=mailbox) > Nov 18 06:39:32 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): > I/O has stalled, no activity for 600 seconds (last sent=mail_change > (EOL), last recv=mailbox) > *End* > > I have configured "replication_full_sync_interval = 1 hours", which > explains why some of the sync errors occur at the same increment on > the hour (if the error does occur). > > I've tested replication over tcps using either IPv6 or IPv4 -- this > did not appear to make a difference. > > Changing replication to occur over tcp solves the issue (with "ssl = > yes" commented out, as well). > > IMAP clients are primarily connecting to MX1 using SSL, which works > well (SSL connections to MX2 also work). These are very low traffic > machines at the moment (just 1 user as I continue testing). > > I've attached the output of "dovecot -n" from each server. > > Are there known bugs with replication using SSL? I'd appreciate any > guidance. > > Thank you, > AP >For what it's worth, I had the same issue when setting this up a few weeks ago. I switched to using SSH based transport and it's been great ever since. Is that an option for you? dsync_remote_cmd = ssh -l%{login} %{host} doveadm dsync-server -u%u mail_replica = remote:root at xx.xx.xx.xx Cheers James
Maybe Matching Threads
- Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol
- dovecot replication - new and cur folders on mx1 and mx2
- dovecot replication - new and cur folders on mx1 and mx2
- 2.3.1 Replication is throwing scary errors
- 2.3.1 Replication is throwing scary errors