thr3ads.net - dovecot - 2.3.1 Replication is throwing scary errors [May 2018]

If this information is useful, please help other people find it:
Share via:

Andy Weal

2018-May-06 03:21 UTC

2.3.1 Replication is throwing scary errors

Hi all,

New to the mailing lists but have joined up because of above */2.3.1 
Replication is throwing scary errors


/*Brief system configuration
 ??? MX1 - Main
 ??? ??? Freebsd 11.1-Release-p9
 ??? ??? Hosted on a Vultr VM in Sydney AU
 ??? ??? MTA = Postfix 3.4-20180401
 ??? ??? Dovecot = 2.3.1
 ??? ??? File system = ufs
 ??? MX2 - Backup
 ??? ??? Freebsd 11.1-Release-p9
 ??? ???? Running on bare metal - no VM or jails
 ??? ??? MTA = Postfix 3.4-20180401
 ??? ??? Dovecot = 2.3.1
 ??? ??? File system = ufs ( on SSD)

*/
/*Brief sequence of events

 1.  ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2
    from 2.3.0 ? (service dovecot stop, portmaster upgrade, service
    dovecot start)
 2.  ??? both systems ran ok with no errors for 10 days.
 3.  ??? Last night I shutdown mx2 and restarted it a few hours later
 4.  ??? within minutes i was getting the following types of errors on mx2

 ???????? May 06 12:56:29 doveadm: Error: Couldn't lock 
/var/mail/vhosts/example.net/user1/.dovecot-sync.lock: 
fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock, 
F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by 
pid 1960)

 ? Before i venture down the rabbit hole of fault finding and excess 
coffee consumption I was wondering if any of you had any updates on the 
problems discussed below.


Cheers for now,
Andy



Hi,

[Formatting is a bit rough, replying from a trimmed digest email]
>/Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael
Grimm <trashcan at ellael.org<https://dovecot.org/mailman/listinfo/dovecot>> />/To: Dovecot
Mailing List <dovecot at dovecot.org
<https://dovecot.org/mailman/listinfo/dovecot>> />/Subject: Re:
2.3.1 Replication is throwing scary errors />/Message-ID:
<E7E7A927-68F8-443F-BA59-E66CED8FE878 at ellael.org
<https://dovecot.org/mailman/listinfo/dovecot>> />/Content-Type:
text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From:
Michael Grimm <trashcan at ellael.org
<https://dovecot.org/mailman/listinfo/dovecot>>
/>//>>>/[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two
jails at
distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I
do become pounded by
error messages at server1 (and vice versa at server2) as follows:
/>>>/| Apr 2 17:12:18 <mail.err> server1.lan dovecot: doveadm:
Error:
dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds
(last sent=mail_change, last
recv=mail_change (EOL)) />>>/| Apr 2 17:12:18 <mail.err>
server1.lan dovecot: doveadm: Error: Timeout
during state=sync_mails \ />>>/(send=changes recv=mail_requests)
/>>>/[?] />>>/| Apr 2 18:59:03 <mail.err> server1.lan
dovecot: doveadm: Error:
dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds
(last sent=mail, last recv=mail (EOL)) />>>/| Apr 2 18:59:03
<mail.err> server1.lan dovecot: doveadm: Error: Timeout
during state=sync_mails \ />>>/(send=mails recv=recv_last_common)
/>>>/I cannot see in my personal account any missing replications,
*but* I
haven't tested this thoroughly enough. I do have customers being 
serviced at these productive servers, *thus* I'm back to 2.2.35 until I 
do understand or have learned what is going on. />//>/In my reply to this
statement of mine I mentioned that I have seen those
timeouts quite some times during the past year. Thus, I upgraded to 
2.3.1 again, and boom: after some hours I ended up in hanging processes 
[1] like (see Remko's mail in addition) ... />//>/doveadm-server:
[IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at
server2 paired with a file like ? />//>/-rw------- 1 vmail dovecot uarch 0
Apr 3 16:52
/home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at
server2 are like ? />//>/Apr 3 17:10:49 <mail.err> server2.lan
dovecot: doveadm: Error: Couldn't
lock /home/to/USER1/.dovecot-sync.lock: \
/>/fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking
failed: Timed out after 30 seconds \ />/(WRITE lock held by pid 51110)
/>//>/[1] Even stopping dovecot will not end those processes. One has to
manually kill those before restarting dovecot. />//>/After one day of
testing 2.3.1 with a couple of those episodes of
locking/timeout, and now missing mails depending with server your MUA 
will connect to, I went back to 2.2.35. After two days at that version I 
never had such an episode again. />//>>/It's not just you. This
issue hit me recently, and it was impacting />>/replication noticeably. I
am following git master-2.3 . />/[...] />>/There is also a second issue
of a long standing race with replication />>/occurring somewhere whereby
if a mail comes in, is written to disk, is />>/replicated and then deleted
in short succession, it will reappear />>/again to the MUA. I suspect the
mail is being replicated back from />>/the remote. A few people have
reported it over the years but it's not />>/reliable or consistent, so
it has never been fixed. />>/And lastly there has been an ongoing but
seemingly minor issue />>/relating to locking timing out after 30s
particularly on the remote />>/host that is being replicated to. I rarely
see the problem on my />>/local disk where almost all of the mail comes
in, it's almost always />>/occurring on the replicate/remote system.
/>//>/It might be time to describe our setups in order to possibly find
common
grounds that might trigger this issue you describe and Rimko and myself 
ran into as well. />//>/Servers: Cloud Instances (both identical), around
25ms latency apart. />/Intel Core Processor (Haswell, no TSX) (3092.91-MHz
K8-class CPU) />/Both servers are connected via IPsec/racoon tunnels
/>/OS: FreeBSD 11.1-STABLE (both servers) />/Filesystem: ZFS />/MTA:
postfix 3.4-20180401 (postfix delivers via dovecot's LMTP) />/IMAP:
dovecot running in FreeBSD jails (issues with 2.3.1, fine with
2.2.35) />/Replication: unsecured tcp / master-master />/MUA: mainly iOS
or macOS mail.app, rarely roundcube /
For me:

Servers:	Main: VM on a VMWare ESXi local system (light
		load), local SSD disks (no NFS)
		Redundant: Linode VM in Singapore, around 250ms away
		Also no NFS.  Linode use SSDs for IO.
		There is native IPv6 connectivity between both VMs.
		As I am using TCPs I don't have a VPN between them -
		just raw IPv6 end to end.
OS:		Gentoo Linux x86_64 kept well up to date
Filesystem:	EXT4 for both
MTA:		Postfix 3.4-x (same as you)
IMAP:		Dovecot running natively on the machine (no jail/chroot)
Replication:	Secured TCP / master-master (but only one master is
		actively used for SMTP and MUA access)
MUA:		Thunderbird, Maildroid, MS Mail, Gmail client,
		Roundcube, LG Mail client on handset etc
>/I believe it is worthwhile to mention here that I run a poor man's fail-over approach (round-robin DNS) as follows: />//>/DNS:
mail.server.tld resolves to one IP4 and one IP6 address of each
server, thus 4 IP addresses in total /
For me I have a simple pair of servers that replicate mail to each
other.  Clients connect to the main server only, which is where mail
comes in.  The second one that experiences the problem with locking is
only a redundant server which is a secondary MX, that only gets
occasional SMTP deliveries from time to time.  So the amount of
replication back from the "secondary" is very minimal; the vast bulk
of
the replication is the primary replicating changes out.

Reuben

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<https://dovecot.org/pipermail/dovecot/attachments/20180506/27569830/attachment.html>

Michael Grimm

2018-May-06 18:48 UTC

head link

2.3.1 Replication is throwing scary errors

Hi Andy

Andy Weal <andy at bizemail.com.au> wrote
> Hi all,
> 
> New to the mailing lists but have joined up because of above 2.3.1
Replication is throwing scary errors
> 
> 
> Brief system configuration
>     MX1 - Main 
>         Freebsd 11.1-Release-p9
>         Hosted on a Vultr VM in Sydney AU 
>         MTA = Postfix 3.4-20180401
>         Dovecot = 2.3.1
>         File system = ufs
>     MX2 - Backup 
>         Freebsd 11.1-Release-p9
>          Running on bare metal - no VM or jails
>         MTA = Postfix 3.4-20180401
>         Dovecot = 2.3.1
>         File system = ufs ( on SSD)
> 
>    
> Brief sequence of events
> 	?     apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2 from
2.3.0   (service dovecot stop, portmaster upgrade, service dovecot start)
> 	?     both systems ran ok with no errors for 10 days.
> 	?     Last night I shutdown mx2 and restarted it a few hours later
> 	?     within minutes i was getting the following types of errors on mx2
>          May 06 12:56:29 doveadm: Error: Couldn't lock
/var/mail/vhosts/example.net/user1/.dovecot-sync.lock:
fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock,
F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid
1960)
> 
>   Before i venture down the rabbit hole of fault finding and excess coffee
consumption I was wondering if any of you had any updates on the problems
discussed below.
As Reuben already stated: nothing has been "solved" regarding this
issue with replication and 2.3.1 dovecot, yet.

There are about 10 reports of this kind, here, and in the German dovecot list, I
am aware of. All dovecot setups differ in every aspect like OS or virtual versus
bare metal servers, thus I am convinced that it solely has to do with some
dovecot code that differs between either 2.2.35 or 2.3.0 and 2.3.1.

Hoping this issue becomes recognised by the developers as a showstopper for
upgrading from 2.2 to 2.3, soon.

As you are using FreeBSD, you will have a dovecot22 and dovecot-pigeonhole04
port at hand to omit upgrading to the erroneous 2.3 version for the time being.
(Thanks to the port maintainer who is following this ML!)

With kind regards,
Michael

Seemingly Similar Threads

Search for more reasonably related threads

dovecot - May 2018 - 2.3.1 Replication is throwing scary errors

2.3.1 Replication is throwing scary errors

2.3.1 Replication is throwing scary errors

Seemingly Similar Threads