thr3ads.net - dovecot - 2.3.1 Replication is throwing scary errors [May 2018]

If this information is useful, please help other people find it:
Share via:

Thore Bödecker

2018-May-06 19:54 UTC

2.3.1 Replication is throwing scary errors

Hey all,

I've been affected by these replication issues too and finally downgraded
back to 2.2.35 since some newly created virtual domains/mailboxes
weren't replicated *at all* due to the bug(s).

My setup is more like a master-slave, where I only have a rather small
virtual machine as the slave host, which is also only MX 20.
The idea was to replicate all mails through dovecot and perform
individual (independent) backups on each host.

The clients use a CNAME with a low TTL of 60s so in case my "master"
(physical dedicated machine) goes down for a longer period I can simply
switch to the slave.

In order for this concept to work, the replication has to work without
any issue. Otherwise clients might notice missing mails or it might
even result in conflicts when the master cames back online if the
slave was out of sync beforehand.

On 06.05.18 - 21:34, Michael Grimm wrote:> And please have a look for processes like:
> 	doveadm-server: [IP4 <user> INBOX import:1/3] (doveadm-server)
> 
> These processes will "survive" a dovecot reboot ...
This is indeed the case. Once the replication processes
(doveadm-server) get stuck I had to resort to `kill -9` to get rid of
them. Something is really wrong there.

As stated multiple times in the #dovecot irc channel I'm happy to test
any patches for the 2.3 series in my setup and provide further details
if required.

Thanks to all who are participating in this thread and finally these
issues get some attention :)

Cheers,
Thore

-- 
Thore B?decker

GPG ID: 0xD622431AF8DB80F3
GPG FP: 0F96 559D 3556 24FC 2226  A864 D622 431A F8DB 80F3
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL:
<https://dovecot.org/pipermail/dovecot/attachments/20180506/082bd2a8/attachment.sig>

Reuben Farrelly

2018-May-30 12:10 UTC

head link

2.3.1 Replication is throwing scary errors

Hi,

Checking in - this is still an issue with 2.3-master as of today 
(2.3.devel (3a6537d59)).

I haven't been able to narrow the problem down to a specific commit. 
The best I have been able to get to is that this commit is relatively 
good (not perfect but good enough):

d9a1a7cbec19f4c6a47add47688351f8c3a0e372 (from Feb 19, 2018)

whereas this commit:

6418419ec282c887b67469dbe3f541fc4873f7f0 (From Mar 12, 2018)

is pretty bad.  Somewhere in between some commit has caused the problem 
(which may have been introduced earlier) to get much worse.

There seem to be a handful of us with broken systems who are prepared to 
assist in debugging this and put in our own time to patch, test and get 
to the bottom of it, but it is starting to look like we're basically on 
our own.

What sort of debugging, short of bisecting 100+ patches between the 
commits above, can we do to progress this?

Reuben



On 7/05/2018 5:54 am, Thore B?decker wrote:> Hey all,
> 
> I've been affected by these replication issues too and finally
downgraded
> back to 2.2.35 since some newly created virtual domains/mailboxes
> weren't replicated *at all* due to the bug(s).
> 
> My setup is more like a master-slave, where I only have a rather small
> virtual machine as the slave host, which is also only MX 20.
> The idea was to replicate all mails through dovecot and perform
> individual (independent) backups on each host.
> 
> The clients use a CNAME with a low TTL of 60s so in case my
"master"
> (physical dedicated machine) goes down for a longer period I can simply
> switch to the slave.
> 
> In order for this concept to work, the replication has to work without
> any issue. Otherwise clients might notice missing mails or it might
> even result in conflicts when the master cames back online if the
> slave was out of sync beforehand.
> 
> 
> On 06.05.18 - 21:34, Michael Grimm wrote:
>> And please have a look for processes like:
>> 	doveadm-server: [IP4 <user> INBOX import:1/3] (doveadm-server)
>>
>> These processes will "survive" a dovecot reboot ...
> 
> This is indeed the case. Once the replication processes
> (doveadm-server) get stuck I had to resort to `kill -9` to get rid of
> them. Something is really wrong there.
> 
> As stated multiple times in the #dovecot irc channel I'm happy to test
> any patches for the 2.3 series in my setup and provide further details
> if required.
> 
> Thanks to all who are participating in this thread and finally these
> issues get some attention :)
> 
> 
> Cheers,
> Thore
>

Michael Grimm

2018-May-31 15:52 UTC

head link

2.3.1 Replication is throwing scary errors

Reuben Farrelly <reuben-dovecot at reub.net> wrote:
> Checking in - this is still an issue with 2.3-master as of today (2.3.devel
(3a6537d59)).
That doesn't sound good, because I did hope that someone has been working on
this issue ...
> I haven't been able to narrow the problem down to a specific commit.
The best I have been able to get to is that this commit is relatively good (not
perfect but good enough):
> 
> d9a1a7cbec19f4c6a47add47688351f8c3a0e372 (from Feb 19, 2018)
> 
> whereas this commit:
> 
> 6418419ec282c887b67469dbe3f541fc4873f7f0 (From Mar 12, 2018)
> 
> is pretty bad.  Somewhere in between some commit has caused the problem
(which may have been introduced earlier) to get much worse.
Thanks for the info.
> There seem to be a handful of us with broken systems who are prepared to
assist in debugging this and put in our own time to patch, test and get to the
bottom of it, but it is starting to look like we're basically on our own.
I wonder if there is anyone running a 2.3 master-master replication scheme
*without* running into this issue? Please let us know: yes, 2.3 master-master
replication does run as rock-stable as in 2.2.

Anyone?

I would love to get some feedback from the developers regarding: 

#) are commercial customers of yours running 2.3 master-master replication
without those issues reported in this thread?
#) do you get reports about these issues outside this ML as well?
#) and ...
> What sort of debugging, short of bisecting 100+ patches between the commits
above, can we do to progress this?
? what kind of debugging do you suggest?

Regards,
Michael

Seemingly Similar Threads

Search for more apparently analagous threads

dovecot - May 2018 - 2.3.1 Replication is throwing scary errors

2.3.1 Replication is throwing scary errors

2.3.1 Replication is throwing scary errors

2.3.1 Replication is throwing scary errors

Seemingly Similar Threads