Hey all, I've been affected by these replication issues too and finally downgraded back to 2.2.35 since some newly created virtual domains/mailboxes weren't replicated *at all* due to the bug(s). My setup is more like a master-slave, where I only have a rather small virtual machine as the slave host, which is also only MX 20. The idea was to replicate all mails through dovecot and perform individual (independent) backups on each host. The clients use a CNAME with a low TTL of 60s so in case my "master" (physical dedicated machine) goes down for a longer period I can simply switch to the slave. In order for this concept to work, the replication has to work without any issue. Otherwise clients might notice missing mails or it might even result in conflicts when the master cames back online if the slave was out of sync beforehand. On 06.05.18 - 21:34, Michael Grimm wrote:> And please have a look for processes like: > doveadm-server: [IP4 <user> INBOX import:1/3] (doveadm-server) > > These processes will "survive" a dovecot reboot ...This is indeed the case. Once the replication processes (doveadm-server) get stuck I had to resort to `kill -9` to get rid of them. Something is really wrong there. As stated multiple times in the #dovecot irc channel I'm happy to test any patches for the 2.3 series in my setup and provide further details if required. Thanks to all who are participating in this thread and finally these issues get some attention :) Cheers, Thore -- Thore B?decker GPG ID: 0xD622431AF8DB80F3 GPG FP: 0F96 559D 3556 24FC 2226 A864 D622 431A F8DB 80F3 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://dovecot.org/pipermail/dovecot/attachments/20180506/082bd2a8/attachment.sig>
Hi, Checking in - this is still an issue with 2.3-master as of today (2.3.devel (3a6537d59)). I haven't been able to narrow the problem down to a specific commit. The best I have been able to get to is that this commit is relatively good (not perfect but good enough): d9a1a7cbec19f4c6a47add47688351f8c3a0e372 (from Feb 19, 2018) whereas this commit: 6418419ec282c887b67469dbe3f541fc4873f7f0 (From Mar 12, 2018) is pretty bad. Somewhere in between some commit has caused the problem (which may have been introduced earlier) to get much worse. There seem to be a handful of us with broken systems who are prepared to assist in debugging this and put in our own time to patch, test and get to the bottom of it, but it is starting to look like we're basically on our own. What sort of debugging, short of bisecting 100+ patches between the commits above, can we do to progress this? Reuben On 7/05/2018 5:54 am, Thore B?decker wrote:> Hey all, > > I've been affected by these replication issues too and finally downgraded > back to 2.2.35 since some newly created virtual domains/mailboxes > weren't replicated *at all* due to the bug(s). > > My setup is more like a master-slave, where I only have a rather small > virtual machine as the slave host, which is also only MX 20. > The idea was to replicate all mails through dovecot and perform > individual (independent) backups on each host. > > The clients use a CNAME with a low TTL of 60s so in case my "master" > (physical dedicated machine) goes down for a longer period I can simply > switch to the slave. > > In order for this concept to work, the replication has to work without > any issue. Otherwise clients might notice missing mails or it might > even result in conflicts when the master cames back online if the > slave was out of sync beforehand. > > > On 06.05.18 - 21:34, Michael Grimm wrote: >> And please have a look for processes like: >> doveadm-server: [IP4 <user> INBOX import:1/3] (doveadm-server) >> >> These processes will "survive" a dovecot reboot ... > > This is indeed the case. Once the replication processes > (doveadm-server) get stuck I had to resort to `kill -9` to get rid of > them. Something is really wrong there. > > As stated multiple times in the #dovecot irc channel I'm happy to test > any patches for the 2.3 series in my setup and provide further details > if required. > > Thanks to all who are participating in this thread and finally these > issues get some attention :) > > > Cheers, > Thore >
Reuben Farrelly <reuben-dovecot at reub.net> wrote:> Checking in - this is still an issue with 2.3-master as of today (2.3.devel (3a6537d59)).That doesn't sound good, because I did hope that someone has been working on this issue ...> I haven't been able to narrow the problem down to a specific commit. The best I have been able to get to is that this commit is relatively good (not perfect but good enough): > > d9a1a7cbec19f4c6a47add47688351f8c3a0e372 (from Feb 19, 2018) > > whereas this commit: > > 6418419ec282c887b67469dbe3f541fc4873f7f0 (From Mar 12, 2018) > > is pretty bad. Somewhere in between some commit has caused the problem (which may have been introduced earlier) to get much worse.Thanks for the info.> There seem to be a handful of us with broken systems who are prepared to assist in debugging this and put in our own time to patch, test and get to the bottom of it, but it is starting to look like we're basically on our own.I wonder if there is anyone running a 2.3 master-master replication scheme *without* running into this issue? Please let us know: yes, 2.3 master-master replication does run as rock-stable as in 2.2. Anyone? I would love to get some feedback from the developers regarding: #) are commercial customers of yours running 2.3 master-master replication without those issues reported in this thread? #) do you get reports about these issues outside this ML as well? #) and ...> What sort of debugging, short of bisecting 100+ patches between the commits above, can we do to progress this?? what kind of debugging do you suggest? Regards, Michael