thr3ads.net - dovecot - [2.3.8] possible replication issue [Dec 2019]

If this information is useful, please help other people find it:
Share via:

Timo Sirainen

2019-Dec-05 12:15 UTC

[2.3.8] possible replication issue

I think there's a good chance that upgrading both will fix it. The bug
already existed in old versions, it just wasn't normally triggered. Since
v2.3.8 this situation is triggered on one dsync side, so the v2.3.9 fix needs to
be on the other side.
> On 5. Dec 2019, at 8.34, Piper Andreas via dovecot <dovecot at
dovecot.org> wrote:
> 
> Hello,
> 
> upgrading to 2.3.9 unfortunately does *not* solve this issue:
> 
> I upgraded one of my replicators from 2.3.7.2 to 2.3.9 and after some
seconds replication stopped. The other replicator remained with 2.3.7.2. After
downgrading to 2.3.7.2 replication is again working fine.
> 
> I did not try to upgrade both replicators up to now, as this is a live
production system. Is there a chance, that upgrading both replicators will solve
the problem?
> 
> The machines are running Ubuntu 18.04
> 
> Any help is appreciated.
> 
> Thanks,
> Andreas
> 
> Am 18.10.19 um 13:52 schrieb Carsten Rosenberg via dovecot:
>> Hi,
>> some of our customers have discovered a replication issue after
>> upgraded from 2.3.7.2 to 2.3.8.
>> Running 2.3.8 several replication connections are hanging until defined
>> timeout. So after some seconds there are $replication_max_conns hanging
>> connections.
>> Other replications are running fast and successful.
>> Also running a doveadm sync tcp:... is working fine for all users.
>> I can't see exactly, but I haven't seen mailboxes timeouting
again and
>> again. So I would assume it's not related to the mailbox.
>> From the logs:
>> server1:
>> Oct 16 08:29:25 server1 dovecot[5715]:
>> dsync-local(username1 at domain.com)<FXnVDW22pl0tGAAA1cwDxA>:
Error:
>> dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds
(version
>> not received)
>> Oct 16 08:29:25 server1 dovecot[5715]:
>> dsync-local(username1 at domain.com)<FXnVDW22pl0tGAAA1cwDxA>:
Error:
>> Timeout during state=master_recv_handshake
>> server2:
>> Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error: read(server1)
>> failed: EOF (last sent=handshake, last recv=handshake)
>> There aren't any additional logs regarding the replication.
>> I have tried increasing vsz_limit or reducing replication_max_conns.
>> Nothing changed.
>> --
>> Both customers have 10k+ users. Currently I couldn't reproduce this
on
>> smaller test systems.
>> Both installation were downgraded to 2.3.7.2 to fix the issue for now
>> --
>> I've attached a tcpdump showing the client showing the client stops
>> sending any data after the mailbox_guid table headers.
>> Any idea what could be wrong here or the debug this issue?
>> Thanks.
>> Carsten Rosenberg
> 
> 
> -- 
> ________________________________________________________________________
> Dr. Andreas Piper, Hochschulrechenzentrum der Philipps-Univ. Marburg
>          Hans-Meerwein-Stra?e 6, 35032 Marburg, Germany
> Phone: +49 6421 28-23521  Fax: -26994  E-Mail: piper at HRZ.Uni-Marburg.DE
<mailto:piper at HRZ.Uni-Marburg.DE>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<https://dovecot.org/pipermail/dovecot/attachments/20191205/dfe1927a/attachment.html>

Piper Andreas

2019-Dec-06 06:09 UTC

head link

[2.3.8] possible replication issue

Hello Timo,

upgrading both replicators did the job! Both replicators now run v2.3.9 
and replication works fine, all sync-jobs which queued up during the 
upgrading have been processed successfully.

Thanks for the reassurement and all your great work with dovecot,

Andreas


Am 05.12.19 um 13:15 schrieb Timo Sirainen via dovecot:> I think there's a good chance that upgrading both will fix it. The bug 
> already existed in old versions, it just wasn't normally triggered. 
> Since v2.3.8 this situation is triggered on one dsync side, so the 
> v2.3.9 fix needs to be on the other side.
> 
>> On 5. Dec 2019, at 8.34, Piper Andreas via dovecot 
>> <dovecot at dovecot.org <mailto:dovecot at dovecot.org>>
wrote:
>>
>> Hello,
>>
>> upgrading to 2.3.9 unfortunately does *not* solve this issue:
>>
>> I upgraded one of my replicators from 2.3.7.2 to 2.3.9 and after some 
>> seconds replication stopped. The other replicator remained with 
>> 2.3.7.2. After downgrading to 2.3.7.2 replication is again working
fine.
>>
>> I did not try to upgrade both replicators up to now, as this is a live 
>> production system. Is there a chance, that upgrading both replicators 
>> will solve the problem?
>>
>> The machines are running Ubuntu 18.04
>>
>> Any help is appreciated.
>>
>> Thanks,
>> Andreas
>>
>> Am 18.10.19 um 13:52 schrieb Carsten Rosenberg via dovecot:
>>> Hi,
>>> some of our customers have discovered a replication issue after
>>> upgraded from 2.3.7.2 to 2.3.8.
>>> Running 2.3.8 several replication connections are hanging until
defined
>>> timeout. So after some seconds there are $replication_max_conns
hanging
>>> connections.
>>> Other replications are running fast and successful.
>>> Also running a doveadm sync tcp:... is working fine for all users.
>>> I can't see exactly, but I haven't seen mailboxes
timeouting again and
>>> again. So I would assume it's not related to the mailbox.
>>> From the logs:
>>> server1:
>>> Oct 16 08:29:25 server1 dovecot[5715]:
>>> dsync-local(username1 at domain.com 
>>> <mailto:username1 at
domain.com>)<FXnVDW22pl0tGAAA1cwDxA>: Error:
>>> dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds
(version
>>> not received)
>>> Oct 16 08:29:25 server1 dovecot[5715]:
>>> dsync-local(username1 at domain.com 
>>> <mailto:username1 at
domain.com>)<FXnVDW22pl0tGAAA1cwDxA>: Error:
>>> Timeout during state=master_recv_handshake
>>> server2:
>>> Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error:
read(server1)
>>> failed: EOF (last sent=handshake, last recv=handshake)
>>> There aren't any additional logs regarding the replication.
>>> I have tried increasing vsz_limit or reducing
replication_max_conns.
>>> Nothing changed.
>>> --
>>> Both customers have 10k+ users. Currently I couldn't reproduce
this on
>>> smaller test systems.
>>> Both installation were downgraded to 2.3.7.2 to fix the issue for
now
>>> --
>>> I've attached a tcpdump showing the client showing the client
stops
>>> sending any data after the mailbox_guid table headers.
>>> Any idea what could be wrong here or the debug this issue?
>>> Thanks.
>>> Carsten Rosenberg
>>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5394 bytes
Desc: S/MIME Cryptographic Signature
URL:
<https://dovecot.org/pipermail/dovecot/attachments/20191206/16e95864/attachment.p7s>

KOCIK Fabien (Acoss)

2019-Dec-06 07:25 UTC

head link

[2.3.8] possible replication issue

Hello all,

Just tested this morning : I can confirm that issue seems to be resolved for me
after upgrading both servers from 2.3.7.2 to 2.3.9.

Refs : 
  * https://dovecot.org/pipermail/dovecot/2019-October/117353.html
  * https://dovecot.org/pipermail/dovecot/2019-November/117467.html

No more "I/O has stalled" error messages and replication works fine
now.
Thanks very much to the Dovecot team.

Have a nice day.
Fabien

-----Message d'origine-----
De?: dovecot <dovecot-bounces at dovecot.org> De la part de Piper Andreas
via dovecot
Envoy??: vendredi 6 d?cembre 2019 07:10
??: dovecot at dovecot.org
Objet?: Re: [2.3.8] possible replication issue

Hello Timo,

upgrading both replicators did the job! Both replicators now run v2.3.9 
and replication works fine, all sync-jobs which queued up during the 
upgrading have been processed successfully.

Thanks for the reassurement and all your great work with dovecot,

Andreas


Am 05.12.19 um 13:15 schrieb Timo Sirainen via dovecot:> I think there's a good chance that upgrading both will fix it. The bug 
> already existed in old versions, it just wasn't normally triggered. 
> Since v2.3.8 this situation is triggered on one dsync side, so the 
> v2.3.9 fix needs to be on the other side.
> 
>> On 5. Dec 2019, at 8.34, Piper Andreas via dovecot 
>> <dovecot at dovecot.org <mailto:dovecot at dovecot.org>>
wrote:
>>
>> Hello,
>>
>> upgrading to 2.3.9 unfortunately does *not* solve this issue:
>>
>> I upgraded one of my replicators from 2.3.7.2 to 2.3.9 and after some 
>> seconds replication stopped. The other replicator remained with 
>> 2.3.7.2. After downgrading to 2.3.7.2 replication is again working
fine.
>>
>> I did not try to upgrade both replicators up to now, as this is a live 
>> production system. Is there a chance, that upgrading both replicators 
>> will solve the problem?
>>
>> The machines are running Ubuntu 18.04
>>
>> Any help is appreciated.
>>
>> Thanks,
>> Andreas
>>
>> Am 18.10.19 um 13:52 schrieb Carsten Rosenberg via dovecot:
>>> Hi,
>>> some of our customers have discovered a replication issue after
>>> upgraded from 2.3.7.2 to 2.3.8.
>>> Running 2.3.8 several replication connections are hanging until
defined
>>> timeout. So after some seconds there are $replication_max_conns
hanging
>>> connections.
>>> Other replications are running fast and successful.
>>> Also running a doveadm sync tcp:... is working fine for all users.
>>> I can't see exactly, but I haven't seen mailboxes
timeouting again and
>>> again. So I would assume it's not related to the mailbox.
>>> From the logs:
>>> server1:
>>> Oct 16 08:29:25 server1 dovecot[5715]:
>>> dsync-local(username1 at domain.com 
>>> <mailto:username1 at
domain.com>)<FXnVDW22pl0tGAAA1cwDxA>: Error:
>>> dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds
(version
>>> not received)
>>> Oct 16 08:29:25 server1 dovecot[5715]:
>>> dsync-local(username1 at domain.com 
>>> <mailto:username1 at
domain.com>)<FXnVDW22pl0tGAAA1cwDxA>: Error:
>>> Timeout during state=master_recv_handshake
>>> server2:
>>> Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error:
read(server1)
>>> failed: EOF (last sent=handshake, last recv=handshake)
>>> There aren't any additional logs regarding the replication.
>>> I have tried increasing vsz_limit or reducing
replication_max_conns.
>>> Nothing changed.
>>> --
>>> Both customers have 10k+ users. Currently I couldn't reproduce
this on
>>> smaller test systems.
>>> Both installation were downgraded to 2.3.7.2 to fix the issue for
now
>>> --
>>> I've attached a tcpdump showing the client showing the client
stops
>>> sending any data after the mailbox_guid table headers.
>>> Any idea what could be wrong here or the debug this issue?
>>> Thanks.
>>> Carsten Rosenberg
>>
>>

Possibly Parallel Threads

Search for more maybe matching threads

dovecot - Dec 2019 - [2.3.8] possible replication issue

[2.3.8] possible replication issue

[2.3.8] possible replication issue

[2.3.8] possible replication issue

Possibly Parallel Threads