Hello, I am working on replicating a server (and adding compression on the other side) and since I had "Error: dsync I/O has stalled, no activity for 600 seconds (version not received)" errors I upgraded both source and destination server with the latest 2.3 version (2.3.18). While before the upgrade all the 15 replication connections were busy after upgrading dovecot replicator dsync-status shows that most of the time nothing is being replicated at all. I can see some brief replications that last, but 99,9% of the time nothing is happening at all. I have a replication_full_sync_interval of 12 hours but I have thousands of users with their last full sync over 90 hours ago. "doveadm replicator status" also shows that i have over 35,000 queued full resync requests, but no sync, high or low queued requests so why aren't the full requests occuring? There are no errors in the logs. Thanks, Arnaud -- Arnaud Ab?lard Responsable p?le Syst?me et Stockage Service Infrastructures DSIN Universit? de Nantes - -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4186 bytes Desc: S/MIME Cryptographic Signature URL: <https://dovecot.org/pipermail/dovecot/attachments/20220424/5ad01143/attachment.bin>
Paul Kudla (SCOM.CA Internet Services Inc.)
2022-Apr-24 17:36 UTC
no full syncs after upgrading to dovecot 2.3.18
Question having similiar replication issues pls read everything below and advise the folder counts on the non-replicated users? i find the total number of folders / account seems to be a factor and NOT the size of the mail box ie i have customers with 40G of emails no problem over 40 or so folders and it works ok 300+ folders seems to be the issue i have been going through the replication code no errors being logged i am assuming that the replication --> dhclient --> other server is timing out or not reading the folder lists correctly (ie dies after X folders read) thus i am going through the code patching for log entries etc to find the issues. see [13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot # ll total 86 drwxr-xr-x 2 root wheel uarch 4B Apr 24 11:11 . drwxr-xr-x 4 root wheel uarch 4B Mar 8 2021 .. -rw-r--r-- 1 root wheel uarch 73B Apr 24 11:11 instances -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db [13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot # replicator.db seems to get updated ok but never processed properly. # sync.users nick at elirpa.com high 00:09:41 463:47:01 - y keith at elirpa.com high 00:09:23 463:45:43 - y paul at scom.ca high 00:09:41 463:46:51 - y ed at scom.ca high 00:09:43 463:47:01 - y ed.hanna at dssmgmt.com high 00:09:42 463:46:58 - y paul at paulkudla.net high 00:09:44 463:47:03 580:35:07 y so .... two things : first to get the production stuff to work i had to write a script that whould find the bad sync's and the force a dsync between the servers i run this every five minutes or each server. in crontab */10 * * * * root /usr/bin/nohup /programs/common/sync.recover > /dev/null python script to sort things out # cat /programs/common/sync.recover #!/usr/local/bin/python3 #Force sync between servers that are reporting bad? import os,sys,django,socket from optparse import OptionParser from lib import * #Sample Re-Index MB #doveadm -D force-resync -u paul at scom.ca -f INBOX* USAGE_TEXT = '''\ usage: %%prog %s[options] ''' parser = OptionParser(usage=USAGE_TEXT % '', version='0.4') parser.add_option("-m", "--send_to", dest="send_to", help="Send Email To") parser.add_option("-e", "--email", dest="email_box", help="Box to Index") parser.add_option("-d", "--detail",action='store_true', dest="detail",default =False, help="Detailed report") parser.add_option("-i", "--index",action='store_true', dest="index",default =False, help="Index") options, args = parser.parse_args() print (options.email_box) print (options.send_to) print (options.detail) #sys.exit() print ('Getting Current User Sync Status') command = commands("/usr/local/bin/doveadm replicator status '*'") #print command sync_user_status = command.output.split('\n') #print sync_user_status synced = [] for n in range(1,len(sync_user_status)) : user = sync_user_status[n] print ('Processing User : %s' %user.split(' ')[0]) if user.split(' ')[0] != options.email_box : if options.email_box != None : continue if options.index == True : command = '/usr/local/bin/doveadm -D force-resync -u %s -f INBOX*' %user.split(' ')[0] command = commands(command) command = command.output #print user for nn in range (len(user)-1,0,-1) : #print nn #print user[nn] if user[nn] == '-' : #print 'skipping ... %s' %user.split(' ')[0] break if user[nn] == 'y': #Found a Bad Mailbox print ('syncing ... %s' %user.split(' ')[0]) if options.detail == True : command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] print (command) command = commands(command) command = command.output.split('\n') print (command) print ('Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) for nnn in range(len(command)): synced.append(command[nnn] + '\n') break if options.detail == False : #command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] #print (command) #command = os.system(command) command = subprocess.Popen( ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0] ], \ shell = True, stdin=None, stdout=None, stderr=None, close_fds=True) print ( 'Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) #sys.exit() break if len(synced) != 0 : #send email showing bad synced boxes ? if options.send_to != None : send_from = 'monitor at scom.ca' send_to = ['%s' %options.send_to] send_subject = 'Dovecot Bad Sync Report for : %s' %(socket.gethostname()) send_text = '\n\n' for n in range (len(synced)) : send_text = send_text + synced[n] + '\n' send_files = [] sendmail (send_from, send_to, send_subject, send_text, send_files) sys.exit() second : i posted this a month ago - no response please appreciate that i am trying to help .... after much testing i can now reporduce the replication issues at hand I am running on freebsd 12 & 13 stable (both test and production servers) sdram drives etc ... Basically replication works fine until reaching a folder quantity of ~ 256 or more to reproduce using doveadm i created folders like INBOX/folder-0 INBOX/folder-1 INBOX/folder-2 INBOX/folder-3 and so forth ...... I created 200 folders and they replicated ok on both servers I created another 200 (400 total) and the replicator got stuck and would not update the mbox on the alternate server anymore and is still updating 4 days later ? basically replicator goes so far and either hangs or more likely bails on an error that is not reported to the debug reporting ? however dsync will sync the two servers but only when run manually (ie all the folders will sync) I have two test servers avaliable if you need any kind of access - again here to help. [07:28:42] mail18.scom.ca [root:0] ~ # sync.status Queued 'sync' requests 0 Queued 'high' requests 6 Queued 'low' requests 0 Queued 'failed' requests 0 Queued 'full resync' requests 0 Waiting 'failed' requests 0 Total number of known users 255 username type status paul at scom.ca normal Waiting for dsync to finish keith at elirpa.com incremental Waiting for dsync to finish ed.hanna at dssmgmt.com incremental Waiting for dsync to finish ed at scom.ca incremental Waiting for dsync to finish nick at elirpa.com incremental Waiting for dsync to finish paul at paulkudla.net incremental Waiting for dsync to finish i have been going through the c code and it seems the replication gets requested ok replicator.db does get updated ok with the replicated request for the mbox in question. however i am still looking for the actual replicator function in the lib's that do the actual replication requests the number of folders & subfolders is defanately the issue - not the mbox pyhsical size as thought origionally. if someone can point me in the right direction, it seems either the replicator is not picking up on the number of folders to replicat properly or it has a hard set limit like 256 / 512 / 65535 etc and stops the replication request thereafter. I am mainly a machine code programmer from the 80's and have concentrated on python as of late, 'c' i am starting to go through just to give you a background on my talents. It took 2 months to finger this out. this issue also seems to be indirectly causing the duplicate messages supression not to work as well. python programming to reproduce issue (loops are for last run started @ 200 - fyi) : # cat mbox.gen #!/usr/local/bin/python2 import os,sys from lib import * user = 'paul at paulkudla.net' """ for count in range (0,600) : box = 'INBOX/folder-%s' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a """ for count in range (0,600) : box = 'INBOX/folder-0/sub-%' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a #sys.exit() Happy Sunday !!! Thanks - paul Paul Kudla Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3 Toronto 416.642.7266 Main?1.866.411.7266 Fax?1.888.892.7266 On 4/24/2022 10:22 AM, Arnaud Ab?lard wrote:> Hello, > > I am working on replicating a server (and adding compression on the > other side) and since I had "Error: dsync I/O has stalled, no activity > for 600 seconds (version not received)" errors I upgraded both source > and destination server with the latest 2.3 version (2.3.18). While > before the upgrade all the 15 replication connections were busy after > upgrading dovecot replicator dsync-status shows that most of the time > nothing is being replicated at all. I can see some brief replications > that last, but 99,9% of the time nothing is happening at all. > > I have a replication_full_sync_interval of 12 hours but I have thousands > of users with their last full sync over 90 hours ago. > > "doveadm replicator status" also shows that i have over 35,000 queued > full resync requests, but no sync, high or low queued requests so why > aren't the full requests occuring? > > There are no errors in the logs. > > Thanks, > > Arnaud > > > > >