In case anyone is interested in reading (and maybe helping!) with a dsync redesign that's intended to fix all of its current problems, here are some possibly incoherent ramblings about it: http://dovecot.org/tmp/dsync-redesign.txt and even if you don't understand that, here's another document disguising as an algorithm class problem :) If anyone has thoughts on how to solve it, would be great: http://dovecot.org/tmp/dsync-redesign-problem.txt It only deals with saving new messages, not expunges/flag changes/etc, but those should be much simpler.
On 03/23/12 22:25, Timo Sirainen wrote:> In case anyone is interested in reading (and maybe helping!) with a dsync redesign that's intended to fix all of its current problems, here are some possibly incoherent ramblings about it: > > http://dovecot.org/tmp/dsync-redesign.txt > > and even if you don't understand that, here's another document disguising as an algorithm class problem :) If anyone has thoughts on how to solve it, would be great: > > http://dovecot.org/tmp/dsync-redesign-problem.txt > > It only deals with saving new messages, not expunges/flag changes/etc, but those should be much simpler. >Well, dsync is a very useful tool, but with continuous replication it tries to solve a problem which should be handled -at least partially- elsewhere. Storing stuff in plain file systems and duplicating them to another one just doesn't scale. I personally think that Dovecot could gain much more if the amount of work going into fixing or improving dsync would go into making Dovecot to (be able of) use a high scale, distributed storage backend. I know it's much harder, because there are several major differences compared to the "low latency" and consistency problem free local file systems, but its fruits are also sweeter for the long term. :) It would bring Dovecot into the class of open source mail servers where there are currently no contenders. BTW, for the previous question in this topic (are there any nosql dbs supporting application-level conflict resolution?), there are similar solutions (like CouchDB, but having some experiences with it, I wouldn't recommend it for massive mail storage -at least the plain CouchDB product), but I guess you would be better off with designing a schema which doesn't need it at the first time. For example, messages are immutable, so you won't face this issue in this area. And for metadata, maybe the solution is not to store "digested" snapshots of the current metadata (folders, flags, message links for folders etc), but to store the changes happening on the user's mailbox and occasionally aggregate them into a last known good and consistent state. Also, there are other interesting ideas, maybe with real single instance store (splitting mime parts? Storing attachments in plain binary form? This always brings up the question of whether the mail server should modify the mails, can be pretty bad for encrypted/signed stuff). And of course there is always the problem of designing a good, consistent method which is also efficient.
Hello Timo, Thank you very much for planning a redesign of the dsycn and for opening this discussion. As I can see from the replies that came until now everybody misses the main point of IMAP: IMAP has been designed to work as a disconnected, high-latency data store. To make this more clear: once and IMAP client finishes the synchronization with the server, both have client and server have a consistent state of the mailbox. After this both the "client" and the "server" act like master for their own local copy (on the "server" new emails get created etc, on the "client" existing emails get changed (flags) and moved, and new emails appear (sent items)). So the protocol is designed, originally, to handle the master-master replication. And as this it make sense a deployment global-wide, where servers work independently and from time to time they "merge" the changes. This being said and acknowledged here are my 2 cents: I think that the current '1 brain / 2 workers' seems to be the correct model. The "the client" connects to the "server" and pushes the local changes and after retrieves the updated/new items from the "server". "The brain" considers first server as the "local storage" and the second server as "server storage". For the split design, "come to the same conclusion of the state" is very race-condition prone. As long as the algorithm is kept as you described it in the original document then the backups should really be incremental (because you only do the changes since last sync). As the most changes are "metadata-only" the sync can be pretty fast by merging indexes. Thank you, Andrei> In case anyone is interested in reading (and maybe helping!) with a dsync > redesign that's intended to fix all of its current problems, here are some > possibly incoherent ramblings about it: > > http://dovecot.org/tmp/dsync-redesign.txt > > and even if you don't understand that, here's another document disguising > as an algorithm class problem :) If anyone has thoughts on how to solve > it, would be great: > > http://dovecot.org/tmp/dsync-redesign-problem.txt > > It only deals with saving new messages, not expunges/flag changes/etc, but > those should be much simpler. > > > !DSPAM:4f6cea4c260302917022693! > >
Timo Sirainen <tss at iki.fi> writes:> In case anyone is interested in reading (and maybe helping!) with a dsync redesign that's intended to fix all of its current problems, here are some possibly incoherent ramblings about it:thank you for opening this discussion about dsync! besides the problems I've encountered with dsync, there are a couple things that I think would be great to build into the new vision of the protocol. One would be the ability to perform *intelligent* incremental/rotated backups. I can do this now by running a dsync backup operation and then doing manual hardlinking or moving of the backup directories (daily.1, daily.2, weekly.1, monthly.1, etc.), but it would be more intelligent if this were baked into the backup process. Secondly, being able to filter out mailboxes could result in much more efficient syncing. Now there is the capability to operate on only specific mailboxes, but this doesn't scale well when I am trying to backup thousands of users and I want to omit the Spam and Trash folders from the sync. I would have to get a mailbox list of each user, and then iterate over each mailbox for each user, skipping the Spam and Trash folders, forking a new 'dsync backup' for each of their mailboxes, for each user. Lastly, there isn't a good method for restoring backups. I can reverse the backup process, onto the user's "live" mailbox, but that brings the user into an undesirable state (eg. their mailbox state one day ago). Better would be if their backup could be restored in such a way that the user can resolve the missing pieces manually, as they know best. thanks again for your work on this, from my position dovecot is an amazing piece of software, the only part that seems to have some issues is dsync and I applaud the effort to redesign to fix things! micah
On 23.3.2012, at 23.25, Timo Sirainen wrote:> and even if you don't understand that, here's another document disguising as an algorithm class problem :) If anyone has thoughts on how to solve it, would be great: > > http://dovecot.org/tmp/dsync-redesign-problem.txt > > It only deals with saving new messages, not expunges/flag changes/etc, but those should be much simpler.Step #3 was more difficult than I first realized. I spent last two days figuring out a way to make it work, and looks like I finally did. I didn't update the document yet, but I wrote a test program: http://dovecot.org/tmp/test-dsync.c Step #2 should be easy enough. Step #4 I think I'll forget about and just implement a per-mailbox dsync lock. The main reason I wanted to get rid of locks was because a per-user lock can't work with shared mailboxes. But a per-mailbox lock is okay enough. Note that #3 allows the two dsyncs to run in parallel and send duplicate changes, just not modifying the same mailbox at the same time (which would duplicate mails due to two transactions adding the same mails).