thr3ads.net - dovecot - [Dovecot] Replication plans [May 2007]

If this information is useful, please help other people find it:
Share via:

Timo Sirainen

2007-May-17 13:45 UTC

[Dovecot] Replication plans

Several companies have been interested in getting replication support
for Dovecot. It looks like I could begin implementing it in a few months
after some other changes. So here's how I'm planning on doing it:

What needs to be replicated:

 - Saving new mails (IMAP, deliver)
 - Copying existing mails
 - Expunges
 - Flag and keyword changes
 - Mailbox creations, deletions and renames
 - Subscription list
 - IMAP extension changes, such as ACLs

I'll first talk about only master/slave configuration. Later then
master/multi-slave and multi-master.

Comments?

Basic design
------------

Since the whole point of master-slave replication would be to get a
reliable service, I'll want to make the replication as reliable as
possible. It would be easier to implement much simpler master-slave
replication, but in error conditions that would almost guarantee that
some messages get lost. I want to avoid that.

Both slave and master would be running a dovecot-replication process.
They would talk to each others via TCP connection. Master would be
feeding changes (transactions) to slave, and slave would keep replying
how far it's committed the changes. So there would be an increasing
sequence number sent with each transaction.

Master keeps all the changes in memory until slave has replied that it
has committed the changes. If the memory buffer gets too large (1MB?)
because the slave is handling the input too slowly or because it's
completely dead, the master starts writing the buffer to a file. Once
the slave is again responding the changes are read from the file and
finally the file gets deleted.

If the file gets too large (10MB?) it's deleted and slave will require a
resync. Master always keeps track of "user/mailbox -> last transaction
sequence" in memory. When the slave comes back up and tells the master
its last committed sequence, this allows the master to resync only those
mailboxes that had changed. If this mapping doesn't exist (master was
restarted), all the users' mailboxes need to be resynced.

The resyncing will also have to deal with new mailboxes, ACL changes and
such. I'm not completely sure yet how the resyncing will work. Probably
the only important thing here is that it would be good if it would
resolve any message UID conflicts (even though they shouldn't happen,
see below) by comparing either the messages' MD5 contents, or with
maildir it would be enough to just compare the maildir filename.

It's possible that the slave doesn't die completely but only some
transactions fail. This shouldn't normally happen, so the easiest way to
handle it would be to try a full resync, and if that still fails stop
the whole slave. Another way would be to just mark that one user or
mailbox as dirty and try to resync it once in a while.

imap, pop3 and deliver would connect to dovecot-replication either using
UNIX sockets or POSIX message queue. Queue could be faster, but sockets
would allow better security checks. Or I suppose it would be possible to
send some kind of a small authentication token in each message with
queues. The communication protocol would be binary and the existing
dovecot.index.log format would be used as much as possible (making it
simple for slave to append changes to mailboxes' dovecot.index.log).

dovecot-replication process would need read/write access to all users'
mailboxes. So either it would run as root or it would need to have at
least group-permission rights to all mailboxes. A bit more complex
solution would be to use multiple processes each running with their own
UIDs, but I think I won't implement this yet.

The replication would require support for shared mailboxes. That way the
mailboxes could be accessed using the configured shared namespaces, for
example "shared/<username>/INBOX". (LMTP server could be
implemented in
a similar way BTW)

Configuration
-------------

I'm not yet sure how this should work. The simplest possibility would be
just to configure a single slave and everything would be replicated
there.

But it should be possible to split users into multiple slaves (still one
slave/user). The most configurable way to do this would be to have
userdb return the slave host. Hmm. Actually that could work pretty well
if the userdb returned slave's name and each slave would have a separate
socket/queue where dovecot-replication listens in. I'm not sure if they
could all still be handled by a single dovecot-replication process or if
there should be a separate one for each of them. The configuration would
then be something like:

# slave for usernames beginning with letters A..K
replication users_a_k {
  host = slave1.example.com
  # other options?
}
# L..Z and others
replication users_l_z_rest {
  host = slave2.example.com
}

Saving
------

This is the most important thing to get right, and also the most complex
one. Besides replicating mails that are being saved via Dovecot, I think
also externally saved mails should be replicated when they're first
seen. This is somewhat related to doing an initial sync to a slave.

The biggest problem with saving is how to robustly handle master
crashes. If you're just pushing changes from master to slave and the
master dies, it's entirely possible that some of the new messages that
were already saved in master didn't get through to slave. So when you
switch to using slave, those messages are gone. The biggest problem with
this is that when saving new mails, they'll be assigned UIDs that were
already used by the original master. If a locally caching client had
already seen the UID from the original master, it'll think the message
is the original one instead of the new one. So the newly added message
pretty much gets lost, and if the user could expunge the message
(thinking it's the original one) without ever even seeing it.

This UID conflict could be avoided by making the master tell slave
whenever it's saving new mails. The latency between master-slave
communication could be avoided with a bit of a trick:

  - When beginning to save/copy a mail, tell slave to increase UID
counter by one. Slave replies with the new highest-seen-UID counter.
  - UID allocation is done after all the mails are saved/copied within a
transaction. At this stage wait until slave has replied that it has
reserved UIDs at least as high as we're currently allocating (other
processes could have caused it to get even higher).
  - If save/copy is aborted, tell the slave to decrease the UID counter
by the number of aborted messages.
  - UID counter desync shouldn't happen, but if it does handle that also
somehow.

Then the UID allocation latency depends on how long it takes to write
the message, and where the message comes from. With IMAP APPEND there's
probably no wait required, unless the APPEND command and the whole
message arrives in one IP packet. With deliver and copying it might have
to wait a while.

So the UID allocation should work with minimal possible latency. It
could use a separate TCP connection to slave and maybe a separate high
priority process in the slave that would handle only UID allocations.

The above will make sure that conflicting UIDs aren't used, but it
doesn't make sure that messages themselves aren't lost. If the
master's
hard disk failed completely and it can't be resynced anymore, the
messages saved there are lost. This could be avoided by making the
saving wait until the message is saved to slave:

  - save mail to disk, and at the same time also send it to slave
  - allocate UID(s) and tell to slave what they were, wait for "mail
saved" reply from slave before committing the messages to mailbox
permanently

This would however add more latency than the simpler UID allocation
counter. I don't know if it should be optional which way is used.
Perhaps I'll implement the full save-sync first and if it seems too slow
then I'll implement optionally the UID allocation counter.

Save messages from imap/deliver would contain at least:

1. user name (unless using UNIX sockets, then it's already known)
2. mailbox name
3. UIDVALIDITY
4a) UID
4b) global UID, filename, offset and size of the message

4a) would be used with the UID allocation counter behavior. The message
would be sent after the message was saved and UIDs were allocated.

4b) would be used with the wait-slave-to-save behavior. The message
would be sent immediately whenever new data was received for the
message, so possibly multiple times for each message. The last block
would then contain size=0. After saving all the mails in a transaction,
there would be a separate "allocate UIDs x..y for global UIDs a, b, c,
d, etc." message which the slave then replies and master forwards the
reply to imap/deliver.

If the slave is down, or if a timeout (a few seconds?) occurs while
waiting for the reply from slave, imap/deliver could just go on saving
the message. If this is done it's possible that if the master also dies
soon, there will be UID conflicts. But since this should be a rare error
condition, it's unlikely that the master will die soon after a
misbehaving slave.

Copying
-------

Copying has similar UID conflict problems than saving. The main
difference is that there's no need to transfer message contents, which
makes it faster to handle copies reliably.

Copying could be handled with the UID allocation counter way described
above. Since the message already exists in the server, there's no way
that the message gets lost. So it would be possible to allow master and
slave to go into desync, because the user would still find the message
from the original mailbox.

Move operation is done with "copy + mark \deleted + expunge"
combination. This could be problematic if the master had managed to do
all that before slave saw the COPY at all. Since the message isn't yet
copied to the new mailbox in slave, clients wouldn't see it there, but
locally caching clients might not see the message in the original
mailbox either because they had already received EXPUNGE notification
for it. So this should be avoided in some way.

Perhaps it would be better to just treat copies the same way as saves:
Require a notification from slave before replying to client with OK
message. Alternatively the wait could be done at EXPUNGE command, as
described below.

Expunges
--------

Losing expunge transactions wouldn't be too bad, because the worst that
can happen is that the user needs to expunge the message again. Except
that having an expunged message suddenly appear again in the middle of
the mailbox violates IMAP protocol. I don't know if/how that will break
clients.

Solution here would again be that before EXPUNGE notifications are sent
to client we'll wait for reply from slave that it had also processed the
expunge.

Flag and keyword changes
------------------------

Most likely doesn't matter if some of them are lost. If someone really
cares about them, it would be possible to again do the slave-waiting for
either all the changes or just when some specific flags/keywords change.

Other changes
-------------

Mailbox creations, deletions, renames and subscription changes could all
be handled in a similiar way to flag/keyword changes.

Handling changes implemented by IMAP extensions such as ACL is another
thing. It's probably not worth the trouble to implement a special
replication handler for each plugin. These changes most likely will
happen somewhat rarely though, so they could be implemented by sending
the whole IMAP command to slave and have it execute the command using a
regular imap process. The commands could be marked with some
"replication" flag so Dovecot knows that they need to be sent to
slave.

Although instead of executing a new imap process for each user (probably
each command), it could be possible to use one imap process and just
change the mailbox name in the command. So instead of eg. "setacl
mailbox (whatever)" the IMAP process would execute "setacl
shared/username/mailbox (whatever)". That would require telling Dovecot
what parameter contains the mailbox name. Should be pretty easy to
implement, unless some IMAP extensions want multiple mailbox names or
the position isn't static. I can't think of any such extensions right
now though.

Master/multi-slave
------------------

Once the master/slave is working, support for multiple slaves could be
added. The main thing to be done here is to have the slaves select which
one of them is the new master, and then they'll have to sync the
mailboxes with each others, because each of them might have seen some
messages that others hadn't. This is assuming that save/etc. operations
that are waiting for a reply from slave would wait for only one reply
from the first slave that happens to respond. The number of slave
replies to wait for could be also configurable, in case you're worried
of a simultaneous master+slave crash.

Multi-master
------------

After master/multi-slave is working, we're nearly ready for a full
multi-master operation. The master farm will need global locking.
Perhaps there is a usable existing implementation which could be used.
In any case there needs to be a way to keeps track of what master owns
what locks.

The easiest way to implement multi-master is to assume that a single
user isn't normally accessed from different servers at the same time.
This wouldn't be a requirement, but it would make the performance
better. For most users you could handle this by making sure that load
balancer transfers all connections from the same IP to the same server.
Or you could use a Dovecot proxy that transfers the connections.

So, once the user is accessed only from one server, the operation
becomes pretty much like master/multi-slave. You'll just need to be able
to dynamically change the current master server for the user. This can
be done by grabbing a global per-user lock before modifying the mailbox.
The lock wouldn't have to be released immediately though. If no other
server needs the lock, there's no point in locking/unlocking constantly.
Instead the lock server could request the current lock owner to release
the lock whenever another server wants it.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL:
<http://dovecot.org/pipermail/dovecot/attachments/20070517/1af90f1a/attachment.bin>

Jonathan

2007-May-17 14:39 UTC

head link

[Dovecot] Replication plans

Hi Timo,

MySQL gets around the problem of multiple masters allocating the
same primary key, by giving each server its own address range
(e.g. first machine uses 1,5,9,13 next one uses 2,6,10,14,...).
Would this work for UIDs?

Jonathan.

Troy Benjegerdes

2007-May-17 15:04 UTC

head link

[Dovecot] Replication plans

My first reaction is that I've already got replication by running
dovecot on AFS ;)

But that's currently not *really* replicated. The real question I guess
is why not use a cluster/distributed/san filesystem like AFS, GFS,
Lustre, GPFS to handle the actual data, and specify that replication
only works for maildir or other single file per message mailbox formats.

This puts about half the replication problem on the filesystem, and I
would hope keeps the dovecot code a lot simpler. The hard part for
dovecot is understanding the locking semantics of various filesystems,
and doing locking in a way that doesn't totally trash performance.

I also tend to have imap clients open on multiple machines, so the
assumption that a user's mailbox will only be accessed from 1 IP is
probably a bad one.

I'd suggest that multi-master replication be implemented on a
per-mailbox basis, so that for any mailbox, the first dovecot instance
that gets a connection for that mailbox checks with a 'dovelock' process
that can either backend to a shared filesystem, or implement it's own
tcp-based lock manager. It would then get a master lock on that mailbox.
If a second dovecot gets a connection to a mailbox with a master lock,
it would act as a proxy and redirect all writes to the master. In the
case of a shared filesystem, all reads can still come from the straight
from the local filesystem. In the case of AFS, this might hit the
directory lookups kind of hard, but I think would still be a big win
because of local cacheing of all the maildir files, which for the most
part are read-only.

In the event that a master crashes or stops responding, the dovelock
process would need a mechanism to revoke a master lock.. this might
require some filesystem-specific glue to make sure that once a lock is
revoked the dovecot process that had it's lock revoked can't write to
the filesystem anymore. I can think of at least one way to do this with
AFS by having the AFS server invalidate authentication tokens for the
process/server that the lock was revoked from.

Once the lock is revoked, the remaining process can grab it and go on
happily with life. Once the dead/crashed process comes back, it just has
to have it's filesystem go out and update to the latest files.

I like the idea of dovecot having a built-in proxy/redirect ability so a
cluster can be done with plain old round-robin DNS. I also think that in
most cases, if there's a dead server in the round-robin DNS pool, most
clients will retry automatically, or the user will get an error, click
okay, and try again. Having designated proxy or redirectors in the
middle just makes things complicated and hard to debug.

On Thu, May 17, 2007 at 04:45:21PM +0300, Timo Sirainen
wrote:> Several companies have been interested in getting replication support
> for Dovecot. It looks like I could begin implementing it in a few months
> after some other changes. So here's how I'm planning on doing it:
> 
> What needs to be replicated:
> 
>  - Saving new mails (IMAP, deliver)
>  - Copying existing mails
>  - Expunges
>  - Flag and keyword changes
>  - Mailbox creations, deletions and renames
>  - Subscription list
>  - IMAP extension changes, such as ACLs
> 
> I'll first talk about only master/slave configuration. Later then
> master/multi-slave and multi-master.
> 
> Comments?
> 
> Basic design
> ------------
> 
> Since the whole point of master-slave replication would be to get a
> reliable service, I'll want to make the replication as reliable as
> possible. It would be easier to implement much simpler master-slave
> replication, but in error conditions that would almost guarantee that
> some messages get lost. I want to avoid that.
> 
> Both slave and master would be running a dovecot-replication process.
> They would talk to each others via TCP connection. Master would be
> feeding changes (transactions) to slave, and slave would keep replying
> how far it's committed the changes. So there would be an increasing
> sequence number sent with each transaction.
> 
> Master keeps all the changes in memory until slave has replied that it
> has committed the changes. If the memory buffer gets too large (1MB?)
> because the slave is handling the input too slowly or because it's
> completely dead, the master starts writing the buffer to a file. Once
> the slave is again responding the changes are read from the file and
> finally the file gets deleted.
> 
> If the file gets too large (10MB?) it's deleted and slave will require
a
> resync. Master always keeps track of "user/mailbox -> last
transaction
> sequence" in memory. When the slave comes back up and tells the master
> its last committed sequence, this allows the master to resync only those
> mailboxes that had changed. If this mapping doesn't exist (master was
> restarted), all the users' mailboxes need to be resynced.
> 
> The resyncing will also have to deal with new mailboxes, ACL changes and
> such. I'm not completely sure yet how the resyncing will work. Probably
> the only important thing here is that it would be good if it would
> resolve any message UID conflicts (even though they shouldn't happen,
> see below) by comparing either the messages' MD5 contents, or with
> maildir it would be enough to just compare the maildir filename.
> 
> It's possible that the slave doesn't die completely but only some
> transactions fail. This shouldn't normally happen, so the easiest way
to
> handle it would be to try a full resync, and if that still fails stop
> the whole slave. Another way would be to just mark that one user or
> mailbox as dirty and try to resync it once in a while.
> 
> imap, pop3 and deliver would connect to dovecot-replication either using
> UNIX sockets or POSIX message queue. Queue could be faster, but sockets
> would allow better security checks. Or I suppose it would be possible to
> send some kind of a small authentication token in each message with
> queues. The communication protocol would be binary and the existing
> dovecot.index.log format would be used as much as possible (making it
> simple for slave to append changes to mailboxes' dovecot.index.log).
> 
> dovecot-replication process would need read/write access to all users'
> mailboxes. So either it would run as root or it would need to have at
> least group-permission rights to all mailboxes. A bit more complex
> solution would be to use multiple processes each running with their own
> UIDs, but I think I won't implement this yet.
> 
> The replication would require support for shared mailboxes. That way the
> mailboxes could be accessed using the configured shared namespaces, for
> example "shared/<username>/INBOX". (LMTP server could be
implemented in
> a similar way BTW)
> 
> Configuration
> -------------
> 
> I'm not yet sure how this should work. The simplest possibility would
be
> just to configure a single slave and everything would be replicated
> there.
> 
> But it should be possible to split users into multiple slaves (still one
> slave/user). The most configurable way to do this would be to have
> userdb return the slave host. Hmm. Actually that could work pretty well
> if the userdb returned slave's name and each slave would have a
separate
> socket/queue where dovecot-replication listens in. I'm not sure if they
> could all still be handled by a single dovecot-replication process or if
> there should be a separate one for each of them. The configuration would
> then be something like:
> 
> # slave for usernames beginning with letters A..K
> replication users_a_k {
>   host = slave1.example.com
>   # other options?
> }
> # L..Z and others
> replication users_l_z_rest {
>   host = slave2.example.com
> }
> 
> Saving
> ------
> 
> This is the most important thing to get right, and also the most complex
> one. Besides replicating mails that are being saved via Dovecot, I think
> also externally saved mails should be replicated when they're first
> seen. This is somewhat related to doing an initial sync to a slave.
> 
> The biggest problem with saving is how to robustly handle master
> crashes. If you're just pushing changes from master to slave and the
> master dies, it's entirely possible that some of the new messages that
> were already saved in master didn't get through to slave. So when you
> switch to using slave, those messages are gone. The biggest problem with
> this is that when saving new mails, they'll be assigned UIDs that were
> already used by the original master. If a locally caching client had
> already seen the UID from the original master, it'll think the message
> is the original one instead of the new one. So the newly added message
> pretty much gets lost, and if the user could expunge the message
> (thinking it's the original one) without ever even seeing it.
> 
> This UID conflict could be avoided by making the master tell slave
> whenever it's saving new mails. The latency between master-slave
> communication could be avoided with a bit of a trick:
> 
>   - When beginning to save/copy a mail, tell slave to increase UID
> counter by one. Slave replies with the new highest-seen-UID counter.
>   - UID allocation is done after all the mails are saved/copied within a
> transaction. At this stage wait until slave has replied that it has
> reserved UIDs at least as high as we're currently allocating (other
> processes could have caused it to get even higher).
>   - If save/copy is aborted, tell the slave to decrease the UID counter
> by the number of aborted messages.
>   - UID counter desync shouldn't happen, but if it does handle that
also
> somehow.
> 
> Then the UID allocation latency depends on how long it takes to write
> the message, and where the message comes from. With IMAP APPEND there's
> probably no wait required, unless the APPEND command and the whole
> message arrives in one IP packet. With deliver and copying it might have
> to wait a while.
> 
> So the UID allocation should work with minimal possible latency. It
> could use a separate TCP connection to slave and maybe a separate high
> priority process in the slave that would handle only UID allocations.
> 
> The above will make sure that conflicting UIDs aren't used, but it
> doesn't make sure that messages themselves aren't lost. If the
master's
> hard disk failed completely and it can't be resynced anymore, the
> messages saved there are lost. This could be avoided by making the
> saving wait until the message is saved to slave:
> 
>   - save mail to disk, and at the same time also send it to slave
>   - allocate UID(s) and tell to slave what they were, wait for "mail
> saved" reply from slave before committing the messages to mailbox
> permanently
> 
> This would however add more latency than the simpler UID allocation
> counter. I don't know if it should be optional which way is used.
> Perhaps I'll implement the full save-sync first and if it seems too
slow
> then I'll implement optionally the UID allocation counter.
> 
> Save messages from imap/deliver would contain at least:
> 
> 1. user name (unless using UNIX sockets, then it's already known)
> 2. mailbox name
> 3. UIDVALIDITY
> 4a) UID
> 4b) global UID, filename, offset and size of the message
> 
> 4a) would be used with the UID allocation counter behavior. The message
> would be sent after the message was saved and UIDs were allocated.
> 
> 4b) would be used with the wait-slave-to-save behavior. The message
> would be sent immediately whenever new data was received for the
> message, so possibly multiple times for each message. The last block
> would then contain size=0. After saving all the mails in a transaction,
> there would be a separate "allocate UIDs x..y for global UIDs a, b, c,
> d, etc." message which the slave then replies and master forwards the
> reply to imap/deliver.
> 
> If the slave is down, or if a timeout (a few seconds?) occurs while
> waiting for the reply from slave, imap/deliver could just go on saving
> the message. If this is done it's possible that if the master also dies
> soon, there will be UID conflicts. But since this should be a rare error
> condition, it's unlikely that the master will die soon after a
> misbehaving slave.
> 
> Copying
> -------
> 
> Copying has similar UID conflict problems than saving. The main
> difference is that there's no need to transfer message contents, which
> makes it faster to handle copies reliably.
> 
> Copying could be handled with the UID allocation counter way described
> above. Since the message already exists in the server, there's no way
> that the message gets lost. So it would be possible to allow master and
> slave to go into desync, because the user would still find the message
> from the original mailbox.
> 
> Move operation is done with "copy + mark \deleted + expunge"
> combination. This could be problematic if the master had managed to do
> all that before slave saw the COPY at all. Since the message isn't yet
> copied to the new mailbox in slave, clients wouldn't see it there, but
> locally caching clients might not see the message in the original
> mailbox either because they had already received EXPUNGE notification
> for it. So this should be avoided in some way.
> 
> Perhaps it would be better to just treat copies the same way as saves:
> Require a notification from slave before replying to client with OK
> message. Alternatively the wait could be done at EXPUNGE command, as
> described below.
> 
> Expunges
> --------
> 
> Losing expunge transactions wouldn't be too bad, because the worst that
> can happen is that the user needs to expunge the message again. Except
> that having an expunged message suddenly appear again in the middle of
> the mailbox violates IMAP protocol. I don't know if/how that will break
> clients.
> 
> Solution here would again be that before EXPUNGE notifications are sent
> to client we'll wait for reply from slave that it had also processed
the
> expunge.
> 
> Flag and keyword changes
> ------------------------
> 
> Most likely doesn't matter if some of them are lost. If someone really
> cares about them, it would be possible to again do the slave-waiting for
> either all the changes or just when some specific flags/keywords change.
> 
> Other changes
> -------------
> 
> Mailbox creations, deletions, renames and subscription changes could all
> be handled in a similiar way to flag/keyword changes.
> 
> Handling changes implemented by IMAP extensions such as ACL is another
> thing. It's probably not worth the trouble to implement a special
> replication handler for each plugin. These changes most likely will
> happen somewhat rarely though, so they could be implemented by sending
> the whole IMAP command to slave and have it execute the command using a
> regular imap process. The commands could be marked with some
> "replication" flag so Dovecot knows that they need to be sent to
slave.
> 
> Although instead of executing a new imap process for each user (probably
> each command), it could be possible to use one imap process and just
> change the mailbox name in the command. So instead of eg. "setacl
> mailbox (whatever)" the IMAP process would execute "setacl
> shared/username/mailbox (whatever)". That would require telling
Dovecot
> what parameter contains the mailbox name. Should be pretty easy to
> implement, unless some IMAP extensions want multiple mailbox names or
> the position isn't static. I can't think of any such extensions
right
> now though.
> 
> Master/multi-slave
> ------------------
> 
> Once the master/slave is working, support for multiple slaves could be
> added. The main thing to be done here is to have the slaves select which
> one of them is the new master, and then they'll have to sync the
> mailboxes with each others, because each of them might have seen some
> messages that others hadn't. This is assuming that save/etc. operations
> that are waiting for a reply from slave would wait for only one reply
> from the first slave that happens to respond. The number of slave
> replies to wait for could be also configurable, in case you're worried
> of a simultaneous master+slave crash.
> 
> Multi-master
> ------------
> 
> After master/multi-slave is working, we're nearly ready for a full
> multi-master operation. The master farm will need global locking.
> Perhaps there is a usable existing implementation which could be used.
> In any case there needs to be a way to keeps track of what master owns
> what locks.
> 
> The easiest way to implement multi-master is to assume that a single
> user isn't normally accessed from different servers at the same time.
> This wouldn't be a requirement, but it would make the performance
> better. For most users you could handle this by making sure that load
> balancer transfers all connections from the same IP to the same server.
> Or you could use a Dovecot proxy that transfers the connections.
> 
> So, once the user is accessed only from one server, the operation
> becomes pretty much like master/multi-slave. You'll just need to be
able
> to dynamically change the current master server for the user. This can
> be done by grabbing a global per-user lock before modifying the mailbox.
> The lock wouldn't have to be released immediately though. If no other
> server needs the lock, there's no point in locking/unlocking
constantly.
> Instead the lock server could request the current lock owner to release
> the lock whenever another server wants it.


-- 
--------------------------------------------------------------------------
Troy Benjegerdes                'da hozer'                hozer at
hozed.org

Somone asked me why I work on this free (http://www.fsf.org/philosophy/)
software stuff and not get a real job. Charles Shultz had the best answer:

"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't.
That's why
I draw cartoons. It's my life." -- Charles Shultz

Aredridel

2007-May-17 15:23 UTC

head link

[Dovecot] Replication plans

Jonathan wrote:> Hi Timo,
>
> MySQL gets around the problem of multiple masters allocating the
> same primary key, by giving each server its own address range
> (e.g. first machine uses 1,5,9,13 next one uses 2,6,10,14,...).
> Would this work for UIDs?UIDs have to be sequential. Causes some problems.

Bill Boebel

2007-May-18 16:20 UTC

head link

[Dovecot] Replication plans

On Fri, May 18, 2007 1:42 am, Troy Benjegerdes <hozer at hozed.org> said:
> I'm going to throw out a warning that it's my feeling that
replication
> has ended many otherwise worthwhile projects. Once you go down that
> rabbit hole, you end up finding out the hard way that you just can't
> avoid the stability, performance, complexity, and whatever problems.
> ..
> I've found myself pretty much in the same "all roads lead to the
> filesystem" situation. I don't want to replicate just imap, I want
to
> replicate the build directory with my source code, my email, and my MP3
> files.
One of the problems with the clustered file system approach seems to be that
accessing Dovecot's index, cache and control files are slow over the
network.  For speed, ideally you want your index, cache and control on local
disk... but still replicated to another server.

So what about tackling this replication problem from a different angle...  Make
it Dovecot's job to replicate the index and control files between servers,
and make it the file system's job to replicate just the mail data.  This
would require further disconnecting the index and control files from the mail
data, so that there is less syncing required.  i.e. remove the need to check
directory mtimes and to compare directory listings against the index; and
instead assume that the indexes are always correct.  Periodically you could
still check to see if a sync is needed, but you'd do this must less
frequently.

I agree that there are already great solutions available for replicated storage,
so this would allow us to take advantage of these solutions for the bulk of our
storage without impacting the speed of IMAP.

Bill

Bill Boebel

2007-May-18 17:38 UTC

head link

[Dovecot] Replication plans

On Fri, May 18, 2007 1:10 pm, Timo Sirainen <tss at iki.fi> said:
> On Fri, 2007-05-18 at 12:20 -0400, Bill Boebel wrote:
>> So what about tackling this replication problem from a different
>> angle...  Make it Dovecot's job to replicate the index and control
>> files between servers, and make it the file system's job to
replicate
>> just the mail data.  This would require further disconnecting the
>> index and control files from the mail data, so that there is less
>> syncing required.  i.e. remove the need to check directory mtimes and
>> to compare directory listings against the index; and instead assume
>> that the indexes are always correct.  Periodically you could still
>> check to see if a sync is needed, but you'd do this must less
>> frequently.
> 
> This would practically mean that you want either cydir or dbox storage.
> 
> This kind of a hybrid replication / clustered filesystem implementation
> would simplify the replication a bit, but all the difficult things
> related to UID conflicts etc. will still be there. So I wouldn't mind
> implementing this, but I think implementing the message content sending
> via TCP socket wouldn't add much more code anymore.
> 
> The clustered filesystem could probably be used to simplify some things
> though, such as UID allocation could be done by renaming a
"uid-<next
> uid>" file. If the rename() succeeded, you allocated the UID,
otherwise
> someone else did and you'll have to find the new filename and try
again.
> But I'm not sure if this kind of a special-case handling would be good.
> Unless of course I decide to use the same thing for non-replicated
> cydir/dbox.
Dbox definitely sounds promising.  I'd avoid putting this
uid-<nextuid> file in the same location as the mail storage though,
because if you can truly keep mail storage isolated and infrequently accessed,
then you can do cool things like store your mail data remotely on Amazon S3 or
equivalent.

Francisco Reyes

2007-May-21 00:58 UTC

head link

[Dovecot] Replication plans

Timo Sirainen writes:
> Master keeps all the changes in memory until slave has replied that it
> has committed the changes. If the memory buffer gets too large (1MB?)
Does this mean that in case of a crash all that would be lost?
I think the cache should be smaller.  
> because the slave is handling the input too slowly or because it's
> completely dead, the master starts writing the buffer to a file. Once
> the slave is again responding the changes are read from the file and
> finally the file gets deleted.
Good.
 > If the file gets too large (10MB?) it's deleted and slave will require
a
> resync.
Don't agree.
A large mailstore with Gigabytes worth of mail would benefit from having 
10MB synced... instead of re-starting from scratch.
>Master always keeps track of "user/mailbox -> last transaction
> sequence" in memory. When the slave comes back up and tells the master
> its last committed sequence, this allows the master to resync only those
> mailboxes that had changed.
I think a user configurable option to decide how large the sync files can 
grow to would be most flexible.

> the whole slave. Another way would be to just mark that one user or
> mailbox as dirty and try to resync it once in a while.
That sounds better.
A full resync can be very time consuming with a large and busy mailstore.
Not only the full amount of data needs to be synced, but new changes too. 
 > queues. The communication protocol would be binary
Because? Performance? Wouldn't that make debugging more difficult?
> dovecot-replication process would need read/write access to all users'
> mailboxes. So either it would run as root or it would need to have at
> least group-permission rights to all mailboxes. A bit more complex
> solution would be to use multiple processes each running with their own
> UIDs, but I think I won't implement this yet.
For now pick the easiest approach to get this first version out.
This will allow testers to have something to stress test. Better to have 
some basics out.. get feedback.. than to try to go after a more complex 
approach; unless you believe the complex approach is the ultimate long term 
best method.
   > But it should be possible to split users into multiple slaves (still one
> slave/user). The most configurable way to do this would be to have
> userdb return the slave host.
Why not just have 1 slave process per slave machine?

> This is the most important thing to get right, and also the most complex
> one. Besides replicating mails that are being saved via Dovecot, I think
> also externally saved mails should be replicated when they're first
> seen. This is somewhat related to doing an initial sync to a slave.
Why not go with a pure log replication scheme?
this way you basically have 3 processes.

1- The normal, currently existing programs. Add logs to the process
2- A Master replication process which listens for clients requesting for 
info.
3- The slave processes that request infomation and write it to the slave 
machines.

With this approach you can basically break it down into logical units of 
code which can be tested and debugged. Also helps when you need to worry 
about security and the level at which each component needs to work.
   > The biggest problem with saving is how to robustly handle master
> crashes. If you're just pushing changes from master to slave and the
> master dies, it's entirely possible that some of the new messages that
> were already saved in master didn't get through to slave.
With my suggested method that, in theory, never happen.
A message doesn't get accepted unless the log gets written (if replication 
is on).

If the master dies, when it gets restarted it should be able to continue.   
>   - If save/copy is aborted, tell the slave to decrease the UID counter
> by the number of aborted messages.
Are you planning to have a single slave? Or did you plan to allow multiple 
slaves? If allowing multiple slaves you will need to keep track at which 
point in the log each slave is. An easier approach is to have a setting 
based on time for how long to allow the master to keep logs.
> Solution here would again be that before EXPUNGE notifications are sent
> to client we'll wait for reply from slave that it had also processed
the
> expunge.
>From all your descriptions it sounds as if you are trying to do Synchronous replicat. What I suggested is basically to use Asynchronous replication.
I think synchronous replication is not only much more difficult, but also 
much more difficult to debug and maintain in working order over changes.
     > Master/multi-slave
> ------------------
> 
> Once the master/slave is working, support for multiple slaves could be
> added.
With the log shipping method I suggested multi-slave should not be much more 
coding to do.

In theory you could put more of the burden on the slaves to ask for their 
last transaction ID.. that they got onward.. so the master will not need to 
know anything extra to handle multi-slaves.  

> After master/multi-slave is working, we're nearly ready for a full
> multi-master operation
I think it will be clearer to see what needs to be done after you have 
master-slave working. I have never tried to implement a replication system, 
but I think that the onl way to have a reliable multi-master system is to 
have synchronous replication across ALL nodes.

This increases communication and locking significantly. The locking alone 
will likely be a choke point.

Steffen Kaiser

2007-May-21 13:35 UTC

head link

[Dovecot] Replication plans

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 17 May 2007, Timo Sirainen wrote:

Hello,

OpenLDAP uses another strategy, which is more robust aka needs less 
fragile interaction between the servers.

OpenLDAP stores any transaction into a replication log file, after it has 
been processed locally. The repl file is frequently read by another demaon 
(slurp) and forwarded to the slaves. If the forward to one particular 
slaves fails, the transaction is placed into a host-specific rejection log 
file. OpenLDAP uses a feature, that any modifiation (update, add new, 
delete) can be expressed in "command" syntax, hence, the
"slave" speaks
the same protocol as the master.

The biggest advantage is that the transation already succeeded for the 
master and is replayed to the slaves. So when pushing the message to the 
slave, you need not fiddle with decreasing UIDs for instance, because to 
perform a partial sync of a known-good-state mailbox. And the transaction 
is saved in the replay log file. In case the master process/host is 
crashing.

I think, if the replication log is replayed fastly - e.g. by "tailing"
the
file, you can effectively separate the problem of non-reacting slaves and 
re-replay for slaves that come up later and have quasi-immediate updates 
of the slaves. Also, because one replay agent per slave can be used, all 
interaction to the slave is sequential. You wrote something about avoiding 
files, what about making the repl log file a socket; so the frontend is 
dealing with the IMAP client and forwards the request to the replayer and 
is, therefore, not effected by probably bad network issues to the slaves.

You cannot have the advantage of OpenLDAP to use the same IMAP protocol 
for the slaves, because of some restrictions. You want to have a 100% 
replica, as I understand it, hence, the UIDs et al need to be equal.
So you will probably need to improve the IMAP protocol by:

"APPEND/STORE with UID".

The message will be spooled with the same UID on the slave. As you've 
wrote, it SHOULD NOT happen, that the slave fails, but if the operation is 
impossible, due to some wicked out-of-sync state, the slave reports back 
and requests a full resync. The replay agent would then drop any changes 
in the transaction for the specific host and mailbox and syncs the whole 
mailbox with the client, probably using something like rsync?

BTW: It would be good, if the resyncs can be initiated on admin request, 
too ;-)

For the dial-up situation you've mentioned (laptop with own server), the 
replay agent would store any changes until the slave come up, properly by 
contacting the Master Dovecot process and issues something like "SMTP 
ETRN".

When the probability is low that the same mailbox is accessable on 
different hosts (for shared folders multiple accesses are likely), this 
method should be even work well in multi-master situations. You'll have to 
run replay agents on all the servers then.

To get the issues with the UIDs correct, when one mailbox is in use on 
different hosts, you thought about locks. But is this necessary?

If only the UIDs are the problem, then with a method to "mark" an UID
as
taken throughout multiple masters, all masters will have the same UID 
level, not necessarily with the message data already associated, meaning:

master A is to APPEND a new message to mailbox M,
it sends all other masters the info: "want to take UID U".
If the UID is already taken by another master B, B replies "UID
taken",
then the mailboxes are out-of-sync and need a full resync.
If a master B receives a request for UID U, it has sent a election for 
itself, masters A&B are ranked, e.g. by IP address, so master B 
replies either "you may take it" or "I want to take it". In
first case,
master B re-issues its request for another UID U2 and marks UID U as 
taken.
Otherwise master B marks UID U as taken in mailbox M.

If master A got the "OK for UID U", it allocates it finally and
accepts
the message from the IMAP/SMTP client and places the message into the 
replay log file.

When now a master B gets a transaction "STORE message as UID U" being 
taken, but no message, yet, the master accepts the transaction.
> doesn't make sure that messages themselves aren't lost. If the
master's
> hard disk failed completely and it can't be resynced anymore, the
> messages saved there are lost. This could be avoided by making the
> saving wait until the message is saved to slave:
>
>  - save mail to disk, and at the same time also send it to slave
>  - allocate UID(s) and tell to slave what they were, wait for "mail
> saved" reply from slave before committing the messages to mailbox
> permanently
Well, this assumes that everything is functional hyper-good.
To preseve a hard disk should not be the issue of Dovecot, but the 
underlaying filesystem, IMHO. (aka RAID, SAN)

If you want to wait for each transaction, that all slaves gave their OK, 
you'll have problems with the "slave server on laptop" scenario.
Then you'll need to perfrom a full sync each time.

BTW: There is something like DLM (Distributed Lock Manager), I don't know 
it this is what you are looking for.

Bye,

- -- 
Steffen Kaiser
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iQEVAwUBRlGgKC9SORjhbDpvAQLvaQgAtIebLdGqSsV0AGMb/miU9GErGdRBvyWQ
/0Z99DWugw4zDwOBzLgArOLxnJLKORMEs79/UXZVrESlXGzvOjjc5xzGU7VPEJ25
5UP8C8I/cTOeI8nvN0KTZ8Af576YgTb/qL5Jq1YwW6y60HYMiglFq5ZTvjAvZHPW
oFQM30h0ZjnQxHDvXVy4PNtx0J1sU8vb1vD3Bd7jEsEwzj+3rtdmKoN9OxgqDV4X
5bEF+f2TAX28f1YGh5I0kfibh/7wseWMhqlNyUhAWmY9SSSHte0ZRg9b69PCU3rF
ovz5807zOTzV51NmXjQPEYxBDnX5/VCwvotKmwEMhBhlJlW4pHyFQw==ppQK
-----END PGP SIGNATURE-----

Seemingly Similar Threads

Search for more seemingly similar threads

dovecot - May 2007 - Replication plans

[Dovecot] Replication plans

[Dovecot] Replication plans

[Dovecot] Replication plans

[Dovecot] Replication plans

[Dovecot] Replication plans

[Dovecot] Replication plans

[Dovecot] Replication plans

[Dovecot] Replication plans

Seemingly Similar Threads