thr3ads.net - dovecot - [Dovecot] Clustering (replication and proxying) plans for the future [Oct 2006]

If this information is useful, please help other people find it:
Share via:

Timo Sirainen

2006-Oct-24 18:31 UTC

[Dovecot] Clustering (replication and proxying) plans for the future

I wrote this mostly so that I won't forget my thoughts. Most likely
won't be implemented anytime soon.

I haven't thought through all the locking issues. It would require a
distributed lock manager in any case. I think there are already several
existing ones, so one of them could probably be used for Dovecot.

Each mail server would consult a master server for getting namespace
configuration and location for shared/public mailboxes. If the data is
stored in some central SQL server there could just as well be multiple
master servers (or even each server be their own master).

Whenever trying to open a shared mailbox that doesn't exist in the
current server, the imap process would automatically create a new
connection for the destination server and start proxying the data. The
proxy would be a somewhat dummy proxy, which just forwards the client's
commands to it and doesn't even try to understand the replies from the
remote server. So admin will have to be careful to use only servers
which have all the capabilities that were announced by the original
server.

So, then the replication. Each namespace could be configured to be
available either by proxying or by replication:

namespace public {
  prefix = example/
  location = proxy:imap.example.com
}
namespace public {
  prefix = example/
  replicated_to = host1.example.com host2.example.com
  # If multi-master, we'll need to use global locking. If not, the
  # replicated_to servers contain only slaves.
  multi_master = yes
  location = maildir:~/Maildir
}

The way this would work is that each imap process connects via UNIX
socket to a replication process, which gets input from all the imap
processes. The replication process keeps track of how much has been sent
to each other server and also the acks it has received from them. So if
the network dies temporarily it'll start writing the changes to the disk
until it reaches some limit, after which a complete resync will be
necessary.

Dovecot already writes all the changes to a mailbox first to a
transaction log file and and only afterwards it updates the actual
mailbox. The contents of the new mails aren't stored there though.

The replication could work simply by sending the transaction logs'
contents to the replicatio process which passes it onto other servers,
which finally sync their local mailboxes based on that data. Since
Dovecot already is able to sync mailboxes based on the transaction log's
contents this should be pretty easy to implement.

Of course the new mails' contents also have to be sent. This could be
prioritized lower than the transaction traffic, so that each server
always has very up-to-date view of the mailbox metadata, but not
necessarily the contents of all the mails.

If the server finds itself in a situation that it doesn't have some
specific mail, it'll send a request to the replication process to fetch
it ASAP from another server. The reply will then take the highest
priority in the queue.

Normally when opening or reading the mailbox there's really no need to
do any global locking. The server will show the mailbox's state to be
what it has last received from the replication process.

Whenever modifying the mailbox a global lock is required. Here's also
the problem that if the network is down between two servers, they both
think that they got the global lock. If they did any conflicting
changes, they'll have to be resolved when both of them again see each
others.

Dovecot's transactions in general aren't order-dependent so most of the
conflicts can be resolved by just having the replication master decide
which way the transactions were done and then sending the transactions
back to all of the servers in that order.

Adding new messages is a problem however. UIDs must be always growing
and they must be unique. I think Cyrus solved this by having 64bit
global UIDs which really uniquely identify the messages (ie. first
32bits contain server-specific UID), and then the 32bit IMAP UIDs. If
there are any conflicts with the IMAP UIDs, they're given new UIDs
(which invalidates clients' cache for those messages, but that can't be
helped).

Resyncing works basically by requesting a list of all the mailboxes,
opening each one of them and requesting new transactions from the log
files with <UID validity>, <log file sequence>, <log file
offset>
triplet. If UID validity has changed or log file no longer exists, a
full resync needs to be done. Also if the log file contains a lot of
changes, it's probably faster to do a full resync. Full resync would
then be just sending the whole dovecot.index file.

Conflict resolving could also require a full resync if transaction logs
were lost. That'd work again by the master receiving everyone's
dovecot.index files, deciding what to do with them and sending the
modified version back to everyone.

Each server has different transaction logs, so the log file sequences
can't be directly used between servers. So the replication process
probably will have to keep track of every server's log seq/offset. Also
when receiving full dovecot.index files they'll have to be updated in
some way.

Then there's the problem of how to implement the receiving side. If the
user already has a process connected to the replication process it could
be instructed to update the mailboxes. But since this doesn't happen all
that often, there also needs to be some replication-writer process which
writes all the received transactions to the mailboxes.

Now, when exactly should these replication-writer processes be created
then? If all users have the same UNIX user ID, then there could be a
spool of processes always running and doing the writing. Although this
worries me a bit since the same process is handling different users'
data.

More secure way, and the only way if users have different UNIX UIDs, is
to launch a separate writer process for each user. It probably isn't a
great idea to do that immediately when receiving data for a user, but to
queue data for the user for a while and then once enough has been
received or some timeout has occurred, the writer process would be
executed.

There probably should be some shared in-memory cache with a memory limit
of some megabytes, after which data is written to disk up to a
configurable amount. Only after that is full or timeout for the user is
reached, the writer process is executed and the data is sent to it.

All this replication stuff could also be used to implement a remote mail
storage (ie. smarter proxying). The received indexes/transactions could
simply be kept in memory, and all the mails that aren't already cached
in memory could just be requested from the master whenever needed. I'm
not sure if there's much point in doing this over dummy proxying though.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL:
<http://dovecot.org/pipermail/dovecot/attachments/20061024/c0a0b38e/attachment.bin>

Timo Sirainen

2006-Oct-24 18:48 UTC

head link

[Dovecot] Clustering (replication and proxying) plans for the future

Some updates I thought of while reading through this mail..:

On Tue, 2006-10-24 at 21:31 +0300, Timo Sirainen wrote:> Of course the new mails' contents also have to be sent. This could be
> prioritized lower than the transaction traffic, so that each server
> always has very up-to-date view of the mailbox metadata, but not
> necessarily the contents of all the mails.
I'm not sure which is the best way to do this split of transaction data
and mail contents. There are two possibilities:

1) Use multiple connections: high priority (requested mail contents),
normal priority (transactions) and low-priority (non-requested mail
contents). This won't give any priority guarantees though, unless
different ports are used and admin sets up QoS. Although this won't
really matter as long as there's enough bandwidth available.

2) Use one connection and send all data in smaller blocks. Dynamically
adjust the block size to be such that it gives high enough throughput
but also small enough latency that higher priority blocks can be
inserted in the queue and received quickly.
> If the server finds itself in a situation that it doesn't have some
> specific mail, it'll send a request to the replication process to fetch
> it ASAP from another server. The reply will then take the highest
> priority in the queue.
If the other server is gone, there's no way to get the mail. In this
case the mail needs to be treated as expunged, and in the next sync
it'll get a new UID.
> Now, when exactly should these replication-writer processes be created
> then? If all users have the same UNIX user ID, then there could be a
> spool of processes always running and doing the writing. Although this
> worries me a bit since the same process is handling different users'
> data.
> 
> More secure way, and the only way if users have different UNIX UIDs, is
> to launch a separate writer process for each user. 
Note that this exact same problem comes if LMTP server is implemented.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL:
<http://dovecot.org/pipermail/dovecot/attachments/20061024/42c20c4c/attachment.bin>

Ethan Sommer

2006-Oct-27 12:35 UTC

head link

[Dovecot] Clustering (replication and proxying) plans for the future

Timo Sirainen wrote:> Whenever modifying the mailbox a global lock is required. Here's also
> the problem that if the network is down between two servers, they both
> think that they got the global lock. If they did any conflicting
> changes, they'll have to be resolved when both of them again see each
> others.
>   Some clustering systems (eg GFS) get around this by requiring that you 
have >= 3 servers (preferably an odd number) and having elections which 
require a majority (quorum) to proceed. The third server could just be a 
tie breaker machine and doesn't actually have to run dovecot. (kind of 
like the Vice President in the Senate.) It makes the system somewhat 
less redundant (eg, you can't have a "remote office" server and
expect
it to function if the link is down,) but it is "safer" and
"simpler" in
many ways.

Thus you could have 3 servers, if one goes down, the other two would 
behave normally. (so you can do maintenance, etc) If two servers go 
down, then you would queue up mail and wait for a quorum to be regained. 

With 5 servers, 2 servers could go down without requiring the other 3 to 
stop functioning, etc.


Another option, if you wanted to take on something smaller first would 
be to only allow master/slave clustering somewhat like LDAP and use some 
form of reliable heartbeat (eg serial cable) to detect if the master is 
down and the slave should take the master role. You could still try to 
split the read requests among the two servers for scalability. I know 
that we would _love_ to have a system like this, even if it is less 
flexible.
> All this replication stuff could also be used to implement a remote mail
> storage (ie. smarter proxying). The received indexes/transactions could
> simply be kept in memory, and all the mails that aren't already cached
> in memory could just be requested from the master whenever needed. I'm
> not sure if there's much point in doing this over dummy proxying
though.
>   I would imagine that having a smarter proxy might not help much for full 
fledged mail client (eg Thunderbird) but would have a significant impact 
on webmail applications which are constantly asking for the same thing 
(eg SELECT INBOX) over and over again.


Ethan Sommer
UNIX Systems Administrator
Gustavus Adolphus College

Bill Boebel

2006-Nov-02 04:55 UTC

head link

[Dovecot] Clustering (replication and proxying) plans for the future

On Tue, October 24, 2006 2:31 pm, Timo Sirainen <tss at iki.fi> said:
> The replication could work simply by sending the transaction logs'
> contents to the replicatio process which passes it onto other servers,
> which finally sync their local mailboxes based on that data. Since
> Dovecot already is able to sync mailboxes based on the transaction
log's
> contents this should be pretty easy to implement.
> 
> Of course the new mails' contents also have to be sent. This could be
> prioritized lower than the transaction traffic, so that each server
> always has very up-to-date view of the mailbox metadata, but not
> necessarily the contents of all the mails.
> 
> If the server finds itself in a situation that it doesn't have some
> specific mail, it'll send a request to the replication process to fetch
> it ASAP from another server. The reply will then take the highest
> priority in the queue.
Is your primary goal with this replication to add redundancy or to distribute
load such as for shared mailboxes?  I've thought about the redundancy side
of this a lot, but not so much about load distribution.  It sounds like
you're going for load distribution, but...

If the goal is redundancy, I'd suggest that mailbox state is less important
than the mail data.  Index files can be recreated from the data, and would not
even be needed on the secondary server(s) unless the primary fails.  So I'd
put a higher priority on getting the mail content to the secondary servers.  If
the goal is to distribute load for shared mailboxes, then your prioritization
makes sense.

Also, if the goal is redundancy, this design can be simplified a lot by having
one master and one or more slaves.  All modify operations would go to the
master, so that you do not need global locks.  Different sets of users would
belong to different namespaces, so that you can have multiple replication paths
in your cluster.

You'd also want an API for the replication process so that third-party
applications that modify mailboxes can log those changes so that they are
replicated.  And in that case, the replication process could be made generic
enough so that it can be used to replicate any set of files, with or without
Dovecot.

Bill

Apparently Analagous Threads

Search for more seemingly similar threads

dovecot - Oct 2006 - Clustering (replication and proxying) plans for the future

[Dovecot] Clustering (replication and proxying) plans for the future

[Dovecot] Clustering (replication and proxying) plans for the future

[Dovecot] Clustering (replication and proxying) plans for the future

[Dovecot] Clustering (replication and proxying) plans for the future

Apparently Analagous Threads