Timo Sirainen
2006-Oct-24 18:31 UTC
[Dovecot] Clustering (replication and proxying) plans for the future
I wrote this mostly so that I won't forget my thoughts. Most likely won't be implemented anytime soon. I haven't thought through all the locking issues. It would require a distributed lock manager in any case. I think there are already several existing ones, so one of them could probably be used for Dovecot. Each mail server would consult a master server for getting namespace configuration and location for shared/public mailboxes. If the data is stored in some central SQL server there could just as well be multiple master servers (or even each server be their own master). Whenever trying to open a shared mailbox that doesn't exist in the current server, the imap process would automatically create a new connection for the destination server and start proxying the data. The proxy would be a somewhat dummy proxy, which just forwards the client's commands to it and doesn't even try to understand the replies from the remote server. So admin will have to be careful to use only servers which have all the capabilities that were announced by the original server. So, then the replication. Each namespace could be configured to be available either by proxying or by replication: namespace public { prefix = example/ location = proxy:imap.example.com } namespace public { prefix = example/ replicated_to = host1.example.com host2.example.com # If multi-master, we'll need to use global locking. If not, the # replicated_to servers contain only slaves. multi_master = yes location = maildir:~/Maildir } The way this would work is that each imap process connects via UNIX socket to a replication process, which gets input from all the imap processes. The replication process keeps track of how much has been sent to each other server and also the acks it has received from them. So if the network dies temporarily it'll start writing the changes to the disk until it reaches some limit, after which a complete resync will be necessary. Dovecot already writes all the changes to a mailbox first to a transaction log file and and only afterwards it updates the actual mailbox. The contents of the new mails aren't stored there though. The replication could work simply by sending the transaction logs' contents to the replicatio process which passes it onto other servers, which finally sync their local mailboxes based on that data. Since Dovecot already is able to sync mailboxes based on the transaction log's contents this should be pretty easy to implement. Of course the new mails' contents also have to be sent. This could be prioritized lower than the transaction traffic, so that each server always has very up-to-date view of the mailbox metadata, but not necessarily the contents of all the mails. If the server finds itself in a situation that it doesn't have some specific mail, it'll send a request to the replication process to fetch it ASAP from another server. The reply will then take the highest priority in the queue. Normally when opening or reading the mailbox there's really no need to do any global locking. The server will show the mailbox's state to be what it has last received from the replication process. Whenever modifying the mailbox a global lock is required. Here's also the problem that if the network is down between two servers, they both think that they got the global lock. If they did any conflicting changes, they'll have to be resolved when both of them again see each others. Dovecot's transactions in general aren't order-dependent so most of the conflicts can be resolved by just having the replication master decide which way the transactions were done and then sending the transactions back to all of the servers in that order. Adding new messages is a problem however. UIDs must be always growing and they must be unique. I think Cyrus solved this by having 64bit global UIDs which really uniquely identify the messages (ie. first 32bits contain server-specific UID), and then the 32bit IMAP UIDs. If there are any conflicts with the IMAP UIDs, they're given new UIDs (which invalidates clients' cache for those messages, but that can't be helped). Resyncing works basically by requesting a list of all the mailboxes, opening each one of them and requesting new transactions from the log files with <UID validity>, <log file sequence>, <log file offset> triplet. If UID validity has changed or log file no longer exists, a full resync needs to be done. Also if the log file contains a lot of changes, it's probably faster to do a full resync. Full resync would then be just sending the whole dovecot.index file. Conflict resolving could also require a full resync if transaction logs were lost. That'd work again by the master receiving everyone's dovecot.index files, deciding what to do with them and sending the modified version back to everyone. Each server has different transaction logs, so the log file sequences can't be directly used between servers. So the replication process probably will have to keep track of every server's log seq/offset. Also when receiving full dovecot.index files they'll have to be updated in some way. Then there's the problem of how to implement the receiving side. If the user already has a process connected to the replication process it could be instructed to update the mailboxes. But since this doesn't happen all that often, there also needs to be some replication-writer process which writes all the received transactions to the mailboxes. Now, when exactly should these replication-writer processes be created then? If all users have the same UNIX user ID, then there could be a spool of processes always running and doing the writing. Although this worries me a bit since the same process is handling different users' data. More secure way, and the only way if users have different UNIX UIDs, is to launch a separate writer process for each user. It probably isn't a great idea to do that immediately when receiving data for a user, but to queue data for the user for a while and then once enough has been received or some timeout has occurred, the writer process would be executed. There probably should be some shared in-memory cache with a memory limit of some megabytes, after which data is written to disk up to a configurable amount. Only after that is full or timeout for the user is reached, the writer process is executed and the data is sent to it. All this replication stuff could also be used to implement a remote mail storage (ie. smarter proxying). The received indexes/transactions could simply be kept in memory, and all the mails that aren't already cached in memory could just be requested from the master whenever needed. I'm not sure if there's much point in doing this over dummy proxying though. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: <dovecot.org/pipermail/dovecot/attachments/20061024/c0a0b38e/attachment.bin>
Timo Sirainen
2006-Oct-24 18:48 UTC
[Dovecot] Clustering (replication and proxying) plans for the future
Some updates I thought of while reading through this mail..: On Tue, 2006-10-24 at 21:31 +0300, Timo Sirainen wrote:> Of course the new mails' contents also have to be sent. This could be > prioritized lower than the transaction traffic, so that each server > always has very up-to-date view of the mailbox metadata, but not > necessarily the contents of all the mails.I'm not sure which is the best way to do this split of transaction data and mail contents. There are two possibilities: 1) Use multiple connections: high priority (requested mail contents), normal priority (transactions) and low-priority (non-requested mail contents). This won't give any priority guarantees though, unless different ports are used and admin sets up QoS. Although this won't really matter as long as there's enough bandwidth available. 2) Use one connection and send all data in smaller blocks. Dynamically adjust the block size to be such that it gives high enough throughput but also small enough latency that higher priority blocks can be inserted in the queue and received quickly.> If the server finds itself in a situation that it doesn't have some > specific mail, it'll send a request to the replication process to fetch > it ASAP from another server. The reply will then take the highest > priority in the queue.If the other server is gone, there's no way to get the mail. In this case the mail needs to be treated as expunged, and in the next sync it'll get a new UID.> Now, when exactly should these replication-writer processes be created > then? If all users have the same UNIX user ID, then there could be a > spool of processes always running and doing the writing. Although this > worries me a bit since the same process is handling different users' > data. > > More secure way, and the only way if users have different UNIX UIDs, is > to launch a separate writer process for each user.Note that this exact same problem comes if LMTP server is implemented. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: <dovecot.org/pipermail/dovecot/attachments/20061024/42c20c4c/attachment.bin>
Ethan Sommer
2006-Oct-27 12:35 UTC
[Dovecot] Clustering (replication and proxying) plans for the future
Timo Sirainen wrote:> Whenever modifying the mailbox a global lock is required. Here's also > the problem that if the network is down between two servers, they both > think that they got the global lock. If they did any conflicting > changes, they'll have to be resolved when both of them again see each > others. >Some clustering systems (eg GFS) get around this by requiring that you have >= 3 servers (preferably an odd number) and having elections which require a majority (quorum) to proceed. The third server could just be a tie breaker machine and doesn't actually have to run dovecot. (kind of like the Vice President in the Senate.) It makes the system somewhat less redundant (eg, you can't have a "remote office" server and expect it to function if the link is down,) but it is "safer" and "simpler" in many ways. Thus you could have 3 servers, if one goes down, the other two would behave normally. (so you can do maintenance, etc) If two servers go down, then you would queue up mail and wait for a quorum to be regained. With 5 servers, 2 servers could go down without requiring the other 3 to stop functioning, etc. Another option, if you wanted to take on something smaller first would be to only allow master/slave clustering somewhat like LDAP and use some form of reliable heartbeat (eg serial cable) to detect if the master is down and the slave should take the master role. You could still try to split the read requests among the two servers for scalability. I know that we would _love_ to have a system like this, even if it is less flexible.> All this replication stuff could also be used to implement a remote mail > storage (ie. smarter proxying). The received indexes/transactions could > simply be kept in memory, and all the mails that aren't already cached > in memory could just be requested from the master whenever needed. I'm > not sure if there's much point in doing this over dummy proxying though. >I would imagine that having a smarter proxy might not help much for full fledged mail client (eg Thunderbird) but would have a significant impact on webmail applications which are constantly asking for the same thing (eg SELECT INBOX) over and over again. Ethan Sommer UNIX Systems Administrator Gustavus Adolphus College
Bill Boebel
2006-Nov-02 04:55 UTC
[Dovecot] Clustering (replication and proxying) plans for the future
On Tue, October 24, 2006 2:31 pm, Timo Sirainen <tss at iki.fi> said:> The replication could work simply by sending the transaction logs' > contents to the replicatio process which passes it onto other servers, > which finally sync their local mailboxes based on that data. Since > Dovecot already is able to sync mailboxes based on the transaction log's > contents this should be pretty easy to implement. > > Of course the new mails' contents also have to be sent. This could be > prioritized lower than the transaction traffic, so that each server > always has very up-to-date view of the mailbox metadata, but not > necessarily the contents of all the mails. > > If the server finds itself in a situation that it doesn't have some > specific mail, it'll send a request to the replication process to fetch > it ASAP from another server. The reply will then take the highest > priority in the queue.Is your primary goal with this replication to add redundancy or to distribute load such as for shared mailboxes? I've thought about the redundancy side of this a lot, but not so much about load distribution. It sounds like you're going for load distribution, but... If the goal is redundancy, I'd suggest that mailbox state is less important than the mail data. Index files can be recreated from the data, and would not even be needed on the secondary server(s) unless the primary fails. So I'd put a higher priority on getting the mail content to the secondary servers. If the goal is to distribute load for shared mailboxes, then your prioritization makes sense. Also, if the goal is redundancy, this design can be simplified a lot by having one master and one or more slaves. All modify operations would go to the master, so that you do not need global locks. Different sets of users would belong to different namespaces, so that you can have multiple replication paths in your cluster. You'd also want an API for the replication process so that third-party applications that modify mailboxes can log those changes so that they are replicated. And in that case, the replication process could be made generic enough so that it can be used to replicate any set of files, with or without Dovecot. Bill