Timo Sirainen
2009-Aug-10 17:01 UTC
[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem
This is something I figured out a few months ago, mainly because this one guy at work (hi, Stu) kept telling me my multi-master replication plan sucked and we should use some existing scalable database. (I guess it didn't go exactly like that, but that's the result anyway.) So, my current plan is based on a couple of observations: * Index files are really more like memory dumps. They're already in an optimal format for keeping them in memory, so they can be just mmap()ed and used. Doing some kind of translation to another format would just make it more complex and slower. * I can change all indexing and dbox code to not require any locks or overwriting files. I just need very few filesystem operations, primarily the ability to atomically append to a file. * Index and mail data is very different. Index data is accessed constantly and it must be very low latency or performance will be horrible. It practically should be in memory in local machine and there shouldn't normally be any network lookups when accessing it. * Mail data on the other hand is just written once and usually read maybe once or a couple of times. Caching mail data in memory probably doesn't help all that much. Latency isn't such a horrible issue as long as multiple mails can be fetched at once / in parallel, so there's only a single latency wait. So the high level plan is: 1. Change the index/cache/log file formats in a way that allows lockless writes. 2. Abstract out filesystem accessing in index and dbox code and implement a regular POSIX filesystem support. 3. Make lib-storage able to access mails in parallel and send multiple "get mail" requests in advance. (3.5. Implement async I/O filesystem backend.) 4. Implement a multi-master filesystem backend for index files. The idea would be that all servers accessing the same mailbox must be talking to each others via network and every time something is changed, push the change to other servers. This is actually very similar to my previous multi-master plan. One of the servers accessing the mailbox would still act as a master and handle conflict resolution and writing indexes to disk more or less often. 5. Implement filesystem backend for dbox and permanent index storage using some scalable distributed database, such as maybe Cassandra. This is the part I've thought the least about, but it's also the part I hope to (mostly) outsource to someone else. I'm not going to write a distributed database from scratch.. This actually should solve several issues: * Scalability, of course. It'll be as scalable as the distributed database being used to store mails. * NFS reliability! Even if you don't care about any of these alternative databases, this still solves NFS caching problems. You'd keep using the regular POSIX FS API (or async FS api) but with the in-memory index "cache", so only a single server is writing to mailbox indexes at the same time. * Shared mailboxes. The filesystem API is abstracted, so it should be possible to easily add another layer to handle accessing other users' mails from both local and remote servers. This should finally make it possible to easily support shared mailboxes with system users. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: <http://dovecot.org/pipermail/dovecot/attachments/20090810/e469c6b3/attachment-0002.bin>
Seth Mattinen
2009-Aug-10 21:33 UTC
[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem
Timo Sirainen wrote:> This is something I figured out a few months ago, mainly because this > one guy at work (hi, Stu) kept telling me my multi-master replication > plan sucked and we should use some existing scalable database. (I guess > it didn't go exactly like that, but that's the result anyway.) >Ick, some people (myself included) hate the idea of storing mail in a database versus simple and almost impossible to screw up plain text files of maildir. Cyrus already does the whole mail-in-database thing. ~Seth
Steffen Kaiser
2009-Aug-11 14:32 UTC
[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, 10 Aug 2009, Timo Sirainen wrote:> 4. Implement a multi-master filesystem backend for index files. The idea > would be that all servers accessing the same mailbox must be talking to > each others via network and every time something is changed, push the > change to other servers. This is actually very similar to my previous > multi-master plan. One of the servers accessing the mailbox would still > act as a master and handle conflict resolution and writing indexes to > disk more or less often.What I don't understand here is: _One_ server is the master, which owns the indexes locally? Oh, 5. means that this particular server is initiating the write, right? You spoke about thousends of servers, if one of them opens a mailbox, it needs to query all (thousends - 1) servers, which of them is probably the master of this mailbox. I suppose you need a "home location" server, which other servers connect to, in order to get server currently locking (aka acting as master for) this mailbox. GSM has some home location register pointing to the base station currently managing the user info, because the GSM device is in its reach. There is also another point I'm wondering about: index files are "really more like memory dumps", you've wrote. so if you cluster thousends of servers together you'll most probably have different server architectures, say 32bit vs. 64bit, CISC vs. RISC, big vs. little endian, ASCII vs. EBCDIC :). To share these memory dumps without another abstraction layer wouldn't work.> 5. Implement filesystem backend for dbox and permanent index storage > using some scalable distributed database, such as maybe Cassandra. ThisAlthough I like the "eventually consistent" part, I wonder about the Java-based stuff of Cassandra.> is the part I've thought the least about, but it's also the part I hope > to (mostly) outsource to someone else. I'm not going to write a > distributed database from scratch..I wonder if the index-backend in 4. and 5. shouldn't be the same. == How many work is it to handle the data in the index files? What if any server forwards changes to the master and recieves changes from the master to sync its local read-only cache? So you needn't handle conflicts (except when network was down) and writes are consistent originated from this single master server. The actual mail data is accessed via another API. When the current master does no longer need to access the mailbox, it could hand over the "master" stick to another server currently accessing the mailbox. Bye, - -- Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iQEVAwUBSoGA6XWSIuGy1ktrAQKGjggAh9Yjzy2oFI2H8MS2rppm/ug2HWO+9PGX aTRrzNzj2wTScAL1NrFZrN8Mlc7qK2YfH3rXDbM5Mcw/eC67VQ2P2XcetTY7h5XK RxFqk5+h3Q06Jiwl0IFQyCxkRzs4bK6cZegjAfSViDfQTx8iQhvXHxioPLvIiFQH D3lOd7+QUxOLKJyAxejjDM5ez/9OUFXZF9WeWrDGpQYES5HVNND3T288uBwWx5zJ hwqQI8qR3Fwu9VRSDLpvCx1DjQWGOT7x6DfIaKg2j6IvvSTpH2dMsNg0M3YmLsvY JyreDtqMlZDLclg00ELx0ORgQVHN5eQpOs/XgmFF0+YBQvAO6mtrUw==1GC8 -----END PGP SIGNATURE-----
Ed W
2009-Aug-12 15:26 UTC
[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem
Hi> * Mail data on the other hand is just written once and usually read > maybe once or a couple of times. Caching mail data in memory probably > doesn't help all that much. Latency isn't such a horrible issue as long > as multiple mails can be fetched at once / in parallel, so there's only > a single latency wait. >This logically seems correct. Couple of questions then: 1) Since latency requirements are low, why did performance drop so much previously when you implemented a very simple mysql storage backend? I glanced at the code a few weeks ago and whilst it's surprisingly complicated right now to implement a backend, I was also surprised that a database storage engine "sucked" I think you phrased it? Possibly the code also placed the indexes on the DB? Certainly this could very well kill performance? (Note I'm not arguing sql storage is a good thing, I just want to understand the latency to backend requirements) 2) I would be thinking that with some care, even very high latency storage would be workable, eg S3/Gluster/MogileFs ? I would love to see a backend using S3 - If nothing else I think it would quickly highlight all the bottlenecks in any design...> 4. Implement a multi-master filesystem backend for index files. The idea > would be that all servers accessing the same mailbox must be talking to > each others via network and every time something is changed, push the > change to other servers. This is actually very similar to my previous > multi-master plan. One of the servers accessing the mailbox would still > act as a master and handle conflict resolution and writing indexes to > disk more or less often. >Take a look at Mogilefs for some ideas here. I doubt it's a great fit, but they certainly need to solve a lot of the same problems> 5. Implement filesystem backend for dbox and permanent index storage > using some scalable distributed database, such as maybe Cassandra.CouchDB? It is just the Lotus Notes database after all, and personally I have built some *amazing* applications using that as the backend. (I just love the concept of Notes - the gui is another matter...) Note that CouchDB is interesting in that it is multi-master with "eventual" synchronisation. This potentially has some interesting issues/benefits for offline use For the filesystem backend have you looked at the various log structured filesystems appearing? Whenever I watch the debate between Maildir vs Mailbox I always think that a hybrid is the best solution because you are optimising for a write one, read many situation, where you have an increased probability of having good cache localisation on any given read. To me this ends up looking like log structured storage... (which feels like a hybrid between maildir/mailbox)> * Scalability, of course. It'll be as scalable as the distributed > database being used to store mails. >I would be very interested to see a kind of "where the time goes" benchmark of dovecot. Have you measured and found that latency of this part accounts for x% of the response time and CPU bound here is another y%, etc? eg if you deliberately introduce X ms of latency in the index lookups, what does that do to the response time of the system once the cache warms up? What about if the response time to the storage backend changes? I would have thought this would help you determine how to scale this thing? All in all sounds very interesting. However, couple of thoughts: - What is the goal? - If the goal is performance by allowing a scale-out in quantity of servers then I guess you need to measure it carefully to make sure it actually works? I haven't had the fortune to develop something that big, but the general advice is that scaling out is hard to get right, so assume you made a mistake in your design somewhere... Measure, measure - If the goal is reliability then I guess it's prudent to assume that somehow all servers will get out of sync (eventually). It's definitely nice if they are self repairing as a design goal, eg the difference between a full sync and shipping logs (I ship logs to have a master-master mysql server, but if we have a crash then I use a sync program (maatkit) to check the two servers are in sync and avoid recreating one of the servers from fresh) - If the goal is increased storage capacity on commodity hardware then it needs a useful bunch of tools to manage the replication and make sure there is redundancy and it's easy to find the required storage. I guess look at Mogilefs, if you think you can do better then at least remember it was quite hard work to get to that stage, so doing it again is likely to be non trivial? - If the goal were making it simpler to build a backend storage engine then this would be excellent - I find myself wanting to benchmark ideas like S3 or sticking things in a database, but I looked at the API recently and it's going to require a bit of investment to get started - certainly more than a couple of evenings poking around... Hopefully others would write interesting backends, regardless of whether it's sensible to use them on high performance setups, some folks simply want/need to do unusual things... - Finally I am a bit sad that offline distributed multi-master isn't in the roadmap anymore... :-( - My situation is we have a lot of boats boating around with intermittent expensive satellite connections and the users are fluid and need to get access to their data from land and different vessels. Currently we build software inhouse to make this possible, but it would be fantastic to see more features enabling this on the server side (CouchDB / Lotus Notes is cool...) Good luck - sounds fun implementing all this anyway! Ed W
Daniel L. Miller
2009-Aug-12 22:54 UTC
[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem
Ha! Fooled you! I'm going to reply to the original question instead of SIS! Timo Sirainen wrote:> * Index files are really more like memory dumps. They're already in an > optimal format for keeping them in memory, so they can be just mmap()ed > and used. Doing some kind of translation to another format would just > make it more complex and slower. > > * Index and mail data is very different. Index data is accessed > constantly and it must be very low latency or performance will be > horrible. It practically should be in memory in local machine and there > shouldn't normally be any network lookups when accessing it. >Ok, I lied. I'm going to start something new. Do the indexes contain any of the header information? In particular, since I know nothing of the communication between IMAP clients & servers in general, is the information that is shown in typical client mail lists (subject, sender, date, etc.) stored in the indexes? I guess I'm asking if any planned changes will have an impact in retrieving message lists in any way. -- Daniel
Timo Sirainen
2010-Mar-10 21:19 UTC
[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem
On 10.8.2009, at 20.01, Timo Sirainen wrote:> (3.5. Implement async I/O filesystem backend.)You know what I found out today? Linux doesn't support async IO for regular buffered files. I had heard there were issues, but I thought it was mainly about some annoying APIs and such. Anyone know if some project has successfully figured out some usable way to do async disk IO? The possibilities seem to be: a) Use Linux's native AIO, which requires direct-io for files. This *might* not be horribly bad for mail files. After all, same mail is rarely read multiple times. Except when parsing its headers first and then its body. Maybe the process could do some internal buffering?.. I guess no one ever tried my posix_fadvise() patch? The idea was that it would tell the kernel after closing a mail file that it's no longer needed in memory, so kernel could remove it from page cache. I never heard any positive or negative comments about how it affected performance.. http://dovecot.org/patches/1.1/fadvise.diff b) Use threads, either via some library or implement yourself. Each thread of course uses some extra memory. Also enabling threads causes glibc to start using a thread-safe version of malloc() (I think?), which slows things down (unless that can be avoided, maybe by using clone() directly instead of pthreads?). c) I read someone's idea about using posix_fadvise() and fincore() functions to somehow make it "kind of work, usually, maybe". I'm not sure if there's a practical way to make them work though. And of course I don't think fincore() has even been accepted by Linus yet.