thr3ads.net - dovecot - [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Timo Sirainen

2009-Aug-10 17:01 UTC

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

This is something I figured out a few months ago, mainly because this
one guy at work (hi, Stu) kept telling me my multi-master replication
plan sucked and we should use some existing scalable database. (I guess
it didn't go exactly like that, but that's the result anyway.)

So, my current plan is based on a couple of observations:

 * Index files are really more like memory dumps. They're already in an
optimal format for keeping them in memory, so they can be just mmap()ed
and used. Doing some kind of translation to another format would just
make it more complex and slower.

 * I can change all indexing and dbox code to not require any locks or
overwriting files. I just need very few filesystem operations, primarily
the ability to atomically append to a file.

 * Index and mail data is very different. Index data is accessed
constantly and it must be very low latency or performance will be
horrible. It practically should be in memory in local machine and there
shouldn't normally be any network lookups when accessing it.

 * Mail data on the other hand is just written once and usually read
maybe once or a couple of times. Caching mail data in memory probably
doesn't help all that much. Latency isn't such a horrible issue as long
as multiple mails can be fetched at once / in parallel, so there's only
a single latency wait.

So the high level plan is:

1. Change the index/cache/log file formats in a way that allows lockless
writes.

2. Abstract out filesystem accessing in index and dbox code and
implement a regular POSIX filesystem support.

3. Make lib-storage able to access mails in parallel and send multiple
"get mail" requests in advance.

(3.5. Implement async I/O filesystem backend.)

4. Implement a multi-master filesystem backend for index files. The idea
would be that all servers accessing the same mailbox must be talking to
each others via network and every time something is changed, push the
change to other servers. This is actually very similar to my previous
multi-master plan. One of the servers accessing the mailbox would still
act as a master and handle conflict resolution and writing indexes to
disk more or less often.

5. Implement filesystem backend for dbox and permanent index storage
using some scalable distributed database, such as maybe Cassandra. This
is the part I've thought the least about, but it's also the part I hope
to (mostly) outsource to someone else. I'm not going to write a
distributed database from scratch..

This actually should solve several issues:

 * Scalability, of course. It'll be as scalable as the distributed
database being used to store mails.

 * NFS reliability! Even if you don't care about any of these
alternative databases, this still solves NFS caching problems. You'd
keep using the regular POSIX FS API (or async FS api) but with the
in-memory index "cache", so only a single server is writing to mailbox
indexes at the same time.

 * Shared mailboxes. The filesystem API is abstracted, so it should be
possible to easily add another layer to handle accessing other users'
mails from both local and remote servers. This should finally make it
possible to easily support shared mailboxes with system users.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL:
<dovecot.org/pipermail/dovecot/attachments/20090810/e469c6b3/attachment-0002.bin>

Seth Mattinen

2009-Aug-10 21:33 UTC

head link

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Timo Sirainen wrote:> This is something I figured out a few months ago, mainly because this
> one guy at work (hi, Stu) kept telling me my multi-master replication
> plan sucked and we should use some existing scalable database. (I guess
> it didn't go exactly like that, but that's the result anyway.)
> 
Ick, some people (myself included) hate the idea of storing mail in a 
database versus simple and almost impossible to screw up plain text 
files of maildir. Cyrus already does the whole mail-in-database thing.

~Seth

Steffen Kaiser

2009-Aug-11 14:32 UTC

head link

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, 10 Aug 2009, Timo Sirainen wrote:
> 4. Implement a multi-master filesystem backend for index files. The idea
> would be that all servers accessing the same mailbox must be talking to
> each others via network and every time something is changed, push the
> change to other servers. This is actually very similar to my previous
> multi-master plan. One of the servers accessing the mailbox would still
> act as a master and handle conflict resolution and writing indexes to
> disk more or less often.
What I don't understand here is:

_One_ server is the master, which owns the indexes locally?
Oh, 5. means that this particular server is initiating the write, right?

You spoke about thousends of servers, if one of them opens a mailbox, it 
needs to query all (thousends - 1) servers, which of them is probably the 
master of this mailbox. I suppose you need a "home location" server,
which
other servers connect to, in order to get server currently locking (aka 
acting as master for) this mailbox.

GSM has some home location register pointing to the base station currently 
managing the user info, because the GSM device is in its reach.

There is also another point I'm wondering about:
index files are "really more like memory dumps", you've wrote. so
if you
cluster thousends of servers together you'll most probably have different 
server architectures, say 32bit vs. 64bit, CISC vs. RISC, big vs. little 
endian, ASCII vs. EBCDIC :). To share these memory dumps without another 
abstraction layer wouldn't work.
> 5. Implement filesystem backend for dbox and permanent index storage
> using some scalable distributed database, such as maybe Cassandra. This
Although I like the "eventually consistent" part, I wonder about the 
Java-based stuff of Cassandra.
> is the part I've thought the least about, but it's also the part I
hope
> to (mostly) outsource to someone else. I'm not going to write a
> distributed database from scratch..
I wonder if the index-backend in 4. and 5. shouldn't be the same.

==
How many work is it to handle the data in the index files?
What if any server forwards changes to the master and recieves changes 
from the master to sync its local read-only cache? So you needn't handle 
conflicts (except when network was down) and writes are consistent 
originated from this single master server. The actual mail data is 
accessed via another API.

When the current master does no longer need to access the mailbox, it 
could hand over the "master" stick to another server currently
accessing
the mailbox.

Bye,

- -- 
Steffen Kaiser
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iQEVAwUBSoGA6XWSIuGy1ktrAQKGjggAh9Yjzy2oFI2H8MS2rppm/ug2HWO+9PGX
aTRrzNzj2wTScAL1NrFZrN8Mlc7qK2YfH3rXDbM5Mcw/eC67VQ2P2XcetTY7h5XK
RxFqk5+h3Q06Jiwl0IFQyCxkRzs4bK6cZegjAfSViDfQTx8iQhvXHxioPLvIiFQH
D3lOd7+QUxOLKJyAxejjDM5ez/9OUFXZF9WeWrDGpQYES5HVNND3T288uBwWx5zJ
hwqQI8qR3Fwu9VRSDLpvCx1DjQWGOT7x6DfIaKg2j6IvvSTpH2dMsNg0M3YmLsvY
JyreDtqMlZDLclg00ELx0ORgQVHN5eQpOs/XgmFF0+YBQvAO6mtrUw==1GC8
-----END PGP SIGNATURE-----

Ed W

2009-Aug-12 15:26 UTC

head link

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Hi
>  * Mail data on the other hand is just written once and usually read
> maybe once or a couple of times. Caching mail data in memory probably
> doesn't help all that much. Latency isn't such a horrible issue as
long
> as multiple mails can be fetched at once / in parallel, so there's only
> a single latency wait.
>   
This logically seems correct.  Couple of questions then:

1) Since latency requirements are low, why did performance drop so much 
previously when you implemented a very simple mysql storage backend?  I 
glanced at the code a few weeks ago and whilst it's surprisingly 
complicated right now to implement a backend, I was also surprised that 
a database storage engine "sucked" I think you phrased it? Possibly
the
code also placed the indexes on the DB? Certainly this could very well 
kill performance?  (Note I'm not arguing sql storage is a good thing, I 
just want to understand the latency to backend requirements)

2) I would be thinking that with some care, even very high latency 
storage would be workable, eg S3/Gluster/MogileFs ?  I would love to see 
a backend using S3 - If nothing else I think it would quickly highlight 
all the bottlenecks in any design...
> 4. Implement a multi-master filesystem backend for index files. The idea
> would be that all servers accessing the same mailbox must be talking to
> each others via network and every time something is changed, push the
> change to other servers. This is actually very similar to my previous
> multi-master plan. One of the servers accessing the mailbox would still
> act as a master and handle conflict resolution and writing indexes to
> disk more or less often.
>   
Take a look at Mogilefs for some ideas here.  I doubt it's a great fit, 
but they certainly need to solve a lot of the same problems
> 5. Implement filesystem backend for dbox and permanent index storage
> using some scalable distributed database, such as maybe Cassandra.
CouchDB?  It is just the Lotus Notes database after all, and personally 
I have built some *amazing* applications using that as the backend. (I 
just love the concept of Notes - the gui is another matter...)

Note that CouchDB is interesting in that it is multi-master with 
"eventual" synchronisation.  This potentially has some interesting 
issues/benefits for offline use

For the filesystem backend have you looked at the various log structured 
filesystems appearing?  Whenever I watch the debate between Maildir vs 
Mailbox I always think that a hybrid is the best solution because you 
are optimising for a write one, read many situation, where you have an 
increased probability of having good cache localisation on any given read.

To me this ends up looking like log structured storage... (which feels 
like a hybrid between maildir/mailbox)
>  * Scalability, of course. It'll be as scalable as the distributed
> database being used to store mails.
>   
I would be very interested to see a kind of "where the time goes" 
benchmark of dovecot.  Have you measured and found that latency of this 
part accounts for x% of the response time and CPU bound here is another 
y%, etc?  eg if you deliberately introduce X ms of latency in the index 
lookups, what does that do to the response time of the system once the 
cache warms up?  What about if the response time to the storage backend 
changes?  I would have thought this would help you determine how to 
scale this thing?



All in all sounds very interesting.  However, couple of thoughts:

- What is the goal?
- If the goal is performance by allowing a scale-out in quantity of 
servers then I guess you need to measure it carefully to make sure it 
actually works? I haven't had the fortune to develop something that big, 
but the general advice is that scaling out is hard to get right, so 
assume you made a mistake in your design somewhere... Measure, measure
- If the goal is reliability then I guess it's prudent to assume that 
somehow all servers will get out of sync (eventually). It's definitely 
nice if they are self repairing as a design goal, eg the difference 
between a full sync and shipping logs (I ship logs to have a 
master-master mysql server, but if we have a crash then I use a sync 
program (maatkit) to check the two servers are in sync and avoid 
recreating one of the servers from fresh)
- If the goal is increased storage capacity on commodity hardware then 
it needs a useful bunch of tools to manage the replication and make sure 
there is redundancy and it's easy to find the required storage. I guess 
look at Mogilefs, if you think you can do better then at least remember 
it was quite hard work to get to that stage, so doing it again is likely 
to be non trivial?
- If the goal were making it simpler to build a backend storage engine 
then this would be excellent - I find myself wanting to benchmark ideas 
like S3 or sticking things in a database, but I looked at the API 
recently and it's going to require a bit of investment to get started - 
certainly more than a couple of evenings poking around...  Hopefully 
others would write interesting backends, regardless of whether it's 
sensible to use them on high performance setups, some folks simply 
want/need to do unusual things...

- Finally I am a bit sad that offline distributed multi-master isn't in 
the roadmap anymore... :-(  - My situation is we have a lot of boats 
boating around with intermittent expensive satellite connections and the 
users are fluid and need to get access to their data from land and 
different vessels.  Currently we build software inhouse to make this 
possible, but it would be fantastic to see more features enabling this 
on the server side (CouchDB / Lotus Notes is cool...)

Good luck - sounds fun implementing all this anyway!

Ed W

Daniel L. Miller

2009-Aug-12 22:54 UTC

head link

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Ha!  Fooled you!  I'm going to reply to the original question instead of 
SIS!

Timo Sirainen wrote:>  * Index files are really more like memory dumps. They're already in an
> optimal format for keeping them in memory, so they can be just mmap()ed
> and used. Doing some kind of translation to another format would just
> make it more complex and slower.
>
>  * Index and mail data is very different. Index data is accessed
> constantly and it must be very low latency or performance will be
> horrible. It practically should be in memory in local machine and there
> shouldn't normally be any network lookups when accessing it.
>   Ok, I lied.  I'm going to start something new.

Do the indexes contain any of the header information?  In particular, 
since I know nothing of the communication between IMAP clients & servers 
in general, is the information that is shown in typical client mail 
lists (subject, sender, date, etc.) stored in the indexes?  I guess I'm 
asking if any planned changes will have an impact in retrieving message 
lists in any way.

-- 
Daniel

Timo Sirainen

2010-Mar-10 21:19 UTC

head link

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

On 10.8.2009, at 20.01, Timo Sirainen wrote:
> (3.5. Implement async I/O filesystem backend.)
You know what I found out today? Linux doesn't support async IO for regular
buffered files. I had heard there were issues, but I thought it was mainly about
some annoying APIs and such. Anyone know if some project has successfully
figured out some usable way to do async disk IO? The possibilities seem to be:

a) Use Linux's native AIO, which requires direct-io for files. This *might*
not be horribly bad for mail files. After all, same mail is rarely read multiple
times. Except when parsing its headers first and then its body. Maybe the
process could do some internal buffering?..

I guess no one ever tried my posix_fadvise() patch? The idea was that it would
tell the kernel after closing a mail file that it's no longer needed in
memory, so kernel could remove it from page cache. I never heard any positive or
negative comments about how it affected performance..
dovecot.org/patches/1.1/fadvise.diff

b) Use threads, either via some library or implement yourself. Each thread of
course uses some extra memory. Also enabling threads causes glibc to start using
a thread-safe version of malloc() (I think?), which slows things down (unless
that can be avoided, maybe by using clone() directly instead of pthreads?).

c) I read someone's idea about using posix_fadvise() and fincore() functions
to somehow make it "kind of work, usually, maybe". I'm not sure if
there's a practical way to make them work though. And of course I don't
think fincore() has even been accepted by Linus yet.

Reasonably Related Threads

Search for more maybe matching threads

dovecot - Aug 2009 - Scalability plans: Abstract out filesystem and make it someone else's problem

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Reasonably Related Threads