thr3ads.net - dovecot - [Dovecot] RFC: Storing indexes in a (remote) database to increase dovecot scalability [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Mark Zealey

2009-Aug-10 10:20 UTC

[Dovecot] RFC: Storing indexes in a (remote) database to increase dovecot scalability

Hi all,

This is just an idea I had over the weekend; I was wondering if instead
of storing the indexes/caches on disk, it would be possible to have an
option to store them in a remote database (eg mysql). There are several
issues that we currently have with scaling dovecot, and I think that if
it could store indexes in an external database we could alleviate most
of these issues. In no particular order:

1) One major issue we have with dovecot is the number of iops it
requires to keep its files in sync on our big disk arrays. If this were
a mysql database, we'd have better ways of keeping these in sync, for
example switching the speed at which it flushes the updates to disk,
growth using partitioning/sharding, read-only slave instances etc,
specific ssd storage for indexes/caches. I know that we could create a
separate ssd disk array and export that over nfs too, but see issue (2)
below. If indexes were stored in a mysql database we could choose to
have a different reliability for the indexes (ie if the database server
crashes we have 1 second of index changes lost, whereas for the email
itself we can ensure it is always kept in sync on our other disk
arrays).

2) nfs locking issues. No matter how hard we try, we always find there
are some issues with concurrency over nfs. It really doesn't seem
designed to handle these issues very well. I've not tried with nfsv4 and
I believe there are some improvements there, but at the same time nfs
wasn't really designed to store databases and have heavy concurrency. A
proper database server (eg mysql, postgres) on the other hand thrives in
this type of environment, especially when coupled with a transactional
storage backend such as innodb which includes support for deadlock
handling etc.

3) As we store all of our mail systems over nfs, effectively each export
is a single-threaded connection to the server. No matter how fast the
central nfs server you can realistically not go above 10k nfs ops/sec on
the fastest of local networks due to latency issues. With mysql you
could have multiple parallel connections to the central database store
(1 per process?) which would mean more nfs ops available for actually
serving the files.

4) Caching. Much more tuneable on a database server than in a
filesystem. Eg in mysql there are loads of options to tune the various
innodb/query/key caches, on nfs there are very few even on advanced
netapps etc

I know some of you store all your indexes on a single server locally &
then store the emails on a remote central filestore. Whilst this is a
good idea, it doesn't scale very well in that if the one server fails
then everything has to be reindexed as clients login which causes
incredible load as dovecot has to re-read most of the messages (in pop3
mode at least) in order to work out size counts etc. In my experience
this reindexing would take out our email system for the best part of a
day. I'm sure we could set something up with drbd mirroring or a central
san array with ssd's for the index data, but this intrinsically does not
scale with as much ease as a replicating database cluster could. There
are also failover issues which can be very easily solved with a database
system (eg mysql with an active/passive dual-master setup).

Also, I know that most of these points refer to nfs; and I know that
there are alternative methods (eg san + gfs or some other scalable
filesystem) however most of these are relatively new technologies
whereas nfs is pretty much the standard for remote file storage;
certainly I'd consider it a lot more reliable when compared to the likes
of gfs or lustre. Your typical large scale mail setup would be over nfs
and this is the sort of environment that I've run into these scaling
issues in. I've heard that gfs also doesn't work very well with 'hot
spots' which indexes tend to be, although I've no first hand experience
of this. So, although this post only really mentions nfs & mysql (the
two technologies that I'm most familiar with) I think this is a much
more general problem so I'm not advocating the use of either technology
apart from them being most people's first choices for ease of use.

What I'm really proposing is a way to separate the two different types
of data that dovecot uses. Indexes&caches contain small frequent
transactions; and whilst these are relatively important, dropping the
past few seconds of these transactions doesn't seriously matter. On the
other hand emails are larger, infrequently changed & it's important they
are kept fully in sync - if we accept a message it should be guaranteed
that we don't loose it. At the end of the day it seems to me that a
filesystem is the ideal place to store emails, whereas a database server
would be the ideal place to store indexes/caches. Obviously for speed of
setup, simply storing everything on a filesystem is easy so I'm not
suggesting that we abandon this route, but for truly large
infrastructures I think being able to use a database system would have
significant advantages in each of these four areas.

Any thoughts?

Mark

--
Mark Zealey -- Platform Architect
Product Development * Webfusion
123-reg.co.uk, webfusion.co.uk, donhost.co.uk, supanames.co.uk

This mail is subject to http://www.gxn.net/disclaimer

Timo Sirainen

2009-Aug-10 17:10 UTC

head link

[Dovecot] RFC: Storing indexes in a (remote) database to increase dovecot scalability

I just wrote a "Scalability plans" mail, which explains what I've
been
recently thinking about. I think that would solve most of your problems.

On Mon, 2009-08-10 at 11:20 +0100, Mark Zealey wrote:> 1) One major issue we have with dovecot is the number of iops it
> requires to keep its files in sync on our big disk arrays. If this were
> a mysql database, we'd have better ways of keeping these in sync, for
> example switching the speed at which it flushes the updates to disk,
> growth using partitioning/sharding, read-only slave instances etc,
> specific ssd storage for indexes/caches. I know that we could create a
> separate ssd disk array and export that over nfs too, but see issue (2)
> below. If indexes were stored in a mysql database we could choose to
> have a different reliability for the indexes (ie if the database server
> crashes we have 1 second of index changes lost, whereas for the email
> itself we can ensure it is always kept in sync on our other disk
> arrays).
Seems pretty complex to me, but it should be possible to implement MySQL
FS backend and keep indexes/mails there. They would anyway pretty much
have to be just blobs, but I guess that's not an issue for you?
> 2) nfs locking issues. No matter how hard we try, we always find there
> are some issues with concurrency over nfs. 
Solved again.
> 3) As we store all of our mail systems over nfs, effectively each export
> is a single-threaded connection to the server. No matter how fast the
> central nfs server you can realistically not go above 10k nfs ops/sec on
> the fastest of local networks due to latency issues. With mysql you
> could have multiple parallel connections to the central database store
> (1 per process?) which would mean more nfs ops available for actually
> serving the files.
lib-storage parallelization with async I/O backend should also solve
this. Even when using mysql the lib-storage parallelization would be
required to take advantage of multiple connections.
> 4) Caching. Much more tuneable on a database server than in a
> filesystem. Eg in mysql there are loads of options to tune the various
> innodb/query/key caches, on nfs there are very few even on advanced
> netapps etc
I'm not sure about this. I think the in-memory index cache would solve
the worst problem.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL:
<http://dovecot.org/pipermail/dovecot/attachments/20090810/d3211758/attachment-0002.bin>

Seemingly Similar Threads

Search for more possibly parallel threads

dovecot - Aug 2009 - RFC: Storing indexes in a (remote) database to increase dovecot scalability

[Dovecot] RFC: Storing indexes in a (remote) database to increase dovecot scalability

[Dovecot] RFC: Storing indexes in a (remote) database to increase dovecot scalability

Seemingly Similar Threads