Hello,
Currently Dovecot (and any other application, which cares about e-mail
delivery) does at least one fsync per mail delivery. Given that hard
disk drives have a very limited IOPS, this effectively limits the
maximum mail delivery performance to a very low value, under utilizing
the available storage IO capacity.
Calculating with an average mail size of 50 kB and an average consumer
HDD with 120 IOPS, the theoretical mail delivery performance will be 50
kB*120 IOPS=5.85 MBps. But if we could write 500 kB with every
transaction, the delivery speed would be nearly 10 times as well.
Dovecot have two process models: separate processes for each client
connection and an async in-process multiplexing method. This works for
each one, albeit the timing is somewhat different.
So here's the idea: instead of fsyncing immediately in the LDA (lmtpd)
every time when the client says "\r\n.\r\n" after the DATA phase,
let's
introduce a user settable timer (let's call that sync_delay from now on)
and only sync in every sync_delay seconds.
This would introduce an up to sync_delay seconds delay in lmtpd
returning "250 Ok" to the client, but that's generally not a
problem,
because in high traffic setups there is a great amount of concurrency,
so you could use a lot of client connections easily.
Take an example setting of sync_delay = 100 ms.
With this, 10 syncs would happen in every second from Dovecot LDA
processes, meaning if a client connects in t=0 it will immediately got
the response 250, if a client connects in t=0.05, it will get the
response in 50 ms (in an ideal world, where syncing does not take time),
and the committed blocks could accumulate for a maximum of 100 ms.
In a busy system (where this setting would make sense), it means it
would be possible to write more data with less IOPS needed.
I can see two problems:
1. there is no call for committing a lot of file descriptors in one
transaction, so instead of fsync() for each of the modified FDs, a
sync() would be needed. sync() writes all buffers to stable storage,
which is bad if you have a mixed workload, where there are a lot of
non-fsynced data, or other heavy fsync users. But modern file systems,
like ZFS will write those back too, so there an fsync(fd) is -AFAIK-
mostly equivalent with a sync(pool on which fd is). sync() of course is
system wide, so if you have other file systems, those will be synced as
well. (this setting isn't for everybody)
2. in a multiprocess environment this would need coordination, so
instead of doing fsyncs in distinct processes, there would be one
process needed, which does the sync and returns OK for the others, so
they can notify the client about the commit to the stable storage.
Any opinions on this?