http://hg.dovecot.org/dovecot-2.0-sis contains the code for it.
Otherwise it's the latest (as of writing this) dovecot-2.0 hg tree.
Please test if you're interested in SIS. :)
Once there's at least some testing, I'll probably add this to v2.0.x
since very little of this new code is used when SIS is disabled (which
is the default of course).
SIS works pretty much like explained in
http://dovecot.org/list/dovecot/2010-July/050832.html and
http://dovecot.org/list/dovecot/2010-July/050992.html
Two things I'm not yet entirely sure about:
1. What hash algorithm to use? Currently it's hard coded to SHA1.
Besides more CPU usage, the other potential problem with larger hashes
is that they also generate larger filenames. The filenames are currently
hex-encoded, but to save space they could be changed to some kind of
modified-base64 (base64 uses '/' chars, so it can't be regular
base64).
Example filename lengths:
hex modified-base64
SHA1 73 50
SHA256 97 66
SHA512 161 109
Yet another possibility would be to use SHA256/SHA512 and just truncate
the hash length to less number of bits.
2. Should I add support for trusting hash uniqueness and to avoid disk
I/O generated by the byte-by-byte comparison? It could still first check
that the file sizes match.
Usage
-----
You can enable SIS for sdbox and mdbox:
mail_attachment_dir = /var/attachments
Just setting the above enables "instant SIS", where byte-by-byte
comparison is done immediately during saving mails. Alternative is to
leave the comparing later by setting:
mail_attachment_fs = sis-queue /var/attachments/queue:posix
This does no deduplication itself yet. To do that you'll need a nighty
(or whatever) run, which calls:
doveadm sis deduplicate /var/attachments /var/attachments/queue
There's also a feature to easily find all attachments based on a hash.
For example:
% sha1sum foo
351641b73feb7cf7e87e5a8c3ca9a37d7b21e525 foo
% doveadm sis find /var/attachments 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525
/var/attachments/35/16/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525-e13a841f28ba764c123b00008c4a11c1
/var/attachments/35/16/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525-1d3b940628ba764c0b3b00008c4a11c1
If you want to save attachments to a separate files without SIS (e.g.
you want to use your filesystems deduplication), set:
mail_attachment_fs = posix
By default only attachments larger than 128 kB are written to attachment
storage. You can change it from:
mail_attachment_min_size = 128k
It's also possible to create a plugin that adds further restrictions to
when the attachment is saved separately. This might be useful to reduce
disk seeks for attachments that are typically shown inline by
clients/webmail. You can do this by overriding
mailbox.save_is_attachment() method.
If you want to distribute attachments to multiple filesystems, just
create /var/attachments/[0-9a-f][0-9a-f] as symlinks pointing to
whatever mount paths you want.