thr3ads.net - dovecot - [Dovecot] (Single instance) attachment storage [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Timo Sirainen

2010-Jul-19 15:24 UTC

[Dovecot] (Single instance) attachment storage

Now that v2.0.0 is only waiting for people to report bugs (and me to
figure out how to fix them), I've finally had time to start doing what I
actually came here (Portugal Telecom/SAPO) to do. :)

The idea is to have dbox and mdbox support saving attachments (or MIME
parts in general) to separate files, which with some magic gives a
possibility to do single instance attachment storage. Comments welcome.

Reading attachments
-------------------

dbox metadata would contain entries like (this is a wrapped single line
entry):

 X1442 2742784 94/b2/01f34a9def84372a440d7a103a159ac6c9fd752b
  2744378 27423 27/c8/a1dccc34d0aaa40e413b449a18810f600b4ae77b

So the format is:

 "X" 1*(<offset> <byte count> <link path>)

So when reading a dbox message body, it's read as:

 offset=0: <first 1442 bytes from dbox body>
 offset=1442: <next 2742784 bytes from external file>
 offset=2744226: <next 152 bytes from dbox body>
 offset=2744378: <next 27423 bytes from external file>
 offset=2744378 27423: <the rest from dbox body>

This is all done internally by creating a single istream that lazily
opens the external files only when data is actually tried to be read
from that part of the message.

The link paths don't have to be in any specific format. In future
perhaps it can recognize different formats (even http:// urls and such).

Saving attachments separately
-----------------------------

Message MIME structure is being parsed while message is saved. After
each MIME part's headers are parsed, it's determined if this part should
be stored into attachment storage. By default it only checks that the
MIME part isn't multipart/* (because then its child parts would contain
attachments). Plugins can also override this. For example they could try
to determine if the commonly used clients/webmail always downloads and
shows the MIME part when opening the mail (text/*, inline images, etc).

dbox_attachment_min_size specifies the minimum MIME part size that can
be saved as an attachment. Anything smaller than that will be stored
normally. While reading a potential attachment MIME part body, it's
first buffered into memory until the min. size is reached. After that
the attachment file is actually created and buffer flushed to it.

Each attachment filename contains a global UID part, so that no two
(even identical) attachments will ever contain the same filename. But
there can be multiple attachment storages in different mount points, and
each one could be configured to do deduplication internally. So
identical attachments should somehow be stored to same storage. This is
done by taking a hash of the body and using a part of it as the path to
the file. For example:

 mail_location = dbox:~/dbox:ATTACHMENTS=/attachments/$/$

Each $ would be expanded to 8 bits of the hash in hex (00..ff). So the
full path to an attachment could look like:

 /attachments/04/f1/5ddf4d05177b3b4c7a7600008c4a11c1

Sysadmin can then create /attachment/00..ff as symlinks to different
storages.

Hashing problems
----------------

Some problematic design decisions:

1) Hash is taken from hardcoded first n kB vs. first
dbox_attachment_min_size bytes?

 + With first n kB, dbox_attachment_min_size can be changed without
causing duplication of attachments, otherwise after the change the same
attachment could get a hash to a different storage than before the
change.
 - If n kB is larger than dbox_attachment_min_size, it uses more memory.
 - If n kB is determined to be too small to get uniform attachment
distribution to different storages, it can't be changed without
recompiling.

2) Hash is taken from first n bytes vs. everything?

 + First n bytes are already read to memory anyway and can be hashed
efficiently. The attachment file can be created without wasting extra
memory or disk I/O. If everything is hashed, the whole attachment has to
be first stored to memory or to a temporary file and from there written
to final storage.
 - With first n bytes it's possible for an attacker to generate lots of
different large attachments that begin with the same bytes and then
overflow a single storage. If everything is hashed with a secure hash
function and a system-specific secret random value is added to the hash,
this attack isn't possible.

I'm thinking that even though taking a hash of everything is the least
efficient option, it's the safest option. It's pretty much guaranteed to
give a uniform distribution across all storages, even against
intentional attacks. Also the worse performance isn't probably that
noticeable, especially assuming a system where local disk isn't used for
storing mails, and the temporary files would be created there.

Single instance storage
-----------------------

All of the above assumes that if you want a single instance storage,
you'll need to enable it in your storage. Now, what if you can't do
that?

I've been planning on making all index/dbox code to use an abstracted
out simple filesystem API rather than using POSIX directly. This work
can be started by making the attachment reading/writing code use the FS
API and then create a single instance storage FS plugin. The plugin
would work like:

open(ha/sh/hash-guid): The destination storage is in ha/sh/ directory,
so a new temp file can be created under it. The hash is part of the
filename to make unlink() easier to handle.

Since the hash is already known at open() time, look up if hashes/<hash>
file exists. If it does, open it.

write(): Write to the temp file. If hashes/ file is open, do a
byte-by-byte comparison of the inputs. If there's a mismatch, close the
hashes/ file and mark it as unusable.

finish():
 a) If hashes/ file is still open and it's at EOF, link() it to our
final destination filename and delete the temp file. If link() fails
with ENOENT (it was just expunged), goto b. If link() fails with EMLINK
(too many links), goto c.
 b) If hashes/ file didn't exist, link() the temp file to the hash and
rename() it to the destination file.
 c) If the hashed file existed but wasn't the same, or if link() failed
with EMLINK, link() our temp file to a second temp file and rename() it
over the hashes/ file and goto a.

unlink(): If hashes/<hash> has the same inode as our file and the link
count is 2, unlink() the hash file. After that unlink() our file.

One alternative to avoid using <hash> as part of the filename would be
for unlink() to read the file and recalculate its hash, but that would
waste disk I/O.

Another possibility would to be to not unlink() the hashes/ files
immediately, but rather let some nightly cronjob to stat() through all
of the files and unlink() the ones that have link count=1. This could be
wastefully inefficient though.

Yet another possibility would be for the plugin to internally calculate
the hash and write it somewhere. If it's at the beginning of the file,
it could be read from there with some extra disk I/O. But is it worth
it?..

Extra features
--------------

The attachment files begin with an extensible header. This allows a
couple of extra features to reduce disk space:

1) The attachment could be compressed (header contains compressed-flag)

2) If base64 attachment is in a standardized form that can be 100%
reliably converted back to its original form, it could be stored decoded
and then encoded back to original on the fly.

It would be nice if it was also possible to compress (and decompress)
attachments after they were already stored. This would be possible, but
it would require finding all the links to the message and recreating
them to point to the new message. (Simply overwriting the file in place
would require there are no readers at the same time, and that's not easy
to guarantee, except if Dovecot was entirely stopped. I also considered
some symlinking schemes but they seemed too complex and they'd also
waste inodes and performance.)

Code status
-----------

Initial version of the attachment reading/writing code is already done
and works (lacks some error handling and probably performance
optimizations). The SIS plugin code is also started and should be
working soon.

This code is very isolated and can't cause any destabilization unless
it's enabled, so I'm thinking about just adding it to v2.0 as soon as it
works, although the config file comments should indicate that it's still
considered unstable.

Daniel L. Miller

2010-Jul-19 16:01 UTC

head link

[Dovecot] (Single instance) attachment storage

On 7/19/2010 8:24 AM, Timo Sirainen wrote:> Now that v2.0.0 is only waiting for people to report bugs (and me to
> figure out how to fix them), I've finally had time to start doing what
I
> actually came here (Portugal Telecom/SAPO) to do. :)
>
> The idea is to have dbox and mdbox support saving attachments (or MIME
> parts in general) to separate files, which with some magic gives a
> possibility to do single instance attachment storage. Comments welcome.
>
>    YAAAY!!!  Timo's gonna give us SIS!!!

Is it done yet :) ?

I'm just thinking out loud here - and I'm probably way off base.  Just 
tell me to shut up and I'll go away and hide until you're finished.

1.  You've already identified that enabling this feature needs to avoid 
introducing problems - including treating different-but-similar 
attachments as identical.  In your hashing choices, you only mentioned 
attachment body.  What about including size and date in the hash?

2.  You didn't explicitly define if SIS would be per-mailbox or 
system-wide.  Speaking for myself, and probably a few others, I'll take 
whatever implementation I can get - but I'd love to see it system-wide.

3.  Are you envisioning this as being handled totally within deliver, or 
would there be a server process for consolidating the messages?  I'm 
wondering about the impact to high-traffic sites (which mine is 
thankfully NOT) - if deliver needs to crunch on large messages, could 
this lead to time-out issues from the MTA's?

A possible alternative, have deliver write the message out as normal - 
but flag it for attachment processing.  Then have a secondary process 
awakened to check for attachments and perform accordingly.  So any SIS 
overhead becomes invisible to the MTA - other than needing available 
system resources for processing (and the attachment processing could be 
done at a lower priority).

-- 
Daniel

William Blunn

2010-Jul-19 16:29 UTC

head link

[Dovecot] (Single instance) attachment storage

Timo Sirainen wrote:> Now that v2.0.0 is only waiting for people to report bugs (and me to figure
out how to fix them), I've finally had time to start doing what I actually
came here (Portugal Telecom/SAPO) to do. :)
>
> The idea is to have dbox and mdbox support saving attachments (or MIME
parts in general) to separate files, which with some magic gives a possibility
to do single instance attachment storage. Comments welcome.
>   
Cool.
> Extra features
> --------------
>
> The attachment files begin with an extensible header. This allows a couple
of extra features to reduce disk space:
>
> 1) The attachment could be compressed (header contains compressed-flag)
>   
Cool.
> 2) If base64 attachment is in a standardized form that can be 100% reliably
converted back to its original form, it could be stored decoded and then encoded
back to original on the fly.
>   
Cool.

I have thought about this issue in the past. What follows may be obvious 
to you already, but might as well mention rather than missing something.

Presumably you want to be able to recreate the original base64 stream 
exactly verbatim?

Under base64, the number of 4-byte (encoded) / 3-byte (decoded) cells 
per line is not fixed by the specs.

I believe the optimal value is 19 cells per line, but I have seen some 
systems use 18 cells per line, and I think I have seen 15 as well. Once 
you have three possibilities, you might as well just store the number of 
cells per line.

I would suggest considering the base64 format as (conceptually) having 
an (integer) parameter for the number of cells in each line (except for 
the last line).

So base64(19) would have on each line 19 cells encoding 57 (19 ? 3) 
bytes into 76 (19 ? 4) bytes.

Probably you would need to have a base64 matcher/decoder which is 
smarter than normal base64 decoders and checks to make sure that all 
lines (apart from the last) are encoded (a) canonically (e.g.. with no 
trailing whitespace), and (b) using the same number of cells per line.

The base64 matcher/decoder needs to return information about the cell 
count as well as the decoded data.

If any line is not canonical base64 or uses a different number of cells, 
then the base64 may still be valid but "weird" so would just be stored
as the original base64 stream.

When recovering message data, obviously your base64 encoder needs to use 
a parameter which is the number of cells per line to encode. Then you 
get back your original base64 stream verbatim.

=
Some systems finish the base64 stream with a newline (which in a 
multipart manifests as a blank line between the base64 stream and the 
'--' of the MIME boundary), whereas some systems finish the base64 
stream at the end of final 4-byte cell (which in a multipart manifests 
as the '--' of the MIME boundary appearing on the line immediately 
following the base64 encoded data). Your encoding allows for arbitrary 
data between the objects, so you would have no problem store these two 
cases verbatim. But something to watch out for when storing.

=
Maybe it would be a good idea to have the ability to say that an object 
was base64 decoded AND compressed (i.e. to recover the original stream 
fragment you need to decompress and base64 encode (with the relevant 
number of base64 cells per line)) --- as well as options for just base64 
decoded or just compressed.

You could go nuts and say that it is an arbitrarily-sized filter stack, 
but my first guess would be that this would be too much flexibility.

It might be better to say that there can be
zero or one decode/encode layers (like base64 or something else), and
zero or one compression layers (like gzip or bzip2 or xz/LZMA).

Obviously whatever translations are required to recover the original 
stream should be encoded into the attachment file so that sysadmins can 
tune the storage algorithm without affecting previously stored attachments.

Bill

William Blunn

2010-Jul-19 17:30 UTC

head link

[Dovecot] (Single instance) attachment storage

Timo Sirainen wrote:>  X1442 2742784 94/b2/01f34a9def84372a440d7a103a159ac6c9fd752b
>   2744378 27423 27/c8/a1dccc34d0aaa40e413b449a18810f600b4ae77b
>
> So the format is:
>
>  "X" 1*(<offset> <byte count> <link path>)
>
> ...
>
> Extra features
> --------------
>
> The attachment files begin with an extensible header. This allows a
> couple of extra features to reduce disk space:
>
> 1) The attachment could be compressed (header contains compressed-flag)
>
> 2) If base64 attachment is in a standardized form that can be 100%
> reliably converted back to its original form, it could be stored decoded
> and then encoded back to original on the fly.
Consider storing the recovery filter stack in the dbox metadata rather 
than the attachment file.

e.g. so I put "-b64_19" after the file path to indicate that it needs
to
be exploded to base64 with 19 cells per line before being incorporated 
in the message stream.

X1442 2742784 94/b2/01f34a9def84372a440d7a103a159ac6c9fd752b -b64_19
  2744378 27423 27/c8/a1dccc34d0aaa40e413b449a18810f600b4ae77b -b64_19

This means that the attachment file can be purely the attachment data.

This has a couple of upshots:

1. If one person receives a message with an attachment which is encoded 
with base64 at say 19 cells (76 bytes) per line, and then re-sends the 
same file as an attachment to someone else but their MUA encodes base64 
at say 18 cells (72 bytes) per line, the attachment file can contain 
exactly the same data, allowing for deduplication even in this case.

2. Assuming we have configured Dovecot to decode base64 but not to 
compress, then the file in which we store the attachment data contains 
literally the exact same byte stream as if the attachment were saved out 
from the MUA. I don't know what practical use this might be, but it 
/sounds/ cool :-) Perhaps a suitable filesystem or backup-system could 
deduplicate both a file *and* its instance as a message attachment.

Bill

Timo Sirainen

2010-Jul-21 20:19 UTC

head link

[Dovecot] (Single instance) attachment storage

On Mon, 2010-07-19 at 16:24 +0100, Timo Sirainen wrote:> Code status
> -----------
> 
> Initial version of the attachment reading/writing code is already done
> and works (lacks some error handling and probably performance
> optimizations). The SIS plugin code is also started and should be
> working soon.
The initial version can be tested here:
http://hg.dovecot.org/dovecot-2.0-sis/

It should work with sdbox, but the settings parsing is causing mdbox to
crash at startup. I need to figure out some way to fix that.

You can enable it by setting e.g.:

dbox_attachment_dir = ~/dbox/attachments

By default is does SIS, but if you want only external attachment
storage, you can set:

dbox_attachment_fs = posix

(default is dbox_attachment_fs = sis posix)

TODO has:

 - attachments can generate very long metadata lines. input stream reading
   them probably has a limit..
 - save attachments base64-decoded
 - if attachment stream is too small/long, log an error
     - if file is completely empty, maybe show it as spaces? this could be
       useful for deleting viruses by truncating their files to zero bytes
 - delayed deduplication daemon?

Perhaps the settings troubles could be solved simply by making the code
to be common for all storage backends, even though currently only dbox
would use it. Then the setting names would be mail_attachment_fs and
mail_attachment_dir.

Damon Atkins

2010-Jul-25 08:34 UTC

head link

[Dovecot] (Single instance) attachment storage

Just keep in mine Signed Emails, the email contents needs to be 
present back to the client as it came, so the client says the email has 
not been altered.

William Blunn

2010-Jul-25 09:53 UTC

head link

[Dovecot] (Single instance) attachment storage

Damon Atkins wrote:> Just keep in mine Signed Emails, the email contents needs to be 
> present back to the client as it came, so the client says the email 
> has not been altered.
Timo will presumably correct me if I am wrong: I believe it is already a 
design goal that messages going in to the mail store will come out 
byte-for-byte verbatim at the RFC5322 message stream level.

I suppose the suggestion to come out of this is that if there is to be 
any partial or complete RFC2045-deconstruction of messages that this 
will suggest to some people the possibility of non-verbatim retrieval, 
and it might be an idea for the documentation to make it clear whether 
or not stored messages will be retrieved verbatim at the RFC5322 message 
stream level.

It might be useful if the documentation gave confidence-building 
examples such as "for example, blocks of base64 encoded data will only 
be stored in decoded form if, when re-encoded, the original base64 
stream can be reconstructed byte-for-byte verbatim".

Or if messages might not come back verbatim, it might be useful if the 
documentation were to explain what scenarios it might not be verbatim, 
and the nature of the differences which may occur.

Bill

Ed W

2010-Aug-24 12:42 UTC

head link

[Dovecot] (Single instance) attachment storage

Hi
> The idea is to have dbox and mdbox support saving attachments (or MIME
> parts in general) to separate files, which with some magic gives a
> possibility to do single instance attachment storage. Comments welcome.
This is a really interesting idea.  I have previously given it some 
thought.  My 2p

1) Being able to ask "the server" if it has an attachment matching a 
specific hash would be useful for a bunch of other reasons. This result 
needs to be (crytographically) unique and hence the hash needs to be a 
good hash (MD5/SHA or better) of the complete attachment, ideally after 
decoding

2) It might be useful to be able to find attachments with a specific 
hash regardless of whether the attachment has been spat out separately 
(think of a use case where we want to be able to spot a 2KB footer gif 
which on it's own isn't worth worrying about, but some offline scan 
later discovers 90% of emails contain this gif and we wish to split it 
off as a policy decision).

3) Storing attachments by hash may be interesting for use with 
specialist filesystems, eg an interesting direction that dbox could take 
might be to store the headers and message text in some (compressed?) 
format with high linear read rates and most attachments in a some 
key/value storage system?

4) Many modern IMAP clients are starting to download attachments on 
demand. Need to be able to supply only parts of the email efficiently 
without needing to pull in the blobs.  Stated another way, it's 
desirable not to peek inside the blobs to be able to fetch arbitrary 
mime parts

5) It's going to be easy to break signed emails...  Need to be careful

6) In many cases this isn't a performance win... It's still a *great* 
feature, but two disk seeks outweigh a lot of linear read speed.

7) When something gets corrupted... It's worth pondering about how we 
can audit and find unreferenced "blobs" later?


Some of the use cases I have for these features (just in case you 
care...).  We have a feature which is a bit like the opposite of one of 
these services for sending big attachments.  When users email arrives we 
remove all attachments that meet our criteria and replace them with 
links to the files.  This requires being able to give users a coded link 
which can later be decoded to refer to a specific attachment.  If this 
change offered us additional ways to find attachments by hash or 
whatever then it would be extremely useful

Another feature we offer is a client application which compresses and 
reduces bandwidth when sending/receiving emails.  We currently don't try 
and hash bits of email, but it's an idea I have been mulling over for 
IMAP users where we typically see the data sent via SMTP, then uploaded 
to the imap "sent items", then often downloaded again when the client 
polls the sent items for new messages (durr).  Being able to see if we 
have binary content which matches a specific hash could be extremely 
interesting


I'm not sure if with your current proposal I can do 100% of the above?  
For example it's not clear if 4) is still possible?  Also without a 
"guaranteed" hash we can't use the hash as a lookup key in a
key/value
storage system (which implies another mapping of keys to keys is 
required). Can we do an (efficient) offline scan of messages looking for 
duplicated hash keys (ie can the server calculate hashes for all 
attachment parts ahead of time)

Sounds extremely interesting.  Look forward to seeing this develop!

Cheers

Ed W

Apparently Analagous Threads

Search for more maybe matching threads

dovecot - Jul 2010 - (Single instance) attachment storage

[Dovecot] (Single instance) attachment storage

[Dovecot] (Single instance) attachment storage

[Dovecot] (Single instance) attachment storage

[Dovecot] (Single instance) attachment storage

[Dovecot] (Single instance) attachment storage

[Dovecot] (Single instance) attachment storage

[Dovecot] (Single instance) attachment storage

[Dovecot] (Single instance) attachment storage

Apparently Analagous Threads