thr3ads.net - dovecot - [Dovecot] dbox redesign [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Timo Sirainen

2009-Feb-11 22:20 UTC

[Dovecot] dbox redesign

This is about how to implement multiple msgs/file dbox format. The
current v1.1's one msg/file design would stay pretty much the same and
it would be compatible with this new design.

dbox directories with multiple msgs/file would be like:

~/dbox/storage/ has the actual mail data for all mailboxes
~/dbox/mailboxes/ has subdirectories containing mailboxes and their
indexes

Also since dbox supports already the single msg per file, those files
would be stored in the mailboxes/ directory. So the idea would be that
either you use multiple msgs per file using a global storage, or you use
single msg per file without a global storage (or it's also possible to
be in a mixed setup with some mails in storage/ and some in mailboxes/,
mainly to allow migration between those configurations).

The storage/ directory would have a new "map index" which is a regular
dovecot index (dovecot.index and dovecot.index.log). So the mailbox
index would point to mails using an intermediary "map UID". This way
if
mails are moved to another file only the map index needs to be updated.

GUID would be a globally unique 128 bit ID for messages. So if map
indexes get corrupted for any reason it's possible to rebuild it by
finding the mails using GUIDs.

v1.1 dbox has this "dbox.index" file which I was originally planning
on
using with multiple msgs/file. It had complex file range locking stuff.
Now I'm thinking that it's pretty much useless. The only reason for its
existence with the new design is for listing metadata for files
converted from Maildir.

Map index record would contain:
 - 32 bit map UID
 - 8 bit flags (MAIL_DELETED flag = message marked as expunged)
 - 8 bit unused wasted space
 - 16 bit refcount
 - 32 bit file sequence
 - 32 bit file offset
 --> total 128 bits/msg

Mailbox index:
 - IMAP UID, flags, keywords, etc.
 - 32 bit map UID
 - 128 bit GUID

dbox file metadata:
 - 128 bit GUID
 - size, vsize, received time, saved time, etc.
 - initial mailbox name (if all indexes get trashed, we can still figure
out at least one mailbox where to put the mail. copies would get lost
though.)
 (- no map UID, no imap UID)

How to save a message with multiple msgs/file:

1. Find dbox file where to append to:
1.1. Look up the last message from map index
1.2. Is the file "too old"? (or doesn't exist at all)
 - Yes -> Create new dbox file
1.3. Is the file "too large"?
 - Yes -> Look at the previous file (one sequence less) and goto 1.2.
1.4. Try to lock the file.
 - Fail -> Look at prev file and goto 1.2.

Now we have a locked/new dbox file where we can write to. Because 1.4.
step only tries to lock the file, there's no waiting on locks. This also
means that if e.g. two processes are writing new messages rapidly they
may be appending actively to two different files. I don't think that's a
problem, better than waiting for locks.

2a) We're using an existing file and we need to find the append offset.
Since we found the file by finding the last msg in the file, we also
know the last message's offset. I wasn't really planning on saving the
message sizes in the index file, so to get the append offset I guess it
needs to do an extra read on the last msg's header to find the size and
skip over it. Hmm. Or would it be less disk I/O to store the size on the
index so it could be found directly? I'm not really sure..

In any case, after we find the append offset, check to see if it's at
EOF. If not, that means that either another process just saved a new
message there or a process crashed previously and left garbage lying
around. Refresh the map index to see if this file+offset exists in it.
If not, truncate the file and just continue writing there. If it exists,
figure out the new append offset and see again if the file limit would
be reached. If the file would become too large, unlock the file and goto
step 1.

2b) We're writing to a new file. No need to worry about anything in 2a)

3. Write the message and its metadata to dbox file (including generated
128 bit GUID).

4. Assign map UIDs for the written mails and write APPEND records to map
index's transaction log. The record would contain the map UID, file seq,
offset, refcount=1. The transaction is saved with a "weak" flag
(wonder
if there's a better name for this) and its offset is remembered.
 - If we're creating a new dbox file, it's assigned the file seq and
rename()d to the final file name while the map index is locked.

5. Write APPEND record to mailbox index's transaction log with IMAP UID,
map UID and GUID (and flags, keywords, etc).

6. Write "commit offset=x" record where x is the offset remembered in
step 4. This marks the 4's weak transaction as being fully finished.

7. dbox file is unlocked (if we weren't creating a new file).

When reading the index and we see a weak transaction without a commit
record, call a resolve() function in dbox code. It finds the dbox file
in the weak transaction and tries to lock it. If it can't lock it, it
(probably) means that there's still a process that's going to finish the
save. If a locking succeeded, run a fixup code that goes through all
files that are referenced in weak transactions and makes sure all the
refcounts and such are valid. Then it writes commit records for each
such transaction.

Copying can be done by locking source dbox file, writing a weak
"refcount++" transaction and continuing from saving step 5.

The only step I can think of where there's a small problem is if data is
written to dbox file, then process dies, but no saving is attempted to
the same file later. The data is left there as garbage until maybe some
day later when a message is expunged from that file and cleanup code
notices that there's extra data. I thought about reordering steps so
that the records are written to map index first, but that's less
efficient when writing multiple messages, because each one would need to
be saved separately to the map index (where each save would be lock +
fstat (+ read) + write + unlock). This is probably a rare problem which
doesn't really matter..

Expunging:

The "expunge" and "cleanup to remove deleted space" are
separate steps.
Preferably the cleanup would be done when the server is "less busy",
but
I guess it could also be configured to happen either immediately or
during logout (or once an hour or so).

1. Write "want to expunge" record to mailbox's transaction log.
2. Write a weak "refcount--" to map index's transaction log.
3. Write "expunged" record to mailbox's transaction log.
4. Write "commit" for the weak refcount--.

(The two-stage want-to-expunge(1)/expunged(3) is already done for
maildir/mbox also.)

I was also thinking maybe this could be just 3 steps without a weak
transaction. Especially since without a fsync() between 2 and 3 there's
really no guarantee about the order in which 2 and 3 happen. And
fsync()s are kind of annoying. But then again if that isn't done and it
crashes between 2 and 3 the expunge is tried again later which now
decrements the refcount again. If the message's original refcount was 1
it would now be negative. That's not really a problem. More problematic
is if the message had been copied to another mailbox and refcount=2, now
the refcount would be 0 and the message in the other mailbox would get
unexpectedly lost. Not good.

Space deletion cleanup:

1. Check from map index header to see there are any refcount=0 mails.
Although having this step probably requires some annoying dbox-specific
code in index handling code. I guess I'll have to see later if there's a
non-horribly-ugly way to do this.

2. Go through all records in the map index. Get a list of files that
have refcount=0 records. Also make a copy of file seq+offsets and sort
them (so at step 3 we can easily find what mails exist).

3. For each file that has refcount=0 records:

3.1. Try to lock the file.
 -> Fail: Skip to next file. Either someone's appending to it or
cleaning it up already.
3.2. Refresh map index. Add any newly added records to the sorted
records list (rare).
3.2. Find the first refcount=0 mail.
3.3. Copy all mails after that with refcount>0 to a new file.
3.4. Lock map index, assign new file seq for the new file, rename() it,
for each mail we just copied, using a weak transaction write the seq
+offset updates to map index's transaction log. Unlock map index.
3.5. Depending on if first refcount=0 mail was the first mail, either
unlink() the file or ftruncate() it.
3.6. Write "commit" record to finish the weak transaction.

That's a bit inefficient when cleaning up multiple dbox files, but I'm
not sure if there's a much better way unless the map index is locked for
the entire time. Since the copying might take a while, a long map index
lock is bad since it prevents messages from being saved.

Reading mails can be done without locking. But since a file can be
truncated unexpectedly, the code must be able to find the new file and
offset in that situation and continue reading from there. If the message
we were reading was just expunged+cleaned up, the client will have to be
just disconnected. Since the cleanup shouldn't normally happen
immediately after expunge, this should happen rarely/never.

Oh and about all these new temp files that are created. When dbox is
opened the dir is stat()ed and if its atime is older than 1 week
(currently hardcoded) it readdir()s through the dir to see if there are
any old temp* files and then unlinks them.

With the new design flags/keywords would never be written to actual dbox
files. This helps with backups and also avoids the horribleness of
having to increase metadata area. To avoid complete data loss due to
index corruption (which really shouldn't be happening nowadays), the
indexes are also once in a while backed up to e.g. dovecot.index.backup
file. If Dovecot detects missing/broken index, it uses this backup copy
to rebuild the index as well as it can.

I'm also wondering if it's better for each mailbox to have its separate
dovecot.index.cache file or if there should be one cache file for the
map index. The upside with one index is that if a mail is copied there's
no need to duplicate its cached data. The downside is that I'm worried
it would cause more disk I/O, since mailboxes are typically accessed
separately, so it might be better that the cache data for one mailbox is
stored together rather than all around one bigger cache file.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL:
<http://dovecot.org/pipermail/dovecot/attachments/20090211/93ad255d/attachment-0002.bin>

Seth Mattinen

2009-Feb-11 22:32 UTC

head link

[Dovecot] dbox redesign

Timo Sirainen wrote:> This is about how to implement multiple msgs/file dbox format. The
> current v1.1's one msg/file design would stay pretty much the same and
> it would be compatible with this new design.
> 
Out of curiosity, what's the advantage to going to multiple messages per
file? Wouldn't this have the same problems as mbox?

~Seth

Mikkel

2009-Feb-12 10:29 UTC

head link

[Dovecot] dbox redesign

Hi Timo

I have a few comments. Please just disregard them if I have 
misunderstood your design.

Regarding your storage plan
I find it very important that users can be stored in different locations 
because:
1. Discount users could be placed on cheap storage while others are 
offered premium service on expensive hardware
2. It's easy to scale if you just add another LUN from your SAN or mount 
from NAS
3. In order to avoing huge directories you can put users into subdirs 
with each subdir containing only say 1000 users each
All this is very easy to achieve in 1.1 because you can return 
individual storage dirs for indexes and data from the user db.
I'm not sure from reading your post whether this will still be possible 
but I believe it?s a very important thing.

Regarding 7.
I very much for all the self healing you describe.
There is nothing worse than huge complex systems that fail just because 
of some minor error that could easily be fixed without manual intervention.
But also I'm a little worried in this regard.

Maildir is so robust that nothing can really go wrong. But here you have 
index files and data files located in different places.
Imagine the index file being on one NFS mount whilst the data resides on 
another.
Or if the administrator is purposely loading a different index file or 
data file from a backup.

Worst case scenario is that the self healing takes a manual operation 
for a failure and breaks something.
It should be very resilient to temporarily losing access to all files in 
this operation (could happen very often on NFS mounts).

Also I imagine the self-healing going into loops if it doesn't 
understand what?s going on.
If the data changes dues to manual intervention or par of the file 
system can be accessed you could imagine the self healing process trying 
again and again to fix something that isn't its job to fix.
In that case it would be better if it just skipped the apparent failures.

Timo wrote:
 >I'm also wondering if it's better for each mailbox to have its
separate
 >dovecot.index.cache file or if there should be one cache file for the
 >map index.
I think you should consider more files as the general choice (not only 
regarding cache files).
Imagine many dovecot servers accessing the same storage simultaneously. 
I figure it would be a lot easier if they weren?t all trying to 
read/update one essential file at the same time (with only one file, 
load can?t be spread across multiple mounts and everything goes down if 
the mount with the essential file is inaccessible).
If there is serious data corruption and you have only one file then all 
operations are paused while the self healing is trying to figure out 
what went wrong (and what happens if different servers decide to do 
self-healing on this one file at the same time?).
With one file per maildir only a small portion of the users are 
affected, the load is spread and really bad file corruption doesn?t 
break everything for thousands of users.

Other than that I?m just really glad that dbox is progressing. I 
consider it the feature.
Dbox is the email administrator?s wet dream. I?m already dreaming of 
completely avoiding the scalability issues of large Maildirs (which is 
the biggest challenge today in my opinion) and reducing the IO. Buying 
more IO is an order of magnitude more expensive than getting more RAM or 
CPU power (and dovecot barely needs any RAM and CPU anyway).

Best wishes, Mikkel

Possibly Parallel Threads

Search for more maybe matching threads

dovecot - Feb 2009 - dbox redesign

[Dovecot] dbox redesign

[Dovecot] dbox redesign

[Dovecot] dbox redesign

Possibly Parallel Threads