thr3ads.net - dovecot - [Dovecot] mdbox compression [Feb 2010]

If this information is useful, please help other people find it:
Share via:

Timo Sirainen

2010-Feb-06 00:36 UTC

[Dovecot] mdbox compression

I was wondering if I should add compression support to mdbox one mail at
a time or one file (~2MB) at a time. The tradeoffs are:

 * one mail at a time allows quickly seeking to wanted mail inside the
file, but it can't compress mails as well
 * one file at a time compresses better, but seeking is slow because it
can only be done by uncompressing all the data until the wanted offset
is reached

I did a quick test for this with 27 MB of my old INBOX mails:

(note the -b option, so it doesn't count wasted fs space)
mdbox/storage% du -sb .
15120350        .

Maildir/cur% du -sb .             
16517320        .

% echo 1-15120350/16517320|bc -l
.08457606924125705623

So, compressed mdboxes take 8.5% less space. This was with regular gzip
compression with default level. With bzip2 -9 compression the difference
was 10%.

Any thoughts on if 8-10% is significant enough improvement to make
seeking performance worse? Or perhaps I should just implement both
ways.. :)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL:
<http://dovecot.org/pipermail/dovecot/attachments/20100206/89e2a2d2/attachment-0002.bin>

Matt Reimer

2010-Feb-06 01:03 UTC

head link

[Dovecot] mdbox compression

On Fri, Feb 5, 2010 at 4:36 PM, Timo Sirainen <tss at iki.fi> wrote:
> I was wondering if I should add compression support to mdbox one mail at
> a time or one file (~2MB) at a time. The tradeoffs are:
>
>  * one mail at a time allows quickly seeking to wanted mail inside the
> file, but it can't compress mails as well
>  * one file at a time compresses better, but seeking is slow because it
> can only be done by uncompressing all the data until the wanted offset
> is reached
>
> I did a quick test for this with 27 MB of my old INBOX mails:
>
> (note the -b option, so it doesn't count wasted fs space)
> mdbox/storage% du -sb .
> 15120350        .
>
> Maildir/cur% du -sb .
> 16517320        .
>
> % echo 1-15120350/16517320|bc -l
> .08457606924125705623
>
> So, compressed mdboxes take 8.5% less space. This was with regular gzip
> compression with default level. With bzip2 -9 compression the difference
> was 10%.
>
> Any thoughts on if 8-10% is significant enough improvement to make
> seeking performance worse? Or perhaps I should just implement both
> ways.. :)
>
Isn't the real difference even smaller?

15120350/28311552 = .534
16517320/28311552 = .583

So that's just under 5%.

Either way, I'd say go with compressing each mail individually for quick
seeking.

Also, if you were compressing the whole file of mails as a single stream,
wouldn't you have to recompress and rewrite the whole file for each new mail
delivered?

Matt

Stan Hoeppner

2010-Feb-06 01:24 UTC

head link

[Dovecot] mdbox compression

Timo Sirainen put forth on 2/5/2010 6:36 PM:
> So, compressed mdboxes take 8.5% less space. This was with regular gzip
> compression with default level. With bzip2 -9 compression the difference
> was 10%.
> 
> Any thoughts on if 8-10% is significant enough improvement to make
> seeking performance worse? Or perhaps I should just implement both
> ways.. :)
Given the cost of mechanical storage today (1TB for less than $100 USD) I
can't
see why anyone would want to implement compression.  The cases I can think of
would be folks using strictly SSD (if there are any), those doing backups, or
very large sites.  Then again, I'm thinking most such backup solutions
implement
their own compression anyway so it makes no difference in that case except
possibly LAN/SAN bandwidth in moving compresses vs uncompressed data.

I would think only really large sites would consider compression.  10% space
savings for 1 million mailboxen might add up to some significant storage
hardware dollar savings, not to mention the power savings.  This is just a guess
as I've never worked in such an environment.  If a projected infrastructure
build out is calling for a $1 million back end clustered shared storage array
for mailboxen (think NetApp, IBM, SGI), and this compression cuts your number of
required spindles by 10%, that's potentially a $100,000 savings.  In
today's
economy, folks would be seriously looking at keeping that $100,000 in their
pocket book.

Very large sites would probably want maximum compression while retaining maximum
performance.  You didn't state the CPU burn difference between the two
methods,
or the total CPU burn for either method.  If one burns 50% CPU and the other
60%, on a loaded system, say 500 concurrent users, the relative difference is
minor, but both are so horrible WRT CPU that no one would use them.  If the
relative load is 10% for the first method and 12% for the other, then I'd
say
some people would gladly adopt the 2nd, slightly less efficient method.

-- 
Stan

Damon Atkins

2010-Feb-06 14:07 UTC

head link

[Dovecot] mdbox compression

ZFS has support for compression on the file system ( lzjb | gzip | 
gzip-N | zle ). gzip eats CPU even at levels as low as 3.

 From the zfs man page
The lzjb compression algorithm is optimized for performance while  
providing  decent  data  compression. You can specify the gzip level by 
using the  value  gzip-N  where  N  is  an integer  from 1 (fastest) to 
9 (best compression ratio).

The cost of CPU to support compression (gzip) vs the Cost of Disk if 
mboxes are not compressed?

The other interesting feature ZFS (OpenSolaris) has is de-duplication, 
it detects if the block which is about to be written, is the same as 
another block and instead of writing the block it puts a pointer to the 
already existing block, saving disk space. Only problem with this is you 
would need to filter out headers which are different even when the 
to/body is the same.

Cheers

Seemingly Similar Threads

Search for more reasonably related threads

dovecot - Feb 2010 - mdbox compression

[Dovecot] mdbox compression

[Dovecot] mdbox compression

[Dovecot] mdbox compression

[Dovecot] mdbox compression

Seemingly Similar Threads