I was wondering if I should add compression support to mdbox one mail at a time or one file (~2MB) at a time. The tradeoffs are: * one mail at a time allows quickly seeking to wanted mail inside the file, but it can't compress mails as well * one file at a time compresses better, but seeking is slow because it can only be done by uncompressing all the data until the wanted offset is reached I did a quick test for this with 27 MB of my old INBOX mails: (note the -b option, so it doesn't count wasted fs space) mdbox/storage% du -sb . 15120350 . Maildir/cur% du -sb . 16517320 . % echo 1-15120350/16517320|bc -l .08457606924125705623 So, compressed mdboxes take 8.5% less space. This was with regular gzip compression with default level. With bzip2 -9 compression the difference was 10%. Any thoughts on if 8-10% is significant enough improvement to make seeking performance worse? Or perhaps I should just implement both ways.. :) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part URL: <http://dovecot.org/pipermail/dovecot/attachments/20100206/89e2a2d2/attachment-0002.bin>
On Fri, Feb 5, 2010 at 4:36 PM, Timo Sirainen <tss at iki.fi> wrote:> I was wondering if I should add compression support to mdbox one mail at > a time or one file (~2MB) at a time. The tradeoffs are: > > * one mail at a time allows quickly seeking to wanted mail inside the > file, but it can't compress mails as well > * one file at a time compresses better, but seeking is slow because it > can only be done by uncompressing all the data until the wanted offset > is reached > > I did a quick test for this with 27 MB of my old INBOX mails: > > (note the -b option, so it doesn't count wasted fs space) > mdbox/storage% du -sb . > 15120350 . > > Maildir/cur% du -sb . > 16517320 . > > % echo 1-15120350/16517320|bc -l > .08457606924125705623 > > So, compressed mdboxes take 8.5% less space. This was with regular gzip > compression with default level. With bzip2 -9 compression the difference > was 10%. > > Any thoughts on if 8-10% is significant enough improvement to make > seeking performance worse? Or perhaps I should just implement both > ways.. :) >Isn't the real difference even smaller? 15120350/28311552 = .534 16517320/28311552 = .583 So that's just under 5%. Either way, I'd say go with compressing each mail individually for quick seeking. Also, if you were compressing the whole file of mails as a single stream, wouldn't you have to recompress and rewrite the whole file for each new mail delivered? Matt
Timo Sirainen put forth on 2/5/2010 6:36 PM:> So, compressed mdboxes take 8.5% less space. This was with regular gzip > compression with default level. With bzip2 -9 compression the difference > was 10%. > > Any thoughts on if 8-10% is significant enough improvement to make > seeking performance worse? Or perhaps I should just implement both > ways.. :)Given the cost of mechanical storage today (1TB for less than $100 USD) I can't see why anyone would want to implement compression. The cases I can think of would be folks using strictly SSD (if there are any), those doing backups, or very large sites. Then again, I'm thinking most such backup solutions implement their own compression anyway so it makes no difference in that case except possibly LAN/SAN bandwidth in moving compresses vs uncompressed data. I would think only really large sites would consider compression. 10% space savings for 1 million mailboxen might add up to some significant storage hardware dollar savings, not to mention the power savings. This is just a guess as I've never worked in such an environment. If a projected infrastructure build out is calling for a $1 million back end clustered shared storage array for mailboxen (think NetApp, IBM, SGI), and this compression cuts your number of required spindles by 10%, that's potentially a $100,000 savings. In today's economy, folks would be seriously looking at keeping that $100,000 in their pocket book. Very large sites would probably want maximum compression while retaining maximum performance. You didn't state the CPU burn difference between the two methods, or the total CPU burn for either method. If one burns 50% CPU and the other 60%, on a loaded system, say 500 concurrent users, the relative difference is minor, but both are so horrible WRT CPU that no one would use them. If the relative load is 10% for the first method and 12% for the other, then I'd say some people would gladly adopt the 2nd, slightly less efficient method. -- Stan
ZFS has support for compression on the file system ( lzjb | gzip | gzip-N | zle ). gzip eats CPU even at levels as low as 3. From the zfs man page The lzjb compression algorithm is optimized for performance while providing decent data compression. You can specify the gzip level by using the value gzip-N where N is an integer from 1 (fastest) to 9 (best compression ratio). The cost of CPU to support compression (gzip) vs the Cost of Disk if mboxes are not compressed? The other interesting feature ZFS (OpenSolaris) has is de-duplication, it detects if the block which is about to be written, is the same as another block and instead of writing the block it puts a pointer to the already existing block, saving disk space. Only problem with this is you would need to filter out headers which are different even when the to/body is the same. Cheers