Hi Guys, I am wondering about mail deduplication. I am looking into the possibility of seperating out all of the message bodies with multiple parts inside mail that is recived from `dovecot` and hashing them all. The idea is that by hashing all of the parts inside the email, I will be able to ensure that each part of the email will only be saved once. This means that attachments & common parts of the body will only be saved once inside the storage. How achievable would this be with the current state of dovecot? Would it even be worth doing? Thanks, Tim
El 30/04/13 03:28, Tim Groeneveld escribi?:> > Hi Guys, > > I am wondering about mail deduplication. I am looking into the possibility > of seperating out all of the message bodies with multiple parts inside mail > that is recived from `dovecot` and hashing them all. > > The idea is that by hashing all of the parts inside the email, I will be > able to ensure that each part of the email will only be saved once. > > This means that attachments & common parts of the body will only be > saved once inside the storage. > > How achievable would this be with the current state of dovecot? Would it > even be worth doing? >I asked the same question recently. As Timo responsed at http://kevat.dovecot.org/list/dovecot/2013-March/089072.html it seems that this feature is production stable in recent versions of dovecot. And I think it is worth. My estimations (with just about 10 users of my organization, they are no accurate) is that you can save more than 30% of total mail storage. To configure it you need to use options: * mail_attachment_dir * mail_attachement_min_size * mail_attachment_fs * mail_attachment_hash -- Angel L. Mateo Mart?nez Secci?n de Telem?tica ?rea de Tecnolog?as de la Informaci?n y las Comunicaciones Aplicadas (ATICA) http://www.um.es/atica Tfo: 868889150 Fax: 868888337
Dne 30.4.2013 03:28, Tim Groeneveld napsal:> Hi Guys, > > I am wondering about mail deduplication. I am looking into the > possibility > of seperating out all of the message bodies with multiple parts inside > mail > that is recived from `dovecot` and hashing them all. > > The idea is that by hashing all of the parts inside the email, I will > be > able to ensure that each part of the email will only be saved once. > > This means that attachments & common parts of the body will only be > saved once inside the storage. > > How achievable would this be with the current state of dovecot? Would > it > even be worth doing? > > Thanks, > TimHi Tim, thank you for your question. I am pleasure, because I can help you. I had the same problem in past and there wasn?t solution. So, I have written script which count md5 hashes from receive date and message body. Then script compare md5 hashes and delete duplicated messages. Script uses doveadm for message manipulation and openssl for counting md5 hashes. Deduplication is done through all user?s mailboxes. Syntax is dedup <user> <mailbox>, for example: dedup name at domain.cz INBOX. If you want dedup all mailboxes, enter ?A instead of mailbox name: dedup name at domain.cz ?A. Script is attached. I made it for my own use, so it isn?t stupid proof. If I can advise to you, work with care and make a backup ;-) Good luck #! /bin/sh # Remove duplicate messages from mainbox function dedup_mailbox () { local uids=( $(doveadm -f flow fetch -u $1 "uid" mailbox "$2" all | cut -f 2 -d =) ) if [ ${#uids[@]} -eq 0 ]; then echo " No messages" return elif [ ${#uids[@]} -eq 1 ]; then echo " Only one message" return fi for (( i=0; i<${#uids[@]}; i++ )); do local md5s_u[$i]=$(echo $(doveadm -f flow fetch -u $1 "date.received body" mailbox "$2" uid ${uids[$i]} | openssl md5)",${uids[$i]}") echo -en " Compute hashes: $i/${#uids[@]}(${md5s_u[$i]})\r" done echo -en " \r" local md5s=( $(echo ${md5s_u[@]} | sed 's/ /\n/g' | sort) ) x=0 i=0 while [ $i -lt $((${#md5s[@]} - 1)) ]; do A=$(echo ${md5s[$i]} | cut -f 1 -d ,) for (( j=$(($i + 1)); j<${#md5s[@]}; j++ )); do B=$(echo ${md5s[$j]} | cut -f 1 -d ,) if [ $A == $B ]; then doveadm expunge -u $1 mailbox "$2" uid $(echo ${md5s[$j]} | cut -f 2 -d ,) x=$(($x + 1)) else break fi done echo -en " Expunged $x message(s) from $(($j + 1))/${#md5s[@]}\r" i=$j done echo "" } if [ $2 == "-A" ]; then eval boxes=( $(doveadm mailbox list -u $1 | sed 's/.*/"&"/') ); else boxes[0]=$2 fi for (( k=0; k<${#boxes[@]}; k++ )); do echo "${boxes[$k]}:" dedup_mailbox $1 "${boxes[$k]}" done -------------- next part -------------- A non-text attachment was scrubbed... Name: dedup Type: text/x-shellscript Size: 1538 bytes Desc: not available URL: <http://dovecot.org/pipermail/dovecot/attachments/20130430/35f88a5e/attachment.bin>
Tim, oops, I read your message again and carefully. I see my mistake. You don't want delete whole duplicated messages but only their parts. So sorry for my reply, because It is quite out of topic. Radek Dne 30.4.2013 03:28, Tim Groeneveld napsal:> Hi Guys, > > I am wondering about mail deduplication. I am looking into the > possibility > of seperating out all of the message bodies with multiple parts inside > mail > that is recived from `dovecot` and hashing them all. > > The idea is that by hashing all of the parts inside the email, I will > be > able to ensure that each part of the email will only be saved once. > > This means that attachments & common parts of the body will only be > saved once inside the storage. > > How achievable would this be with the current state of dovecot? Would > it > even be worth doing? > > Thanks, > Tim
On 2013-04-30 2:05 AM, Angel L. Mateo <amateo at um.es> wrote:> El 30/04/13 03:28, Tim Groeneveld escribi?: >> I am wondering about mail deduplication. I am looking into the >> possibility >> of seperating out all of the message bodies with multiple parts >> inside mail >> that is recived from `dovecot` and hashing them all. >> >> The idea is that by hashing all of the parts inside the email, I will be >> able to ensure that each part of the email will only be saved once. >> >> This means that attachments & common parts of the body will only be >> saved once inside the storage. >> >> How achievable would this be with the current state of dovecot? Would it >> even be worth doing? >> > I asked the same question recently. As Timo responsed at > http://kevat.dovecot.org/list/dovecot/2013-March/089072.html it seems > that this feature is production stable in recent versions of dovecot. > > And I think it is worth. My estimations (with just about 10 users > of my organization, they are no accurate) is that you can save more > than 30% of total mail storage. > > To configure it you need to use options: > > * mail_attachment_dir > * mail_attachement_min_size > * mail_attachment_fs > * mail_attachment_hashThis only dedupes attachments - which, in my opinion, is the only part of deduplicating email that is really worth it. Yes, you might be able to recapture a miniscule amount of storage space as a percentage of total mailstore size by deduping the other mime parts (headers, body, etc), but the complexity of doing this for each message part in my opinion overkill, way too error-prone for my comfort level, and just not enough bang for the buck. Deduping attachments on the other hand can have a dramatic impact (depending on your system usage and requirements), and is reliable enough to make it well worth it for some. I am expecting at least a 40-60% reduction in our storage when I implement this on my new server soon (will report back once it is completed). We use a lot of large attachments, and our idiot users save multiple copies, resending the same one sometimes many multiple times to different people (so, maybe 3 or sometimes even 10+ copies of the same 20MB attachment in their Sent folder). Anyway, thats my .02 -- Best regards, Charles
----- Original Message -----> This only dedupes attachments - which, in my opinion, is the only > part of deduplicating email that is really worth it. > > [snip] > > I am expecting at least a 40-60% reduction in our storage when I > implement this on my new server soon.Thanks guys for all of your messages. Maybe I was getting too excited about saving storage everywhere possible. After thinking about it a little bit more, I have determined that just recombining the messages to send them to the client will be too intensive, and will cause extra latencies when retrieving emails. Regards, Tim