Jiri Bourek
2014-Mar-28 16:38 UTC
[Dovecot] Deduplicate not processing all messages - bug?
Hello, I'm trying to create automated backup recovery using "doveadm import" and "doveadm deduplicate". During testing I noticed that deduplicate only deletes some duplicates and has to be called multiple times to find them all. Here's what I've been trying (in shell commands): First, expunge inbox (the end result is the same even if you delete only some messages): # doveadm expunge -u test mailbox inbox all # ls /home/mailboxes/test/cur | wc -l 0 Then import data from backup - twice, so duplicates are created (again, if you don't delete all messages and call import only once, resulting behaviour is the same.) # doveadm import -u test maildir:/home/test "" mailbox INBOX # doveadm import -u test maildir:/home/test "" mailbox INBOX # ls /home/mailboxes/test/cur | wc -l 1046 Then try to deduplicate # doveadm deduplicate -u test mailbox INBOX # ls /home/mailboxes/test/cur | wc -l 1040 And again # doveadm deduplicate -u test mailbox INBOX # ls /home/mailboxes/test/cur | wc -l 1029 And so on until the message count holds on 523 Each repetition removes 10 - 30 duplicates so eventually all duplicates are removed if "doveadm deduplicate" is called enough times in a row. I also noticed that when I repeat the test, import the backup again and call deduplicate, the steps - how many messages are removed at one time - are the same. That is I start with 1046 messages in the mailbox, after first run there's 1040, then 1029 and so on. My guess would be the behaviour depends on what is stored in the mailbox, but that's pretty much all I can figure out on my own at this time. My question is - is this intended behaviour, ie. are you supposed to run doveadm deduplicate as long as the number of messages in the mailbox keeps changing? Or is it a bug? Tried to Google for the answer but no luck, so thanks for any answers. Tested on Dovecot version 2.2.9 and 2.2.12 (both from Debian repositories.)
Jiří Bourek
2014-Apr-01 08:48 UTC
[Dovecot] Deduplicate not processing all messages - bug?
Judging from lack of replies I guess either not many people use the feature, or it's supposed to work this way. After a bit of more research I realized repeated calls of doveadm deduplicate won't be very reliable - the cycle is prone to be interrupted prematurely in a busy mailbox (if deduplicate removes x messages and x new messages arrive into the mailbox, it seems like nothing was done and the cycle interrupts.) Solving this requires to know more details about the contents of the mailbox, which leads to avoiding deduplicate altogether. I'm thinking along the lines of using doveadm fetch to get guid, date.saved, mailbox-guid and uid fields - find duplicates in guid, preserve the message with oldes date.saved, doveadm expunge the rest using mailbox-guid and uid. I'll probably be duplicating most of doveadm deduplicate, but in the end it should prove more reliable. Just my 2 cents in case someone else runs into this issue. Jiri Bourek wrote:> Hello, > > I'm trying to create automated backup recovery using "doveadm import" > and "doveadm deduplicate". During testing I noticed that deduplicate > only deletes some duplicates and has to be called multiple times to find > them all. Here's what I've been trying (in shell commands): > > First, expunge inbox (the end result is the same even if you delete only > some messages): > > # doveadm expunge -u test mailbox inbox all > # ls /home/mailboxes/test/cur | wc -l > 0 > > Then import data from backup - twice, so duplicates are created (again, > if you don't delete all messages and call import only once, resulting > behaviour is the same.) > > # doveadm import -u test maildir:/home/test "" mailbox INBOX > # doveadm import -u test maildir:/home/test "" mailbox INBOX > # ls /home/mailboxes/test/cur | wc -l > 1046 > > Then try to deduplicate > > # doveadm deduplicate -u test mailbox INBOX > # ls /home/mailboxes/test/cur | wc -l > 1040 > > And again > > # doveadm deduplicate -u test mailbox INBOX > # ls /home/mailboxes/test/cur | wc -l > 1029 > > And so on until the message count holds on 523 > > Each repetition removes 10 - 30 duplicates so eventually all duplicates > are removed if "doveadm deduplicate" is called enough times in a row. I > also noticed that when I repeat the test, import the backup again and > call deduplicate, the steps - how many messages are removed at one time > - are the same. That is I start with 1046 messages in the mailbox, after > first run there's 1040, then 1029 and so on. My guess would be the > behaviour depends on what is stored in the mailbox, but that's pretty > much all I can figure out on my own at this time. > > My question is - is this intended behaviour, ie. are you supposed to run > doveadm deduplicate as long as the number of messages in the mailbox > keeps changing? Or is it a bug? Tried to Google for the answer but no > luck, so thanks for any answers. > > Tested on Dovecot version 2.2.9 and 2.2.12 (both from Debian repositories.)