When concatenating mbox files like described here https://xaizek.github.io/2013-03-30/merge-mbox-mailboxes/. You will end up with an 'unsorted' mbox file. Is this going to be a problem esspecially when they are large >2GB's and new emails will be written to it? The email client nicely sorts the message from folder A "foldera 5 last" as last, but of course the mbox is not like this. Is there a better solution for merging files? Having: A folder A with messages: From Subject Received Size test foldera 1 16:18 665 B test foldera 2 16:18 665 B test foldera 3 16:18 665 B test foldera 4 16:18 665 B test foldera 5 last 16:29 670 B A folder B with messages: From Subject Received Size test folderb 1 16:23 665 B test folderb 2 16:24 665 B test folderb 3 16:24 665 B test folderb 4 16:24 665 B test folderb 5 16:24 665 B [@ mail] cat .foldera .folderb > .folderc Getting a folder C with messages: From Subject Received Size test foldera 1 16:18 665 B test foldera 2 16:18 665 B test foldera 3 16:18 665 B test foldera 4 16:18 665 B Mail System Internal Data DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA 16:19 454 B test folderb 1 16:23 665 B test folderb 2 16:24 665 B test folderb 3 16:24 665 B test folderb 4 16:24 665 B test folderb 5 16:24 665 B test foldera 5 last 16:29 670 B
On Thu, 29 Nov 2018, Marc Roos wrote:> When concatenating mbox files like described here > https://xaizek.github.io/2013-03-30/merge-mbox-mailboxes/. You will end > up with an 'unsorted' mbox file. Is this going to be a problem > esspecially when they are large >2GB's and new emails will be written to > it?I don't think it will be a problem, but you might have to remove some headers (like the UUID header?). However, I think dovecot ought to be able to cope with it anyways and regenerate the indices.> The email client nicely sorts the message from folder A "foldera 5 last" > as last, but of course the mbox is not like this. > Is there a better solution for merging files?As noted, the time order gets scrambled -- using your mail reader to get it back in time order requires sorting, an intensive operation. It just so happen I've done this recently with a (GNU) awk script that merges multiple mailboxes into one mailbox, preserving time order. It assumes that each message starst with a From envelopes header with sorted timestamps e.g. From mickey at disney.com Thu Nov 25 18:45:37 2018 From mickey at disney.com Thu Nov 25 18:45:37 2018 -0400 Your're welcome to use it. There's probably a more elegant way with doveadm/dsync. Using a mail reader to sort the merged mailbox, then drag/drop/copy everything into a final mailbox could also work. Joseph Tam <jtam.home at gmail.com> #!/bin/sh # # Merge multiple mbox's into one assuming that each message # starts with /^From .* {year}$/ and they are sorted by time. # # -- Joseph Tam <jtam.home at gmail.com> # [ x"$*" = x ] && { echo "Usage: $0 mbox-file ..." exit 1 } gawk -v boxes="$*" </dev/null ' function Tstamp(header) { # Format: Jan 22 21:00:48 2018 -0700 # 12345678901234567890123456 l = length(header) spec = (substr(header,l-4,1)=="-")? substr(header,l-25,20) : substr(header,l-19,20) spec = substr(spec,17,4) " " ym[substr(spec,1,3)] substr(spec,4,3) \ " " substr(spec,8,2) " " substr(spec,11,2) " " substr(spec,14,2) return int(mktime(spec)) } function DumpMessage(i) { if (header[i]!="") { printf("%s\n",header[i]) } while ((getline x <mbox[i])>0) { if (x~/^From .*[0-9][0-9][0-9][0-9]$/) { stamp[i] = Tstamp(x) header[i] = x printf("%s => [%d] %d\n",header[i],i,stamp[i]) >"/dev/stderr" return } print x } printf("EOF[%d]\n",i) >"/dev/stderr" stamp[i] = 2147483647 header[i] = "" } BEGIN { ym["Jan"] = "01"; ym["Feb"] = "02"; ym["Mar"] = "03"; ym["Apr"] = "04" ym["May"] = "05"; ym["Jun"] = "06"; ym["Jul"] = "07"; ym["Aug"] = "08" ym["Sep"] = "09"; ym["Oct"] = "10"; ym["Nov"] = "11"; ym["Dec"] = "12" n = split(boxes,mbox," ") # Read first header line from all boxes for (i=1; i<=n; i++) { DumpMessage(i) } # Loop until all maiboxes read while (1) { t = 2147483646 # Find next message for (i=1; i<=n; i++) { if (stamp[i]<=t) {t=stamp[i]; j=i;} } # If no more message, quit if (t==2147483646) exit # Dump next message from mbox[j] DumpMessage(j) } }'
aside from cat? On Thu, Nov 29, 2018 at 03:07:58PM -0800, Joseph Tam wrote:> On Thu, 29 Nov 2018, Marc Roos wrote: > > >When concatenating mbox files like described here > >https://xaizek.github.io/2013-03-30/merge-mbox-mailboxes/. You will end > >up with an 'unsorted' mbox file. Is this going to be a problem > >esspecially when they are large >2GB's and new emails will be written to > >it? > > I don't think it will be a problem, but you might have to remove > some headers (like the UUID header?). However, I think dovecot > ought to be able to cope with it anyways and regenerate the indices. > > >The email client nicely sorts the message from folder A "foldera 5 last" > >as last, but of course the mbox is not like this. > >Is there a better solution for merging files? > > As noted, the time order gets scrambled -- using your mail reader to > get it back in time order requires sorting, an intensive operation. > > It just so happen I've done this recently with a (GNU) awk script that > merges multiple mailboxes into one mailbox, preserving time order. > It assumes that each message starst with a From envelopes header with > sorted timestamps e.g. > > From mickey at disney.com Thu Nov 25 18:45:37 2018 > From mickey at disney.com Thu Nov 25 18:45:37 2018 -0400 > > Your're welcome to use it. There's probably a more elegant way with > doveadm/dsync. Using a mail reader to sort the merged mailbox, then > drag/drop/copy everything into a final mailbox could also work. > > Joseph Tam <jtam.home at gmail.com> > > #!/bin/sh > # > # Merge multiple mbox's into one assuming that each message > # starts with /^From .* {year}$/ and they are sorted by time. > # > # -- Joseph Tam <jtam.home at gmail.com> > # > > [ x"$*" = x ] && { > echo "Usage: $0 mbox-file ..." > exit 1 > } > > gawk -v boxes="$*" </dev/null ' > function Tstamp(header) { > # Format: Jan 22 21:00:48 2018 -0700 > # 12345678901234567890123456 > l = length(header) > spec = (substr(header,l-4,1)=="-")? substr(header,l-25,20) : substr(header,l-19,20) > spec = substr(spec,17,4) " " ym[substr(spec,1,3)] substr(spec,4,3) \ > " " substr(spec,8,2) " " substr(spec,11,2) " " substr(spec,14,2) > return int(mktime(spec)) > > } > > function DumpMessage(i) { > if (header[i]!="") { > printf("%s\n",header[i]) > } > while ((getline x <mbox[i])>0) { > if (x~/^From .*[0-9][0-9][0-9][0-9]$/) { > stamp[i] = Tstamp(x) > header[i] = x > printf("%s => [%d] %d\n",header[i],i,stamp[i]) >"/dev/stderr" > return > } > print x > } > > printf("EOF[%d]\n",i) >"/dev/stderr" > stamp[i] = 2147483647 > header[i] = "" > } > > BEGIN { > ym["Jan"] = "01"; ym["Feb"] = "02"; ym["Mar"] = "03"; ym["Apr"] = "04" > ym["May"] = "05"; ym["Jun"] = "06"; ym["Jul"] = "07"; ym["Aug"] = "08" > ym["Sep"] = "09"; ym["Oct"] = "10"; ym["Nov"] = "11"; ym["Dec"] = "12" > > n = split(boxes,mbox," ") > > # Read first header line from all boxes > for (i=1; i<=n; i++) { > DumpMessage(i) > } > > # Loop until all maiboxes read > while (1) { > t = 2147483646 > > # Find next message > for (i=1; i<=n; i++) { > if (stamp[i]<=t) {t=stamp[i]; j=i;} > } > > # If no more message, quit > if (t==2147483646) exit > > # Dump next message from mbox[j] > DumpMessage(j) > } > }'-- So many immigrant groups have swept through our town that Brooklyn, like Atlantis, reaches mythological proportions in the mind of the world - RI Safir 1998 http://www.mrbrklyn.com DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002 http://www.nylxs.com - Leadership Development in Free Software http://www2.mrbrklyn.com/resources - Unpublished Archive http://www.coinhangout.com - coins! http://www.brooklyn-living.com Being so tracked is for FARM ANIMALS and and extermination camps, but incompatible with living as a free human being. -RI Safir 2013
Apparently Analagous Threads
- Best way of merging mbox files
- Best way of merging mbox files
- transform(_data,...) using strptime gives an error
- [Bug 1332] New: Time-matching extension (--match time) broken by timestamping changes in kernel 4.20 and later
- Multiple assignment to several columns in dataset