thr3ads.net - dovecot - Dupliate-ish email search? [Mar 2022]

If this information is useful, please help other people find it:
Share via:

@lbutlr

2022-Mar-02 19:00 UTC

Dupliate-ish email search?

I'm mulling over writing some code to find emails in a maildir that are
duplicates, ish. That is to say that sometimes the same message doesn't
quite show up as an exact match. Like some ad company may send you three
identical messages, except they aren't actually EXACTLY identical, the
message-IDs are different, and may the to address quoted part is different, so
normal duplicate finders fail to find them.

Before I start, is this a solved problem?


-- 
I thought that they were angels, but to my surprise, we climbed
	aboard their starship, we headed for the skies.

Michael Slusarz

2022-Mar-02 20:40 UTC

head link

Dupliate-ish email search?

> On 03/02/2022 12:00 PM @lbutlr <kremels at kreme.com> wrote:
> 
> I'm mulling over writing some code to find emails in a maildir that are
duplicates, ish. That is to say that sometimes the same message doesn't
quite show up as an exact match. Like some ad company may send you three
identical messages, except they aren't actually EXACTLY identical, the
message-IDs are different, and may the to address quoted part is different, so
normal duplicate finders fail to find them.
> 
> Before I start, is this a solved problem?
Besides the fact that you've pretty much described how modern AV/AS systems
work? :)

Joking aside, isn't this what Bayesian classification is essentially doing? 
Comparing the similarities between text (via tokens) in messages and then using
Bayesian probabilities to emphasize certain terms/relationships?  Although this
requires training and is not comparing any messages directly...

Maybe some form of perceptual hashing (or similar idea) would work?  E.g.
http://phash.org/

michael

Joseph Tam

2022-Mar-03 23:55 UTC

head link

Dupliate-ish email search?

On Wed, 2 Mar 2022, @lbutlr wrote:
> I'm mulling over writing some code to find emails in a maildir that are
> duplicates, ish.  That is to say that sometimes the same message
> doesn't quite show up as an exact match.  Like some ad company may send
> you three identical messages, except they aren't actually EXACTLY
> identical, the message-IDs are different, and may the to address quoted
> part is different, so normal duplicate finders fail to find them.
>
> Before I start, is this a solved problem?
Not perfectly, and maybe impossible in the general sense.

If you've ever had to anonymize mail by comparing samples sent by a
mailing list provider to 2 different recipients, you can see various
hashes and identifiers that show up in tracking headers and URLs.
Adding customized name labels e.g. "Dear Alfred P. Sloan" or
individual
specific information, and this becomes a complex question how different
is different.

If you make some simplifying assumptions (e.g. exact same message body,
same header for From/Sending network or IP/time-range/Subject, you can
do a fairly good job.

Joseph Tam <jtam.home at gmail.com>

dovecot - Mar 2022 - Dupliate-ish email search?

Dupliate-ish email search?

Dupliate-ish email search?

Dupliate-ish email search?