thr3ads.net - dovecot - Processing incoming mail efficiently [Jan 2021]

If this information is useful, please help other people find it:
Share via:

Ron Garret

2021-Jan-30 18:11 UTC

Processing incoming mail efficiently

Sorry, I left out a few details.

The filter actually has two parts, one of which is on the MTA side (a milter). 
That part does things like keep track of outgoing mail from authorized users so
that it knows when an incoming message has a subject line that a user has sent
out or is from a sender that a user has previously sent a message to.  Those are
two very reliable ham signals.

The reason there is also a filter on the LDA side is that one of the filtering
strategies I?m using is looking for two messages from two different previously
unknown senders with the same subject received within a few minutes of each
other.  This turns out to be a very reliable spam signal.  But it requires that
messages with unknown provenance to be held in temporary storage for a while to
see if another matching message comes in.  That message then needs to be
processed as spam after the fact.

rg

On Jan 30, 2021, at 9:56 AM, Tom Hendrikx <tom at whyscream.net> wrote:
> 
> 
> On 30-01-2021 17:49, Ron Garret wrote:
>> I?ve asked a related question on this list before but I now have a much
better handle on what I?m doing and I realize that I still don?t know the
answer, so I?m going to ask this again in a slightly different form.
>> I?m writing a spam filter, so obviously I need to feed incoming mail to
it somehow.  The ?obvious? way to do this is with a sieve script using the pipe
extension.  There are two problems with this:
>> 1.  This will always pipe the entire file no matter how big it is.  The
filter will often not need to process the body of the message, only the headers,
or only the first part of a multipart MIME message.  Is there any way to allow
my filter to open the file in which the message is stored rather than piping it
a copy of the message?
>> 2.  Once the filter has processed the message and decided if it?s spam
it still needs to move the message to the appropriate folder (INBOX or Junk). 
To do this it needs to somehow correlate the *content* of the message that was
piped to it with the UID of the message that needs to be moved.  One way to do
this is to pull out the message-id header and then use doveadm to find the file
containing the message with that message-id, but there are two problems with
this.  First, not all messages have message-ids.  I can work around this by
adding my own message-id to messages that don?t already have them, but this just
feel wrong.  And second, unless dovecot keeps an index of message-ids (does it?)
then this will be horribly inefficient because it will have to essentially grep
for the message id every time I want to move a message.  So it seems like there
has to be a better way, but I can?t think of what that would be.
> 
> Normally the flow is a bit different:
> 
> You configure the spam/content filter in your MTA (for instance SMTP-proxy,
pre-queue, milter or post-queue content filter). The main benefit of doing this
type of work in the MTA is that you have the ability to reject blatant spam
messages during the SMTP stage. This means that you don't have to store the
spam at all, you simply tell the sending server that you don't want to
accept the message, and the sending server will have to deal with that decision
(f.i. by sending a non-delivery notice to the sender).
> 
> The spam filter will add headers to the incoming message. If you decide to
accept it, you can configure Sieve to deliver the message to the Inbox or the
Junk folder. A nice implementation is
https://doc.dovecot.org/configuration_manual/sieve/extensions/spamtest_virustest/),
but can of course wrangle your own sieve recipes.
> 
> Spam scanning during the delivery phase (f.i. with a sieve filter) is less
common because it has a few downsides.
> 
> So to answer your questions:
> 
> 1. Your content filter can be a spam filter, but it might also be an
antivirus scanner. The latter is of course very interested in the complete
e-mail including all attachments. So most setups try so send the complete
message. There are also implementations that ignore messages with a size above a
certain threshold, or others which just ignore the data after a certain
threshold. What filter are you trying to implement? Something off the shelf, or
a homebrewn one? Why is it so hard to consume the whole message? Please explain
:)
> 
> 2. The normal flow is a bit different (as described above), but in general:
the spam filter decides. Some (existing) filters take the whole message from the
MTA, add headers and re-inject the message again.
> Other filters use a mechanism (f.i. milter protocol) which allows them to
consume only a part of the message, and in response they instruct the MTA to add
the result headers. This means that the filters must support the protocol to the
MTA, but it doesn't have to take care of re-delivering the message.
> 
> We need to know about the actual problem you're trying to solve. It
sounds a lot like your trying to reinvent things that have been solved many
times before. Please give a broader explanation of your specific problem and we
can give you a better advice :)
> 
> Kind regards,
> 
> 	Tom

Tom Hendrikx

2021-Jan-30 19:54 UTC

head link

Processing incoming mail efficiently

On 30-01-2021 19:11, Ron Garret wrote:> Sorry, I left out a few details.
> 
> The filter actually has two parts, one of which is on the MTA side (a
milter).  That part does things like keep track of outgoing mail from authorized
users so that it knows when an incoming message has a subject line that a user
has sent out or is from a sender that a user has previously sent a message to. 
Those are two very reliable ham signals.
> 
> The reason there is also a filter on the LDA side is that one of the
filtering strategies I?m using is looking for two messages from two different
previously unknown senders with the same subject received within a few minutes
of each other.  This turns out to be a very reliable spam signal.  But it
requires that messages with unknown provenance to be held in temporary storage
for a while to see if another matching message comes in.  That message then
needs to be processed as spam after the fact.
> 
If you don't want to deliver the message to the inbox of the sender, you 
should just do that: don;t deliver it. Put it in some quarantine, and 
when you're sure you want it to end up in the mailbox of the user, pick 
up the message from quarantine and put it back in the mail queue, and 
have it delivered using the normal delivery route.

How you set up the quarantine is up to you. This could be a simple 
mailbox, which is reprocessed using a sieve filter (as you suggested). 
The most logical routine would then be to consume the message by the 
sieve filter, and then re-inject it in the mail delivery queue. But 
there are probably better solutions.

I suggest that you look into existing OSS quarantine solutions and learn 
from them, amavis or rspamd come to mind. IMHO you're still trying to 
re-invent the wheel :)

Kind regards,
	Tom
> rg
> 
> On Jan 30, 2021, at 9:56 AM, Tom Hendrikx <tom at whyscream.net>
wrote:
> 
>>
>>
>> On 30-01-2021 17:49, Ron Garret wrote:
>>> I?ve asked a related question on this list before but I now have a
much better handle on what I?m doing and I realize that I still don?t know the
answer, so I?m going to ask this again in a slightly different form.
>>> I?m writing a spam filter, so obviously I need to feed incoming
mail to it somehow.  The ?obvious? way to do this is with a sieve script using
the pipe extension.  There are two problems with this:
>>> 1.  This will always pipe the entire file no matter how big it is. 
The filter will often not need to process the body of the message, only the
headers, or only the first part of a multipart MIME message.  Is there any way
to allow my filter to open the file in which the message is stored rather than
piping it a copy of the message?
>>> 2.  Once the filter has processed the message and decided if it?s
spam it still needs to move the message to the appropriate folder (INBOX or
Junk).  To do this it needs to somehow correlate the *content* of the message
that was piped to it with the UID of the message that needs to be moved.  One
way to do this is to pull out the message-id header and then use doveadm to find
the file containing the message with that message-id, but there are two problems
with this.  First, not all messages have message-ids.  I can work around this by
adding my own message-id to messages that don?t already have them, but this just
feel wrong.  And second, unless dovecot keeps an index of message-ids (does it?)
then this will be horribly inefficient because it will have to essentially grep
for the message id every time I want to move a message.  So it seems like there
has to be a better way, but I can?t think of what that would be.
>>
>> Normally the flow is a bit different:
>>
>> You configure the spam/content filter in your MTA (for instance
SMTP-proxy, pre-queue, milter or post-queue content filter). The main benefit of
doing this type of work in the MTA is that you have the ability to reject
blatant spam messages during the SMTP stage. This means that you don't have
to store the spam at all, you simply tell the sending server that you don't
want to accept the message, and the sending server will have to deal with that
decision (f.i. by sending a non-delivery notice to the sender).
>>
>> The spam filter will add headers to the incoming message. If you decide
to accept it, you can configure Sieve to deliver the message to the Inbox or the
Junk folder. A nice implementation is
https://doc.dovecot.org/configuration_manual/sieve/extensions/spamtest_virustest/),
but can of course wrangle your own sieve recipes.
>>
>> Spam scanning during the delivery phase (f.i. with a sieve filter) is
less common because it has a few downsides.
>>
>> So to answer your questions:
>>
>> 1. Your content filter can be a spam filter, but it might also be an
antivirus scanner. The latter is of course very interested in the complete
e-mail including all attachments. So most setups try so send the complete
message. There are also implementations that ignore messages with a size above a
certain threshold, or others which just ignore the data after a certain
threshold. What filter are you trying to implement? Something off the shelf, or
a homebrewn one? Why is it so hard to consume the whole message? Please explain
:)
>>
>> 2. The normal flow is a bit different (as described above), but in
general: the spam filter decides. Some (existing) filters take the whole message
from the MTA, add headers and re-inject the message again.
>> Other filters use a mechanism (f.i. milter protocol) which allows them
to consume only a part of the message, and in response they instruct the MTA to
add the result headers. This means that the filters must support the protocol to
the MTA, but it doesn't have to take care of re-delivering the message.
>>
>> We need to know about the actual problem you're trying to solve. It
sounds a lot like your trying to reinvent things that have been solved many
times before. Please give a broader explanation of your specific problem and we
can give you a better advice :)
>>
>> Kind regards,
>>
>> 	Tom
>

dovecot - Jan 2021 - Processing incoming mail efficiently

Processing incoming mail efficiently

Processing incoming mail efficiently