thr3ads.net - dovecot - Processing incoming mail efficiently [Jan 2021]

If this information is useful, please help other people find it:
Share via:

Ron Garret

2021-Jan-30 16:49 UTC

Processing incoming mail efficiently

I?ve asked a related question on this list before but I now have a much better
handle on what I?m doing and I realize that I still don?t know the answer, so
I?m going to ask this again in a slightly different form.

I?m writing a spam filter, so obviously I need to feed incoming mail to it
somehow.  The ?obvious? way to do this is with a sieve script using the pipe
extension.  There are two problems with this:

1.  This will always pipe the entire file no matter how big it is.  The filter
will often not need to process the body of the message, only the headers, or
only the first part of a multipart MIME message.  Is there any way to allow my
filter to open the file in which the message is stored rather than piping it a
copy of the message?

2.  Once the filter has processed the message and decided if it?s spam it still
needs to move the message to the appropriate folder (INBOX or Junk).  To do this
it needs to somehow correlate the *content* of the message that was piped to it
with the UID of the message that needs to be moved.  One way to do this is to
pull out the message-id header and then use doveadm to find the file containing
the message with that message-id, but there are two problems with this.  First,
not all messages have message-ids.  I can work around this by adding my own
message-id to messages that don?t already have them, but this just feel wrong. 
And second, unless dovecot keeps an index of message-ids (does it?) then this
will be horribly inefficient because it will have to essentially grep for the
message id every time I want to move a message.  So it seems like there has to
be a better way, but I can?t think of what that would be.

I figure this has to be a solved problem because I am obviously not the first
person to write a spam filter for dovecot.  What is the Right Way to do this?

Thanks,
rg

Tom Hendrikx

2021-Jan-30 17:56 UTC

head link

Processing incoming mail efficiently

On 30-01-2021 17:49, Ron Garret wrote:> I?ve asked a related question on this list before but I now have a much
better handle on what I?m doing and I realize that I still don?t know the
answer, so I?m going to ask this again in a slightly different form.
> 
> I?m writing a spam filter, so obviously I need to feed incoming mail to it
somehow.  The ?obvious? way to do this is with a sieve script using the pipe
extension.  There are two problems with this:
> 
> 1.  This will always pipe the entire file no matter how big it is.  The
filter will often not need to process the body of the message, only the headers,
or only the first part of a multipart MIME message.  Is there any way to allow
my filter to open the file in which the message is stored rather than piping it
a copy of the message?
> 
> 2.  Once the filter has processed the message and decided if it?s spam it
still needs to move the message to the appropriate folder (INBOX or Junk).  To
do this it needs to somehow correlate the *content* of the message that was
piped to it with the UID of the message that needs to be moved.  One way to do
this is to pull out the message-id header and then use doveadm to find the file
containing the message with that message-id, but there are two problems with
this.  First, not all messages have message-ids.  I can work around this by
adding my own message-id to messages that don?t already have them, but this just
feel wrong.  And second, unless dovecot keeps an index of message-ids (does it?)
then this will be horribly inefficient because it will have to essentially grep
for the message id every time I want to move a message.  So it seems like there
has to be a better way, but I can?t think of what that would be.
Normally the flow is a bit different:

You configure the spam/content filter in your MTA (for instance 
SMTP-proxy, pre-queue, milter or post-queue content filter). The main 
benefit of doing this type of work in the MTA is that you have the 
ability to reject blatant spam messages during the SMTP stage. This 
means that you don't have to store the spam at all, you simply tell the 
sending server that you don't want to accept the message, and the 
sending server will have to deal with that decision (f.i. by sending a 
non-delivery notice to the sender).

The spam filter will add headers to the incoming message. If you decide 
to accept it, you can configure Sieve to deliver the message to the 
Inbox or the Junk folder. A nice implementation is 
https://doc.dovecot.org/configuration_manual/sieve/extensions/spamtest_virustest/),
but can of course wrangle your own sieve recipes.

Spam scanning during the delivery phase (f.i. with a sieve filter) is 
less common because it has a few downsides.

So to answer your questions:

1. Your content filter can be a spam filter, but it might also be an 
antivirus scanner. The latter is of course very interested in the 
complete e-mail including all attachments. So most setups try so send 
the complete message. There are also implementations that ignore 
messages with a size above a certain threshold, or others which just 
ignore the data after a certain threshold. What filter are you trying to 
implement? Something off the shelf, or a homebrewn one? Why is it so 
hard to consume the whole message? Please explain :)

2. The normal flow is a bit different (as described above), but in 
general: the spam filter decides. Some (existing) filters take the whole 
message from the MTA, add headers and re-inject the message again.
Other filters use a mechanism (f.i. milter protocol) which allows them 
to consume only a part of the message, and in response they instruct the 
MTA to add the result headers. This means that the filters must support 
the protocol to the MTA, but it doesn't have to take care of 
re-delivering the message.

We need to know about the actual problem you're trying to solve. It 
sounds a lot like your trying to reinvent things that have been solved 
many times before. Please give a broader explanation of your specific 
problem and we can give you a better advice :)

Kind regards,

	Tom

Marc Roos

2021-Jan-30 20:33 UTC

head link

Processing incoming mail efficiently

> -----Original Message-----
> From: dovecot <dovecot-bounces at dovecot.org> On Behalf Of Ron
Garret
> Sent: 30 January 2021 17:49
> To: Dovecot <dovecot at dovecot.org>
> Subject: Processing incoming mail efficiently
> 
> I?ve asked a related question on this list before but I now have a much
> better handle on what I?m doing and I realize that I still don?t know
> the answer, so I?m going to ask this again in a slightly different form.
> 
> I?m writing a spam filter, so obviously I need to feed incoming mail to
> it somehow.  The ?obvious? way to do this is with a sieve script using
> the pipe extension.  There are two problems with this:
No, that is not obvious, this would imply a dependency on sieve.
> 1.  This will always pipe the entire file no matter how big it is.  The
> filter will often not need to process the body of the message, 
Yes because your starting point is wrong. Using mailfromd you can process a
specific milter state, see envfrom envrcpt etc.

https://puszcza.gnu.org.ua/software/mailfromd/manual/mailfromd.html#handler-names

only the> headers, or only the first part of a multipart MIME message.  Is there
> any way to allow my filter to open the file in which the message is
> stored rather than piping it a copy of the message?
> 
> 2.  Once the filter has processed the message and decided if it?s spam
> it still needs to move the message to the appropriate folder (INBOX or
> Junk).  To do this it needs to somehow correlate the *content* of the
> message that was piped to it with the UID of the message that needs to
> be moved.  One way to do this is to pull out the message-id header and
> then use doveadm
No, in what ever milter state you are processing. You can add a message header
'This is spam'. And you make just one sieve rule that moves messages on
the existance of that specific header.
> to find the file containing the message with that
> message-id, but there are two problems with this.  First, not all
> messages have message-ids.  I can work around this by adding my own
First you have crawl, before walking. So learn how to crawl. It does not make
sense trying to make something, if you do not know specifics.
> message-id to messages that don?t already have them, but this just feel
> wrong.  And second, unless dovecot keeps an index of message-ids (does
> it?) then this will be horribly inefficient because it will have to
> essentially grep for the message id every time I want to move a message.
> So it seems like there has to be a better way, but I can?t think of what
> that would be.
Start playing with mailfromd. It has scripting language to configure it and all
tools(funtions) are available to do whatever you can think of.

https://puszcza.gnu.org.ua/software/mailfromd/manual/mailfromd.html#Filter-Script-Example
> I figure this has to be a solved problem because I am obviously not the
> first person to write a spam filter for dovecot.  What is the Right Way
> to do this?
> 

As written above

dovecot - Jan 2021 - Processing incoming mail efficiently

Processing incoming mail efficiently

Processing incoming mail efficiently

Processing incoming mail efficiently