Antonio Perez-Aranda
2011-May-23 11:11 UTC
[Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr
Indexing mail attachments with Dovecot + Solr. This patch has been tested with these versions: * dovecot 2.0.9 * apache-solr 1.4.1 This is a patch for the fts-solr plugin (that indexes mail messages for Dovecot with Solr). In main stream, the plugin does not index attachments; With this patch, you can index mails and their attachments (pdf, docs, openoffice docs...) . You can get others goodies with this patch and the Solr Config provided, like Synonyms and Stemming (Spanish by default). Attachment indexing is provided by Solr Cell and Tika (ExtractingRequestHandler) * http://wiki.apache.org/solr/ExtractingRequestHandler Synonyms and Stemming are provided by SnowballPorterFilterFactory from Solr Language Analysis: * http://wiki.apache.org/solr/LanguageAnalysis We have tested Solr with Tomcat and Jetty. Tomcat is better to handle UTF-8 and bigger POSTS. Attachments file format supported * http://tika.apache.org/0.9/formats.html At present, attachments in attachments (like, for example, attachments in fordwarded "eml" attachments) are not indexed. Also, keep in mind that there are many types of files, and many variants of the same file type. Per Example, some pdf files are "not readable" by solr pdf reader. Config: There are two new options added to fts_solr property: * index-attachments Enable attachments indexing. * manual-update Avoid index on user search. You can trigger indexing using doveadm search or doveadm index commands. There is a new property for the section plugin to filter the mimetypes that you want to index. * fts_solr_mimetype files with this mimetype will be sent to solr. After integrating solr directory in your solr config, and building Dovecot with fts-solr support and with fts-solr-attachments-r885.patch applied, you can update your dovecot config by adding to your dovecot.conf: ... mail_pluings = $mail_plugins fts fts_solr plugin { fts = solr fts_solr = url=http://solrhost:8983/solr/ break-imap-search index-attachments fts_solr_mimetype = application/x-pdf application/vnd.openxmlformats-officedocument.wordprocessingml.document } ... -- Antonio P?rez-Aranda Alcaide aperezaranda at yaco.es Yaco Sistemas S.L. http://www.yaco.es/ C/ Rioja 5, 41001 Sevilla Tel?fono +34 954 50 00 57 Fax ? ? ?+34 954 50 09 29
Antonio Perez-Aranda
2011-May-23 11:14 UTC
[Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr
Sorry, I forgot to include the attachment. 2011/5/23 Antonio Perez-Aranda <aperezaranda at yaco.es>:> Indexing mail attachments with Dovecot + Solr. > > This patch has been tested with these versions: > ?* dovecot 2.0.9 > ?* apache-solr 1.4.1 > > This is a patch for the fts-solr plugin (that indexes mail messages > for Dovecot with Solr). In main stream, the plugin does not index > attachments; With this patch, you can index mails and their > attachments (pdf, docs, openoffice docs...) . You can get others > goodies with this patch and the Solr > Config provided, like Synonyms and Stemming (Spanish by default). > > Attachment indexing is provided by Solr Cell and Tika (ExtractingRequestHandler) > ?* http://wiki.apache.org/solr/ExtractingRequestHandler > > Synonyms and Stemming are provided by SnowballPorterFilterFactory from > Solr Language Analysis: > ?* http://wiki.apache.org/solr/LanguageAnalysis > > We have tested Solr with Tomcat and Jetty. Tomcat is better to handle > UTF-8 and bigger POSTS. > > Attachments file format supported > ?* http://tika.apache.org/0.9/formats.html > > At present, attachments in attachments (like, for example, attachments > in fordwarded "eml" attachments) are not indexed. Also, keep in mind > that there are many types of files, and many variants of the same file > type. Per Example, some pdf files are "not readable" by solr pdf > reader. > > Config: > > There are two new options added to fts_solr property: > ?* index-attachments > ? ? ? Enable attachments indexing. > ?* manual-update > ? ? ? Avoid index on user search. You can trigger indexing using > doveadm search or doveadm index commands. > > There is a new property for the section plugin to filter the mimetypes > that you want to index. > ?* fts_solr_mimetype > ? ? ? files with this mimetype will be sent to solr. > > After integrating solr directory in your solr config, and building > Dovecot with fts-solr support and with fts-solr-attachments-r885.patch > applied, you can update your dovecot config by adding to your > dovecot.conf: > > ... > mail_pluings = $mail_plugins fts fts_solr > > plugin { > ? fts = solr > ? fts_solr = url=http://solrhost:8983/solr/ break-imap-search > index-attachments > ? fts_solr_mimetype = application/x-pdf > application/vnd.openxmlformats-officedocument.wordprocessingml.document > } > ... > > > > -- > Antonio P?rez-Aranda Alcaide > aperezaranda at yaco.es > > Yaco Sistemas S.L. > http://www.yaco.es/ > C/ Rioja 5, 41001 Sevilla > Tel?fono +34 954 50 00 57 > Fax ? ? ?+34 954 50 09 29 >-- Antonio P?rez-Aranda Alcaide aperezaranda at yaco.es Yaco Sistemas S.L. http://www.yaco.es/ C/ Rioja 5, 41001 Sevilla Tel?fono +34 954 50 00 57 Fax ? ? ?+34 954 50 09 29 -------------- next part -------------- A non-text attachment was scrubbed... Name: fts-solr-attachments-r885.tar.gz Type: application/x-gzip Size: 28370 bytes Desc: not available URL: <http://dovecot.org/pipermail/dovecot/attachments/20110523/c89bac0c/attachment-0002.gz>
Charles Marcus
2011-May-23 12:54 UTC
[Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr
On 2011-05-23 7:11 AM, Antonio Perez-Aranda wrote:> Indexing mail attachments with Dovecot + Solr. > > This patch has been tested with these versions: > * dovecot 2.0.9 > * apache-solr 1.4.1Isn't it customary - and logical - to always test/patch against the current stable RELEASE version (ie, 2.0.13)? -- Best regards, Charles
Timo Sirainen
2011-Aug-31 13:24 UTC
[Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr
On Mon, 2011-05-23 at 13:11 +0200, Antonio Perez-Aranda wrote:> Indexing mail attachments with Dovecot + Solr.I've been looking at this and wondering about a few things: The example solrconfig.xml contains:> <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy"> > .. > <!-- capture link hrefs but ignore div attributes --> > <str name="captureAttr">true</str> > <str name="fmap.a">links</str> > <str name="fmap.div">ignored_</str> > </lst>To me it looks like this requires that there exists a "links" field that is used for.. I guess content between <a>..</a> tags? Or also for the href URLS? In any case there's no links field in the schema.xml so I don't think this works? Similarly it looks like stuff between <div>..</div> is ignored here, which doesn't seem like a good idea.> There is a new property for the section plugin to filter the mimetypes > that you want to index. > * fts_solr_mimetype > files with this mimetype will be sent to solr.In v2.1 I've added a generic "fts decoder" script that can handle attachment decoding. The script contains stuff like: formats='application/pdf pdf application/x-pdf pdf application/msword doc .. So there already exists a place which can list supported MIME types and also what filename extensions they have, so if there's application/octet-stream with filename=foo.pdf, Dovecot's fts code can change the MIME type to application/pdf. This sounds like it could be useful for the Solr attachments too. Maybe instead of fts_solr_mimetype setting the script could be modified a bit so that it would even allow mixed Solr/script attachment extraction. For example: formats='+application/pdf pdf +application/x-pdf pdf application/msword doc' The "+" prefix could tell that the FTS backend (Solr) handles the MIME type instead of the script. So with above config Solr would decode .pdfs, but the script would decode .docs. I was also thinking that the attachment documents could contain some description fields as well, which could be useful if you're searching the Solr index directly instead of via Dovecot. Maybe fields like "attachment_filename" (parsed from Content-Disposition: header) and "attachment_description" (parsed from Content-Description: header). They could of course be empty if those fields don't exist (and probably should be optional anyway). Also there should be "attachment_part" field that would contain the IMAP MIME part number of the attachment (e.g. "2.1.3"), so it would be easy to find and fetch the attachment. This could also be used as part of the ID string instead of the attachment_count.