Hi, i'm trying to resolve few problems with indexing 'From' headers using FTS/Solr. I was tcpdumping the communication between Dovecot and Jetty/Solr and noticed that 'From' headers, which includes also sender's name, are double escaped. This is what was Dovecot sending to Solr: </field><field name="from">Name Surname &lt;test at example.com&gt;</field></doc></add> As you can see, characters < and > were escaped to < and > which were, again, escaped to &lt; and &gt;. This is doing problems while trying to index whole e-mail address, as Solr sees it as '<test at example.com>'. I spend hours trying to figure out why i'm able to search in all parts of e-mail addresses but searching for full and exact e-mail address was successfull ONLY for messages which doesn't include sender's name in 'From' header. Finally, after i found this bug, this fixed all search problems: <filter class="solr.PatternReplaceFilterFactory" pattern="&lt;" replacement=""/> <filter class="solr.PatternReplaceFilterFactory" pattern="&gt;" replacement=""/> I hope that, at least, this bug, reported by me, will be fixed. Thank you. azur
On 06.04.2017 14:58, azurit at pobox.sk wrote:> Hi, > > i'm trying to resolve few problems with indexing 'From' headers using > FTS/Solr. I was tcpdumping the communication between Dovecot and > Jetty/Solr and noticed that 'From' headers, which includes also > sender's name, are double escaped. This is what was Dovecot sending to > Solr: > > </field><field name="from">Name Surname > &lt;test at example.com&gt;</field></doc></add> > > As you can see, characters < and > were escaped to < and > which > were, again, escaped to &lt; and &gt;. This is doing problems > while trying to index whole e-mail address, as Solr sees it as > '<test at example.com>'. > > I spend hours trying to figure out why i'm able to search in all parts > of e-mail addresses but searching for full and exact e-mail address > was successfull ONLY for messages which doesn't include sender's name > in 'From' header. Finally, after i found this bug, this fixed all > search problems: > > <filter class="solr.PatternReplaceFilterFactory" pattern="&lt;" > replacement=""/> > <filter class="solr.PatternReplaceFilterFactory" pattern="&gt;" > replacement=""/> > > I hope that, at least, this bug, reported by me, will be fixed. Thank > you. > > azurHi! Which dovecot version was this? Aki
Cit?t Aki Tuomi <aki.tuomi at dovecot.fi>:> On 06.04.2017 14:58, azurit at pobox.sk wrote: >> Hi, >> >> i'm trying to resolve few problems with indexing 'From' headers using >> FTS/Solr. I was tcpdumping the communication between Dovecot and >> Jetty/Solr and noticed that 'From' headers, which includes also >> sender's name, are double escaped. This is what was Dovecot sending to >> Solr: >> >> </field><field name="from">Name Surname >> &lt;test at example.com&gt;</field></doc></add> >> >> As you can see, characters < and > were escaped to < and > which >> were, again, escaped to &lt; and &gt;. This is doing problems >> while trying to index whole e-mail address, as Solr sees it as >> '<test at example.com>'. >> >> I spend hours trying to figure out why i'm able to search in all parts >> of e-mail addresses but searching for full and exact e-mail address >> was successfull ONLY for messages which doesn't include sender's name >> in 'From' header. Finally, after i found this bug, this fixed all >> search problems: >> >> <filter class="solr.PatternReplaceFilterFactory" pattern="&lt;" >> replacement=""/> >> <filter class="solr.PatternReplaceFilterFactory" pattern="&gt;" >> replacement=""/> >> >> I hope that, at least, this bug, reported by me, will be fixed. Thank >> you. >> >> azur > > Hi! > > Which dovecot version was this? > > AkiSorry, forgot to mention it, 2.2.27, Debian Jessie (backports), 64bit.
On 6 Apr 2017, at 14.58, azurit at pobox.sk wrote:> > Hi, > > i'm trying to resolve few problems with indexing 'From' headers using FTS/Solr. I was tcpdumping the communication between Dovecot and Jetty/Solr and noticed that 'From' headers, which includes also sender's name, are double escaped. This is what was Dovecot sending to Solr: > > </field><field name="from">Name Surname &lt;test at example.com&gt;</field></doc></add> > > As you can see, characters < and > were escaped to < and > which were, again, escaped to &lt; and &gt;. This is doing problems while trying to index whole e-mail address, as Solr sees it as '<test at example.com>'. > > I spend hours trying to figure out why i'm able to search in all parts of e-mail addresses but searching for full and exact e-mail address was successfull ONLY for messages which doesn't include sender's name in 'From' header. Finally, after i found this bug, this fixed all search problems: > > <filter class="solr.PatternReplaceFilterFactory" pattern="&lt;" replacement=""/> > <filter class="solr.PatternReplaceFilterFactory" pattern="&gt;" replacement=""/> > > I hope that, at least, this bug, reported by me, will be fixed. Thank you.The attached patch should also help. -------------- next part -------------- A non-text attachment was scrubbed... Name: solr.diff Type: application/octet-stream Size: 843 bytes Desc: not available URL: <http://dovecot.org/pipermail/dovecot/attachments/20170409/8aa32e7d/attachment.obj>
Cit?t Timo Sirainen <tss at iki.fi>:> On 6 Apr 2017, at 14.58, azurit at pobox.sk wrote: >> >> Hi, >> >> i'm trying to resolve few problems with indexing 'From' headers >> using FTS/Solr. I was tcpdumping the communication between Dovecot >> and Jetty/Solr and noticed that 'From' headers, which includes also >> sender's name, are double escaped. This is what was Dovecot sending >> to Solr: >> >> </field><field name="from">Name Surname >> &lt;test at example.com&gt;</field></doc></add> >> >> As you can see, characters < and > were escaped to < and > >> which were, again, escaped to &lt; and &gt;. This is doing >> problems while trying to index whole e-mail address, as Solr sees >> it as '<test at example.com>'. >> >> I spend hours trying to figure out why i'm able to search in all >> parts of e-mail addresses but searching for full and exact e-mail >> address was successfull ONLY for messages which doesn't include >> sender's name in 'From' header. Finally, after i found this bug, >> this fixed all search problems: >> >> <filter class="solr.PatternReplaceFilterFactory" pattern="&lt;" >> replacement=""/> >> <filter class="solr.PatternReplaceFilterFactory" pattern="&gt;" >> replacement=""/> >> >> I hope that, at least, this bug, reported by me, will be fixed. Thank you. > > The attached patch should also help.Works fine, thank you!