Hello
Checking further, and putting logs a bit every where in the dovecot 
code, the core is sending FIRST the initial document (not decoded) then 
SECOND the decoded version
Thisi is really weird, and the indexer then indexes a lot of binary crap
I am struggling to find where in the code this double call is made.
Anyone knows ?
On 2021-02-10 00:05, John Fawcett wrote:
> On 09/02/2021 15:33, Joan Moreau wrote:
> 
>> If I place the following code in the plugin 
>> fts_backend_xxx_update_build_more function (lucene, squat and xapian, 
>> as solr refuses to work properly on my setup)
>> 
>> {
>> char * s = i_strdup("EMPTY");
>> if(data != NULL) { i_free(s); s = i_strndup(data,20); }
>> i_info("fts_backend_update_build_more: data like
'%s'",s);
>> i_free(s);
>> }
>> 
>> and if I send a PDF by email, the data shown in the log is
"%PDF-1.7 "
>> 
>> so it does mean the decoder data is not properly transmitted to the 
>> plugin
>> 
>> Something is wrong in the data transmission
> 
> Joan
> 
> I too see something similar with fts_solr. I do see the raw %PDF string 
> and PDF binary data being passed through to 
> fts_backend_xxx_update_build_more function but I disagree with the 
> conclusion you draw from it.
> 
> After the raw data I also see the decoded data, so at least in my case 
> it is possible to see both the raw and decoded data in 
> fts_backend_xxx_update_build_more function. In the rawlog I no longer 
> see the binary data (but some blank lines), so something is filtering 
> it. I do see the decoded data in the rawlog. I do get hits on the solr 
> search for the decoded text.
> 
> John
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<https://dovecot.org/pipermail/dovecot/attachments/20210211/e83cb1f4/attachment.html>