On 2021-02-08, Joan Moreau <jom at grosjo.net> wrote:> Well, in the function xxx_build_more of FTS plugin, the data received in > the original PDF, not the output of pdftotext > > Can you clarify where do you put your log in the solr plugin , so I can > check the situation in the xapian plugin ?The log is particular to fts_solr, you set it with e.g. "fts_solr = url=http://127.0.0.1:8983/solr/dovecot/ rawlog_dir=/tmp/solr" Confirmed it works for me, i.e. passes text from inside the pdf, and not the whole pdf itself. Did you check that decode2text.sh works ok on your system (when running as the relevant uid)? cat foo.pdf | sudo -u dovecot /usr/libexec/dovecot/decode2text.sh application/pdf
Yes , once again : output of the decoder is fine, I also put log inide the dovecot core to check whereas data is properly transmitted and it is (i.e. dovecot core receives the proper output of pdftotext via the decoder Now, that data is the /not/ the once ent from dovecot core to the fts plugin (and this is the same issue for solr and all other plugins) Of course, the stemming will show a good results abut the problem does remain. How to make sure the data sent to the FTS plugins (xapian, solr, whatever...) is the the output of the decoder and /not/ the original data ? On 2021-02-08 21:11, Stuart Henderson wrote:> On 2021-02-08, Joan Moreau <jom at grosjo.net> wrote: > >> Well, in the function xxx_build_more of FTS plugin, the data received >> in >> the original PDF, not the output of pdftotext >> >> Can you clarify where do you put your log in the solr plugin , so I >> can >> check the situation in the xapian plugin ? > > The log is particular to fts_solr, you set it with e.g. > > "fts_solr = url=http://127.0.0.1:8983/solr/dovecot/ > rawlog_dir=/tmp/solr" > > Confirmed it works for me, i.e. passes text from inside the pdf, and > not > the whole pdf itself. > > Did you check that decode2text.sh works ok on your system (when running > as the relevant uid)? > > cat foo.pdf | sudo -u dovecot /usr/libexec/dovecot/decode2text.sh > application/pdf-------------- next part -------------- An HTML attachment was scrubbed... URL: <https://dovecot.org/pipermail/dovecot/attachments/20210208/336d8c0f/attachment-0001.html>
Yes , once again : output of the decoder is fine, I also put log inide the dovecot core to check whether data is properly transmitted, and result is that it is (i.e. dovecot core receives the proper output of pdftotext via the decoder Now, that data is the /not/ the one sent from dovecot core to the fts plugin (and this is the same issue for solr and all other plugins) Of course, the stemming will show a good results (as PDF content will be stemmed) but the problem does remain. How to make sure the data sent to the FTS plugins (xapian, solr, whatever...) is the the output of the decoder and /not/ the original data ? On 2021-02-08 21:11, Stuart Henderson wrote:> On 2021-02-08, Joan Moreau <jom at grosjo.net> wrote: > >> Well, in the function xxx_build_more of FTS plugin, the data received >> in >> the original PDF, not the output of pdftotext >> >> Can you clarify where do you put your log in the solr plugin , so I >> can >> check the situation in the xapian plugin ? > > The log is particular to fts_solr, you set it with e.g. > > "fts_solr = url=http://127.0.0.1:8983/solr/dovecot/ > rawlog_dir=/tmp/solr" > > Confirmed it works for me, i.e. passes text from inside the pdf, and > not > the whole pdf itself. > > Did you check that decode2text.sh works ok on your system (when running > as the relevant uid)? > > cat foo.pdf | sudo -u dovecot /usr/libexec/dovecot/decode2text.sh > application/pdf-------------- next part -------------- An HTML attachment was scrubbed... URL: <https://dovecot.org/pipermail/dovecot/attachments/20210208/98119731/attachment.html>