PGNet Dev
2020-Oct-11 22:27 UTC
v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed)
I'm running, dovecot --version 2.3.11.3 (502c39af9) solr -version 8.6.3 uname -rm 5.8.13-200.fc32.x86_64 x86_64 grep _NAME /etc/os-release PRETTY_NAME="Fedora 32 (Server Edition)" CPE_NAME="cpe:/o:fedoraproject:fedora:32" Solr FTS plugin is enabled/configured, mail_plugins = virtual acl fts fts_solr plugin { fts = solr fts_autoindex = yes fts_solr = url=https://solr.example.com:8984/solr/dovecot/ fts_enforced = body fts_filters = normalizer-icu stopwords snowball fts_language_config = /usr/share/libexttextcat/fpdb.conf fts_languages = en es de fr it pt soft_commit = yes } IMAP capability returns, a OK [CAPABILITY IMAP4rev1 SASL-IR LOGIN-REFERRALS ID ENABLE IDLE SORT SORT=DISPLAY THREAD=REFERENCES THREAD=REFS THREAD=ORDEREDSUBJECT MULTIAPPEND URL-PARTIAL CATENATE UNSELECT CHILDREN NAMESPACE UIDPLUS LIST-EXTENDED I18NLEVEL=1 CONDSTORE QRESYNC ESEARCH ESORT SEARCHRES WITHIN CONTEXT=SEARCH LIST-STATUS BINARY MOVE SNIPPET=FUZZY PREVIEW=FUZZY STATUS=SIZE SAVEDATE SPECIAL-USE LITERAL+ NOTIFY SPECIAL-USE QUOTA ACL RIGHTS=texk] Logged in I've got two messages in my IMAP store, cd /data/vmail/example.com/myuser/Maildir/cur/ ls -altr | grep S= | /bin/tail -n2 -rw------- 1 vmail vmail 1.3K Oct 11 14:05 1602450306.M393628P65260.mx.example.com,S=1278,W=1304:2,S -rw------- 1 vmail vmail 1.3K Oct 11 14:05 1602450353.M756184P65260.mx.example.com,S=1277,W=1303:2,S that differ in BODY CONTENT -- -- one message has ascii txt with NO character accents -- the other has the same text, but with ON character accent cat "1602450306.M393628P65260.mx.example.com,S=1278,W=1304:2,S" ... From: M User <myuser at example.com> Subject: test Reply-To: myuser at example.com To: "User, My" <myuser at example.com> Message-ID: <6fc7ac30-b460-7dd4-f85d-ca4403ad7188 at example.com> Date: Sun, 11 Oct 2020 14:05:06 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.3.2 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit !!!! tambi?n cat 1602450353.M756184P65260.mx.example.com,S=1277,W=1303:2,S ... From: M User <myuser at example.com> Subject: test Reply-To: myuser at example.com To: "User, My" <myuser at example.com> Message-ID: <015b3fb4-46f9-87cc-d541-060db0a13086 at example.com> Date: Sun, 11 Oct 2020 14:05:53 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.3.2 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit !!!! tambien i manually re-scan & index doveadm fts rescan -u myuser at example.com doveadm index -u myuser at example.com -q '*' ... ==> /var/log/dovecot/dovecot-info.log <= 2020-10-11 15:06:34 indexer-worker(myuser at example.com)<OyUmLeqBg18fDAEA+IOfAw>: Info: Indexed 21 messages in accts (UIDs 14399..130699) 2020-10-11 15:06:34 indexer-worker(myuser at example.com)<6NnOMuqBg18fDAEA+IOfAw>: Info: Indexed 16 messages in accts/v007132 (UIDs 13414..14778) ... with no errors. then search in mail client, here TBird 78, with [X] Run Search on Server for _un_accented "tambien", match is correctly -- and quickly -- returned. in logs, ==> /var/log/dovecot/dovecot-info.log <= 2020-10-11 14:57:05 imap-login: Info: Login: user=<myuser at example.com>, method=PLAIN, rip=10.0.1.7, lip=10.0.1.50, mpid=67743, TLS 2020-10-11 14:57:16 indexer-worker(myuser at example.com)<3ZUzQ2yx2JKsHgsH:9gu0MbF/g1+hCAEA+IOfAw>: Info: Indexed 4788 messages in INBOX (UIDs 135476..140263) BUT, repeating search for ACCENTED "tambi?n" returns *no* match/result. No errors in log, simply no match. Attempting to test/debug from from cmd line, doveadm fts lookup -u myuser at example.com body "tambien" causes a PANIC doveadm(myuser at example.com): Panic: file mail-storage.c: line 2112 (mailbox_get_open_status): assertion failed: (box->opened) doveadm(myuser at example.com): Error: Raw backtrace: /usr/lib64/dovecot/libdovecot.so.0(backtrace_append+0x46) [0x7f3ee94accc6] -> /usr/lib64/dovecot/libdovecot.so.0(backtrace_get+0x22) [0x7f3ee94acde2] -> /usr/lib64/dovecot/libdovecot.so.0(+0x10025b) [0x7f3ee94b625b] -> /usr/lib64/dovecot/libdovecot.so.0(+0x100297) [0x7f3ee94b6297] -> /usr/lib64/dovecot/libdovecot.so.0(+0x59bc6) [0x7f3ee940fbc6] -> /usr/lib64/dovecot/libdovecot-storage.so.0(+0x4779e) [0x7f3ee95c379e] -> /usr/lib64/dovecot/lib21_fts_solr_plugin.so(+0x5849) [0x7f3ee9015849] -> /usr/lib64/dovecot/lib20_fts_plugin.so(fts_backend_lookup+0x51) [0x7f3ee8c37491] -> /usr/lib64/dovecot/doveadm/lib20_doveadm_fts_plugin.so(+0x3280) [0x7f3ee8ba9280] -> doveadm(+0x343cd) [0x5637e99443cd] -> doveadm(+0x34fe0) [0x5637e9944fe0] -> doveadm(doveadm_cmd_ver2_to_mail_cmd_wrapper+0x22d) [0x5637e9945e2d] -> doveadm(doveadm_cmd_run_ver2+0x4e8) [0x5637e99568d8] -> doveadm(doveadm_cmd_try_run_ver2+0x3e) [0x5637e995692e] -> doveadm(main+0x1d4) [0x5637e9934cf4] -> /lib64/libc.so.6(__libc_start_main+0xf2) [0x7f3ee9071042] -> doveadm(_start+0x2e) [0x5637e99351ce] Aborted (1) What config -- dovecot &/or solr -- is needed to match on accented characters? (2) What add'l detail, if any, is needed for troubleshooting the panic?
John Fawcett
2020-Oct-18 09:54 UTC
v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed)
On 12/10/2020 00:27, PGNet Dev wrote:> for _un_accented "tambien", match is correctly -- and quickly -- returned. > > in logs, > > ==> /var/log/dovecot/dovecot-info.log <=> 2020-10-11 14:57:05 imap-login: Info: Login: user=<myuser at example.com>, method=PLAIN, rip=10.0.1.7, lip=10.0.1.50, mpid=67743, TLS > 2020-10-11 14:57:16 indexer-worker(myuser at example.com)<3ZUzQ2yx2JKsHgsH:9gu0MbF/g1+hCAEA+IOfAw>: Info: Indexed 4788 messages in INBOX (UIDs 135476..140263) > > BUT, repeating search for ACCENTED "tambi?n" returns *no* match/result. > > No errors in log, simply no match. >I have no issues searching for accented characters from Thunderbird. For example I found your message search for either tambien or tambi?n. My configuration is somewhat simpler though. Maybe a silly question, but if you repeat the test for other words with accents does it work? I noticed you have configured stopwords so some words are not going to get indexed and seems that tambi?n is one of those.> Attempting to test/debug from from cmd line, > > doveadm fts lookup -u myuser at example.com body "tambien" > > causes a PANIC > > doveadm(myuser at example.com): Panic: file mail-storage.c: line 2112 (mailbox_get_open_status): assertion failed: (box->opened) > doveadm(myuser at example.com): Error: Raw backtrace: /usr/lib64/dovecot/libdovecot.so.0(backtrace_append+0x46) [0x7f3ee94accc6] -> /usr/lib64/dovecot/libdovecot.so.0(backtrace_get+0x22) [0x7f3ee94acde2] -> /usr/lib64/dovecot/libdovecot.so.0(+0x10025b) [0x7f3ee94b625b] -> /usr/lib64/dovecot/libdovecot.so.0(+0x100297) [0x7f3ee94b6297] -> /usr/lib64/dovecot/libdovecot.so.0(+0x59bc6) [0x7f3ee940fbc6] -> /usr/lib64/dovecot/libdovecot-storage.so.0(+0x4779e) [0x7f3ee95c379e] -> /usr/lib64/dovecot/lib21_fts_solr_plugin.so(+0x5849) [0x7f3ee9015849] -> /usr/lib64/dovecot/lib20_fts_plugin.so(fts_backend_lookup+0x51) [0x7f3ee8c37491] -> /usr/lib64/dovecot/doveadm/lib20_doveadm_fts_plugin.so(+0x3280) [0x7f3ee8ba9280] -> doveadm(+0x343cd) [0x5637e99443cd] -> doveadm(+0x34fe0) [0x5637e9944fe0] -> doveadm(doveadm_cmd_ver2_to_mail_cmd_wrapper+0x22d) [0x5637e9945e2d] -> doveadm(doveadm_cmd_run_ver2+0x4e8) [0x5637e99568d8] -> doveadm(doveadm_cmd_try_run_ver2+0x3e) [0x5637e995692e] -> doveadm(main+0x1d4) [0x5637e9934cf4] -> /lib64/libc.so.6(__libc_start_main+0xf2) [0x7f3ee9071042] -> doveadm(_start+0x2e) [0x5637e99351ce] > Aborted > > > (1) What config -- dovecot &/or solr -- is needed to match on accented characters? > (2) What add'l detail, if any, is needed for troubleshooting the panic? >I've had more luck searching the index from the command line with the following doveadm search -u myuser at example.com body tambien I've noticed various errors when running some of the doveadm comamnds and I've always put it down to not having run it under the right user or in the right intial conditions or having a virtual setup rather than system users. Not sure if that's the case with this error. I confirm I get the same error as you. John
Shawn Heisey
2020-Oct-18 21:58 UTC
v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed)
On 10/11/2020 4:27 PM, PGNet Dev wrote:> I'm running, > > dovecot --version > 2.3.11.3 (502c39af9) > > solr -version > 8.6.3<snip>> Attempting to test/debug from from cmd line, > > doveadm fts lookup -u myuser at example.com body "tambien" > > causes a PANICI am a committer on the lucene-solr project. So I know that product very well. I am less confident about dovecot, but I do use it. I do not use the fts-solr plugin, because my mail host in AWS does not have enough memory for that. If you are using something like the following schema: https://raw.githubusercontent.com/dovecot/core/master/doc/solr-schema-7.7.0.xml That schema does not have anything that would fold accented characters. I do see "normalizer-icu" in your dovecot config ... if this filters messages before they get to Solr during indexing, then maybe the Solr config does not need to do the folding. Solr does have a set of ICU filters, which I would recommend using rather than the lowercase filter, because they are aware of all of Unicode. Those filters are not present in the main Solr distribution, but they are in the Solr binary package under "contrib". I do not have a setup where I can test this. If I did, I would have done that testing. I cannot say much about the panic you're getting when using the doveadm command. The stacktrace says it is happening in dovecot code, not Solr code. And it looks like the panic had nothing to do with FTS or Solr ... what I see points to mailbox storage code. Thanks, Shawn
PGNet Dev
2020-Oct-18 23:49 UTC
v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed)
I've since rebuilt/reconfig'd all parts of my setup from scratch; some good cleanup along the way. Atm, my entire system for send/recv, store/retrieve, + rules & search is working as I intend. Ok, mostly ... Except for this accented-character search mystery. I've got a _lot_ of mail with various languages in bodies, so _do_ need to get this sorted.> On 10/18/20 2:58 PM, John Fawcett wrote:...> silly question... hardly! creating 2 messages (1) Subject: tambien Body: tambien (2) Subject: tambi?n Body: tambi?n and two more, two avoid known stop words (3) Subject: aausdfrhyetdwgyatrdf Body: aausdfrhyetdwgyatrdf (4) Subject: aausdfrhy?tdwgyatrdf Body: aausdfrhy?tdwgyatrdf 1st, doveadm fts rescan -u myuser at example.com doveadm index -u myuser at example.com -q '*' TBird/solr searches, Subject: tambien ==> FOUND Subject: tambi?n ==> FOUND Subject: aausdfrhyetdwgyatrdf ==> FOUND Subject: aausdfrhy?tdwgyatrdf ==> FOUND Body: tambien ==> FOUND Body: tambi?n ==> (empty) Body: aausdfrhyetdwgyatrdf ==> FOUND Body: aausdfrhy?tdwgyatrdf ==> (empty) suggests it's _not_ (just) an existing-stopword problem notable/odd that subject searches are OK, but not body. On 10/18/20 2:58 PM, Shawn Heisey wrote: ...> If you are using something like the following schema: > https://raw.githubusercontent.com/dovecot/core/master/doc/solr-schema-7.7.0.xmlI am> Solr does have a set of ICU filters, which I would recommend using rather than the lowercase filterI'll give that a try ; haven't used solr outside of the dovecot context -- so need to find a doc/example on how, exactly, that's done correctly.> I cannot say much about the panic you're getting when using the doveadm command. The stacktrace says it is happening in > dovecot code, not Solr code. And it looks like the panic had nothing to do with FTS or Solr ... what I see points to > mailbox storage code.again/still doveadm fts lookup -u myuser at example.com <any key> "<any str>" _all_ panic, as above, doveadm(myuser at example.com): Panic: file mail-storage.c: line 2112 (mailbox_get_open_status): assertion failed: (box->opened) doveadm(myuser at example.com): Error: Raw backtrace: /usr/lib64/dovecot/libdovecot.so.0(backtrace_append+0x46) [0x7f61bba4ecc6] -> /usr/lib64/dovecot/libdovecot.so.0(backtrace_get+0x22) [0x7f61bba4ede2] -> /usr/lib64/dovecot/libdovecot.so.0(+0x10025b) [0x7f61bba5825b] -> /usr/lib64/dovecot/libdovecot.so.0(+0x100297) [0x7f61bba58297] -> /usr/lib64/dovecot/libdovecot.so.0(+0x59bc6) [0x7f61bb9b1bc6] -> /usr/lib64/dovecot/libdovecot-storage.so.0(+0x4779e) [0x7f61bbb6579e] -> /usr/lib64/dovecot/lib21_fts_solr_plugin.so(+0x5849) [0x7f61bb5b7849] -> /usr/lib64/dovecot/lib20_fts_plugin.so(fts_backend_lookup+0x51) [0x7f61bb1d9491] -> /usr/lib64/dovecot/doveadm/lib20_doveadm_fts_plugin.so(+0x3280) [0x7f61bb14b280] -> doveadm(+0x343cd) [0x55f5def873cd] -> doveadm(+0x34fe0) [0x55f5def87fe0] -> doveadm(doveadm_cmd_ver2_to_mail_cmd_wrapper+0x22d) [0x55f5def88e2d] -> doveadm(doveadm_cmd_run_ver2+0x4e8) [0x55f5def998d8] -> doveadm(doveadm_cmd_try_run_ver2+0x3e) [0x55f5def9992e] -> doveadm(main+0x1d4) [0x55f5def77cf4] -> /lib64/libc.so.6(__libc_start_main+0xf2) [0x7f61bb613042] -> doveadm(_start+0x2e) [0x55f5def781ce] Aborted Hopefully dovecot devs might comment further. I'll see what I find with using the ICU filters -- if perhaps anything changes
PGNet Dev
2020-Oct-19 00:33 UTC
v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed)
re-reading your mail ... On 10/18/20 2:58 PM, Shawn Heisey wrote:> I do not use the fts-solr plugin, because my mail host in AWS does not have enough memory for that.is it that you're not using the dovecot plugin, but _DO_ have solr search setup? by what method/mean? or that you're avoiding solr usage altogether?
PGNet Dev
2020-Oct-19 17:36 UTC
v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed)
exec'ing search from Roundcube client, instead of TBird, accented-text search WORKS in both cases, "subject = aausdfrhy?tdwgyatrdf" only, 2020-10-19 17:28:17.847 INFO (qtp1533985074-18) [ x:dovecot] o.a.s.c.S.Request [dovecot] webapp=/solr path=/select params={q={!lucene+q.op%3DAND}subject:aausdfrhy?tdwgyatrdf+OR+subject:aausdfrhy?tdwgyatrdf+OR+subject:aausdfrhyetdwgyatrdf&fl=uid,score&sort=uid+asc&fq=%2Bbox:c92f64f79f0d1ed01e6d5b314f04886c+%2Buser:testuser at example.com&rows=135790&wt=xml} hits=4 status=0 QTime=8 "body = aausdfrhy?tdwgyatrdf" only 2020-10-19 17:28:27.802 INFO (qtp1533985074-53) [ x:dovecot] o.a.s.c.S.Request [dovecot] webapp=/solr path=/select params={q={!lucene+q.op%3DAND}body:aausdfrhy?tdwgyatrdf+OR+body:aausdfrhy?tdwgyatrdf+OR+body:aausdfrhyetdwgyatrdf&fl=uid,score&sort=uid+asc&fq=%2Bbox:c92f64f79f0d1ed01e6d5b314f04886c+%2Buser:testuser at example.com&rows=135790&wt=xml} hits=4 status=0 QTime=25 note the apparent folded search for _all_ of aausdfrhy?tdwgyatrdf aausdfrhy?tdwgyatrdf aausdfrhyetdwgyatrdf this^ is with normalization-icu _out_ of the loop, due to the apparent missing libicu lib links in pkg build
PGNet Dev
2020-Oct-19 18:08 UTC
v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed)
for any interested, let's see how far this gets: TBird "search on server" doesn't -- no comm with backend IMAP/SOLR; appears to be local-only search? https://groups.google.com/g/mozilla.dev.apps.thunderbird/c/SP-r2OEMZ24
Possibly Parallel Threads
- v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed)
- v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed)
- v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed) [proposed patch]
- v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed) [proposed patch]
- v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed) [proposed patch]