vuser1 at test123.ru
2009-Nov-17 17:53 UTC
[Dovecot] fts squat non-english search for 2 words
Hello, It looks I encoutered a bug or misconfiguration. fts_squat search for subject and body works excellent for English mails. For non-English (in particular, Russian) it works only when query consists of 1 word. Phrases - 2 and more words - always returns nothing. Example: search for "planet" ("???????") returns results, search for "Earth" ("?????") also returns results, but "planet Earth" ("??????? ?????") returns nothing. But there are emails having exact phrase "planet Earth". This problem occurs only for non-English queries, both for search in subject and in email body. I tried web-mail Horde 3.2 and Thunderbird. I turned fts plugin off and it correctly finds phrases with 2 and more russian words! So problem is squat. Is it a bug or known config issue? OS: Debian 5.0, installed as openVZ container inside Ubuntu 8.04. Dovecot: 1.2.4, from backports.org Ext3 filesys dovecot -n -------------- # 1.2.4: /etc/dovecot/dovecot.conf # OS: Linux 2.6.24-24-openvz i686 Debian 5.0.3 simfs log_timestamp: %Y-%m-%d %H:%M:%S protocols: imap imaps pop3 pop3s ssl_cert_file: /etc/ssl/test123/test123.full.crt ssl_key_file: /etc/ssl/test123/priv/test123.key disable_plaintext_auth: no login_dir: /var/run/dovecot/login login_executable(default): /usr/lib/dovecot/imap-login login_executable(imap): /usr/lib/dovecot/imap-login login_executable(pop3): /usr/lib/dovecot/pop3-login last_valid_uid: 500 mail_privileged_group: mail mail_location: maildir:/var/mail/%u mbox_write_locks: fcntl dotlock mail_executable(default): /usr/lib/dovecot/imap mail_executable(imap): /usr/lib/dovecot/imap mail_executable(pop3): /usr/lib/dovecot/pop3 mail_plugins(default): quota imap_quota fts fts_squat mail_plugins(imap): quota imap_quota fts fts_squat mail_plugins(pop3): mail_plugin_dir(default): /usr/lib/dovecot/modules/imap mail_plugin_dir(imap): /usr/lib/dovecot/modules/imap mail_plugin_dir(pop3): /usr/lib/dovecot/modules/pop3 imap_client_workarounds(default): outlook-idle delay-newmail imap_client_workarounds(imap): outlook-idle delay-newmail imap_client_workarounds(pop3): pop3_client_workarounds(default): pop3_client_workarounds(imap): pop3_client_workarounds(pop3): outlook-no-nuls oe-ns-eoh lda: postmaster_address: postmaster at test123.ru hostname: test123.ru sendmail_path: /usr/sbin/sendmail auth_socket_path: /var/run/dovecot/auth-master mail_plugins: quota sieve log_path: info_log_path: auth default: mechanisms: plain login user: nobody passdb: driver: sql args: /etc/dovecot/dovecot-sql.conf userdb: driver: passwd userdb: driver: sql args: /etc/dovecot/dovecot-sql.conf userdb: driver: prefetch socket: type: listen client: path: /var/spool/postfix/private/auth mode: 432 user: postfix group: mail master: path: /var/run/dovecot/auth-master mode: 432 user: vmail group: mail plugin: acl: vfile:/etc/dovecot/acls trash: /etc/dovecot/trash.conf fts: squat fts_squat: partial=4 full=20 --------------- (with full=10 problem persists)
vuser1 at test123.ru
2009-Nov-18 09:19 UTC
[Dovecot] fts squat non-english search for 2 words
Maybe I asked wrong question. OK, does anybody use fts_squat for non-English emails? Can you find emails by query of 2 WORDS - "planet Earth"? On my system it works only when both words are from latin alphabet, otherwise returns nothing. For latin, it finds even emails having both lating and russian letters (UTF-8 encoding). For non-latin, query must consist of 1 word only. Thanks for any ideas.> It looks I encoutered a bug or misconfiguration. fts_squat search for > subject and body works excellent for English mails. For non-English > (in particular, Russian) it works only when query consists of 1 word. > Phrases - 2 and more words - always returns nothing. Example: search > for "planet" ("???????") returns results, search for "Earth" > ("?????") also returns results, but "planet Earth" ("??????? ?????") > returns nothing. But there are emails having exact phrase "planet > Earth". This problem occurs only for non-English queries, both for > search in subject and in email body. > I tried web-mail Horde 3.2 and Thunderbird.> I *turned fts plugin off* and it correctly finds phrases with 2 and> more russian words! So problem is squat. Is it a bug or known config> issue?> > dovecot -n > --------------
On Wed, 2009-11-18 at 00:53 +0700, vuser1 at test123.ru wrote:> It looks I encoutered a bug or misconfiguration. fts_squat search for subject and body works excellent for English mails. For non-English (in particular, Russian) it works only when query consists of 1 word. Phrases - 2 and more words - always returns nothing. Example: search for "planet" ("???????") returns results, search for "Earth" ("?????") also returns results, but "planet Earth" ("??????? ?????") returns nothing. But there are emails having exact phrase "planet Earth". This problem occurs only for non-English queries, both for search in subject and in email body.This should fix it: http://hg.dovecot.org/dovecot-1.2/rev/6541fcc3bf54 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: <http://dovecot.org/pipermail/dovecot/attachments/20091123/bd26ae69/attachment-0002.bin>
vuser1 at test123.ru
2009-Nov-25 17:00 UTC
[Dovecot] fts squat non-english search for 2 words
Timo Sirainen <tss at iki.fi>:> On Sun, 2009-11-22 at 20:35 +0700, vuser1 at test123.ru wrote: >> Timo, thank you for answer. Meanwhile I was trying to setup >> horde+dovecot+search. Next step was dovecot 1.2.4 + solr 1.4. It >> works! Now it can find 2 non-latin words. >> 1) I cannot search by substrings - neither "plane" nor "plane*" does >> find "planet" > > Try if attached patch helps? >Quick answer is "no" 8)). Now the story. I debugged and realized that patched plugin generates - for search "xxx yyy": q=body:"XXX YYY*" It should be: q=body:XXX* +body:YYY* (not q=body:"XXX*" +body:"YYY*" - quotation does matter) But this does not work as expected. Prefix searches (with asterisk) are case-sensitive. I googled around and found this post - http://michaelkimsal.com/blog/solr-case-sensitivty/comment-page-1/#comment-78198 . It is old - 2007, but it looks SOLR is still case-sensitive for *. Because of dovecot capitalizes query (and this is right, I think), the search will never find a thing. I played with Solr admin for 3 evenings and I have to say - its behaviour is strange. For example, if I send several emails with body of 2 words: "xxx yyy", "yyy xxx", "xxX yYY" etc. - different case and different word order - it does not find "xxx" in all emails. Maybe Solr 1.4 is not production-ready yet and 1.3 is better. But it is enough for me, maybe I will return to it in next year. Now I will try to apply your changeset (fts-squat: Fixed searching multi-byte characters) to dovecot 1.2.4 (debian stable). If you think 1.2.8 is better, I will follow your recommendation.
vuser1 at test123.ru
2010-Jan-07 10:07 UTC
[Dovecot] fts squat non-english search for 2 words
Timo, many thanx for this! Finally I installed dovecot 1.2.9 from debian backports. Your fix have solved the problem. But look, it happens both for English and Russian emails: 1) I have testing mailbox with ~27000 emails. Big and small, 13Gb total. 2) Search (squat) for single word "planet" runs for 2-4 seconds. 3) Search for another word "Earth" runs fast as well. 4) Search for "planet Earth" runs for more than 3 minutes! And it uses a lot of I/O - server's HDD LED constantly blinks during the search. I use horde/imp mail client. I can't believe the problem is squat internal design. There must be something wrong in algorithm implementation. With Thunderbird/Win32 there is same search delay. More, thunderbird can't search for Russian words - always no results. There are things to stabilize. I must say that squat is my preferable FTS engine, as you know SOLR engine has issues. I am very interested in easy and powerful IMAP search and would like to help you make it even better, as tester. Anyway, thank you for great product! -----Original Message----- From: dovecot-bounces+vuser1=test123.ru at dovecot.org [mailto:dovecot-bounces+vuser1=test123.ru at dovecot.org] On Behalf Of Timo Sirainen Sent: Tuesday, November 24, 2009 12:52 AM To: vuser1 at test123.ru Cc: dovecot at dovecot.org Subject: Re: [Dovecot] fts squat non-english search for 2 words On Wed, 2009-11-18 at 00:53 +0700, vuser1 at test123.ru wrote:> It looks I encoutered a bug or misconfiguration. fts_squat search for subject and body works excellent for English mails. For non-English (in particular, Russian) it works only when query consists of 1 word. Phrases - 2 and more words - always returns nothing. Example: search for "planet" ("???????") returns results, search for "Earth" ("?????") also returns results, but "planet Earth" ("??????? ?????") returns nothing. But there are emails having exact phrase "planet Earth". This problem occurs only for non-English queries, both for search in subject and in email body.This should fix it: http://hg.dovecot.org/dovecot-1.2/rev/6541fcc3bf54