thr3ads.net - dovecot - [Dovecot] Full text search improvements [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Timo Sirainen

2013-Nov-30 17:18 UTC

[Dovecot] Full text search improvements

FTS indexing is something I hear quite often nowadays. I?ve added some hacks to
make it work better for some installations, but it?s about time to think about
the whole design and how it could be improved for everyone in future. Here are
some of my initial thoughts.

Currently Dovecot supports 3 full text search engines: Solr, CLucene and Dovecot
Squat. CLucene plugin has various features built in, which should have been
built in a generic way to work with all the engines (although Solr has most of
those already built-in). Squat was abandoned a few years ago in favor of
Solr/CLucene, but perhaps it could be brought back to life, since it looks like
its index sizes could be smaller than Lucene's.

Here's a list of things that should be added to generic Dovecot FTS code to
improve all the backends:

1. Support for multiple languages. Use textcat while indexing to guess the
language of the indexed data. (Perhaps run it separately for each paragraph to
handle multi-language mails? Or at least many emails begin/end with different
language than the text in the middle, e.g. "Foo Bar wrote:" is often
in various languages.) Index the data using the detected language's stemming
and other features. Keep track of which languages have been used in the index,
and when searching stem the search words to all the used languages. Since each
added language requires additional searches and there's the possibility of
wrong detection, the list of allowed languages could be configurable. See also
http://ntextcat.codeplex.com/ or at least change textcat to use UTF8.

2. Word stemming. This can be done for many languages with Snowball library.
Solr has also implemented several other languages, perhaps its code can be
somehow automatically translated to C(++) for use with Dovecot?

3. Don't index language-specific stopwords. We can get the word lists from
e.g. Solr.

4. Try to detect compound words and index each part separately for languages
that use them. http://wiki.apache.org/solr/LanguageAnalysis#Decompounding
suggests two possible ways to do it.

5. Normalize words (e.g. drop diacritics). libicu can be used for this.

6. Drop (Unicode) characters that don't belong to the language? Or
especially don't index most of the weird Unicode characters. This would
avoid filling the index with unnecessary garbage.

7. Don't index non-text data? For example if there is large block of base64
data or something else that definitely doesn't look like text, it's
pretty useless to index it. Then again, we do want to index all kinds of IDs
that someone might want to search. This could be a bit difficult to implement
well.

8. Index attachments separately, so it would be possible to search only
attachments. (Should "SEARCH BODY word1 BODY word2" return matches if
word1 and word2 are in different attachments?)

9. Attachments can be translated to indexable UTF-8 text already with
fts_decoder setting by doing it via a conversion script. This could also support
Apache Tika server directly.

10. It should be configurable which fields are indexed. Body and header would
always be separately indexed. Optionally there could be also at least:
attachments, From, To, Cc, Bcc and Subject. The From/To/Cc/Bcc could also be
indexed together in one "addresses" field. The more fields there are,
the larger the index, but better/faster search results.

11. Each indexed mail should have metadata: Mailbox GUID, mail UID and the
language the mail was indexed with. For attachments there should also be the
MIME part number. When matching results, drop results if returned language
doesn't match the query language.

Squat
-----

Currently Squat index consists of a trie containing all the words and pointer to
a file listing all the message UIDs that contain them. Each node in the trie has
a pointer to the UIDs, so e.g. with "abc" the "a" node will
contain UIDs of all mails that contain the "a" letter (e.g. 1,3-5,10).
"ab" node will contain mails that have the "ab" substring.
Since the "ab" is a subset of "a", the "ab"
won't contain UIDs directly but instead it contains indexes to the
"a" list to get a better compression (e.g. UID 3-5,10 -> 2-4
indexes in the "1,3-5,10" list). The "abc" node then
similarly refers to the "ab" node's indexes.

It's configurable how long words Squat will index. Also substring matching
is configurable. By default both are 4 letters, so words longer than 4 letters
will be split to 4 letter pieces which are indexed (e.g. "dovecot"
-> "dove", "ovec", "veco", "ecot").
When searching these pieces are looked up and the results are merged.

It's pretty pointless to do a search for 1-2 letter substrings. Most likely
the user wants to find 1-2 letter word instead. Perhaps this is true also for 3
letters? The Squat index could be changed to only add results for the first 1-2
(or 1-3?) letters only for full words, not to word prefixes. This of course
would mean that the "ab" referring to "a" UID list would no
longer work for the first nodes.

Substring searching likely wouldn't work very nicely for stemmed words. So
Squat should probably index the full stemmed word and then also index the
unstemmed word in the small 4 letter pieces. It should be possible to also
disable substring searching entirely.

Squat already attempts to reduce disk space by encoding the common characters
with less bits than other characters. This is hardcoded for English language
though. Each index compression could analyze the most common letters and
dynamically use those. Perhaps different languages could even use separate Squat
index files to get the most advantage of this? Although if there are a lot of
languages it's a bit annoying to have many different index files.

Squat currently indexes each Unicode character as one letter, so there can be
quite a huge number of different 4 letter words. This substring indexing should
probably be disabled for words that contain letters not used by the current
language.

The UID lists are not compressed very efficiently. See
http://www2009.eprints.org/41/1/p401.pdf for alternatives.

There is currently one Squat index per mailbox. There should be one per user.
This requires adding a separate ID -> { mailbox GUID, UID, language, mime
part } mapping file.

Squat indexes should be updated in two parts. For old mails there are the large
trie+uidlist base files in optimized format. For new mails there are small
trie+uidlist files in a format where they are just appended to. When the trie or
uidlist becomes large enough, it's first internally sorted/optimized, and
then merged with the base files by recreating new optimized base files. This
fixes the scalability problem with current large Squat indexes where with large
unsorted indexes the sorting could have taken forever.

Since substring searches in Squat produce only "maybe matches" UID
lists, the mails still need to be opened and searched through. This step could
probably also do the same language detection + stemming as is done during
indexing to improve the search results. (And this feature could be enabled even
if no FTS index is enabled.)

Mike Abbott

2013-Dec-02 18:50 UTC

head link

[Dovecot] Full text search improvements

> how [FTS indexing] could be improved for everyone in future
For sites which set client_limit > 1 it would help performance not to stall
for INDEXER_WAIT_MSECS when polling the indexer for input.  Currently dovecot
unwinds back out to the main command loop repeatedly to allow other clients to
use the process but it also stalls the whole process for INDEXER_WAIT_MSECS
every time it finds no input from the indexer, which hurts responsiveness for
those other clients.  This can be avoided by removing the client's I/O from
the main ioloop and adding the indexer's instead, or perhaps by leveraging
CLIENT_COMMAND_STATE_WAIT_EXTERNAL.

Third-party FTS implementations may benefit from having the NOT/AND/OR
seq_range_array merging logic in squat_lookup_arg() generalized and made
available to all.

It would also be helpful if FTS expunge were asynchronous, but this is not
critical.

Metro Domain Admin

2013-Dec-04 15:22 UTC

head link

[Dovecot] Full text search improvements

Substring match is important to us, so we'd love to see Squat reinstated 
with speed improvements. It seems like Solr can handle substrings as 
well ([Edge]NGramFilterFactory), but for small deployments, having the 
engine built right in is a plus.

Michael M Slusarz

2013-Dec-04 17:20 UTC

head link

[Dovecot] Full text search improvements

Quoting Timo Sirainen <tss at iki.fi>:
> 1. Support for multiple languages. Use textcat while indexing to  
> guess the language of the indexed data.
FWIW, you could probably use the Content-Language header (if it  
exists) to at least give a hint.  No guarantee it is correct, but it's  
a better starting place than simply scanning all languages.

And, for that matter, you could leverage Accept-Language also (again,  
if it exists).  Which might be more useful, since it lists all the  
languages the user recognizes.

michael

Steffen Kaiser

2013-Dec-05 08:40 UTC

head link

[Dovecot] Full text search improvements

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sat, 30 Nov 2013, Timo Sirainen wrote:
> 7. Don't index non-text data? For example if there is large block of 
> base64 data or something else that definitely doesn't look like text, 
> it's pretty useless to index it. Then again, we do want to index all 
> kinds of IDs that someone might want to search. This could be a bit 
> difficult to implement well.
> 9. Attachments can be translated to indexable UTF-8 text already with 
> fts_decoder setting by doing it via a conversion script. This could also 
> support Apache Tika server directly.
This means some kind of MIME type based (or file type guesser) "... to 
UTF8 text" converter script? Some users would find that very very very ^ n 
nice. There are already several programs used in the field of CMS.

- -- 
Steffen Kaiser
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iQEVAwUBUqA8A13r2wJMiz2NAQLQYwf/bAyrg080/i2khM/XGXLlhjlcPcyxGHym
KgoFFBhh2sgfl+ecRHCM4BP+WX/c5coxAScyXhSy9JjwcQz8MXUHzkbGL4d8kwa4
pgdhaD4hFhPqpOJGf1ULwBSIBEsJfZeHaOkJHlMqDgd3yKY5APoJPKJtG2z+lI+7
vqR/Pe8n8EhCcWcLC1CfEGKxcci09XYj09Sai96VGbCO2coVCm+xIKRSCW6pasoQ
NTqpJBTCe2gCD3KdVA5jUNqFeEj2AQF5+nkujtSF4B1G/xrpfoABLkJ+lyQ8F5hc
DTJFiHhlvJKRIIKbhuyQukeqDSzeln2UtSRce3q59fek4foFzDrhTw==l3mf
-----END PGP SIGNATURE-----

Apparently Analagous Threads

Search for more seemingly similar threads

dovecot - Nov 2013 - Full text search improvements

[Dovecot] Full text search improvements

[Dovecot] Full text search improvements

[Dovecot] Full text search improvements

[Dovecot] Full text search improvements

[Dovecot] Full text search improvements

Apparently Analagous Threads