FTS indexing is something I hear quite often nowadays. I?ve added some hacks to
make it work better for some installations, but it?s about time to think about
the whole design and how it could be improved for everyone in future. Here are
some of my initial thoughts.
Currently Dovecot supports 3 full text search engines: Solr, CLucene and Dovecot
Squat. CLucene plugin has various features built in, which should have been
built in a generic way to work with all the engines (although Solr has most of
those already built-in). Squat was abandoned a few years ago in favor of
Solr/CLucene, but perhaps it could be brought back to life, since it looks like
its index sizes could be smaller than Lucene's.
Here's a list of things that should be added to generic Dovecot FTS code to
improve all the backends:
1. Support for multiple languages. Use textcat while indexing to guess the
language of the indexed data. (Perhaps run it separately for each paragraph to
handle multi-language mails? Or at least many emails begin/end with different
language than the text in the middle, e.g. "Foo Bar wrote:" is often
in various languages.) Index the data using the detected language's stemming
and other features. Keep track of which languages have been used in the index,
and when searching stem the search words to all the used languages. Since each
added language requires additional searches and there's the possibility of
wrong detection, the list of allowed languages could be configurable. See also
http://ntextcat.codeplex.com/ or at least change textcat to use UTF8.
2. Word stemming. This can be done for many languages with Snowball library.
Solr has also implemented several other languages, perhaps its code can be
somehow automatically translated to C(++) for use with Dovecot?
3. Don't index language-specific stopwords. We can get the word lists from
e.g. Solr.
4. Try to detect compound words and index each part separately for languages
that use them. http://wiki.apache.org/solr/LanguageAnalysis#Decompounding
suggests two possible ways to do it.
5. Normalize words (e.g. drop diacritics). libicu can be used for this.
6. Drop (Unicode) characters that don't belong to the language? Or
especially don't index most of the weird Unicode characters. This would
avoid filling the index with unnecessary garbage.
7. Don't index non-text data? For example if there is large block of base64
data or something else that definitely doesn't look like text, it's
pretty useless to index it. Then again, we do want to index all kinds of IDs
that someone might want to search. This could be a bit difficult to implement
well.
8. Index attachments separately, so it would be possible to search only
attachments. (Should "SEARCH BODY word1 BODY word2" return matches if
word1 and word2 are in different attachments?)
9. Attachments can be translated to indexable UTF-8 text already with
fts_decoder setting by doing it via a conversion script. This could also support
Apache Tika server directly.
10. It should be configurable which fields are indexed. Body and header would
always be separately indexed. Optionally there could be also at least:
attachments, From, To, Cc, Bcc and Subject. The From/To/Cc/Bcc could also be
indexed together in one "addresses" field. The more fields there are,
the larger the index, but better/faster search results.
11. Each indexed mail should have metadata: Mailbox GUID, mail UID and the
language the mail was indexed with. For attachments there should also be the
MIME part number. When matching results, drop results if returned language
doesn't match the query language.
Squat
-----
Currently Squat index consists of a trie containing all the words and pointer to
a file listing all the message UIDs that contain them. Each node in the trie has
a pointer to the UIDs, so e.g. with "abc" the "a" node will
contain UIDs of all mails that contain the "a" letter (e.g. 1,3-5,10).
"ab" node will contain mails that have the "ab" substring.
Since the "ab" is a subset of "a", the "ab"
won't contain UIDs directly but instead it contains indexes to the
"a" list to get a better compression (e.g. UID 3-5,10 -> 2-4
indexes in the "1,3-5,10" list). The "abc" node then
similarly refers to the "ab" node's indexes.
It's configurable how long words Squat will index. Also substring matching
is configurable. By default both are 4 letters, so words longer than 4 letters
will be split to 4 letter pieces which are indexed (e.g. "dovecot"
-> "dove", "ovec", "veco", "ecot").
When searching these pieces are looked up and the results are merged.
It's pretty pointless to do a search for 1-2 letter substrings. Most likely
the user wants to find 1-2 letter word instead. Perhaps this is true also for 3
letters? The Squat index could be changed to only add results for the first 1-2
(or 1-3?) letters only for full words, not to word prefixes. This of course
would mean that the "ab" referring to "a" UID list would no
longer work for the first nodes.
Substring searching likely wouldn't work very nicely for stemmed words. So
Squat should probably index the full stemmed word and then also index the
unstemmed word in the small 4 letter pieces. It should be possible to also
disable substring searching entirely.
Squat already attempts to reduce disk space by encoding the common characters
with less bits than other characters. This is hardcoded for English language
though. Each index compression could analyze the most common letters and
dynamically use those. Perhaps different languages could even use separate Squat
index files to get the most advantage of this? Although if there are a lot of
languages it's a bit annoying to have many different index files.
Squat currently indexes each Unicode character as one letter, so there can be
quite a huge number of different 4 letter words. This substring indexing should
probably be disabled for words that contain letters not used by the current
language.
The UID lists are not compressed very efficiently. See
http://www2009.eprints.org/41/1/p401.pdf for alternatives.
There is currently one Squat index per mailbox. There should be one per user.
This requires adding a separate ID -> { mailbox GUID, UID, language, mime
part } mapping file.
Squat indexes should be updated in two parts. For old mails there are the large
trie+uidlist base files in optimized format. For new mails there are small
trie+uidlist files in a format where they are just appended to. When the trie or
uidlist becomes large enough, it's first internally sorted/optimized, and
then merged with the base files by recreating new optimized base files. This
fixes the scalability problem with current large Squat indexes where with large
unsorted indexes the sorting could have taken forever.
Since substring searches in Squat produce only "maybe matches" UID
lists, the mails still need to be opened and searched through. This step could
probably also do the same language detection + stemming as is done during
indexing to improve the search results. (And this feature could be enabled even
if no FTS index is enabled.)