thr3ads.net - dovecot - [Dovecot] Full text search indexing [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Timo Sirainen

2006-Apr-03 07:44 UTC

[Dovecot] Full text search indexing

SEARCH is probably the slowest command that Dovecot has, especially when
searching texts from message bodies. There exists a lot of text search
engines, but unfortunately as far as I know there exists only one that
can be used with IMAP.

The reason why IMAP is special is because it requires that when
searching eg. "ello", it must match also mails which contain eg. word
"hello". Most search engines support only matching either complete
words
or matching from the beginning of the words.

So, the one usable engine which I know of is Cyrus's squat indexing. It
works basically by indexing the mails in 4 byte blocks and storing them
in a trie. So if there exists a string "he llo", it indexes "he
l",
"e ll" and " llo" strings. When searching it then gets the
first 4 bytes
of the search keyword, looks them up from the squat index and then once
it gets a list of matched mails, it reads the mails and does a proper
search for only them. So the squat index doesn't give exact results, but
only narrows down the number of mails which have to be searched.

The biggest problem with those squat indexes is that it doesn't support
incremental updates, so there has to be some nightly runs to rebuild the
indexes from scratch. And with lots of users that's not practical, so I
think that's the reason why the squat indexes aren't all that commonly
used with Cyrus.

I was thinking about taking the basic ideas from the Cyrus squat indexer
but making it possible to update it incrementally so Dovecot could do it
automatically. Also to reduce space I was thinking about sorting the 4
byte strings so instead of eg. "abcd", "dcba" and
"abdc" all being
indexed, they could all be stored as "abcd" (so the search key would
also have to be sorted before use). But that might cause larger result
sets making the search slower, so I'll have to test if it's worth it.

Unicode characters might also be a problem. If the characters are stored
in 16bit characters, it would waste space. If they're stored in UTF-8
the 4 bytes might not be enough to give a useful index (as a single
character might take even 5 bytes). One thing I was thinking was to not
index in 4 byte blocks, but 4 character blocks in UTF-8, so the trie
could be 4..20 levels deep depending on what unicode characters are used
(in finnish text it would probably be max. 7 levels deep).

So, now would be a good time to tell if you can think of some better
ideas how to do this, especially if you know about some new algorithms
specifically designed for this kind of searching. :) Pretty much all the
algorithms I've found about this were suffix trees/arrays and this basic
trie thing.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL:
<http://dovecot.org/pipermail/dovecot/attachments/20060403/d2314789/attachment.bin>

Jens Laas

2006-Apr-06 07:33 UTC

head link

[Dovecot] Full text search indexing

(06.04.03 kl.10:44) Timo Sirainen skrev f?ljande till dovecot at dovecot.org:
> SEARCH is probably the slowest command that Dovecot has, especially when
> searching texts from message bodies. There exists a lot of text search
> engines, but unfortunately as far as I know there exists only one that
> can be used with IMAP.
>
> The reason why IMAP is special is because it requires that when
> searching eg. "ello", it must match also mails which contain eg.
word
> "hello". Most search engines support only matching either
complete words
> or matching from the beginning of the words.
>
> So, the one usable engine which I know of is Cyrus's squat indexing. It
> works basically by indexing the mails in 4 byte blocks and storing them
> in a trie. So if there exists a string "he llo", it indexes
"he l",
> "e ll" and " llo" strings. When searching it then gets
the first 4 bytes
> of the search keyword, looks them up from the squat index and then once
> it gets a list of matched mails, it reads the mails and does a proper
> search for only them. So the squat index doesn't give exact results,
but
> only narrows down the number of mails which have to be searched.
>
How would you search for '*lo' in 'hello' ? Only 'ello'
would be indexed.
> Unicode characters might also be a problem. If the characters are stored
> in 16bit characters, it would waste space. If they're stored in UTF-8
> the 4 bytes might not be enough to give a useful index (as a single
> character might take even 5 bytes). One thing I was thinking was to not
> index in 4 byte blocks, but 4 character blocks in UTF-8, so the trie
> could be 4..20 levels deep depending on what unicode characters are used
> (in finnish text it would probably be max. 7 levels deep).
Using UTF-8 in the index sounds good, but how hard would it be in 
practise? The non-ascii characters are already encoded someway in the 
mails arent they?
>
> So, now would be a good time to tell if you can think of some better
> ideas how to do this, especially if you know about some new algorithms
> specifically designed for this kind of searching. :) Pretty much all the
> algorithms I've found about this were suffix trees/arrays and this
basic
> trie thing.
>
I am very interested in this area.
The best full text search method I've yet to come across is the suffix 
index.
You index every suffix of a word.
For 'hello':
'hello'
'ello'
'llo'
'lo'
'o'

So the index maps:
suffix -> word -> list ( id1 .. idN )

You can also map word to a phrase as an intermediate step.

I feel certain there must be better methods out there.

Cheers
Jens

-----------------------------------------------------------------------
     'Old C programmers don't die ... they're just cast into
void*'
-----------------------------------------------------------------------
     Jens L??s                              Email: jens.laas at data.slu.se
     Department of Computer Services, SLU   Phone: +46 18 67 35 15
     Vindbrov?gen 1
     P.O. Box 7079
     S-750 07 Uppsala
     SWEDEN
-----------------------------------------------------------------------

Apparently Analagous Threads

Search for more reasonably related threads

dovecot - Apr 2006 - Full text search indexing

[Dovecot] Full text search indexing

[Dovecot] Full text search indexing

Apparently Analagous Threads