Displaying 2 results from an estimated 2 matches for "languageanalysis".
2011 May 23
3
[PATCH] Indexing mail attachments with Dovecot + Solr
...and Stemming (Spanish by default).
Attachment indexing is provided by Solr Cell and Tika (ExtractingRequestHandler)
* http://wiki.apache.org/solr/ExtractingRequestHandler
Synonyms and Stemming are provided by SnowballPorterFilterFactory from
Solr Language Analysis:
* http://wiki.apache.org/solr/LanguageAnalysis
We have tested Solr with Tomcat and Jetty. Tomcat is better to handle
UTF-8 and bigger POSTS.
Attachments file format supported
* http://tika.apache.org/0.9/formats.html
At present, attachments in attachments (like, for example, attachments
in fordwarded "eml" attachments) are not ind...
2013 Nov 30
4
Full text search improvements
...perhaps its code can be somehow automatically translated to C(++) for use with Dovecot?
3. Don't index language-specific stopwords. We can get the word lists from e.g. Solr.
4. Try to detect compound words and index each part separately for languages that use them. http://wiki.apache.org/solr/LanguageAnalysis#Decompounding suggests two possible ways to do it.
5. Normalize words (e.g. drop diacritics). libicu can be used for this.
6. Drop (Unicode) characters that don't belong to the language? Or especially don't index most of the weird Unicode characters. This would avoid filling the index wit...