search for: languageanalysis

Displaying 2 results from an estimated 2 matches for "languageanalysis".

2011 May 23
3
[PATCH] Indexing mail attachments with Dovecot + Solr
...and Stemming (Spanish by default). Attachment indexing is provided by Solr Cell and Tika (ExtractingRequestHandler) * http://wiki.apache.org/solr/ExtractingRequestHandler Synonyms and Stemming are provided by SnowballPorterFilterFactory from Solr Language Analysis: * http://wiki.apache.org/solr/LanguageAnalysis We have tested Solr with Tomcat and Jetty. Tomcat is better to handle UTF-8 and bigger POSTS. Attachments file format supported * http://tika.apache.org/0.9/formats.html At present, attachments in attachments (like, for example, attachments in fordwarded "eml" attachments) are not ind...
2013 Nov 30
4
Full text search improvements
...perhaps its code can be somehow automatically translated to C(++) for use with Dovecot? 3. Don't index language-specific stopwords. We can get the word lists from e.g. Solr. 4. Try to detect compound words and index each part separately for languages that use them. http://wiki.apache.org/solr/LanguageAnalysis#Decompounding suggests two possible ways to do it. 5. Normalize words (e.g. drop diacritics). libicu can be used for this. 6. Drop (Unicode) characters that don't belong to the language? Or especially don't index most of the weird Unicode characters. This would avoid filling the index wit...