On the pure Lucene side of things, it stores text entirely as Unicode
(for the most part - the Kinosearch guy has identified that it''s the
Java version of Unicode, but for all intents and purposes it''s
Unicode). So indexing and storing text containing any character
possible works just fine. So there is no problem indexing and
searching for any of the languages you mention.
The complexity comes in how you _analyze_ the text to extract words.
Lots of rhetorical questions... what to do about accented
characters? Flatten them to the standard unaccented alphabet? Do
you need stemming? What about handling of transliterations and
spell checking?
One good test - try out the StandardAnalyzer with some Hungarian text
to see what happens. To do this, the code from Lucene in Action can
be downloaded and tried out easily (as long as you have Ant and a JDK
- see the README for details) from http://www.lucenebook.com
Here''s a sample output:
$ ant AnalyzerDemo
Buildfile: build.xml
AnalyzerDemo:
[echo]
[echo] Demonstrates analysis of sample text.
[echo]
[echo] Refer to the "Analysis" chapter for much more on
this
[echo] extremely crucial topic.
[echo]
[input] Press return to continue...
[input] String to analyze: [This string will be analyzed.]
[echo] Running lia.analysis.AnalyzerDemo...
[java] Analyzing "This string will be analyzed."
[java] WhitespaceAnalyzer:
[java] [This] [string] [will] [be] [analyzed.]
[java] SimpleAnalyzer:
[java] [this] [string] [will] [be] [analyzed]
[java] StopAnalyzer:
[java] [string] [analyzed]
[java] StandardAnalyzer:
[java] [this] [string] [will] [be] [analyzed]
[java] SnowballAnalyzer:
[java] [this] [string] [will] [be] [analyz]
[java] SnowballAnalyzer:
[java] [this] [string] [wil] [be] [analyzed]
[java] SnowballAnalyzer:
[java] [thi] [string] [will] [be] [analyz]
BUILD SUCCESSFUL
Total time: 10 seconds
On 26 Oct 2005, at 07:44, Jean-Etienne Durand wrote:
> Hi,
>
> A few questions about available languages by ferret or lucene
> 1. what are the languages currently supported by the engine? Only
> english?
> 2. is there any plan to have western european languages like
> french, german, or italian?
> 3. is there any plan to have hungarian? Or does it exist already?
>
> Thank you,
> --
> Jean-Etienne Durand
> Mail: etienne dot durand at mail dot com
> Blog: http://spaces.msn.com/members/jetienne
> _______________________________________________
> Rails mailing list
> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> http://lists.rubyonrails.org/mailman/listinfo/rails
>