Hi all, maybe not a Ferret question, but I assume here might have came across that already. I wrote a simple CGI app that adds docs into a Ferret index. The idea is testing asian languages input and searching. The script that does the input seems to be OK. As David mentioned in a question I made a little while ago, Ferret''s index is agnostic, in the sense that you can store anything in it. I then wrote another one to search the index created. This is what it looks like: #################################### #!/usr/bin/ruby $KCODE = ''u'' require ''cgi'' require ''ferret'' include Ferret index = Index::Index.new(:path => ''/var/index'', :default_field => "*") cgi = CGI.new("html4") result = "" if cgi[''query''] and not cgi[''query''].empty? index.search_each(cgi[''query'']) do |doc, score| result << "<table border=''1''> <tr><td>#{index[doc][''tileid'']}</td><td>#{index[doc][''title'']}</td><td>#{index[doc][''description'']}</td></tr> </table> " end end #################################### It''s A-OK for searching english. But when trying to input chinese characters in the "query" field, I''m getting the following error in my lighttpd log file: #################################### /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:15:in `search_each'': : Error occured at <analysis.c>:701 (Exception) Error: exception 2 not handled: Error decoding input string. Check that you have the locale set correctly from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:15 #################################### Is the error message above suggesting I should specify a chinese locale and not UTF-8? I thought UTF-8 would actually handle chinese and anything else one could throw at it as long as it''s a human language. Any help is appreciated. -- Julio C. Ody http://rootshell.be/~julioody
On 7/18/06, Julio Cesar Ody <julioody at gmail.com> wrote:> Hi all, > > maybe not a Ferret question, but I assume here might have came across > that already. > > I wrote a simple CGI app that adds docs into a Ferret index. The idea > is testing asian languages input and searching. > > The script that does the input seems to be OK. As David mentioned in a > question I made a little while ago, Ferret''s index is agnostic, in the > sense that you can store anything in it. I then wrote another one to > search the index created. This is what it looks like: > > #################################### > > #!/usr/bin/ruby > > $KCODE = ''u'' > require ''cgi'' > require ''ferret'' > include Ferret > > index = Index::Index.new(:path => ''/var/index'', :default_field => "*") > > cgi = CGI.new("html4") > > result = "" > if cgi[''query''] and not cgi[''query''].empty? > index.search_each(cgi[''query'']) do |doc, score| > result << "<table border=''1''> > <tr><td>#{index[doc][''tileid'']}</td><td>#{index[doc][''title'']}</td><td>#{index[doc][''description'']}</td></tr> > </table> > " > end > end > #################################### > > It''s A-OK for searching english. But when trying to input chinese > characters in the "query" field, I''m getting the following error in my > lighttpd log file: > > #################################### > /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:15:in > `search_each'': : Error occured at <analysis.c>:701 (Exception) > Error: exception 2 not handled: Error decoding input string. Check > that you have the locale set correctly > from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:15 > #################################### > > Is the error message above suggesting I should specify a chinese > locale and not UTF-8? I thought UTF-8 would actually handle chinese > and anything else one could throw at it as long as it''s a human > language. > > Any help is appreciated.The error is being raised when the analyzer tries to tokenize the query string My guess would be that the query string either starts in the wrong encoding (when you type it in) or it gets converted somewhere between being typed in the browser and going into your script. UTF-8 can certainly handle Chinese characters if they are UTF-8 encoded but there are other encodings for Chinese as well. If I were trying to debug this, the first thing I''d do is log the query string in a file and check its encoding. Something like; File.open("query.log", "w") {|f| f.write(cgi[''query''])} If you want, send me the file and I''ll try and see what encoding it is. Cheers, Dave
> The error is being raised when the analyzer tries to tokenize the > query string My guess would be that the query string either starts in > the wrong encoding (when you type it in)Didn''t get that bit.> or it gets converted > somewhere between being typed in the browser and going into your > script.Umm... maybe yes.> UTF-8 can certainly handle Chinese characters if they are > UTF-8 encoded but there are other encodings for Chinese as well. If I > were trying to debug this, the first thing I''d do is log the query > string in a file and check its encoding. Something like; > > File.open("query.log", "w") {|f| f.write(cgi[''query''])} > > If you want, send me the file and I''ll try and see what encoding it is.I wrote another script that does just that (writes cgi[''query''] to /tmp/query.log). After inputting this in a text field name "query" and submitting this chinese string: ?? This is what appears in the /tmp/query.log 新闻 Note that the only thing I did hoping to have evething magically working in UTF-8 is putting this in my script: $KCODE = ''u'' Anything I''m missing?> > Cheers, > Dave-- Julio C. Ody http://rootshell.be/~julioody
On 7/18/06, Julio Cesar Ody <julioody at gmail.com> wrote:> I wrote another script that does just that (writes cgi[''query''] to > /tmp/query.log). After inputting this in a text field name "query" and > submitting this chinese string: > > ?? > > This is what appears in the /tmp/query.log > > 新闻 > > Note that the only thing I did hoping to have evething magically > working in UTF-8 is putting this in my script: > > $KCODE = ''u'' > > Anything I''m missing?dbalmain at ubuntu:~/ $ irb -Ku irb(main):001:0> require ''cgi'' => true irb(main):002:0> CGI.unescapeHTML("新闻") => "??" That should fix your problem. Dave
Yep, it did. Thanks tons! But I''m not getting any results now. I take this is because of the default analyzer being used, right? How can I use a whitespace analyzer in my query? (or something that could work effectively with asian languages). For my needs, I suppose the whitespace one could do... On 7/18/06, David Balmain <dbalmain.ml at gmail.com> wrote:> On 7/18/06, Julio Cesar Ody <julioody at gmail.com> wrote: > > I wrote another script that does just that (writes cgi[''query''] to > > /tmp/query.log). After inputting this in a text field name "query" and > > submitting this chinese string: > > > > ?? > > > > This is what appears in the /tmp/query.log > > > > 新闻 > > > > Note that the only thing I did hoping to have evething magically > > working in UTF-8 is putting this in my script: > > > > $KCODE = ''u'' > > > > Anything I''m missing? > > dbalmain at ubuntu:~/ $ irb -Ku > irb(main):001:0> require ''cgi'' > => true > irb(main):002:0> CGI.unescapeHTML("新闻") > => "??" > > That should fix your problem. > > Dave > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk-- Julio C. Ody http://rootshell.be/~julioody
On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote:> Yep, it did. Thanks tons! > > But I''m not getting any results now. I take this is because of the > default analyzer being used, right? > > How can I use a whitespace analyzer in my query? (or something that > could work effectively with asian languages). > > For my needs, I suppose the whitespace one could do...index = Index::Index.new(:path => ''/var/index'', :default_field => "*", :analyzer => Ferret::Analysis::WhiteSpaceAnalzyer.new) Although you should probably use the same analyzer I gave you for indexing; http://www.ruby-forum.com/topic/72086#101764 Cheers, Dave
Thanks, and sorry. I checked the documentation for Index::Index and found it right after I asked the question. My bad. I''m getting segfauls when trying to initialize an index using a different analyzer other than the default one (but it works otherwise). But as I can see in this thread http://www.ruby-forum.com/topic/71620 It ain''t stable yet for 64 bit. So I''ll wait. Thanks again. On 7/19/06, David Balmain <dbalmain.ml at gmail.com> wrote:> On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote: > > Yep, it did. Thanks tons! > > > > But I''m not getting any results now. I take this is because of the > > default analyzer being used, right? > > > > How can I use a whitespace analyzer in my query? (or something that > > could work effectively with asian languages). > > > > For my needs, I suppose the whitespace one could do... > > index = Index::Index.new(:path => ''/var/index'', :default_field => "*", > :analyzer => Ferret::Analysis::WhiteSpaceAnalzyer.new) > > Although you should probably use the same analyzer I gave you for indexing; > > http://www.ruby-forum.com/topic/72086#101764 > > Cheers, > Dave > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-- Julio C. Ody http://rootshell.be/~julioody
Just sharing my experience and asking another question. I tried the analyzer suggested here: http://www.ruby-forum.com/topic/72086#101764. It works fine if you specify the search field you want to use (anyway, it seems that''s how it''s suppose to work). # CODE analyzer = Ferret::Analysis::PerFieldAnalyzer.new(Ferret::Analysis::StandardAnalyzer.new) analyzer["chinese"] = Ferret::Analysis::RegExpAnalyzer.new(/./, false) index = Index::Index.new(:path => ''/var/index'', :analyzer => analyzer, :default_field => "*") ... index.search_each("chinese: #{val}") do |doc, score| #val is a chinese char puts "#{doc} - #{score}" end # END CODE This works OK. However, if you try searching like this: # CODE index.search_each(val) do |doc, score| #val is a chinese char puts "#{doc} - #{score}" end # END CODE I get in my lighttpd error log: /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19:in `search_each'': : Error occured at <analysis.c>:701 (StandardError) Error: exception 2 not handled: Error decoding input string. Check that you have the locale set correctly from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19 Which MAKES SENSE, since the docs I created before are created like this: doc = { "author" => "englishchars", "title" => "more regular chars", "chinese" => "??"} index << doc and I think search_each is going through all the fields (since I explicitly said it should when I issued :default_field => "*" up there), finding english chars, and trying to match them against the chinese ones I supplied as a search query. So alright, I can use the suggested analyzer. But my question is: is there a way to use an analyzer that would work with both character types (english, and asian) simply by not returning matches them as opposed to giving me an error? Thanks a ton for any help. On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote:> Thanks, and sorry. I checked the documentation for Index::Index and > found it right after I asked the question. My bad. > > I''m getting segfauls when trying to initialize an index using a > different analyzer other than the default one (but it works > otherwise). But as I can see in this thread > > http://www.ruby-forum.com/topic/71620 > > It ain''t stable yet for 64 bit. So I''ll wait. > > Thanks again. > > > On 7/19/06, David Balmain <dbalmain.ml at gmail.com> wrote: > > On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote: > > > Yep, it did. Thanks tons! > > > > > > But I''m not getting any results now. I take this is because of the > > > default analyzer being used, right? > > > > > > How can I use a whitespace analyzer in my query? (or something that > > > could work effectively with asian languages). > > > > > > For my needs, I suppose the whitespace one could do... > > > > index = Index::Index.new(:path => ''/var/index'', :default_field => "*", > > :analyzer => Ferret::Analysis::WhiteSpaceAnalzyer.new) > > > > Although you should probably use the same analyzer I gave you for indexing; > > > > http://www.ruby-forum.com/topic/72086#101764 > > > > Cheers, > > Dave > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > > -- > Julio C. Ody > http://rootshell.be/~julioody >-- Julio C. Ody http://rootshell.be/~julioody
On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote:> Just sharing my experience and asking another question. > > I tried the analyzer suggested here: > http://www.ruby-forum.com/topic/72086#101764. It works fine if you > specify the search field you want to use (anyway, it seems that''s how > it''s suppose to work). > > # CODE > analyzer = Ferret::Analysis::PerFieldAnalyzer.new(Ferret::Analysis::StandardAnalyzer.new) > analyzer["chinese"] = Ferret::Analysis::RegExpAnalyzer.new(/./, false) > > index = Index::Index.new(:path => ''/var/index'', :analyzer => analyzer, > :default_field => "*") > > ... > > index.search_each("chinese: #{val}") do |doc, score| #val is a chinese char > puts "#{doc} - #{score}" > end > # END CODE > > This works OK. However, if you try searching like this: > > # CODE > index.search_each(val) do |doc, score| #val is a chinese char > puts "#{doc} - #{score}" > end > # END CODE > > I get in my lighttpd error log: > > /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19:in > `search_each'': : Error occured at <analysis.c>:701 (StandardError) > Error: exception 2 not handled: Error decoding input string. Check > that you have the locale set correctly > from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19 > > Which MAKES SENSE, since the docs I created before are created like this: > > doc = { "author" => "englishchars", "title" => "more regular chars", > "chinese" => "??"} > index << doc > > and I think search_each is going through all the fields (since I > explicitly said it should when I issued :default_field => "*" up > there), finding english chars, and trying to match them against the > chinese ones I supplied as a search query.Actually, it''s not because of there is a comparison between Chinese and English characters. That shouldn''t cause an error. The error is being thrown because val can''t be decoded using the StandardAnalyzer. Again, you need to check that val is correctly encoded and you have your locale set correctly.The only times tokenizing happens are when you add documents to the index and when you run a query through the query parser. Apart from that, all operations on strings are done at the byte level. I hope that makes sense.> So alright, I can use the suggested analyzer. But my question is: is > there a way to use an analyzer that would work with both character > types (english, and asian) simply by not returning matches them as > opposed to giving me an error? > > Thanks a ton for any help.The answer to this question is that it already should work correctly. Just make sure the locale is set correctly when the search method is called and that whatever you pass as a query to the search method is correctly encoded according to the locale. Cheers, Dave
Does it take anything other than simply: $KCODE = ''u'' right in the beginning of the script? I have that in place already. (it''s CGI we''re talking about) On 7/19/06, David Balmain <dbalmain.ml at gmail.com> wrote:> On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote: > > Just sharing my experience and asking another question. > > > > I tried the analyzer suggested here: > > http://www.ruby-forum.com/topic/72086#101764. It works fine if you > > specify the search field you want to use (anyway, it seems that''s how > > it''s suppose to work). > > > > # CODE > > analyzer = Ferret::Analysis::PerFieldAnalyzer.new(Ferret::Analysis::StandardAnalyzer.new) > > analyzer["chinese"] = Ferret::Analysis::RegExpAnalyzer.new(/./, false) > > > > index = Index::Index.new(:path => ''/var/index'', :analyzer => analyzer, > > :default_field => "*") > > > > ... > > > > index.search_each("chinese: #{val}") do |doc, score| #val is a chinese char > > puts "#{doc} - #{score}" > > end > > # END CODE > > > > This works OK. However, if you try searching like this: > > > > # CODE > > index.search_each(val) do |doc, score| #val is a chinese char > > puts "#{doc} - #{score}" > > end > > # END CODE > > > > I get in my lighttpd error log: > > > > /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19:in > > `search_each'': : Error occured at <analysis.c>:701 (StandardError) > > Error: exception 2 not handled: Error decoding input string. Check > > that you have the locale set correctly > > from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19 > > > > Which MAKES SENSE, since the docs I created before are created like this: > > > > doc = { "author" => "englishchars", "title" => "more regular chars", > > "chinese" => "??"} > > index << doc > > > > and I think search_each is going through all the fields (since I > > explicitly said it should when I issued :default_field => "*" up > > there), finding english chars, and trying to match them against the > > chinese ones I supplied as a search query. > > Actually, it''s not because of there is a comparison between Chinese > and English characters. That shouldn''t cause an error. The error is > being thrown because val can''t be decoded using the StandardAnalyzer. > Again, you need to check that val is correctly encoded and you have > your locale set correctly.The only times tokenizing happens are when > you add documents to the index and when you run a query through the > query parser. Apart from that, all operations on strings are done at > the byte level. I hope that makes sense. > > > So alright, I can use the suggested analyzer. But my question is: is > > there a way to use an analyzer that would work with both character > > types (english, and asian) simply by not returning matches them as > > opposed to giving me an error? > > > > Thanks a ton for any help. > > The answer to this question is that it already should work correctly. > Just make sure the locale is set correctly when the search method is > called and that whatever you pass as a query to the search method is > correctly encoded according to the locale. > > Cheers, > Dave > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk-- Julio C. Ody http://rootshell.be/~julioody
Reply to myself: yes: ENV[''LANG''] = ''en_US.utf8'' Did the job. Thanks! On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote:> Does it take anything other than simply: > > $KCODE = ''u'' > > right in the beginning of the script? > > I have that in place already. > > (it''s CGI we''re talking about) > > On 7/19/06, David Balmain <dbalmain.ml at gmail.com> wrote: > > On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote: > > > Just sharing my experience and asking another question. > > > > > > I tried the analyzer suggested here: > > > http://www.ruby-forum.com/topic/72086#101764. It works fine if you > > > specify the search field you want to use (anyway, it seems that''s how > > > it''s suppose to work). > > > > > > # CODE > > > analyzer = Ferret::Analysis::PerFieldAnalyzer.new(Ferret::Analysis::StandardAnalyzer.new) > > > analyzer["chinese"] = Ferret::Analysis::RegExpAnalyzer.new(/./, false) > > > > > > index = Index::Index.new(:path => ''/var/index'', :analyzer => analyzer, > > > :default_field => "*") > > > > > > ... > > > > > > index.search_each("chinese: #{val}") do |doc, score| #val is a chinese char > > > puts "#{doc} - #{score}" > > > end > > > # END CODE > > > > > > This works OK. However, if you try searching like this: > > > > > > # CODE > > > index.search_each(val) do |doc, score| #val is a chinese char > > > puts "#{doc} - #{score}" > > > end > > > # END CODE > > > > > > I get in my lighttpd error log: > > > > > > /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19:in > > > `search_each'': : Error occured at <analysis.c>:701 (StandardError) > > > Error: exception 2 not handled: Error decoding input string. Check > > > that you have the locale set correctly > > > from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19 > > > > > > Which MAKES SENSE, since the docs I created before are created like this: > > > > > > doc = { "author" => "englishchars", "title" => "more regular chars", > > > "chinese" => "??"} > > > index << doc > > > > > > and I think search_each is going through all the fields (since I > > > explicitly said it should when I issued :default_field => "*" up > > > there), finding english chars, and trying to match them against the > > > chinese ones I supplied as a search query. > > > > Actually, it''s not because of there is a comparison between Chinese > > and English characters. That shouldn''t cause an error. The error is > > being thrown because val can''t be decoded using the StandardAnalyzer. > > Again, you need to check that val is correctly encoded and you have > > your locale set correctly.The only times tokenizing happens are when > > you add documents to the index and when you run a query through the > > query parser. Apart from that, all operations on strings are done at > > the byte level. I hope that makes sense. > > > > > So alright, I can use the suggested analyzer. But my question is: is > > > there a way to use an analyzer that would work with both character > > > types (english, and asian) simply by not returning matches them as > > > opposed to giving me an error? > > > > > > Thanks a ton for any help. > > > > The answer to this question is that it already should work correctly. > > Just make sure the locale is set correctly when the search method is > > called and that whatever you pass as a query to the search method is > > correctly encoded according to the locale. > > > > Cheers, > > Dave > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > -- > Julio C. Ody > http://rootshell.be/~julioody >-- Julio C. Ody http://rootshell.be/~julioody