thr3ads.net - Ferret talk - [Ferret-talk] searching with chinese chars [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Julio Cesar Ody

2006-Jul-18 06:52 UTC

[Ferret-talk] searching with chinese chars

Hi all,

maybe not a Ferret question, but I assume here might have came across
that already.

I wrote a simple CGI app that adds docs into a Ferret index. The idea
is testing asian languages input and searching.

The script that does the input seems to be OK. As David mentioned in a
question I made a little while ago, Ferret''s index is agnostic, in the
sense that you can store anything in it. I then wrote another one to
search the index created. This is what it looks like:

####################################

#!/usr/bin/ruby

$KCODE = ''u''
require ''cgi''
require ''ferret''
include Ferret

index = Index::Index.new(:path => ''/var/index'',
:default_field => "*")

cgi = CGI.new("html4")

result = ""
if cgi[''query''] and not cgi[''query''].empty?
  index.search_each(cgi[''query'']) do |doc, score|
    result << "<table border=''1''>
  
<tr><td>#{index[doc][''tileid'']}</td><td>#{index[doc][''title'']}</td><td>#{index[doc][''description'']}</td></tr>
      </table>
    "
  end
end
####################################

It''s A-OK for searching english. But when trying to input chinese
characters in the "query" field, I''m getting the following
error in my
lighttpd log file:

####################################
/var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:15:in
`search_each'': : Error occured at <analysis.c>:701 (Exception)
Error: exception 2 not handled: Error decoding input string. Check
that you have the locale set correctly
        from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:15
####################################

Is the error message above suggesting I should specify a chinese
locale and not UTF-8? I thought UTF-8 would actually handle chinese
and anything else one could throw at it as long as it''s a human
language.

Any help is appreciated.


-- 
Julio C. Ody
http://rootshell.be/~julioody

David Balmain

2006-Jul-18 07:22 UTC

head link

[Ferret-talk] searching with chinese chars

On 7/18/06, Julio Cesar Ody <julioody at gmail.com>
wrote:> Hi all,
>
> maybe not a Ferret question, but I assume here might have came across
> that already.
>
> I wrote a simple CGI app that adds docs into a Ferret index. The idea
> is testing asian languages input and searching.
>
> The script that does the input seems to be OK. As David mentioned in a
> question I made a little while ago, Ferret''s index is agnostic, in
the
> sense that you can store anything in it. I then wrote another one to
> search the index created. This is what it looks like:
>
> ####################################
>
> #!/usr/bin/ruby
>
> $KCODE = ''u''
> require ''cgi''
> require ''ferret''
> include Ferret
>
> index = Index::Index.new(:path => ''/var/index'',
:default_field => "*")
>
> cgi = CGI.new("html4")
>
> result = ""
> if cgi[''query''] and not
cgi[''query''].empty?
>   index.search_each(cgi[''query'']) do |doc, score|
>     result << "<table border=''1''>
>   
<tr><td>#{index[doc][''tileid'']}</td><td>#{index[doc][''title'']}</td><td>#{index[doc][''description'']}</td></tr>
>       </table>
>     "
>   end
> end
> ####################################
>
> It''s A-OK for searching english. But when trying to input chinese
> characters in the "query" field, I''m getting the
following error in my
> lighttpd log file:
>
> ####################################
> /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:15:in
> `search_each'': : Error occured at <analysis.c>:701
(Exception)
> Error: exception 2 not handled: Error decoding input string. Check
> that you have the locale set correctly
>         from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:15
> ####################################
>
> Is the error message above suggesting I should specify a chinese
> locale and not UTF-8? I thought UTF-8 would actually handle chinese
> and anything else one could throw at it as long as it''s a human
> language.
>
> Any help is appreciated.
The error is being raised when the analyzer tries to tokenize the
query string My guess would be that the query string either starts in
the wrong encoding (when you type it in) or it gets converted
somewhere between being typed in the browser and going into your
script. UTF-8 can certainly handle Chinese characters if they are
UTF-8 encoded but there are other encodings for Chinese as well. If I
were trying to debug this, the first thing I''d do is log the query
string in a file and check its encoding. Something like;

    File.open("query.log", "w") {|f|
f.write(cgi[''query''])}

If you want, send me the file and I''ll try and see what encoding it is.

Cheers,
Dave

Julio Cesar Ody

2006-Jul-18 07:49 UTC

head link

[Ferret-talk] searching with chinese chars

> The error is being raised when the analyzer tries to tokenize the
> query string My guess would be that the query string either starts in
> the wrong encoding (when you type it in)
Didn''t get that bit.
> or it gets converted
> somewhere between being typed in the browser and going into your
> script.
Umm... maybe yes.

> UTF-8 can certainly handle Chinese characters if they are
> UTF-8 encoded but there are other encodings for Chinese as well. If I
> were trying to debug this, the first thing I''d do is log the query
> string in a file and check its encoding. Something like;
>
>     File.open("query.log", "w") {|f|
f.write(cgi[''query''])}
>
> If you want, send me the file and I''ll try and see what encoding
it is.
I wrote another script that does just that (writes
cgi[''query''] to
/tmp/query.log). After inputting this in a text field name "query" and
submitting this chinese string:

??

This is what appears in the /tmp/query.log

&#26032;&#38395;

Note that the only thing I did hoping to have evething magically
working in UTF-8 is putting this in my script:

$KCODE = ''u''

Anything I''m missing?
>
> Cheers,
> Dave
-- 
Julio C. Ody
http://rootshell.be/~julioody

David Balmain

2006-Jul-18 08:01 UTC

head link

[Ferret-talk] searching with chinese chars

On 7/18/06, Julio Cesar Ody <julioody at gmail.com>
wrote:> I wrote another script that does just that (writes
cgi[''query''] to
> /tmp/query.log). After inputting this in a text field name
"query" and
> submitting this chinese string:
>
> ??
>
> This is what appears in the /tmp/query.log
>
> &#26032;&#38395;
>
> Note that the only thing I did hoping to have evething magically
> working in UTF-8 is putting this in my script:
>
> $KCODE = ''u''
>
> Anything I''m missing?
dbalmain at ubuntu:~/ $ irb -Ku
irb(main):001:0> require ''cgi''
=> true
irb(main):002:0> CGI.unescapeHTML("&#26032;&#38395;")
=> "??"

That should fix your problem.

Dave

Julio Cesar Ody

2006-Jul-19 00:50 UTC

head link

[Ferret-talk] searching with chinese chars

Yep, it did. Thanks tons!

But I''m not getting any results now. I take this is because of the
default analyzer being used, right?

How can I use a whitespace analyzer in my query? (or something that
could work effectively with asian languages).

For my needs, I suppose the whitespace one could do...


On 7/18/06, David Balmain <dbalmain.ml at gmail.com>
wrote:> On 7/18/06, Julio Cesar Ody <julioody at gmail.com> wrote:
> > I wrote another script that does just that (writes
cgi[''query''] to
> > /tmp/query.log). After inputting this in a text field name
"query" and
> > submitting this chinese string:
> >
> > ??
> >
> > This is what appears in the /tmp/query.log
> >
> > &#26032;&#38395;
> >
> > Note that the only thing I did hoping to have evething magically
> > working in UTF-8 is putting this in my script:
> >
> > $KCODE = ''u''
> >
> > Anything I''m missing?
>
> dbalmain at ubuntu:~/ $ irb -Ku
> irb(main):001:0> require ''cgi''
> => true
> irb(main):002:0> CGI.unescapeHTML("&#26032;&#38395;")
> => "??"
>
> That should fix your problem.
>
> Dave
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk

-- 
Julio C. Ody
http://rootshell.be/~julioody

David Balmain

2006-Jul-19 01:24 UTC

head link

[Ferret-talk] searching with chinese chars

On 7/19/06, Julio Cesar Ody <julioody at gmail.com>
wrote:> Yep, it did. Thanks tons!
>
> But I''m not getting any results now. I take this is because of the
> default analyzer being used, right?
>
> How can I use a whitespace analyzer in my query? (or something that
> could work effectively with asian languages).
>
> For my needs, I suppose the whitespace one could do...
index = Index::Index.new(:path => ''/var/index'',
:default_field => "*",
                     :analyzer => Ferret::Analysis::WhiteSpaceAnalzyer.new)

Although you should probably use the same analyzer I gave you for indexing;

http://www.ruby-forum.com/topic/72086#101764

Cheers,
Dave

Julio Cesar Ody

2006-Jul-19 01:45 UTC

head link

[Ferret-talk] searching with chinese chars

Thanks, and sorry. I checked the documentation for Index::Index and
found it right after I asked the question. My bad.

I''m getting segfauls when trying to initialize an index using a
different analyzer other than the default one (but it works
otherwise). But as I can see in this thread

http://www.ruby-forum.com/topic/71620

It ain''t stable yet for 64 bit. So I''ll wait.

Thanks again.


On 7/19/06, David Balmain <dbalmain.ml at gmail.com>
wrote:> On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote:
> > Yep, it did. Thanks tons!
> >
> > But I''m not getting any results now. I take this is because
of the
> > default analyzer being used, right?
> >
> > How can I use a whitespace analyzer in my query? (or something that
> > could work effectively with asian languages).
> >
> > For my needs, I suppose the whitespace one could do...
>
> index = Index::Index.new(:path => ''/var/index'',
:default_field => "*",
>                      :analyzer =>
Ferret::Analysis::WhiteSpaceAnalzyer.new)
>
> Although you should probably use the same analyzer I gave you for indexing;
>
> http://www.ruby-forum.com/topic/72086#101764
>
> Cheers,
> Dave
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

-- 
Julio C. Ody
http://rootshell.be/~julioody

Julio Cesar Ody

2006-Jul-19 03:09 UTC

head link

[Ferret-talk] searching with chinese chars

Just sharing my experience and asking another question.

I tried the analyzer suggested here:
http://www.ruby-forum.com/topic/72086#101764. It works fine if you
specify the search field you want to use (anyway, it seems that''s how
it''s suppose to work).

# CODE
analyzer =
Ferret::Analysis::PerFieldAnalyzer.new(Ferret::Analysis::StandardAnalyzer.new)
analyzer["chinese"] = Ferret::Analysis::RegExpAnalyzer.new(/./, false)

index = Index::Index.new(:path => ''/var/index'', :analyzer
=> analyzer,
:default_field => "*")

...

index.search_each("chinese: #{val}") do |doc, score|  #val is a
chinese char
 puts "#{doc} - #{score}"
end
# END CODE

This works OK. However, if you try searching like this:

# CODE
index.search_each(val) do |doc, score|  #val is a chinese char
 puts "#{doc} - #{score}"
end
# END CODE

I get in my lighttpd error log:

/var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19:in
`search_each'': : Error occured at <analysis.c>:701
(StandardError)
Error: exception 2 not handled: Error decoding input string. Check
that you have the locale set correctly
        from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19

Which MAKES SENSE, since the docs I created before are created like this:

doc = { "author" => "englishchars", "title"
=> "more regular chars",
"chinese" => "??"}
index << doc

and I think search_each is going through all the fields (since I
explicitly said it should when I issued :default_field => "*" up
there), finding english chars, and trying to match them against the
chinese ones I supplied as a search query.

So alright, I can use the suggested analyzer. But my question is: is
there a way to use an analyzer that would work with both character
types (english, and asian) simply by not returning matches them as
opposed to giving me an error?

Thanks a ton for any help.

On 7/19/06, Julio Cesar Ody <julioody at gmail.com>
wrote:> Thanks, and sorry. I checked the documentation for Index::Index and
> found it right after I asked the question. My bad.
>
> I''m getting segfauls when trying to initialize an index using a
> different analyzer other than the default one (but it works
> otherwise). But as I can see in this thread
>
> http://www.ruby-forum.com/topic/71620
>
> It ain''t stable yet for 64 bit. So I''ll wait.
>
> Thanks again.
>
>
> On 7/19/06, David Balmain <dbalmain.ml at gmail.com> wrote:
> > On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote:
> > > Yep, it did. Thanks tons!
> > >
> > > But I''m not getting any results now. I take this is
because of the
> > > default analyzer being used, right?
> > >
> > > How can I use a whitespace analyzer in my query? (or something
that
> > > could work effectively with asian languages).
> > >
> > > For my needs, I suppose the whitespace one could do...
> >
> > index = Index::Index.new(:path => ''/var/index'',
:default_field => "*",
> >                      :analyzer =>
Ferret::Analysis::WhiteSpaceAnalzyer.new)
> >
> > Although you should probably use the same analyzer I gave you for
indexing;
> >
> > http://www.ruby-forum.com/topic/72086#101764
> >
> > Cheers,
> > Dave
> > _______________________________________________
> > Ferret-talk mailing list
> > Ferret-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/ferret-talk
> >
>
>
> --
> Julio C. Ody
> http://rootshell.be/~julioody
>

-- 
Julio C. Ody
http://rootshell.be/~julioody

David Balmain

2006-Jul-19 03:20 UTC

head link

[Ferret-talk] searching with chinese chars

On 7/19/06, Julio Cesar Ody <julioody at gmail.com>
wrote:> Just sharing my experience and asking another question.
>
> I tried the analyzer suggested here:
> http://www.ruby-forum.com/topic/72086#101764. It works fine if you
> specify the search field you want to use (anyway, it seems that''s
how
> it''s suppose to work).
>
> # CODE
> analyzer =
Ferret::Analysis::PerFieldAnalyzer.new(Ferret::Analysis::StandardAnalyzer.new)
> analyzer["chinese"] = Ferret::Analysis::RegExpAnalyzer.new(/./,
false)
>
> index = Index::Index.new(:path => ''/var/index'',
:analyzer => analyzer,
> :default_field => "*")
>
> ...
>
> index.search_each("chinese: #{val}") do |doc, score|  #val is a
chinese char
>  puts "#{doc} - #{score}"
> end
> # END CODE
>
> This works OK. However, if you try searching like this:
>
> # CODE
> index.search_each(val) do |doc, score|  #val is a chinese char
>  puts "#{doc} - #{score}"
> end
> # END CODE
>
> I get in my lighttpd error log:
>
> /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19:in
> `search_each'': : Error occured at <analysis.c>:701
(StandardError)
> Error: exception 2 not handled: Error decoding input string. Check
> that you have the locale set correctly
>         from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19
>
> Which MAKES SENSE, since the docs I created before are created like this:
>
> doc = { "author" => "englishchars",
"title" => "more regular chars",
> "chinese" => "??"}
> index << doc
>
> and I think search_each is going through all the fields (since I
> explicitly said it should when I issued :default_field => "*"
up
> there), finding english chars, and trying to match them against the
> chinese ones I supplied as a search query.
Actually, it''s not because of there is a comparison between Chinese
and English characters. That shouldn''t cause an error. The error is
being thrown because val can''t be decoded using the StandardAnalyzer.
Again, you need to check that val is correctly encoded and you have
your locale set correctly.The only times tokenizing happens are when
you add documents to the index and when you run a query through the
query parser. Apart from that, all operations on strings are done at
the byte level. I hope that makes sense.
> So alright, I can use the suggested analyzer. But my question is: is
> there a way to use an analyzer that would work with both character
> types (english, and asian) simply by not returning matches them as
> opposed to giving me an error?
>
> Thanks a ton for any help.
The answer to this question is that it already should work correctly.
Just make sure the locale is set correctly when the search method is
called and that whatever you pass as a query to the search method is
correctly encoded according to the locale.

Cheers,
Dave

Julio Cesar Ody

2006-Jul-19 03:27 UTC

head link

[Ferret-talk] searching with chinese chars

Does it take anything other than simply:

$KCODE = ''u''

right in the beginning of the script?

I have that in place already.

(it''s CGI we''re talking about)

On 7/19/06, David Balmain <dbalmain.ml at gmail.com>
wrote:> On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote:
> > Just sharing my experience and asking another question.
> >
> > I tried the analyzer suggested here:
> > http://www.ruby-forum.com/topic/72086#101764. It works fine if you
> > specify the search field you want to use (anyway, it seems
that''s how
> > it''s suppose to work).
> >
> > # CODE
> > analyzer =
Ferret::Analysis::PerFieldAnalyzer.new(Ferret::Analysis::StandardAnalyzer.new)
> > analyzer["chinese"] =
Ferret::Analysis::RegExpAnalyzer.new(/./, false)
> >
> > index = Index::Index.new(:path => ''/var/index'',
:analyzer => analyzer,
> > :default_field => "*")
> >
> > ...
> >
> > index.search_each("chinese: #{val}") do |doc, score|  #val
is a chinese char
> >  puts "#{doc} - #{score}"
> > end
> > # END CODE
> >
> > This works OK. However, if you try searching like this:
> >
> > # CODE
> > index.search_each(val) do |doc, score|  #val is a chinese char
> >  puts "#{doc} - #{score}"
> > end
> > # END CODE
> >
> > I get in my lighttpd error log:
> >
> > /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19:in
> > `search_each'': : Error occured at <analysis.c>:701
(StandardError)
> > Error: exception 2 not handled: Error decoding input string. Check
> > that you have the locale set correctly
> >         from /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19
> >
> > Which MAKES SENSE, since the docs I created before are created like
this:
> >
> > doc = { "author" => "englishchars",
"title" => "more regular chars",
> > "chinese" => "??"}
> > index << doc
> >
> > and I think search_each is going through all the fields (since I
> > explicitly said it should when I issued :default_field =>
"*" up
> > there), finding english chars, and trying to match them against the
> > chinese ones I supplied as a search query.
>
> Actually, it''s not because of there is a comparison between
Chinese
> and English characters. That shouldn''t cause an error. The error
is
> being thrown because val can''t be decoded using the
StandardAnalyzer.
> Again, you need to check that val is correctly encoded and you have
> your locale set correctly.The only times tokenizing happens are when
> you add documents to the index and when you run a query through the
> query parser. Apart from that, all operations on strings are done at
> the byte level. I hope that makes sense.
>
> > So alright, I can use the suggested analyzer. But my question is: is
> > there a way to use an analyzer that would work with both character
> > types (english, and asian) simply by not returning matches them as
> > opposed to giving me an error?
> >
> > Thanks a ton for any help.
>
> The answer to this question is that it already should work correctly.
> Just make sure the locale is set correctly when the search method is
> called and that whatever you pass as a query to the search method is
> correctly encoded according to the locale.
>
> Cheers,
> Dave
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk

-- 
Julio C. Ody
http://rootshell.be/~julioody

Julio Cesar Ody

2006-Jul-19 03:35 UTC

head link

[Ferret-talk] searching with chinese chars

Reply to myself: yes:

ENV[''LANG''] = ''en_US.utf8''

Did the job.

Thanks!

On 7/19/06, Julio Cesar Ody <julioody at gmail.com>
wrote:> Does it take anything other than simply:
>
> $KCODE = ''u''
>
> right in the beginning of the script?
>
> I have that in place already.
>
> (it''s CGI we''re talking about)
>
> On 7/19/06, David Balmain <dbalmain.ml at gmail.com> wrote:
> > On 7/19/06, Julio Cesar Ody <julioody at gmail.com> wrote:
> > > Just sharing my experience and asking another question.
> > >
> > > I tried the analyzer suggested here:
> > > http://www.ruby-forum.com/topic/72086#101764. It works fine if
you
> > > specify the search field you want to use (anyway, it seems
that''s how
> > > it''s suppose to work).
> > >
> > > # CODE
> > > analyzer =
Ferret::Analysis::PerFieldAnalyzer.new(Ferret::Analysis::StandardAnalyzer.new)
> > > analyzer["chinese"] =
Ferret::Analysis::RegExpAnalyzer.new(/./, false)
> > >
> > > index = Index::Index.new(:path =>
''/var/index'', :analyzer => analyzer,
> > > :default_field => "*")
> > >
> > > ...
> > >
> > > index.search_each("chinese: #{val}") do |doc, score| 
#val is a chinese char
> > >  puts "#{doc} - #{score}"
> > > end
> > > # END CODE
> > >
> > > This works OK. However, if you try searching like this:
> > >
> > > # CODE
> > > index.search_each(val) do |doc, score|  #val is a chinese char
> > >  puts "#{doc} - #{score}"
> > > end
> > > # END CODE
> > >
> > > I get in my lighttpd error log:
> > >
> > > /var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19:in
> > > `search_each'': : Error occured at <analysis.c>:701
(StandardError)
> > > Error: exception 2 not handled: Error decoding input string.
Check
> > > that you have the locale set correctly
> > >         from
/var/www/localhost/htdocs/cgi-bin/search_chinese.ruby:19
> > >
> > > Which MAKES SENSE, since the docs I created before are created
like this:
> > >
> > > doc = { "author" => "englishchars",
"title" => "more regular chars",
> > > "chinese" => "??"}
> > > index << doc
> > >
> > > and I think search_each is going through all the fields (since I
> > > explicitly said it should when I issued :default_field =>
"*" up
> > > there), finding english chars, and trying to match them against
the
> > > chinese ones I supplied as a search query.
> >
> > Actually, it''s not because of there is a comparison between
Chinese
> > and English characters. That shouldn''t cause an error. The
error is
> > being thrown because val can''t be decoded using the
StandardAnalyzer.
> > Again, you need to check that val is correctly encoded and you have
> > your locale set correctly.The only times tokenizing happens are when
> > you add documents to the index and when you run a query through the
> > query parser. Apart from that, all operations on strings are done at
> > the byte level. I hope that makes sense.
> >
> > > So alright, I can use the suggested analyzer. But my question is:
is
> > > there a way to use an analyzer that would work with both
character
> > > types (english, and asian) simply by not returning matches them
as
> > > opposed to giving me an error?
> > >
> > > Thanks a ton for any help.
> >
> > The answer to this question is that it already should work correctly.
> > Just make sure the locale is set correctly when the search method is
> > called and that whatever you pass as a query to the search method is
> > correctly encoded according to the locale.
> >
> > Cheers,
> > Dave
> > _______________________________________________
> > Ferret-talk mailing list
> > Ferret-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/ferret-talk
>
>
> --
> Julio C. Ody
> http://rootshell.be/~julioody
>

-- 
Julio C. Ody
http://rootshell.be/~julioody

Reasonably Related Threads

Search for more maybe matching threads

Ferret talk - Jul 2006 - searching with chinese chars

[Ferret-talk] searching with chinese chars

[Ferret-talk] searching with chinese chars

[Ferret-talk] searching with chinese chars

[Ferret-talk] searching with chinese chars

[Ferret-talk] searching with chinese chars

[Ferret-talk] searching with chinese chars

[Ferret-talk] searching with chinese chars

[Ferret-talk] searching with chinese chars

[Ferret-talk] searching with chinese chars

[Ferret-talk] searching with chinese chars

[Ferret-talk] searching with chinese chars

Reasonably Related Threads