thr3ads.net - Ferret talk - [Ferret-talk] Accented characters [May 2007]

If this information is useful, please help other people find it:
Share via:

Marcello parra

2007-May-23 02:25 UTC

[Ferret-talk] Accented characters

Hello,

I want to clean up accented characters in my index, using acts_as_ferret
in a Rails project. I searched this forum, and found the best solution
is to use an analyser.
I created somthing like this:

class PortugueseAnalyzer
  include Ferret::Analysis
  MAPPING = {
       
[''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'']
=> ''a'',
             ''?''                                       =>
''ae'',
             [''?'',''?'']                      
=> ''d'',
            
[''?'',''?'',''?'',''?'',''?'']
=> ''c'',
            
[''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',]
=> ''e'',
             [''?'']                                     =>
''f'',
            
[''?'',''?'',''?'',''?'']
=> ''g'',
             [''?'',''?'']                      
=> ''h'',
            
[''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'']
=> ''i'',
            
[''?'',''?'',''?'',''?'']
=> ''j'',
             [''?'',''?'']                      
=> ''k'',
            
[''?'',''?'',''?'',''?'',''?'']
=> ''l'',
            
[''?'',''?'',''?'',''?'',''?'',''?'']
=> ''n'',
            
[''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'']
=> ''o'',
             [''?'']                                     =>
''oek'',
             [''?'']                                     =>
''q'',
             [''?'',''?'',''?'']
=> ''r'',
            
[''?'',''?'',''?'',''?'',''?'']
=> ''s'',
            
[''?'',''?'',''?'',''?'']
=> ''t'',
            
[''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'']
=> ''u'',
             [''?'']                                     =>
''w'',
             [''?'',''?'',''?'']
=> ''y'',
             [''?'',''?'',''?'']
=> ''z''
      }
  def token_stream(field, string)
    return MappingFilter.new(StandardTokenizer.new(string), MAPPING)
  end
end


And inserted this code at the end of environment.rb.



Im my model:

acts_as_ferret({ :fields => [ ''name'' ] }, :analyzer =>
PortugueseAnalyzer.new)



But this did not work....

Can someone tell me what I did wrong ????

Thanks


Marcello

-- 
Posted via http://www.ruby-forum.com/.

Jens Kraemer

2007-May-23 07:52 UTC

head link

[Ferret-talk] Accented characters

On Wed, May 23, 2007 at 04:25:12AM +0200, Marcello parra
wrote:> Hello,
> 
> I want to clean up accented characters in my index, using acts_as_ferret
> in a Rails project. I searched this forum, and found the best solution
> is to use an analyser.
> I created somthing like this:
> 
> class PortugueseAnalyzer
Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not
seem to be necessary API-wise, but imho this should help. 

Jens


-- 
Jens Kr?mer
webit! Gesellschaft f?r neue Medien mbH
Schnorrstra?e 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
kraemer at webit.de | www.webit.de
 
Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Marcello parra

2007-May-23 09:42 UTC

head link

[Ferret-talk] Accented characters

> Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not
> seem to be necessary API-wise, but imho this should help.
> 
> Jens
> 

Thanks Jens.
I changed from "class PortugueseAnalyzer" to
"class PortugueseAnalyzer < Ferret::Analysis::Analyzer",
but did not work also....

Did I put this in the right place ??

Thanks

Marcello

-- 
Posted via http://www.ruby-forum.com/.

Jens Kraemer

2007-May-23 09:46 UTC

head link

[Ferret-talk] Accented characters

On Wed, May 23, 2007 at 11:42:04AM +0200, Marcello parra
wrote:> > Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not
> > seem to be necessary API-wise, but imho this should help.
> > 
> > Jens
> > 
> 
> 
> Thanks Jens.
> I changed from "class PortugueseAnalyzer" to
> "class PortugueseAnalyzer < Ferret::Analysis::Analyzer",
> but did not work also....
> 
> Did I put this in the right place ??
I think so. To help debugging this a small ruby skript reproducing the
exact problem would be cool.

Jens


-- 
Jens Kr?mer
webit! Gesellschaft f?r neue Medien mbH
Schnorrstra?e 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
kraemer at webit.de | www.webit.de
 
Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Marcello parra

2007-May-23 10:21 UTC

head link

[Ferret-talk] Accented characters

> I think so. To help debugging this a small ruby skript reproducing the
> exact problem would be cool.
> 
Jens,

In the log, I get:

creating doc for class: Conta, id: 164
Adding field name with value ''Jos?? Antonio'' to index


So, the name is not being traslated from UTF to ascii....
It''s the same output if I did not use the Analyzer.

Thanks

-- 
Posted via http://www.ruby-forum.com/.

Marcello parra

2007-May-23 10:43 UTC

head link

[Ferret-talk] Accented characters

> In the log, I get:
> 
> creating doc for class: Conta, id: 164
> Adding field name with value ''Jos?? Antonio'' to index

I included a word preju?zo... that should be translated to prejuizo...
I put some code to output information when it builds the index. This is 
what a get:

Analyzing: field:nome  str:preju??zo
token["preju":0:5:1]
token["zo":7:9:1]


So, the problem is that it breaks the word in two, just in the accented 
character...

A guess the problem is in:

def token_stream(field, string)
    return MappingFilter.new(StandardTokenizer.new(string), MAPPING)
end


But, I can''t figure how.....

-- 
Posted via http://www.ruby-forum.com/.

Jens Kraemer

2007-May-23 11:37 UTC

head link

[Ferret-talk] Accented characters

On Wed, May 23, 2007 at 12:43:21PM +0200, Marcello parra
wrote:> > In the log, I get:
> > 
> > creating doc for class: Conta, id: 164
> > Adding field name with value ''Jos?? Antonio'' to
index
> 
> 
> I included a word preju?zo... that should be translated to prejuizo...
> I put some code to output information when it builds the index. This is 
> what a get:
> 
> Analyzing: field:nome  str:preju??zo
> token["preju":0:5:1]
> token["zo":7:9:1]
With the script at http://pastie.caboo.se/63808 I get:

token["prejuizo":0:9:1]

It seems that Ferret doesn''t recognize the ? as a character and
therefore splits the word at this position.

You have to make sure that everything in your environment is using UTF-8
as character encoding for these things to work (expecially locale
settings are relevant to ferret)

Jens

-- 
Jens Kr?mer
webit! Gesellschaft f?r neue Medien mbH
Schnorrstra?e 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
kraemer at webit.de | www.webit.de
 
Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Reasonably Related Threads

Search for more apparently analagous threads

Ferret talk - May 2007 - Accented characters

[Ferret-talk] Accented characters

[Ferret-talk] Accented characters

[Ferret-talk] Accented characters

[Ferret-talk] Accented characters

[Ferret-talk] Accented characters

[Ferret-talk] Accented characters

[Ferret-talk] Accented characters

Reasonably Related Threads