Hello,
I want to clean up accented characters in my index, using acts_as_ferret
in a Rails project. I searched this forum, and found the best solution
is to use an analyser.
I created somthing like this:
class PortugueseAnalyzer
include Ferret::Analysis
MAPPING = {
[''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'']
=> ''a'',
''?'' =>
''ae'',
[''?'',''?'']
=> ''d'',
[''?'',''?'',''?'',''?'',''?'']
=> ''c'',
[''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',]
=> ''e'',
[''?''] =>
''f'',
[''?'',''?'',''?'',''?'']
=> ''g'',
[''?'',''?'']
=> ''h'',
[''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'']
=> ''i'',
[''?'',''?'',''?'',''?'']
=> ''j'',
[''?'',''?'']
=> ''k'',
[''?'',''?'',''?'',''?'',''?'']
=> ''l'',
[''?'',''?'',''?'',''?'',''?'',''?'']
=> ''n'',
[''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'']
=> ''o'',
[''?''] =>
''oek'',
[''?''] =>
''q'',
[''?'',''?'',''?'']
=> ''r'',
[''?'',''?'',''?'',''?'',''?'']
=> ''s'',
[''?'',''?'',''?'',''?'']
=> ''t'',
[''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'']
=> ''u'',
[''?''] =>
''w'',
[''?'',''?'',''?'']
=> ''y'',
[''?'',''?'',''?'']
=> ''z''
}
def token_stream(field, string)
return MappingFilter.new(StandardTokenizer.new(string), MAPPING)
end
end
And inserted this code at the end of environment.rb.
Im my model:
acts_as_ferret({ :fields => [ ''name'' ] }, :analyzer =>
PortugueseAnalyzer.new)
But this did not work....
Can someone tell me what I did wrong ????
Thanks
Marcello
--
Posted via http://www.ruby-forum.com/.
On Wed, May 23, 2007 at 04:25:12AM +0200, Marcello parra wrote:> Hello, > > I want to clean up accented characters in my index, using acts_as_ferret > in a Rails project. I searched this forum, and found the best solution > is to use an analyser. > I created somthing like this: > > class PortugueseAnalyzerTry inheriting your analyzer from Ferret::Analysis::Analyzer. Does not seem to be necessary API-wise, but imho this should help. Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
> Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not > seem to be necessary API-wise, but imho this should help. > > Jens >Thanks Jens. I changed from "class PortugueseAnalyzer" to "class PortugueseAnalyzer < Ferret::Analysis::Analyzer", but did not work also.... Did I put this in the right place ?? Thanks Marcello -- Posted via http://www.ruby-forum.com/.
On Wed, May 23, 2007 at 11:42:04AM +0200, Marcello parra wrote:> > Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not > > seem to be necessary API-wise, but imho this should help. > > > > Jens > > > > > Thanks Jens. > I changed from "class PortugueseAnalyzer" to > "class PortugueseAnalyzer < Ferret::Analysis::Analyzer", > but did not work also.... > > Did I put this in the right place ??I think so. To help debugging this a small ruby skript reproducing the exact problem would be cool. Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
> I think so. To help debugging this a small ruby skript reproducing the > exact problem would be cool. >Jens, In the log, I get: creating doc for class: Conta, id: 164 Adding field name with value ''Jos?? Antonio'' to index So, the name is not being traslated from UTF to ascii.... It''s the same output if I did not use the Analyzer. Thanks -- Posted via http://www.ruby-forum.com/.
> In the log, I get: > > creating doc for class: Conta, id: 164 > Adding field name with value ''Jos?? Antonio'' to indexI included a word preju?zo... that should be translated to prejuizo... I put some code to output information when it builds the index. This is what a get: Analyzing: field:nome str:preju??zo token["preju":0:5:1] token["zo":7:9:1] So, the problem is that it breaks the word in two, just in the accented character... A guess the problem is in: def token_stream(field, string) return MappingFilter.new(StandardTokenizer.new(string), MAPPING) end But, I can''t figure how..... -- Posted via http://www.ruby-forum.com/.
On Wed, May 23, 2007 at 12:43:21PM +0200, Marcello parra wrote:> > In the log, I get: > > > > creating doc for class: Conta, id: 164 > > Adding field name with value ''Jos?? Antonio'' to index > > > I included a word preju?zo... that should be translated to prejuizo... > I put some code to output information when it builds the index. This is > what a get: > > Analyzing: field:nome str:preju??zo > token["preju":0:5:1] > token["zo":7:9:1]With the script at http://pastie.caboo.se/63808 I get: token["prejuizo":0:9:1] It seems that Ferret doesn''t recognize the ? as a character and therefore splits the word at this position. You have to make sure that everything in your environment is using UTF-8 as character encoding for these things to work (expecially locale settings are relevant to ferret) Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa