Hello, I want to clean up accented characters in my index, using acts_as_ferret in a Rails project. I searched this forum, and found the best solution is to use an analyser. I created somthing like this: class PortugueseAnalyzer include Ferret::Analysis MAPPING = { [''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?''] => ''a'', ''?'' => ''ae'', [''?'',''?''] => ''d'', [''?'',''?'',''?'',''?'',''?''] => ''c'', [''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',] => ''e'', [''?''] => ''f'', [''?'',''?'',''?'',''?''] => ''g'', [''?'',''?''] => ''h'', [''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?''] => ''i'', [''?'',''?'',''?'',''?''] => ''j'', [''?'',''?''] => ''k'', [''?'',''?'',''?'',''?'',''?''] => ''l'', [''?'',''?'',''?'',''?'',''?'',''?''] => ''n'', [''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?''] => ''o'', [''?''] => ''oek'', [''?''] => ''q'', [''?'',''?'',''?''] => ''r'', [''?'',''?'',''?'',''?'',''?''] => ''s'', [''?'',''?'',''?'',''?''] => ''t'', [''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?'',''?''] => ''u'', [''?''] => ''w'', [''?'',''?'',''?''] => ''y'', [''?'',''?'',''?''] => ''z'' } def token_stream(field, string) return MappingFilter.new(StandardTokenizer.new(string), MAPPING) end end And inserted this code at the end of environment.rb. Im my model: acts_as_ferret({ :fields => [ ''name'' ] }, :analyzer => PortugueseAnalyzer.new) But this did not work.... Can someone tell me what I did wrong ???? Thanks Marcello -- Posted via http://www.ruby-forum.com/.
On Wed, May 23, 2007 at 04:25:12AM +0200, Marcello parra wrote:> Hello, > > I want to clean up accented characters in my index, using acts_as_ferret > in a Rails project. I searched this forum, and found the best solution > is to use an analyser. > I created somthing like this: > > class PortugueseAnalyzerTry inheriting your analyzer from Ferret::Analysis::Analyzer. Does not seem to be necessary API-wise, but imho this should help. Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
> Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not > seem to be necessary API-wise, but imho this should help. > > Jens >Thanks Jens. I changed from "class PortugueseAnalyzer" to "class PortugueseAnalyzer < Ferret::Analysis::Analyzer", but did not work also.... Did I put this in the right place ?? Thanks Marcello -- Posted via http://www.ruby-forum.com/.
On Wed, May 23, 2007 at 11:42:04AM +0200, Marcello parra wrote:> > Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not > > seem to be necessary API-wise, but imho this should help. > > > > Jens > > > > > Thanks Jens. > I changed from "class PortugueseAnalyzer" to > "class PortugueseAnalyzer < Ferret::Analysis::Analyzer", > but did not work also.... > > Did I put this in the right place ??I think so. To help debugging this a small ruby skript reproducing the exact problem would be cool. Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
> I think so. To help debugging this a small ruby skript reproducing the > exact problem would be cool. >Jens, In the log, I get: creating doc for class: Conta, id: 164 Adding field name with value ''Jos?? Antonio'' to index So, the name is not being traslated from UTF to ascii.... It''s the same output if I did not use the Analyzer. Thanks -- Posted via http://www.ruby-forum.com/.
> In the log, I get: > > creating doc for class: Conta, id: 164 > Adding field name with value ''Jos?? Antonio'' to indexI included a word preju?zo... that should be translated to prejuizo... I put some code to output information when it builds the index. This is what a get: Analyzing: field:nome str:preju??zo token["preju":0:5:1] token["zo":7:9:1] So, the problem is that it breaks the word in two, just in the accented character... A guess the problem is in: def token_stream(field, string) return MappingFilter.new(StandardTokenizer.new(string), MAPPING) end But, I can''t figure how..... -- Posted via http://www.ruby-forum.com/.
On Wed, May 23, 2007 at 12:43:21PM +0200, Marcello parra wrote:> > In the log, I get: > > > > creating doc for class: Conta, id: 164 > > Adding field name with value ''Jos?? Antonio'' to index > > > I included a word preju?zo... that should be translated to prejuizo... > I put some code to output information when it builds the index. This is > what a get: > > Analyzing: field:nome str:preju??zo > token["preju":0:5:1] > token["zo":7:9:1]With the script at http://pastie.caboo.se/63808 I get: token["prejuizo":0:9:1] It seems that Ferret doesn''t recognize the ? as a character and therefore splits the word at this position. You have to make sure that everything in your environment is using UTF-8 as character encoding for these things to work (expecially locale settings are relevant to ferret) Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa