I''m having trouble with PerFieldAnalyzer (ferret version 0.10.14). Script: require ''rubygems'' require ''ferret'' require ''pp'' include Ferret::Analysis include Ferret::Index class TestAnalyzer def token_stream field, input pp field pp input LetterTokenizer.new(input) end end pfa = PerFieldAnalyzer.new(StandardAnalyzer.new()) pfa[:test] = TestAnalyzer.new index = Index.new(:analyzer => pfa) index << {:test => ''foo''} index.search_each(''bar'') Output: :test "" :test "bar" Why is input "" the first time token_stream is called? I hope that the answer isn''t "upgrade to 0.11". :( -ryan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20070328/6172137d/attachment.html
On 3/28/07, Ryan King <ryansking at gmail.com> wrote:> I''m having trouble with PerFieldAnalyzer (ferret version 0.10.14). > > Script: > require ''rubygems'' > require ''ferret'' > require ''pp'' > > include Ferret::Analysis > include Ferret::Index > > class TestAnalyzer > def token_stream field, input > pp field > pp input > LetterTokenizer.new(input) > end > end > > pfa = PerFieldAnalyzer.new(StandardAnalyzer.new()) > pfa[:test] = TestAnalyzer.new > index = Index.new(:analyzer => pfa) > index << {:test => ''foo''} > index.search_each(''bar'') > > > > Output: > > :test > "" > :test > "bar" > > > Why is input "" the first time token_stream is called? > > I hope that the answer isn''t "upgrade to 0.11". :(FWIW, I upgraded to 0.11.3 on my test box and it didnt'' change anything. Are my assumptions about PFA wrong? Or is there a bug? -ryan
On Mon, Apr 02, 2007 at 11:57:37AM -0700, Ryan King wrote:> On 3/28/07, Ryan King <ryansking at gmail.com> wrote:[..]> > FWIW, I upgraded to 0.11.3 on my test box and it didnt'' change > anything. Are my assumptions about PFA wrong? Or is there a bug?I guess that''s a bug - I can perfectly reproduce that behaviour here. The funny thing is that this does not necessarily mean that it doesn''t work as intended. Just for fun I wrote an analyzer that completely ignores the input it should analyze, and always uses a fixed text instead: class TestAnalyzer def token_stream field, input ts = LetterTokenizer.new("senseless standard text") puts "token_stream for :#{field} and input <#{input}>: #{ts.inspect}\n #{ts.text}" ts end end a = TestAnalyzer.new ts = a.token_stream :test, ''foo bar'' puts ts.text # ''senseless standard text'' as expected pfa = PerFieldAnalyzer.new(StandardAnalyzer.new()) pfa[:test] = TestAnalyzer.new ts = pfa.token_stream :test, ''foo bar'' puts ts.text # surprise: ''foo bar'' I guess the pfa does not give the text to analyze via the token_stream method, but sets it later by using the Tokenizer''s text=() method. Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
On 4/3/07, Jens Kraemer <kraemer at webit.de> wrote:> On Mon, Apr 02, 2007 at 11:57:37AM -0700, Ryan King wrote: > > On 3/28/07, Ryan King <ryansking at gmail.com> wrote: > [..] > > > > FWIW, I upgraded to 0.11.3 on my test box and it didnt'' change > > anything. Are my assumptions about PFA wrong? Or is there a bug? > > I guess that''s a bug - I can perfectly reproduce that behaviour here. > > The funny thing is that this does not necessarily mean that it doesn''t > work as intended. Just for fun I wrote an analyzer that completely > ignores the input it should analyze, and always uses a fixed text > instead: > > class TestAnalyzer > def token_stream field, input > ts = LetterTokenizer.new("senseless standard text") > puts "token_stream for :#{field} and input <#{input}>: #{ts.inspect}\n #{ts.text}" > ts > end > end > > a = TestAnalyzer.new > ts = a.token_stream :test, ''foo bar'' > puts ts.text # ''senseless standard text'' as expected > > pfa = PerFieldAnalyzer.new(StandardAnalyzer.new()) > pfa[:test] = TestAnalyzer.new > ts = pfa.token_stream :test, ''foo bar'' > puts ts.text # surprise: ''foo bar'' > > I guess the pfa does not give the text to analyze via the token_stream > method, but sets it later by using the Tokenizer''s text=() method.I don''t think so. I''ve tried overriding #text=, but it never gets called. -ryan
On Tue, Apr 03, 2007 at 10:29:49AM -0700, Ryan King wrote:> On 4/3/07, Jens Kraemer <kraemer at webit.de> wrote:[..]> > > > The funny thing is that this does not necessarily mean that it doesn''t > > work as intended. Just for fun I wrote an analyzer that completely > > ignores the input it should analyze, and always uses a fixed text > > instead: > > > > class TestAnalyzer > > def token_stream field, input > > ts = LetterTokenizer.new("senseless standard text") > > puts "token_stream for :#{field} and input <#{input}>: #{ts.inspect}\n #{ts.text}" > > ts > > end > > end > > > > a = TestAnalyzer.new > > ts = a.token_stream :test, ''foo bar'' > > puts ts.text # ''senseless standard text'' as expected > > > > pfa = PerFieldAnalyzer.new(StandardAnalyzer.new()) > > pfa[:test] = TestAnalyzer.new > > ts = pfa.token_stream :test, ''foo bar'' > > puts ts.text # surprise: ''foo bar'' > > > > I guess the pfa does not give the text to analyze via the token_stream > > method, but sets it later by using the Tokenizer''s text=() method. > > I don''t think so. I''ve tried overriding #text=, but it never gets called.ok, then it''s happening somewhere else - in ferret''s analysis.c there''s a method a_standard_get_ts that clones an existing token stream instance and calls a method named reset on it, with the text to be tokenized. I guess we''ll need Dave''s help to sort this out... Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
On 4/4/07, Jens Kraemer <kraemer at webit.de> wrote:> On Tue, Apr 03, 2007 at 10:29:49AM -0700, Ryan King wrote: > > On 4/3/07, Jens Kraemer <kraemer at webit.de> wrote: > [..] > > > > > > The funny thing is that this does not necessarily mean that it doesn''t > > > work as intended. Just for fun I wrote an analyzer that completely > > > ignores the input it should analyze, and always uses a fixed text > > > instead: > > > > > > class TestAnalyzer > > > def token_stream field, input > > > ts = LetterTokenizer.new("senseless standard text") > > > puts "token_stream for :#{field} and input <#{input}>: #{ts.inspect}\n #{ts.text}" > > > ts > > > end > > > end > > > > > > a = TestAnalyzer.new > > > ts = a.token_stream :test, ''foo bar'' > > > puts ts.text # ''senseless standard text'' as expected > > > > > > pfa = PerFieldAnalyzer.new(StandardAnalyzer.new()) > > > pfa[:test] = TestAnalyzer.new > > > ts = pfa.token_stream :test, ''foo bar'' > > > puts ts.text # surprise: ''foo bar'' > > > > > > I guess the pfa does not give the text to analyze via the token_stream > > > method, but sets it later by using the Tokenizer''s text=() method. > > > > I don''t think so. I''ve tried overriding #text=, but it never gets called. > > ok, then it''s happening somewhere else - in ferret''s analysis.c there''s > a method a_standard_get_ts that clones an existing token stream instance > and calls a method named reset on it, with the text to be tokenized. > > I guess we''ll need Dave''s help to sort this out...Ok, I can see why this is confusing. To try and show you how it works, try this code; require ''rubygems'' require ''ferret'' require ''pp'' require ''strscan'' include Ferret::Analysis include Ferret::Index class TestAnalyzer class TestTokenizer def initialize(input) puts "initialize => (#{input})" @input = input end def next() term, @input = @input, nil return term ? Token.new(term, 0, term.size) : nil end def text=(text) puts "reset => (#{text})" @input = text end end def token_stream field, input pp field pp input TestTokenizer.new(input) end end pfa = PerFieldAnalyzer.new(StandardAnalyzer.new()) pfa[:test] = TestAnalyzer.new index = Index.new(:analyzer => pfa) index << {:test => ''foo''} index.search_each(''bar'') The output is; :test "" initialize => () r_analysis.c, 563: cwrts_reset #<= debugging bug :-0 reset => (foo) :test "bar" initialize => (bar) There is a stray debugging comment in there which I''m embarrassed I didn''t pick up earlier. But otherwise it should show you what is happening. The tokenizer gets created with an empty string and then TestTokenizer#text= gets called. This was actually an optimization for multi-string fields. For example; index << {:test => [''one'', ''two'', ''three'']} # => initialize => () reset => (one) reset => (two) reset => (three) So the tokenizer only needs to be instantiated once and then it gets reset for each string. This is good example of premature optimization, particularly since most people will never even have multi-string fields like this. Getting rid of this optimization makes things a lot clearer. The next version of Ferret will give this output; index << {:test => [''one'', ''two'', ''three'']} # => initialize => (one) initialize => (two) initialize => (three) So Ryan, you will now get the output you expect. It will require updating to Ferret 0.11.4 though. Is there any reason this is a problem? Hope that helps, Dave -- Dave Balmain http://www.davebalmain.com/
On 4/5/07, David Balmain <dbalmain.ml at gmail.com> wrote:> On 4/4/07, Jens Kraemer <kraemer at webit.de> wrote: > > On Tue, Apr 03, 2007 at 10:29:49AM -0700, Ryan King wrote: > > > On 4/3/07, Jens Kraemer <kraemer at webit.de> wrote: > > [..] > > > > > > > > The funny thing is that this does not necessarily mean that it doesn''t > > > > work as intended. Just for fun I wrote an analyzer that completely > > > > ignores the input it should analyze, and always uses a fixed text > > > > instead: > > > > > > > > class TestAnalyzer > > > > def token_stream field, input > > > > ts = LetterTokenizer.new("senseless standard text") > > > > puts "token_stream for :#{field} and input <#{input}>: #{ts.inspect}\n #{ts.text}" > > > > ts > > > > end > > > > end > > > > > > > > a = TestAnalyzer.new > > > > ts = a.token_stream :test, ''foo bar'' > > > > puts ts.text # ''senseless standard text'' as expected > > > > > > > > pfa = PerFieldAnalyzer.new(StandardAnalyzer.new()) > > > > pfa[:test] = TestAnalyzer.new > > > > ts = pfa.token_stream :test, ''foo bar'' > > > > puts ts.text # surprise: ''foo bar'' > > > > > > > > I guess the pfa does not give the text to analyze via the token_stream > > > > method, but sets it later by using the Tokenizer''s text=() method. > > > > > > I don''t think so. I''ve tried overriding #text=, but it never gets called. > > > > ok, then it''s happening somewhere else - in ferret''s analysis.c there''s > > a method a_standard_get_ts that clones an existing token stream instance > > and calls a method named reset on it, with the text to be tokenized. > > > > I guess we''ll need Dave''s help to sort this out... > > Ok, I can see why this is confusing. To try and show you how it works, > try this code; > > require ''rubygems'' > require ''ferret'' > require ''pp'' > require ''strscan'' > > include Ferret::Analysis > include Ferret::Index > > class TestAnalyzer > class TestTokenizer > def initialize(input) > puts "initialize => (#{input})" > @input = input > end > def next() > term, @input = @input, nil > return term ? Token.new(term, 0, term.size) : nil > end > def text=(text) > puts "reset => (#{text})" > @input = text > end > end > > def token_stream field, input > pp field > pp input > TestTokenizer.new(input) > end > end > > pfa = PerFieldAnalyzer.new(StandardAnalyzer.new()) > pfa[:test] = TestAnalyzer.new > index = Index.new(:analyzer => pfa) > index << {:test => ''foo''} > index.search_each(''bar'') > > The output is; > > :test > "" > initialize => () > r_analysis.c, 563: cwrts_reset #<= debugging bug :-0 > reset => (foo) > :test > "bar" > initialize => (bar) > > There is a stray debugging comment in there which I''m embarrassed I > didn''t pick up earlier. But otherwise it should show you what is > happening. The tokenizer gets created with an empty string and then > TestTokenizer#text= gets called. This was actually an optimization for > multi-string fields. For example; > > index << {:test => [''one'', ''two'', ''three'']} > # => > initialize => () > reset => (one) > reset => (two) > reset => (three) > > So the tokenizer only needs to be instantiated once and then it gets > reset for each string. This is good example of premature optimization, > particularly since most people will never even have multi-string > fields like this. Getting rid of this optimization makes things a lot > clearer. The next version of Ferret will give this output; > > index << {:test => [''one'', ''two'', ''three'']} > # => > initialize => (one) > initialize => (two) > initialize => (three) > > So Ryan, you will now get the output you expect. It will require > updating to Ferret 0.11.4 though. Is there any reason this is a > problem?I''m at the point where I need to upgrade for other reason anyway, so it shouldn''t be a problem. Thanks for your help. -ryan