Doug Smith
2007-May-05 00:50 UTC
[Ferret-talk] Stop words, fields, StandardAnalyzer quagmire
Hello, I''m using: Ruby 1.8.6, Rails 1.2.3, ferret 0.11.4, acts_as_ferret from svn stable. I''ve had quite a day wrestling with trying to remove the use of stopwords. The problem was that when searching for words like "no" or "the", no results were found. I found a confusing thing behavior that has taken me some time to figure out, and I hope sharing it saves someone else some time.>From searching around online and in the source code I came up with thefollowing config in my ActiveRecord model: acts_as_ferret({:fields => {:name => {:boost => 10}, :type => {:boost => 2}, :email => {:boost => 10}, :bio => {:store => :no}, :status_id => {:boost => 1}}, :store_class_name => true, :remote => true, :ferret => { :analyzer => Ferret::Analysis::StandardAnalyzer.new([]) } } ) With the StandardAnalyzer added, I do find results with "no" or "the". The complicating factor is that as you can see, I have a field "status_id". This field lets me filter for profiles that are published or draft in my CMS. Before I added the StandardAnalyzer, the status_id field worked fine in queries like this: a = Profile.find_by_contents("smith status_id:100") a.total_hits => 2 # this is correct, only 2 are published a = Profile.find_by_contents("smith") a.total_hits => 4 # this is correct, there are 4 total So, you can see that the status_id was automatically "AND"-ed to the query word. However, after adding the above StandardAnalyzer config, the status_id was now "OR"-ed, like so: a = Profile.find_by_contents("no") a.total_hits => 5 # this is good a = Profile.find_by_contents("no status_id:100") a.total_hits => 208 # this is bad -- it''s the same as if I only searched for status_id:100. a = Profile.find_by_contents("smith status_id:100") a.total_hits => 208 # this is just as bad -- it''s the same as if I only searched for status_id:100. The fix here is to add the AND keyword explicitly to the query: a = Profile.find_by_contents("smith AND status_id:100") a.total_hits => 2 # works just like before. In fact, OR becomes the default search regardless of whether I use a field in the query: a = Profile.find_by_contents("smith jones") a.total_hits => 5 # OR''ed results a = Profile.find_by_contents("smith AND jones") a.total_hits => 0 Again, before StandardAnalyzer, "AND" was the default so the first "smith jones" query would have returned 0 as it should. Any insight as to why this might be? I would prefer AND to be the default. Thanks, Doug
Jens Kraemer
2007-May-08 09:35 UTC
[Ferret-talk] Stop words, fields, StandardAnalyzer quagmire
Hi! On Fri, May 04, 2007 at 05:50:39PM -0700, Doug Smith wrote:> Hello, > > I''m using: Ruby 1.8.6, Rails 1.2.3, ferret 0.11.4, acts_as_ferret from > svn stable.[..]> acts_as_ferret({:fields => {:name => {:boost => 10}, > :type => {:boost => 2}, > :email => {:boost => 10}, > :bio => {:store => :no}, > :status_id => {:boost => 1}}, > :store_class_name => true, > :remote => true, > :ferret => { :analyzer => > Ferret::Analysis::StandardAnalyzer.new([]) } > } ) > > With the StandardAnalyzer added, I do find results with "no" or "the". > The complicating factor is that as you can see, I have a field > "status_id". This field lets me filter for profiles that are > published or draft in my CMS. >[..]> In fact, OR becomes the default search regardless of whether I use a > field in the query:[..]> Again, before StandardAnalyzer, "AND" was the default so the first > "smith jones" query would have returned 0 as it should. > > Any insight as to why this might be? I would prefer AND to be the default.Then you shouldn''t override acts_as_ferret''s default behaviour by using the completely unsupported and only internally used :ferret option :-) I admit that this is a bug in how aaf handles it''s parameters and I''ll fix this, however for thetime being you can use this statement which should work as intended: acts_as_ferret({ :fields => {:name => {:boost => 10}, :type => {:boost => 2}, :email => {:boost => 10}, :bio => {:store => :no}, :status_id => {:boost => 1}}, :store_class_name => true, :remote => true }, { :analyzer => Ferret::Analysis::StandardAnalyzer.new([]) }) Please note the difference: the analyzer option is part of a second options hash. The reason for this separation is that AAF more or less passes the last hash directly to Ferret, while the first option hash is used for aaf options Ferret itself doesn''t know about. However I plan to rework this in the Future so then your original statement should work correctly then. Btw, where did you find that solution? I''ve never seen the :ferret option being used outside aaf before. Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
Jens Kraemer
2007-May-08 09:51 UTC
[Ferret-talk] Stop words, fields, StandardAnalyzer quagmire
On Tue, May 08, 2007 at 11:35:48AM +0200, Jens Kraemer wrote: [..]> > acts_as_ferret({:fields => {:name => {:boost => 10}, > > :type => {:boost => 2}, > > :email => {:boost => 10}, > > :bio => {:store => :no}, > > :status_id => {:boost => 1}}, > > :store_class_name => true, > > :remote => true, > > :ferret => { :analyzer => > > Ferret::Analysis::StandardAnalyzer.new([]) } > > } ) > >I just committed a fix so that the above call should be working correctly now. I''d go so far to say that this should be the preferred way of passing ferret options to aaf now. The two-hash calling style I suggested below will still work of course, so nothing should break. Thoughts anyone? Old calling style:> > acts_as_ferret({ :fields => {:name => {:boost => 10}, > :type => {:boost => 2}, > :email => {:boost => 10}, > :bio => {:store => :no}, > :status_id => {:boost => 1}}, > :store_class_name => true, > :remote => true > }, { > :analyzer => Ferret::Analysis::StandardAnalyzer.new([]) > }) > > Please note the difference: the analyzer option is part of a second > options hash. >-- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
Doug Smith
2007-May-08 15:14 UTC
[Ferret-talk] Stop words, fields, StandardAnalyzer quagmire
On 5/8/07, Jens Kraemer <kraemer at webit.de> wrote:> Hi! > > However I plan to rework this in the Future so then your original statement > should work correctly then. Btw, where did you find that solution? I''ve > never seen the :ferret option being used outside aaf before.Hi Jens, Thank you for your fast response. I found this as an option by searching through the aaf source code. There was a commented out version of it in act_methods.rb, the acts_as_ferret() method. I''ll try your latest change and let you know how it works. Thanks again, Doug
Doug Smith
2007-May-08 15:54 UTC
[Ferret-talk] Stop words, fields, StandardAnalyzer quagmire
On 5/8/07, Jens Kraemer <kraemer at webit.de> wrote:> On Tue, May 08, 2007 at 11:35:48AM +0200, Jens Kraemer wrote: > [..] > > > acts_as_ferret({:fields => {:name => {:boost => 10}, > > > :type => {:boost => 2}, > > > :email => {:boost => 10}, > > > :bio => {:store => :no}, > > > :status_id => {:boost => 1}}, > > > :store_class_name => true, > > > :remote => true, > > > :ferret => { :analyzer => > > > Ferret::Analysis::StandardAnalyzer.new([]) } > > > } ) > > > > > I just committed a fix so that the above call should be working > correctly now. I''d go so far to say that this should be the preferred > way of passing ferret options to aaf now. The two-hash calling style I > suggested below will still work of course, so nothing should break.Hi Jens, This is excellent. It works well in my initial testing. I think it''s a great way to go. Thanks for your great support, Doug
Possibly Parallel Threads
- acts_as_ferret : cannot use a customized Analyzer (as indicated in the AdvancedUsageNotes)
- Strange intermittent no search results problem
- QueryParser Exception Handling Problem
- Strange search result with conditions in find_by_contents
- Errror on update after Model.rebuild_index