Julio Cesar Ody
2007-Mar-13 05:07 UTC
[Ferret-talk] index returns all results for specific queries
Hey all, I''m getting some really weird results when searching documents. It *seems* to be somehow related to the document format I''m using. I wrote a small script to replicate it: ################ #!/usr/bin/ruby require ''rubygems'' require ''ferret'' include Ferret index = Index::Index.new(:path => ''/tmp/fooindex'', :key => :id) # dummy data index << {:visibility=>"private", :type=>"media", :title=>"example title", :owner=>"user/3003", :author=>"user/3003", :description=>"description example", :id=>"user/3003/media/1"} index << {:visibility=>"private", :type=>"media", :title=>"a new title", :owner=>"user/3003", :author=>"user/3003", :description=>"more foo desc", :id=>"user/3003/media/2"} index << {:visibility=>"private", :type=>"media", :title=>"random title", :owner=>"user/3003", :author=>"user/3003", :description=>"random description", :id=>"user/3003/media/4"} index << {:visibility=>"private", :type=>"media", :title=>"random title", :owner=>"user/3003", :author=>"user/3003", :description=>"random description", :id=>"user/3003/media/5"} index.search_each(ARGV.shift) { |doc, score| puts index[doc].load.inspect } ################ The following queries are returning *all* the results currently in the index: $ ruby script.rb "title:me" {:author=>"user/3003", :description=>"description example", :visibility=>"private", :id=>"user/3003/media/1", :title=>"example title", :type=>"media", :owner=>"user/3003"} ... (remaining results) $ ruby script.rb "title:my" (same as above) And weird enough, the following $ ruby script.rb "title:mo" Won''t return anything. There''s more variants to that, but I think you get my meaning. The following works properly: $ ruby script.rb "title:random" (returns the two results that contain "random" in the title, which is what is supposed to be) Is there something I''m missing? It doesn''t seem to make sense to me that those queries above should return all the results in the index, specially considering they don''t actually match anything. Any help is appreciated. Thanks. -- Julio C. Ody
David Balmain
2007-Mar-13 06:38 UTC
[Ferret-talk] index returns all results for specific queries
On 3/13/07, Julio Cesar Ody <julioody at gmail.com> wrote:> Hey all, > > I''m getting some really weird results when searching documents. It > *seems* to be somehow related to the document format I''m using. > > I wrote a small script to replicate it: > > ################ > #!/usr/bin/ruby > > require ''rubygems'' > require ''ferret'' > include Ferret > index = Index::Index.new(:path => ''/tmp/fooindex'', :key => :id) > > # dummy data > index << {:visibility=>"private", :type=>"media", :title=>"example > title", :owner=>"user/3003", :author=>"user/3003", > :description=>"description example", :id=>"user/3003/media/1"} > index << {:visibility=>"private", :type=>"media", :title=>"a new > title", :owner=>"user/3003", :author=>"user/3003", :description=>"more > foo desc", :id=>"user/3003/media/2"} > index << {:visibility=>"private", :type=>"media", :title=>"random > title", :owner=>"user/3003", :author=>"user/3003", > :description=>"random description", :id=>"user/3003/media/4"} > index << {:visibility=>"private", :type=>"media", :title=>"random > title", :owner=>"user/3003", :author=>"user/3003", > :description=>"random description", :id=>"user/3003/media/5"} > > index.search_each(ARGV.shift) { |doc, score| > puts index[doc].load.inspect > } > ################Thanks for including the script. It makes my job much easier. :)> The following queries are returning *all* the results currently in the index: > > $ ruby script.rb "title:me" > {:author=>"user/3003", :description=>"description example", > :visibility=>"private", :id=>"user/3003/media/1", :title=>"example > title", :type=>"media", :owner=>"user/3003"} > ... (remaining results) > $ ruby script.rb "title:my" > (same as above) > > And weird enough, the following > > $ ruby script.rb "title:mo" > > Won''t return anything. There''s more variants to that, but I think you > get my meaning.The problem is that ''me'' and ''my'' are stop words. When they get removed the query becomes ''title:'' which is invalid. By default Ferret catches query parse exceptions and attempts to parse the query as a simple boolean term query, removing all special characters, so this query then becomes ''title''. Since title can be found in the title field for all documents, all documents are returned. So I don''t think this is a bug but it is definitely undesired behaviour. I''ll try and think of a better way to parse this. In the mean time, you may want to think about changing the stopword list or removing stopwords all together to prevent this problem from occurring. -- Dave Balmain http://www.davebalmain.com/
Julio Cesar Ody
2007-Mar-13 22:29 UTC
[Ferret-talk] index returns all results for specific queries
Thanks David, I instanced a StandardAnalyzer and passed an empty array for stop words, and it did the trick. If anyone wants to comment on what I''m losing by doing this, It would be really nice. On 3/13/07, David Balmain <dbalmain.ml at gmail.com> wrote:> On 3/13/07, Julio Cesar Ody <julioody at gmail.com> wrote: > > Hey all, > > > > I''m getting some really weird results when searching documents. It > > *seems* to be somehow related to the document format I''m using. > > > > I wrote a small script to replicate it: > > > > ################ > > #!/usr/bin/ruby > > > > require ''rubygems'' > > require ''ferret'' > > include Ferret > > index = Index::Index.new(:path => ''/tmp/fooindex'', :key => :id) > > > > # dummy data > > index << {:visibility=>"private", :type=>"media", :title=>"example > > title", :owner=>"user/3003", :author=>"user/3003", > > :description=>"description example", :id=>"user/3003/media/1"} > > index << {:visibility=>"private", :type=>"media", :title=>"a new > > title", :owner=>"user/3003", :author=>"user/3003", :description=>"more > > foo desc", :id=>"user/3003/media/2"} > > index << {:visibility=>"private", :type=>"media", :title=>"random > > title", :owner=>"user/3003", :author=>"user/3003", > > :description=>"random description", :id=>"user/3003/media/4"} > > index << {:visibility=>"private", :type=>"media", :title=>"random > > title", :owner=>"user/3003", :author=>"user/3003", > > :description=>"random description", :id=>"user/3003/media/5"} > > > > index.search_each(ARGV.shift) { |doc, score| > > puts index[doc].load.inspect > > } > > ################ > > Thanks for including the script. It makes my job much easier. :) > > > The following queries are returning *all* the results currently in the index: > > > > $ ruby script.rb "title:me" > > {:author=>"user/3003", :description=>"description example", > > :visibility=>"private", :id=>"user/3003/media/1", :title=>"example > > title", :type=>"media", :owner=>"user/3003"} > > ... (remaining results) > > $ ruby script.rb "title:my" > > (same as above) > > > > And weird enough, the following > > > > $ ruby script.rb "title:mo" > > > > Won''t return anything. There''s more variants to that, but I think you > > get my meaning. > > The problem is that ''me'' and ''my'' are stop words. When they get > removed the query becomes ''title:'' which is invalid. By default Ferret > catches query parse exceptions and attempts to parse the query as a > simple boolean term query, removing all special characters, so this > query then becomes ''title''. Since title can be found in the title > field for all documents, all documents are returned. So I don''t think > this is a bug but it is definitely undesired behaviour. I''ll try and > think of a better way to parse this. > > In the mean time, you may want to think about changing the stopword > list or removing stopwords all together to prevent this problem from > occurring. > > -- > Dave Balmain > http://www.davebalmain.com/ > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-- Julio C. Ody http://rootshell.be/~julioody