I''m getting some results that I don''t understand from a search. The code, based on the tutorial, and the results are below. Everything makes sense to me, except the results for the ''title:"Some"'' query. I would think that it should match the first two documents, but not the third. What am I missing here? Thanks for any help! --- code ----------------------------------------------------- require ''ferret'' def query(index, query_str) puts("Query ''#{query_str}''...") index.search_each(query_str) do |id, score| puts(" id=#{id} score=#{score} uid=#{index[id][:uid]} title=''#{index[id][:title]}''") end end index = Ferret::Index::Index.new index << {:uid => ''one'', :title => ''Some Title'', :content => ''my first text''} index << {:uid => ''two'', :title => ''Some Title'', :content => ''some second content''} index << {:uid => ''three'', :title => ''Other Title'', :content => ''my third text''} query(index, ''content:"text"'') query(index, ''content:"some"'') query(index, ''title:"Some"'') query(index, ''title:"Title"'') query(index, ''uid:"two"'') --- results --------------------------------------- Query ''content:"text"''... id=0 score=0.625 uid=one title=''Some Title'' id=2 score=0.625 uid=three title=''Other Title'' Query ''content:"some"''... id=1 score=0.125318586826324 uid=two title=''Some Title'' Query ''title:"Some"''... id=0 score=0.0554137788712978 uid=one title=''Some Title'' id=1 score=0.0554137788712978 uid=two title=''Some Title'' id=2 score=0.0554137788712978 uid=three title=''Other Title'' Query ''title:"Title"''... id=0 score=0.712317943572998 uid=one title=''Some Title'' id=1 score=0.712317943572998 uid=two title=''Some Title'' id=2 score=0.712317943572998 uid=three title=''Other Title'' Query ''uid:"two"''... id=1 score=1.0 uid=two title=''Some Title''
On Mar 31, 2007, at 5:36 PM, Jeff Mallatt wrote:> I''m getting some results that I don''t understand from a search. > > index << {:uid => ''one'', :title => ''Some Title'', :content => ''my > first text''} > index << {:uid => ''two'', :title => ''Some Title'', :content => ''some > second content''} > index << {:uid => ''three'', :title => ''Other Title'', :content => ''my > third text''} > > query(index, ''title:"Some"'') > query(index, ''title:"Title"'') > query(index, ''uid:"two"'')Nice one. When people don''t understand search results, it''s usually to do with stop words. The StandardAnalyzer which parses documents and(!) queries, uses a list of stop words which are ignored. See Ferret::Analysis::FULL_ENGLISH_STOP_WORDS for a complete list of (english) stop words. In the case of "title:Some", "Some" is removed by the analyzer giving only "title:", i.e. an empty query which (surprisingly) matches all documents. However, the same should happen with "content:some" but this one returns only one document which leaves me completely puzzled. This just isn''t consistent. So I''m afraid I can''t be of much help here, but I''m sure somebody else will enlighten us. This might as well be a bug, but even if it''s not, it''s definitely not what anyone would reasonably expect. -- @David: You should probably consider changing StandardAnalyzer not to use stop words by default. It confuses people because no one would suspect such a feature to be enabled by default. It just doesn''t follow the principle of least astonishment. Even if people want to use stop words, they might not be happy with the ones built into Ferret. It very much depends on the nature of the content that is indexed and instead of using a one-size-fit-all stop word list one is usually better off with compiling a custom one for any particular application. Cheers, Andy
On Mar 31, 2007, at 10:41 AM, Andreas Korth wrote:> @David: You should probably consider changing StandardAnalyzer not to > use stop words by default. It confuses people because no one would > suspect such a feature to be enabled by default. It just doesn''t > follow the principle of least astonishment. > > Even if people want to use stop words, they might not be happy with > the ones built into Ferret. It very much depends on the nature of the > content that is indexed and instead of using a one-size-fit-all stop > word list one is usually better off with compiling a custom one for > any particular application.I concur. Ferret''s StandardAnalyzer is based upon Lucene''s class of the same name, so some parallelism would be lost, but I think omitting stop lists is better nonetheless. There are performance and disk-space implications for avoiding stop lists by default. However, disk space is cheap, Ferret is fast, and search results are slightly better when you avoid stop lists (e.g. searching for ''"the who"'' actually returns something). Users with large deployments will be able to trade away some amount of IR precision for increased performance by enabling stop lists if they so choose. KinoSearch doesn''t have a StandardAnalyzer; a class called PolyAnalyzer fills that role. By default, it performs lowercasing, tokenizing and stemming -- but no stopalizing. <http:// www.rectangular.com/kinosearch/docs/devel/KinoSearch/Analysis/ PolyAnalyzer.html> Marvin Humphrey Rectangular Research http://www.rectangular.com/
On Sat, Mar 31, 2007 at 07:41:06PM +0200, Andreas Korth wrote:> > On Mar 31, 2007, at 5:36 PM, Jeff Mallatt wrote: > > > I''m getting some results that I don''t understand from a search. > > > > index << {:uid => ''one'', :title => ''Some Title'', :content => ''my > > first text''} > > index << {:uid => ''two'', :title => ''Some Title'', :content => ''some > > second content''} > > index << {:uid => ''three'', :title => ''Other Title'', :content => ''my > > third text''} > > > > query(index, ''title:"Some"'') > > query(index, ''title:"Title"'') > > query(index, ''uid:"two"'') > > Nice one. > > When people don''t understand search results, it''s usually to do with > stop words. The StandardAnalyzer which parses documents and(!) > queries, uses a list of stop words which are ignored. See > Ferret::Analysis::FULL_ENGLISH_STOP_WORDS for a complete list of > (english) stop words. > > In the case of "title:Some", "Some" is removed by the analyzer giving > only "title:", i.e. an empty query which (surprisingly) matches all > documents. > > However, the same should happen with "content:some" but this one > returns only one document which leaves me completely puzzled. This > just isn''t consistent.adding the output of index.process_query to the script I get: Query ''content:"some"''... processed to <title:content uid:content content:content> Query ''title:"Some"''... processed to <title:title uid:title content:title> so it seems the stop word is stripped first, then the query is recognized as invalid, and the parser does it''s best to run it anyway - it takes the remaining word that once was the field name, and interprets it as the query string. Setting handle_parse_errors to false turns this behaviour off and leads to no results for the empty queries. Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
At 2007-04-01 06:12, you wrote:>On Sat, Mar 31, 2007 at 07:41:06PM +0200, Andreas Korth wrote: > > On Mar 31, 2007, at 5:36 PM, Jeff Mallatt wrote: > > > I''m getting some results that I don''t understand from a search. > > > >[snip] >adding the output of index.process_query to the script I get: > >Query ''content:"some"''... >processed to <title:content uid:content content:content> >Query ''title:"Some"''... >processed to <title:title uid:title content:title> > >so it seems the stop word is stripped first, then the query is >recognized as invalid, and the parser does it''s best to run it anyway - >it takes the remaining word that once was the field name, and interprets >it as the query string. > >Setting handle_parse_errors to false turns this behaviour off and leads >to no results for the empty queries.That explains it all. Thanks much!