thr3ads.net - Ferret talk - [Ferret-talk] not understanding search results [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Jeff Mallatt

2007-Mar-31 15:36 UTC

[Ferret-talk] not understanding search results

I''m getting some results that I don''t understand from a
search.

The code, based on the tutorial, and the results are below.

Everything makes sense to me, except the results for
the ''title:"Some"'' query.  I would think that it
should
match the first two documents, but not the third.

What am I missing here?

Thanks for any help!

--- code -----------------------------------------------------

require ''ferret''

def query(index, query_str)
  puts("Query ''#{query_str}''...")
  index.search_each(query_str) do |id, score|
    puts("  id=#{id} score=#{score} uid=#{index[id][:uid]}
title=''#{index[id][:title]}''")
  end
end

index = Ferret::Index::Index.new

index << {:uid => ''one'', :title => ''Some
Title'', :content => ''my first text''}
index << {:uid => ''two'', :title => ''Some
Title'', :content => ''some second content''}
index << {:uid => ''three'', :title =>
''Other Title'', :content => ''my third
text''}

query(index, ''content:"text"'')
query(index, ''content:"some"'')
query(index, ''title:"Some"'')
query(index, ''title:"Title"'')
query(index, ''uid:"two"'')

--- results ---------------------------------------

Query ''content:"text"''...
  id=0 score=0.625 uid=one title=''Some Title''
  id=2 score=0.625 uid=three title=''Other Title''
Query ''content:"some"''...
  id=1 score=0.125318586826324 uid=two title=''Some Title''
Query ''title:"Some"''...
  id=0 score=0.0554137788712978 uid=one title=''Some Title''
  id=1 score=0.0554137788712978 uid=two title=''Some Title''
  id=2 score=0.0554137788712978 uid=three title=''Other Title''
Query ''title:"Title"''...
  id=0 score=0.712317943572998 uid=one title=''Some Title''
  id=1 score=0.712317943572998 uid=two title=''Some Title''
  id=2 score=0.712317943572998 uid=three title=''Other Title''
Query ''uid:"two"''...
  id=1 score=1.0 uid=two title=''Some Title''

Andreas Korth

2007-Mar-31 17:41 UTC

head link

[Ferret-talk] not understanding search results

On Mar 31, 2007, at 5:36 PM, Jeff Mallatt wrote:
> I''m getting some results that I don''t understand from a
search.
>
> index << {:uid => ''one'', :title =>
''Some Title'', :content => ''my
> first text''}
> index << {:uid => ''two'', :title =>
''Some Title'', :content => ''some
> second content''}
> index << {:uid => ''three'', :title =>
''Other Title'', :content => ''my
> third text''}
>
> query(index, ''title:"Some"'')
> query(index, ''title:"Title"'')
> query(index, ''uid:"two"'')
Nice one.

When people don''t understand search results, it''s usually to
do with
stop words. The StandardAnalyzer which parses documents and(!)  
queries, uses a list of stop words which are ignored. See  
Ferret::Analysis::FULL_ENGLISH_STOP_WORDS for a complete list of  
(english) stop words.

In the case of "title:Some", "Some" is removed by the
analyzer giving
only "title:", i.e. an empty query which (surprisingly) matches all  
documents.

However, the same should happen with "content:some" but this one  
returns only one document which leaves me completely puzzled. This  
just isn''t consistent.

So I''m afraid I can''t be of much help here, but I''m
sure somebody
else will enlighten us. This might as well be a bug, but even if it''s  
not, it''s definitely not what anyone would reasonably expect.

--

@David: You should probably consider changing StandardAnalyzer not to  
use stop words by default. It confuses people because no one would  
suspect such a feature to be enabled by default. It just doesn''t  
follow the principle of least astonishment.

Even if people want to use stop words, they might not be happy with  
the ones built into Ferret. It very much depends on the nature of the  
content that is indexed and instead of using a one-size-fit-all stop  
word list one is usually better off with compiling a custom one for  
any particular application.

Cheers,
Andy

Marvin Humphrey

2007-Mar-31 18:46 UTC

head link

[Ferret-talk] not understanding search results

On Mar 31, 2007, at 10:41 AM, Andreas Korth wrote:
> @David: You should probably consider changing StandardAnalyzer not to
> use stop words by default. It confuses people because no one would
> suspect such a feature to be enabled by default. It just doesn''t
> follow the principle of least astonishment.
>
> Even if people want to use stop words, they might not be happy with
> the ones built into Ferret. It very much depends on the nature of the
> content that is indexed and instead of using a one-size-fit-all stop
> word list one is usually better off with compiling a custom one for
> any particular application.
I concur.  Ferret''s StandardAnalyzer is based upon Lucene''s
class of
the same name, so some parallelism would be lost, but I think  
omitting stop lists is better nonetheless.

There are performance and disk-space implications for avoiding stop  
lists by default.  However, disk space is cheap, Ferret is fast, and  
search results are slightly better when you avoid stop lists (e.g.  
searching for ''"the who"'' actually returns
something).  Users with
large deployments will be able to trade away some amount of IR  
precision for increased performance by enabling stop lists if they so  
choose.

KinoSearch doesn''t have a StandardAnalyzer; a class called  
PolyAnalyzer fills that role.  By default, it performs lowercasing,  
tokenizing and stemming -- but no stopalizing.  <http:// 
www.rectangular.com/kinosearch/docs/devel/KinoSearch/Analysis/ 
PolyAnalyzer.html>

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Jens Kraemer

2007-Apr-01 10:12 UTC

head link

[Ferret-talk] not understanding search results

On Sat, Mar 31, 2007 at 07:41:06PM +0200, Andreas Korth
wrote:> 
> On Mar 31, 2007, at 5:36 PM, Jeff Mallatt wrote:
> 
> > I''m getting some results that I don''t understand
from a search.
> >
> > index << {:uid => ''one'', :title =>
''Some Title'', :content => ''my
> > first text''}
> > index << {:uid => ''two'', :title =>
''Some Title'', :content => ''some
> > second content''}
> > index << {:uid => ''three'', :title =>
''Other Title'', :content => ''my
> > third text''}
> >
> > query(index, ''title:"Some"'')
> > query(index, ''title:"Title"'')
> > query(index, ''uid:"two"'')
> 
> Nice one.
> 
> When people don''t understand search results, it''s usually
to do with
> stop words. The StandardAnalyzer which parses documents and(!)  
> queries, uses a list of stop words which are ignored. See  
> Ferret::Analysis::FULL_ENGLISH_STOP_WORDS for a complete list of  
> (english) stop words.
> 
> In the case of "title:Some", "Some" is removed by the
analyzer giving
> only "title:", i.e. an empty query which (surprisingly) matches
all
> documents.
> 
> However, the same should happen with "content:some" but this one
> returns only one document which leaves me completely puzzled. This  
> just isn''t consistent.
adding the output of index.process_query to the script I get:

Query ''content:"some"''...
processed to <title:content uid:content content:content>
Query ''title:"Some"''...
processed to <title:title uid:title content:title>

so it seems the stop word is stripped first, then the query is
recognized as invalid, and the parser does it''s best to run it anyway -
it takes the remaining word that once was the field name, and interprets
it as the query string.

Setting handle_parse_errors to false turns this behaviour off and leads
to no results for the empty queries.


Jens

-- 
Jens Kr?mer
webit! Gesellschaft f?r neue Medien mbH
Schnorrstra?e 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
kraemer at webit.de | www.webit.de
 
Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Jeff Mallatt

2007-Apr-01 14:56 UTC

head link

[Ferret-talk] not understanding search results

At 2007-04-01 06:12, you wrote:>On Sat, Mar 31, 2007 at 07:41:06PM +0200, Andreas Korth wrote:
> > On Mar 31, 2007, at 5:36 PM, Jeff Mallatt wrote:
> > > I''m getting some results that I don''t
understand from a search.
> > >
>[snip]
>adding the output of index.process_query to the script I get:
>
>Query ''content:"some"''...
>processed to <title:content uid:content content:content>
>Query ''title:"Some"''...
>processed to <title:title uid:title content:title>
>
>so it seems the stop word is stripped first, then the query is
>recognized as invalid, and the parser does it''s best to run it
anyway -
>it takes the remaining word that once was the field name, and interprets
>it as the query string.
>
>Setting handle_parse_errors to false turns this behaviour off and leads
>to no results for the empty queries.
That explains it all.

Thanks much!

Maybe Matching Threads

Search for more seemingly similar threads

Ferret talk - Mar 2007 - not understanding search results

[Ferret-talk] not understanding search results

[Ferret-talk] not understanding search results

[Ferret-talk] not understanding search results

[Ferret-talk] not understanding search results

[Ferret-talk] not understanding search results

Maybe Matching Threads