thr3ads.net - Ferret talk - [Ferret-talk] Double-quoted query with "and" fails. [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Aldous D. Penaranda

2007-Jan-19 09:33 UTC

[Ferret-talk] Double-quoted query with "and" fails.

Hi,

We''re using Ferret 0.9.4 and we''ve observed the following
behavior.
Searching for ''fieldname: foo and bar'' works fine while
''fieldname:
"foo and bar"'' doesn''t return any results. Is there
a way to make
ferret recognize the ''and'' inside the query as a search term
and not
an operator? (I hope I got the terminology right)

Thanks in advance.

-- 
Linux Just Simply Rocks!
dous at penarmac.com | dous at ubuntu.com
http://deathwing.penarmac.com/
GPG: 0xD6655C18

William Morgan

2007-Jan-19 16:08 UTC

head link

[Ferret-talk] Double-quoted query with "and" fails.

Excerpts from Aldous D. Penaranda''s message of Fri Jan 19 01:33:51
-0800 2007:> Is there a way to make ferret recognize the ''and'' inside
the query as
> a search term and not an operator? (I hope I got the terminology
> right)
You need to use an Analyzer that does not remove ''and''. The
default
analyzer removes all words in FULL_ENGLISH_STOP_WORDS, which includes
''and''. (So does ENGLISH_STOP_WORDS.)

The analyzer needs to be used both while adding documents to the index
and when parsing the query parsing time (i.e. passed to both
QueryParser.new and IndexWriter.new/Index.new).  If you''ve been using
the default analyzer, you''ll have to reindex so that the occurrences of
''and'' get written to disk.

-- 
William <wmorgan-ferret at masanjin.net>

Aldous D. Penaranda

2007-Jan-19 17:01 UTC

head link

[Ferret-talk] Double-quoted query with "and" fails.

On 1/20/07, William Morgan <wmorgan-ferret at masanjin.net>
wrote:> Excerpts from Aldous D. Penaranda''s message of Fri Jan 19 01:33:51
-0800 2007:
> > Is there a way to make ferret recognize the ''and''
inside the query as
> > a search term and not an operator? (I hope I got the terminology
> > right)
>
> You need to use an Analyzer that does not remove ''and''.
The default
> analyzer removes all words in FULL_ENGLISH_STOP_WORDS, which includes
> ''and''. (So does ENGLISH_STOP_WORDS.)
Thanks. I noticed, however, that the documentation for
Ferret::Index::Index says that the default analyzer is
StandardAnalyzer. The StandardAnalyzer documentation says that it
filters LetterTokenizer with LowerCaseFilter. Are you talking about
StopAnalyzer? If so, perhaps the documentation is wrong and should be
updated. I''ve checked both the 0.9 and 0.10 api documentation and they
say the same thing.
> The analyzer needs to be used both while adding documents to the index
> and when parsing the query parsing time (i.e. passed to both
> QueryParser.new and IndexWriter.new/Index.new).  If you''ve been
using
> the default analyzer, you''ll have to reindex so that the
occurrences of
> ''and'' get written to disk.
Again, many thanks! I''ll try this out after I get some sleep. :)

-- 
Linux Just Simply Rocks!
dous at penarmac.com | dous at ubuntu.com
http://deathwing.penarmac.com/
GPG: 0xD6655C18

William Morgan

2007-Jan-19 19:47 UTC

head link

[Ferret-talk] Double-quoted query with "and" fails.

Excerpts from Aldous D. Penaranda''s message of Fri Jan 19 09:01:43
-0800 2007:> The StandardAnalyzer documentation says that it filters
> LetterTokenizer with LowerCaseFilter.
My interpretation of
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/StandardAnalyzer.html
is that StandardAnalyzer uses FULL_ENGLISH_STOP_WORDS as the stopword
list.

Perhaps I''m wrong; I''ve never verified it empirically.
I''m of the
opinion that the whole concept of stopwords is a relic of 1970''s
technology and the TREC ad-hoc query paradigm, neither of which are
particularly relevant for modern-day web search, so I typically turn
them off.

-- 
William <wmorgan-ferret at masanjin.net>

Andreas Korth

2007-Jan-19 20:26 UTC

head link

[Ferret-talk] Double-quoted query with "and" fails.

On 19.01.2007, at 20:47, William Morgan wrote:
> Perhaps I''m wrong; I''ve never verified it empirically.
I''m of the
> opinion that the whole concept of stopwords is a relic of 1970''s
> technology and the TREC ad-hoc query paradigm, neither of which are
> particularly relevant for modern-day web search, so I typically turn
> them off.
Could you elaborate on that, please? What exactly has changed since  
the 70''s which isn''t relevant any more and what is the TREC
ad-hoc
query paradigm anyway?

My understanding is that stop words reduce the size of the index (and  
hence speed up queries) by filtering out words that occur frequently  
in almost any text of considerable length. Isn''t it even worse if you  
store term vectors?

I''d turn off stop words right away if there wasn''t any
considerable
impact on performance, but I''d like to have a little more information  
on that. I''d appreciate if you could give some pointers.

Thanks!
Andy

Aldous D. Penaranda

2007-Jan-20 01:16 UTC

head link

[Ferret-talk] Double-quoted query with "and" fails.

On 1/20/07, William Morgan <wmorgan-ferret at masanjin.net>
wrote:> Excerpts from Aldous D. Penaranda''s message of Fri Jan 19 09:01:43
-0800 2007:
> > The StandardAnalyzer documentation says that it filters
> > LetterTokenizer with LowerCaseFilter.
>
> My interpretation of
>
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/StandardAnalyzer.html
> is that StandardAnalyzer uses FULL_ENGLISH_STOP_WORDS as the stopword
> list.
Yes, my bad. The latest documentation does say that. The 0.9 api
doesn''t and it''s the version that we''re using. What
if the Document in
question looks like this:

Document {
  stored/uncompressed,indexed,tokenized,<fieldname:foo and bar>
}

Should a search for ''fieldname:"foo and bar"'' result
on the said document?

-- 
Linux Just Simply Rocks!
dous at penarmac.com | dous at ubuntu.com
http://deathwing.penarmac.com/
GPG: 0xD6655C18

William Morgan

2007-Jan-20 04:22 UTC

head link

[Ferret-talk] Double-quoted query with "and" fails.

Excerpts from Andreas Korth''s message of Fri Jan 19 12:26:20 -0800
2007:> Could you elaborate on that, please? What exactly has changed since  
> the 70''s which isn''t relevant any more and what is the
TREC ad-hoc
> query paradigm anyway?
TREC is a competition that arguably drove most information retrieval
research for the past several decades. The ad-hoc task is one of the
tasks in the competition, and is essentially what we think of as
"search": given a fixed set of documents, take an arbitrary query and
produce a subset of documents that are considered "relevant".  (Other
TREC tasks involve things like document clustering, or question
answering, or responding to a fixed query on a changing set of
documents.)

Almost all the ideas behind Ferret, Lucene, etc., are from the IR
research community, were evaluated and found to be favorable in the
context of TREC. The "inverted" index, stop words, boosting, the
twiddle
operator, etc, are all many decades old.

The problem is that the ad-hoc task is pretty different from, say, web
search, or email search in Sup. An ad-hoc query is essentially a
mini-document, with a separate title, and several complete, grammatical
sentences describing the "information need" in somewhat formal
English.
By contrast, in our case, the user is typically entering in just a few
words, and is typically making explicit use of the mechanics of the
search (glorified word matching) and thus isn''t entering in a
grammatical English description of what he''d like to find.

Stop words make a lot of sense for the ad-hoc task because they
eliminate "content-free" words. But I think they don''t make
nearly as
much sense for the uses that you and I have for Ferret.

The other big difference, of course, is that disk space is much cheaper
now than when this stuff was developed.
> My understanding is that stop words reduce the size of the index (and
> hence speed up queries) by filtering out words that occur frequently
> in almost any text of considerable length. Isn''t it even worse if
you
> store term vectors?
True, and yes. The question is: by how much?
> I''d turn off stop words right away if there wasn''t any
considerable
> impact on performance, but I''d like to have a little more
information
> on that. I''d appreciate if you could give some pointers.
Unfortunately all I have are opinions. :) I''d be very interested in an
empirical analysis of just how much bigger the index gets when using
stopwords (with and without term vectors), and just how much slower
queries get. I''m guessing that neither will be serious, but I could be
wrong.

-- 
William <wmorgan-ferret at masanjin.net>

Marvin Humphrey

2007-Jan-20 07:48 UTC

head link

[Ferret-talk] Double-quoted query with "and" fails.

On Jan 19, 2007, at 8:22 PM, William Morgan wrote:
> Stop words make a lot of sense for the ad-hoc task because they
> eliminate "content-free" words. But I think they don''t
make nearly as
> much sense for the uses that you and I have for Ferret.
>
> The other big difference, of course, is that disk space is much  
> cheaper
> now than when this stuff was developed.
You''ve expressed pretty much the reasons why the default  
"PolyAnalyzer" configuration in KinoSearch consists of an  
LCNormalizer, a Tokenizer, and a Stemmer -- no Stopalizer.  See  
<http://www.rectangular.com/downloads/KinoSearch_OSCON2006.pdf> pages  
74-80.
> Unfortunately all I have are opinions. :) I''d be very interested
in an
> empirical analysis of just how much bigger the index gets when using
> stopwords (with and without term vectors), and just how much slower
> queries get. I''m guessing that neither will be serious, but I
could be
> wrong.
The search-time benefit from using a stoplist can be substantial.   
Search-time costs are dominated by time spent pawing through postings  
for common terms.  Eliminating the most common terms can make a big  
difference.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

William Morgan

2007-Jan-22 17:02 UTC

head link

[Ferret-talk] Double-quoted query with "and" fails.

Excerpts from Marvin Humphrey''s message of Fri Jan 19 23:48:36 -0800
2007:> The search-time benefit from using a stoplist can be substantial.
> Search-time costs are dominated by time spent pawing through postings
> for common terms.  Eliminating the most common terms can make a big
> difference.
I agree that common terms can really affect search time cost. I just
don''t think it''s a problem.

At least, I don''t think it''s a problem in a world where the
query
creaters are motivated, sophisticated users who have developed an
understanding of how search engines work (i.e. glorified word matching).
You don''t have to use a search engine more than a few times before you
understand that putting stopwords in your query is basically wasting
your time.

One can certainly argue about just how much we are in that world.
Perhaps the AARP website search folks are in a different one.  In my
case, a text-only email client backed by an IR engine and with a user
interface that smacks of Emacs is a pretty selective filter. :)

-- 
William <wmorgan-ferret at masanjin.net>

Marvin Humphrey

2007-Jan-22 17:34 UTC

head link

[Ferret-talk] Double-quoted query with "and" fails.

On Jan 22, 2007, at 9:02 AM, William Morgan wrote:
> Excerpts from Marvin Humphrey''s message of Fri Jan 19 23:48:36  
> -0800 2007:
>> The search-time benefit from using a stoplist can be substantial.
>> Search-time costs are dominated by time spent pawing through postings
>> for common terms.  Eliminating the most common terms can make a big
>> difference.
>
> I agree that common terms can really affect search time cost. I just
> don''t think it''s a problem.
Yes.  If your corpus is small enough and your machine is fast enough,  
the absolute search-time costs of using an engine as efficient as  
Ferret or KinoSearch aren''t consequential.  As the corpus grows you  
have the option of trading away some relevance for speed, or, in the  
case of KS, distributing the index over multiple machines and  
aggregating search results.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Reasonably Related Threads

Search for more reasonably related threads

Ferret talk - Jan 2007 - Double-quoted query with "and" fails.

[Ferret-talk] Double-quoted query with "and" fails.

[Ferret-talk] Double-quoted query with "and" fails.

[Ferret-talk] Double-quoted query with "and" fails.

[Ferret-talk] Double-quoted query with "and" fails.

[Ferret-talk] Double-quoted query with "and" fails.

[Ferret-talk] Double-quoted query with "and" fails.

[Ferret-talk] Double-quoted query with "and" fails.

[Ferret-talk] Double-quoted query with "and" fails.

[Ferret-talk] Double-quoted query with "and" fails.

[Ferret-talk] Double-quoted query with "and" fails.

Reasonably Related Threads