thr3ads.net - Ferret talk - [Ferret-talk] index browser inconsistent with IndexReader [Jun 2007]

If this information is useful, please help other people find it:
Share via:

Richard Jones

2007-Jun-12 15:32 UTC

[Ferret-talk] index browser inconsistent with IndexReader

Hi,

We have an index of around 1M web pages as part of our web app.  The
app uses ferret by way of RDig to perform searches.   We have noticed
anecdotally that some searches don''t work the way we thought they
should, as if documents were missing from the index.  Yesterday we
came upon a concrete instance of this.

Our documents have several fields, one of which is called :keywords
and another called :data, both of which are used for searching.  We
isolated a single document that is not found on the web app by terms
in the :data field, but which can be found by the terms in its
:keywords field.

We assumed first that a problem occurred in the indexing which
resulted in the :data field being lost.  However, the index browser
that''s included with version 0.11.4 showed the document with all its
fields intact, including the :data field.  All the :data field terms
that failed to retrieve the document on the web app were indeed
present, according to the browser.

We then built a short script with the API that instantiated an
IndexReader and called IndexReader.term_vectors() with the id of our
subject doc.  The term_vectors returned included a vector for
:keywords, but not for :data.

Somehow the core API funcs are not finding this document''s :data field
when the 0.11.4 browser is.  Are there differences between the two
that would explain this?  Does this problem description ring a bell
with anyone out there?

Many thanks.

-- 
Richard Jones
dickjr at gmail.com

Richard Jones

2007-Jun-12 15:46 UTC

head link

[Ferret-talk] index browser inconsistent with IndexReader

Follow-up to my recent post with the same subject:

It seems that within the API scripting world I can view the suspect
document by instantiating and then loading the LazyDoc returned by
Ferret::Search::Searcher.get_document(doc_id).  It contains the :data
field data and is perhaps what is being used by the browser.

So my question is then this:  what would cause a document in an index
to have a non-empty field when looked at through a LazyDoc, but for
which no non-empty term_vector is available for the same field on the
same document?

-- 
Richard Jones
dickjr at gmail.com

Jens Kraemer

2007-Jun-12 15:58 UTC

head link

[Ferret-talk] index browser inconsistent with IndexReader

On Tue, Jun 12, 2007 at 11:46:02AM -0400, Richard Jones
wrote:> Follow-up to my recent post with the same subject:
> 
> It seems that within the API scripting world I can view the suspect
> document by instantiating and then loading the LazyDoc returned by
> Ferret::Search::Searcher.get_document(doc_id).  It contains the :data
> field data and is perhaps what is being used by the browser.
> 
> So my question is then this:  what would cause a document in an index
> to have a non-empty field when looked at through a LazyDoc, but for
> which no non-empty term_vector is available for the same field on the
> same document?
having the field data stored in the index does not imply that this field 
is searchable. It all depends what options are set for the field (see
the FieldInfos api docs for the available options)

So it''s perfectly possible to create an index with fields f1 and f2,
where only f1 can be searched, but the contents of f2 can be shown for
search results:

fi = Ferret::Index::FieldInfos.new
fi.add_field :f1, :store => :yes, :index => :yes
fi.add_field :f2, :store => :yes, :index => :no, :term_vector => :no
i = Ferret::I.new :field_infos => fi
i << { :f1 => ''field one'' , :f2 => ''field
two'' }

i.search ''one'' # finds the document
i.search ''two'' # won''t find anything


i[0][:f1] # outputs ''field one''
i[0][:f2] # outputs ''field two''


However that does not explain why some documents seem to have other
indexing options than the rest - maybe yo uchanged them some time
without doing a rebuild?


Jens

 

-- 
Jens Kr?mer
webit! Gesellschaft f?r neue Medien mbH
Schnorrstra?e 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
kraemer at webit.de | www.webit.de
 
Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Richard Jones

2007-Jun-13 12:58 UTC

head link

[Ferret-talk] index browser inconsistent with IndexReader

According to my IndexReader''s field_infos, all the fields are stored
and indexed, with :with_positions_offsets for the term_vectors.

A look at a term vector for one of these :data fields gives:

#<struct Ferret::Index::TermVector field=:data, terms=[], offsets=nil>

Is this what they look like when you index with :index=>no?


On 6/12/07, Jens Kraemer <kraemer at webit.de>
wrote:> On Tue, Jun 12, 2007 at 11:46:02AM -0400, Richard Jones wrote:
> > Follow-up to my recent post with the same subject:
> >
> > It seems that within the API scripting world I can view the suspect
> > document by instantiating and then loading the LazyDoc returned by
> > Ferret::Search::Searcher.get_document(doc_id).  It contains the :data
> > field data and is perhaps what is being used by the browser.
> >
> > So my question is then this:  what would cause a document in an index
> > to have a non-empty field when looked at through a LazyDoc, but for
> > which no non-empty term_vector is available for the same field on the
> > same document?
>
> having the field data stored in the index does not imply that this field
> is searchable. It all depends what options are set for the field (see
> the FieldInfos api docs for the available options)
>
> So it''s perfectly possible to create an index with fields f1 and
f2,
> where only f1 can be searched, but the contents of f2 can be shown for
> search results:
>
> fi = Ferret::Index::FieldInfos.new
> fi.add_field :f1, :store => :yes, :index => :yes
> fi.add_field :f2, :store => :yes, :index => :no, :term_vector =>
:no
> i = Ferret::I.new :field_infos => fi
> i << { :f1 => ''field one'' , :f2 =>
''field two'' }
>
> i.search ''one'' # finds the document
> i.search ''two'' # won''t find anything
>
>
> i[0][:f1] # outputs ''field one''
> i[0][:f2] # outputs ''field two''
>
>
> However that does not explain why some documents seem to have other
> indexing options than the rest - maybe yo uchanged them some time
> without doing a rebuild?
>
>
> Jens
>
>
>
> --
> Jens Kr?mer
> webit! Gesellschaft f?r neue Medien mbH
> Schnorrstra?e 76 | 01069 Dresden
> Telefon +49 351 46766-0 | Telefax +49 351 46766-66
> kraemer at webit.de | www.webit.de
>
> Amtsgericht Dresden | HRB 15422
> GF Sven Haubold, Hagen Malessa
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

-- 
Richard Jones
dickjr at gmail.com

Jens Kraemer

2007-Jun-13 14:00 UTC

head link

[Ferret-talk] index browser inconsistent with IndexReader

On Wed, Jun 13, 2007 at 08:58:36AM -0400, Richard Jones
wrote:> According to my IndexReader''s field_infos, all the fields are
stored
> and indexed, with :with_positions_offsets for the term_vectors.
> 
> A look at a term vector for one of these :data fields gives:
> 
> #<struct Ferret::Index::TermVector field=:data, terms=[],
offsets=nil>
> 
> Is this what they look like when you index with :index=>no?
no, with index => no no term vectors can be stored and then term_vector
returns nil, not an empty tv.

The scenario you have could happen if your analyzer choked at indexing
time and returned not a single term for your document (just like if you
had a doc full of stop words).

Since you have the stored contents, could you try to index that data
again and see if the problem can be reproduced?

Jens

 

-- 
Jens Kr?mer
webit! Gesellschaft f?r neue Medien mbH
Schnorrstra?e 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
kraemer at webit.de | www.webit.de
 
Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Richard Jones

2007-Jun-13 14:31 UTC

head link

[Ferret-talk] index browser inconsistent with IndexReader

I ran one of the :data fields through the StandardAnalyzer - the only
one we have used - and it tokenized it with no complaints.

Interestingly, the last batch of 1700 sites that we added
incrementally to our index does not seem to suffer from this problem.



On 6/13/07, Jens Kraemer <kraemer at webit.de>
wrote:> On Wed, Jun 13, 2007 at 08:58:36AM -0400, Richard Jones wrote:
> > According to my IndexReader''s field_infos, all the fields are
stored
> > and indexed, with :with_positions_offsets for the term_vectors.
> >
> > A look at a term vector for one of these :data fields gives:
> >
> > #<struct Ferret::Index::TermVector field=:data, terms=[],
offsets=nil>
> >
> > Is this what they look like when you index with :index=>no?
>
> no, with index => no no term vectors can be stored and then term_vector
> returns nil, not an empty tv.
>
> The scenario you have could happen if your analyzer choked at indexing
> time and returned not a single term for your document (just like if you
> had a doc full of stop words).
>
> Since you have the stored contents, could you try to index that data
> again and see if the problem can be reproduced?
>
> Jens
>
>
>
> --
> Jens Kr?mer
> webit! Gesellschaft f?r neue Medien mbH
> Schnorrstra?e 76 | 01069 Dresden
> Telefon +49 351 46766-0 | Telefax +49 351 46766-66
> kraemer at webit.de | www.webit.de
>
> Amtsgericht Dresden | HRB 15422
> GF Sven Haubold, Hagen Malessa
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

-- 
Richard Jones
dickjr at gmail.com

Apparently Analagous Threads

Search for more possibly parallel threads

Ferret talk - Jun 2007 - index browser inconsistent with IndexReader

[Ferret-talk] index browser inconsistent with IndexReader

[Ferret-talk] index browser inconsistent with IndexReader

[Ferret-talk] index browser inconsistent with IndexReader

[Ferret-talk] index browser inconsistent with IndexReader

[Ferret-talk] index browser inconsistent with IndexReader

[Ferret-talk] index browser inconsistent with IndexReader

Apparently Analagous Threads