Richard Jones
2007-Jun-12 15:32 UTC
[Ferret-talk] index browser inconsistent with IndexReader
Hi, We have an index of around 1M web pages as part of our web app. The app uses ferret by way of RDig to perform searches. We have noticed anecdotally that some searches don''t work the way we thought they should, as if documents were missing from the index. Yesterday we came upon a concrete instance of this. Our documents have several fields, one of which is called :keywords and another called :data, both of which are used for searching. We isolated a single document that is not found on the web app by terms in the :data field, but which can be found by the terms in its :keywords field. We assumed first that a problem occurred in the indexing which resulted in the :data field being lost. However, the index browser that''s included with version 0.11.4 showed the document with all its fields intact, including the :data field. All the :data field terms that failed to retrieve the document on the web app were indeed present, according to the browser. We then built a short script with the API that instantiated an IndexReader and called IndexReader.term_vectors() with the id of our subject doc. The term_vectors returned included a vector for :keywords, but not for :data. Somehow the core API funcs are not finding this document''s :data field when the 0.11.4 browser is. Are there differences between the two that would explain this? Does this problem description ring a bell with anyone out there? Many thanks. -- Richard Jones dickjr at gmail.com
Richard Jones
2007-Jun-12 15:46 UTC
[Ferret-talk] index browser inconsistent with IndexReader
Follow-up to my recent post with the same subject: It seems that within the API scripting world I can view the suspect document by instantiating and then loading the LazyDoc returned by Ferret::Search::Searcher.get_document(doc_id). It contains the :data field data and is perhaps what is being used by the browser. So my question is then this: what would cause a document in an index to have a non-empty field when looked at through a LazyDoc, but for which no non-empty term_vector is available for the same field on the same document? -- Richard Jones dickjr at gmail.com
Jens Kraemer
2007-Jun-12 15:58 UTC
[Ferret-talk] index browser inconsistent with IndexReader
On Tue, Jun 12, 2007 at 11:46:02AM -0400, Richard Jones wrote:> Follow-up to my recent post with the same subject: > > It seems that within the API scripting world I can view the suspect > document by instantiating and then loading the LazyDoc returned by > Ferret::Search::Searcher.get_document(doc_id). It contains the :data > field data and is perhaps what is being used by the browser. > > So my question is then this: what would cause a document in an index > to have a non-empty field when looked at through a LazyDoc, but for > which no non-empty term_vector is available for the same field on the > same document?having the field data stored in the index does not imply that this field is searchable. It all depends what options are set for the field (see the FieldInfos api docs for the available options) So it''s perfectly possible to create an index with fields f1 and f2, where only f1 can be searched, but the contents of f2 can be shown for search results: fi = Ferret::Index::FieldInfos.new fi.add_field :f1, :store => :yes, :index => :yes fi.add_field :f2, :store => :yes, :index => :no, :term_vector => :no i = Ferret::I.new :field_infos => fi i << { :f1 => ''field one'' , :f2 => ''field two'' } i.search ''one'' # finds the document i.search ''two'' # won''t find anything i[0][:f1] # outputs ''field one'' i[0][:f2] # outputs ''field two'' However that does not explain why some documents seem to have other indexing options than the rest - maybe yo uchanged them some time without doing a rebuild? Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
Richard Jones
2007-Jun-13 12:58 UTC
[Ferret-talk] index browser inconsistent with IndexReader
According to my IndexReader''s field_infos, all the fields are stored and indexed, with :with_positions_offsets for the term_vectors. A look at a term vector for one of these :data fields gives: #<struct Ferret::Index::TermVector field=:data, terms=[], offsets=nil> Is this what they look like when you index with :index=>no? On 6/12/07, Jens Kraemer <kraemer at webit.de> wrote:> On Tue, Jun 12, 2007 at 11:46:02AM -0400, Richard Jones wrote: > > Follow-up to my recent post with the same subject: > > > > It seems that within the API scripting world I can view the suspect > > document by instantiating and then loading the LazyDoc returned by > > Ferret::Search::Searcher.get_document(doc_id). It contains the :data > > field data and is perhaps what is being used by the browser. > > > > So my question is then this: what would cause a document in an index > > to have a non-empty field when looked at through a LazyDoc, but for > > which no non-empty term_vector is available for the same field on the > > same document? > > having the field data stored in the index does not imply that this field > is searchable. It all depends what options are set for the field (see > the FieldInfos api docs for the available options) > > So it''s perfectly possible to create an index with fields f1 and f2, > where only f1 can be searched, but the contents of f2 can be shown for > search results: > > fi = Ferret::Index::FieldInfos.new > fi.add_field :f1, :store => :yes, :index => :yes > fi.add_field :f2, :store => :yes, :index => :no, :term_vector => :no > i = Ferret::I.new :field_infos => fi > i << { :f1 => ''field one'' , :f2 => ''field two'' } > > i.search ''one'' # finds the document > i.search ''two'' # won''t find anything > > > i[0][:f1] # outputs ''field one'' > i[0][:f2] # outputs ''field two'' > > > However that does not explain why some documents seem to have other > indexing options than the rest - maybe yo uchanged them some time > without doing a rebuild? > > > Jens > > > > -- > Jens Kr?mer > webit! Gesellschaft f?r neue Medien mbH > Schnorrstra?e 76 | 01069 Dresden > Telefon +49 351 46766-0 | Telefax +49 351 46766-66 > kraemer at webit.de | www.webit.de > > Amtsgericht Dresden | HRB 15422 > GF Sven Haubold, Hagen Malessa > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-- Richard Jones dickjr at gmail.com
Jens Kraemer
2007-Jun-13 14:00 UTC
[Ferret-talk] index browser inconsistent with IndexReader
On Wed, Jun 13, 2007 at 08:58:36AM -0400, Richard Jones wrote:> According to my IndexReader''s field_infos, all the fields are stored > and indexed, with :with_positions_offsets for the term_vectors. > > A look at a term vector for one of these :data fields gives: > > #<struct Ferret::Index::TermVector field=:data, terms=[], offsets=nil> > > Is this what they look like when you index with :index=>no?no, with index => no no term vectors can be stored and then term_vector returns nil, not an empty tv. The scenario you have could happen if your analyzer choked at indexing time and returned not a single term for your document (just like if you had a doc full of stop words). Since you have the stored contents, could you try to index that data again and see if the problem can be reproduced? Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
Richard Jones
2007-Jun-13 14:31 UTC
[Ferret-talk] index browser inconsistent with IndexReader
I ran one of the :data fields through the StandardAnalyzer - the only one we have used - and it tokenized it with no complaints. Interestingly, the last batch of 1700 sites that we added incrementally to our index does not seem to suffer from this problem. On 6/13/07, Jens Kraemer <kraemer at webit.de> wrote:> On Wed, Jun 13, 2007 at 08:58:36AM -0400, Richard Jones wrote: > > According to my IndexReader''s field_infos, all the fields are stored > > and indexed, with :with_positions_offsets for the term_vectors. > > > > A look at a term vector for one of these :data fields gives: > > > > #<struct Ferret::Index::TermVector field=:data, terms=[], offsets=nil> > > > > Is this what they look like when you index with :index=>no? > > no, with index => no no term vectors can be stored and then term_vector > returns nil, not an empty tv. > > The scenario you have could happen if your analyzer choked at indexing > time and returned not a single term for your document (just like if you > had a doc full of stop words). > > Since you have the stored contents, could you try to index that data > again and see if the problem can be reproduced? > > Jens > > > > -- > Jens Kr?mer > webit! Gesellschaft f?r neue Medien mbH > Schnorrstra?e 76 | 01069 Dresden > Telefon +49 351 46766-0 | Telefax +49 351 46766-66 > kraemer at webit.de | www.webit.de > > Amtsgericht Dresden | HRB 15422 > GF Sven Haubold, Hagen Malessa > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-- Richard Jones dickjr at gmail.com