thr3ads.net - Ferret talk - [Ferret-talk] Indexing problem 10.9/10.10 [Oct 2006]

If this information is useful, please help other people find it:
Share via:

Ben Lee

2006-Oct-11 01:35 UTC

[Ferret-talk] Indexing problem 10.9/10.10

Sorry if this is a repost-  I wasn''t sure if the www.ruby-forum.com
list works for postings.
I''ve been having trouble with indexing a large amount of
documents(2.4M).


Essentially, I have one process that is following the tutorial
dumping documents to an index stored on the file system.  If I open the
index with another process, and run the size() method it is stuck at
a number of documents much smaller than the number I''ve added to the
index.

Eg. 290k -- when the indexer process has already gone through 1 M.

Additionally, if I search, I don''t get results past an
even smaller number of docs (22k) . I''ve tried the two latest ferret
releases.


Does this listing of the index directory look right?

-rw-------  1 blee blee 3.8M Oct 10 17:06 _v.fdt
-rw-------  1 blee blee  51K Oct 10 17:06 _v.fdx
-rw-------  1 blee blee  12M Oct 10 16:49 _u.cfs
-rw-------  1 blee blee   97 Oct 10 16:49 fields

-rw-------  1 blee blee   78 Oct 10 16:49 segments
-rw-------  1 blee blee  11M Oct 10 16:23 _t.cfs
-rw-------  1 blee blee  11M Oct 10 15:56 _s.cfs
-rw-------  1 blee blee  15M Oct 10 15:11 _r.cfs
-rw-------  1 blee blee  13M Oct 10 14:48 _q.cfs

-rw-------  1 blee blee  14M Oct 10 14:37 _p.cfs
-rw-------  1 blee blee  13M Oct 10 14:28 _o.cfs
-rw-------  1 blee blee  12M Oct 10 14:19 _n.cfs
-rw-------  1 blee blee  12M Oct 10 14:16 _m.cfs
-rw-------  1 blee blee 118M Oct 10 14:10 _l.cfs

-rw-------  1 blee blee 129M Oct 10 13:24 _a.cfs
-rw-------  1 blee blee    0 Oct 10 13:00 ferret-write.lck

Thanks,
Ben

peter

2006-Oct-11 04:44 UTC

head link

[Ferret-talk] Indexing problem 10.9/10.10

We''ve had somewhat of a similar situation ourselves, where we are
indexing
about a million records to an index, and each record can be somewhat large.

Now..what happened on our side was that the index files (very similar in
structure to what you have below) came up to a 2 gig limit and stopped
there..and the indexer started crashing each time it hit this limit.

On your side, I don''t see your index file sizes really that large.  I
think
the compiling with large file support only really kicks in when you hit this
2 gig size limit.

Couple of thoughts that might help:
1.  On our side, to keep size down, I would optimize the index at every
100,000 documents.  The optimize call also flushes the index.

2.  Make sure you close the index once you index your data.  Small
thing..but just making sure.

3.  With the index being this large, we actually have two copies, one for
searching against an already optimized index, and the other copy doing the
indexing.  This way, no items are being searched on while the indexing is
taking place.

4.  One neat thing that I learned with indexing large items, was that I
don''t have to actually store everything.  I can have a field set to
tokenize, but not store, so that it can be searched..but I don''t need
it to
be displayed in the search results per say..I don''t actually store it,
so I
was able to keep my index size down.


> From: "Ben Lee" <benlee at ece.ucsb.edu>
> Reply-To: ferret-talk at rubyforge.org
> Date: Tue, 10 Oct 2006 18:35:35 -0700
> To: ferret-talk at rubyforge.org
> Subject: [Ferret-talk] Indexing problem 10.9/10.10
> 
> Sorry if this is a repost-  I wasn''t sure if the
www.ruby-forum.com
> list works for postings.
> I''ve been having trouble with indexing a large amount of
documents(2.4M).
> 
> 
> Essentially, I have one process that is following the tutorial
> dumping documents to an index stored on the file system.  If I open the
> index with another process, and run the size() method it is stuck at
> a number of documents much smaller than the number I''ve added to
the index.
> 
> Eg. 290k -- when the indexer process has already gone through 1 M.
> 
> Additionally, if I search, I don''t get results past an
> even smaller number of docs (22k) . I''ve tried the two latest
ferret releases.
> 
> 
> Does this listing of the index directory look right?
> 
> -rw-------  1 blee blee 3.8M Oct 10 17:06 _v.fdt
> -rw-------  1 blee blee  51K Oct 10 17:06 _v.fdx
> -rw-------  1 blee blee  12M Oct 10 16:49 _u.cfs
> -rw-------  1 blee blee   97 Oct 10 16:49 fields
> 
> -rw-------  1 blee blee   78 Oct 10 16:49 segments
> -rw-------  1 blee blee  11M Oct 10 16:23 _t.cfs
> -rw-------  1 blee blee  11M Oct 10 15:56 _s.cfs
> -rw-------  1 blee blee  15M Oct 10 15:11 _r.cfs
> -rw-------  1 blee blee  13M Oct 10 14:48 _q.cfs
> 
> -rw-------  1 blee blee  14M Oct 10 14:37 _p.cfs
> -rw-------  1 blee blee  13M Oct 10 14:28 _o.cfs
> -rw-------  1 blee blee  12M Oct 10 14:19 _n.cfs
> -rw-------  1 blee blee  12M Oct 10 14:16 _m.cfs
> -rw-------  1 blee blee 118M Oct 10 14:10 _l.cfs
> 
> -rw-------  1 blee blee 129M Oct 10 13:24 _a.cfs
> -rw-------  1 blee blee    0 Oct 10 13:00 ferret-write.lck
> 
> Thanks,
> Ben
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

David Balmain

2006-Oct-11 05:24 UTC

head link

[Ferret-talk] Indexing problem 10.9/10.10

On 10/11/06, Ben Lee <benlee at ece.ucsb.edu>
wrote:> Sorry if this is a repost-  I wasn''t sure if the
www.ruby-forum.com
> list works for postings.
> I''ve been having trouble with indexing a large amount of
documents(2.4M).
>
>
> Essentially, I have one process that is following the tutorial
> dumping documents to an index stored on the file system.  If I open the
> index with another process, and run the size() method it is stuck at
> a number of documents much smaller than the number I''ve added to
the index.
>
> Eg. 290k -- when the indexer process has already gone through 1 M.
>
> Additionally, if I search, I don''t get results past an
> even smaller number of docs (22k) . I''ve tried the two latest
ferret releases.
>
>
> Does this listing of the index directory look right?
>
> -rw-------  1 blee blee 3.8M Oct 10 17:06 _v.fdt
> -rw-------  1 blee blee  51K Oct 10 17:06 _v.fdx
> -rw-------  1 blee blee  12M Oct 10 16:49 _u.cfs
> -rw-------  1 blee blee   97 Oct 10 16:49 fields
>
> -rw-------  1 blee blee   78 Oct 10 16:49 segments
> -rw-------  1 blee blee  11M Oct 10 16:23 _t.cfs
> -rw-------  1 blee blee  11M Oct 10 15:56 _s.cfs
> -rw-------  1 blee blee  15M Oct 10 15:11 _r.cfs
> -rw-------  1 blee blee  13M Oct 10 14:48 _q.cfs
>
> -rw-------  1 blee blee  14M Oct 10 14:37 _p.cfs
> -rw-------  1 blee blee  13M Oct 10 14:28 _o.cfs
> -rw-------  1 blee blee  12M Oct 10 14:19 _n.cfs
> -rw-------  1 blee blee  12M Oct 10 14:16 _m.cfs
> -rw-------  1 blee blee 118M Oct 10 14:10 _l.cfs
>
> -rw-------  1 blee blee 129M Oct 10 13:24 _a.cfs
> -rw-------  1 blee blee    0 Oct 10 13:00 ferret-write.lck
>
> Thanks,
> Ben
I thought this was possibly due to the fact that you didn''t have
Ferret compiled with large-file support but by the looks of it you
aren''t getting near that limit yet. In the directory listing you have
here there is no way you could have added more than 290K documents
unless you set :max_buffered_docs to a different value (> 10,000).
Perhaps the index is getting over-written at some stage. Could you
show us the code you are using for indexing?

As for search results only showing for the top 22k documents, I''m not
sure what the problem might be. You need to make sure you open the
index reader or searcher after committing the index writer, otherwise
the latest results won''t show up. I don''t think this is your
problem
though as I''m sure you would have opened the index-reader much later
than after indexing 22k documents.

Cheers,
Dave

David Balmain

2006-Oct-11 06:16 UTC

head link

[Ferret-talk] Indexing problem 10.9/10.10

On 10/11/06, peter <peter at ioffer.com> wrote:> We''ve had somewhat of a similar situation ourselves, where we are
indexing
> about a million records to an index, and each record can be somewhat large.
>
> Now..what happened on our side was that the index files (very similar in
> structure to what you have below) came up to a 2 gig limit and stopped
> there..and the indexer started crashing each time it hit this limit.
>
> On your side, I don''t see your index file sizes really that large.
I think
> the compiling with large file support only really kicks in when you hit
this
> 2 gig size limit.
Hi Peter,
Did you manage to compile Ferret successfully with large-file support yourself?
> Couple of thoughts that might help:
> 1.  On our side, to keep size down, I would optimize the index at every
> 100,000 documents.  The optimize call also flushes the index.
You can also just call Index#flush to flush the index without having
to optimize. Or IndexWriter#commit. Actually they should both be
commit so I''m going to alias commit to flush in the Index class in the
next version.
> 2.  Make sure you close the index once you index your data.  Small
> thing..but just making sure.
>
> 3.  With the index being this large, we actually have two copies, one for
> searching against an already optimized index, and the other copy doing the
> indexing.  This way, no items are being searched on while the indexing is
> taking place.
This shouldn''t be necessary. Whatever version of the index you open
the IndexReader on will be the version of the index that you are
searching, even when it''s files are deleted it will hold on to the
file handles so the data should still be available. The operating
system won''t be able to use that disk space until you close the
IndexReader (or Searcher).
> 4.  One neat thing that I learned with indexing large items, was that I
> don''t have to actually store everything.  I can have a field set
to
> tokenize, but not store, so that it can be searched..but I don''t
need it to
> be displayed in the search results per say..I don''t actually store
it, so I
> was able to keep my index size down.
Very good tip. You should also set :term_vector to :no unless you are
using term-vectors.

Cheers,
Dave

Ben Lee

2006-Oct-11 06:51 UTC

head link

[Ferret-talk] Indexing problem 10.9/10.10

Thanks for the tips, things seem happier now.  Yeah, the size of each
document (number of tokens) is actually quite small in my case - I
think this is just case  of me messing up the flush/optimize/close
tactics.



On 10/10/06, peter <peter at ioffer.com> wrote:> We''ve had somewhat of a similar situation ourselves, where we are
indexing
> about a million records to an index, and each record can be somewhat large.
>
> Now..what happened on our side was that the index files (very similar in
> structure to what you have below) came up to a 2 gig limit and stopped
> there..and the indexer started crashing each time it hit this limit.
>
> On your side, I don''t see your index file sizes really that large.
I think
> the compiling with large file support only really kicks in when you hit
this
> 2 gig size limit.
>
> Couple of thoughts that might help:
> 1.  On our side, to keep size down, I would optimize the index at every
> 100,000 documents.  The optimize call also flushes the index.
>
> 2.  Make sure you close the index once you index your data.  Small
> thing..but just making sure.
>
> 3.  With the index being this large, we actually have two copies, one for
> searching against an already optimized index, and the other copy doing the
> indexing.  This way, no items are being searched on while the indexing is
> taking place.
>
> 4.  One neat thing that I learned with indexing large items, was that I
> don''t have to actually store everything.  I can have a field set
to
> tokenize, but not store, so that it can be searched..but I don''t
need it to
> be displayed in the search results per say..I don''t actually store
it, so I
> was able to keep my index size down.
>
>
>
> > From: "Ben Lee" <benlee at ece.ucsb.edu>
> > Reply-To: ferret-talk at rubyforge.org
> > Date: Tue, 10 Oct 2006 18:35:35 -0700
> > To: ferret-talk at rubyforge.org
> > Subject: [Ferret-talk] Indexing problem 10.9/10.10
> >
> > Sorry if this is a repost-  I wasn''t sure if the
www.ruby-forum.com
> > list works for postings.
> > I''ve been having trouble with indexing a large amount of
documents(2.4M).
> >
> >
> > Essentially, I have one process that is following the tutorial
> > dumping documents to an index stored on the file system.  If I open
the
> > index with another process, and run the size() method it is stuck at
> > a number of documents much smaller than the number I''ve added
to the index.
> >
> > Eg. 290k -- when the indexer process has already gone through 1 M.
> >
> > Additionally, if I search, I don''t get results past an
> > even smaller number of docs (22k) . I''ve tried the two latest
ferret releases.
> >
> >
> > Does this listing of the index directory look right?
> >
> > -rw-------  1 blee blee 3.8M Oct 10 17:06 _v.fdt
> > -rw-------  1 blee blee  51K Oct 10 17:06 _v.fdx
> > -rw-------  1 blee blee  12M Oct 10 16:49 _u.cfs
> > -rw-------  1 blee blee   97 Oct 10 16:49 fields
> >
> > -rw-------  1 blee blee   78 Oct 10 16:49 segments
> > -rw-------  1 blee blee  11M Oct 10 16:23 _t.cfs
> > -rw-------  1 blee blee  11M Oct 10 15:56 _s.cfs
> > -rw-------  1 blee blee  15M Oct 10 15:11 _r.cfs
> > -rw-------  1 blee blee  13M Oct 10 14:48 _q.cfs
> >
> > -rw-------  1 blee blee  14M Oct 10 14:37 _p.cfs
> > -rw-------  1 blee blee  13M Oct 10 14:28 _o.cfs
> > -rw-------  1 blee blee  12M Oct 10 14:19 _n.cfs
> > -rw-------  1 blee blee  12M Oct 10 14:16 _m.cfs
> > -rw-------  1 blee blee 118M Oct 10 14:10 _l.cfs
> >
> > -rw-------  1 blee blee 129M Oct 10 13:24 _a.cfs
> > -rw-------  1 blee blee    0 Oct 10 13:00 ferret-write.lck
> >
> > Thanks,
> > Ben
> > _______________________________________________
> > Ferret-talk mailing list
> > Ferret-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/ferret-talk
> >
>
>

David Balmain

2006-Oct-11 08:10 UTC

head link

[Ferret-talk] Indexing problem 10.9/10.10

On 10/11/06, Ben Lee <benlee at ece.ucsb.edu>
wrote:> Thanks for the tips, things seem happier now.  Yeah, the size of each
> document (number of tokens) is actually quite small in my case - I
> think this is just case  of me messing up the flush/optimize/close
> tactics.
>That''s great to hear Ben.

peter

2006-Oct-11 16:18 UTC

head link

[Ferret-talk] Indexing problem 10.9/10.10

Hey Dave!

Yes..we actually compiled with large-file support, and things seem to be
working just fine.  And in the end, once I figured out that I can tokenize a
large bit of text, and not have to actually store it, we were able to have
the optimized index only be about 1 gig at the end, so large-file support
never became an issue, even though we did compile it that way, just in case.

With the two copies thing, we actually have two boxes in our cluster, each
with a copy of the index used for searching, but only one copy used for
indexing.  That way, each box we have in the cluster can search locally,
while the "indexing" box can index away, and update the copies when
it''s
done.

Oh..and I do turn off :term_vector for most of my fields..thanks for the
tip.

By the way, thanks for all the hard work you do in getting this product the
best it can be.
> From: "David Balmain" <dbalmain.ml at gmail.com>
> Reply-To: ferret-talk at rubyforge.org
> Date: Wed, 11 Oct 2006 15:16:58 +0900
> To: ferret-talk at rubyforge.org
> Subject: Re: [Ferret-talk] Indexing problem 10.9/10.10
> 
> On 10/11/06, peter <peter at ioffer.com> wrote:
>> We''ve had somewhat of a similar situation ourselves, where we
are indexing
>> about a million records to an index, and each record can be somewhat
large.
>> 
>> Now..what happened on our side was that the index files (very similar
in
>> structure to what you have below) came up to a 2 gig limit and stopped
>> there..and the indexer started crashing each time it hit this limit.
>> 
>> On your side, I don''t see your index file sizes really that
large.  I think
>> the compiling with large file support only really kicks in when you hit
this
>> 2 gig size limit.
> 
> Hi Peter,
> Did you manage to compile Ferret successfully with large-file support
> yourself?
> 
>> Couple of thoughts that might help:
>> 1.  On our side, to keep size down, I would optimize the index at every
>> 100,000 documents.  The optimize call also flushes the index.
> 
> You can also just call Index#flush to flush the index without having
> to optimize. Or IndexWriter#commit. Actually they should both be
> commit so I''m going to alias commit to flush in the Index class in
the
> next version.
> 
>> 2.  Make sure you close the index once you index your data.  Small
>> thing..but just making sure.
>> 
>> 3.  With the index being this large, we actually have two copies, one
for
>> searching against an already optimized index, and the other copy doing
the
>> indexing.  This way, no items are being searched on while the indexing
is
>> taking place.
> 
> This shouldn''t be necessary. Whatever version of the index you
open
> the IndexReader on will be the version of the index that you are
> searching, even when it''s files are deleted it will hold on to the
> file handles so the data should still be available. The operating
> system won''t be able to use that disk space until you close the
> IndexReader (or Searcher).
> 
>> 4.  One neat thing that I learned with indexing large items, was that I
>> don''t have to actually store everything.  I can have a field
set to
>> tokenize, but not store, so that it can be searched..but I
don''t need it to
>> be displayed in the search results per say..I don''t actually
store it, so I
>> was able to keep my index size down.
> 
> Very good tip. You should also set :term_vector to :no unless you are
> using term-vectors.
> 
> Cheers,
> Dave
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Seemingly Similar Threads

Search for more maybe matching threads

Ferret talk - Oct 2006 - Indexing problem 10.9/10.10

[Ferret-talk] Indexing problem 10.9/10.10

[Ferret-talk] Indexing problem 10.9/10.10

[Ferret-talk] Indexing problem 10.9/10.10

[Ferret-talk] Indexing problem 10.9/10.10

[Ferret-talk] Indexing problem 10.9/10.10

[Ferret-talk] Indexing problem 10.9/10.10

[Ferret-talk] Indexing problem 10.9/10.10

Seemingly Similar Threads