Sorry if this is a repost- I wasn''t sure if the www.ruby-forum.com list works for postings. I''ve been having trouble with indexing a large amount of documents(2.4M). Essentially, I have one process that is following the tutorial dumping documents to an index stored on the file system. If I open the index with another process, and run the size() method it is stuck at a number of documents much smaller than the number I''ve added to the index. Eg. 290k -- when the indexer process has already gone through 1 M. Additionally, if I search, I don''t get results past an even smaller number of docs (22k) . I''ve tried the two latest ferret releases. Does this listing of the index directory look right? -rw------- 1 blee blee 3.8M Oct 10 17:06 _v.fdt -rw------- 1 blee blee 51K Oct 10 17:06 _v.fdx -rw------- 1 blee blee 12M Oct 10 16:49 _u.cfs -rw------- 1 blee blee 97 Oct 10 16:49 fields -rw------- 1 blee blee 78 Oct 10 16:49 segments -rw------- 1 blee blee 11M Oct 10 16:23 _t.cfs -rw------- 1 blee blee 11M Oct 10 15:56 _s.cfs -rw------- 1 blee blee 15M Oct 10 15:11 _r.cfs -rw------- 1 blee blee 13M Oct 10 14:48 _q.cfs -rw------- 1 blee blee 14M Oct 10 14:37 _p.cfs -rw------- 1 blee blee 13M Oct 10 14:28 _o.cfs -rw------- 1 blee blee 12M Oct 10 14:19 _n.cfs -rw------- 1 blee blee 12M Oct 10 14:16 _m.cfs -rw------- 1 blee blee 118M Oct 10 14:10 _l.cfs -rw------- 1 blee blee 129M Oct 10 13:24 _a.cfs -rw------- 1 blee blee 0 Oct 10 13:00 ferret-write.lck Thanks, Ben
We''ve had somewhat of a similar situation ourselves, where we are indexing about a million records to an index, and each record can be somewhat large. Now..what happened on our side was that the index files (very similar in structure to what you have below) came up to a 2 gig limit and stopped there..and the indexer started crashing each time it hit this limit. On your side, I don''t see your index file sizes really that large. I think the compiling with large file support only really kicks in when you hit this 2 gig size limit. Couple of thoughts that might help: 1. On our side, to keep size down, I would optimize the index at every 100,000 documents. The optimize call also flushes the index. 2. Make sure you close the index once you index your data. Small thing..but just making sure. 3. With the index being this large, we actually have two copies, one for searching against an already optimized index, and the other copy doing the indexing. This way, no items are being searched on while the indexing is taking place. 4. One neat thing that I learned with indexing large items, was that I don''t have to actually store everything. I can have a field set to tokenize, but not store, so that it can be searched..but I don''t need it to be displayed in the search results per say..I don''t actually store it, so I was able to keep my index size down.> From: "Ben Lee" <benlee at ece.ucsb.edu> > Reply-To: ferret-talk at rubyforge.org > Date: Tue, 10 Oct 2006 18:35:35 -0700 > To: ferret-talk at rubyforge.org > Subject: [Ferret-talk] Indexing problem 10.9/10.10 > > Sorry if this is a repost- I wasn''t sure if the www.ruby-forum.com > list works for postings. > I''ve been having trouble with indexing a large amount of documents(2.4M). > > > Essentially, I have one process that is following the tutorial > dumping documents to an index stored on the file system. If I open the > index with another process, and run the size() method it is stuck at > a number of documents much smaller than the number I''ve added to the index. > > Eg. 290k -- when the indexer process has already gone through 1 M. > > Additionally, if I search, I don''t get results past an > even smaller number of docs (22k) . I''ve tried the two latest ferret releases. > > > Does this listing of the index directory look right? > > -rw------- 1 blee blee 3.8M Oct 10 17:06 _v.fdt > -rw------- 1 blee blee 51K Oct 10 17:06 _v.fdx > -rw------- 1 blee blee 12M Oct 10 16:49 _u.cfs > -rw------- 1 blee blee 97 Oct 10 16:49 fields > > -rw------- 1 blee blee 78 Oct 10 16:49 segments > -rw------- 1 blee blee 11M Oct 10 16:23 _t.cfs > -rw------- 1 blee blee 11M Oct 10 15:56 _s.cfs > -rw------- 1 blee blee 15M Oct 10 15:11 _r.cfs > -rw------- 1 blee blee 13M Oct 10 14:48 _q.cfs > > -rw------- 1 blee blee 14M Oct 10 14:37 _p.cfs > -rw------- 1 blee blee 13M Oct 10 14:28 _o.cfs > -rw------- 1 blee blee 12M Oct 10 14:19 _n.cfs > -rw------- 1 blee blee 12M Oct 10 14:16 _m.cfs > -rw------- 1 blee blee 118M Oct 10 14:10 _l.cfs > > -rw------- 1 blee blee 129M Oct 10 13:24 _a.cfs > -rw------- 1 blee blee 0 Oct 10 13:00 ferret-write.lck > > Thanks, > Ben > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
On 10/11/06, Ben Lee <benlee at ece.ucsb.edu> wrote:> Sorry if this is a repost- I wasn''t sure if the www.ruby-forum.com > list works for postings. > I''ve been having trouble with indexing a large amount of documents(2.4M). > > > Essentially, I have one process that is following the tutorial > dumping documents to an index stored on the file system. If I open the > index with another process, and run the size() method it is stuck at > a number of documents much smaller than the number I''ve added to the index. > > Eg. 290k -- when the indexer process has already gone through 1 M. > > Additionally, if I search, I don''t get results past an > even smaller number of docs (22k) . I''ve tried the two latest ferret releases. > > > Does this listing of the index directory look right? > > -rw------- 1 blee blee 3.8M Oct 10 17:06 _v.fdt > -rw------- 1 blee blee 51K Oct 10 17:06 _v.fdx > -rw------- 1 blee blee 12M Oct 10 16:49 _u.cfs > -rw------- 1 blee blee 97 Oct 10 16:49 fields > > -rw------- 1 blee blee 78 Oct 10 16:49 segments > -rw------- 1 blee blee 11M Oct 10 16:23 _t.cfs > -rw------- 1 blee blee 11M Oct 10 15:56 _s.cfs > -rw------- 1 blee blee 15M Oct 10 15:11 _r.cfs > -rw------- 1 blee blee 13M Oct 10 14:48 _q.cfs > > -rw------- 1 blee blee 14M Oct 10 14:37 _p.cfs > -rw------- 1 blee blee 13M Oct 10 14:28 _o.cfs > -rw------- 1 blee blee 12M Oct 10 14:19 _n.cfs > -rw------- 1 blee blee 12M Oct 10 14:16 _m.cfs > -rw------- 1 blee blee 118M Oct 10 14:10 _l.cfs > > -rw------- 1 blee blee 129M Oct 10 13:24 _a.cfs > -rw------- 1 blee blee 0 Oct 10 13:00 ferret-write.lck > > Thanks, > BenI thought this was possibly due to the fact that you didn''t have Ferret compiled with large-file support but by the looks of it you aren''t getting near that limit yet. In the directory listing you have here there is no way you could have added more than 290K documents unless you set :max_buffered_docs to a different value (> 10,000). Perhaps the index is getting over-written at some stage. Could you show us the code you are using for indexing? As for search results only showing for the top 22k documents, I''m not sure what the problem might be. You need to make sure you open the index reader or searcher after committing the index writer, otherwise the latest results won''t show up. I don''t think this is your problem though as I''m sure you would have opened the index-reader much later than after indexing 22k documents. Cheers, Dave
On 10/11/06, peter <peter at ioffer.com> wrote:> We''ve had somewhat of a similar situation ourselves, where we are indexing > about a million records to an index, and each record can be somewhat large. > > Now..what happened on our side was that the index files (very similar in > structure to what you have below) came up to a 2 gig limit and stopped > there..and the indexer started crashing each time it hit this limit. > > On your side, I don''t see your index file sizes really that large. I think > the compiling with large file support only really kicks in when you hit this > 2 gig size limit.Hi Peter, Did you manage to compile Ferret successfully with large-file support yourself?> Couple of thoughts that might help: > 1. On our side, to keep size down, I would optimize the index at every > 100,000 documents. The optimize call also flushes the index.You can also just call Index#flush to flush the index without having to optimize. Or IndexWriter#commit. Actually they should both be commit so I''m going to alias commit to flush in the Index class in the next version.> 2. Make sure you close the index once you index your data. Small > thing..but just making sure. > > 3. With the index being this large, we actually have two copies, one for > searching against an already optimized index, and the other copy doing the > indexing. This way, no items are being searched on while the indexing is > taking place.This shouldn''t be necessary. Whatever version of the index you open the IndexReader on will be the version of the index that you are searching, even when it''s files are deleted it will hold on to the file handles so the data should still be available. The operating system won''t be able to use that disk space until you close the IndexReader (or Searcher).> 4. One neat thing that I learned with indexing large items, was that I > don''t have to actually store everything. I can have a field set to > tokenize, but not store, so that it can be searched..but I don''t need it to > be displayed in the search results per say..I don''t actually store it, so I > was able to keep my index size down.Very good tip. You should also set :term_vector to :no unless you are using term-vectors. Cheers, Dave
Thanks for the tips, things seem happier now. Yeah, the size of each document (number of tokens) is actually quite small in my case - I think this is just case of me messing up the flush/optimize/close tactics. On 10/10/06, peter <peter at ioffer.com> wrote:> We''ve had somewhat of a similar situation ourselves, where we are indexing > about a million records to an index, and each record can be somewhat large. > > Now..what happened on our side was that the index files (very similar in > structure to what you have below) came up to a 2 gig limit and stopped > there..and the indexer started crashing each time it hit this limit. > > On your side, I don''t see your index file sizes really that large. I think > the compiling with large file support only really kicks in when you hit this > 2 gig size limit. > > Couple of thoughts that might help: > 1. On our side, to keep size down, I would optimize the index at every > 100,000 documents. The optimize call also flushes the index. > > 2. Make sure you close the index once you index your data. Small > thing..but just making sure. > > 3. With the index being this large, we actually have two copies, one for > searching against an already optimized index, and the other copy doing the > indexing. This way, no items are being searched on while the indexing is > taking place. > > 4. One neat thing that I learned with indexing large items, was that I > don''t have to actually store everything. I can have a field set to > tokenize, but not store, so that it can be searched..but I don''t need it to > be displayed in the search results per say..I don''t actually store it, so I > was able to keep my index size down. > > > > > From: "Ben Lee" <benlee at ece.ucsb.edu> > > Reply-To: ferret-talk at rubyforge.org > > Date: Tue, 10 Oct 2006 18:35:35 -0700 > > To: ferret-talk at rubyforge.org > > Subject: [Ferret-talk] Indexing problem 10.9/10.10 > > > > Sorry if this is a repost- I wasn''t sure if the www.ruby-forum.com > > list works for postings. > > I''ve been having trouble with indexing a large amount of documents(2.4M). > > > > > > Essentially, I have one process that is following the tutorial > > dumping documents to an index stored on the file system. If I open the > > index with another process, and run the size() method it is stuck at > > a number of documents much smaller than the number I''ve added to the index. > > > > Eg. 290k -- when the indexer process has already gone through 1 M. > > > > Additionally, if I search, I don''t get results past an > > even smaller number of docs (22k) . I''ve tried the two latest ferret releases. > > > > > > Does this listing of the index directory look right? > > > > -rw------- 1 blee blee 3.8M Oct 10 17:06 _v.fdt > > -rw------- 1 blee blee 51K Oct 10 17:06 _v.fdx > > -rw------- 1 blee blee 12M Oct 10 16:49 _u.cfs > > -rw------- 1 blee blee 97 Oct 10 16:49 fields > > > > -rw------- 1 blee blee 78 Oct 10 16:49 segments > > -rw------- 1 blee blee 11M Oct 10 16:23 _t.cfs > > -rw------- 1 blee blee 11M Oct 10 15:56 _s.cfs > > -rw------- 1 blee blee 15M Oct 10 15:11 _r.cfs > > -rw------- 1 blee blee 13M Oct 10 14:48 _q.cfs > > > > -rw------- 1 blee blee 14M Oct 10 14:37 _p.cfs > > -rw------- 1 blee blee 13M Oct 10 14:28 _o.cfs > > -rw------- 1 blee blee 12M Oct 10 14:19 _n.cfs > > -rw------- 1 blee blee 12M Oct 10 14:16 _m.cfs > > -rw------- 1 blee blee 118M Oct 10 14:10 _l.cfs > > > > -rw------- 1 blee blee 129M Oct 10 13:24 _a.cfs > > -rw------- 1 blee blee 0 Oct 10 13:00 ferret-write.lck > > > > Thanks, > > Ben > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > >
On 10/11/06, Ben Lee <benlee at ece.ucsb.edu> wrote:> Thanks for the tips, things seem happier now. Yeah, the size of each > document (number of tokens) is actually quite small in my case - I > think this is just case of me messing up the flush/optimize/close > tactics. >That''s great to hear Ben.
Hey Dave! Yes..we actually compiled with large-file support, and things seem to be working just fine. And in the end, once I figured out that I can tokenize a large bit of text, and not have to actually store it, we were able to have the optimized index only be about 1 gig at the end, so large-file support never became an issue, even though we did compile it that way, just in case. With the two copies thing, we actually have two boxes in our cluster, each with a copy of the index used for searching, but only one copy used for indexing. That way, each box we have in the cluster can search locally, while the "indexing" box can index away, and update the copies when it''s done. Oh..and I do turn off :term_vector for most of my fields..thanks for the tip. By the way, thanks for all the hard work you do in getting this product the best it can be.> From: "David Balmain" <dbalmain.ml at gmail.com> > Reply-To: ferret-talk at rubyforge.org > Date: Wed, 11 Oct 2006 15:16:58 +0900 > To: ferret-talk at rubyforge.org > Subject: Re: [Ferret-talk] Indexing problem 10.9/10.10 > > On 10/11/06, peter <peter at ioffer.com> wrote: >> We''ve had somewhat of a similar situation ourselves, where we are indexing >> about a million records to an index, and each record can be somewhat large. >> >> Now..what happened on our side was that the index files (very similar in >> structure to what you have below) came up to a 2 gig limit and stopped >> there..and the indexer started crashing each time it hit this limit. >> >> On your side, I don''t see your index file sizes really that large. I think >> the compiling with large file support only really kicks in when you hit this >> 2 gig size limit. > > Hi Peter, > Did you manage to compile Ferret successfully with large-file support > yourself? > >> Couple of thoughts that might help: >> 1. On our side, to keep size down, I would optimize the index at every >> 100,000 documents. The optimize call also flushes the index. > > You can also just call Index#flush to flush the index without having > to optimize. Or IndexWriter#commit. Actually they should both be > commit so I''m going to alias commit to flush in the Index class in the > next version. > >> 2. Make sure you close the index once you index your data. Small >> thing..but just making sure. >> >> 3. With the index being this large, we actually have two copies, one for >> searching against an already optimized index, and the other copy doing the >> indexing. This way, no items are being searched on while the indexing is >> taking place. > > This shouldn''t be necessary. Whatever version of the index you open > the IndexReader on will be the version of the index that you are > searching, even when it''s files are deleted it will hold on to the > file handles so the data should still be available. The operating > system won''t be able to use that disk space until you close the > IndexReader (or Searcher). > >> 4. One neat thing that I learned with indexing large items, was that I >> don''t have to actually store everything. I can have a field set to >> tokenize, but not store, so that it can be searched..but I don''t need it to >> be displayed in the search results per say..I don''t actually store it, so I >> was able to keep my index size down. > > Very good tip. You should also set :term_vector to :no unless you are > using term-vectors. > > Cheers, > Dave > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >