William Mitchell
2006-Jun-02 19:16 UTC
[Ferret-talk] Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum
Ferret 0.9.3 Ruby 1.8.2 NOT storing file contents in the index. Only indexing first 25k of each file. Very large data set (1 million files, 350 Gb) Code based on snippet from David Balmain''s forum posts. After 6 hours, Ferret bails out with Ruby "exceeds max file size". Cache: -rw-r--r-- 1 bill bill 2147483647 2006-06-01 22:45 _ntc6.tmp -rw-r--r-- 1 bill bill 1690862924 2006-06-01 22:42 _ntc6.prx -rw-r--r-- 1 bill bill 646302802 2006-06-01 22:42 _ntc6.frq -rw-r--r-- 1 bill bill 165561698 2006-06-01 22:42 _ntc6.tis -rw-r--r-- 1 bill bill 50541430 2006-06-01 22:14 _ntc6.fdt -rw-r--r-- 1 bill bill 8000000 2006-06-01 22:14 _ntc6.fdx -rw-r--r-- 1 bill bill 2097842 2006-06-01 22:42 _ntc6.tii -rw-r--r-- 1 bill bill 1000000 2006-06-01 22:42 _ntc6.f0 -rw-r--r-- 1 bill bill 1000000 2006-06-01 22:42 _ntc6.f1 -rw-r--r-- 1 bill bill 30 2006-06-01 22:42 segments -rw-r--r-- 1 bill bill 16 2006-06-01 22:14 _ntc6.fnm Code: #------------ index = Index::Index.new(:path => "/var/cache/ferrets") max_file_length = 25000 Dir.glob(allfiles).each do |file| doc = Document::Document.new() doc << Document::Field.new(:file, file, Document::Field::Store::YES, Document::Field::Index::UNTOKENIZED) doc << Document::Field.new(:content, IO.read(file, max_file_length), Document::Field::Store::NO, Document::Field::Index::TOKENIZED) index << doc end #------------ Is there a workaround, or is this exceeding Ferret''s limits? Thanks! By the way, retrieval is usably fast for my purposes, even on a big index like this. Very impressive. -- Posted via http://www.ruby-forum.com/.
David Balmain
2006-Jun-02 23:47 UTC
[Ferret-talk] Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum
On 6/3/06, William Mitchell <wemitchell at gmail.com> wrote:> Ferret 0.9.3 > Ruby 1.8.2 > NOT storing file contents in the index. > Only indexing first 25k of each file. > Very large data set (1 million files, 350 Gb) > Code based on snippet from David Balmain''s forum posts. > > After 6 hours, Ferret bails out with Ruby "exceeds max file size". > > Cache: > > -rw-r--r-- 1 bill bill 2147483647 2006-06-01 22:45 _ntc6.tmp > -rw-r--r-- 1 bill bill 1690862924 2006-06-01 22:42 _ntc6.prx > -rw-r--r-- 1 bill bill 646302802 2006-06-01 22:42 _ntc6.frq > -rw-r--r-- 1 bill bill 165561698 2006-06-01 22:42 _ntc6.tis > -rw-r--r-- 1 bill bill 50541430 2006-06-01 22:14 _ntc6.fdt > -rw-r--r-- 1 bill bill 8000000 2006-06-01 22:14 _ntc6.fdx > -rw-r--r-- 1 bill bill 2097842 2006-06-01 22:42 _ntc6.tii > -rw-r--r-- 1 bill bill 1000000 2006-06-01 22:42 _ntc6.f0 > -rw-r--r-- 1 bill bill 1000000 2006-06-01 22:42 _ntc6.f1 > -rw-r--r-- 1 bill bill 30 2006-06-01 22:42 segments > -rw-r--r-- 1 bill bill 16 2006-06-01 22:14 _ntc6.fnm > > Code: > > #------------ > > index = Index::Index.new(:path => "/var/cache/ferrets") > > max_file_length = 25000 > > Dir.glob(allfiles).each do > |file| > doc = Document::Document.new() > doc << Document::Field.new(:file, file, > Document::Field::Store::YES, > Document::Field::Index::UNTOKENIZED) > doc << Document::Field.new(:content, IO.read(file, max_file_length), > Document::Field::Store::NO, > Document::Field::Index::TOKENIZED) > index << doc > end > > #------------ > > Is there a workaround, or is this exceeding Ferret''s limits?You need to set :max_merge_docs when you create the index. This will stop the index merging segments when it gets to a certain size. This will also mean that you will always have multiple segments in your index which will slow things down a little but it shouldn''t be a problem. Judging by the filenames you''ve almost merged 1,000,000 documents by the time it fails ("ntc6".to_i(36) = 1,111,110 1,000,000 documents and 111,110 merges). Looks like you are pretty close to finishing. So if you create your index like this it should work; index = Index::Index.new(:path => "/var/cache/ferrets", :max_merge_docs => 100_000) This will leave you with at least 10 segments at the end. You could also set max_merge_docs to 500_000 and run index.optimize at the end. This should keep you under the max file size and with 2-3 segments, searching should be easily fast enough. As an aside, you can also set :max_field_length (default 10,000) to limit the number of terms that get indexed from any one document instead of truncating the file to 25,000 bytes. The will prevent you getting a half term at the end of the document as 25,000 might break in the middle of a word. It shouldn''t effect search results too much however so you can keep doing it this way. In a future version you''ll be able to pass a File handle instead of a string in which case it will definitly be better to set :max_field_length.> Thanks! By the way, retrieval is usably fast for my purposes, even on a > big index like this. Very impressive.Thanks. Please let me know how it goes. This is possibly the largest document set to be indexed with Ferret so far. Cheers, Dave