thr3ads.net - Ferret talk - [Ferret-talk] Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum [Jun 2006]

If this information is useful, please help other people find it:
Share via:

William Mitchell

2006-Jun-02 19:16 UTC

[Ferret-talk] Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum

Ferret 0.9.3
Ruby 1.8.2
NOT storing file contents in the index.
Only indexing first 25k of each file.
Very large data set (1 million files, 350 Gb)
Code based on snippet from David Balmain''s forum posts.

After 6 hours, Ferret bails out with Ruby "exceeds max file size".

Cache:

-rw-r--r--  1 bill bill 2147483647 2006-06-01 22:45 _ntc6.tmp
-rw-r--r--  1 bill bill 1690862924 2006-06-01 22:42 _ntc6.prx
-rw-r--r--  1 bill bill  646302802 2006-06-01 22:42 _ntc6.frq
-rw-r--r--  1 bill bill  165561698 2006-06-01 22:42 _ntc6.tis
-rw-r--r--  1 bill bill   50541430 2006-06-01 22:14 _ntc6.fdt
-rw-r--r--  1 bill bill    8000000 2006-06-01 22:14 _ntc6.fdx
-rw-r--r--  1 bill bill    2097842 2006-06-01 22:42 _ntc6.tii
-rw-r--r--  1 bill bill    1000000 2006-06-01 22:42 _ntc6.f0
-rw-r--r--  1 bill bill    1000000 2006-06-01 22:42 _ntc6.f1
-rw-r--r--  1 bill bill         30 2006-06-01 22:42 segments
-rw-r--r--  1 bill bill         16 2006-06-01 22:14 _ntc6.fnm

Code:

#------------

index = Index::Index.new(:path => "/var/cache/ferrets")

max_file_length = 25000

Dir.glob(allfiles).each do
  |file|
  doc = Document::Document.new()
  doc << Document::Field.new(:file, file,
                             Document::Field::Store::YES,
                             Document::Field::Index::UNTOKENIZED)
  doc << Document::Field.new(:content, IO.read(file, max_file_length),
                             Document::Field::Store::NO,
                             Document::Field::Index::TOKENIZED)
  index << doc
end

#------------

Is there a workaround, or is this exceeding Ferret''s limits?

Thanks!  By the way, retrieval is usably fast for my purposes, even on a 
big index like this.  Very impressive.

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Jun-02 23:47 UTC

head link

[Ferret-talk] Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum

On 6/3/06, William Mitchell <wemitchell at gmail.com>
wrote:> Ferret 0.9.3
> Ruby 1.8.2
> NOT storing file contents in the index.
> Only indexing first 25k of each file.
> Very large data set (1 million files, 350 Gb)
> Code based on snippet from David Balmain''s forum posts.
>
> After 6 hours, Ferret bails out with Ruby "exceeds max file
size".
>
> Cache:
>
> -rw-r--r--  1 bill bill 2147483647 2006-06-01 22:45 _ntc6.tmp
> -rw-r--r--  1 bill bill 1690862924 2006-06-01 22:42 _ntc6.prx
> -rw-r--r--  1 bill bill  646302802 2006-06-01 22:42 _ntc6.frq
> -rw-r--r--  1 bill bill  165561698 2006-06-01 22:42 _ntc6.tis
> -rw-r--r--  1 bill bill   50541430 2006-06-01 22:14 _ntc6.fdt
> -rw-r--r--  1 bill bill    8000000 2006-06-01 22:14 _ntc6.fdx
> -rw-r--r--  1 bill bill    2097842 2006-06-01 22:42 _ntc6.tii
> -rw-r--r--  1 bill bill    1000000 2006-06-01 22:42 _ntc6.f0
> -rw-r--r--  1 bill bill    1000000 2006-06-01 22:42 _ntc6.f1
> -rw-r--r--  1 bill bill         30 2006-06-01 22:42 segments
> -rw-r--r--  1 bill bill         16 2006-06-01 22:14 _ntc6.fnm
>
> Code:
>
> #------------
>
> index = Index::Index.new(:path => "/var/cache/ferrets")
>
> max_file_length = 25000
>
> Dir.glob(allfiles).each do
>   |file|
>   doc = Document::Document.new()
>   doc << Document::Field.new(:file, file,
>                              Document::Field::Store::YES,
>                              Document::Field::Index::UNTOKENIZED)
>   doc << Document::Field.new(:content, IO.read(file,
max_file_length),
>                              Document::Field::Store::NO,
>                              Document::Field::Index::TOKENIZED)
>   index << doc
> end
>
> #------------
>
> Is there a workaround, or is this exceeding Ferret''s limits?
You need to set :max_merge_docs when you create the index. This will
stop the index merging segments when it gets to a certain size. This
will also mean that you will always have multiple segments in your
index which will slow things down a little but it shouldn''t be a
problem. Judging by the filenames you''ve almost merged 1,000,000
documents by the time it fails ("ntc6".to_i(36) = 1,111,110 1,000,000
documents and 111,110 merges). Looks like you are pretty
close to finishing. So if you create your index like this it should
work;

    index = Index::Index.new(:path => "/var/cache/ferrets",
                             :max_merge_docs => 100_000)

This will leave you with at least 10 segments at the end. You could
also set max_merge_docs to 500_000 and run index.optimize at the end.
This should keep you under the max file size and with 2-3 segments,
searching should be easily fast enough.

As an aside, you can also set :max_field_length (default 10,000) to
limit the number of terms that get indexed from any one document
instead of truncating the file to 25,000 bytes. The will prevent you
getting a half term at the end of the document as 25,000 might break
in the middle of a word. It shouldn''t effect search results too much
however so you can keep doing it this way. In a future version you''ll
be able to pass a File handle instead of a string in which case it
will definitly be better to set :max_field_length.
> Thanks!  By the way, retrieval is usably fast for my purposes, even on a
> big index like this.  Very impressive.
Thanks. Please let me know how it goes. This is possibly the largest
document set to be indexed with Ferret so far.

Cheers,
Dave

Reasonably Related Threads

Search for more possibly parallel threads

Ferret talk - Jun 2006 - Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum

[Ferret-talk] Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum

[Ferret-talk] Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum

Reasonably Related Threads