thr3ads.net - Ferret talk - [Ferret-talk] Indexing Speed? [May 2006]

If this information is useful, please help other people find it:
Share via:

steven

2006-May-02 15:16 UTC

[Ferret-talk] Indexing Speed?

Hi all,

Have been looking at lucene and ferret.

Have noticed that ferret takes ~463 seconds to index 200Mb of docs,
whereas lucene takes ~60 seconds.

I''m using the standard "get you started" sort of code
provided by both
libraries.

My ruby code is: (abridged)

@index = Index::Index.new(:path => inIndexPath)

def createIndex(inRepositoryPath)
    Find.find(inRepositoryPath) do |path|
        if FileTest.file?(path)
            File.open(path) do |file|
                 @index.add_document(:file =>path, :content =>
file.readlines)
end

My Java code is basically a direct port.

Has anyone else noticed this difference in speed? Am I doing something
wrong? Is this speed normal?

Any advice gratefully received.
Thanks,
Steven

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-May-03 02:53 UTC

head link

[Ferret-talk] Indexing Speed?

Hi Steven,

Are the indexes you get the same size? My guess is that the code isn''t
really equivalent. Ferret should be faster than Lucene. Try this;

include Ferret::Document

@index = Index::Index.new(:path => inIndexPath)

def createIndex(inRepositoryPath)
    Find.find(inRepositoryPath) do |path|
        if FileTest.file?(path)
            File.open(path) do |file|
                doc = Document.new()
                doc << Field.new(:file, path,
                              Field::Store::YES, Field::Index::UNTOKENIZED)
                doc << Field.new(:content, file.readlines,
                              Field::Store::NO, Field::Index::TOKENIZED)
                @index << doc
            end
        end
    end
end

Let me know if this helps.

Cheers,
Dave

On 5/3/06, steven <steven_shingler at hotmail.com>
wrote:> Hi all,
>
> Have been looking at lucene and ferret.
>
> Have noticed that ferret takes ~463 seconds to index 200Mb of docs,
> whereas lucene takes ~60 seconds.
>
> I''m using the standard "get you started" sort of code
provided by both
> libraries.
>
> My ruby code is: (abridged)
>
> @index = Index::Index.new(:path => inIndexPath)
>
> def createIndex(inRepositoryPath)
>     Find.find(inRepositoryPath) do |path|
>         if FileTest.file?(path)
>             File.open(path) do |file|
>                  @index.add_document(:file =>path, :content =>
> file.readlines)
> end
>
> My Java code is basically a direct port.
>
> Has anyone else noticed this difference in speed? Am I doing something
> wrong? Is this speed normal?
>
> Any advice gratefully received.
> Thanks,
> Steven
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

steven

2006-May-05 13:15 UTC

head link

[Ferret-talk] Indexing Speed?

Hi Dave,

Thanks very much for getting back to me.

You were right about the indexes being different...

Your snippet has helped - but still nowhere near as fast as the Java 
version:

doc.add(new Field("path", f.getPath(), Field.Store.YES, 
Field.Index.UN_TOKENIZED));
doc.add(new Field("modified",DateTools.timeToString(f.lastModified(), 
DateTools.Resolution.MINUTE), Field.Store.YES, 
Field.Index.UN_TOKENIZED));
doc.add(new Field("contents", new FileReader(f)));

Could it be that ruby''s file.readlines is slower than Java''s
FileReader?

Another possible snafu is that the Directory contains loads of pdfs and 
other binary files which neither lucene or ferret can index - could it 
be that ferret is slower at dealing with things like that? (Just a 
thought)

Would love to hear any thoughts.

Many Thanks,
Steven.

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-May-05 14:41 UTC

head link

[Ferret-talk] Indexing Speed?

Hi Steven,

Once you made those changes were the indexes approximately the same
size? You''ll get the most accurate results if the indexes are
identical. Also, which version of Ferret are you using? I just tried
200Mb here (~600 files). In my case all of it is text and everything
gets indexed. Lucene took ~120 seconds and Ferret took ~55 seconds.
Both indexes are identical. I''m using the Sun JVM.

I look forward to your reply.

Cheers,
Dave


On 5/5/06, steven <steven_shingler at hotmail.com>
wrote:> Hi Dave,
>
> Thanks very much for getting back to me.
>
> You were right about the indexes being different...
>
> Your snippet has helped - but still nowhere near as fast as the Java
> version:
>
> doc.add(new Field("path", f.getPath(), Field.Store.YES,
> Field.Index.UN_TOKENIZED));
> doc.add(new
Field("modified",DateTools.timeToString(f.lastModified(),
> DateTools.Resolution.MINUTE), Field.Store.YES,
> Field.Index.UN_TOKENIZED));
> doc.add(new Field("contents", new FileReader(f)));
>
> Could it be that ruby''s file.readlines is slower than
Java''s FileReader?
>
> Another possible snafu is that the Directory contains loads of pdfs and
> other binary files which neither lucene or ferret can index - could it
> be that ferret is slower at dealing with things like that? (Just a
> thought)
>
> Would love to hear any thoughts.
>
> Many Thanks,
> Steven.
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

steven shingler

2006-May-11 15:18 UTC

head link

[Ferret-talk] Indexing Speed?

Just for completeness'' sake...

After conversations offline with David, it turns out I have been working 
with the pure ruby version of ferret, without the C extensions, 
obviously explaining the slower performance.

-- 
Posted via http://www.ruby-forum.com/.

Seemingly Similar Threads

Search for more maybe matching threads

Ferret talk - May 2006 - Indexing Speed?

[Ferret-talk] Indexing Speed?

[Ferret-talk] Indexing Speed?

[Ferret-talk] Indexing Speed?

[Ferret-talk] Indexing Speed?

[Ferret-talk] Indexing Speed?

Seemingly Similar Threads