thr3ads.net - Ferret talk - [Ferret-talk] Indexing so slow...... [Dec 2005]

If this information is useful, please help other people find it:
Share via:

hui

2005-Dec-19 14:11 UTC

[Ferret-talk] Indexing so slow......

I am indexing over 10,000 rows of data, it is very slow when it is
indexing the 100,1000,10000 row, and now it is over 1 hour passed on
the row 10,000.

how to make it faster?
here is my code:
=================doc = Document.new
		doc << Field.new("id",    	t.id,    							Field::Store::YES, 
Field::Index::UNTOKENIZED)
		doc << Field.new("title", 	t.title, 							Field::Store::NO,
	Field::Index::TOKENIZED)
		doc << Field.new("body",  	t.body,  							Field::Store::NO,
	Field::Index::TOKENIZED)
		doc << Field.new("album",  	t.album.name, 			
Field::Store::NO,
	Field::Index::TOKENIZED)
		doc << Field.new("artist",  t.album.artist.name,
Field::Store::NO,
	Field::Index::TOKENIZED)
		doc << Field.new("release",	t.album.release,		
Field::Store::NO,
Field::Index::UNTOKENIZED)
		index << doc
=============================================I just store the id, other data
saved in database, because if i store
data in ferret, my PC looks like just dead.

Erik Hatcher

2005-Dec-19 14:27 UTC

head link

[Ferret-talk] Indexing so slow......

This is very likely due to the merge factors.  Lucene (and thus  
Ferret) reorganizes the index periodically.  These settings are  
controllable, at least with Java Lucene.  The trade-off is how much  
memory you want the indexing process to use.

	Erik



On Dec 19, 2005, at 9:11 AM, hui wrote:
> I am indexing over 10,000 rows of data, it is very slow when it is
> indexing the 100,1000,10000 row, and now it is over 1 hour passed on
> the row 10,000.
>
> how to make it faster?
> here is my code:
> =================> doc = Document.new
> 		doc << Field.new("id",    	t.id,    						
Field::Store::YES,
> Field::Index::UNTOKENIZED)
> 		doc << Field.new("title", 	t.title, 						
Field::Store::NO,
> 	Field::Index::TOKENIZED)
> 		doc << Field.new("body",  	t.body,  						
Field::Store::NO,
> 	Field::Index::TOKENIZED)
> 		doc << Field.new("album",  	t.album.name, 			
Field::Store::NO,
> 	Field::Index::TOKENIZED)
> 		doc << Field.new("artist",  t.album.artist.name,
Field::Store::NO,
> 	Field::Index::TOKENIZED)
> 		doc << Field.new("release",	t.album.release,		
Field::Store::NO,
> Field::Index::UNTOKENIZED)
> 		index << doc
> =============================================> I just store the id,
other data saved in database, because if i store
> data in ferret, my PC looks like just dead.
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk

David Balmain

2005-Dec-19 14:38 UTC

head link

[Ferret-talk] Indexing so slow......

On 12/19/05, hui <fortez at gmail.com> wrote:> I am indexing over 10,000 rows of data, it is very slow when it is
> indexing the 100,1000,10000 row, and now it is over 1 hour passed on
> the row 10,000.
>
> how to make it faster?
> here is my code:
> =================> doc = Document.new
>                 doc << Field.new("id",          t.id,      
Field::Store::YES,
> Field::Index::UNTOKENIZED)
>                 doc << Field.new("title",       t.title,   
Field::Store::NO,
>         Field::Index::TOKENIZED)
>                 doc << Field.new("body",        t.body,    
Field::Store::NO,
>         Field::Index::TOKENIZED)
>                 doc << Field.new("album",      
t.album.name,                           Field::Store::NO,
>         Field::Index::TOKENIZED)
>                 doc << Field.new("artist", 
t.album.artist.name,        Field::Store::NO,
>         Field::Index::TOKENIZED)
>                 doc << Field.new("release",    
t.album.release,                        Field::Store::NO,
> Field::Index::UNTOKENIZED)
>                 index << doc
Hi Hui,

This looks fine. Some suggestions;

* Make sure you are not using auto_flush. That will slow indexing down
considerably. This will slow things down considerable. In fact, it is
probably better to use Index::IndexWriter rather than Index::Index.
* You could index everything in memory and write it to disk. This will
depend on how much memory you have and how big the index becomes.
* You can play around with merge_factor, :min_merge_docs and
:max_merge_docs in IndexWriter. They are currently set to 10 but you
might get more speed with different settings. Try anything between 2
and 100.
* You switch :use_compound_file to false. This will speed things up
but you may get an error for having too many files open.

There are a few other things you can do like indexing in parallel but
I won''t go into it yet. I''m currently working on some pretty
big speed
ups by implement everything in C. After I release that version I think
most performance problems will go away. It will certainly speed up
things more than any changes you might make to the current version.
Anyway, I thought I''d have it out by Christmas but it''s
turning into a
bigger task then I thought (doesn''t this always seem to happen). I
will get finished early next year though so if you can put up with the
slow performance until then, relief is on it''s way. :-)

Cheers,
Dave

PS The reason it takes a long time at 100, 1000, 10,000 is that
indexing is done in segments and at 100, 1000, etc a number of the
segments are merged together into bigger segments. Here is a good
picture for you;

http://nutch.sourceforge.net/blog/2004/11/dynamization-and-lucene.html

The :merge_factor in the picture is 3. The current merge factor in
Ferret is 10. Hope that makes sense.

> =============================================> I just store the id,
other data saved in database, because if i store
> data in ferret, my PC looks like just dead.
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

David Balmain

2005-Dec-19 14:53 UTC

head link

[Ferret-talk] Indexing so slow......

Sorry, just to correct myself. :max_merge_docs is set to a very big
number, not 10. I''ll try to quickly explain this numbers and the
payoffs.

:min_merge_docs => the minimum number of documents a segment must have
before it is merged. Set this to a larger number if you want to use
more RAM and speed things up.

:merge_factor => this tells ferret when to merge for segments larger
than min_merge_docs. A high value means merges are done less often
which means faster indexing but slower searching. You can set this to
a high value when you do your batch index and then optimize the index
and lower the value afterwoods when search speed is more important. A
higher value also requires more memory

:max_merge_docs => this sets the maximum number of documents in a
segment. Once this count is reached, that segment is no longer merged
with the other documents unless optimize is called. You might set this
to a lower value to stop the IndexWriter from holding the lock for too
long while doing a merge. For example if you set it to 1000, you
wouldn''t get that really long hang time at 10000.

On 12/19/05, David Balmain <dbalmain.ml at gmail.com>
wrote:> On 12/19/05, hui <fortez at gmail.com> wrote:
> > I am indexing over 10,000 rows of data, it is very slow when it is
> > indexing the 100,1000,10000 row, and now it is over 1 hour passed on
> > the row 10,000.
> >
> > how to make it faster?
> > here is my code:
> > =================> > doc = Document.new
> >                 doc << Field.new("id",          t.id, 
Field::Store::YES,
> > Field::Index::UNTOKENIZED)
> >                 doc << Field.new("title",      
t.title,                                                       
Field::Store::NO,
> >         Field::Index::TOKENIZED)
> >                 doc << Field.new("body",       
t.body,                                                        
Field::Store::NO,
> >         Field::Index::TOKENIZED)
> >                 doc << Field.new("album",      
t.album.name,                           Field::Store::NO,
> >         Field::Index::TOKENIZED)
> >                 doc << Field.new("artist", 
t.album.artist.name,        Field::Store::NO,
> >         Field::Index::TOKENIZED)
> >                 doc << Field.new("release",    
t.album.release,                        Field::Store::NO,
> > Field::Index::UNTOKENIZED)
> >                 index << doc
>
> Hi Hui,
>
> This looks fine. Some suggestions;
>
> * Make sure you are not using auto_flush. That will slow indexing down
> considerably. This will slow things down considerable. In fact, it is
> probably better to use Index::IndexWriter rather than Index::Index.
> * You could index everything in memory and write it to disk. This will
> depend on how much memory you have and how big the index becomes.
> * You can play around with merge_factor, :min_merge_docs and
> :max_merge_docs in IndexWriter. They are currently set to 10 but you
> might get more speed with different settings. Try anything between 2
> and 100.
> * You switch :use_compound_file to false. This will speed things up
> but you may get an error for having too many files open.
>
> There are a few other things you can do like indexing in parallel but
> I won''t go into it yet. I''m currently working on some
pretty big speed
> ups by implement everything in C. After I release that version I think
> most performance problems will go away. It will certainly speed up
> things more than any changes you might make to the current version.
> Anyway, I thought I''d have it out by Christmas but it''s
turning into a
> bigger task then I thought (doesn''t this always seem to happen). I
> will get finished early next year though so if you can put up with the
> slow performance until then, relief is on it''s way. :-)
>
> Cheers,
> Dave
>
> PS The reason it takes a long time at 100, 1000, 10,000 is that
> indexing is done in segments and at 100, 1000, etc a number of the
> segments are merged together into bigger segments. Here is a good
> picture for you;
>
> http://nutch.sourceforge.net/blog/2004/11/dynamization-and-lucene.html
>
> The :merge_factor in the picture is 3. The current merge factor in
> Ferret is 10. Hope that makes sense.
>
>
> > =============================================> > I just store
the id, other data saved in database, because if i store
> > data in ferret, my PC looks like just dead.
> >
> > _______________________________________________
> > Ferret-talk mailing list
> > Ferret-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/ferret-talk
> >
>

David Balmain

2005-Dec-19 14:53 UTC

head link

[Ferret-talk] Indexing so slow......

Oh and one last thing. Erik''s book "Lucene in Action" has a
much
better explanation of all of this.

On 12/19/05, David Balmain <dbalmain.ml at gmail.com>
wrote:> Sorry, just to correct myself. :max_merge_docs is set to a very big
> number, not 10. I''ll try to quickly explain this numbers and the
> payoffs.
>
> :min_merge_docs => the minimum number of documents a segment must have
> before it is merged. Set this to a larger number if you want to use
> more RAM and speed things up.
>
> :merge_factor => this tells ferret when to merge for segments larger
> than min_merge_docs. A high value means merges are done less often
> which means faster indexing but slower searching. You can set this to
> a high value when you do your batch index and then optimize the index
> and lower the value afterwoods when search speed is more important. A
> higher value also requires more memory
>
> :max_merge_docs => this sets the maximum number of documents in a
> segment. Once this count is reached, that segment is no longer merged
> with the other documents unless optimize is called. You might set this
> to a lower value to stop the IndexWriter from holding the lock for too
> long while doing a merge. For example if you set it to 1000, you
> wouldn''t get that really long hang time at 10000.
>
> On 12/19/05, David Balmain <dbalmain.ml at gmail.com> wrote:
> > On 12/19/05, hui <fortez at gmail.com> wrote:
> > > I am indexing over 10,000 rows of data, it is very slow when it
is
> > > indexing the 100,1000,10000 row, and now it is over 1 hour passed
on
> > > the row 10,000.
> > >
> > > how to make it faster?
> > > here is my code:
> > > =================> > > doc = Document.new
> > >                 doc << Field.new("id",         
t.id,                                                          
Field::Store::YES,
> > > Field::Index::UNTOKENIZED)
> > >                 doc << Field.new("title",      
t.title,                                                       
Field::Store::NO,
> > >         Field::Index::TOKENIZED)
> > >                 doc << Field.new("body",       
t.body,                                                        
Field::Store::NO,
> > >         Field::Index::TOKENIZED)
> > >                 doc << Field.new("album",      
t.album.name,                           Field::Store::NO,
> > >         Field::Index::TOKENIZED)
> > >                 doc << Field.new("artist", 
t.album.artist.name,        Field::Store::NO,
> > >         Field::Index::TOKENIZED)
> > >                 doc << Field.new("release",    
t.album.release,                        Field::Store::NO,
> > > Field::Index::UNTOKENIZED)
> > >                 index << doc
> >
> > Hi Hui,
> >
> > This looks fine. Some suggestions;
> >
> > * Make sure you are not using auto_flush. That will slow indexing down
> > considerably. This will slow things down considerable. In fact, it is
> > probably better to use Index::IndexWriter rather than Index::Index.
> > * You could index everything in memory and write it to disk. This will
> > depend on how much memory you have and how big the index becomes.
> > * You can play around with merge_factor, :min_merge_docs and
> > :max_merge_docs in IndexWriter. They are currently set to 10 but you
> > might get more speed with different settings. Try anything between 2
> > and 100.
> > * You switch :use_compound_file to false. This will speed things up
> > but you may get an error for having too many files open.
> >
> > There are a few other things you can do like indexing in parallel but
> > I won''t go into it yet. I''m currently working on
some pretty big speed
> > ups by implement everything in C. After I release that version I think
> > most performance problems will go away. It will certainly speed up
> > things more than any changes you might make to the current version.
> > Anyway, I thought I''d have it out by Christmas but
it''s turning into a
> > bigger task then I thought (doesn''t this always seem to
happen). I
> > will get finished early next year though so if you can put up with the
> > slow performance until then, relief is on it''s way. :-)
> >
> > Cheers,
> > Dave
> >
> > PS The reason it takes a long time at 100, 1000, 10,000 is that
> > indexing is done in segments and at 100, 1000, etc a number of the
> > segments are merged together into bigger segments. Here is a good
> > picture for you;
> >
> > http://nutch.sourceforge.net/blog/2004/11/dynamization-and-lucene.html
> >
> > The :merge_factor in the picture is 3. The current merge factor in
> > Ferret is 10. Hope that makes sense.
> >
> >
> > > =============================================> > > I
just store the id, other data saved in database, because if i store
> > > data in ferret, my PC looks like just dead.
> > >
> > > _______________________________________________
> > > Ferret-talk mailing list
> > > Ferret-talk at rubyforge.org
> > > http://rubyforge.org/mailman/listinfo/ferret-talk
> > >
> >
>

Finn Smith

2005-Dec-19 16:21 UTC

head link

[Ferret-talk] Indexing so slow......

On 12/19/05, David Balmain <dbalmain.ml at gmail.com>
wrote:> Oh and one last thing. Erik''s book "Lucene in Action"
has a much
> better explanation of all of this.
I''ll second this. Even though I am not using the Java version of
Lucene, I have found this book very helpful in explaining the concepts
underlying the various Lucene frameworks.

-F

Thomas Lockney

2005-Dec-19 17:23 UTC

head link

[Ferret-talk] Indexing so slow......

On Mon, 2005-12-19 at 11:21 -0500, Finn Smith wrote:
> On 12/19/05, David Balmain <dbalmain.ml at gmail.com> wrote:
> > Oh and one last thing. Erik''s book "Lucene in
Action" has a much
> > better explanation of all of this.
> 
> I''ll second this. Even though I am not using the Java version of
> Lucene, I have found this book very helpful in explaining the concepts
> underlying the various Lucene frameworks.

As long as we''re voting... I''ll third it! I was trying to
figure things
out from the API docs and source code and was horribly confused until I
went out and picked up Lucene in Action last week. It helps that it''s a
very well written book (good job Erik!).

Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20051219/d931c73f/attachment.htm

Erik Hatcher

2005-Dec-19 18:47 UTC

head link

[Ferret-talk] Indexing so slow......

On Dec 19, 2005, at 12:23 PM, Thomas Lockney wrote:
> On Mon, 2005-12-19 at 11:21 -0500, Finn Smith wrote:
>> On 12/19/05, David Balmain <dbalmain.ml at gmail.com> wrote: >
Oh and
>> one last thing. Erik''s book "Lucene in Action" has a
much > better
>> explanation of all of this. I''ll second this. Even though I am
not
>> using the Java version of Lucene, I have found this book very  
>> helpful in explaining the concepts underlying the various Lucene  
>> frameworks.
>
> As long as we''re voting... I''ll third it! I was trying to
figure
> things out from the API docs and source code and was horribly  
> confused until I went out and picked up Lucene in Action last week.  
> It helps that it''s a very well written book (good job Erik!).

Thanks everyone!  I''ll pass these kind words on to Otis as well, who  
is probably not tuned into the Ferret community (Python people...  
geez!).

	Erik

hui

2005-Dec-20 00:49 UTC

head link

[Ferret-talk] Indexing so slow......

Thank you very much everybody! I will try all the suggestions,
re-indexing my data. And ferret is great! Cannot wait for the new
version, and unicode support ;-)

hui

hui

2005-Dec-20 14:06 UTC

head link

[Ferret-talk] Indexing so slow......

the indexing is quite fast now, i use the following code,
==========================================index =
IndexWriter.new("db/index.db", :create_if_missing=>true,
:use_compound_file=>false)
index.max_merge_docs = 30000
index.min_merge_docs = 4000
============================
but problem comes when optimizing,
the data files size grows from about 200m to 3G, and broke finally with:
============================================D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/store/buffer
ed_index_io.rb:178:in `refill'': EOFError (EOFError)
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/store/buffered_index_io.rb:94:in `read_byte''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/store/index_io.rb:61:in `read_vint''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/term_doc_enum.rb:131:in `next?''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/term_doc_enum.rb:273:in `next?''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/segment_merger.rb:269:in `append_postings''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/segment_merger.rb:262:in `times''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/segment_merger.rb:262:in `append_postings''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/segment_merger.rb:240:in `merge_term_info''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/segment_merger.rb:215:in `merge_term_infos''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/segment_merger.rb:176:in `merge_terms''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/segment_merger.rb:48:in `merge''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/index_writer.rb:403:in `merge_segments''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/index_writer.rb:183:in `optimize''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/index_writer.rb:173:in `synchronize''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/index_writer.rb:173:in `optimize''
        from script/indexdb.rb:55
======================================================are there some tips about
optimizing?

Thanks again.

hui

David Balmain

2005-Dec-20 16:39 UTC

head link

[Ferret-talk] Indexing so slow......

Hi Hui,

Can you email me your index directory listing? 3Gb sounds very large.
Also, how much data are you indexing? How many files/records and what
total size? This will help me work out what is wrong. I have a few
ideas.

Cheers,
Dave

On 12/20/05, hui <fortez at gmail.com> wrote:> the indexing is quite fast now, i use the following code,
> ==========================================> index =
IndexWriter.new("db/index.db", :create_if_missing=>true,
> :use_compound_file=>false)
> index.max_merge_docs = 30000
> index.min_merge_docs = 4000
> ============================>
> but problem comes when optimizing,
> the data files size grows from about 200m to 3G, and broke finally with:
> ============================================>
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/store/buffer
> ed_index_io.rb:178:in `refill'': EOFError (EOFError)
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /store/buffered_index_io.rb:94:in `read_byte''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /store/index_io.rb:61:in `read_vint''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/term_doc_enum.rb:131:in `next?''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/term_doc_enum.rb:273:in `next?''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/segment_merger.rb:269:in `append_postings''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/segment_merger.rb:262:in `times''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/segment_merger.rb:262:in `append_postings''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/segment_merger.rb:240:in `merge_term_info''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/segment_merger.rb:215:in `merge_term_infos''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/segment_merger.rb:176:in `merge_terms''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/segment_merger.rb:48:in `merge''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/index_writer.rb:403:in `merge_segments''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/index_writer.rb:183:in `optimize''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/index_writer.rb:173:in `synchronize''
>         from
D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
> /index/index_writer.rb:173:in `optimize''
>         from script/indexdb.rb:55
> ======================================================> are there some
tips about optimizing?
>
> Thanks again.
>
> hui
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

hui

2005-Dec-23 09:20 UTC

head link

[Ferret-talk] Indexing so slow......

have you got my email, David?

i find searching is slow without optimizing,
It takes 5s when querying one word from 130,000 records,
and 30s when querying a 4 words phrase.

Hui

2005/12/21, David Balmain <dbalmain.ml at
gmail.com>:> Hi Hui,
>
> Can you email me your index directory listing? 3Gb sounds very large.
> Also, how much data are you indexing? How many files/records and what
> total size? This will help me work out what is wrong. I have a few
> ideas.
>
> Cheers,
> Dave

David Balmain

2005-Dec-24 02:07 UTC

head link

[Ferret-talk] Indexing so slow......

Hi hui,

Sorry, I''m taking a couple of weeks of for Christmas. I''ll be
back to
work on Ferret on the 11th Jan. Hope you can wait till then. I suggest
with the number of records you''re working on you wait until I finish
cFerret.

Merry Christmas.
Dave

On 12/23/05, hui <fortez at gmail.com> wrote:> have you got my email, David?
>
> i find searching is slow without optimizing,
> It takes 5s when querying one word from 130,000 records,
> and 30s when querying a 4 words phrase.
>
> Hui
>
> 2005/12/21, David Balmain <dbalmain.ml at gmail.com>:
> > Hi Hui,
> >
> > Can you email me your index directory listing? 3Gb sounds very large.
> > Also, how much data are you indexing? How many files/records and what
> > total size? This will help me work out what is wrong. I have a few
> > ideas.
> >
> > Cheers,
> > Dave
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

hui

2005-Dec-24 02:28 UTC

head link

[Ferret-talk] Indexing so slow......

I tried lucene last night, it is so fast.
about just one hour indexed 130,000 records (stored all data), which
was 10 hous using ferret (only id and without opmiziting).

and it seems ferrer cannot use the lucene index data, I got an error:
====================================================D:\InstantRails\rails_apps\muvava>ruby
script\console
Loading development environment.>> index = Ferret::Index::Index.new("db/index.db")IndexError: index 11667591 out of string
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/index.rb:122:in `[]=''
        from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret
/index/index.rb:122:in `initialize''
        from (irb):1:in `new''
        from (irb):1>>====================================================

2005/12/23, hui <fortez at gmail.com>:> have you got my email, David?
>
> i find searching is slow without optimizing,
> It takes 5s when querying one word from 130,000 records,
> and 30s when querying a 4 words phrase.
>
> Hui
>
> 2005/12/21, David Balmain <dbalmain.ml at gmail.com>:
> > Hi Hui,
> >
> > Can you email me your index directory listing? 3Gb sounds very large.
> > Also, how much data are you indexing? How many files/records and what
> > total size? This will help me work out what is wrong. I have a few
> > ideas.
> >
> > Cheers,
> > Dave
>

hui

2005-Dec-25 07:37 UTC

head link

[Ferret-talk] Indexing so slow......

Have nice holidays :)

2005/12/24, David Balmain <dbalmain.ml at
gmail.com>:> Hi hui,
>
> Sorry, I''m taking a couple of weeks of for Christmas.
I''ll be back to
> work on Ferret on the 11th Jan. Hope you can wait till then. I suggest
> with the number of records you''re working on you wait until I
finish
> cFerret.
>
> Merry Christmas.
> Dave

David Balmain

2006-Jan-12 00:58 UTC

head link

[Ferret-talk] Indexing so slow......

Hi Hui,

On 12/24/05, hui <fortez at gmail.com> wrote:> I tried lucene last night, it is so fast.
> about just one hour indexed 130,000 records (stored all data), which
> was 10 hous using ferret (only id and without opmiziting).
Lucene is certainly a lot faster than Ferret. That''s why I''m
working on cFerret.
>
> and it seems ferrer cannot use the lucene index data, I got an error:
> <snip/>
The error probably occurred because of a difference in the way Lucene
handles UTF-8 strings. Ferret always treats strings as an array of
bytes, while Lucene treats them as an array of characters in some
instances (not all). For example, a chinese character might have a
length of 1 in a Lucene index and 4 in a Ferret index so the two
indexes will be incompatible. This is something I discovered just
before I went on holiday. I''m still contemplating what to do about
this. Treating strings as arrays of bytes still seems preferable to me
but it will make the Lucene indexes incompatible.

Anyway, I''m afraid it''s not high priority right now.
I''d rather get
cFerret finished so less people need/want to use Lucene. Hope this
isn''t too confusing.

Cheers,
Dave

Erik Hatcher

2006-Jan-12 01:26 UTC

head link

[Ferret-talk] Indexing so slow......

On Jan 11, 2006, at 7:58 PM, David Balmain wrote:> Treating strings as arrays of bytes still seems preferable to me
> but it will make the Lucene indexes incompatible.
Please please don''t let Ferret stay incompatible with Java Lucene.   
Interoping indexes is a major *feature* for me at least.  There are  
very solid reasons to want interoperability.  For example, there are  
fine libraries in Java to index various types of content that don''t  
have decent Ruby counterparts, so indexing with Java can be  
preferable in these cases.

If Java Lucene needs to change (and yes this issue has come up with  
someone doing a port of Java Lucene to Perl, no not Plucene but a  
different port).  Check the java-dev archives (or maybe java-user?).   
There were patches offered, but there were downsides to it in terms  
of performace, if I recall correctly.

	Erik

David Balmain

2006-Jan-12 02:23 UTC

head link

[Ferret-talk] Indexing so slow......

On 1/12/06, Erik Hatcher <erik at ehatchersolutions.com>
wrote:>
> On Jan 11, 2006, at 7:58 PM, David Balmain wrote:
> > Treating strings as arrays of bytes still seems preferable to me
> > but it will make the Lucene indexes incompatible.
>
> Please please don''t let Ferret stay incompatible with Java Lucene.
> Interoping indexes is a major *feature* for me at least.  There are
> very solid reasons to want interoperability.  For example, there are
> fine libraries in Java to index various types of content that
don''t
> have decent Ruby counterparts, so indexing with Java can be
> preferable in these cases.
Agreed. Don''t for one second assume that I don''t think this is
important. It''s just that it''s not an easy issue to solve and
I''d be
wasting my time if I started working on it in the pure ruby version of
Ferret. I''d have to repeat the work once cFerret is finished.
> If Java Lucene needs to change (and yes this issue has come up with
> someone doing a port of Java Lucene to Perl, no not Plucene but a
> different port).  Check the java-dev archives (or maybe java-user?).
> There were patches offered, but there were downsides to it in terms
> of performace, if I recall correctly.
Here is the discussion;

http://www.gossamer-threads.com/lists/lucene/java-dev/28334?search_string=perl%20unicode;#28334
>From reading this there are more issues at hand than just theperformance. And I haven''t seen any patches coming in for this so
I''m
evidently not the only person who thinks this is a difficult problem.
My feeling is that I''ll be better off submitting a patch to Lucene
rather than fitting Ferret to work with the current Lucene files. That
is probably what I''ll do once I finish cFerret. Hopefully someone will
get to it before I do. ;-)

Just for sake of discussion, the alternative is to add another
Directory implementation that is compatible with Lucene index. Not the
most elegent solution but it will do the job and we won''t have to
sacrifice performance in Ferret for non-java indexes. I should note at
this point that there would be a definite sacrifice in performance to
make Ferret compatible with Lucene indexes but I''m not so sure the
same is true the other way around.

Dave

>         Erik
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Apparently Analagous Threads

Search for more reasonably related threads

Ferret talk - Dec 2005 - Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

[Ferret-talk] Indexing so slow......

Apparently Analagous Threads