I am indexing over 10,000 rows of data, it is very slow when it is indexing the 100,1000,10000 row, and now it is over 1 hour passed on the row 10,000. how to make it faster? here is my code: =================doc = Document.new doc << Field.new("id", t.id, Field::Store::YES, Field::Index::UNTOKENIZED) doc << Field.new("title", t.title, Field::Store::NO, Field::Index::TOKENIZED) doc << Field.new("body", t.body, Field::Store::NO, Field::Index::TOKENIZED) doc << Field.new("album", t.album.name, Field::Store::NO, Field::Index::TOKENIZED) doc << Field.new("artist", t.album.artist.name, Field::Store::NO, Field::Index::TOKENIZED) doc << Field.new("release", t.album.release, Field::Store::NO, Field::Index::UNTOKENIZED) index << doc =============================================I just store the id, other data saved in database, because if i store data in ferret, my PC looks like just dead.
This is very likely due to the merge factors. Lucene (and thus Ferret) reorganizes the index periodically. These settings are controllable, at least with Java Lucene. The trade-off is how much memory you want the indexing process to use. Erik On Dec 19, 2005, at 9:11 AM, hui wrote:> I am indexing over 10,000 rows of data, it is very slow when it is > indexing the 100,1000,10000 row, and now it is over 1 hour passed on > the row 10,000. > > how to make it faster? > here is my code: > =================> doc = Document.new > doc << Field.new("id", t.id, Field::Store::YES, > Field::Index::UNTOKENIZED) > doc << Field.new("title", t.title, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("body", t.body, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("album", t.album.name, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("artist", t.album.artist.name, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("release", t.album.release, Field::Store::NO, > Field::Index::UNTOKENIZED) > index << doc > =============================================> I just store the id, other data saved in database, because if i store > data in ferret, my PC looks like just dead. > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk
On 12/19/05, hui <fortez at gmail.com> wrote:> I am indexing over 10,000 rows of data, it is very slow when it is > indexing the 100,1000,10000 row, and now it is over 1 hour passed on > the row 10,000. > > how to make it faster? > here is my code: > =================> doc = Document.new > doc << Field.new("id", t.id, Field::Store::YES, > Field::Index::UNTOKENIZED) > doc << Field.new("title", t.title, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("body", t.body, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("album", t.album.name, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("artist", t.album.artist.name, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("release", t.album.release, Field::Store::NO, > Field::Index::UNTOKENIZED) > index << docHi Hui, This looks fine. Some suggestions; * Make sure you are not using auto_flush. That will slow indexing down considerably. This will slow things down considerable. In fact, it is probably better to use Index::IndexWriter rather than Index::Index. * You could index everything in memory and write it to disk. This will depend on how much memory you have and how big the index becomes. * You can play around with merge_factor, :min_merge_docs and :max_merge_docs in IndexWriter. They are currently set to 10 but you might get more speed with different settings. Try anything between 2 and 100. * You switch :use_compound_file to false. This will speed things up but you may get an error for having too many files open. There are a few other things you can do like indexing in parallel but I won''t go into it yet. I''m currently working on some pretty big speed ups by implement everything in C. After I release that version I think most performance problems will go away. It will certainly speed up things more than any changes you might make to the current version. Anyway, I thought I''d have it out by Christmas but it''s turning into a bigger task then I thought (doesn''t this always seem to happen). I will get finished early next year though so if you can put up with the slow performance until then, relief is on it''s way. :-) Cheers, Dave PS The reason it takes a long time at 100, 1000, 10,000 is that indexing is done in segments and at 100, 1000, etc a number of the segments are merged together into bigger segments. Here is a good picture for you; http://nutch.sourceforge.net/blog/2004/11/dynamization-and-lucene.html The :merge_factor in the picture is 3. The current merge factor in Ferret is 10. Hope that makes sense.> =============================================> I just store the id, other data saved in database, because if i store > data in ferret, my PC looks like just dead. > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
Sorry, just to correct myself. :max_merge_docs is set to a very big number, not 10. I''ll try to quickly explain this numbers and the payoffs. :min_merge_docs => the minimum number of documents a segment must have before it is merged. Set this to a larger number if you want to use more RAM and speed things up. :merge_factor => this tells ferret when to merge for segments larger than min_merge_docs. A high value means merges are done less often which means faster indexing but slower searching. You can set this to a high value when you do your batch index and then optimize the index and lower the value afterwoods when search speed is more important. A higher value also requires more memory :max_merge_docs => this sets the maximum number of documents in a segment. Once this count is reached, that segment is no longer merged with the other documents unless optimize is called. You might set this to a lower value to stop the IndexWriter from holding the lock for too long while doing a merge. For example if you set it to 1000, you wouldn''t get that really long hang time at 10000. On 12/19/05, David Balmain <dbalmain.ml at gmail.com> wrote:> On 12/19/05, hui <fortez at gmail.com> wrote: > > I am indexing over 10,000 rows of data, it is very slow when it is > > indexing the 100,1000,10000 row, and now it is over 1 hour passed on > > the row 10,000. > > > > how to make it faster? > > here is my code: > > =================> > doc = Document.new > > doc << Field.new("id", t.id, Field::Store::YES, > > Field::Index::UNTOKENIZED) > > doc << Field.new("title", t.title, Field::Store::NO, > > Field::Index::TOKENIZED) > > doc << Field.new("body", t.body, Field::Store::NO, > > Field::Index::TOKENIZED) > > doc << Field.new("album", t.album.name, Field::Store::NO, > > Field::Index::TOKENIZED) > > doc << Field.new("artist", t.album.artist.name, Field::Store::NO, > > Field::Index::TOKENIZED) > > doc << Field.new("release", t.album.release, Field::Store::NO, > > Field::Index::UNTOKENIZED) > > index << doc > > Hi Hui, > > This looks fine. Some suggestions; > > * Make sure you are not using auto_flush. That will slow indexing down > considerably. This will slow things down considerable. In fact, it is > probably better to use Index::IndexWriter rather than Index::Index. > * You could index everything in memory and write it to disk. This will > depend on how much memory you have and how big the index becomes. > * You can play around with merge_factor, :min_merge_docs and > :max_merge_docs in IndexWriter. They are currently set to 10 but you > might get more speed with different settings. Try anything between 2 > and 100. > * You switch :use_compound_file to false. This will speed things up > but you may get an error for having too many files open. > > There are a few other things you can do like indexing in parallel but > I won''t go into it yet. I''m currently working on some pretty big speed > ups by implement everything in C. After I release that version I think > most performance problems will go away. It will certainly speed up > things more than any changes you might make to the current version. > Anyway, I thought I''d have it out by Christmas but it''s turning into a > bigger task then I thought (doesn''t this always seem to happen). I > will get finished early next year though so if you can put up with the > slow performance until then, relief is on it''s way. :-) > > Cheers, > Dave > > PS The reason it takes a long time at 100, 1000, 10,000 is that > indexing is done in segments and at 100, 1000, etc a number of the > segments are merged together into bigger segments. Here is a good > picture for you; > > http://nutch.sourceforge.net/blog/2004/11/dynamization-and-lucene.html > > The :merge_factor in the picture is 3. The current merge factor in > Ferret is 10. Hope that makes sense. > > > > =============================================> > I just store the id, other data saved in database, because if i store > > data in ferret, my PC looks like just dead. > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > >
Oh and one last thing. Erik''s book "Lucene in Action" has a much better explanation of all of this. On 12/19/05, David Balmain <dbalmain.ml at gmail.com> wrote:> Sorry, just to correct myself. :max_merge_docs is set to a very big > number, not 10. I''ll try to quickly explain this numbers and the > payoffs. > > :min_merge_docs => the minimum number of documents a segment must have > before it is merged. Set this to a larger number if you want to use > more RAM and speed things up. > > :merge_factor => this tells ferret when to merge for segments larger > than min_merge_docs. A high value means merges are done less often > which means faster indexing but slower searching. You can set this to > a high value when you do your batch index and then optimize the index > and lower the value afterwoods when search speed is more important. A > higher value also requires more memory > > :max_merge_docs => this sets the maximum number of documents in a > segment. Once this count is reached, that segment is no longer merged > with the other documents unless optimize is called. You might set this > to a lower value to stop the IndexWriter from holding the lock for too > long while doing a merge. For example if you set it to 1000, you > wouldn''t get that really long hang time at 10000. > > On 12/19/05, David Balmain <dbalmain.ml at gmail.com> wrote: > > On 12/19/05, hui <fortez at gmail.com> wrote: > > > I am indexing over 10,000 rows of data, it is very slow when it is > > > indexing the 100,1000,10000 row, and now it is over 1 hour passed on > > > the row 10,000. > > > > > > how to make it faster? > > > here is my code: > > > =================> > > doc = Document.new > > > doc << Field.new("id", t.id, Field::Store::YES, > > > Field::Index::UNTOKENIZED) > > > doc << Field.new("title", t.title, Field::Store::NO, > > > Field::Index::TOKENIZED) > > > doc << Field.new("body", t.body, Field::Store::NO, > > > Field::Index::TOKENIZED) > > > doc << Field.new("album", t.album.name, Field::Store::NO, > > > Field::Index::TOKENIZED) > > > doc << Field.new("artist", t.album.artist.name, Field::Store::NO, > > > Field::Index::TOKENIZED) > > > doc << Field.new("release", t.album.release, Field::Store::NO, > > > Field::Index::UNTOKENIZED) > > > index << doc > > > > Hi Hui, > > > > This looks fine. Some suggestions; > > > > * Make sure you are not using auto_flush. That will slow indexing down > > considerably. This will slow things down considerable. In fact, it is > > probably better to use Index::IndexWriter rather than Index::Index. > > * You could index everything in memory and write it to disk. This will > > depend on how much memory you have and how big the index becomes. > > * You can play around with merge_factor, :min_merge_docs and > > :max_merge_docs in IndexWriter. They are currently set to 10 but you > > might get more speed with different settings. Try anything between 2 > > and 100. > > * You switch :use_compound_file to false. This will speed things up > > but you may get an error for having too many files open. > > > > There are a few other things you can do like indexing in parallel but > > I won''t go into it yet. I''m currently working on some pretty big speed > > ups by implement everything in C. After I release that version I think > > most performance problems will go away. It will certainly speed up > > things more than any changes you might make to the current version. > > Anyway, I thought I''d have it out by Christmas but it''s turning into a > > bigger task then I thought (doesn''t this always seem to happen). I > > will get finished early next year though so if you can put up with the > > slow performance until then, relief is on it''s way. :-) > > > > Cheers, > > Dave > > > > PS The reason it takes a long time at 100, 1000, 10,000 is that > > indexing is done in segments and at 100, 1000, etc a number of the > > segments are merged together into bigger segments. Here is a good > > picture for you; > > > > http://nutch.sourceforge.net/blog/2004/11/dynamization-and-lucene.html > > > > The :merge_factor in the picture is 3. The current merge factor in > > Ferret is 10. Hope that makes sense. > > > > > > > =============================================> > > I just store the id, other data saved in database, because if i store > > > data in ferret, my PC looks like just dead. > > > > > > _______________________________________________ > > > Ferret-talk mailing list > > > Ferret-talk at rubyforge.org > > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > > >
On 12/19/05, David Balmain <dbalmain.ml at gmail.com> wrote:> Oh and one last thing. Erik''s book "Lucene in Action" has a much > better explanation of all of this.I''ll second this. Even though I am not using the Java version of Lucene, I have found this book very helpful in explaining the concepts underlying the various Lucene frameworks. -F
On Mon, 2005-12-19 at 11:21 -0500, Finn Smith wrote:> On 12/19/05, David Balmain <dbalmain.ml at gmail.com> wrote: > > Oh and one last thing. Erik''s book "Lucene in Action" has a much > > better explanation of all of this. > > I''ll second this. Even though I am not using the Java version of > Lucene, I have found this book very helpful in explaining the concepts > underlying the various Lucene frameworks.As long as we''re voting... I''ll third it! I was trying to figure things out from the API docs and source code and was horribly confused until I went out and picked up Lucene in Action last week. It helps that it''s a very well written book (good job Erik!). Thomas -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051219/d931c73f/attachment.htm
On Dec 19, 2005, at 12:23 PM, Thomas Lockney wrote:> On Mon, 2005-12-19 at 11:21 -0500, Finn Smith wrote: >> On 12/19/05, David Balmain <dbalmain.ml at gmail.com> wrote: > Oh and >> one last thing. Erik''s book "Lucene in Action" has a much > better >> explanation of all of this. I''ll second this. Even though I am not >> using the Java version of Lucene, I have found this book very >> helpful in explaining the concepts underlying the various Lucene >> frameworks. > > As long as we''re voting... I''ll third it! I was trying to figure > things out from the API docs and source code and was horribly > confused until I went out and picked up Lucene in Action last week. > It helps that it''s a very well written book (good job Erik!).Thanks everyone! I''ll pass these kind words on to Otis as well, who is probably not tuned into the Ferret community (Python people... geez!). Erik
Thank you very much everybody! I will try all the suggestions, re-indexing my data. And ferret is great! Cannot wait for the new version, and unicode support ;-) hui
the indexing is quite fast now, i use the following code, ==========================================index = IndexWriter.new("db/index.db", :create_if_missing=>true, :use_compound_file=>false) index.max_merge_docs = 30000 index.min_merge_docs = 4000 ============================ but problem comes when optimizing, the data files size grows from about 200m to 3G, and broke finally with: ============================================D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/store/buffer ed_index_io.rb:178:in `refill'': EOFError (EOFError) from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /store/buffered_index_io.rb:94:in `read_byte'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /store/index_io.rb:61:in `read_vint'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/term_doc_enum.rb:131:in `next?'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/term_doc_enum.rb:273:in `next?'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:269:in `append_postings'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:262:in `times'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:262:in `append_postings'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:240:in `merge_term_info'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:215:in `merge_term_infos'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:176:in `merge_terms'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:48:in `merge'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index_writer.rb:403:in `merge_segments'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index_writer.rb:183:in `optimize'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index_writer.rb:173:in `synchronize'' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index_writer.rb:173:in `optimize'' from script/indexdb.rb:55 ======================================================are there some tips about optimizing? Thanks again. hui
Hi Hui, Can you email me your index directory listing? 3Gb sounds very large. Also, how much data are you indexing? How many files/records and what total size? This will help me work out what is wrong. I have a few ideas. Cheers, Dave On 12/20/05, hui <fortez at gmail.com> wrote:> the indexing is quite fast now, i use the following code, > ==========================================> index = IndexWriter.new("db/index.db", :create_if_missing=>true, > :use_compound_file=>false) > index.max_merge_docs = 30000 > index.min_merge_docs = 4000 > ============================> > but problem comes when optimizing, > the data files size grows from about 200m to 3G, and broke finally with: > ============================================> D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/store/buffer > ed_index_io.rb:178:in `refill'': EOFError (EOFError) > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /store/buffered_index_io.rb:94:in `read_byte'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /store/index_io.rb:61:in `read_vint'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/term_doc_enum.rb:131:in `next?'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/term_doc_enum.rb:273:in `next?'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:269:in `append_postings'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:262:in `times'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:262:in `append_postings'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:240:in `merge_term_info'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:215:in `merge_term_infos'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:176:in `merge_terms'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:48:in `merge'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/index_writer.rb:403:in `merge_segments'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/index_writer.rb:183:in `optimize'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/index_writer.rb:173:in `synchronize'' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/index_writer.rb:173:in `optimize'' > from script/indexdb.rb:55 > ======================================================> are there some tips about optimizing? > > Thanks again. > > hui > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
have you got my email, David? i find searching is slow without optimizing, It takes 5s when querying one word from 130,000 records, and 30s when querying a 4 words phrase. Hui 2005/12/21, David Balmain <dbalmain.ml at gmail.com>:> Hi Hui, > > Can you email me your index directory listing? 3Gb sounds very large. > Also, how much data are you indexing? How many files/records and what > total size? This will help me work out what is wrong. I have a few > ideas. > > Cheers, > Dave
Hi hui, Sorry, I''m taking a couple of weeks of for Christmas. I''ll be back to work on Ferret on the 11th Jan. Hope you can wait till then. I suggest with the number of records you''re working on you wait until I finish cFerret. Merry Christmas. Dave On 12/23/05, hui <fortez at gmail.com> wrote:> have you got my email, David? > > i find searching is slow without optimizing, > It takes 5s when querying one word from 130,000 records, > and 30s when querying a 4 words phrase. > > Hui > > 2005/12/21, David Balmain <dbalmain.ml at gmail.com>: > > Hi Hui, > > > > Can you email me your index directory listing? 3Gb sounds very large. > > Also, how much data are you indexing? How many files/records and what > > total size? This will help me work out what is wrong. I have a few > > ideas. > > > > Cheers, > > Dave > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
I tried lucene last night, it is so fast. about just one hour indexed 130,000 records (stored all data), which was 10 hous using ferret (only id and without opmiziting). and it seems ferrer cannot use the lucene index data, I got an error: ====================================================D:\InstantRails\rails_apps\muvava>ruby script\console Loading development environment.>> index = Ferret::Index::Index.new("db/index.db")IndexError: index 11667591 out of string from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index.rb:122:in `[]='' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index.rb:122:in `initialize'' from (irb):1:in `new'' from (irb):1>>==================================================== 2005/12/23, hui <fortez at gmail.com>:> have you got my email, David? > > i find searching is slow without optimizing, > It takes 5s when querying one word from 130,000 records, > and 30s when querying a 4 words phrase. > > Hui > > 2005/12/21, David Balmain <dbalmain.ml at gmail.com>: > > Hi Hui, > > > > Can you email me your index directory listing? 3Gb sounds very large. > > Also, how much data are you indexing? How many files/records and what > > total size? This will help me work out what is wrong. I have a few > > ideas. > > > > Cheers, > > Dave >
Have nice holidays :) 2005/12/24, David Balmain <dbalmain.ml at gmail.com>:> Hi hui, > > Sorry, I''m taking a couple of weeks of for Christmas. I''ll be back to > work on Ferret on the 11th Jan. Hope you can wait till then. I suggest > with the number of records you''re working on you wait until I finish > cFerret. > > Merry Christmas. > Dave
Hi Hui, On 12/24/05, hui <fortez at gmail.com> wrote:> I tried lucene last night, it is so fast. > about just one hour indexed 130,000 records (stored all data), which > was 10 hous using ferret (only id and without opmiziting).Lucene is certainly a lot faster than Ferret. That''s why I''m working on cFerret.> > and it seems ferrer cannot use the lucene index data, I got an error: > <snip/>The error probably occurred because of a difference in the way Lucene handles UTF-8 strings. Ferret always treats strings as an array of bytes, while Lucene treats them as an array of characters in some instances (not all). For example, a chinese character might have a length of 1 in a Lucene index and 4 in a Ferret index so the two indexes will be incompatible. This is something I discovered just before I went on holiday. I''m still contemplating what to do about this. Treating strings as arrays of bytes still seems preferable to me but it will make the Lucene indexes incompatible. Anyway, I''m afraid it''s not high priority right now. I''d rather get cFerret finished so less people need/want to use Lucene. Hope this isn''t too confusing. Cheers, Dave
On Jan 11, 2006, at 7:58 PM, David Balmain wrote:> Treating strings as arrays of bytes still seems preferable to me > but it will make the Lucene indexes incompatible.Please please don''t let Ferret stay incompatible with Java Lucene. Interoping indexes is a major *feature* for me at least. There are very solid reasons to want interoperability. For example, there are fine libraries in Java to index various types of content that don''t have decent Ruby counterparts, so indexing with Java can be preferable in these cases. If Java Lucene needs to change (and yes this issue has come up with someone doing a port of Java Lucene to Perl, no not Plucene but a different port). Check the java-dev archives (or maybe java-user?). There were patches offered, but there were downsides to it in terms of performace, if I recall correctly. Erik
On 1/12/06, Erik Hatcher <erik at ehatchersolutions.com> wrote:> > On Jan 11, 2006, at 7:58 PM, David Balmain wrote: > > Treating strings as arrays of bytes still seems preferable to me > > but it will make the Lucene indexes incompatible. > > Please please don''t let Ferret stay incompatible with Java Lucene. > Interoping indexes is a major *feature* for me at least. There are > very solid reasons to want interoperability. For example, there are > fine libraries in Java to index various types of content that don''t > have decent Ruby counterparts, so indexing with Java can be > preferable in these cases.Agreed. Don''t for one second assume that I don''t think this is important. It''s just that it''s not an easy issue to solve and I''d be wasting my time if I started working on it in the pure ruby version of Ferret. I''d have to repeat the work once cFerret is finished.> If Java Lucene needs to change (and yes this issue has come up with > someone doing a port of Java Lucene to Perl, no not Plucene but a > different port). Check the java-dev archives (or maybe java-user?). > There were patches offered, but there were downsides to it in terms > of performace, if I recall correctly.Here is the discussion; http://www.gossamer-threads.com/lists/lucene/java-dev/28334?search_string=perl%20unicode;#28334>From reading this there are more issues at hand than just theperformance. And I haven''t seen any patches coming in for this so I''m evidently not the only person who thinks this is a difficult problem. My feeling is that I''ll be better off submitting a patch to Lucene rather than fitting Ferret to work with the current Lucene files. That is probably what I''ll do once I finish cFerret. Hopefully someone will get to it before I do. ;-) Just for sake of discussion, the alternative is to add another Directory implementation that is compatible with Lucene index. Not the most elegent solution but it will do the job and we won''t have to sacrifice performance in Ferret for non-java indexes. I should note at this point that there would be a definite sacrifice in performance to make Ferret compatible with Lucene indexes but I''m not so sure the same is true the other way around. Dave> Erik > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >