Fez Bojangles
2007-Jan-05 00:18 UTC
[Ferret-talk] Hitting Files per Directory Limits with Ferret?
Hey all! We''ve been using Ferret to great success these past six months. But recently we''ved tried adding many new ContentItems (only thing being index by Ferret at the moment), and things came crashing to a halt. ferret gem: 0.10.9 acts_as_ferret plugin (not sure which version) How we''re using the plugin: class ContentItem < ActiveRecord::Base acts_as_ferret :fields => { ''title'' => {}, ''description'' => {} } ... end In the directory (on production) ''index/production/content_item'', there are now 45812 files. (this is on Fedora Core 5, btw) This leads me to believe this could be the culprit... and if not the culprit now, it will be soon.>> ContentItem.count=> 19603 Any ideas? Any help would be mucho appreciated. Thanks! - Fez http://ummyeah.com/ -- Posted via http://www.ruby-forum.com/.
Ferret can optimize its index, which will collapse the files in an index directory. Sadly enough, acts_as_ferret does not call it unless you choose to rebuild its entire index. This could solve your problem: ContentItem.rebuild_index. This might take a while... Regards, Ewout>Hey all! > >We''ve been using Ferret to great success these past six months. But >recently we''ved tried adding many new ContentItems (only thing being >index by Ferret at the moment), and things came crashing to a halt. > >ferret gem: 0.10.9 >acts_as_ferret plugin (not sure which version) > >How we''re using the plugin: > >class ContentItem < ActiveRecord::Base > acts_as_ferret :fields => { ''title'' => {}, > ''description'' => {} > } > ... >end > >In the directory (on production) ''index/production/content_item'', there >are now 45812 files. (this is on Fedora Core 5, btw) > >This leads me to believe this could be the culprit... and if not the >culprit now, it will be soon. > >>> ContentItem.count >=> 19603 > >Any ideas? > >Any help would be mucho appreciated. Thanks! > >- Fez >http://ummyeah.com/ > >-- >Posted via http://www.ruby-forum.com/. >_______________________________________________ >Ferret-talk mailing list >Ferret-talk at rubyforge.org >http://rubyforge.org/mailman/listinfo/ferret-talk
Fez Bojangles
2007-Jan-05 04:02 UTC
[Ferret-talk] Hitting Files per Directory Limits with Ferret?
Just a heads up.. rebuilding the index did the trick. http://www.ruby-forum.com/topic/89245 I''m curious though, how many items can Ferret reasonably be expected to scale to? And, if anyone has hit Ferret''s natural limits, are there any solutions (i.e. partitioning the index into manageable chunks, etc) that still use Ferret as the base search indexer / engine? Fez Bojangles wrote:> Hey all! > > We''ve been using Ferret to great success these past six months. But > recently we''ved tried adding many new ContentItems (only thing being > index by Ferret at the moment), and things came crashing to a halt. > > ferret gem: 0.10.9 > acts_as_ferret plugin (not sure which version) > > How we''re using the plugin: > > class ContentItem < ActiveRecord::Base > acts_as_ferret :fields => { ''title'' => {}, > ''description'' => {} > } > ... > end > > In the directory (on production) ''index/production/content_item'', there > are now 45812 files. (this is on Fedora Core 5, btw) > > This leads me to believe this could be the culprit... and if not the > culprit now, it will be soon. > >>> ContentItem.count > => 19603 > > Any ideas? > > Any help would be mucho appreciated. Thanks! > > - Fez > http://ummyeah.com/-- Posted via http://www.ruby-forum.com/.
Jan Prill
2007-Jan-05 10:48 UTC
[Ferret-talk] Hitting Files per Directory Limits with Ferret?
Hey Fez, the limit of indexed items of ferret (and lucene) shouldn''t be in the thousands but in the millions. I''ve indexed hundreds of thousands of documents myself with ferret as well as with lucene and 20.000 is not even near the limit. Regarding the file-count in the index directory: It seems as if the index was never optimized. This defragments the chunks into one big index file. You should investigate why this didn''t happen. I did not look into the aaf code for some time but I think that it should do index optimization from time to time. Cheers, Jan -- ?????????????????????????????? http://www.inviado.de - Internetseiten f?r RAe http://www.xing.com/profile/Jan_Prill -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20070105/6eabc802/attachment-0001.html
Actually, it does not. The only call to index.optimize is in the rebuild_index method. A possible extension for aaf is that index.optimize is called automatically each C insertions, where C is some constant (1000 seems reasonable). I can only agree with Jan on scalability, at the moment I''m keeping an index of over 700.000 bibliographic records. Searches are instant. Regards, Ewout>Hey Fez, > >the limit of indexed items of ferret (and lucene) shouldn''t be in the >thousands but in the millions. I''ve indexed hundreds of thousands of >documents myself with ferret as well as with lucene and 20.000 is not even >near the limit. Regarding the file-count in the index directory: It seems as >if the index was never optimized. This defragments the chunks into one big >index file. You should investigate why this didn''t happen. I did not look >into the aaf code for some time but I think that it should do index >optimization from time to time. > >Cheers, >Jan > > >-- >------------------------------ >http://www.inviado.de - Internetseiten f?r RAe >http://www.xing.com/profile/Jan_Prill >_______________________________________________ >Ferret-talk mailing list >Ferret-talk at rubyforge.org >http://rubyforge.org/mailman/listinfo/ferret-talk
I created a patch for acts_as_ferret that will optimize the index every 100 insertions (experience will have to show weither this constant is adequate). The only prerequisite is that your model has an id attribute that increases 1 by 1, automatically, since the id is used to determine when to optimize. Just apply this patch to instance_methods.rb of acts_as_ferret to try it. Hope this will be of use.>Actually, it does not. The only call to index.optimize is in the >rebuild_index method. A possible extension for aaf is that >index.optimize is called automatically each C insertions, where C is >some constant (1000 seems reasonable). > >I can only agree with Jan on scalability, at the moment I''m keeping an >index of over 700.000 bibliographic records. Searches are instant. > >Regards, >Ewout > >>Hey Fez, >> >>the limit of indexed items of ferret (and lucene) shouldn''t be in the >>thousands but in the millions. I''ve indexed hundreds of thousands of >>documents myself with ferret as well as with lucene and 20.000 is not even >>near the limit. Regarding the file-count in the index directory: It seems as >>if the index was never optimized. This defragments the chunks into one big >>index file. You should investigate why this didn''t happen. I did not look >>into the aaf code for some time but I think that it should do index >>optimization from time to time. >> >>Cheers, >>Jan >> >> >>-- >>------------------------------ >>http://www.inviado.de - Internetseiten f?r RAe >>http://www.xing.com/profile/Jan_Prill >>_______________________________________________ >>Ferret-talk mailing list >>Ferret-talk at rubyforge.org >>http://rubyforge.org/mailman/listinfo/ferret-talk > >_______________________________________________ >Ferret-talk mailing list >Ferret-talk at rubyforge.org >http://rubyforge.org/mailman/listinfo/ferret-talk-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/octet-stream Size: 559 bytes Desc: not available Url : http://rubyforge.org/pipermail/ferret-talk/attachments/20070105/62aa2743/attachment.obj
Erik Hatcher
2007-Jan-05 15:55 UTC
[Ferret-talk] Hitting Files per Directory Limits with Ferret?
Ferret itself does not automatically optimize itself after so many document insertions? Lucene does, but maybe Ferret does not? It certainly causes indexing hiccups when it hits that optimization with Lucene, so care has to be taken to be sure you account for that possible optimization delay or to tune the parameters so you know when to expect it. Erik On Jan 5, 2007, at 10:02 AM, Ewout wrote:> I created a patch for acts_as_ferret that will optimize the index > every > 100 insertions (experience will have to show weither this constant is > adequate). > > The only prerequisite is that your model has an id attribute that > increases 1 by 1, automatically, since the id is used to determine > when > to optimize. > > Just apply this patch to instance_methods.rb of acts_as_ferret to > try it. > > Hope this will be of use. > >> Actually, it does not. The only call to index.optimize is in the >> rebuild_index method. A possible extension for aaf is that >> index.optimize is called automatically each C insertions, where C is >> some constant (1000 seems reasonable). >> >> I can only agree with Jan on scalability, at the moment I''m >> keeping an >> index of over 700.000 bibliographic records. Searches are instant. >> >> Regards, >> Ewout >> >>> Hey Fez, >>> >>> the limit of indexed items of ferret (and lucene) shouldn''t be in >>> the >>> thousands but in the millions. I''ve indexed hundreds of thousands of >>> documents myself with ferret as well as with lucene and 20.000 is >>> not even >>> near the limit. Regarding the file-count in the index directory: >>> It seems as >>> if the index was never optimized. This defragments the chunks >>> into one big >>> index file. You should investigate why this didn''t happen. I did >>> not look >>> into the aaf code for some time but I think that it should do index >>> optimization from time to time. >>> >>> Cheers, >>> Jan >>> >>> >>> -- >>> ------------------------------ >>> http://www.inviado.de - Internetseiten f?r RAe >>> http://www.xing.com/profile/Jan_Prill >>> _______________________________________________ >>> Ferret-talk mailing list >>> Ferret-talk at rubyforge.org >>> http://rubyforge.org/mailman/listinfo/ferret-talk >> >> _______________________________________________ >> Ferret-talk mailing list >> Ferret-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/ferret-talk >> <acts_as_ferret_optimize.patch> > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk
If ferret would implement automatic optimization, it should indeed be optional and parameterizable. For example: suppose you are indexing 500 013 documents. After indexing, you would naturally call index.optimize. But suppose ferret automatically optimizes every 1000 insertions. Obviously, there''s lots of overhead in here (optimize 501 times instead of just once). The ideal solution would be parellel: - index optimization happens in a separate process - while optimizing, the old index is still available Is this possible now? Is Ferret safe enough to allow one process to optimize the index while another is using it? Also, anyone has data about the duration of an optimization process? I don''t think it takes too long, but haven''t got any concrete data on that (yet). Ewout>Ferret itself does not automatically optimize itself after so many >document insertions? > >Lucene does, but maybe Ferret does not? It certainly causes >indexing hiccups when it hits that optimization with Lucene, so care >has to be taken to be sure you account for that possible optimization >delay or to tune the parameters so you know when to expect it. > > Erik > > >On Jan 5, 2007, at 10:02 AM, Ewout wrote: > >> I created a patch for acts_as_ferret that will optimize the index >> every >> 100 insertions (experience will have to show weither this constant is >> adequate). >> >> The only prerequisite is that your model has an id attribute that >> increases 1 by 1, automatically, since the id is used to determine >> when >> to optimize. >> >> Just apply this patch to instance_methods.rb of acts_as_ferret to >> try it. >> >> Hope this will be of use. >> >>> Actually, it does not. The only call to index.optimize is in the >>> rebuild_index method. A possible extension for aaf is that >>> index.optimize is called automatically each C insertions, where C is >>> some constant (1000 seems reasonable). >>> >>> I can only agree with Jan on scalability, at the moment I''m >>> keeping an >>> index of over 700.000 bibliographic records. Searches are instant. >>> >>> Regards, >>> Ewout >>> >>>> Hey Fez, >>>> >>>> the limit of indexed items of ferret (and lucene) shouldn''t be in >>>> the >>>> thousands but in the millions. I''ve indexed hundreds of thousands of >>>> documents myself with ferret as well as with lucene and 20.000 is >>>> not even >>>> near the limit. Regarding the file-count in the index directory: >>>> It seems as >>>> if the index was never optimized. This defragments the chunks >>>> into one big >>>> index file. You should investigate why this didn''t happen. I did >>>> not look >>>> into the aaf code for some time but I think that it should do index >>>> optimization from time to time. >>>> >>>> Cheers, >>>> Jan >>>> >>>> >>>> -- >>>> ------------------------------ >>>> http://www.inviado.de - Internetseiten f?r RAe >>>> http://www.xing.com/profile/Jan_Prill >>>> _______________________________________________ >>>> Ferret-talk mailing list >>>> Ferret-talk at rubyforge.org >>>> http://rubyforge.org/mailman/listinfo/ferret-talk >>> >>> _______________________________________________ >>> Ferret-talk mailing list >>> Ferret-talk at rubyforge.org >>> http://rubyforge.org/mailman/listinfo/ferret-talk >>> <acts_as_ferret_optimize.patch> >> _______________________________________________ >> Ferret-talk mailing list >> Ferret-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/ferret-talk > >_______________________________________________ >Ferret-talk mailing list >Ferret-talk at rubyforge.org >http://rubyforge.org/mailman/listinfo/ferret-talk