I''m building a new index from scratch based on a number of documents stored in a database loaded using my Rails env (using Ruby Ferret 0.9x (installed today with Gem) on Windows). At first everything goes nice but after a number of documents it starts to go slower and slower until it grinds to a halt (at least feels like it). Am I doing something wrong? Is there some way to work around this? /Marcus Code in question: ENV[''RAILS_ENV''] ||= ''development'' puts "Environment : #{ENV[''RAILS_ENV'']}" require ''config/environment.rb'' require ''ferret'' index = Ferret::Index::Index.new( :path => Node.class_index_dir, :create => true) Node.find_all_by_type("PageNode").each { |content| puts "ID: #{content.id} => name: #{content.title}" index << content.to_doc if content.respond_to?("to_doc") } index.flush index.optimize index.close -- Posted via http://www.ruby-forum.com/.
Hi, Marcus, by using Ferret 0.9.3 on windows you are using the ''pure pure'' ruby version. As I''ve read some time ago someone - i think it was jens kraemer - suggested that on windows downgrading to 0.3.2 might be a good idea, because this version comes with a native extension (not as feature rich as cFerret of course but a predecessor) even on windows. Pure ruby - as clean and wonderful the language is - is slow comparing it to java or C and therefore pure ruby ferret isn''t really the first choice for building up an index of a large document set. Another possibility you might want to think about while waiting for cFerret on Windows could be to do the initial huge indexing batch on a linux or osx/freebsd machine, transfer the index and perform only ongoing updates on windows. Regardless what I''ve said before: What performance are you experiencing with your pure ruby installation? How much datasets do you need to index initially? When (after how much datasets) are you experiencing the bottleneck? Regards Jan On 5/24/06, Marcus Andersson <m-lists at bristav.se> wrote:> > I''m building a new index from scratch based on a number of documents > stored in a database loaded using my Rails env (using Ruby Ferret 0.9x > (installed today with Gem) on Windows). At first everything goes nice > but after a number of documents it starts to go slower and slower until > it grinds to a halt (at least feels like it). > > Am I doing something wrong? Is there some way to work around this? > > /Marcus > > Code in question: > > ENV[''RAILS_ENV''] ||= ''development'' > puts "Environment : #{ENV[''RAILS_ENV'']}" > > require ''config/environment.rb'' > > require ''ferret'' > > index = Ferret::Index::Index.new( :path => Node.class_index_dir, :create > => true) > Node.find_all_by_type("PageNode").each { |content| > puts "ID: #{content.id} => name: #{content.title}" > index << content.to_doc if content.respond_to?("to_doc") > } > index.flush > index.optimize > index.close > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060524/c80d3b16/attachment.htm
Jan Prill wrote:> > Regardless what I''ve said before: What performance are you experiencing > with > your pure ruby installation? How much datasets do you need to index > initially? When (after how much datasets) are you experiencing the > bottleneck? >After doing quite a bit more of testing it seems that speed seems to be content dependant. The content is ugly test content it seems where someone have just made random key strokes. Content that it shokes on is down at the end. /Marcus Each document is built this way (documents may contain UTF-8 chars but I ignore that for now): class Node < ActiveRecord::Base acts_as_ferret ... end class PageNode < Node def to_doc doc = super page.content_items.each { |item| item.to_doc(doc) if item.searchable? } if page doc end end class ContentItem def to_doc(doc) doc << Ferret::Document::Field.new( ''content_item'', self.content, Ferret::Document::Field::Store::NO, Ferret::Document::Field::Index::TOKENIZED) end end Content: <h1>Huvudrubrik svart</h1>ldfkgjdflkgjdflkgjdflgkdflgkdflgkjdflkgj<br><br><h2>Huvudrubrik orange</h2>sdlkfjsdfkljsdlfksjdflsjflskfjslkfjslkdfsd<br>fsd fsdfsd<br>fsdfsdfsddfdsdfsdf<br><h3>Underrubrik svart</h3><p>dfgfgdfgdfgdfgdfgdfgdfgdf<br>gdfgdgdfgkjhdfkjghdkjgh dkjghd kgjhd kgfjh d<br></p><h4>Underrubrik orange</h4>lkdfjgldfkgjdlfkgjdlfkgjdflkgdfg<br>dfgdfgdfgdfgdfgdfg<br><br><h5>Styckerubrik svart</h5>fghfhfghfhkfjglhkjfglhkfjhlkfjghlfkhjflkgh jflgkhjflgkhf<br>ghfghfgh<br>fgh<br>fghfghfghgfh<br><br><h6>Styckerubrik orange</h6>fghkfgjhlfgkjhflkghj flghkjfgl hkjfg lhkfgjhlfgkhfghfgh<br> -- Posted via http://www.ruby-forum.com/.
Hi, Marcus, I don''t know too much about the internals of ferret. But I''m not too much surprised that ferret is choking on this ''content''. As all fulltext search engines ferret will presume that it''s human readable language that is going to be indexed. It would be only because of coincidence that tests of the stemming, analyzing (and so on) algorithms won''t fail, which results in lengthy parsings at least. Is it only because of problems to get ''real world'' test content? You''ll find loads of content on http://www.gutenberg.org/ for example... Regards Jan On 5/25/06, Marcus Andersson <m-lists at bristav.se> wrote:> > Jan Prill wrote: > > > > Regardless what I''ve said before: What performance are you experiencing > > with > > your pure ruby installation? How much datasets do you need to index > > initially? When (after how much datasets) are you experiencing the > > bottleneck? > > > After doing quite a bit more of testing it seems that speed seems to be > content dependant. The content is ugly test content it seems where > someone have just made random key strokes. > > Content that it shokes on is down at the end. > > /Marcus > > Each document is built this way (documents may contain UTF-8 chars but I > ignore that for now): > > class Node < ActiveRecord::Base > acts_as_ferret ... > end > > class PageNode < Node > def to_doc > doc = super > page.content_items.each { |item| item.to_doc(doc) if > item.searchable? } if page > doc > end > end > > class ContentItem > def to_doc(doc) > doc << Ferret::Document::Field.new( > ''content_item'', self.content, > Ferret::Document::Field::Store::NO, > Ferret::Document::Field::Index::TOKENIZED) > end > end > > Content: > > > > > > > > <h1>Huvudrubrik > svart</h1>ldfkgjdflkgjdflkgjdflgkdflgkdflgkjdflkgj<br><br><h2>Huvudrubrik > orange</h2>sdlkfjsdfkljsdlfksjdflsjflskfjslkfjslkdfsd<br>fsd > fsdfsd<br>fsdfsdfsddfdsdfsdf<br><h3>Underrubrik > svart</h3><p>dfgfgdfgdfgdfgdfgdfgdfgdf<br>gdfgdgdfgkjhdfkjghdkjgh dkjghd > kgjhd kgfjh d<br></p><h4>Underrubrik > > orange</h4>lkdfjgldfkgjdlfkgjdlfkgjdflkgdfg<br>dfgdfgdfgdfgdfgdfg<br><br><h5>Styckerubrik > svart</h5>fghfhfghfhkfjglhkjfglhkfjhlkfjghlfkhjflkgh > jflgkhjflgkhf<br>ghfghfgh<br>fgh<br>fghfghfghgfh<br><br><h6>Styckerubrik > orange</h6>fghkfgjhlfgkjhflkghj flghkjfgl hkjfg lhkfgjhlfgkhfghfgh<br> > > > > > > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060525/81b01868/attachment-0001.htm
This is actually content from the customer''s database. Most of the content in the database is real (it''s actually in live deployment). Problem seems to be that they have created a number of test pages in the beginning that is still there. How do I as a developer ensure that the content isn''t of a form that Ferret chokes on? I mean, even if I take the test data out now, I cannot guarantee someone else will put similar data into the database again. Then it''s me, the developer, who will take the blame when search isn''t working. It must be possible to either: - Somehow test the data before indexing to ensure it''s not "deadly" - The indexing algorithm should skip after a (configurable) time if it''s stuck on a small chunk of data. (or something like it) Would it help in this case to replace <html>-tags with spaces (as those aren''t significant anyway)? Regards Marcus ps. Thanks for the comments. -- Posted via http://www.ruby-forum.com/.
Marcus Andersson wrote:> > Would it help in this case to replace <html>-tags with spaces (as those > aren''t significant anyway)?Answering to myself here: No, it don''t (after testing...) Marcus -- Posted via http://www.ruby-forum.com/.
More testing: This document (with several fields in it) took 15 seconds to index: Field: new item Field: Presentationsmaterial Field: Ppt-presentationer Field: Field: new item Field: new item Field: new item A bit long for that little content if you ask me. I have several similar documents that take a lot of time ("new item" is an ugly default value that all content items get from the beginning, don''t ask me why, does it affect indexing speed when a lot of documents contains similar tokens?). But, I don''t know. I''m using the Ruby version. That is supposed to be slow. Maybe the super fast C implementation should take 150ms to handle a document of this size? What affects indexing speed? Regards, Marcus -- Posted via http://www.ruby-forum.com/.
Hi, Marcus, as you may read in http://ferret.davebalmain.com/trac/wiki/MyFirstBenchmarkthe indexing of 408MB project gutenberg files took around 1min. To give you an impression of the indexing speed. I haven''t got the time right now to test the performance on a windows box and with cFerret. Maybe anyone else is possible to jump in. but 15 seconds for this document is obviously strange. cheers, Jan On 5/25/06, Marcus Andersson <m-lists at bristav.se> wrote:> > More testing: > > This document (with several fields in it) took 15 seconds to index: > Field: new item > Field: Presentationsmaterial > Field: Ppt-presentationer > Field: > Field: new item > Field: new item > Field: new item > > A bit long for that little content if you ask me. I have several similar > documents that take a lot of time ("new item" is an ugly default value > that all content items get from the beginning, don''t ask me why, does it > affect indexing speed when a lot of documents contains similar tokens?). > > But, I don''t know. I''m using the Ruby version. That is supposed to be > slow. Maybe the super fast C implementation should take 150ms to handle > a document of this size? What affects indexing speed? > > Regards, > Marcus > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060525/a5c08db9/attachment.htm
Hi, Marc, if it would be of any help to you and you''ve got the time to make some preperations you might send me a test.sql (or migration) with a little testdata and your essential AR-models. Then I may test it on a windows box and we are able to compare the results... cheers, Jan On 5/25/06, Jan Prill <jan.prill at gmail.com> wrote:> > Hi, Marcus, > > as you may read in > http://ferret.davebalmain.com/trac/wiki/MyFirstBenchmark the indexing of > 408MB project gutenberg files took around 1min. To give you an impression of > the indexing speed. > > I haven''t got the time right now to test the performance on a windows box > and with cFerret. Maybe anyone else is possible to jump in. but 15 seconds > for this document is obviously strange. > > cheers, > Jan > > On 5/25/06, Marcus Andersson <m-lists at bristav.se> wrote: > > > More testing: > > > > This document (with several fields in it) took 15 seconds to index: > > Field: new item > > Field: Presentationsmaterial > > Field: Ppt-presentationer > > Field: > > Field: new item > > Field: new item > > Field: new item > > > > A bit long for that little content if you ask me. I have several similar > > documents that take a lot of time ("new item" is an ugly default value > > that all content items get from the beginning, don''t ask me why, does it > > > > affect indexing speed when a lot of documents contains similar tokens?). > > > > But, I don''t know. I''m using the Ruby version. That is supposed to be > > slow. Maybe the super fast C implementation should take 150ms to handle > > a document of this size? What affects indexing speed? > > > > Regards, > > Marcus > > > > -- > > Posted via http://www.ruby-forum.com/. > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060525/8be84348/attachment.htm
Jan Prill wrote:> Hi, Marc, > > if it would be of any help to you and you''ve got the time to make some > preperations you might send me a test.sql (or migration) with a little > testdata and your essential AR-models. Then I may test it on a windows > box > and we are able to compare the results... > > cheers, > JanThanks for your time. I think I wait for the windows C version though. Implemented an ugly straight db search for the time being. Regards, Marcus -- Posted via http://www.ruby-forum.com/.
On 5/26/06, Marcus Andersson <m-lists at bristav.se> wrote:> More testing: > > This document (with several fields in it) took 15 seconds to index: > Field: new item > Field: Presentationsmaterial > Field: Ppt-presentationer > Field: > Field: new item > Field: new item > Field: new item > > A bit long for that little content if you ask me. I have several similar > documents that take a lot of time ("new item" is an ugly default value > that all content items get from the beginning, don''t ask me why, does it > affect indexing speed when a lot of documents contains similar tokens?). > > But, I don''t know. I''m using the Ruby version. That is supposed to be > slow. Maybe the super fast C implementation should take 150ms to handle > a document of this size? What affects indexing speed?Hi Marcus, I just tested this here; require ''lib/rferret.rb'' include Ferret include Ferret::Document include Ferret::Index doc = Document.new doc << Field.new(:field, "new item") doc << Field.new(:field, "Presentationsmaterial") doc << Field.new(:field, "Ppt-presentationer") doc << Field.new(:field, " ") doc << Field.new(:field, "new item") doc << Field.new(:field, "new item") doc << Field.new(:field, "new item") i = Index.new(:path => "index_dir") i << doc i.close dbalmain at ubuntu:~/workspace/ferret $ time ruby test.rb real 0m0.147s user 0m0.125s sys 0m0.022s This is with the pure ruby version. If this document is taking 15 seconds then something is going wrong. Similarly the bad data should hurt indexing speed considerably although it will make your index larger than usual and merging will take a little longer. Could you post a simple testcase that takes a long time for you? Cheers, Dave