thr3ads.net - Ferret talk - [Ferret-talk] Ferret slow after a while [May 2006]

If this information is useful, please help other people find it:
Share via:

Marcus Andersson

2006-May-24 15:37 UTC

[Ferret-talk] Ferret slow after a while

I''m building a new index from scratch based on a number of documents 
stored in a database loaded using my Rails env (using Ruby Ferret 0.9x 
(installed today with Gem) on Windows). At first everything goes nice 
but after a number of documents it starts to go slower and slower until 
it grinds to a halt (at least feels like it).

Am I doing something wrong? Is there some way to work around this?

/Marcus

Code in question:

ENV[''RAILS_ENV''] ||= ''development''
puts "Environment : #{ENV[''RAILS_ENV'']}"

require ''config/environment.rb''

require ''ferret''

index = Ferret::Index::Index.new( :path => Node.class_index_dir, :create 
=> true)
Node.find_all_by_type("PageNode").each { |content|
  puts "ID: #{content.id} => name: #{content.title}"
  index << content.to_doc if content.respond_to?("to_doc")
}
index.flush
index.optimize
index.close

-- 
Posted via http://www.ruby-forum.com/.

Jan Prill

2006-May-24 16:56 UTC

head link

[Ferret-talk] Ferret slow after a while

Hi, Marcus,

by using Ferret 0.9.3 on windows you are using the ''pure pure''
ruby version.
As I''ve read some time ago someone - i think it was jens kraemer -
suggested
that on windows downgrading to 0.3.2 might be a good idea, because this
version comes with a native extension (not as feature rich as cFerret of
course but a predecessor) even on windows. Pure ruby - as clean and
wonderful the language is - is slow comparing it to java or C and therefore
pure ruby ferret isn''t really the first choice for building up an index
of a
large document set.

Another possibility you might want to think about while waiting for cFerret
on Windows could be to do the initial huge indexing batch on a linux or
osx/freebsd machine, transfer the index and perform only ongoing updates on
windows.

Regardless what I''ve said before: What performance are you experiencing
with
your pure ruby installation? How much datasets do you need to index
initially? When (after how much datasets) are you experiencing the
bottleneck?

Regards
Jan

On 5/24/06, Marcus Andersson <m-lists at bristav.se>
wrote:>
> I''m building a new index from scratch based on a number of
documents
> stored in a database loaded using my Rails env (using Ruby Ferret 0.9x
> (installed today with Gem) on Windows). At first everything goes nice
> but after a number of documents it starts to go slower and slower until
> it grinds to a halt (at least feels like it).
>
> Am I doing something wrong? Is there some way to work around this?
>
> /Marcus
>
> Code in question:
>
> ENV[''RAILS_ENV''] ||= ''development''
> puts "Environment : #{ENV[''RAILS_ENV'']}"
>
> require ''config/environment.rb''
>
> require ''ferret''
>
> index = Ferret::Index::Index.new( :path => Node.class_index_dir, :create
> => true)
> Node.find_all_by_type("PageNode").each { |content|
>   puts "ID: #{content.id} => name: #{content.title}"
>   index << content.to_doc if content.respond_to?("to_doc")
> }
> index.flush
> index.optimize
> index.close
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20060524/c80d3b16/attachment.htm

Marcus Andersson

2006-May-25 13:22 UTC

head link

[Ferret-talk] Ferret slow after a while

Jan Prill wrote:> 
> Regardless what I''ve said before: What performance are you
experiencing
> with
> your pure ruby installation? How much datasets do you need to index
> initially? When (after how much datasets) are you experiencing the
> bottleneck?
> After doing quite a bit more of testing it seems that speed seems to be 
content dependant. The content is ugly test content it seems where 
someone have just made random key strokes.

Content that it shokes on is down at the end.

/Marcus

Each document is built this way (documents may contain UTF-8 chars but I 
ignore that for now):

class Node < ActiveRecord::Base
  acts_as_ferret ...
end

class PageNode < Node
  def to_doc
    doc = super
    page.content_items.each { |item| item.to_doc(doc) if 
item.searchable? } if page
    doc
  end
end

class ContentItem
  def to_doc(doc)
    doc <<  Ferret::Document::Field.new(
              ''content_item'', self.content,
              Ferret::Document::Field::Store::NO,
              Ferret::Document::Field::Index::TOKENIZED)
  end
end

Content:







<h1>Huvudrubrik 
svart</h1>ldfkgjdflkgjdflkgjdflgkdflgkdflgkjdflkgj<br><br><h2>Huvudrubrik
orange</h2>sdlkfjsdfkljsdlfksjdflsjflskfjslkfjslkdfsd<br>fsd
fsdfsd<br>fsdfsdfsddfdsdfsdf<br><h3>Underrubrik 
svart</h3><p>dfgfgdfgdfgdfgdfgdfgdfgdf<br>gdfgdgdfgkjhdfkjghdkjgh
dkjghd
kgjhd kgfjh d<br></p><h4>Underrubrik 
orange</h4>lkdfjgldfkgjdlfkgjdlfkgjdflkgdfg<br>dfgdfgdfgdfgdfgdfg<br><br><h5>Styckerubrik
svart</h5>fghfhfghfhkfjglhkjfglhkfjhlkfjghlfkhjflkgh
jflgkhjflgkhf<br>ghfghfgh<br>fgh<br>fghfghfghgfh<br><br><h6>Styckerubrik
orange</h6>fghkfgjhlfgkjhflkghj flghkjfgl hkjfg
lhkfgjhlfgkhfghfgh<br>






-- 
Posted via http://www.ruby-forum.com/.

Jan Prill

2006-May-25 13:50 UTC

head link

[Ferret-talk] Ferret slow after a while

Hi, Marcus,

I don''t know too much about the internals of ferret. But I''m
not too much
surprised that ferret is choking on this ''content''. As all
fulltext search
engines ferret will presume that it''s human readable language that is
going
to be indexed. It would be only because of coincidence that tests of the
stemming, analyzing (and so on) algorithms won''t fail, which results in
lengthy parsings at least.

Is it only because of problems to get ''real world'' test
content? You''ll find
loads of content on http://www.gutenberg.org/ for example...

Regards
Jan


On 5/25/06, Marcus Andersson <m-lists at bristav.se>
wrote:>
> Jan Prill wrote:
> >
> > Regardless what I''ve said before: What performance are you
experiencing
> > with
> > your pure ruby installation? How much datasets do you need to index
> > initially? When (after how much datasets) are you experiencing the
> > bottleneck?
> >
> After doing quite a bit more of testing it seems that speed seems to be
> content dependant. The content is ugly test content it seems where
> someone have just made random key strokes.
>
> Content that it shokes on is down at the end.
>
> /Marcus
>
> Each document is built this way (documents may contain UTF-8 chars but I
> ignore that for now):
>
> class Node < ActiveRecord::Base
>   acts_as_ferret ...
> end
>
> class PageNode < Node
>   def to_doc
>     doc = super
>     page.content_items.each { |item| item.to_doc(doc) if
> item.searchable? } if page
>     doc
>   end
> end
>
> class ContentItem
>   def to_doc(doc)
>     doc <<  Ferret::Document::Field.new(
>               ''content_item'', self.content,
>               Ferret::Document::Field::Store::NO,
>               Ferret::Document::Field::Index::TOKENIZED)
>   end
> end
>
> Content:
>
>
>
>
>
>
>
> <h1>Huvudrubrik
>
svart</h1>ldfkgjdflkgjdflkgjdflgkdflgkdflgkjdflkgj<br><br><h2>Huvudrubrik
> orange</h2>sdlkfjsdfkljsdlfksjdflsjflskfjslkfjslkdfsd<br>fsd
> fsdfsd<br>fsdfsdfsddfdsdfsdf<br><h3>Underrubrik
>
svart</h3><p>dfgfgdfgdfgdfgdfgdfgdfgdf<br>gdfgdgdfgkjhdfkjghdkjgh
dkjghd
> kgjhd kgfjh d<br></p><h4>Underrubrik
>
>
orange</h4>lkdfjgldfkgjdlfkgjdlfkgjdflkgdfg<br>dfgdfgdfgdfgdfgdfg<br><br><h5>Styckerubrik
> svart</h5>fghfhfghfhkfjglhkjfglhkfjhlkfjghlfkhjflkgh
>
jflgkhjflgkhf<br>ghfghfgh<br>fgh<br>fghfghfghgfh<br><br><h6>Styckerubrik
> orange</h6>fghkfgjhlfgkjhflkghj flghkjfgl hkjfg
lhkfgjhlfgkhfghfgh<br>
>
>
>
>
>
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20060525/81b01868/attachment-0001.htm

Marcus Andersson

2006-May-25 15:15 UTC

head link

[Ferret-talk] Ferret slow after a while

This is actually content from the customer''s database. Most of the 
content in the database is real (it''s actually in live deployment). 
Problem seems to be that they have created a number of test pages in the 
beginning that is still there.

How do I as a developer ensure that the content isn''t of a form that 
Ferret chokes on? I mean, even if I take the test data out now, I cannot 
guarantee someone else will put similar data into the database again. 
Then it''s me, the developer, who will take the blame when search
isn''t
working.

It must be possible to either:

- Somehow test the data before indexing to ensure it''s not
"deadly"
- The indexing algorithm should skip after a (configurable) time if
it''s
stuck on a small chunk of data.

(or something like it)

Would it help in this case to replace <html>-tags with spaces (as those 
aren''t significant anyway)?

Regards
Marcus

ps. Thanks for the comments.

-- 
Posted via http://www.ruby-forum.com/.

Marcus Andersson

2006-May-25 15:23 UTC

head link

[Ferret-talk] Ferret slow after a while

Marcus Andersson wrote:> 
> Would it help in this case to replace <html>-tags with spaces (as
those
> aren''t significant anyway)?
Answering to myself here: No, it don''t (after testing...)

Marcus

-- 
Posted via http://www.ruby-forum.com/.

Marcus Andersson

2006-May-25 16:21 UTC

head link

[Ferret-talk] Ferret slow after a while

More testing:

This document (with several fields in it) took 15 seconds to index:
Field: new item
Field: Presentationsmaterial
Field: Ppt-presentationer
Field: &nbsp;
Field: new item
Field: new item
Field: new item

A bit long for that little content if you ask me. I have several similar 
documents that take a lot of time ("new item" is an ugly default value
that all content items get from the beginning, don''t ask me why, does
it
affect indexing speed when a lot of documents contains similar tokens?).

But, I don''t know. I''m using the Ruby version. That is
supposed to be
slow. Maybe the super fast C implementation should take 150ms to handle 
a document of this size? What affects indexing speed?

Regards,
Marcus

-- 
Posted via http://www.ruby-forum.com/.

Jan Prill

2006-May-25 16:30 UTC

head link

[Ferret-talk] Ferret slow after a while

Hi, Marcus,

as you may read in
http://ferret.davebalmain.com/trac/wiki/MyFirstBenchmarkthe indexing
of 408MB project gutenberg files took around 1min. To give you
an impression of the indexing speed.

I haven''t got the time right now to test the performance on a windows
box
and with cFerret. Maybe anyone else is possible to jump in. but 15 seconds
for this document is obviously strange.

cheers,
Jan

On 5/25/06, Marcus Andersson <m-lists at bristav.se>
wrote:>
> More testing:
>
> This document (with several fields in it) took 15 seconds to index:
> Field: new item
> Field: Presentationsmaterial
> Field: Ppt-presentationer
> Field: &nbsp;
> Field: new item
> Field: new item
> Field: new item
>
> A bit long for that little content if you ask me. I have several similar
> documents that take a lot of time ("new item" is an ugly default
value
> that all content items get from the beginning, don''t ask me why,
does it
> affect indexing speed when a lot of documents contains similar tokens?).
>
> But, I don''t know. I''m using the Ruby version. That is
supposed to be
> slow. Maybe the super fast C implementation should take 150ms to handle
> a document of this size? What affects indexing speed?
>
> Regards,
> Marcus
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20060525/a5c08db9/attachment.htm

Jan Prill

2006-May-25 16:51 UTC

head link

[Ferret-talk] Ferret slow after a while

Hi, Marc,

if it would be of any help to you and you''ve got the time to make some
preperations you might send me a test.sql (or migration) with a little
testdata and your essential AR-models. Then I may test it on a windows box
and we are able to compare the results...

cheers,
Jan

On 5/25/06, Jan Prill <jan.prill at gmail.com>
wrote:>
> Hi, Marcus,
>
> as you may read in
> http://ferret.davebalmain.com/trac/wiki/MyFirstBenchmark the indexing of
> 408MB project gutenberg files took around 1min. To give you an impression
of
> the indexing speed.
>
> I haven''t got the time right now to test the performance on a
windows box
> and with cFerret. Maybe anyone else is possible to jump in. but 15 seconds
> for this document is obviously strange.
>
> cheers,
> Jan
>
> On 5/25/06, Marcus Andersson <m-lists at bristav.se> wrote:
>
> > More testing:
> >
> > This document (with several fields in it) took 15 seconds to index:
> > Field: new item
> > Field: Presentationsmaterial
> > Field: Ppt-presentationer
> > Field: &nbsp;
> > Field: new item
> > Field: new item
> > Field: new item
> >
> > A bit long for that little content if you ask me. I have several
similar
> > documents that take a lot of time ("new item" is an ugly
default value
> > that all content items get from the beginning, don''t ask me
why, does it
> >
> > affect indexing speed when a lot of documents contains similar
tokens?).
> >
> > But, I don''t know. I''m using the Ruby version. That
is supposed to be
> > slow. Maybe the super fast C implementation should take 150ms to
handle
> > a document of this size? What affects indexing speed?
> >
> > Regards,
> > Marcus
> >
> > --
> > Posted via http://www.ruby-forum.com/.
> > _______________________________________________
> > Ferret-talk mailing list
> > Ferret-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/ferret-talk
> >
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20060525/8be84348/attachment.htm

Marcus Andersson

2006-May-26 17:50 UTC

head link

[Ferret-talk] Ferret slow after a while

Jan Prill wrote:> Hi, Marc,
> 
> if it would be of any help to you and you''ve got the time to make
some
> preperations you might send me a test.sql (or migration) with a little
> testdata and your essential AR-models. Then I may test it on a windows 
> box
> and we are able to compare the results...
> 
> cheers,
> Jan
Thanks for your time. I think I wait for the windows C version though. 
Implemented an ugly straight db search for the time being.

Regards,

Marcus

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-May-26 23:07 UTC

head link

[Ferret-talk] Ferret slow after a while

On 5/26/06, Marcus Andersson <m-lists at bristav.se>
wrote:> More testing:
>
> This document (with several fields in it) took 15 seconds to index:
> Field: new item
> Field: Presentationsmaterial
> Field: Ppt-presentationer
> Field: &nbsp;
> Field: new item
> Field: new item
> Field: new item
>
> A bit long for that little content if you ask me. I have several similar
> documents that take a lot of time ("new item" is an ugly default
value
> that all content items get from the beginning, don''t ask me why,
does it
> affect indexing speed when a lot of documents contains similar tokens?).
>
> But, I don''t know. I''m using the Ruby version. That is
supposed to be
> slow. Maybe the super fast C implementation should take 150ms to handle
> a document of this size? What affects indexing speed?
Hi Marcus,

I just tested this here;

    require ''lib/rferret.rb''

    include Ferret
    include Ferret::Document
    include Ferret::Index

    doc = Document.new
    doc << Field.new(:field, "new item")
    doc << Field.new(:field, "Presentationsmaterial")
    doc << Field.new(:field, "Ppt-presentationer")
    doc << Field.new(:field, "&nbsp;")
    doc << Field.new(:field, "new item")
    doc << Field.new(:field, "new item")
    doc << Field.new(:field, "new item")

    i = Index.new(:path => "index_dir")
    i << doc
    i.close

  dbalmain at ubuntu:~/workspace/ferret $ time ruby test.rb

  real    0m0.147s
  user    0m0.125s
  sys     0m0.022s

This is with the pure ruby version. If this document is taking 15
seconds then something is going wrong. Similarly the bad data should
hurt indexing speed considerably although it will make your index
larger than usual and merging will take a little longer. Could you
post a simple testcase that takes a long time for you?

Cheers,
Dave

Reasonably Related Threads

Search for more maybe matching threads

Ferret talk - May 2006 - Ferret slow after a while

[Ferret-talk] Ferret slow after a while

[Ferret-talk] Ferret slow after a while

[Ferret-talk] Ferret slow after a while

[Ferret-talk] Ferret slow after a while

[Ferret-talk] Ferret slow after a while

[Ferret-talk] Ferret slow after a while

[Ferret-talk] Ferret slow after a while

[Ferret-talk] Ferret slow after a while

[Ferret-talk] Ferret slow after a while

[Ferret-talk] Ferret slow after a while

[Ferret-talk] Ferret slow after a while

Reasonably Related Threads