thr3ads.net - Ferret talk - [Ferret-talk] performance bottleneck [Jul 2007]

If this information is useful, please help other people find it:
Share via:

golak Sarangi

2007-Jul-14 11:14 UTC

[Ferret-talk] performance bottleneck

I have got my database in Mysql. I used ferret to index a table with 10
million rows. On limiting the selection of data to 1000 initial retrieval,
it takes 200 seconds but for the whole table it took more than four hours
and after which i had to close my indexing application. I used the
StandardAnalyser for it. There is no problem from the database side as
retrieval of all the data in the table into a document takes very less time.
So please suggest me ways to overcome it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20070714/4332ddd8/attachment.html

Andreas Korth

2007-Jul-14 14:40 UTC

head link

[Ferret-talk] performance bottleneck

Hi,

here''s what I did to speed up index rebuilds:

1. Database
------------------
Use raw SQL to retrieve objects from the DB with  
ModelClass.connection.select_all() instead of Model.find_by_sql().  
This will prevent ActiveRecord objects from being instantiated for  
every row, which is fairly expensive.

Instead of ActiveRecord objects, select_all() returns hashes with  
string keys. You need to access the values with object[''name'']
instead of object.name. Another consequence is that helper methods  
such as

def full_name
   last_name + '' '' + first_name
end

are not available anymore. You have to use SQL for doing this:

"SELECT ..., CONCAT(first_name, '' '', last_name) AS
full_name, ...".

Another important point are associations. If you index objects with a  
has_many association, the objects themselves are loaded in a batch  
statement but the associateded objects are loaded separately.

Assume, for example, you index Articles which have many Comments. If  
the Articles are loaded in batches of 1000, there will be another  
1000 statements for loading the comments.

With ActiveRecord find() you can use the :include option as a remedy.  
With raw SQL you have to use joins and grouping to get all the data  
you need in one row. Hint: have a look at the group functions that  
your RDBMS supports.

Select only the columns that you need for indexing. Instead of  
"SELECT products.*" use "SELECT product.name,  
product.description,...". This is especially important if your table  
includes BLOB or CLOB columns.

Bottom line: Brush up your SQL skills and do as much work as possible  
inside the database. ActiveRecord is definitely not made for batch  
processing, especially when dealing with millions of rows.

2. Analyzers
------------------
Use fast analyzers if possible. StandardAnalyzer is fine for most  
purposes but it includes HyphenFilter and StopFilter which are slow.  
Consider using lightweight LetterAnalyzer or WhiteSpaceAnalyzer or  
even no analyzer at all for particular fields. You can set individual  
analyzers for different fields using PerFieldAnalyzer.

Don''t be reluctant to write your own custom analyzer. I''m
using an
analyzer written in pure Ruby based on regular expressions. For its  
particular purpose, it''s as fast as StandardAnalyzer but yields  
better results. Text processing in Ruby is actually pretty fast.

3. Ferret
------------
Tweak the Index parameters for rebuilds. See the Ferret documentation  
on what you can do. Here''s an example

i = Ferret::Index::Index.new(
    ...,
    :max_buffer_memory => 0x8000000,
    :max_buffered_docs => 0x10000,
    :merge_factor => 0x10,
    :term_index_interval => 0x40,
    :doc_skip_interval => 0x8
)

Experiment with these values, but be careful! Setting them  
arbitrarily can cause Ferret to crash. Start with the values above  
and try to tweak one at time.

This is the last measure in optimization. Don''t bother fiddling with  
these parameters unless you get steps 1. and 2. right.

Hope this helps.

Cheers.
Andreas

On 14.07.2007, at 13:14, golak Sarangi wrote:
> I have got my database in Mysql. I used ferret to index a table  
> with 10 million rows. On limiting the selection of data to 1000  
> initial retrieval, it takes 200 seconds but for the whole table it  
> took more than four hours and after which i had to close my  
> indexing application. I used the StandardAnalyser for it. There is  
> no problem from the database side as retrieval of all the data in  
> the table into a document takes very less time. So please suggest  
> me ways to overcome it.
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk

Possibly Parallel Threads

Search for more maybe matching threads

Ferret talk - Jul 2007 - performance bottleneck

[Ferret-talk] performance bottleneck

[Ferret-talk] performance bottleneck

Possibly Parallel Threads