thr3ads.net - Ferret talk - [Ferret-talk] Sorting performance [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Pedro Côrte-Real

2006-Jul-31 10:05 UTC

[Ferret-talk] Sorting performance

I''m using acts_as_ferret to index one of my rails models. Right after I
start the app the first request that orders by some ferret field will
take very long. Subsequent ones seem to be fast. I guess some caching is
going on. Any tips on solving this?

Pedro.

David Balmain

2006-Jul-31 10:17 UTC

head link

[Ferret-talk] Sorting performance

On 7/31/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt>
wrote:> I''m using acts_as_ferret to index one of my rails models. Right
after I
> start the app the first request that orders by some ferret field will
> take very long. Subsequent ones seem to be fast. I guess some caching is
> going on. Any tips on solving this?
>
> Pedro.
You guessed correctly. The sort fields are cached. You can easily
preload the cache by running a search when you start up your app. You
should also be careful what fields you sort on. You should only sort
on untokenized fields. You can also speed up sorting by dates by
lowering the precision that you use. For example, if you are storing
the date with time to the nearest second, eg 2006-08-01 10:13:24 you
may get a much faster sort by only storing up to the nearest day, ie
2006-08-01. By the way, what kind of times are we talking about here?

Cheers,
Dave

Pedro Côrte-Real

2006-Jul-31 10:26 UTC

head link

[Ferret-talk] Sorting performance

On Mon, 2006-07-31 at 19:17 +0900, David Balmain wrote:> On 7/31/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt> wrote:
> > I''m using acts_as_ferret to index one of my rails models.
Right after I
> > start the app the first request that orders by some ferret field will
> > take very long. Subsequent ones seem to be fast. I guess some caching
is
> > going on. Any tips on solving this?
> >
> > Pedro.
> 
> You guessed correctly. The sort fields are cached. You can easily
> preload the cache by running a search when you start up your app. You
> should also be careful what fields you sort on. You should only sort
> on untokenized fields. 
Is it ok if the field isn''t stored in the index?

Anyone know how to set a field to be untokenized in acts_as_ferret?
> You can also speed up sorting by dates by
> lowering the precision that you use. For example, if you are storing
> the date with time to the nearest second, eg 2006-08-01 10:13:24 you
> may get a much faster sort by only storing up to the nearest day, ie
> 2006-08-01. 
I''m only using dates so it should be alright.
> By the way, what kind of times are we talking about here?
300 seconds for a 100MB index.

Pedro.

Pedro Côrte-Real

2006-Jul-31 15:10 UTC

head link

[Ferret-talk] Sorting performance

On Mon, 2006-07-31 at 11:26 +0100, Pedro C?rte-Real
wrote:> Anyone know how to set a field to be untokenized in acts_as_ferret?
I forgot that I was actually supplying my own #to_doc so it was a matter
of changing it to not tokenize the fields I want. When using
acts_as_ferret the regular way I don''t know if this is possible.

Pedro.

Pedro Côrte-Real

2006-Jul-31 15:11 UTC

head link

[Ferret-talk] Sorting performance

On Mon, 2006-07-31 at 19:17 +0900, David Balmain wrote:> By the way, what kind of times are we talking about here?
I added a preloading of this at the start of my app and it takes 14
minutes for a 100MB index with 4 fields I order by. Any way to speed
this up? Shouldn''t this be cached in the on-disk structure?

Don''t think I''m being critical, ferret is great software, many
thanks
for it.

Pedro.

Jens Kraemer

2006-Jul-31 17:36 UTC

head link

[Ferret-talk] Sorting performance

On Mon, Jul 31, 2006 at 04:10:03PM +0100, Pedro C?rte-Real
wrote:> On Mon, 2006-07-31 at 11:26 +0100, Pedro C?rte-Real wrote:
> > Anyone know how to set a field to be untokenized in acts_as_ferret?
> 
> I forgot that I was actually supplying my own #to_doc so it was a matter
> of changing it to not tokenize the fields I want. When using
> acts_as_ferret the regular way I don''t know if this is possible.
it is, just provide a hash with the desired options to each field name:

acts_as_ferret(
  :fields => { 
    ''title''       => { :boost => 2 }, 
    ''description'' => { :boost => 1, 
                       :index => Ferret::Document::Field::Index::UNTOKENIZED 
                     }
  })

options that can be set this way are (with their defaults given):

  :store       => Ferret::Document::Field::Store::NO
  :index       => Ferret::Document::Field::Index::TOKENIZED
  :term_vector => Ferret::Document::Field::TermVector::NO
  :binary      => false
  :boost       => 1.0


Jens

-- 
webit! Gesellschaft f?r neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Kr?mer       kraemer at webit.de
Schnorrstra?e 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66

David Balmain

2006-Aug-01 00:24 UTC

head link

[Ferret-talk] Sorting performance

On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt>
wrote:> On Mon, 2006-07-31 at 19:17 +0900, David Balmain wrote:
> > By the way, what kind of times are we talking about here?
>
> I added a preloading of this at the start of my app and it takes 14
> minutes for a 100MB index with 4 fields I order by. Any way to speed
> this up? Shouldn''t this be cached in the on-disk structure?
How many documents and what is the date range (eg 2001-01-01 ->
2006-08-01). These are the critical variables for sort performance.
Once I know these numbers I''ll be able to replicate the task here and
I''ll see what I can do.
> Don''t think I''m being critical, ferret is great software,
many thanks
> for it.
No offence taken. I''d definitely like to be able to help. I''m
guessing
I''ll probably have to optimize the C code to rectify this.

Cheers,
Dave

Pedro Côrte-Real

2006-Aug-01 09:30 UTC

head link

[Ferret-talk] Sorting performance

On Mon, 2006-07-31 at 19:36 +0200, Jens Kraemer wrote:> > I forgot that I was actually supplying my own #to_doc so it was a
matter
> > of changing it to not tokenize the fields I want. When using
> > acts_as_ferret the regular way I don''t know if this is
possible.
> 
> it is, just provide a hash with the desired options to each field name:
> 
> acts_as_ferret(
>   :fields => { 
>     ''title''       => { :boost => 2 }, 
>     ''description'' => { :boost => 1, 
>                        :index =>
Ferret::Document::Field::Index::UNTOKENIZED
>                      }
>   })
> 
> options that can be set this way are (with their defaults given):
> 
>   :store       => Ferret::Document::Field::Store::NO
>   :index       => Ferret::Document::Field::Index::TOKENIZED
>   :term_vector => Ferret::Document::Field::TermVector::NO
>   :binary      => false
>   :boost       => 1.0
Cool. Didn''t know about this. I started reading the code to understand
how it worked but then remembered I was doing my own to_doc so I should
just change that. I''ll be sure to remember that for any future
projects.

By the way, does storing the TermVectors only increase the size of the
index or does it alter performance in any way?

Pedro.

Pedro Côrte-Real

2006-Aug-01 09:36 UTC

head link

[Ferret-talk] Sorting performance

On Tue, 2006-08-01 at 09:24 +0900, David Balmain wrote:> How many documents and what is the date range (eg 2001-01-01 ->
> 2006-08-01). These are the critical variables for sort performance.
> Once I know these numbers I''ll be able to replicate the task here
and
> I''ll see what I can do.
I have around 600_000 documents and the date range is rather large,
something like from year 1000 to now. I don''t know for sure but I can
check if it makes a difference.

But not all my sort fields are dates. I also have regular text fields
that I have now made untokenized (by using separate fields for sorting
and searching). Got to check if that made them faster.
> > Don''t think I''m being critical, ferret is great
software, many thanks
> > for it.
> 
> No offence taken. I''d definitely like to be able to help.
I''m guessing
> I''ll probably have to optimize the C code to rectify this.
That would be great,

Thanks,

Pedro.

David Balmain

2006-Aug-01 09:39 UTC

head link

[Ferret-talk] Sorting performance

On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt>
wrote:> On Mon, 2006-07-31 at 19:36 +0200, Jens Kraemer wrote:
> > > I forgot that I was actually supplying my own #to_doc so it was a
matter
> > > of changing it to not tokenize the fields I want. When using
> > > acts_as_ferret the regular way I don''t know if this is
possible.
> >
> > it is, just provide a hash with the desired options to each field
name:
> >
> > acts_as_ferret(
> >   :fields => {
> >     ''title''       => { :boost => 2 },
> >     ''description'' => { :boost => 1,
> >                        :index =>
Ferret::Document::Field::Index::UNTOKENIZED
> >                      }
> >   })
> >
> > options that can be set this way are (with their defaults given):
> >
> >   :store       => Ferret::Document::Field::Store::NO
> >   :index       => Ferret::Document::Field::Index::TOKENIZED
> >   :term_vector => Ferret::Document::Field::TermVector::NO
> >   :binary      => false
> >   :boost       => 1.0
>
> By the way, does storing the TermVectors only increase the size of the
> index or does it alter performance in any way?
It increases the size of the index and affects indexing performance
since a lot of extra data needs to be written and merged during the
indexing process. Search performance won''t be affected.

Dave

Pedro Côrte-Real

2006-Aug-01 09:52 UTC

head link

[Ferret-talk] Sorting performance

On Tue, 2006-08-01 at 18:39 +0900, David Balmain wrote:> > By the way, does storing the TermVectors only increase the size of the
> > index or does it alter performance in any way?
> 
> It increases the size of the index and affects indexing performance
> since a lot of extra data needs to be written and merged during the
> indexing process. Search performance won''t be affected.
Ah, but the default when creating a new field is already not to store it
so I''m already doing it.

Pedro.

David Balmain

2006-Aug-01 09:59 UTC

head link

[Ferret-talk] Sorting performance

On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt>
wrote:> On Tue, 2006-08-01 at 09:24 +0900, David Balmain wrote:
> > How many documents and what is the date range (eg 2001-01-01 ->
> > 2006-08-01). These are the critical variables for sort performance.
> > Once I know these numbers I''ll be able to replicate the task
here and
> > I''ll see what I can do.
>
> I have around 600_000 documents and the date range is rather large,
> something like from year 1000 to now. I don''t know for sure but I
can
> check if it makes a difference.
>
> But not all my sort fields are dates. I also have regular text fields
> that I have now made untokenized (by using separate fields for sorting
> and searching). Got to check if that made them faster.
Hmmm. Sounds like an interesting application. One solution would be to
cache the sort index on disk. The problem with this is that the cache
would still need to be recalculated every time you add more documents
to the index so you''ll still have the long wait occasionally.
I''ll
look into it anyway at a later stage.

Another idea that I can implement now is to add a BYTES sort type
which would basically sort by the order the terms appear in the index.
Let''s say you index dates in the format "YYYYMMDD" and you
sort by
INTEGER. Everytime you load the sort index you need to go through
every single date and convert it from string to integer. But this is
unnecessary since the dates are already in order in the index. A BYTES
sort type would take advantage of this. You''d get an even bigger
benefit for ascii strings. strcoll is used to sort strings but this is
unnecessary for ascii strings as they are already correctly ordered in
the index. Also, the index needs to keep each string in memory which
would also be unnessary.

Sorry if this isn''t very clear. I''m not sure how much it will
help.
We''ll have to wait and see.

Dave

Pedro Côrte-Real

2006-Aug-01 10:08 UTC

head link

[Ferret-talk] Sorting performance

On Tue, 2006-08-01 at 18:59 +0900, David Balmain wrote:> Hmmm. Sounds like an interesting application. One solution would be to
> cache the sort index on disk. The problem with this is that the cache
> would still need to be recalculated every time you add more documents
> to the index so you''ll still have the long wait occasionally.
I''ll
> look into it anyway at a later stage.
For my application this wouldn''t really be a problem since data is only
loaded maybe once a week. But does the cache need to be recalculated
completely? Database indexes work incrementally.
> Another idea that I can implement now is to add a BYTES sort type
> which would basically sort by the order the terms appear in the index.
> Let''s say you index dates in the format "YYYYMMDD" and
you sort by
> INTEGER. Everytime you load the sort index you need to go through
> every single date and convert it from string to integer. But this is
> unnecessary since the dates are already in order in the index. A BYTES
> sort type would take advantage of this. 
For my date fields this would work.
> You''d get an even bigger
> benefit for ascii strings. strcoll is used to sort strings but this is
> unnecessary for ascii strings as they are already correctly ordered in
> the index. Also, the index needs to keep each string in memory which
> would also be unnessary.
One of my text order fields should have nothing but ASCII. The other is
a title and can include arbitrary UTF-8, so I guess it wouldn''t work
for
that one.
> Sorry if this isn''t very clear. I''m not sure how much it
will help.
> We''ll have to wait and see.
The BYTES ordering would probably speed it up but for my specific case,
storing it on disk would be perfect. It would probably be a very good
thing in case someone uses ferret to code command line tools that access
a common index. Without storing the sorting on disk it will get
recreated every time a command is ran.

Pedro.

David Balmain

2006-Aug-01 10:32 UTC

head link

[Ferret-talk] Sorting performance

On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt>
wrote:> On Tue, 2006-08-01 at 18:59 +0900, David Balmain wrote:
> > Hmmm. Sounds like an interesting application. One solution would be to
> > cache the sort index on disk. The problem with this is that the cache
> > would still need to be recalculated every time you add more documents
> > to the index so you''ll still have the long wait occasionally.
I''ll
> > look into it anyway at a later stage.
>
> For my application this wouldn''t really be a problem since data is
only
> loaded maybe once a week. But does the cache need to be recalculated
> completely? Database indexes work incrementally.
Sure it''s possible but it''s a fair bit of work. Lucene
doesn''t have
anything like this yet (not that that has stopped me adding features
before). I''ll think about it.

Dave

David Balmain

2006-Aug-02 02:13 UTC

head link

[Ferret-talk] Sorting performance

On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt>
wrote:> On Tue, 2006-08-01 at 18:59 +0900, David Balmain wrote:
> > Hmmm. Sounds like an interesting application. One solution would be to
> > cache the sort index on disk. The problem with this is that the cache
> > would still need to be recalculated every time you add more documents
> > to the index so you''ll still have the long wait occasionally.
I''ll
> > look into it anyway at a later stage.
>
> For my application this wouldn''t really be a problem since data is
only
> loaded maybe once a week. But does the cache need to be recalculated
> completely? Database indexes work incrementally.
Have you tried optimizing your index? I found an order of magnitude
difference in speed here with an optimized index. Even with 1,000,000
unique documents though sorting is taking less than 10 seconds for an
unoptimized index and less than 1 second for optimized index. What
kind of system are you running on?

Dave

Pedro Côrte-Real

2006-Aug-02 09:13 UTC

head link

[Ferret-talk] Sorting performance

On Wed, 2006-08-02 at 11:13 +0900, David Balmain wrote:> On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt> wrote:
> > On Tue, 2006-08-01 at 18:59 +0900, David Balmain wrote:
> > > Hmmm. Sounds like an interesting application. One solution would
be to
> > > cache the sort index on disk. The problem with this is that the
cache
> > > would still need to be recalculated every time you add more
documents
> > > to the index so you''ll still have the long wait
occasionally. I''ll
> > > look into it anyway at a later stage.
> >
> > For my application this wouldn''t really be a problem since
data is only
> > loaded maybe once a week. But does the cache need to be recalculated
> > completely? Database indexes work incrementally.
> 
> Have you tried optimizing your index? I found an order of magnitude
> difference in speed here with an optimized index. Even with 1,000,000
> unique documents though sorting is taking less than 10 seconds for an
> unoptimized index and less than 1 second for optimized index. What
> kind of system are you running on?
I was guessing acts_as_ferret did that. But apparently only on
rebuild_index. I''ll try adding an optimize call at the start of the
app.

I''m running this on a 2.66 GHz Celeron with 1GB ram.

Pedro.

Pedro Côrte-Real

2006-Aug-03 15:24 UTC

head link

[Ferret-talk] Sorting performance

On Wed, 2006-08-02 at 11:13 +0900, David Balmain wrote:> Have you tried optimizing your index? I found an order of magnitude
> difference in speed here with an optimized index. Even with 1,000,000
> unique documents though sorting is taking less than 10 seconds for an
> unoptimized index and less than 1 second for optimized index. What
> kind of system are you running on?
You were right. I benchmarked it at about 10x faster to preload the
indexes, even counting the time to run #optimize.

Thanks for the tip.

Pedro.

Reasonably Related Threads

Search for more seemingly similar threads

Ferret talk - Jul 2006 - Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

[Ferret-talk] Sorting performance

Reasonably Related Threads