I''m using acts_as_ferret to index one of my rails models. Right after I start the app the first request that orders by some ferret field will take very long. Subsequent ones seem to be fast. I guess some caching is going on. Any tips on solving this? Pedro.
On 7/31/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt> wrote:> I''m using acts_as_ferret to index one of my rails models. Right after I > start the app the first request that orders by some ferret field will > take very long. Subsequent ones seem to be fast. I guess some caching is > going on. Any tips on solving this? > > Pedro.You guessed correctly. The sort fields are cached. You can easily preload the cache by running a search when you start up your app. You should also be careful what fields you sort on. You should only sort on untokenized fields. You can also speed up sorting by dates by lowering the precision that you use. For example, if you are storing the date with time to the nearest second, eg 2006-08-01 10:13:24 you may get a much faster sort by only storing up to the nearest day, ie 2006-08-01. By the way, what kind of times are we talking about here? Cheers, Dave
On Mon, 2006-07-31 at 19:17 +0900, David Balmain wrote:> On 7/31/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt> wrote: > > I''m using acts_as_ferret to index one of my rails models. Right after I > > start the app the first request that orders by some ferret field will > > take very long. Subsequent ones seem to be fast. I guess some caching is > > going on. Any tips on solving this? > > > > Pedro. > > You guessed correctly. The sort fields are cached. You can easily > preload the cache by running a search when you start up your app. You > should also be careful what fields you sort on. You should only sort > on untokenized fields.Is it ok if the field isn''t stored in the index? Anyone know how to set a field to be untokenized in acts_as_ferret?> You can also speed up sorting by dates by > lowering the precision that you use. For example, if you are storing > the date with time to the nearest second, eg 2006-08-01 10:13:24 you > may get a much faster sort by only storing up to the nearest day, ie > 2006-08-01.I''m only using dates so it should be alright.> By the way, what kind of times are we talking about here?300 seconds for a 100MB index. Pedro.
On Mon, 2006-07-31 at 11:26 +0100, Pedro C?rte-Real wrote:> Anyone know how to set a field to be untokenized in acts_as_ferret?I forgot that I was actually supplying my own #to_doc so it was a matter of changing it to not tokenize the fields I want. When using acts_as_ferret the regular way I don''t know if this is possible. Pedro.
On Mon, 2006-07-31 at 19:17 +0900, David Balmain wrote:> By the way, what kind of times are we talking about here?I added a preloading of this at the start of my app and it takes 14 minutes for a 100MB index with 4 fields I order by. Any way to speed this up? Shouldn''t this be cached in the on-disk structure? Don''t think I''m being critical, ferret is great software, many thanks for it. Pedro.
On Mon, Jul 31, 2006 at 04:10:03PM +0100, Pedro C?rte-Real wrote:> On Mon, 2006-07-31 at 11:26 +0100, Pedro C?rte-Real wrote: > > Anyone know how to set a field to be untokenized in acts_as_ferret? > > I forgot that I was actually supplying my own #to_doc so it was a matter > of changing it to not tokenize the fields I want. When using > acts_as_ferret the regular way I don''t know if this is possible.it is, just provide a hash with the desired options to each field name: acts_as_ferret( :fields => { ''title'' => { :boost => 2 }, ''description'' => { :boost => 1, :index => Ferret::Document::Field::Index::UNTOKENIZED } }) options that can be set this way are (with their defaults given): :store => Ferret::Document::Field::Store::NO :index => Ferret::Document::Field::Index::TOKENIZED :term_vector => Ferret::Document::Field::TermVector::NO :binary => false :boost => 1.0 Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt> wrote:> On Mon, 2006-07-31 at 19:17 +0900, David Balmain wrote: > > By the way, what kind of times are we talking about here? > > I added a preloading of this at the start of my app and it takes 14 > minutes for a 100MB index with 4 fields I order by. Any way to speed > this up? Shouldn''t this be cached in the on-disk structure?How many documents and what is the date range (eg 2001-01-01 -> 2006-08-01). These are the critical variables for sort performance. Once I know these numbers I''ll be able to replicate the task here and I''ll see what I can do.> Don''t think I''m being critical, ferret is great software, many thanks > for it.No offence taken. I''d definitely like to be able to help. I''m guessing I''ll probably have to optimize the C code to rectify this. Cheers, Dave
On Mon, 2006-07-31 at 19:36 +0200, Jens Kraemer wrote:> > I forgot that I was actually supplying my own #to_doc so it was a matter > > of changing it to not tokenize the fields I want. When using > > acts_as_ferret the regular way I don''t know if this is possible. > > it is, just provide a hash with the desired options to each field name: > > acts_as_ferret( > :fields => { > ''title'' => { :boost => 2 }, > ''description'' => { :boost => 1, > :index => Ferret::Document::Field::Index::UNTOKENIZED > } > }) > > options that can be set this way are (with their defaults given): > > :store => Ferret::Document::Field::Store::NO > :index => Ferret::Document::Field::Index::TOKENIZED > :term_vector => Ferret::Document::Field::TermVector::NO > :binary => false > :boost => 1.0Cool. Didn''t know about this. I started reading the code to understand how it worked but then remembered I was doing my own to_doc so I should just change that. I''ll be sure to remember that for any future projects. By the way, does storing the TermVectors only increase the size of the index or does it alter performance in any way? Pedro.
On Tue, 2006-08-01 at 09:24 +0900, David Balmain wrote:> How many documents and what is the date range (eg 2001-01-01 -> > 2006-08-01). These are the critical variables for sort performance. > Once I know these numbers I''ll be able to replicate the task here and > I''ll see what I can do.I have around 600_000 documents and the date range is rather large, something like from year 1000 to now. I don''t know for sure but I can check if it makes a difference. But not all my sort fields are dates. I also have regular text fields that I have now made untokenized (by using separate fields for sorting and searching). Got to check if that made them faster.> > Don''t think I''m being critical, ferret is great software, many thanks > > for it. > > No offence taken. I''d definitely like to be able to help. I''m guessing > I''ll probably have to optimize the C code to rectify this.That would be great, Thanks, Pedro.
On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt> wrote:> On Mon, 2006-07-31 at 19:36 +0200, Jens Kraemer wrote: > > > I forgot that I was actually supplying my own #to_doc so it was a matter > > > of changing it to not tokenize the fields I want. When using > > > acts_as_ferret the regular way I don''t know if this is possible. > > > > it is, just provide a hash with the desired options to each field name: > > > > acts_as_ferret( > > :fields => { > > ''title'' => { :boost => 2 }, > > ''description'' => { :boost => 1, > > :index => Ferret::Document::Field::Index::UNTOKENIZED > > } > > }) > > > > options that can be set this way are (with their defaults given): > > > > :store => Ferret::Document::Field::Store::NO > > :index => Ferret::Document::Field::Index::TOKENIZED > > :term_vector => Ferret::Document::Field::TermVector::NO > > :binary => false > > :boost => 1.0 > > By the way, does storing the TermVectors only increase the size of the > index or does it alter performance in any way?It increases the size of the index and affects indexing performance since a lot of extra data needs to be written and merged during the indexing process. Search performance won''t be affected. Dave
On Tue, 2006-08-01 at 18:39 +0900, David Balmain wrote:> > By the way, does storing the TermVectors only increase the size of the > > index or does it alter performance in any way? > > It increases the size of the index and affects indexing performance > since a lot of extra data needs to be written and merged during the > indexing process. Search performance won''t be affected.Ah, but the default when creating a new field is already not to store it so I''m already doing it. Pedro.
On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt> wrote:> On Tue, 2006-08-01 at 09:24 +0900, David Balmain wrote: > > How many documents and what is the date range (eg 2001-01-01 -> > > 2006-08-01). These are the critical variables for sort performance. > > Once I know these numbers I''ll be able to replicate the task here and > > I''ll see what I can do. > > I have around 600_000 documents and the date range is rather large, > something like from year 1000 to now. I don''t know for sure but I can > check if it makes a difference. > > But not all my sort fields are dates. I also have regular text fields > that I have now made untokenized (by using separate fields for sorting > and searching). Got to check if that made them faster.Hmmm. Sounds like an interesting application. One solution would be to cache the sort index on disk. The problem with this is that the cache would still need to be recalculated every time you add more documents to the index so you''ll still have the long wait occasionally. I''ll look into it anyway at a later stage. Another idea that I can implement now is to add a BYTES sort type which would basically sort by the order the terms appear in the index. Let''s say you index dates in the format "YYYYMMDD" and you sort by INTEGER. Everytime you load the sort index you need to go through every single date and convert it from string to integer. But this is unnecessary since the dates are already in order in the index. A BYTES sort type would take advantage of this. You''d get an even bigger benefit for ascii strings. strcoll is used to sort strings but this is unnecessary for ascii strings as they are already correctly ordered in the index. Also, the index needs to keep each string in memory which would also be unnessary. Sorry if this isn''t very clear. I''m not sure how much it will help. We''ll have to wait and see. Dave
On Tue, 2006-08-01 at 18:59 +0900, David Balmain wrote:> Hmmm. Sounds like an interesting application. One solution would be to > cache the sort index on disk. The problem with this is that the cache > would still need to be recalculated every time you add more documents > to the index so you''ll still have the long wait occasionally. I''ll > look into it anyway at a later stage.For my application this wouldn''t really be a problem since data is only loaded maybe once a week. But does the cache need to be recalculated completely? Database indexes work incrementally.> Another idea that I can implement now is to add a BYTES sort type > which would basically sort by the order the terms appear in the index. > Let''s say you index dates in the format "YYYYMMDD" and you sort by > INTEGER. Everytime you load the sort index you need to go through > every single date and convert it from string to integer. But this is > unnecessary since the dates are already in order in the index. A BYTES > sort type would take advantage of this.For my date fields this would work.> You''d get an even bigger > benefit for ascii strings. strcoll is used to sort strings but this is > unnecessary for ascii strings as they are already correctly ordered in > the index. Also, the index needs to keep each string in memory which > would also be unnessary.One of my text order fields should have nothing but ASCII. The other is a title and can include arbitrary UTF-8, so I guess it wouldn''t work for that one.> Sorry if this isn''t very clear. I''m not sure how much it will help. > We''ll have to wait and see.The BYTES ordering would probably speed it up but for my specific case, storing it on disk would be perfect. It would probably be a very good thing in case someone uses ferret to code command line tools that access a common index. Without storing the sorting on disk it will get recreated every time a command is ran. Pedro.
On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt> wrote:> On Tue, 2006-08-01 at 18:59 +0900, David Balmain wrote: > > Hmmm. Sounds like an interesting application. One solution would be to > > cache the sort index on disk. The problem with this is that the cache > > would still need to be recalculated every time you add more documents > > to the index so you''ll still have the long wait occasionally. I''ll > > look into it anyway at a later stage. > > For my application this wouldn''t really be a problem since data is only > loaded maybe once a week. But does the cache need to be recalculated > completely? Database indexes work incrementally.Sure it''s possible but it''s a fair bit of work. Lucene doesn''t have anything like this yet (not that that has stopped me adding features before). I''ll think about it. Dave
On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt> wrote:> On Tue, 2006-08-01 at 18:59 +0900, David Balmain wrote: > > Hmmm. Sounds like an interesting application. One solution would be to > > cache the sort index on disk. The problem with this is that the cache > > would still need to be recalculated every time you add more documents > > to the index so you''ll still have the long wait occasionally. I''ll > > look into it anyway at a later stage. > > For my application this wouldn''t really be a problem since data is only > loaded maybe once a week. But does the cache need to be recalculated > completely? Database indexes work incrementally.Have you tried optimizing your index? I found an order of magnitude difference in speed here with an optimized index. Even with 1,000,000 unique documents though sorting is taking less than 10 seconds for an unoptimized index and less than 1 second for optimized index. What kind of system are you running on? Dave
On Wed, 2006-08-02 at 11:13 +0900, David Balmain wrote:> On 8/1/06, Pedro C?rte-Real <Pedro.CorteReal at iantt.pt> wrote: > > On Tue, 2006-08-01 at 18:59 +0900, David Balmain wrote: > > > Hmmm. Sounds like an interesting application. One solution would be to > > > cache the sort index on disk. The problem with this is that the cache > > > would still need to be recalculated every time you add more documents > > > to the index so you''ll still have the long wait occasionally. I''ll > > > look into it anyway at a later stage. > > > > For my application this wouldn''t really be a problem since data is only > > loaded maybe once a week. But does the cache need to be recalculated > > completely? Database indexes work incrementally. > > Have you tried optimizing your index? I found an order of magnitude > difference in speed here with an optimized index. Even with 1,000,000 > unique documents though sorting is taking less than 10 seconds for an > unoptimized index and less than 1 second for optimized index. What > kind of system are you running on?I was guessing acts_as_ferret did that. But apparently only on rebuild_index. I''ll try adding an optimize call at the start of the app. I''m running this on a 2.66 GHz Celeron with 1GB ram. Pedro.
On Wed, 2006-08-02 at 11:13 +0900, David Balmain wrote:> Have you tried optimizing your index? I found an order of magnitude > difference in speed here with an optimized index. Even with 1,000,000 > unique documents though sorting is taking less than 10 seconds for an > unoptimized index and less than 1 second for optimized index. What > kind of system are you running on?You were right. I benchmarked it at about 10x faster to preload the indexes, even counting the time to run #optimize. Thanks for the tip. Pedro.