Daniel Einspanjer
2007-Jun-07 17:19 UTC
[Ferret-talk] Advise on slowness in bootstrapping?
I am looking at trying to use ferret/aaf to supplement my querying against a medium and large table with lots of columns. Some facts first: Ferret 0.11.4 AAF 0.4.0 Ruby 1.8.6 Rails 1.2.3 Medium table: 105,464 rows 168 columns (mostly varchar(20)) 11 actual columns indexed in aaf plus 40 virtual columns indexed in aaf (virtual is concat of two physical columns. e.g. cast_first_name_1 + cast_last_name_1 through cast_first_name_20 + cast_last_name_20) Large table: 1,244,716 rows same column/index structure These tables are not updated via Ruby, only read. I am trying to use rebuild_index to bootstrap the medium sized table and it is taking a very long time (running for about 4 hours, indicates 50% complete with 4 hours remaining) and creating a massive number of files in the index directory (currently about 65k, was 90k earlier) I have not done any tuning of ferret/aaf so far, and I fear what it will look like to do the big table. Does anyone have any advise on how to speed this process up? Because the tables are updated by an external batch process, if I were to continue down this ferret/aaf path, I''d have to be looking at running this rebuild_index a couple of times per week which would be rather painful given the present time and might not be possible if the large table took more than 48 hours...
Daniel Einspanjer
2007-Jun-07 18:07 UTC
[Ferret-talk] Advise on slowness in bootstrapping?
p.s. Please forgive my lack of attention to the changes I let the spell checker make. All instances of the verb advise should be mentally replaced with the noun advice. :)
On Thu, Jun 07, 2007 at 05:19:26PM +0000, Daniel Einspanjer wrote:> I am looking at trying to use ferret/aaf to supplement my querying against a > medium and large table with lots of columns. Some facts first: > > Ferret 0.11.4 > AAF 0.4.0 > Ruby 1.8.6 > Rails 1.2.3 > > Medium table: > 105,464 rows > 168 columns (mostly varchar(20)) > 11 actual columns indexed in aaf plus > 40 virtual columns indexed in aaf (virtual is concat of two physical columns. > e.g. cast_first_name_1 + cast_last_name_1 through cast_first_name_20 + > cast_last_name_20) > > Large table: > 1,244,716 rows > same column/index structure > > These tables are not updated via Ruby, only read. I am trying to use > rebuild_index to bootstrap the medium sized table and it is taking a very long > time (running for about 4 hours, indicates 50% complete with 4 hours remaining) > and creating a massive number of files in the index directory (currently about > 65k, was 90k earlier)strange. Ferret is faster than that - I have a test script that builds an index of 100000 documents with 50 fields each containing a single random word in under 10 Minutes here on standard hardware. Maybe the problem is something else? For starters, change line 220 of local_index.rb from index << rec.to_doc if rec.ferret_enabled?(true) to doc = rec.to_doc if rec.ferret_enabled?(true) so nothing is added to the index. How long does that take? Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
Daniel Einspanjer
2007-Jun-08 14:25 UTC
[Ferret-talk] Advise on slowness in bootstrapping?
The bootstrap indexing actually ended up taking twice the amount of time listed below. When there was no index directory and I made the call to rebuild_index, the ferret_index.log file had these lines in it: # Logfile created on Thu Jun 07 08:46:34 -0400 2007 by logger.rb/1.5.2.9 rebuild index: [] reindexing model CurrentProgram reindex model CurrentProgram : 0.00% complete : 25658.57 secs to finish ... when it hit 100%, the following lines appeared: reindex model CurrentProgram : 99.56% complete : 219.29 secs to finish Created Ferret index in: ./script/../config/../config/../index/production/current_program rebuild index: [["CurrentProgram"]] reindexing model CurrentProgram reindex model CurrentProgram : 0.00% complete : 25740.65 secs to finish reindex model CurrentProgram : 0.95% complete : 26065.95 secs to finish So it looks like for some reason, it performed the rebuild twice. :( When I looked at it this morning, it had over 116k files in the current_program directory. Not the most healthy thing. I ran CurrentProgram.aaf_index.ferret_index.optimize and it took a few minutes and fully optimized down to three files. I made the testing patch suggested and am running now. I did not delete the index directory. The ferret_index.log started out with these lines: rebuild index: [["CurrentProgram"]] reindexing model CurrentProgram reindex model CurrentProgram : 0.00% complete : 3540.78 secs to finish reindex model CurrentProgram : 0.95% complete : 3510.69 secs to finish So it is a significantly shorter time when it isn''t actually adding the doc to the index. If you have any further ideas on things to try or any other information you''d like to collect, please let me know. In the meantime, I''m going to try out the acts_as_solr plugin since I''ve had a bit more experience with tuning solr and see what the indexing performance on that looks like. Daniel On 6/8/07, Jens Kraemer <kraemer at webit.de> wrote:> On Thu, Jun 07, 2007 at 05:19:26PM +0000, Daniel Einspanjer wrote: > > I am looking at trying to use ferret/aaf to supplement my querying against a > > medium and large table with lots of columns. Some facts first: > > > > Ferret 0.11.4 > > AAF 0.4.0 > > Ruby 1.8.6 > > Rails 1.2.3 > > > > Medium table: > > 105,464 rows > > 168 columns (mostly varchar(20)) > > 11 actual columns indexed in aaf plus > > 40 virtual columns indexed in aaf (virtual is concat of two physical columns. > > e.g. cast_first_name_1 + cast_last_name_1 through cast_first_name_20 + > > cast_last_name_20) > > > > Large table: > > 1,244,716 rows > > same column/index structure > > > > These tables are not updated via Ruby, only read. I am trying to use > > rebuild_index to bootstrap the medium sized table and it is taking a very long > > time (running for about 4 hours, indicates 50% complete with 4 hours remaining) > > and creating a massive number of files in the index directory (currently about > > 65k, was 90k earlier) > > strange. Ferret is faster than that - I have a test script that builds > an index of 100000 documents with 50 fields each containing a single random > word in under 10 Minutes here on standard hardware. > > Maybe the problem is something else? For starters, change line 220 > of local_index.rb from > index << rec.to_doc if rec.ferret_enabled?(true) > to > doc = rec.to_doc if rec.ferret_enabled?(true) > > so nothing is added to the index. How long does that take? > > Jens > > -- > Jens Kr?mer > webit! Gesellschaft f?r neue Medien mbH > Schnorrstra?e 76 | 01069 Dresden > Telefon +49 351 46766-0 | Telefax +49 351 46766-66 > kraemer at webit.de | www.webit.de > > Amtsgericht Dresden | HRB 15422 > GF Sven Haubold, Hagen Malessa > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
On Fri, Jun 08, 2007 at 10:25:07AM -0400, Daniel Einspanjer wrote:> The bootstrap indexing actually ended up taking twice the amount of > time listed below. When there was no index directory and I made the > call to rebuild_index, the ferret_index.log file had these lines in > it: > # Logfile created on Thu Jun 07 08:46:34 -0400 2007 by logger.rb/1.5.2.9 > rebuild index: [] > reindexing model CurrentProgram > reindex model CurrentProgram : 0.00% complete : 25658.57 secs to finish > ... > > when it hit 100%, the following lines appeared: > reindex model CurrentProgram : 99.56% complete : 219.29 secs to finish > Created Ferret index in: > ./script/../config/../config/../index/production/current_program > rebuild index: [["CurrentProgram"]] > reindexing model CurrentProgram > reindex model CurrentProgram : 0.00% complete : 25740.65 secs to finish > reindex model CurrentProgram : 0.95% complete : 26065.95 secs to finish > > > So it looks like for some reason, it performed the rebuild twice. :(damn, that bug seems to come back from time to time, I''ll try to fix this over the weekend.> When I looked at it this morning, it had over 116k files in the > current_program directory. Not the most healthy thing. I ran > CurrentProgram.aaf_index.ferret_index.optimize and it took a few > minutes and fully optimized down to three files.It should optimize the index automatically after re-indexing.> I made the testing patch suggested and am running now. I did not > delete the index directory. The ferret_index.log started out with > these lines: > rebuild index: [["CurrentProgram"]] > reindexing model CurrentProgram > reindex model CurrentProgram : 0.00% complete : 3540.78 secs to finish > reindex model CurrentProgram : 0.95% complete : 3510.69 secs to finish > > So it is a significantly shorter time when it isn''t actually adding > the doc to the index.Yeah, looks like it''s really the indexing that takes the time. Can you make sure for your testing that nothing else accesses the index while the rebuild runs (i.e. shutdown any mongrels running? Or try aaf trunk and the DRb server which will ensure that by design and for performance measurements is the more realistical scenario anyway.> If you have any further ideas on things to try or any other > information you''d like to collect, please let me know. In the > meantime, I''m going to try out the acts_as_solr plugin since I''ve had > a bit more experience with tuning solr and see what the indexing > performance on that looks like.>From what I''ve heard it should be on par with aaf when things areworking normal (I guess they don''t for some reason in your case). btw, what platform do you run on? Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
Daniel Einspanjer
2007-Jun-08 16:44 UTC
[Ferret-talk] Advise on slowness in bootstrapping?
On 6/8/07, Jens Kraemer <kraemer at webit.de> wrote:> On Fri, Jun 08, 2007 at 10:25:07AM -0400, Daniel Einspanjer wrote: > damn, that bug seems to come back from time to time, I''ll try to fix > this over the weekend.I saw a couple of other threads mentioning something similar to this so I figured it either wasn''t fixed in the version I was working with or it might have been a regression.> > When I looked at it this morning, it had over 116k files in the > > current_program directory. Not the most healthy thing. I ran > > CurrentProgram.aaf_index.ferret_index.optimize and it took a few > > minutes and fully optimized down to three files. > > It should optimize the index automatically after re-indexing.I see in the rebuild_index method where it calls optimize, but it certainly didn''t seem to fully optimize it at that time. Maybe there was something specific to the case of a newly created index instead of opening an existing one?> > I made the testing patch suggested and am running now. I did not > > delete the index directory. The ferret_index.log started out with > > these lines: > > rebuild index: [["CurrentProgram"]] > > reindexing model CurrentProgram > > reindex model CurrentProgram : 0.00% complete : 3540.78 secs to finish > > reindex model CurrentProgram : 0.95% complete : 3510.69 secs to finish > > > > So it is a significantly shorter time when it isn''t actually adding > > the doc to the index. > > Yeah, looks like it''s really the indexing that takes the time. Can you > make sure for your testing that nothing else accesses the index while > the rebuild runs (i.e. shutdown any mongrels running?Since this was a bootstrapping test, I had no processes running other than the script\console production from which I issued the rebuild_index command.> Or try aaf trunk and the DRb server which will ensure that by design and > for performance measurements is the more realistical scenario anyway.I''m currently planning on running this as a single instance application because the index will be read only at run time and only used by one or two people at a time.> >From what I''ve heard it [aas] should be on par with aaf when things are > working normal (I guess they don''t for some reason in your case).I''ve heard the same. The only reason I thought to try it out was because of my prior experience with Solr.> btw, what platform do you run on?This is a windows box connecting to a MSSQL server. (I know.. ick. ;) I did some preliminary testing to make sure that the pagination was working properly since I saw in the list that other people had some difficulties with it. Daniel