Jeff Gortatowsky
2006-Oct-12 15:46 UTC
[Ferret-talk] Newbie question: 28000+ files for 25000+ records?
Hi Obviously my question is, is that normal? To have so many files? I was indexing 6 string fields from 25000+ model records (all of the same model). The index appears to be working. I guess I was expecting a few hundred files after optimzing, not more files that records indexed. Please understand I am brand spanking new to Lucene, Ferret, and AaF. I was using acts_as_ferret with :fields => ["user_id", "answer1", "answer2", "answer3", "answer4", "answer5", "answer6"], :merge_factor => 1000, :max_merge_document = 10000, :max_memory_buffer =>0x4000000 The fields are from 15 to 500 characters long. Also, was there any way to stop AaF from trying to create a new index with all the existing model data? I was surprised when after creating and updating one model object in the Rails Console, AaF took off trying to index all 8 million rows of the underlying table! I did search here on ''too many files'' and "large number of files" but came up empty. I am sure my lack of domain knowledge is most likely what is hurting me. Thanks Jeff -- Posted via http://www.ruby-forum.com/.
David Balmain
2006-Oct-13 00:40 UTC
[Ferret-talk] Newbie question: 28000+ files for 25000+ records?
On 10/13/06, Jeff Gortatowsky <indanapt at yahoo.com> wrote:> Hi > > Obviously my question is, is that normal? To have so many files? I was > indexing 6 string fields from 25000+ model records (all of the same > model). The index appears to be working. I guess I was expecting a few > hundred files after optimzing, not more files that records indexed.Hi Jeff, this doesn''t sound right at all. Could send a partial listing of the directory so I can see what files are in it? Do `ls -l` so I can see their sizes too.> Please understand I am brand spanking new to Lucene, Ferret, and AaF.No problem, we''re here to help.> I was using acts_as_ferret with > > :fields => ["user_id", > "answer1", > "answer2", > "answer3", > "answer4", > "answer5", > "answer6"], > :merge_factor => 1000, > :max_merge_document = 10000, > :max_memory_buffer =>0x4000000 > > > The fields are from 15 to 500 characters long. > > Also, was there any way to stop AaF from trying to create a new index > with all the existing model data? I was surprised when after creating > and updating one model object in the Rails Console, AaF took off trying > to index all 8 million rows of the underlying table!<snip> I''ll leave these kinds of questions to the acts_as_ferret users. Cheers, Dave
Jens Kraemer
2006-Oct-13 08:31 UTC
[Ferret-talk] Newbie question: 28000+ files for 25000+ records?
On Fri, Oct 13, 2006 at 09:40:23AM +0900, David Balmain wrote:> On 10/13/06, Jeff Gortatowsky <indanapt at yahoo.com> wrote:[..]> > Also, was there any way to stop AaF from trying to create a new > > index with all the existing model data? I was surprised when after > > creating and updating one model object in the Rails Console, AaF > > took off trying to index all 8 million rows of the underlying table!aaf always tries to create the index if it doesn''t exist yet. The whole point about aaf is to keep the index in sync with your database. Therefore it is necessary to add all existing records to a newly created index. Although it would be easy to add an option to suppress the indexing of existing data, I don''t think this is useful, because you''ll end up with an index only containing new or updated records, but not those that already existed at index creation time. I can''t imagine this is what you want ;-) To keep the index creation from happening when the index is accessed the first time from your app (could be a search, or some update/create operation), you can build up the index from the console, i.e. RAILS_ENV=production script/console>> Model.rebuild_indexcheers, Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
Jeff Gortatowsky
2006-Oct-19 23:15 UTC
[Ferret-talk] Newbie question: 28000+ files for 25000+ records?
David Balmain wrote:> On 10/13/06, Jeff Gortatowsky <indanapt at yahoo.com> wrote:> Hi Jeff, this doesn''t sound right at all. Could send a partial listing > of the directory so I can see what files are in it? Do `ls -l` so IBelow is a very very very partial listing. My env is Windows XP Pro. THe verions of gems is listed below as well. Basically I accessed the first model object and said model.save! to kick off the indexing. Which it did. BTW: this is SQLServer if it matters. BTW: The searching the index works. well... I found out when I asked to highlight() that I never get anything back. Looking at the soruce code and my fields I find I must have had (or defaulted) to :store=>no so I have to retrieve the row, iterate myself over the fields to find out which field matched, and then display the results. That is not pretty but I have to admit, it''s painless. Still 25,000 records made 28,000+ files. Can you imagine all 8.1 million records!! Is it because one of the fields being indexed is always unique (think User ID/Primary key)? I was going to trying it in Lucene and see what happens. I figure if it is different, the must be doing something odd in Ferret/AaF. Plus I can try native ferret to create the index and forego AaF for the initial index creation (assuming that is a ''fix''). Thank you for any time and effort. I am becoming quite a Ruby/Rails/Ferret fan for prototyping. I can say as I am ready for Rails on my production envronoment hosting 40k logged in users a night, but it''s wonderful for concept exploration. Here is the partial listing (they are representational of all the other files except for the last two of which they are the only ones). After the listing is my gems versions 10/11/2006 08:23 PM 1,300 _z.cfs 10/11/2006 08:25 PM 1,314 _z0.cfs 10/11/2006 08:25 PM 1,705 _z1.cfs 10/11/2006 08:26 PM 3,039 _z2.cfs 10/11/2006 08:26 PM 970 _z3.cfs 10/11/2006 08:26 PM 3,015 _z4.cfs 10/11/2006 08:26 PM 14,266 _z5.cfs 10/11/2006 08:26 PM 770 _z6.cfs 10/11/2006 08:26 PM 815 _z7.cfs 10/11/2006 08:26 PM 1,150 _z8.cfs 10/11/2006 08:26 PM 1,564 _z9.cfs 10/11/2006 08:26 PM 2,283 _za.cfs 10/11/2006 08:26 PM 1,259 _zb.cfs 10/11/2006 08:26 PM 1,598 _zc.cfs 10/11/2006 08:26 PM 1,655 _zd.cfs 10/11/2006 08:26 PM 5,466 _ze.cfs 10/11/2006 08:26 PM 1,242 _zf.cfs 10/11/2006 08:26 PM 13,609 _zg.cfs 10/11/2006 08:26 PM 2,081 _zh.cfs 10/11/2006 08:26 PM 1,101 _zi.cfs 10/11/2006 08:26 PM 1,053 _zj.cfs 10/11/2006 08:26 PM 2,208 _zk.cfs 10/11/2006 08:26 PM 920 _zl.cfs 10/11/2006 08:26 PM 3,003 _zm.cfs 10/11/2006 08:26 PM 2,148 _zn.cfs 10/11/2006 08:26 PM 1,195 _zo.cfs 10/11/2006 08:26 PM 1,707 _zp.cfs 10/11/2006 08:26 PM 1,747 _zq.cfs 10/11/2006 08:26 PM 12,889 _zr.cfs 10/11/2006 08:26 PM 2,531 _zs.cfs 10/11/2006 08:26 PM 1,359 _zt.cfs 10/11/2006 08:26 PM 2,330 _zu.cfs 10/11/2006 08:26 PM 1,793 _zv.cfs 10/11/2006 08:26 PM 1,788 _zw.cfs 10/11/2006 08:26 PM 3,135 _zx.cfs 10/11/2006 08:26 PM 2,603 _zy.cfs 10/11/2006 08:26 PM 2,210 _zz.cfs 10/12/2006 08:39 AM 213 fields 10/12/2006 08:40 AM 29 segments 28381 File(s) 261,021,758 bytes 2 Dir(s) 35,192,127,488 bytes fre actionmailer (1.2.5), actionpack (1.12.5), actionwebservice (1.1.6) activerecord (1.14.4), activesupport (1.3.1), ferret (0.10.9), fxri (0.3.3), fxruby (1.6.2, 1.6.1, 1.6.0, 1.2.6), gem_plugin (0.2.1) log4r (1.0.5), mongrel (0.3.13.3) rails (1.1.6), rake (0.7.1) sources (0.0.1), win32-clipboard (0.4.1, 0.4.0) win32-dir (0.3.0) win32-eventlog (0.4.2, 0.4.1) win32-file (0.5.2) win32-file-stat (1.2.2) win32-process (0.5.1, 0.4.2) win32-sapi (0.1.3) win32-service (0.5.0) win32-sound (0.4.0) windows-pr (0.5.4, 0.5.1) -- Posted via http://www.ruby-forum.com/.
Jeff Gortatowsky
2006-Oct-19 23:22 UTC
[Ferret-talk] Newbie question: 28000+ files for 25000+ records?
Jens Kraemer wrote:> To keep the index creation from happening when the index is accessed the > first time from your app (could be a search, or some update/create > operation), you can build up the index from the console, i.e. > > RAILS_ENV=production script/console >>> Model.rebuild_index > > cheers, > JensThank you Jens. While all you say is true, the original rowset was over 8.000.000 rows and would have taken days or more. I just wanted to do some experimentation to see if my code would work. AaF is not that well documented (well perhaps for those smarter than I) and therefore I thought my best best was to play with it in the console. Little did I realize it would go off and index the table to start. And while you are correct that most of the time, you want an index, really there are use cases where you only want to index data from that time forward. Anyway I created a much smaller table and worked with it to start. However it created 28.000 files for 25.000 records. Still not quite right. But it does work in that I can search it. BTW: Is there a method to say only return fields from the documents that matched and not all the fields of documents that had matches? Of course I did my own filter. Best wishes and thank you for the advice and counsel. Jeff -- Posted via http://www.ruby-forum.com/.
David Balmain
2006-Oct-20 06:07 UTC
[Ferret-talk] Newbie question: 28000+ files for 25000+ records?
On 10/20/06, Jeff Gortatowsky <indanapt at yahoo.com> wrote:> Jens Kraemer wrote: > > > To keep the index creation from happening when the index is accessed the > > first time from your app (could be a search, or some update/create > > operation), you can build up the index from the console, i.e. > > > > RAILS_ENV=production script/console > >>> Model.rebuild_index > > > > cheers, > > Jens > > > Thank you Jens. While all you say is true, the original rowset was over > 8.000.000 rows and would have taken days or more. I just wanted to do > some experimentation to see if my code would work. AaF is not that well > documented (well perhaps for those smarter than I) and therefore I > thought my best best was to play with it in the console. Little did I > realize it would go off and index the table to start. > > And while you are correct that most of the time, you want an index, > really there are use cases where you only want to index data from that > time forward.That may be true but I don''t think the goal of acts_as_ferret should be to cover all possibly use cases. It''s job is to make using Ferret with ActiveRecord as easy as possible. If you need to do anything more complicated than usual then why not just use Ferret directly?> Anyway I created a much smaller table and worked with it > to start. However it created 28.000 files for 25.000 records. Still not > quite right. But it does work in that I can search it.There is something very wrong there but I have no idea what the problem. For some reason Ferret doesn''t seem to be merging the index segments (judging by your following email).> BTW: Is there a method to say only return fields from the documents that > matched and not all the fields of documents that had matches? Of course > I did my own filter.Ferret documents are lazy loading so only the fields that you view get loaded. However, there is currently no way to find out which fields matched. Cheers, Dave
David Balmain
2006-Oct-20 06:35 UTC
[Ferret-talk] Newbie question: 28000+ files for 25000+ records?
On 10/20/06, Jeff Gortatowsky <indanapt at yahoo.com> wrote:> David Balmain wrote: > > On 10/13/06, Jeff Gortatowsky <indanapt at yahoo.com> wrote: > > > Hi Jeff, this doesn''t sound right at all. Could send a partial listing > > of the directory so I can see what files are in it? Do `ls -l` so I > > > Below is a very very very partial listing. My env is Windows XP Pro. THe > verions of gems is listed below as well. Basically I accessed the first > model object and said model.save! to kick off the indexing. Which it > did. BTW: this is SQLServer if it matters. BTW: The searching the index > works. well...Ahhh. I''ve had this problem in Windows before but I thought it was fixed. For some reason the operating system musn''t be allowing Ferret to delete the index files when it is finished with them. I''m not sure why this would be happening though. This would gives us approximately 25_000 + 2500 + 250 + 25 + 2 = 27777 files after merging. This is still short of the 28300 files you have though. :-(> I found out when I asked to highlight() that I never get anything back. > Looking at the soruce code and my fields I find I must have had (or > defaulted) to :store=>no so I have to retrieve the row, iterate myself > over the fields to find out which field matched, and then display the > results. That is not pretty but I have to admit, it''s painless.This is one of the reasons I want to implement a database based on Ferret, so that operations like this will be very simple. I could add a highlighting method for externally stored fields but you need to store term vectors for the highlighting to work exactly (ie for stemmed terms and matching sloppy phrases exactly) so if you are storing term_vectors, you may as well store the field as well. For externally stored fields the highlighting method you are using is best.> Still > 25,000 records made 28,000+ files. Can you imagine all 8.1 million > records!! Is it because one of the fields being indexed is always unique > (think User ID/Primary key)?No, I think the majority of those files are obselete. In fact I''m not sure if Windows would even allow you to open that many files at once (and Ferret does open all of the files in the index directory.) If you open up the segments file you''ll see a list of the segments that are actually still being used by Ferret (along with a bunch of binary data). Given that your segments file is only 29 bytes, I''m guessing that you have optimized your index and you only have one valid index segment. The rest is junk. For the record I indexed 2,000,000 records the other day (approximately 4000kb each) in 2 1/2 hours and I had at most 120 files in my index directory.> I was going to trying it in Lucene and see what happens. I figure if it > is different, the must be doing something odd in Ferret/AaF. Plus I can > try native ferret to create the index and forego AaF for the initial > index creation (assuming that is a ''fix'').Lucene actually records a list of files it fails to delete and continues to try and delete those files. It''s a bit of a hack and I was hoping to get away with not doing that in Ferret. Looks like I was wrong. I wonder why it works for me and not for you. I have XP Home edition so it should be the same.> Thank you for any time and effort. I am becoming quite a > Ruby/Rails/Ferret fan for prototyping. I can say as I am ready for Rails > on my production envronoment hosting 40k logged in users a night, but > it''s wonderful for concept exploration. > > Here is the partial listing (they are representational of all the other > files except for the last two of which they are the only ones). After > the listing is my gems versions > > 10/11/2006 08:23 PM 1,300 _z.cfs > 10/11/2006 08:25 PM 1,314 _z0.cfs > 10/11/2006 08:25 PM 1,705 _z1.cfs > 10/11/2006 08:26 PM 3,039 _z2.cfs > 10/11/2006 08:26 PM 970 _z3.cfs > 10/11/2006 08:26 PM 3,015 _z4.cfs > 10/11/2006 08:26 PM 14,266 _z5.cfs > 10/11/2006 08:26 PM 770 _z6.cfs > 10/11/2006 08:26 PM 815 _z7.cfs > 10/11/2006 08:26 PM 1,150 _z8.cfs > 10/11/2006 08:26 PM 1,564 _z9.cfs > 10/11/2006 08:26 PM 2,283 _za.cfs > 10/11/2006 08:26 PM 1,259 _zb.cfs > 10/11/2006 08:26 PM 1,598 _zc.cfs > 10/11/2006 08:26 PM 1,655 _zd.cfs > 10/11/2006 08:26 PM 5,466 _ze.cfs > 10/11/2006 08:26 PM 1,242 _zf.cfs > 10/11/2006 08:26 PM 13,609 _zg.cfs > 10/11/2006 08:26 PM 2,081 _zh.cfs > 10/11/2006 08:26 PM 1,101 _zi.cfs > 10/11/2006 08:26 PM 1,053 _zj.cfs > 10/11/2006 08:26 PM 2,208 _zk.cfs > 10/11/2006 08:26 PM 920 _zl.cfs > 10/11/2006 08:26 PM 3,003 _zm.cfs > 10/11/2006 08:26 PM 2,148 _zn.cfs > 10/11/2006 08:26 PM 1,195 _zo.cfs > 10/11/2006 08:26 PM 1,707 _zp.cfs > 10/11/2006 08:26 PM 1,747 _zq.cfs > 10/11/2006 08:26 PM 12,889 _zr.cfs > 10/11/2006 08:26 PM 2,531 _zs.cfs > 10/11/2006 08:26 PM 1,359 _zt.cfs > 10/11/2006 08:26 PM 2,330 _zu.cfs > 10/11/2006 08:26 PM 1,793 _zv.cfs > 10/11/2006 08:26 PM 1,788 _zw.cfs > 10/11/2006 08:26 PM 3,135 _zx.cfs > 10/11/2006 08:26 PM 2,603 _zy.cfs > 10/11/2006 08:26 PM 2,210 _zz.cfs > 10/12/2006 08:39 AM 213 fields > 10/12/2006 08:40 AM 29 segments > 28381 File(s) 261,021,758 bytes > 2 Dir(s) 35,192,127,488 bytes fre > > > > actionmailer (1.2.5), actionpack (1.12.5), actionwebservice (1.1.6) > activerecord (1.14.4), activesupport (1.3.1), ferret (0.10.9), > fxri (0.3.3), fxruby (1.6.2, 1.6.1, 1.6.0, 1.2.6), gem_plugin (0.2.1) > log4r (1.0.5), mongrel (0.3.13.3) > rails (1.1.6), rake (0.7.1) > sources (0.0.1), win32-clipboard (0.4.1, 0.4.0) > win32-dir (0.3.0) > win32-eventlog (0.4.2, 0.4.1) > win32-file (0.5.2) > win32-file-stat (1.2.2) > win32-process (0.5.1, 0.4.2) > win32-sapi (0.1.3) > win32-service (0.5.0) > win32-sound (0.4.0) > windows-pr (0.5.4, 0.5.1) > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
Jeff Gortatowsky
2006-Oct-20 16:08 UTC
[Ferret-talk] Newbie question: 28000+ files for 25000+ records?
Thanks everyone for you help. Of source I meant no disrespect about AaF indexing tables as soon as it discovers there is none. I am quite appreciative of the Plugin as it is. Please don''t think I don''t appreciate it. :) And I really thank you all for the help C:\ruby\omi\index\development\string_answer>od -c segments 0000000 \0 \0 \0 \0 \0 \0 \0 \0 E . # ? \0 \0 \0 \0 0000020 \0 \0 n ? 001 004 _ l w a + ? 001 0000035 So am I correct in say that the file _lwa.cfs is the only file really needed? Thanks again. It''s great to see that it really worked and that the only problem is ''Windoze'' related. I would be working on Ubuntu but the SQLServer adapter I tried there could not page through data sets. -- Posted via http://www.ruby-forum.com/.
Jeff Gortatowsky
2006-Oct-20 17:48 UTC
[Ferret-talk] Newbie question: 28000+ files for 25000+ records?
Jeff Gortatowsky wrote:> Thanks everyone for you help.Sorry for all the typos. To place an ending on this I did indeed move all the files out except fields, segments, and _lwa.cfs and it -seems- to be working. If I have a moment to look at where I may have created a problem closing files, I will. And I can easily try recreating the index on Ubuntu as it involves no paging (which I only deal with when interacting with end users). Thanks Dave and Jen for the advice and counsel. -- Posted via http://www.ruby-forum.com/.
David Balmain
2006-Oct-20 17:58 UTC
[Ferret-talk] Newbie question: 28000+ files for 25000+ records?
On 10/21/06, Jeff Gortatowsky <indanapt at yahoo.com> wrote:> Thanks everyone for you help. Of source I meant no disrespect about AaF > indexing tables as soon as it discovers there is none. I am quite > appreciative of the Plugin as it is. Please don''t think I don''t > appreciate it. :) > > And I really thank you all for the help > > C:\ruby\omi\index\development\string_answer>od -c segments > 0000000 \0 \0 \0 \0 \0 \0 \0 \0 E . # ? \0 \0 \0 \0 > 0000020 \0 \0 n ? 001 004 _ l w a + ? 001 > 0000035 > > So am I correct in say that the file _lwa.cfs is the only file really > needed?Well, sort of. Ferret does write a couple of files while it is indexing that won''t appear in the segments file. Also don''t delete the fields file. Otherwise, "lwa" is base 36 integer so any file labled _lw9 and bellow you can delete.> Thanks again. It''s great to see that it really worked and that the only > problem is ''Windoze'' related. I would be working on Ubuntu but the > SQLServer adapter I tried there could not page through data sets.Ferret definitely works on Ubuntu. That''s were it was developed and I think Jens my actually develop acts_as_ferret on Ubuntu too. Cheers, Dave