Craig Jolicoeur
2007-Aug-08 20:16 UTC
[Ferret-talk] issues with index for table with over 18 million records
I have a MySQL table with over 18 million records in it. We are indexing about 10 fields in this table with ferret. I am having problems with the initial building of the index. I created a rake task to run the "Model.rebuild_index" command in the background. That process ran fine for about 2.5 days before it just suddenly stopped. The log/ferret_index.log file says it got to about 28% before ending. I''m not sure if the process died because of something on my server or because of something related to ferret. It appears that it will take close to 10 days for the full index to be build with rebuild_index? Is this normal for a table of this size? Also, is there a way to start where the index ended and update from there instead of having to rebuild the entire index from scratch? I got about 28% of the way through so would like to not have to waste the 2.5 days to rebuild that part again trying to get the full index 100% built. Also, is there a way that I can non-destructive rebuild the index since it didnt complete 100%? Meaning, can I rebuild it without overwriting what is already there? That way I can keep what I have to be searched while the rebuild takes place and then move that over the old index? I''m not running ferret as a Drb server so I dont know if I can. Also, is there a faster or better way that I can/should be building the index? Will I have an issue with the index file sizes with a DB this size? -- Posted via http://www.ruby-forum.com/.
Erik Morton
2007-Aug-08 20:50 UTC
[Ferret-talk] issues with index for table with over 18 million records
We have a 1 million record index that is about 6GB in size. We build it in parallel w/out AAF so it''s hard to comment on the speed of your index build. However I will say that I did need to manually patch Ferret to better handle large indexes. Here is the diff: --- /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/ext/index.c +++ index.c @@ -1375,7 +1375,7 @@ lazy_doc = lazy_doc_new(stored_cnt, fdt_in); for (i = 0; i < stored_cnt; i++) { - int start = 0, end, data_cnt; + off_t start = 0, end, data_cnt; field_num = is_read_vint(fdt_in); fi = fr->fis->fields[field_num]; data_cnt = is_read_vint(fdt_in); @@ -1449,7 +1449,7 @@ if (store_offsets) { int num_positions = tv->offset_cnt = is_read_vint(fdt_in); Offset *offsets = tv->offsets = ALLOC_N(Offset, num_positions); - int offset = 0; + off_t offset = 0; for (i = 0; i < num_positions; i++) { offsets[i].start = offset += is_read_vint(fdt_in); offsets[i].end = offset += is_read_vint(fdt_in); @@ -1683,8 +1683,8 @@ int last_end = 0; os_write_vint(fdt_out, offset_count); /* write shared prefix length */ for (i = 0; i < offset_count; i++) { - int start = offsets[i].start; - int end = offsets[i].end; + off_t start = offsets[i].start; + off_t end = offsets[i].end; os_write_vint(fdt_out, start - last_end); os_write_vint(fdt_out, end - start); last_end = end; @@ -4799,7 +4799,7 @@ * ************************************************************************ ****/ -Offset *offset_new(int start, int end) +Offset *offset_new(off_t start, off_t end) { Offset *offset = ALLOC(Offset); offset->start = start; On Aug 8, 2007, at 4:16 PM, Craig Jolicoeur wrote:> I have a MySQL table with over 18 million records in it. We are > indexing about 10 fields in this table with ferret. > > I am having problems with the initial building of the index. I > created > a rake task to run the "Model.rebuild_index" command in the > background. > That process ran fine for about 2.5 days before it just suddenly > stopped. The log/ferret_index.log file says it got to about 28% > before > ending. I''m not sure if the process died because of something on my > server or because of something related to ferret. > > It appears that it will take close to 10 days for the full index to be > build with rebuild_index? Is this normal for a table of this size? > Also, is there a way to start where the index ended and update from > there instead of having to rebuild the entire index from scratch? > I got > about 28% of the way through so would like to not have to waste the > 2.5 > days to rebuild that part again trying to get the full index 100% > built. > > Also, is there a way that I can non-destructive rebuild the index > since > it didnt complete 100%? Meaning, can I rebuild it without overwriting > what is already there? That way I can keep what I have to be searched > while the rebuild takes place and then move that over the old index? > I''m not running ferret as a Drb server so I dont know if I can. > > Also, is there a faster or better way that I can/should be building > the > index? Will I have an issue with the index file sizes with a DB this > size? > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk
Craig Jolicoeur
2007-Aug-08 20:53 UTC
[Ferret-talk] issues with index for table with over 18 million records
Erik Morton wrote:> We have a 1 million record index that is about 6GB in size. We build > it in parallel w/out AAF so it''s hard to comment on the speed of your > index build. However I will say that I did need to manually patch > Ferret to better handle large indexes. >Erik, What issues did you find that caused you to patch the ferret code? ALso, you say you build the index in parallel w/out AAF; how do you do that? Not sure I''m following how to do that so if you can explain, I''d appreciate it. -- Posted via http://www.ruby-forum.com/.
Erik Morton
2007-Aug-08 21:16 UTC
[Ferret-talk] issues with index for table with over 18 million records
We had to patch it because we were getting seemingly random errors while searching a 2GB+ index. This the trac ticket: http:// ferret.davebalmain.com/trac/ticket/215. The patch I included changes some ints to off_t''s, which solved the problem. As far as I know this patch was never applied to the trunk. We build our index using a modified version of RDig. We basically run up to 80 EC2 servers in parallel to create 80 separate indexes, which we later combine into a single index. You could follow a similar route and still have AAF mange the index after it is built. You''d need to make sure that the documents created by RDig/whatever have the same fields that AAF expects. Erik On Aug 8, 2007, at 4:53 PM, Craig Jolicoeur wrote:> Erik Morton wrote: >> We have a 1 million record index that is about 6GB in size. We build >> it in parallel w/out AAF so it''s hard to comment on the speed of your >> index build. However I will say that I did need to manually patch >> Ferret to better handle large indexes. >> > > > Erik, > > What issues did you find that caused you to patch the ferret code? > > ALso, you say you build the index in parallel w/out AAF; how do you do > that? Not sure I''m following how to do that so if you can explain, > I''d > appreciate it. > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk
Seemingly Similar Threads
- [PATCH 1/4] Btrfs: be less strict on finding next node in clear_extent_bit
- [PATCH] Btrfs: return EUCLEAN rather than ENXIO once internal error has occurred for SEEK_DATA/SEEK_HOLE inquiry
- Strange error. Index corrupt on production server
- [Bug 511] New: Premature ip_conntrack timer expiry on 3+ window size advertisements
- histograms resulting from call to soil.texture