I''m trying to index ~130,000 documents [soon to grow to about 500,000 documents] and I''m wondering if its possible to combine ferret databases or in some other way split up the building process. Normally, indexing 130k documents wouldn''t be that painful except that there are different types of links between these documents and they are not absolute (so for example doc a refers to a document b but there are multiple different documents laballed document a and document b and to prevent false links I have to use some fairly computationally intensive heuristics]. If its not possible to split up the building of a ferret index I''ll probably resolve the links into absolute links as a separate part of the process [which I can split up] and then build the ferret index one one machine after that. -- Posted via http://www.ruby-forum.com/.
On Mon, Nov 20, 2006 at 03:52:21AM +0100, Holden Karau wrote:> I''m trying to index ~130,000 documents [soon to grow to about 500,000 > documents] and I''m wondering if its possible to combine ferret databases > or in some other way split up the building process. > > Normally, indexing 130k documents wouldn''t be that painful except that > there are different types of links between these documents and they are > not absolute (so for example doc a refers to a document b but there are > multiple different documents laballed document a and document b and to > prevent false links I have to use some fairly computationally intensive > heuristics]. > > If its not possible to split up the building of a ferret index I''ll > probably resolve the links into absolute links as a separate part of the > process [which I can split up] and then build the ferret index one one > machine after that.Only one process or thread may write to the index at once, so you''ll have to serialize your writing to the index somehow, i.e. gathering the data on two machines (or threads) and hand it over to the indexer. Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
Jens Kraemer wrote:> On Mon, Nov 20, 2006 at 03:52:21AM +0100, Holden Karau wrote: > >> I''m trying to index ~130,000 documents [soon to grow to about 500,000 >> documents] and I''m wondering if its possible to combine ferret databases >> or in some other way split up the building process. >> >> Normally, indexing 130k documents wouldn''t be that painful except that >> there are different types of links between these documents and they are >> not absolute (so for example doc a refers to a document b but there are >> multiple different documents laballed document a and document b and to >> prevent false links I have to use some fairly computationally intensive >> heuristics]. >> >> If its not possible to split up the building of a ferret index I''ll >> probably resolve the links into absolute links as a separate part of the >> process [which I can split up] and then build the ferret index one one >> machine after that. >> > > Only one process or thread may write to the index at once, so you''ll > have to serialize your writing to the index somehow, i.e. gathering the > data on two machines (or threads) and hand it over to the indexer.*Ferret newbie warning* Shouldn''t it be possible to use the add_indexes method to merge one or more indexes? http://ferret.davebalmain.com/api/classes/Ferret/Index/Index.html#M000035 Cheers! Patrick -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20061120/1babd1fb/attachment.html
On Mon, Nov 20, 2006 at 09:04:21AM -0500, Patrick Ritchie wrote:> Jens Kraemer wrote:[..]> >Only one process or thread may write to the index at once, so you''ll > >have to serialize your writing to the index somehow, i.e. gathering the > >data on two machines (or threads) and hand it over to the indexer. > *Ferret newbie warning* > > Shouldn''t it be possible to use the add_indexes method to merge one or > more indexes? > > http://ferret.davebalmain.com/api/classes/Ferret/Index/Index.html#M000035interesting :-) I didn''t ever try this, so if you do please let me know how it worked. Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
Jens Kraemer wrote:> On Mon, Nov 20, 2006 at 09:04:21AM -0500, Patrick Ritchie wrote: > >> Jens Kraemer wrote: >> > [..] > >>> Only one process or thread may write to the index at once, so you''ll >>> have to serialize your writing to the index somehow, i.e. gathering the >>> data on two machines (or threads) and hand it over to the indexer. >>> >> *Ferret newbie warning* >> >> Shouldn''t it be possible to use the add_indexes method to merge one or >> more indexes? >> >> http://ferret.davebalmain.com/api/classes/Ferret/Index/Index.html#M000035 >> > > interesting :-) > I didn''t ever try this, so if you do please let me know how it worked. > > Jens > >I just did the following in IRB: i1 = Index.new i2 = Index.new i1 << {:text => ''one''} i2 << {:text => ''two''} i1.search_each("text:one") {|id, score| puts "#{i1[id][:text]"} => "one" i1.search_each("text:two") {|id, score| puts "#{i1[id][:text]"} => nil i1.add_indexes i2 i1.search_each("text:two") {|id, score| puts "#{i1[id][:text]"} => "two" Seems to work as advertised... Cheers! Patrick -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20061120/aca5a89f/attachment.html
Patrick Ritchie wrote:> Jens Kraemer wrote: >>> prevent false links I have to use some fairly computationally intensive >> data on two machines (or threads) and hand it over to the indexer. > *Ferret newbie warning* > > Shouldn''t it be possible to use the add_indexes method to merge one or > more indexes? > > http://ferret.davebalmain.com/api/classes/Ferret/Index/Index.html#M000035 > > Cheers! > PatrickI can''t believe I missed that. I''ll give it a shot sometime over the weekend, thanks :-) -- Posted via http://www.ruby-forum.com/.