thr3ads.net - Ferret talk - [Ferret-talk] parallel indexing with unique id? [Mar 2008]

If this information is useful, please help other people find it:
Share via:

R. Bryan Hughes

2008-Mar-25 05:29 UTC

[Ferret-talk] parallel indexing with unique id?

Hello all,
Is it possible to use parallel indexing and still ensure unique documents in
the merged index? Using the canned example, I''m ending up with
non-unique
entries. It''s just adding them all together even though I''ve
defined unique
a :key.

How can I tell the IndexWriter to keep my uniqueness constraints?

For example, imagine that I have two indexes of a phone book:

"index_one" contains a unique set of names A-through-P (let''s
say the key is
their phone number).

"index_two" contains a unique set of names K-through-Z.

When I merge them, I would hope to get a unique index of A-through-Z, but
I''m getting double entries where they overlap, K-through-P.


Here''s some code to demonstrate. My :id field is a long-ish unique
alphanumeric string. In the example below, "one" and "two"
are actually
identical copies, each containing about 60,000 docs. I was hoping to get a
combined index containing the same 60,000 docs, but ended up with 120,000.

Any help will be greatly appreciated. Thanks!

####################

one = "Documents/bucket/index_1"
two = "Documents/bucket/index_2"
merged = "Documents/bucket/merged_index"

pfa = PerFieldAnalyzer.new(LetterAnalyzer.new)
pfa[:id] = WhiteSpaceAnalyzer.new

field_infos = FieldInfos.new(:term_vector => :no)
field_infos.add_field(:id, :index => :untokenized)

index_two = Ferret::I.new(
               :key => :id,
               :max_buffer_memory => 0x8000000,
               :merge_factor => 5,
               :path => one,
               :analyzer => pfa,
               :field_infos => field_infos)

index_one = Ferret::I.new(
               :key => :id,
               :max_buffer_memory => 0x8000000,
               :merge_factor => 5,
               :path => two,
               :analyzer => pfa,
               :field_infos => field_infos)


readers = []
readers << IndexReader.new(one)
readers << IndexReader.new(two)


puts "size of index_one = "+index_one.size.to_s
puts "size of index_two = "+index_two.size.to_s


index_writer = IndexWriter.new(:path => merged)
index_writer.add_readers(readers)
index_writer.close()
readers.each{ |reader| reader.close() }

i = Ferret::I.new(:path => merged)

puts "size before optimize = "+i.size.to_s
i.optimize
puts "size after optimize = "+i.size.to_s
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20080324/eacc1c0d/attachment.html

Jens Kraemer

2008-Mar-25 08:12 UTC

head link

[Ferret-talk] parallel indexing with unique id?

Hi!

On Mon, Mar 24, 2008 at 11:29:14PM -0600, R. Bryan Hughes
wrote:> Hello all,
> Is it possible to use parallel indexing and still ensure unique documents
in
> the merged index? Using the canned example, I''m ending up with
non-unique
> entries. It''s just adding them all together even though
I''ve defined unique
> a :key.
> 
> How can I tell the IndexWriter to keep my uniqueness constraints?
You can''t. The :key option is only interpreted by Ferret''s
Index class,
which will delete any already existing records with the same key field
value before adding a new record.

Cheers,
Jens


-- 
Jens Kr?mer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database

R. Bryan Hughes

2008-Mar-26 15:22 UTC

head link

[Ferret-talk] parallel indexing with unique id?

Thanks! You saved me lots of time.

On 3/25/08, Jens Kraemer <jk at jkraemer.net>
wrote:>
> Hi!
>
> On Mon, Mar 24, 2008 at 11:29:14PM -0600, R. Bryan Hughes wrote:
> > Hello all,
> > Is it possible to use parallel indexing and still ensure unique
> documents in
> > the merged index? Using the canned example, I''m ending up
with
> non-unique
> > entries. It''s just adding them all together even though
I''ve defined
> unique
> > a :key.
> >
> > How can I tell the IndexWriter to keep my uniqueness constraints?
>
> You can''t. The :key option is only interpreted by
Ferret''s Index class,
> which will delete any already existing records with the same key field
> value before adding a new record.
>
> Cheers,
> Jens
>
>
> --
> Jens Kr?mer
> http://www.jkraemer.net/ - Blog
> http://www.omdb.org/     - The new free film database
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20080326/8019cc5f/attachment.html

Ferret talk - Mar 2008 - parallel indexing with unique id?

[Ferret-talk] parallel indexing with unique id?

[Ferret-talk] parallel indexing with unique id?

[Ferret-talk] parallel indexing with unique id?