thr3ads.net - Ferret talk - [Ferret-talk] Proposal of some radical changes to API [Jun 2006]

If this information is useful, please help other people find it:
Share via:

David Balmain

2006-Jun-04 03:42 UTC

[Ferret-talk] Proposal of some radical changes to API

Hey guys,

Now that the Lucy[1] project has Apache approval and is about to
begin, the onus is no longer on Ferret to strive for Lucene
compatability. (We''ll be doing that in Lucy). So I''m starting
to think
about ways to improve Ferret''s API. The first part that needs to be
improved is the Document API. It''s annoying having to type all the
attributes to  initialize a field just to change the boost. So;

    field = Field.new(:name, "data...", Field::Store::YES,
Field::Index::TOKENIZED, Field::TermVector::NO, false, 5.0)

would become;

   field = Field.new(:name, "data...", :index =>
Field::Index::TOKENIZED, :boost => 5.0)

It''d also be nice to replace the Parameter objects with symbols;

   field = Field.new(:name, "data...", :index => :tokenized, :boost
=> 5.0)

Of course, this raises the question, why do we need to specify that
field :name is tokenized every time we create a :name field? Isn''t it
always going to be the same? What if we use a different value the next
time we and a :name field? Well the answer to this last question is a
specific set of rules;

    1. Once you choose to index a field, that field is always indexed
from that point forward.
    2. Once you store term vectors, always store term vectors
    3. Once you store positions, always store positions
    4. Once you store offsets, always store offsets
    5. Once you store norms, always store norms

So currently if you add a field like this (I''ll use the newer notation
as it''s easier to type);

    doc << Field.new(:field, "data...", :index => :yes,
:term_vector
=> :with_positions_offsets)

And later add a field like this;

    doc << Field.new(:field, "diff...", :index => :no,
:term_vector => :no)

This field will be indexed and it''s term vectors will be stored
regardless. This is good because if you are using TermVectors in a
particular field then you probably expect them to be there for all
instances of that field. The problem is that earlier documents will
have been added without storing term vectors. Now I don''t know the
exact thinking behind these rules but it seems to me that it would be
better to just keep whatever rule you used when you first added the
document. If you want to add term vectors later, then re-index.

So here''s my radical api change proposal. You set a fields properties
when you create the index and Document becomes (almost) a simple Hash
object. Actually, you may not have realized this, but you can almost
do this currently in Ferret. Once you add the first instance of a
field, that field''s properties are set. From then on you and just add
documents as Hash objects and each field will have the same properties
as in that first document that was added. (This isn''t true of the
Store or boost properties. These are set on a per document basis.)

So here is a possible way example of the way I''d implement this;

    # the following might even look better in a YAML file.
    field_props = {
        :default => {:store => :no, :index => :tokenized, :term_vector
=> :no},
        :fields => {
            :id => {:store => :yes, :index => :no},
            :title => {:store => :yes, :term_vector =>
:with_positions_offsets},
            [:created_on, :updated_on] => {:store => :yes, :index =>
:untokenized}
        }
    }
    index = Index.new(:field_properties => field_props)

    # ...
    # And if later, you want to add a new field
    index.add_field(:image, {:store => :compressed, :index => :no})

Now you would just create Hashes instead of Documents. The only
exception would be if you needed to set the boost for a particular
field or document. So you would have this;

    index << {:title => "title", :data =>
"data..."}
    # boost a field
    index << {:title => Field.new("important title", 50.0),
:data =>
"normal data"}
    # boost a document
    index << Document.new({:title => "important doc", :data
=> "data"}, 100.0)

So what do you all think? These are just ideas at the moment and it''d
be a while before I could actually implement them. And don''t worry,
I''ll do my best to keep backwards compatibility. Please give me your
feedback.

Cheers,
Dave

[1] - http://wiki.apache.org/jakarta-lucene/LucyProposal

Jan Prill

2006-Jun-04 10:18 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

Posted it first from my gmail adress (with which I''m  not subscribed to
the
mailing list) and therefore got an approval message. So here it comes again:

Hi, Dave,

first of all: congrats to you and Marvin for being approved by ASF. I think
this is great because it ensures an even more prosper future for lucene
ports to dynamic languages. The apache license is a great one and ensures
true open source and all freedom that any user of the c-core would want for
his projects, regardless if they are free, private, commercial or whatever.

Apache is providing great software. Wouldn''t know what I would have
done in
many projects without their webserver, xml-libraries, lucene, nutch et al.
This must be a great place to be for a software developer, especially
working with a "search legend" like Doug Cutting sounds like a great
opportunity. Congrats again!

The notation you are explaining looks great. The field props should indeed
stay in a yaml file, the notation above looks a little too stuffed imho. But
I''m really looking forward to the index creation with hashes. I think
the
rails crowed would love this, because it will look and feel railish to do
things this way...

Regards
Jan


On 6/4/06, Jan Prill <jan.prill at gmail.com>
wrote:>
> Hi, Dave,
>
> first of all: congrats to you and Marvin for being approved by ASF. I
> think this is great because it ensures an even more prosper future for
> lucene ports to dynamic languages. The apache license is a great one and
> ensures true open source and all freedom that any user of the c-core would
> want for his projects, regardless if they are free, private, commercial or
> whatever.
>
> Apache is providing great software. Wouldn''t know what I would
have done
> in many projects without their webserver, xml-libraries, lucene, nutch et
> al. This must be a great place to be for a software developer, especially
> working with a "search legend" like Doug Cutting sounds like a
great
> opportunity. Congrats again!
>
> The notation you are explaining looks great. The field props should indeed
> stay in a yaml file, the notation above looks a little too stuffed imho.
But
> I''m really looking forward to the index creation with hashes. I
think the
> rails crowed would love this, because it will look and feel railish to do
> things this way...
>
> Regards
> Jan
>
>
> On 6/4/06, David Balmain <dbalmain.ml at gmail.com> wrote:
> >
> > Hey guys,
> >
> > Now that the Lucy[1] project has Apache approval and is about to
> > begin, the onus is no longer on Ferret to strive for Lucene
> > compatability. (We''ll be doing that in Lucy). So I''m
starting to think
> > about ways to improve Ferret''s API. The first part that needs
to be
> > improved is the Document API. It''s annoying having to type
all the
> > attributes to  initialize a field just to change the boost. So;
> >
> >     field = Field.new(:name, "data...", Field::Store::YES,
> > Field::Index::TOKENIZED, Field::TermVector::NO, false, 5.0)
> >
> > would become;
> >
> >    field = Field.new(:name, "data...", :index =>
> > Field::Index::TOKENIZED, :boost => 5.0)
> >
> > It''d also be nice to replace the Parameter objects with
symbols;
> >
> >    field = Field.new(:name, "data...", :index =>
:tokenized, :boost =>
> > 5.0)
> >
> > Of course, this raises the question, why do we need to specify that
> > field :name is tokenized every time we create a :name field?
Isn''t it
> > always going to be the same? What if we use a different value the next
> > time we and a :name field? Well the answer to this last question is a
> > specific set of rules;
> >
> >     1. Once you choose to index a field, that field is always indexed
> > from that point forward.
> >     2. Once you store term vectors, always store term vectors
> >     3. Once you store positions, always store positions
> >     4. Once you store offsets, always store offsets
> >     5. Once you store norms, always store norms
> >
> > So currently if you add a field like this (I''ll use the newer
notation
> > as it''s easier to type);
> >
> >     doc << Field.new(:field, "data...", :index =>
:yes, :term_vector
> > => :with_positions_offsets)
> >
> > And later add a field like this;
> >
> >     doc << Field.new(:field, "diff...", :index =>
:no, :term_vector =>
> > :no)
> >
> > This field will be indexed and it''s term vectors will be
stored
> > regardless. This is good because if you are using TermVectors in a
> > particular field then you probably expect them to be there for all
> > instances of that field. The problem is that earlier documents will
> > have been added without storing term vectors. Now I don''t
know the
> > exact thinking behind these rules but it seems to me that it would be
> > better to just keep whatever rule you used when you first added the
> > document. If you want to add term vectors later, then re-index.
> >
> > So here''s my radical api change proposal. You set a fields
properties
> > when you create the index and Document becomes (almost) a simple Hash
> > object. Actually, you may not have realized this, but you can almost
> > do this currently in Ferret. Once you add the first instance of a
> > field, that field''s properties are set. From then on you and
just add
> > documents as Hash objects and each field will have the same properties
> > as in that first document that was added. (This isn''t true of
the
> > Store or boost properties. These are set on a per document basis.)
> >
> > So here is a possible way example of the way I''d implement
this;
> >
> >     # the following might even look better in a YAML file.
> >     field_props = {
> >         :default => {:store => :no, :index => :tokenized,
:term_vector
> > => :no},
> >         :fields => {
> >             :id => {:store => :yes, :index => :no},
> >             :title => {:store => :yes, :term_vector =>
> > :with_positions_offsets},
> >             [:created_on, :updated_on] => {:store => :yes,
:index =>
> > :untokenized}
> >         }
> >     }
> >     index = Index.new(:field_properties => field_props)
> >
> >     # ...
> >     # And if later, you want to add a new field
> >     index.add_field(:image, {:store => :compressed, :index =>
:no})
> >
> > Now you would just create Hashes instead of Documents. The only
> > exception would be if you needed to set the boost for a particular
> > field or document. So you would have this;
> >
> >     index << {:title => "title", :data =>
"data..."}
> >     # boost a field
> >     index << {:title => Field.new("important
title", 50.0), :data =>
> > "normal data"}
> >     # boost a document
> >     index << Document.new({:title => "important
doc", :data => "data"},
> > 100.0)
> >
> > So what do you all think? These are just ideas at the moment and
it''d
> > be a while before I could actually implement them. And don''t
worry,
> > I''ll do my best to keep backwards compatibility. Please give
me your
> > feedback.
> >
> > Cheers,
> > Dave
> >
> > [1] - http://wiki.apache.org/jakarta-lucene/LucyProposal
> > _______________________________________________
> > Ferret-talk mailing list
> > Ferret-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/ferret-talk
> >
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20060604/4aa7ff01/attachment-0001.htm

Neville Burnell

2006-Jun-05 02:33 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

Hi Dave,

Congrats on getting Lucy approved!

WRT the proposed Ferret api changes, is there a good reason you chose
:yes/:no as opposed to true/false for some of the boolean settings?

Kind Regards

Neville Burnell

David Balmain

2006-Jun-05 03:16 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On 6/5/06, Neville Burnell <Neville.Burnell at bmsoft.com.au>
wrote:> Hi Dave,
>
> Congrats on getting Lucy approved!
>
> WRT the proposed Ferret api changes, is there a good reason you chose
> :yes/:no as opposed to true/false for some of the boolean settings?
Hi Neville,

I don''t know if it''s a "good" reason.
That''s up to you to decide. The
reason is that yes and no aren''t the only options. For example, :store
could be :yes, :no, :compressed and :term_vector could be :yes, :no,
:with_positions, :with_offsets and :with_positions_and_offsets. It
would seem strange to me to have the choices true, false and
:compress. I hope that makes sense.

Cheers,
Dave
> Kind Regards
>
> Neville Burnell
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Neville Burnell

2006-Jun-05 03:25 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

>> The reason is that yes and no aren''t the only options. 
Yep, that''s a good reason <grin>  

I thought that was the case, but I wasn''t sure.

Cheers,

Neville

Marvin Humphrey

2006-Jun-05 04:35 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On Jun 3, 2006, at 8:42 PM, David Balmain wrote:
> Now that the Lucy[1] project has Apache approval and is about to
> begin, the onus is no longer on Ferret to strive for Lucene
> compatability. (We''ll be doing that in Lucy).
We''ll take this up more aggressively once some under-appreciated  
volunteers at Apache create mailing lists and other infrastructure  
for Lucy, but I doubt we''ll want to have Lucy''s API mirror
Lucene''s
100%.  Do we really want separate Hits and HitIterator classes, for  
instance? Other, more substantial issues are on the table too as far  
as I''m concerned, such as whether deletions should handled by the  
IndexReader rather than the IndexWriter.  APIs are really hard to  
change once defined, and not taking hard-won lessons from Lucene into  
account would be a crime.

Lucy will definitely need a define-fields-once interface, so although  
you''re proposing stuff specifically for Ferret here, I''m
studying it
with an eye towards using it with Lucy.  My inclination is to start  
with define-fields-once, then add dynamic field definitions later if  
we have to.
> Of course, this raises the question, why do we need to specify that
> field :name is tokenized every time we create a :name field? Isn''t
it
> always going to be the same?
The primary argument for allowing dynamic field definitions I''ve  
seen, is not that the definition might change, but that each document  
might contain previously undefined fields which are unknowable in  
advance.  The CNET/Solr folks, Yonik and Hoss, really, really care  
about that.

I think the idea of dynamic field definitions is weird. (A database  
that allows you to change the table definition with each INSERT?   
Huh?)  I''m sure that CNet could have been done another way if dynamic  
field definitions hadn''t been available, but they''re committed
now.  :(
> What if we use a different value the next
> time we and a :name field? Well the answer to this last question is a
> specific set of rules;
>
>     1. Once you choose to index a field, that field is
>        always indexed from that point forward.
>     2. Once you store term vectors, always store term
>        vectors
>     3. Once you store positions, always store positions
>     4. Once you store offsets, always store offsets
>     5. Once you store norms, always store norms
It''s actually messier than that, isn''t it?  Just because
you''ve
started marking a field as indexed doesn''t mean that Lucene goes back  
to all the documents that you''ve already processed and indexes that  
field.  Same deal with TermVectors, etc.

At least in SQL, when you add a field to a table it goes and adds a  
default value for every row.
> The problem is that earlier documents will
> have been added without storing term vectors. Now I don''t know the
> exact thinking behind these rules but it seems to me that it would be
> better to just keep whatever rule you used when you first added the
> document. If you want to add term vectors later, then re-index.
''Zactly!
> So here''s my radical api change proposal. You set a fields
properties
> when you create the index and Document becomes (almost) a simple Hash
> object.
KinoSearch thinks of documents like hashes, too.  Lucene, however,  
thinks of documents like arrays.
> Actually, you may not have realized this, but you can almost
> do this currently in Ferret. Once you add the first instance of a
> field, that field''s properties are set. From then on you and just
add
> documents as Hash objects and each field will have the same properties
> as in that first document that was added. (This isn''t true of the
> Store or boost properties. These are set on a per document basis.)
Why not set Store once and for all per-field?  And heck, why not  
start with a default boost, but allow it to be overridden?
> So here is a possible way example of the way I''d implement this;
>
>     # the following might even look better in a YAML file.
Ooo, nifty idea!  How about a class whose sole purpose is to define  
fields and generate the YAML file?  Or, if we''re thinking future  
Lucene 2.1 file format, some Lucene-readable index definition file?
>     field_props = {
>         :default => {:store => :no, :index  
> => :tokenized, :term_vector => :no},
>         :fields => {
>             :id => {:store => :yes, :index => :no},
>             :title => {:store => :yes, :term_vector  
> => :with_positions_offsets},
>             [:created_on, :updated_on] => {:store => :yes, :index
=>
> :untokenized}
>         }
>     }
>     index = Index.new(:field_properties => field_props)
This is nice and dense, but maybe a tad complicated.

KinoSearch''s take on doing field defs has some problems too.  It was  
a mistake to make spec_field() a method of InvIndexer (KinoSearch''s  
index writer/modifier class).  The index writer and reader classes  
suffer from serious bloat no matter what, so anything that can be  
shunted somewhere else should be.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

David Balmain

2006-Jun-05 05:46 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On 6/5/06, Marvin Humphrey <marvin at rectangular.com>
wrote:>
> On Jun 3, 2006, at 8:42 PM, David Balmain wrote:
>
> > Now that the Lucy[1] project has Apache approval and is about to
> > begin, the onus is no longer on Ferret to strive for Lucene
> > compatability. (We''ll be doing that in Lucy).
>
> We''ll take this up more aggressively once some under-appreciated
> volunteers at Apache create mailing lists and other infrastructure
> for Lucy, but I doubt we''ll want to have Lucy''s API
mirror Lucene''s
> 100%.  Do we really want separate Hits and HitIterator classes, for
> instance? Other, more substantial issues are on the table too as far
> as I''m concerned, such as whether deletions should handled by the
> IndexReader rather than the IndexWriter.  APIs are really hard to
> change once defined, and not taking hard-won lessons from Lucene into
> account would be a crime.
Thanks for pointing that out. I couldn''t agree with you more. What I
meant was that Lucy would be striving to maintain "index file format"
compatibility (which I believe was the plan). I didn''t make this very
clear, though, as I was talking about changes to the API but as I was
writing this I was thinking about what changes to the index file
format would allow.
> Lucy will definitely need a define-fields-once interface, so although
> you''re proposing stuff specifically for Ferret here, I''m
studying it
> with an eye towards using it with Lucy.  My inclination is to start
> with define-fields-once, then add dynamic field definitions later if
> we have to.
This sounds good to me.
> > Of course, this raises the question, why do we need to specify that
> > field :name is tokenized every time we create a :name field?
Isn''t it
> > always going to be the same?
>
> The primary argument for allowing dynamic field definitions I''ve
> seen, is not that the definition might change, but that each document
> might contain previously undefined fields which are unknowable in
> advance.  The CNET/Solr folks, Yonik and Hoss, really, really care
> about that.
>
> I think the idea of dynamic field definitions is weird. (A database
> that allows you to change the table definition with each INSERT?
> Huh?)  I''m sure that CNet could have been done another way if
dynamic
> field definitions hadn''t been available, but they''re
committed now.  :(
Actually, I fall into the category of people who like dynamic field
definitions.
I agree that they are not necessary but it certainly makes some things
easy. For instance, in a rails application you can add models to an
index and you get to specify within the model itself which of its
fields will be added to the index. The index itself doesn''t need to
know which models will be indexed or how they will be indexed it just
needs to know to store the id field and the model name field and index
everything else. It''s all about keeping it DRY.

The part I don''t like about lucene is *sometimes* being able to change
a fields properties.
> > What if we use a different value the next
> > time we and a :name field? Well the answer to this last question is a
> > specific set of rules;
> >
> >     1. Once you choose to index a field, that field is
> >        always indexed from that point forward.
> >     2. Once you store term vectors, always store term
> >        vectors
> >     3. Once you store positions, always store positions
> >     4. Once you store offsets, always store offsets
> >     5. Once you store norms, always store norms
>
> It''s actually messier than that, isn''t it?  Just because
you''ve
> started marking a field as indexed doesn''t mean that Lucene goes
back
> to all the documents that you''ve already processed and indexes
that
> field.  Same deal with TermVectors, etc.
>
> At least in SQL, when you add a field to a table it goes and adds a
> default value for every row.
>
> > The problem is that earlier documents will
> > have been added without storing term vectors. Now I don''t
know the
> > exact thinking behind these rules but it seems to me that it would be
> > better to just keep whatever rule you used when you first added the
> > document. If you want to add term vectors later, then re-index.
>
> ''Zactly!
>
> > So here''s my radical api change proposal. You set a fields
properties
> > when you create the index and Document becomes (almost) a simple Hash
> > object.
>
> KinoSearch thinks of documents like hashes, too.  Lucene, however,
> thinks of documents like arrays.
>
> > Actually, you may not have realized this, but you can almost
> > do this currently in Ferret. Once you add the first instance of a
> > field, that field''s properties are set. From then on you and
just add
> > documents as Hash objects and each field will have the same properties
> > as in that first document that was added. (This isn''t true of
the
> > Store or boost properties. These are set on a per document basis.)
>
> Why not set Store once and for all per-field?  And heck, why not
> start with a default boost, but allow it to be overridden?
My plan exactly. In my experimental version of Ferret I have a fields
file along with the segments file. The fields file stores all the
field metadata such as store, index, term-vector and field boosts.
That way there is no need to maintain a separate FieldInfos file per
segment. (This will make merging a lot more difficult but I''m still
thinking about that one.)
> > So here is a possible way example of the way I''d implement
this;
> >
> >     # the following might even look better in a YAML file.
>
> Ooo, nifty idea!  How about a class whose sole purpose is to define
> fields and generate the YAML file?  Or, if we''re thinking future
> Lucene 2.1 file format, some Lucene-readable index definition file?
Now this idea I like. Perhaps even a simple question/answer app to
generate the index definition file. I''d guess that Lucene will
probably end up going with XML rather than YAML.

Cheers,
Dave

Marvin Humphrey

2006-Jun-05 16:48 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On Jun 4, 2006, at 10:46 PM, David Balmain wrote:
> What I
> meant was that Lucy would be striving to maintain "index file
format"
> compatibility (which I believe was the plan).
It''s funny that we haven''t actually settled that.  I used to
think
index compatibility was really important, but I don''t so much any more.

Index compatibility is DOA unless Lucene adopts bytecounts as string  
headers, because it would be insanity for Lucy to deal with the  
current format.  So we''re talking compatibility no sooner than Lucene  
2.1, and adapting Lucene will be a challenge.  I think the only way  
to make up the lost speed is to pry in the KinoSearch merge model.  I  
strongly suspect that that will prove to be a marked improvement over  
not just the patched version, but the current release.

However... It''s a lot of work, and I think I''m the only
obvious
candidate with both the expertise and (maybe) the desire to do it,  
unless you want to take it on.  Two stages out of four are complete.   
The bytecounts patch was stage 1, and last night I supplied stage 2:  
a Java port of KinoSearch''s external sorting module.  Stage 3 is  
adapting Lucene''s indexing apparatus to write indexes by the segment  
rather than the document -- porting KinoSearch''s SegWriter module and  
eliminating DocumentWriter and SegmentMerger would be a start.  The  
last stage is adapting everything to be backwards compatible with  
char-counts as string headers.

I''m not sure that I want to dedicate that much of my time to Lucene,  
at least not right now.  The changes outlined above are pretty  
major.  It''s likely that some bugs will get introduced simply because  
of the volume of code change, so that''s an argument against making  
any change at all unless there''s a real benefit.  There would be --  
the KinoSearch merge model is faster -- but politically speaking,  
selling the whole package to the Lucene community would be a PITA.  
Not only do I have to argue that the tangible benefits justify the  
disruption, I have to make the argument that it''s not OK for  
compatibility to begin and end with Java[1][2], plus deal with  
outright hostility and abuse from extreme Java partisans[3].

I''d rather spend my time and energy contributing to Lucy.  Besides, I  
think that ultimately, trying to be compatible with other ports would  
be as much of a drag on Lucy as Lucene, and I think it''s advisable  
for both projects to declare their file formats private.  The Lucene  
file format is just too complex and difficult to serve as a good  
interchange medium.

The only major reason for Lucy to be file-format-compatible with  
Lucene is Luke.  IMO, if we want Luke''s benefits, we should be  
hacking Luke.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

[1] http://xrl.us/m2o3 (Link to mail-archives.apache.org)
[2] http://xrl.us/m2o7 (Link to mail-archives.apache.org)
[3] http://xrl.us/m2kp (Link to mail-archives.apache.org)

Marvin Humphrey

2006-Jun-05 17:28 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On Jun 4, 2006, at 10:46 PM, David Balmain wrote:
> In my experimental version of Ferret I have a fields
> file along with the segments file. The fields file stores all the
> field metadata such as store, index, term-vector and field boosts.
> That way there is no need to maintain a separate FieldInfos file per
> segment. (This will make merging a lot more difficult but I''m
still
> thinking about that one.)
Robert Kirchgessner made a similar proposal:

http://xrl.us/m2qq (Link to mail-archives.apache.org)

Robert addresses the merging issue in a subsequent email, and I think  
his arguments are compelling.

IMO, field defs should be immutable and consistent over the entire  
index.
>>> So here is a possible way example of the way I''d implement
this;
>>>
>>>     # the following might even look better in a YAML file.
>>
>> Ooo, nifty idea!  How about a class whose sole purpose is to define
>> fields and generate the YAML file?  Or, if we''re thinking
future
>> Lucene 2.1 file format, some Lucene-readable index definition file?
>
> Now this idea I like. Perhaps even a simple question/answer app to
> generate the index definition file. I''d guess that Lucene will
> probably end up going with XML rather than YAML.
I think it would be a binary file, using Lucene''s standard  
writeString, writeVInt, etc. methods.

A question/answer app could easily be built based around a module.

How does "IndexCreator" sound?  Take away the ability of the  
IndexWriter module to create or redefine indexes, and encapsulate  
that functionality within one module.  Using Java as our lingua  
franca...

   IndexCreator creator = new IndexCreator(filePath);
   FieldDefinition titleDef = new FieldDefinition("title",
     Field.Store.YES, Field.Index.TOKENIZED);
   FieldDefinition bodyDef = new FieldDefinition("body",
     Field.Store.YES, Field.Index.TOKENIZED
     Field.TermVector.YES);
   creator.addFieldDefinition(titleDef);
   creator.addFieldDefinition(bodyDef);
   creator.createIndex();

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Lee Marlow

2006-Jun-06 17:11 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

Do you mean that all fields would have to be known at index creation
time or just that once a field is defined it properties are the same
across all documents?  Right now I''m indexing documents that create
new fields as needed based on user defined properties, so we don''t
know all the fields initially.

On 6/5/06, Marvin Humphrey <marvin at rectangular.com>
wrote:>
> On Jun 4, 2006, at 10:46 PM, David Balmain wrote:
>
> > In my experimental version of Ferret I have a fields
> > file along with the segments file. The fields file stores all the
> > field metadata such as store, index, term-vector and field boosts.
> > That way there is no need to maintain a separate FieldInfos file per
> > segment. (This will make merging a lot more difficult but I''m
still
> > thinking about that one.)
>
> Robert Kirchgessner made a similar proposal:
>
> http://xrl.us/m2qq (Link to mail-archives.apache.org)
>
> Robert addresses the merging issue in a subsequent email, and I think
> his arguments are compelling.
>
> IMO, field defs should be immutable and consistent over the entire
> index.
>
> >>> So here is a possible way example of the way I''d
implement this;
> >>>
> >>>     # the following might even look better in a YAML file.
> >>
> >> Ooo, nifty idea!  How about a class whose sole purpose is to
define
> >> fields and generate the YAML file?  Or, if we''re thinking
future
> >> Lucene 2.1 file format, some Lucene-readable index definition
file?
> >
> > Now this idea I like. Perhaps even a simple question/answer app to
> > generate the index definition file. I''d guess that Lucene
will
> > probably end up going with XML rather than YAML.
>
> I think it would be a binary file, using Lucene''s standard
> writeString, writeVInt, etc. methods.
>
> A question/answer app could easily be built based around a module.
>
> How does "IndexCreator" sound?  Take away the ability of the
> IndexWriter module to create or redefine indexes, and encapsulate
> that functionality within one module.  Using Java as our lingua
> franca...
>
>    IndexCreator creator = new IndexCreator(filePath);
>    FieldDefinition titleDef = new FieldDefinition("title",
>      Field.Store.YES, Field.Index.TOKENIZED);
>    FieldDefinition bodyDef = new FieldDefinition("body",
>      Field.Store.YES, Field.Index.TOKENIZED
>      Field.TermVector.YES);
>    creator.addFieldDefinition(titleDef);
>    creator.addFieldDefinition(bodyDef);
>    creator.createIndex();
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Marvin Humphrey

2006-Jun-06 18:21 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On Jun 6, 2006, at 10:11 AM, Lee Marlow wrote:
> Do you mean that all fields would have to be known at index creation
> time or just that once a field is defined it properties are the same
> across all documents?  Right now I''m indexing documents that
create
> new fields as needed based on user defined properties, so we don''t
> know all the fields initially.
How would you handle this if you were using an SQL database rather  
than Ferret?  Your app wouldn''t be able to modify the table on the  
fly on that case, unless you did something insane like run a remote  
"ALTER TABLE" command.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Jan Prill

2006-Jun-06 18:37 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

Hi Marvin,

this statement tempted me to jump in, even without using something like
dynamic field creation myself __right now__. But I have been - especially on
cms like projects badly in need for dynamic fields.

That something isn''t common in sql doesn''t mean that there is
no need for
this "something". This limitation of sql is the reason for doing
things like
storing xml in relational dbs as well as the reason for people using object
dbs. I don''t know if you had a look at dabble db, but imagine something
like
this with a relational dbms. not funny! Because of this they haven''t
even
thought about using sql for dabble db. So maybe it''s just me but the
argument: you can''t do this in sql either doesn''t sound too
convincing...

Cheers,
Jan

On 6/6/06, Marvin Humphrey <marvin at rectangular.com>
wrote:>
>
> On Jun 6, 2006, at 10:11 AM, Lee Marlow wrote:
>
> > Do you mean that all fields would have to be known at index creation
> > time or just that once a field is defined it properties are the same
> > across all documents?  Right now I''m indexing documents that
create
> > new fields as needed based on user defined properties, so we
don''t
> > know all the fields initially.
>
> How would you handle this if you were using an SQL database rather
> than Ferret?  Your app wouldn''t be able to modify the table on the
> fly on that case, unless you did something insane like run a remote
> "ALTER TABLE" command.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20060606/a20072fe/attachment.htm

Marvin Humphrey

2006-Jun-06 21:07 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On Jun 6, 2006, at 11:37 AM, Jan Prill wrote:
> this statement tempted me to jump in, even without using something  
> like dynamic field creation myself __right now__. But I have been -  
> especially on cms like projects badly in need for dynamic fields.
>
> That something isn''t common in sql doesn''t mean that
there is no
> need for this "something". This limitation of sql is the reason
for
> doing things like storing xml in relational dbs as well as the  
> reason for people using object dbs. I don''t know if you had a look
> at dabble db, but imagine something like this with a relational  
> dbms. not funny! Because of this they haven''t even thought about  
> using sql for dabble db. So maybe it''s just me but the argument:  
> you can''t do this in sql either doesn''t sound too
convincing...
Jan, I don''t understand the requirement, and I''m not familiar
with
the either dabble db or Rails, so neither that example nor the  
"models" example Dave cited earlier has spoken to me.  I asked the  
question because I honestly wanted to see a concrete example of an  
application that couldn''t be handled within the constraint of pre- 
defined fields.

Behind the scenes in Lucene is an elaborate, expensive apparatus for  
dealing with dynamic fields.  Each document gets turned into its own  
miniature inverted index, complete with its own FieldInfos,  
FieldsWriter, DocumentWriter, TermInfosWriter, and so on.  When these  
mini-indexes get merged, field definitions have to be reconciled.   
This merge stage is one of the bottlenecks which slow down  
interpreted-language ports of Lucene so severely, because there''s a  
lot of object creation and destruction and a lot of method calls.

KinoSearch uses a fixed-field-definition model.  Before you add any  
documents to an index, you have to tell the index writer about all  
the possible fields you might use.  When you add the first document,  
it creates the FieldInfos, FieldsWriter, etc, which persist  
throughout the life of the index writer.  Instead of reconciling  
field definitions each time a document gets added, the field defs are  
defined as invariant for that indexing session.  This is much faster,  
because there is far less object creation and destruction, and far  
less disk shuffling as well -- no segment merging, therefore no  
movement of stored fields, term vectors, etc.

There are several possible ways to add dynamic fields back in to the  
fixed-field-def model.  My main priority in doing so, if it proves to  
be necessary, is to keep table-alteration logic separate from  
insertion operations.  Having the two conflated introduces needless  
complexity and computational expense at the back end.  It''s also just  
plain confusing -- if you accidentally forget to set OMIT_NORMS just  
once, all of a sudden that field is going to have norms for ever and  
ever amen.  I think the user ought to have absolute control over  
field definitions.  Inserting a field with a conflicting definition  
ought to be an error.

Lucy is going to start with the KinoSearch merge model.  I will do a  
better job of adding dynamic capabilities to it if you or someone  
else can articulate some specific examples of situations where static  
definitions would not suffice.  I can think of a few tasks which  
would be slightly more convenient if new fields could be added on the  
fly, but maybe you can go one better and illustrate why dynamic field  
defs are essential.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Neville Burnell

2006-Jun-06 23:33 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

>> I asked the question because I honestly wanted to see a concrete 
>> example of an application that couldn''t be handled within the 
>> constraint of pre- defined fields.
My current application involves writing a web application which can
seach a ferret index built from a SQL database.

The idea is that the customer supplies SQLs for say customers,
suppliers, sales and puchases etc. The app then retrieves the rows from
the datasource and indexes using Ferret. The app provides both a html
website as an interface to the index, and also an XML api which can be
used by non browser clients.

The field set is quite different for each SQL [and is essentially out of
our control].

HTH,

Neville

-----Original Message-----
From: ferret-talk-bounces at rubyforge.org
[mailto:ferret-talk-bounces at rubyforge.org] On Behalf Of Marvin Humphrey
Sent: Wednesday, 7 June 2006 7:08 AM
To: ferret-talk at rubyforge.org
Subject: Re: [Ferret-talk] Proposal of some radical changes to API

On Jun 6, 2006, at 11:37 AM, Jan Prill wrote:
> this statement tempted me to jump in, even without using something 
> like dynamic field creation myself __right now__. But I have been - 
> especially on cms like projects badly in need for dynamic fields.
>
> That something isn''t common in sql doesn''t mean that
there is no need
> for this "something". This limitation of sql is the reason for
doing
> things like storing xml in relational dbs as well as the reason for 
> people using object dbs. I don''t know if you had a look at dabble
db,
> but imagine something like this with a relational dbms. not funny! 
> Because of this they haven''t even thought about using sql for
dabble
> db. So maybe it''s just me but the argument:
> you can''t do this in sql either doesn''t sound too
convincing...
Jan, I don''t understand the requirement, and I''m not familiar
with the
either dabble db or Rails, so neither that example nor the "models"
example Dave cited earlier has spoken to me.  I asked the question
because I honestly wanted to see a concrete example of an application
that couldn''t be handled within the constraint of pre- defined fields.

Behind the scenes in Lucene is an elaborate, expensive apparatus for
dealing with dynamic fields.  Each document gets turned into its own
miniature inverted index, complete with its own FieldInfos,
FieldsWriter, DocumentWriter, TermInfosWriter, and so on.  When these  
mini-indexes get merged, field definitions have to be reconciled.   
This merge stage is one of the bottlenecks which slow down
interpreted-language ports of Lucene so severely, because there''s a lot
of object creation and destruction and a lot of method calls.

KinoSearch uses a fixed-field-definition model.  Before you add any
documents to an index, you have to tell the index writer about all the
possible fields you might use.  When you add the first document, it
creates the FieldInfos, FieldsWriter, etc, which persist throughout the
life of the index writer.  Instead of reconciling field definitions each
time a document gets added, the field defs are defined as invariant for
that indexing session.  This is much faster, because there is far less
object creation and destruction, and far less disk shuffling as well --
no segment merging, therefore no movement of stored fields, term
vectors, etc.

There are several possible ways to add dynamic fields back in to the
fixed-field-def model.  My main priority in doing so, if it proves to be
necessary, is to keep table-alteration logic separate from insertion
operations.  Having the two conflated introduces needless complexity and
computational expense at the back end.  It''s also just plain confusing
-- if you accidentally forget to set OMIT_NORMS just once, all of a
sudden that field is going to have norms for ever and ever amen.  I
think the user ought to have absolute control over field definitions.
Inserting a field with a conflicting definition ought to be an error.

Lucy is going to start with the KinoSearch merge model.  I will do a
better job of adding dynamic capabilities to it if you or someone else
can articulate some specific examples of situations where static
definitions would not suffice.  I can think of a few tasks which would
be slightly more convenient if new fields could be added on the fly, but
maybe you can go one better and illustrate why dynamic field defs are
essential.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
Ferret-talk mailing list
Ferret-talk at rubyforge.org
http://rubyforge.org/mailman/listinfo/ferret-talk

David Balmain

2006-Jun-06 23:39 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On 6/7/06, Lee Marlow <lmarlow at yahoo.com>
wrote:> Do you mean that all fields would have to be known at index creation
> time or just that once a field is defined it properties are the same
> across all documents?  Right now I''m indexing documents that
create
> new fields as needed based on user defined properties, so we don''t
> know all the fields initially.
Hi Lee,

Dynamic fields will definitely be remaining in Ferret. But, as you
said, once a field is defined its properties are set for all
documents. So in your case, you would set the default properties for a
field to match those that you use for your user defined field.
Otherwise you could use Index#add_field(<field properties>) to add a
field with whatever properties you need.

This functionality is going to exist in Ferret but not necessarily in
Lucy. Could you describe in more what kind of user defined properties
you are indexing to help convince Marvin that dynamic fields are a
good thing.

Cheers,
Dave
> On 6/5/06, Marvin Humphrey <marvin at rectangular.com> wrote:
> >
> > On Jun 4, 2006, at 10:46 PM, David Balmain wrote:
> >
> > > In my experimental version of Ferret I have a fields
> > > file along with the segments file. The fields file stores all the
> > > field metadata such as store, index, term-vector and field
boosts.
> > > That way there is no need to maintain a separate FieldInfos file
per
> > > segment. (This will make merging a lot more difficult but
I''m still
> > > thinking about that one.)
> >
> > Robert Kirchgessner made a similar proposal:
> >
> > http://xrl.us/m2qq (Link to mail-archives.apache.org)
> >
> > Robert addresses the merging issue in a subsequent email, and I think
> > his arguments are compelling.
> >
> > IMO, field defs should be immutable and consistent over the entire
> > index.
> >
> > >>> So here is a possible way example of the way I''d
implement this;
> > >>>
> > >>>     # the following might even look better in a YAML
file.
> > >>
> > >> Ooo, nifty idea!  How about a class whose sole purpose is to
define
> > >> fields and generate the YAML file?  Or, if we''re
thinking future
> > >> Lucene 2.1 file format, some Lucene-readable index definition
file?
> > >
> > > Now this idea I like. Perhaps even a simple question/answer app
to
> > > generate the index definition file. I''d guess that
Lucene will
> > > probably end up going with XML rather than YAML.
> >
> > I think it would be a binary file, using Lucene''s standard
> > writeString, writeVInt, etc. methods.
> >
> > A question/answer app could easily be built based around a module.
> >
> > How does "IndexCreator" sound?  Take away the ability of the
> > IndexWriter module to create or redefine indexes, and encapsulate
> > that functionality within one module.  Using Java as our lingua
> > franca...
> >
> >    IndexCreator creator = new IndexCreator(filePath);
> >    FieldDefinition titleDef = new FieldDefinition("title",
> >      Field.Store.YES, Field.Index.TOKENIZED);
> >    FieldDefinition bodyDef = new FieldDefinition("body",
> >      Field.Store.YES, Field.Index.TOKENIZED
> >      Field.TermVector.YES);
> >    creator.addFieldDefinition(titleDef);
> >    creator.addFieldDefinition(bodyDef);
> >    creator.createIndex();
> >
> > Marvin Humphrey
> > Rectangular Research
> > http://www.rectangular.com/
> >
> > _______________________________________________
> > Ferret-talk mailing list
> > Ferret-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/ferret-talk
> >
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

David Balmain

2006-Jun-07 01:08 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On 6/7/06, Marvin Humphrey <marvin at rectangular.com>
wrote:>
> On Jun 6, 2006, at 11:37 AM, Jan Prill wrote:
>
> > this statement tempted me to jump in, even without using something
> > like dynamic field creation myself __right now__. But I have been -
> > especially on cms like projects badly in need for dynamic fields.
> >
> > That something isn''t common in sql doesn''t mean that
there is no
> > need for this "something". This limitation of sql is the
reason for
> > doing things like storing xml in relational dbs as well as the
> > reason for people using object dbs. I don''t know if you had a
look
> > at dabble db, but imagine something like this with a relational
> > dbms. not funny! Because of this they haven''t even thought
about
> > using sql for dabble db. So maybe it''s just me but the
argument:
> > you can''t do this in sql either doesn''t sound too
convincing...
>
> Jan, I don''t understand the requirement, and I''m not
familiar with
> the either dabble db or Rails, so neither that example nor the
> "models" example Dave cited earlier has spoken to me.  I asked
the
> question because I honestly wanted to see a concrete example of an
> application that couldn''t be handled within the constraint of pre-
> defined fields.
>
> Behind the scenes in Lucene is an elaborate, expensive apparatus for
> dealing with dynamic fields.  Each document gets turned into its own
> miniature inverted index, complete with its own FieldInfos,
> FieldsWriter, DocumentWriter, TermInfosWriter, and so on.  When these
> mini-indexes get merged, field definitions have to be reconciled.
> This merge stage is one of the bottlenecks which slow down
> interpreted-language ports of Lucene so severely, because there''s
a
> lot of object creation and destruction and a lot of method calls.
The way I''m dealing with this now is by having all the field
definitions in a single file. When a field is defined it gets assigned
a field number which is set for the life of the index. Hence, dynamic
fields without the expense.
> KinoSearch uses a fixed-field-definition model.  Before you add any
> documents to an index, you have to tell the index writer about all
> the possible fields you might use.  When you add the first document,
> it creates the FieldInfos, FieldsWriter, etc, which persist
> throughout the life of the index writer.  Instead of reconciling
> field definitions each time a document gets added, the field defs are
> defined as invariant for that indexing session.  This is much faster,
> because there is far less object creation and destruction, and far
> less disk shuffling as well -- no segment merging, therefore no
> movement of stored fields, term vectors, etc.
What happens when there are deletes? Which files should I look in to
see how this works? I really need to get my head around the KinoSearch
merge model.
> There are several possible ways to add dynamic fields back in to the
> fixed-field-def model.  My main priority in doing so, if it proves to
> be necessary, is to keep table-alteration logic separate from
> insertion operations.  Having the two conflated introduces needless
> complexity and computational expense at the back end.  It''s also
just
> plain confusing -- if you accidentally forget to set OMIT_NORMS just
> once, all of a sudden that field is going to have norms for ever and
> ever amen.  I think the user ought to have absolute control over
> field definitions.  Inserting a field with a conflicting definition
> ought to be an error.
I mostly agree but I don''t think it is too expensive (computationally
or with regard to complexity) to dynamically add unknown fields with
default properties.
> Lucy is going to start with the KinoSearch merge model.  I will do a
> better job of adding dynamic capabilities to it if you or someone
> else can articulate some specific examples of situations where static
> definitions would not suffice.  I can think of a few tasks which
> would be slightly more convenient if new fields could be added on the
> fly, but maybe you can go one better and illustrate why dynamic field
> defs are essential.
Hopefully Lee will be able to describe his needs in a little more
detail. I must admit that in most cases dynamic fields just make
things a little easier, but you could do without them. Having said
that I don''t think Ferret would be a very ruby-like search library if
it didn''t allow dynamic fields. Ruby allows me to add methods not only
to the core classes but also to already instantiated objects. Coming
from a language that didn''t allow you to do things like this,
you''d
probably think this feature is totally unnessecary. Earlier I said I''d
be using Hashes as documents. Here is an example of how I could add
lazy loading to documents in Ferret:

    def get_doc(doc_num)
        doc = {}
        class <<doc
            attr_accessor :ferret_index, :ferret_doc_num
            def [](key)
                if val = super
                    return val
                else
                    return self[key]
@ferret_index.get_doc_field(@ferret_doc_num, key)
                end
            end
        end
        doc.ferret_index = self
        doc.ferret_doc_num = doc_num
        return doc
    end

This example may be difficult to understand coming from Perl.
Basically what it does is return an empty Hash object when get_doc is
called. Now, whenever you reference a field in that Hash object, for
example doc[:title], it lazily loads that field from the index. All
other Hash objects are unnaffected. Perhaps you can do this sort of
thing in Perl also but I suspect it''s a lot more common in Ruby. A
language like this definitely deserves a search library with dynamic
fields. Not necessarily because they solve an otherwise impossible
problem but because they make other problems much easier to solve.

Lee Marlow

2006-Jun-07 03:51 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

We index properties for products that vary from product to product.
For instance, a shoe could have a color field with values of red, blue
and green.  It would also have a size field with 3,4,5,6,7,8,9,10 for
values.  Another product could be a car with transmission field with
values automatic and manual.  I index all the properties into their
own field as well as dump them into another generic field for
searching.

In the database we have a property_types table where size, color, and
transmission go.  Then there is a many to many table from that to the
products table that holds the acutal values of those properties (e.g.
automatic, manual, red, green, 8, 9, etc.)

I hope that helps explain it.

-Lee

On 6/6/06, Marvin Humphrey <marvin at rectangular.com>
wrote:>
> On Jun 6, 2006, at 10:11 AM, Lee Marlow wrote:
>
> > Do you mean that all fields would have to be known at index creation
> > time or just that once a field is defined it properties are the same
> > across all documents?  Right now I''m indexing documents that
create
> > new fields as needed based on user defined properties, so we
don''t
> > know all the fields initially.
>
> How would you handle this if you were using an SQL database rather
> than Ferret?  Your app wouldn''t be able to modify the table on the
> fly on that case, unless you did something insane like run a remote
> "ALTER TABLE" command.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Marvin Humphrey

2006-Jun-07 04:17 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On Jun 6, 2006, at 6:08 PM, David Balmain wrote:
> What happens when there are deletes? Which files should I look in to
> see how this works? I really need to get my head around the KinoSearch
> merge model.
Let''s say we''re indexing a book.  It has three pages.

    page 1 => "peas porridge hot"
    page 2 => "peas porridge cold"
    page 3 => "peas porridge in the pot, nine days old"

Here''s what Lucene does:

First, create a mini-inverted index for each page...

    hot      => 1
    peas     => 1
    porridge => 1

    cold     => 2
    peas     => 2
    porridge => 2

    days     => 3
    in       => 3
    nine     => 3
    old      => 3
    peas     => 3
    porridge => 3
    pot      => 3

Then combine the indexes...

    cold     => 2
    days     => 3
    in       => 3
    hot      => 1
    nine     => 3
    old      => 3
    peas     => 1, 2, 3
    porridge => 1, 2, 3
    pot      => 3

... and here''s what KinoSearch does:

First, dump everything into one giant pool...

    peas     => 1
    porridge => 1
    hot      => 1
    peas     => 2
    porridge => 2
    cold     => 2
    peas     => 3
    porridge => 3
    in       => 3
    pot      => 3
    the      => 3
    nine     => 3
    days     => 3
    old      => 3

...then sort the whole thing in one go.

Make sense?

The big problem with the KinoSearch method is that you can''t just  
keep dumping stuff into an array indefinitely -- you''ll run out of  
memory, duh!  So what you need is an object that looks like an array  
that you can keep dumping stuff into forever.  Then you "sort" that  
"array".

That''s where the external sort algorithm comes in.  The sortex object  
is basically a PriorityQueue of unlimited size, but which never  
occupies more than 20 or 30 megs of RAM because it periodically sorts  
and flushes its payload to disk. It recovers that stuff from disk  
later -- in sorted order -- when it''s in fetching mode.

If you want to spelunk KinoSearch to see how this happens, start with  
Invindexer::add_doc().  After some minor fiddling, it feeds  
SegWriter::add_doc().  SegWriter goes through each field, having  
TokenBatch invert the field''s contents, feeding the inverted and  
serialized but unordered postings into PostingsWriter (which is where  
the external sort object lives), and writing the norms byte.  Last,  
SegWriter hands the Doc object to FieldsWriter so that it can write  
the stored fields.

The most important part of the previous chain is the step that never  
happened: nobody ever invoked SegmentMerger by calling the equivalent  
of Lucene''s maybeMergeSegments().  There IS no SegmentMerger in  
KinoSearch.

The rest of the process takes place when InvIndexer::finish() gets  
called.  This time, InvIndexer has a lot to do.

First, InvIndexer has to decide which segments need to be merged, if  
any, which it does using an algorithm based on the fibonacci series.   
If there are segments that need mergin'', InvIndexer feeds each one of  
them to SegWriter::add_segment().  SegWriter has DelDocs generate a  
doc map which maps around deleted documents (just like Lucene).  Next  
it has FieldInfos reconcile the field defs and create a field number  
map, which maps field numbers from the segment that''s about to get  
merged away to field numbers for the current segment.  SegWriter  
merges the norms itself.  Then it calls FieldsWriter::add_segment(),  
which reads fields off disk (without decompressing compressed fields,  
or creating document objects, or doing anything important except  
mapping to new field numbers) and writes them into their new home in  
the current segment.  Last, SegWriter arranges for  
PostingsWriter::add_segment to dump all the postings from the old  
segment into the current sort pool -- which *still* hasn''t been  
sorted -- mapping to new field and document numbers as it goes.

(Think of add_segment as add_doc on steroids.)

Now that all documents and all merge-worthy segments have been  
processed, it''s finally time to deal with the sort pool.  InvIndexer  
calls SegWriter::finish(), which calls PostingsWriter::finish().   
PostingsWriter::finish() does a little bit in Perl, then hands off to  
a heavy-duty C routine that goes through the sort pool one posting at  
a time, writing the .frq and .prx files itself, and feeding  
TermInfosWriter so that it can write the .tis and .tii files.

SegWriter::finish() also invokes closing routines for the  
FieldsWriter, the norms filehandles, and so on.  Last, it writes the  
compound file. (For simplicity''s sake, and because there isn''t
much
benefit to using the non-compound format under the KinoSearch merge  
model, KinoSearch always uses the compound format).

Now that all the writing is complete, InvIndexer has to commit the  
changes by rewriting the ''segments'' file.  One interesting
aspect of
the KinoSearch merge model is that no matter how many documents you  
add or segments you merge, if the process gets interrupted at any  
time up till that single commit, the index remains unchanged.  In  
KinoSearch, InvIndexer handles deletions too (IndexReader isn''t even  
a public class), and deletions -- at least those deletions which  
affect segments that haven''t been optimized away -- are committed  
during at the same moment.

Deletable files are deleted if possible, the write lock is  
released... TADA! We''re done.

... and since I spent so much time writing this up, I don''t have time  
to respond to the other points.  Check y''all later...

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Marvin Humphrey

2006-Jun-08 05:06 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote:
>>> I asked the question because I honestly wanted to see a concrete
>>> example of an application that couldn''t be handled within
the
>>> constraint of pre- defined fields.
>
> My current application involves writing a web application which can
> seach a ferret index built from a SQL database.
>
> The idea is that the customer supplies SQLs for say customers,
> suppliers, sales and puchases etc. The app then retrieves the rows  
> from
> the datasource and indexes using Ferret. The app provides both a html
> website as an interface to the index, and also an XML api which can be
> used by non browser clients.
>
> The field set is quite different for each SQL [and is essentially  
> out of
> our control].
So at what point does your app learn the structure of the SQL table?   
Would it work if you were to start each session by telling the index  
writer about the fields that were coming?

   def connect(field_names)
     field_names.each do |field_name|
       index.spec_field(field_name)   # use default properties
     end
   end

   def add_to_index(submission)
     index.add_hash_as_doc(submission)
   end

I can imagine a scenario where that''s not possible, and the fields  
may change up on each insert.  In that case, under the interface I  
envision, you''d have to do something like...

   def add_to_index(submission)
     submission.each do |field_name, value|
       index.spec_field(field_name)   # use default properties
     end
     index.add_hash_as_doc(submission)
   end

FWIW, this stuff is happening anyway, behind the scenes.   
Essentially, every time you add a field to an index, Ferret asks,  
"Say, is this field indexed?  And how about TermVectors, you want  
those?"  The 10_000th time you add the field, Ferret asks, "This  
field wasn''t indexed before -- have you changed your mind? OK,
I''ll
check back again later."... 1_000_000th doc: "You sure?  How about I  
make it indexed?  Awwwww, c''mon... Hey, could you use some
TermVectors?"

When it makes sense, of course you want to simplify the interface and  
hide the complexity inside the library.  However, given that it''s not  
possible to make coherent updates to existing data within a Lucene- 
esque file format, my argument is that field definitions should never  
change.  So the repeated calls to spec_field above would be  
completely redundant -- you''d get an error if you ever tried to  
change the field def.

Your app would be a little less elegant, it''s true (performance  
impact would be somewhere between insignificant and tiny unless you  
had a zillion very short fields).  However, I think the use case  
where the fields are not known in advance is the exception rather  
than the rule.

It would also be possible to use Dave''s polymorphic hash-as-doc  
technique, where if the hash value is a Field object, you spec out  
the field definition using that Field object''s properties -- you  
would just use full-on Field objects for each field.  My argument  
would be, again, that the field definitions should not change.  If  
you don''t agree with that and the definition has to be modifiable  
(within the current constraints), then that single-method technique  
is probably better.  However, if the definition is not modifiable,  
then I''d argue it''s cleaner to separate the two functions.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

David Balmain

2006-Jun-08 06:04 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

On 6/8/06, Marvin Humphrey <marvin at rectangular.com>
wrote:>
> On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote:
>
> >>> I asked the question because I honestly wanted to see a
concrete
> >>> example of an application that couldn''t be handled
within the
> >>> constraint of pre- defined fields.
> >
> > My current application involves writing a web application which can
> > seach a ferret index built from a SQL database.
> >
> > The idea is that the customer supplies SQLs for say customers,
> > suppliers, sales and puchases etc. The app then retrieves the rows
> > from
> > the datasource and indexes using Ferret. The app provides both a html
> > website as an interface to the index, and also an XML api which can be
> > used by non browser clients.
> >
> > The field set is quite different for each SQL [and is essentially
> > out of
> > our control].
>
> So at what point does your app learn the structure of the SQL table?
> Would it work if you were to start each session by telling the index
> writer about the fields that were coming?
>
>    def connect(field_names)
>      field_names.each do |field_name|
>        index.spec_field(field_name)   # use default properties
>      end
>    end
>
>    def add_to_index(submission)
>      index.add_hash_as_doc(submission)
>    end
>
> I can imagine a scenario where that''s not possible, and the fields
> may change up on each insert.  In that case, under the interface I
> envision, you''d have to do something like...
>
>    def add_to_index(submission)
>      submission.each do |field_name, value|
>        index.spec_field(field_name)   # use default properties
>      end
>      index.add_hash_as_doc(submission)
>    end
>
> FWIW, this stuff is happening anyway, behind the scenes.
> Essentially, every time you add a field to an index, Ferret asks,
> "Say, is this field indexed?  And how about TermVectors, you want
> those?"  The 10_000th time you add the field, Ferret asks, "This
> field wasn''t indexed before -- have you changed your mind? OK,
I''ll
> check back again later."... 1_000_000th doc: "You sure?  How
about I
> make it indexed?  Awwwww, c''mon... Hey, could you use some
TermVectors?"
>
> When it makes sense, of course you want to simplify the interface and
> hide the complexity inside the library.  However, given that it''s
not
> possible to make coherent updates to existing data within a Lucene-
> esque file format, my argument is that field definitions should never
> change.  So the repeated calls to spec_field above would be
> completely redundant -- you''d get an error if you ever tried to
> change the field def.
>
> Your app would be a little less elegant, it''s true (performance
> impact would be somewhere between insignificant and tiny unless you
> had a zillion very short fields).  However, I think the use case
> where the fields are not known in advance is the exception rather
> than the rule.
>
> It would also be possible to use Dave''s polymorphic hash-as-doc
> technique, where if the hash value is a Field object, you spec out
> the field definition using that Field object''s properties -- you
> would just use full-on Field objects for each field.  My argument
> would be, again, that the field definitions should not change.  If
> you don''t agree with that and the definition has to be modifiable
> (within the current constraints), then that single-method technique
> is probably better.  However, if the definition is not modifiable,
> then I''d argue it''s cleaner to separate the two
functions.
I completely agree with you that field definitions should not change
once they are set. However, I don''t think having the library add
missing fields with a default set of values (which would be set when
you create the index) adds too much complexity. You simply need to
check whether the field already exists. You already have to look up
the field number anyway. So, to add dynamic fields, simply check to
make sure a valid field number was found and add the field if it
wasn''t.

Of course this is just as easy to implement in the binding code so I
don''t mind whether it gets into Lucy core or not. As long as you can
add new fields to an index after documents have been added, I''m happy,
and it seems from your example (nice ruby code by the way) that that
is your plan.

Dave

Neville Burnell

2006-Jun-08 07:03 UTC

head link

[Ferret-talk] Proposal of some radical changes to API

>> So at what point does your app learn the structure of the SQL table?

At the moment I know the structure after executing the SQL and fetching
the first row [a ruby hash]. But the field set will change from SQL to
SQL, and Ferret is doing all the field specification for me via
hash-as-doc, ala.

  def create
    @index = Ferret::Index::Index.new()

    conn = ODBC.connect(@odbc[:dsn], @odbc[:uid], @odbc[:pwd])

    @sqls.each do |sql|
      stmt = conn.prepare(sql)
      stmt.execute.each_hash{ |row| @index << row }     
      stmt.close
      stmt.drop
    end

    conn.disconnect
  end

The field definitions do not change though, so I''m happy as long as the
hash-as-doc support remains in Ferret.

Cheers,

Neville

-----Original Message-----
From: ferret-talk-bounces at rubyforge.org
[mailto:ferret-talk-bounces at rubyforge.org] On Behalf Of Marvin Humphrey
Sent: Thursday, 8 June 2006 3:07 PM
To: ferret-talk at rubyforge.org
Subject: Re: [Ferret-talk] Proposal of some radical changes to API

On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote:
>>> I asked the question because I honestly wanted to see a concrete 
>>> example of an application that couldn''t be handled within
the
>>> constraint of pre- defined fields.
>
> My current application involves writing a web application which can 
> seach a ferret index built from a SQL database.
>
> The idea is that the customer supplies SQLs for say customers, 
> suppliers, sales and puchases etc. The app then retrieves the rows 
> from the datasource and indexes using Ferret. The app provides both a 
> html website as an interface to the index, and also an XML api which 
> can be used by non browser clients.
>
> The field set is quite different for each SQL [and is essentially out 
> of our control].
So at what point does your app learn the structure of the SQL table?   
Would it work if you were to start each session by telling the index
writer about the fields that were coming?

   def connect(field_names)
     field_names.each do |field_name|
       index.spec_field(field_name)   # use default properties
     end
   end

   def add_to_index(submission)
     index.add_hash_as_doc(submission)
   end

I can imagine a scenario where that''s not possible, and the fields may
change up on each insert.  In that case, under the interface I envision,
you''d have to do something like...

   def add_to_index(submission)
     submission.each do |field_name, value|
       index.spec_field(field_name)   # use default properties
     end
     index.add_hash_as_doc(submission)
   end

FWIW, this stuff is happening anyway, behind the scenes.   
Essentially, every time you add a field to an index, Ferret asks, "Say,
is this field indexed?  And how about TermVectors, you want those?"  The
10_000th time you add the field, Ferret asks, "This field wasn''t
indexed
before -- have you changed your mind? OK, I''ll check back again
later."... 1_000_000th doc: "You sure?  How about I make it indexed?
Awwwww, c''mon... Hey, could you use some TermVectors?"

When it makes sense, of course you want to simplify the interface and
hide the complexity inside the library.  However, given that it''s not
possible to make coherent updates to existing data within a Lucene-
esque file format, my argument is that field definitions should never
change.  So the repeated calls to spec_field above would be completely
redundant -- you''d get an error if you ever tried to change the field
def.

Your app would be a little less elegant, it''s true (performance impact
would be somewhere between insignificant and tiny unless you had a
zillion very short fields).  However, I think the use case where the
fields are not known in advance is the exception rather than the rule.

It would also be possible to use Dave''s polymorphic hash-as-doc
technique, where if the hash value is a Field object, you spec out the
field definition using that Field object''s properties -- you would just
use full-on Field objects for each field.  My argument would be, again,
that the field definitions should not change.  If you don''t agree with
that and the definition has to be modifiable (within the current
constraints), then that single-method technique is probably better.
However, if the definition is not modifiable, then I''d argue
it''s
cleaner to separate the two functions.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
Ferret-talk mailing list
Ferret-talk at rubyforge.org
http://rubyforge.org/mailman/listinfo/ferret-talk

Reasonably Related Threads

Search for more apparently analagous threads

Ferret talk - Jun 2006 - Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

[Ferret-talk] Proposal of some radical changes to API

Reasonably Related Threads