Hey guys, Now that the Lucy[1] project has Apache approval and is about to begin, the onus is no longer on Ferret to strive for Lucene compatability. (We''ll be doing that in Lucy). So I''m starting to think about ways to improve Ferret''s API. The first part that needs to be improved is the Document API. It''s annoying having to type all the attributes to initialize a field just to change the boost. So; field = Field.new(:name, "data...", Field::Store::YES, Field::Index::TOKENIZED, Field::TermVector::NO, false, 5.0) would become; field = Field.new(:name, "data...", :index => Field::Index::TOKENIZED, :boost => 5.0) It''d also be nice to replace the Parameter objects with symbols; field = Field.new(:name, "data...", :index => :tokenized, :boost => 5.0) Of course, this raises the question, why do we need to specify that field :name is tokenized every time we create a :name field? Isn''t it always going to be the same? What if we use a different value the next time we and a :name field? Well the answer to this last question is a specific set of rules; 1. Once you choose to index a field, that field is always indexed from that point forward. 2. Once you store term vectors, always store term vectors 3. Once you store positions, always store positions 4. Once you store offsets, always store offsets 5. Once you store norms, always store norms So currently if you add a field like this (I''ll use the newer notation as it''s easier to type); doc << Field.new(:field, "data...", :index => :yes, :term_vector => :with_positions_offsets) And later add a field like this; doc << Field.new(:field, "diff...", :index => :no, :term_vector => :no) This field will be indexed and it''s term vectors will be stored regardless. This is good because if you are using TermVectors in a particular field then you probably expect them to be there for all instances of that field. The problem is that earlier documents will have been added without storing term vectors. Now I don''t know the exact thinking behind these rules but it seems to me that it would be better to just keep whatever rule you used when you first added the document. If you want to add term vectors later, then re-index. So here''s my radical api change proposal. You set a fields properties when you create the index and Document becomes (almost) a simple Hash object. Actually, you may not have realized this, but you can almost do this currently in Ferret. Once you add the first instance of a field, that field''s properties are set. From then on you and just add documents as Hash objects and each field will have the same properties as in that first document that was added. (This isn''t true of the Store or boost properties. These are set on a per document basis.) So here is a possible way example of the way I''d implement this; # the following might even look better in a YAML file. field_props = { :default => {:store => :no, :index => :tokenized, :term_vector => :no}, :fields => { :id => {:store => :yes, :index => :no}, :title => {:store => :yes, :term_vector => :with_positions_offsets}, [:created_on, :updated_on] => {:store => :yes, :index => :untokenized} } } index = Index.new(:field_properties => field_props) # ... # And if later, you want to add a new field index.add_field(:image, {:store => :compressed, :index => :no}) Now you would just create Hashes instead of Documents. The only exception would be if you needed to set the boost for a particular field or document. So you would have this; index << {:title => "title", :data => "data..."} # boost a field index << {:title => Field.new("important title", 50.0), :data => "normal data"} # boost a document index << Document.new({:title => "important doc", :data => "data"}, 100.0) So what do you all think? These are just ideas at the moment and it''d be a while before I could actually implement them. And don''t worry, I''ll do my best to keep backwards compatibility. Please give me your feedback. Cheers, Dave [1] - http://wiki.apache.org/jakarta-lucene/LucyProposal
Posted it first from my gmail adress (with which I''m not subscribed to the mailing list) and therefore got an approval message. So here it comes again: Hi, Dave, first of all: congrats to you and Marvin for being approved by ASF. I think this is great because it ensures an even more prosper future for lucene ports to dynamic languages. The apache license is a great one and ensures true open source and all freedom that any user of the c-core would want for his projects, regardless if they are free, private, commercial or whatever. Apache is providing great software. Wouldn''t know what I would have done in many projects without their webserver, xml-libraries, lucene, nutch et al. This must be a great place to be for a software developer, especially working with a "search legend" like Doug Cutting sounds like a great opportunity. Congrats again! The notation you are explaining looks great. The field props should indeed stay in a yaml file, the notation above looks a little too stuffed imho. But I''m really looking forward to the index creation with hashes. I think the rails crowed would love this, because it will look and feel railish to do things this way... Regards Jan On 6/4/06, Jan Prill <jan.prill at gmail.com> wrote:> > Hi, Dave, > > first of all: congrats to you and Marvin for being approved by ASF. I > think this is great because it ensures an even more prosper future for > lucene ports to dynamic languages. The apache license is a great one and > ensures true open source and all freedom that any user of the c-core would > want for his projects, regardless if they are free, private, commercial or > whatever. > > Apache is providing great software. Wouldn''t know what I would have done > in many projects without their webserver, xml-libraries, lucene, nutch et > al. This must be a great place to be for a software developer, especially > working with a "search legend" like Doug Cutting sounds like a great > opportunity. Congrats again! > > The notation you are explaining looks great. The field props should indeed > stay in a yaml file, the notation above looks a little too stuffed imho. But > I''m really looking forward to the index creation with hashes. I think the > rails crowed would love this, because it will look and feel railish to do > things this way... > > Regards > Jan > > > On 6/4/06, David Balmain <dbalmain.ml at gmail.com> wrote: > > > > Hey guys, > > > > Now that the Lucy[1] project has Apache approval and is about to > > begin, the onus is no longer on Ferret to strive for Lucene > > compatability. (We''ll be doing that in Lucy). So I''m starting to think > > about ways to improve Ferret''s API. The first part that needs to be > > improved is the Document API. It''s annoying having to type all the > > attributes to initialize a field just to change the boost. So; > > > > field = Field.new(:name, "data...", Field::Store::YES, > > Field::Index::TOKENIZED, Field::TermVector::NO, false, 5.0) > > > > would become; > > > > field = Field.new(:name, "data...", :index => > > Field::Index::TOKENIZED, :boost => 5.0) > > > > It''d also be nice to replace the Parameter objects with symbols; > > > > field = Field.new(:name, "data...", :index => :tokenized, :boost => > > 5.0) > > > > Of course, this raises the question, why do we need to specify that > > field :name is tokenized every time we create a :name field? Isn''t it > > always going to be the same? What if we use a different value the next > > time we and a :name field? Well the answer to this last question is a > > specific set of rules; > > > > 1. Once you choose to index a field, that field is always indexed > > from that point forward. > > 2. Once you store term vectors, always store term vectors > > 3. Once you store positions, always store positions > > 4. Once you store offsets, always store offsets > > 5. Once you store norms, always store norms > > > > So currently if you add a field like this (I''ll use the newer notation > > as it''s easier to type); > > > > doc << Field.new(:field, "data...", :index => :yes, :term_vector > > => :with_positions_offsets) > > > > And later add a field like this; > > > > doc << Field.new(:field, "diff...", :index => :no, :term_vector => > > :no) > > > > This field will be indexed and it''s term vectors will be stored > > regardless. This is good because if you are using TermVectors in a > > particular field then you probably expect them to be there for all > > instances of that field. The problem is that earlier documents will > > have been added without storing term vectors. Now I don''t know the > > exact thinking behind these rules but it seems to me that it would be > > better to just keep whatever rule you used when you first added the > > document. If you want to add term vectors later, then re-index. > > > > So here''s my radical api change proposal. You set a fields properties > > when you create the index and Document becomes (almost) a simple Hash > > object. Actually, you may not have realized this, but you can almost > > do this currently in Ferret. Once you add the first instance of a > > field, that field''s properties are set. From then on you and just add > > documents as Hash objects and each field will have the same properties > > as in that first document that was added. (This isn''t true of the > > Store or boost properties. These are set on a per document basis.) > > > > So here is a possible way example of the way I''d implement this; > > > > # the following might even look better in a YAML file. > > field_props = { > > :default => {:store => :no, :index => :tokenized, :term_vector > > => :no}, > > :fields => { > > :id => {:store => :yes, :index => :no}, > > :title => {:store => :yes, :term_vector => > > :with_positions_offsets}, > > [:created_on, :updated_on] => {:store => :yes, :index => > > :untokenized} > > } > > } > > index = Index.new(:field_properties => field_props) > > > > # ... > > # And if later, you want to add a new field > > index.add_field(:image, {:store => :compressed, :index => :no}) > > > > Now you would just create Hashes instead of Documents. The only > > exception would be if you needed to set the boost for a particular > > field or document. So you would have this; > > > > index << {:title => "title", :data => "data..."} > > # boost a field > > index << {:title => Field.new("important title", 50.0), :data => > > "normal data"} > > # boost a document > > index << Document.new({:title => "important doc", :data => "data"}, > > 100.0) > > > > So what do you all think? These are just ideas at the moment and it''d > > be a while before I could actually implement them. And don''t worry, > > I''ll do my best to keep backwards compatibility. Please give me your > > feedback. > > > > Cheers, > > Dave > > > > [1] - http://wiki.apache.org/jakarta-lucene/LucyProposal > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060604/4aa7ff01/attachment-0001.htm
Neville Burnell
2006-Jun-05 02:33 UTC
[Ferret-talk] Proposal of some radical changes to API
Hi Dave, Congrats on getting Lucy approved! WRT the proposed Ferret api changes, is there a good reason you chose :yes/:no as opposed to true/false for some of the boolean settings? Kind Regards Neville Burnell
On 6/5/06, Neville Burnell <Neville.Burnell at bmsoft.com.au> wrote:> Hi Dave, > > Congrats on getting Lucy approved! > > WRT the proposed Ferret api changes, is there a good reason you chose > :yes/:no as opposed to true/false for some of the boolean settings?Hi Neville, I don''t know if it''s a "good" reason. That''s up to you to decide. The reason is that yes and no aren''t the only options. For example, :store could be :yes, :no, :compressed and :term_vector could be :yes, :no, :with_positions, :with_offsets and :with_positions_and_offsets. It would seem strange to me to have the choices true, false and :compress. I hope that makes sense. Cheers, Dave> Kind Regards > > Neville Burnell > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
Neville Burnell
2006-Jun-05 03:25 UTC
[Ferret-talk] Proposal of some radical changes to API
>> The reason is that yes and no aren''t the only options.Yep, that''s a good reason <grin> I thought that was the case, but I wasn''t sure. Cheers, Neville
Marvin Humphrey
2006-Jun-05 04:35 UTC
[Ferret-talk] Proposal of some radical changes to API
On Jun 3, 2006, at 8:42 PM, David Balmain wrote:> Now that the Lucy[1] project has Apache approval and is about to > begin, the onus is no longer on Ferret to strive for Lucene > compatability. (We''ll be doing that in Lucy).We''ll take this up more aggressively once some under-appreciated volunteers at Apache create mailing lists and other infrastructure for Lucy, but I doubt we''ll want to have Lucy''s API mirror Lucene''s 100%. Do we really want separate Hits and HitIterator classes, for instance? Other, more substantial issues are on the table too as far as I''m concerned, such as whether deletions should handled by the IndexReader rather than the IndexWriter. APIs are really hard to change once defined, and not taking hard-won lessons from Lucene into account would be a crime. Lucy will definitely need a define-fields-once interface, so although you''re proposing stuff specifically for Ferret here, I''m studying it with an eye towards using it with Lucy. My inclination is to start with define-fields-once, then add dynamic field definitions later if we have to.> Of course, this raises the question, why do we need to specify that > field :name is tokenized every time we create a :name field? Isn''t it > always going to be the same?The primary argument for allowing dynamic field definitions I''ve seen, is not that the definition might change, but that each document might contain previously undefined fields which are unknowable in advance. The CNET/Solr folks, Yonik and Hoss, really, really care about that. I think the idea of dynamic field definitions is weird. (A database that allows you to change the table definition with each INSERT? Huh?) I''m sure that CNet could have been done another way if dynamic field definitions hadn''t been available, but they''re committed now. :(> What if we use a different value the next > time we and a :name field? Well the answer to this last question is a > specific set of rules; > > 1. Once you choose to index a field, that field is > always indexed from that point forward. > 2. Once you store term vectors, always store term > vectors > 3. Once you store positions, always store positions > 4. Once you store offsets, always store offsets > 5. Once you store norms, always store normsIt''s actually messier than that, isn''t it? Just because you''ve started marking a field as indexed doesn''t mean that Lucene goes back to all the documents that you''ve already processed and indexes that field. Same deal with TermVectors, etc. At least in SQL, when you add a field to a table it goes and adds a default value for every row.> The problem is that earlier documents will > have been added without storing term vectors. Now I don''t know the > exact thinking behind these rules but it seems to me that it would be > better to just keep whatever rule you used when you first added the > document. If you want to add term vectors later, then re-index.''Zactly!> So here''s my radical api change proposal. You set a fields properties > when you create the index and Document becomes (almost) a simple Hash > object.KinoSearch thinks of documents like hashes, too. Lucene, however, thinks of documents like arrays.> Actually, you may not have realized this, but you can almost > do this currently in Ferret. Once you add the first instance of a > field, that field''s properties are set. From then on you and just add > documents as Hash objects and each field will have the same properties > as in that first document that was added. (This isn''t true of the > Store or boost properties. These are set on a per document basis.)Why not set Store once and for all per-field? And heck, why not start with a default boost, but allow it to be overridden?> So here is a possible way example of the way I''d implement this; > > # the following might even look better in a YAML file.Ooo, nifty idea! How about a class whose sole purpose is to define fields and generate the YAML file? Or, if we''re thinking future Lucene 2.1 file format, some Lucene-readable index definition file?> field_props = { > :default => {:store => :no, :index > => :tokenized, :term_vector => :no}, > :fields => { > :id => {:store => :yes, :index => :no}, > :title => {:store => :yes, :term_vector > => :with_positions_offsets}, > [:created_on, :updated_on] => {:store => :yes, :index => > :untokenized} > } > } > index = Index.new(:field_properties => field_props)This is nice and dense, but maybe a tad complicated. KinoSearch''s take on doing field defs has some problems too. It was a mistake to make spec_field() a method of InvIndexer (KinoSearch''s index writer/modifier class). The index writer and reader classes suffer from serious bloat no matter what, so anything that can be shunted somewhere else should be. Marvin Humphrey Rectangular Research http://www.rectangular.com/
On 6/5/06, Marvin Humphrey <marvin at rectangular.com> wrote:> > On Jun 3, 2006, at 8:42 PM, David Balmain wrote: > > > Now that the Lucy[1] project has Apache approval and is about to > > begin, the onus is no longer on Ferret to strive for Lucene > > compatability. (We''ll be doing that in Lucy). > > We''ll take this up more aggressively once some under-appreciated > volunteers at Apache create mailing lists and other infrastructure > for Lucy, but I doubt we''ll want to have Lucy''s API mirror Lucene''s > 100%. Do we really want separate Hits and HitIterator classes, for > instance? Other, more substantial issues are on the table too as far > as I''m concerned, such as whether deletions should handled by the > IndexReader rather than the IndexWriter. APIs are really hard to > change once defined, and not taking hard-won lessons from Lucene into > account would be a crime.Thanks for pointing that out. I couldn''t agree with you more. What I meant was that Lucy would be striving to maintain "index file format" compatibility (which I believe was the plan). I didn''t make this very clear, though, as I was talking about changes to the API but as I was writing this I was thinking about what changes to the index file format would allow.> Lucy will definitely need a define-fields-once interface, so although > you''re proposing stuff specifically for Ferret here, I''m studying it > with an eye towards using it with Lucy. My inclination is to start > with define-fields-once, then add dynamic field definitions later if > we have to.This sounds good to me.> > Of course, this raises the question, why do we need to specify that > > field :name is tokenized every time we create a :name field? Isn''t it > > always going to be the same? > > The primary argument for allowing dynamic field definitions I''ve > seen, is not that the definition might change, but that each document > might contain previously undefined fields which are unknowable in > advance. The CNET/Solr folks, Yonik and Hoss, really, really care > about that. > > I think the idea of dynamic field definitions is weird. (A database > that allows you to change the table definition with each INSERT? > Huh?) I''m sure that CNet could have been done another way if dynamic > field definitions hadn''t been available, but they''re committed now. :(Actually, I fall into the category of people who like dynamic field definitions. I agree that they are not necessary but it certainly makes some things easy. For instance, in a rails application you can add models to an index and you get to specify within the model itself which of its fields will be added to the index. The index itself doesn''t need to know which models will be indexed or how they will be indexed it just needs to know to store the id field and the model name field and index everything else. It''s all about keeping it DRY. The part I don''t like about lucene is *sometimes* being able to change a fields properties.> > What if we use a different value the next > > time we and a :name field? Well the answer to this last question is a > > specific set of rules; > > > > 1. Once you choose to index a field, that field is > > always indexed from that point forward. > > 2. Once you store term vectors, always store term > > vectors > > 3. Once you store positions, always store positions > > 4. Once you store offsets, always store offsets > > 5. Once you store norms, always store norms > > It''s actually messier than that, isn''t it? Just because you''ve > started marking a field as indexed doesn''t mean that Lucene goes back > to all the documents that you''ve already processed and indexes that > field. Same deal with TermVectors, etc. > > At least in SQL, when you add a field to a table it goes and adds a > default value for every row. > > > The problem is that earlier documents will > > have been added without storing term vectors. Now I don''t know the > > exact thinking behind these rules but it seems to me that it would be > > better to just keep whatever rule you used when you first added the > > document. If you want to add term vectors later, then re-index. > > ''Zactly! > > > So here''s my radical api change proposal. You set a fields properties > > when you create the index and Document becomes (almost) a simple Hash > > object. > > KinoSearch thinks of documents like hashes, too. Lucene, however, > thinks of documents like arrays. > > > Actually, you may not have realized this, but you can almost > > do this currently in Ferret. Once you add the first instance of a > > field, that field''s properties are set. From then on you and just add > > documents as Hash objects and each field will have the same properties > > as in that first document that was added. (This isn''t true of the > > Store or boost properties. These are set on a per document basis.) > > Why not set Store once and for all per-field? And heck, why not > start with a default boost, but allow it to be overridden?My plan exactly. In my experimental version of Ferret I have a fields file along with the segments file. The fields file stores all the field metadata such as store, index, term-vector and field boosts. That way there is no need to maintain a separate FieldInfos file per segment. (This will make merging a lot more difficult but I''m still thinking about that one.)> > So here is a possible way example of the way I''d implement this; > > > > # the following might even look better in a YAML file. > > Ooo, nifty idea! How about a class whose sole purpose is to define > fields and generate the YAML file? Or, if we''re thinking future > Lucene 2.1 file format, some Lucene-readable index definition file?Now this idea I like. Perhaps even a simple question/answer app to generate the index definition file. I''d guess that Lucene will probably end up going with XML rather than YAML. Cheers, Dave
Marvin Humphrey
2006-Jun-05 16:48 UTC
[Ferret-talk] Proposal of some radical changes to API
On Jun 4, 2006, at 10:46 PM, David Balmain wrote:> What I > meant was that Lucy would be striving to maintain "index file format" > compatibility (which I believe was the plan).It''s funny that we haven''t actually settled that. I used to think index compatibility was really important, but I don''t so much any more. Index compatibility is DOA unless Lucene adopts bytecounts as string headers, because it would be insanity for Lucy to deal with the current format. So we''re talking compatibility no sooner than Lucene 2.1, and adapting Lucene will be a challenge. I think the only way to make up the lost speed is to pry in the KinoSearch merge model. I strongly suspect that that will prove to be a marked improvement over not just the patched version, but the current release. However... It''s a lot of work, and I think I''m the only obvious candidate with both the expertise and (maybe) the desire to do it, unless you want to take it on. Two stages out of four are complete. The bytecounts patch was stage 1, and last night I supplied stage 2: a Java port of KinoSearch''s external sorting module. Stage 3 is adapting Lucene''s indexing apparatus to write indexes by the segment rather than the document -- porting KinoSearch''s SegWriter module and eliminating DocumentWriter and SegmentMerger would be a start. The last stage is adapting everything to be backwards compatible with char-counts as string headers. I''m not sure that I want to dedicate that much of my time to Lucene, at least not right now. The changes outlined above are pretty major. It''s likely that some bugs will get introduced simply because of the volume of code change, so that''s an argument against making any change at all unless there''s a real benefit. There would be -- the KinoSearch merge model is faster -- but politically speaking, selling the whole package to the Lucene community would be a PITA. Not only do I have to argue that the tangible benefits justify the disruption, I have to make the argument that it''s not OK for compatibility to begin and end with Java[1][2], plus deal with outright hostility and abuse from extreme Java partisans[3]. I''d rather spend my time and energy contributing to Lucy. Besides, I think that ultimately, trying to be compatible with other ports would be as much of a drag on Lucy as Lucene, and I think it''s advisable for both projects to declare their file formats private. The Lucene file format is just too complex and difficult to serve as a good interchange medium. The only major reason for Lucy to be file-format-compatible with Lucene is Luke. IMO, if we want Luke''s benefits, we should be hacking Luke. Marvin Humphrey Rectangular Research http://www.rectangular.com/ [1] http://xrl.us/m2o3 (Link to mail-archives.apache.org) [2] http://xrl.us/m2o7 (Link to mail-archives.apache.org) [3] http://xrl.us/m2kp (Link to mail-archives.apache.org)
Marvin Humphrey
2006-Jun-05 17:28 UTC
[Ferret-talk] Proposal of some radical changes to API
On Jun 4, 2006, at 10:46 PM, David Balmain wrote:> In my experimental version of Ferret I have a fields > file along with the segments file. The fields file stores all the > field metadata such as store, index, term-vector and field boosts. > That way there is no need to maintain a separate FieldInfos file per > segment. (This will make merging a lot more difficult but I''m still > thinking about that one.)Robert Kirchgessner made a similar proposal: http://xrl.us/m2qq (Link to mail-archives.apache.org) Robert addresses the merging issue in a subsequent email, and I think his arguments are compelling. IMO, field defs should be immutable and consistent over the entire index.>>> So here is a possible way example of the way I''d implement this; >>> >>> # the following might even look better in a YAML file. >> >> Ooo, nifty idea! How about a class whose sole purpose is to define >> fields and generate the YAML file? Or, if we''re thinking future >> Lucene 2.1 file format, some Lucene-readable index definition file? > > Now this idea I like. Perhaps even a simple question/answer app to > generate the index definition file. I''d guess that Lucene will > probably end up going with XML rather than YAML.I think it would be a binary file, using Lucene''s standard writeString, writeVInt, etc. methods. A question/answer app could easily be built based around a module. How does "IndexCreator" sound? Take away the ability of the IndexWriter module to create or redefine indexes, and encapsulate that functionality within one module. Using Java as our lingua franca... IndexCreator creator = new IndexCreator(filePath); FieldDefinition titleDef = new FieldDefinition("title", Field.Store.YES, Field.Index.TOKENIZED); FieldDefinition bodyDef = new FieldDefinition("body", Field.Store.YES, Field.Index.TOKENIZED Field.TermVector.YES); creator.addFieldDefinition(titleDef); creator.addFieldDefinition(bodyDef); creator.createIndex(); Marvin Humphrey Rectangular Research http://www.rectangular.com/
Do you mean that all fields would have to be known at index creation time or just that once a field is defined it properties are the same across all documents? Right now I''m indexing documents that create new fields as needed based on user defined properties, so we don''t know all the fields initially. On 6/5/06, Marvin Humphrey <marvin at rectangular.com> wrote:> > On Jun 4, 2006, at 10:46 PM, David Balmain wrote: > > > In my experimental version of Ferret I have a fields > > file along with the segments file. The fields file stores all the > > field metadata such as store, index, term-vector and field boosts. > > That way there is no need to maintain a separate FieldInfos file per > > segment. (This will make merging a lot more difficult but I''m still > > thinking about that one.) > > Robert Kirchgessner made a similar proposal: > > http://xrl.us/m2qq (Link to mail-archives.apache.org) > > Robert addresses the merging issue in a subsequent email, and I think > his arguments are compelling. > > IMO, field defs should be immutable and consistent over the entire > index. > > >>> So here is a possible way example of the way I''d implement this; > >>> > >>> # the following might even look better in a YAML file. > >> > >> Ooo, nifty idea! How about a class whose sole purpose is to define > >> fields and generate the YAML file? Or, if we''re thinking future > >> Lucene 2.1 file format, some Lucene-readable index definition file? > > > > Now this idea I like. Perhaps even a simple question/answer app to > > generate the index definition file. I''d guess that Lucene will > > probably end up going with XML rather than YAML. > > I think it would be a binary file, using Lucene''s standard > writeString, writeVInt, etc. methods. > > A question/answer app could easily be built based around a module. > > How does "IndexCreator" sound? Take away the ability of the > IndexWriter module to create or redefine indexes, and encapsulate > that functionality within one module. Using Java as our lingua > franca... > > IndexCreator creator = new IndexCreator(filePath); > FieldDefinition titleDef = new FieldDefinition("title", > Field.Store.YES, Field.Index.TOKENIZED); > FieldDefinition bodyDef = new FieldDefinition("body", > Field.Store.YES, Field.Index.TOKENIZED > Field.TermVector.YES); > creator.addFieldDefinition(titleDef); > creator.addFieldDefinition(bodyDef); > creator.createIndex(); > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
Marvin Humphrey
2006-Jun-06 18:21 UTC
[Ferret-talk] Proposal of some radical changes to API
On Jun 6, 2006, at 10:11 AM, Lee Marlow wrote:> Do you mean that all fields would have to be known at index creation > time or just that once a field is defined it properties are the same > across all documents? Right now I''m indexing documents that create > new fields as needed based on user defined properties, so we don''t > know all the fields initially.How would you handle this if you were using an SQL database rather than Ferret? Your app wouldn''t be able to modify the table on the fly on that case, unless you did something insane like run a remote "ALTER TABLE" command. Marvin Humphrey Rectangular Research http://www.rectangular.com/
Hi Marvin, this statement tempted me to jump in, even without using something like dynamic field creation myself __right now__. But I have been - especially on cms like projects badly in need for dynamic fields. That something isn''t common in sql doesn''t mean that there is no need for this "something". This limitation of sql is the reason for doing things like storing xml in relational dbs as well as the reason for people using object dbs. I don''t know if you had a look at dabble db, but imagine something like this with a relational dbms. not funny! Because of this they haven''t even thought about using sql for dabble db. So maybe it''s just me but the argument: you can''t do this in sql either doesn''t sound too convincing... Cheers, Jan On 6/6/06, Marvin Humphrey <marvin at rectangular.com> wrote:> > > On Jun 6, 2006, at 10:11 AM, Lee Marlow wrote: > > > Do you mean that all fields would have to be known at index creation > > time or just that once a field is defined it properties are the same > > across all documents? Right now I''m indexing documents that create > > new fields as needed based on user defined properties, so we don''t > > know all the fields initially. > > How would you handle this if you were using an SQL database rather > than Ferret? Your app wouldn''t be able to modify the table on the > fly on that case, unless you did something insane like run a remote > "ALTER TABLE" command. > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060606/a20072fe/attachment.htm
Marvin Humphrey
2006-Jun-06 21:07 UTC
[Ferret-talk] Proposal of some radical changes to API
On Jun 6, 2006, at 11:37 AM, Jan Prill wrote:> this statement tempted me to jump in, even without using something > like dynamic field creation myself __right now__. But I have been - > especially on cms like projects badly in need for dynamic fields. > > That something isn''t common in sql doesn''t mean that there is no > need for this "something". This limitation of sql is the reason for > doing things like storing xml in relational dbs as well as the > reason for people using object dbs. I don''t know if you had a look > at dabble db, but imagine something like this with a relational > dbms. not funny! Because of this they haven''t even thought about > using sql for dabble db. So maybe it''s just me but the argument: > you can''t do this in sql either doesn''t sound too convincing...Jan, I don''t understand the requirement, and I''m not familiar with the either dabble db or Rails, so neither that example nor the "models" example Dave cited earlier has spoken to me. I asked the question because I honestly wanted to see a concrete example of an application that couldn''t be handled within the constraint of pre- defined fields. Behind the scenes in Lucene is an elaborate, expensive apparatus for dealing with dynamic fields. Each document gets turned into its own miniature inverted index, complete with its own FieldInfos, FieldsWriter, DocumentWriter, TermInfosWriter, and so on. When these mini-indexes get merged, field definitions have to be reconciled. This merge stage is one of the bottlenecks which slow down interpreted-language ports of Lucene so severely, because there''s a lot of object creation and destruction and a lot of method calls. KinoSearch uses a fixed-field-definition model. Before you add any documents to an index, you have to tell the index writer about all the possible fields you might use. When you add the first document, it creates the FieldInfos, FieldsWriter, etc, which persist throughout the life of the index writer. Instead of reconciling field definitions each time a document gets added, the field defs are defined as invariant for that indexing session. This is much faster, because there is far less object creation and destruction, and far less disk shuffling as well -- no segment merging, therefore no movement of stored fields, term vectors, etc. There are several possible ways to add dynamic fields back in to the fixed-field-def model. My main priority in doing so, if it proves to be necessary, is to keep table-alteration logic separate from insertion operations. Having the two conflated introduces needless complexity and computational expense at the back end. It''s also just plain confusing -- if you accidentally forget to set OMIT_NORMS just once, all of a sudden that field is going to have norms for ever and ever amen. I think the user ought to have absolute control over field definitions. Inserting a field with a conflicting definition ought to be an error. Lucy is going to start with the KinoSearch merge model. I will do a better job of adding dynamic capabilities to it if you or someone else can articulate some specific examples of situations where static definitions would not suffice. I can think of a few tasks which would be slightly more convenient if new fields could be added on the fly, but maybe you can go one better and illustrate why dynamic field defs are essential. Marvin Humphrey Rectangular Research http://www.rectangular.com/
Neville Burnell
2006-Jun-06 23:33 UTC
[Ferret-talk] Proposal of some radical changes to API
>> I asked the question because I honestly wanted to see a concrete >> example of an application that couldn''t be handled within the >> constraint of pre- defined fields.My current application involves writing a web application which can seach a ferret index built from a SQL database. The idea is that the customer supplies SQLs for say customers, suppliers, sales and puchases etc. The app then retrieves the rows from the datasource and indexes using Ferret. The app provides both a html website as an interface to the index, and also an XML api which can be used by non browser clients. The field set is quite different for each SQL [and is essentially out of our control]. HTH, Neville -----Original Message----- From: ferret-talk-bounces at rubyforge.org [mailto:ferret-talk-bounces at rubyforge.org] On Behalf Of Marvin Humphrey Sent: Wednesday, 7 June 2006 7:08 AM To: ferret-talk at rubyforge.org Subject: Re: [Ferret-talk] Proposal of some radical changes to API On Jun 6, 2006, at 11:37 AM, Jan Prill wrote:> this statement tempted me to jump in, even without using something > like dynamic field creation myself __right now__. But I have been - > especially on cms like projects badly in need for dynamic fields. > > That something isn''t common in sql doesn''t mean that there is no need > for this "something". This limitation of sql is the reason for doing > things like storing xml in relational dbs as well as the reason for > people using object dbs. I don''t know if you had a look at dabble db, > but imagine something like this with a relational dbms. not funny! > Because of this they haven''t even thought about using sql for dabble > db. So maybe it''s just me but the argument: > you can''t do this in sql either doesn''t sound too convincing...Jan, I don''t understand the requirement, and I''m not familiar with the either dabble db or Rails, so neither that example nor the "models" example Dave cited earlier has spoken to me. I asked the question because I honestly wanted to see a concrete example of an application that couldn''t be handled within the constraint of pre- defined fields. Behind the scenes in Lucene is an elaborate, expensive apparatus for dealing with dynamic fields. Each document gets turned into its own miniature inverted index, complete with its own FieldInfos, FieldsWriter, DocumentWriter, TermInfosWriter, and so on. When these mini-indexes get merged, field definitions have to be reconciled. This merge stage is one of the bottlenecks which slow down interpreted-language ports of Lucene so severely, because there''s a lot of object creation and destruction and a lot of method calls. KinoSearch uses a fixed-field-definition model. Before you add any documents to an index, you have to tell the index writer about all the possible fields you might use. When you add the first document, it creates the FieldInfos, FieldsWriter, etc, which persist throughout the life of the index writer. Instead of reconciling field definitions each time a document gets added, the field defs are defined as invariant for that indexing session. This is much faster, because there is far less object creation and destruction, and far less disk shuffling as well -- no segment merging, therefore no movement of stored fields, term vectors, etc. There are several possible ways to add dynamic fields back in to the fixed-field-def model. My main priority in doing so, if it proves to be necessary, is to keep table-alteration logic separate from insertion operations. Having the two conflated introduces needless complexity and computational expense at the back end. It''s also just plain confusing -- if you accidentally forget to set OMIT_NORMS just once, all of a sudden that field is going to have norms for ever and ever amen. I think the user ought to have absolute control over field definitions. Inserting a field with a conflicting definition ought to be an error. Lucy is going to start with the KinoSearch merge model. I will do a better job of adding dynamic capabilities to it if you or someone else can articulate some specific examples of situations where static definitions would not suffice. I can think of a few tasks which would be slightly more convenient if new fields could be added on the fly, but maybe you can go one better and illustrate why dynamic field defs are essential. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk
On 6/7/06, Lee Marlow <lmarlow at yahoo.com> wrote:> Do you mean that all fields would have to be known at index creation > time or just that once a field is defined it properties are the same > across all documents? Right now I''m indexing documents that create > new fields as needed based on user defined properties, so we don''t > know all the fields initially.Hi Lee, Dynamic fields will definitely be remaining in Ferret. But, as you said, once a field is defined its properties are set for all documents. So in your case, you would set the default properties for a field to match those that you use for your user defined field. Otherwise you could use Index#add_field(<field properties>) to add a field with whatever properties you need. This functionality is going to exist in Ferret but not necessarily in Lucy. Could you describe in more what kind of user defined properties you are indexing to help convince Marvin that dynamic fields are a good thing. Cheers, Dave> On 6/5/06, Marvin Humphrey <marvin at rectangular.com> wrote: > > > > On Jun 4, 2006, at 10:46 PM, David Balmain wrote: > > > > > In my experimental version of Ferret I have a fields > > > file along with the segments file. The fields file stores all the > > > field metadata such as store, index, term-vector and field boosts. > > > That way there is no need to maintain a separate FieldInfos file per > > > segment. (This will make merging a lot more difficult but I''m still > > > thinking about that one.) > > > > Robert Kirchgessner made a similar proposal: > > > > http://xrl.us/m2qq (Link to mail-archives.apache.org) > > > > Robert addresses the merging issue in a subsequent email, and I think > > his arguments are compelling. > > > > IMO, field defs should be immutable and consistent over the entire > > index. > > > > >>> So here is a possible way example of the way I''d implement this; > > >>> > > >>> # the following might even look better in a YAML file. > > >> > > >> Ooo, nifty idea! How about a class whose sole purpose is to define > > >> fields and generate the YAML file? Or, if we''re thinking future > > >> Lucene 2.1 file format, some Lucene-readable index definition file? > > > > > > Now this idea I like. Perhaps even a simple question/answer app to > > > generate the index definition file. I''d guess that Lucene will > > > probably end up going with XML rather than YAML. > > > > I think it would be a binary file, using Lucene''s standard > > writeString, writeVInt, etc. methods. > > > > A question/answer app could easily be built based around a module. > > > > How does "IndexCreator" sound? Take away the ability of the > > IndexWriter module to create or redefine indexes, and encapsulate > > that functionality within one module. Using Java as our lingua > > franca... > > > > IndexCreator creator = new IndexCreator(filePath); > > FieldDefinition titleDef = new FieldDefinition("title", > > Field.Store.YES, Field.Index.TOKENIZED); > > FieldDefinition bodyDef = new FieldDefinition("body", > > Field.Store.YES, Field.Index.TOKENIZED > > Field.TermVector.YES); > > creator.addFieldDefinition(titleDef); > > creator.addFieldDefinition(bodyDef); > > creator.createIndex(); > > > > Marvin Humphrey > > Rectangular Research > > http://www.rectangular.com/ > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
On 6/7/06, Marvin Humphrey <marvin at rectangular.com> wrote:> > On Jun 6, 2006, at 11:37 AM, Jan Prill wrote: > > > this statement tempted me to jump in, even without using something > > like dynamic field creation myself __right now__. But I have been - > > especially on cms like projects badly in need for dynamic fields. > > > > That something isn''t common in sql doesn''t mean that there is no > > need for this "something". This limitation of sql is the reason for > > doing things like storing xml in relational dbs as well as the > > reason for people using object dbs. I don''t know if you had a look > > at dabble db, but imagine something like this with a relational > > dbms. not funny! Because of this they haven''t even thought about > > using sql for dabble db. So maybe it''s just me but the argument: > > you can''t do this in sql either doesn''t sound too convincing... > > Jan, I don''t understand the requirement, and I''m not familiar with > the either dabble db or Rails, so neither that example nor the > "models" example Dave cited earlier has spoken to me. I asked the > question because I honestly wanted to see a concrete example of an > application that couldn''t be handled within the constraint of pre- > defined fields. > > Behind the scenes in Lucene is an elaborate, expensive apparatus for > dealing with dynamic fields. Each document gets turned into its own > miniature inverted index, complete with its own FieldInfos, > FieldsWriter, DocumentWriter, TermInfosWriter, and so on. When these > mini-indexes get merged, field definitions have to be reconciled. > This merge stage is one of the bottlenecks which slow down > interpreted-language ports of Lucene so severely, because there''s a > lot of object creation and destruction and a lot of method calls.The way I''m dealing with this now is by having all the field definitions in a single file. When a field is defined it gets assigned a field number which is set for the life of the index. Hence, dynamic fields without the expense.> KinoSearch uses a fixed-field-definition model. Before you add any > documents to an index, you have to tell the index writer about all > the possible fields you might use. When you add the first document, > it creates the FieldInfos, FieldsWriter, etc, which persist > throughout the life of the index writer. Instead of reconciling > field definitions each time a document gets added, the field defs are > defined as invariant for that indexing session. This is much faster, > because there is far less object creation and destruction, and far > less disk shuffling as well -- no segment merging, therefore no > movement of stored fields, term vectors, etc.What happens when there are deletes? Which files should I look in to see how this works? I really need to get my head around the KinoSearch merge model.> There are several possible ways to add dynamic fields back in to the > fixed-field-def model. My main priority in doing so, if it proves to > be necessary, is to keep table-alteration logic separate from > insertion operations. Having the two conflated introduces needless > complexity and computational expense at the back end. It''s also just > plain confusing -- if you accidentally forget to set OMIT_NORMS just > once, all of a sudden that field is going to have norms for ever and > ever amen. I think the user ought to have absolute control over > field definitions. Inserting a field with a conflicting definition > ought to be an error.I mostly agree but I don''t think it is too expensive (computationally or with regard to complexity) to dynamically add unknown fields with default properties.> Lucy is going to start with the KinoSearch merge model. I will do a > better job of adding dynamic capabilities to it if you or someone > else can articulate some specific examples of situations where static > definitions would not suffice. I can think of a few tasks which > would be slightly more convenient if new fields could be added on the > fly, but maybe you can go one better and illustrate why dynamic field > defs are essential.Hopefully Lee will be able to describe his needs in a little more detail. I must admit that in most cases dynamic fields just make things a little easier, but you could do without them. Having said that I don''t think Ferret would be a very ruby-like search library if it didn''t allow dynamic fields. Ruby allows me to add methods not only to the core classes but also to already instantiated objects. Coming from a language that didn''t allow you to do things like this, you''d probably think this feature is totally unnessecary. Earlier I said I''d be using Hashes as documents. Here is an example of how I could add lazy loading to documents in Ferret: def get_doc(doc_num) doc = {} class <<doc attr_accessor :ferret_index, :ferret_doc_num def [](key) if val = super return val else return self[key] @ferret_index.get_doc_field(@ferret_doc_num, key) end end end doc.ferret_index = self doc.ferret_doc_num = doc_num return doc end This example may be difficult to understand coming from Perl. Basically what it does is return an empty Hash object when get_doc is called. Now, whenever you reference a field in that Hash object, for example doc[:title], it lazily loads that field from the index. All other Hash objects are unnaffected. Perhaps you can do this sort of thing in Perl also but I suspect it''s a lot more common in Ruby. A language like this definitely deserves a search library with dynamic fields. Not necessarily because they solve an otherwise impossible problem but because they make other problems much easier to solve.
We index properties for products that vary from product to product. For instance, a shoe could have a color field with values of red, blue and green. It would also have a size field with 3,4,5,6,7,8,9,10 for values. Another product could be a car with transmission field with values automatic and manual. I index all the properties into their own field as well as dump them into another generic field for searching. In the database we have a property_types table where size, color, and transmission go. Then there is a many to many table from that to the products table that holds the acutal values of those properties (e.g. automatic, manual, red, green, 8, 9, etc.) I hope that helps explain it. -Lee On 6/6/06, Marvin Humphrey <marvin at rectangular.com> wrote:> > On Jun 6, 2006, at 10:11 AM, Lee Marlow wrote: > > > Do you mean that all fields would have to be known at index creation > > time or just that once a field is defined it properties are the same > > across all documents? Right now I''m indexing documents that create > > new fields as needed based on user defined properties, so we don''t > > know all the fields initially. > > How would you handle this if you were using an SQL database rather > than Ferret? Your app wouldn''t be able to modify the table on the > fly on that case, unless you did something insane like run a remote > "ALTER TABLE" command. > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
Marvin Humphrey
2006-Jun-07 04:17 UTC
[Ferret-talk] Proposal of some radical changes to API
On Jun 6, 2006, at 6:08 PM, David Balmain wrote:> What happens when there are deletes? Which files should I look in to > see how this works? I really need to get my head around the KinoSearch > merge model.Let''s say we''re indexing a book. It has three pages. page 1 => "peas porridge hot" page 2 => "peas porridge cold" page 3 => "peas porridge in the pot, nine days old" Here''s what Lucene does: First, create a mini-inverted index for each page... hot => 1 peas => 1 porridge => 1 cold => 2 peas => 2 porridge => 2 days => 3 in => 3 nine => 3 old => 3 peas => 3 porridge => 3 pot => 3 Then combine the indexes... cold => 2 days => 3 in => 3 hot => 1 nine => 3 old => 3 peas => 1, 2, 3 porridge => 1, 2, 3 pot => 3 ... and here''s what KinoSearch does: First, dump everything into one giant pool... peas => 1 porridge => 1 hot => 1 peas => 2 porridge => 2 cold => 2 peas => 3 porridge => 3 in => 3 pot => 3 the => 3 nine => 3 days => 3 old => 3 ...then sort the whole thing in one go. Make sense? The big problem with the KinoSearch method is that you can''t just keep dumping stuff into an array indefinitely -- you''ll run out of memory, duh! So what you need is an object that looks like an array that you can keep dumping stuff into forever. Then you "sort" that "array". That''s where the external sort algorithm comes in. The sortex object is basically a PriorityQueue of unlimited size, but which never occupies more than 20 or 30 megs of RAM because it periodically sorts and flushes its payload to disk. It recovers that stuff from disk later -- in sorted order -- when it''s in fetching mode. If you want to spelunk KinoSearch to see how this happens, start with Invindexer::add_doc(). After some minor fiddling, it feeds SegWriter::add_doc(). SegWriter goes through each field, having TokenBatch invert the field''s contents, feeding the inverted and serialized but unordered postings into PostingsWriter (which is where the external sort object lives), and writing the norms byte. Last, SegWriter hands the Doc object to FieldsWriter so that it can write the stored fields. The most important part of the previous chain is the step that never happened: nobody ever invoked SegmentMerger by calling the equivalent of Lucene''s maybeMergeSegments(). There IS no SegmentMerger in KinoSearch. The rest of the process takes place when InvIndexer::finish() gets called. This time, InvIndexer has a lot to do. First, InvIndexer has to decide which segments need to be merged, if any, which it does using an algorithm based on the fibonacci series. If there are segments that need mergin'', InvIndexer feeds each one of them to SegWriter::add_segment(). SegWriter has DelDocs generate a doc map which maps around deleted documents (just like Lucene). Next it has FieldInfos reconcile the field defs and create a field number map, which maps field numbers from the segment that''s about to get merged away to field numbers for the current segment. SegWriter merges the norms itself. Then it calls FieldsWriter::add_segment(), which reads fields off disk (without decompressing compressed fields, or creating document objects, or doing anything important except mapping to new field numbers) and writes them into their new home in the current segment. Last, SegWriter arranges for PostingsWriter::add_segment to dump all the postings from the old segment into the current sort pool -- which *still* hasn''t been sorted -- mapping to new field and document numbers as it goes. (Think of add_segment as add_doc on steroids.) Now that all documents and all merge-worthy segments have been processed, it''s finally time to deal with the sort pool. InvIndexer calls SegWriter::finish(), which calls PostingsWriter::finish(). PostingsWriter::finish() does a little bit in Perl, then hands off to a heavy-duty C routine that goes through the sort pool one posting at a time, writing the .frq and .prx files itself, and feeding TermInfosWriter so that it can write the .tis and .tii files. SegWriter::finish() also invokes closing routines for the FieldsWriter, the norms filehandles, and so on. Last, it writes the compound file. (For simplicity''s sake, and because there isn''t much benefit to using the non-compound format under the KinoSearch merge model, KinoSearch always uses the compound format). Now that all the writing is complete, InvIndexer has to commit the changes by rewriting the ''segments'' file. One interesting aspect of the KinoSearch merge model is that no matter how many documents you add or segments you merge, if the process gets interrupted at any time up till that single commit, the index remains unchanged. In KinoSearch, InvIndexer handles deletions too (IndexReader isn''t even a public class), and deletions -- at least those deletions which affect segments that haven''t been optimized away -- are committed during at the same moment. Deletable files are deleted if possible, the write lock is released... TADA! We''re done. ... and since I spent so much time writing this up, I don''t have time to respond to the other points. Check y''all later... Marvin Humphrey Rectangular Research http://www.rectangular.com/
Marvin Humphrey
2006-Jun-08 05:06 UTC
[Ferret-talk] Proposal of some radical changes to API
On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote:>>> I asked the question because I honestly wanted to see a concrete >>> example of an application that couldn''t be handled within the >>> constraint of pre- defined fields. > > My current application involves writing a web application which can > seach a ferret index built from a SQL database. > > The idea is that the customer supplies SQLs for say customers, > suppliers, sales and puchases etc. The app then retrieves the rows > from > the datasource and indexes using Ferret. The app provides both a html > website as an interface to the index, and also an XML api which can be > used by non browser clients. > > The field set is quite different for each SQL [and is essentially > out of > our control].So at what point does your app learn the structure of the SQL table? Would it work if you were to start each session by telling the index writer about the fields that were coming? def connect(field_names) field_names.each do |field_name| index.spec_field(field_name) # use default properties end end def add_to_index(submission) index.add_hash_as_doc(submission) end I can imagine a scenario where that''s not possible, and the fields may change up on each insert. In that case, under the interface I envision, you''d have to do something like... def add_to_index(submission) submission.each do |field_name, value| index.spec_field(field_name) # use default properties end index.add_hash_as_doc(submission) end FWIW, this stuff is happening anyway, behind the scenes. Essentially, every time you add a field to an index, Ferret asks, "Say, is this field indexed? And how about TermVectors, you want those?" The 10_000th time you add the field, Ferret asks, "This field wasn''t indexed before -- have you changed your mind? OK, I''ll check back again later."... 1_000_000th doc: "You sure? How about I make it indexed? Awwwww, c''mon... Hey, could you use some TermVectors?" When it makes sense, of course you want to simplify the interface and hide the complexity inside the library. However, given that it''s not possible to make coherent updates to existing data within a Lucene- esque file format, my argument is that field definitions should never change. So the repeated calls to spec_field above would be completely redundant -- you''d get an error if you ever tried to change the field def. Your app would be a little less elegant, it''s true (performance impact would be somewhere between insignificant and tiny unless you had a zillion very short fields). However, I think the use case where the fields are not known in advance is the exception rather than the rule. It would also be possible to use Dave''s polymorphic hash-as-doc technique, where if the hash value is a Field object, you spec out the field definition using that Field object''s properties -- you would just use full-on Field objects for each field. My argument would be, again, that the field definitions should not change. If you don''t agree with that and the definition has to be modifiable (within the current constraints), then that single-method technique is probably better. However, if the definition is not modifiable, then I''d argue it''s cleaner to separate the two functions. Marvin Humphrey Rectangular Research http://www.rectangular.com/
On 6/8/06, Marvin Humphrey <marvin at rectangular.com> wrote:> > On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote: > > >>> I asked the question because I honestly wanted to see a concrete > >>> example of an application that couldn''t be handled within the > >>> constraint of pre- defined fields. > > > > My current application involves writing a web application which can > > seach a ferret index built from a SQL database. > > > > The idea is that the customer supplies SQLs for say customers, > > suppliers, sales and puchases etc. The app then retrieves the rows > > from > > the datasource and indexes using Ferret. The app provides both a html > > website as an interface to the index, and also an XML api which can be > > used by non browser clients. > > > > The field set is quite different for each SQL [and is essentially > > out of > > our control]. > > So at what point does your app learn the structure of the SQL table? > Would it work if you were to start each session by telling the index > writer about the fields that were coming? > > def connect(field_names) > field_names.each do |field_name| > index.spec_field(field_name) # use default properties > end > end > > def add_to_index(submission) > index.add_hash_as_doc(submission) > end > > I can imagine a scenario where that''s not possible, and the fields > may change up on each insert. In that case, under the interface I > envision, you''d have to do something like... > > def add_to_index(submission) > submission.each do |field_name, value| > index.spec_field(field_name) # use default properties > end > index.add_hash_as_doc(submission) > end > > FWIW, this stuff is happening anyway, behind the scenes. > Essentially, every time you add a field to an index, Ferret asks, > "Say, is this field indexed? And how about TermVectors, you want > those?" The 10_000th time you add the field, Ferret asks, "This > field wasn''t indexed before -- have you changed your mind? OK, I''ll > check back again later."... 1_000_000th doc: "You sure? How about I > make it indexed? Awwwww, c''mon... Hey, could you use some TermVectors?" > > When it makes sense, of course you want to simplify the interface and > hide the complexity inside the library. However, given that it''s not > possible to make coherent updates to existing data within a Lucene- > esque file format, my argument is that field definitions should never > change. So the repeated calls to spec_field above would be > completely redundant -- you''d get an error if you ever tried to > change the field def. > > Your app would be a little less elegant, it''s true (performance > impact would be somewhere between insignificant and tiny unless you > had a zillion very short fields). However, I think the use case > where the fields are not known in advance is the exception rather > than the rule. > > It would also be possible to use Dave''s polymorphic hash-as-doc > technique, where if the hash value is a Field object, you spec out > the field definition using that Field object''s properties -- you > would just use full-on Field objects for each field. My argument > would be, again, that the field definitions should not change. If > you don''t agree with that and the definition has to be modifiable > (within the current constraints), then that single-method technique > is probably better. However, if the definition is not modifiable, > then I''d argue it''s cleaner to separate the two functions.I completely agree with you that field definitions should not change once they are set. However, I don''t think having the library add missing fields with a default set of values (which would be set when you create the index) adds too much complexity. You simply need to check whether the field already exists. You already have to look up the field number anyway. So, to add dynamic fields, simply check to make sure a valid field number was found and add the field if it wasn''t. Of course this is just as easy to implement in the binding code so I don''t mind whether it gets into Lucy core or not. As long as you can add new fields to an index after documents have been added, I''m happy, and it seems from your example (nice ruby code by the way) that that is your plan. Dave
Neville Burnell
2006-Jun-08 07:03 UTC
[Ferret-talk] Proposal of some radical changes to API
>> So at what point does your app learn the structure of the SQL table?At the moment I know the structure after executing the SQL and fetching the first row [a ruby hash]. But the field set will change from SQL to SQL, and Ferret is doing all the field specification for me via hash-as-doc, ala. def create @index = Ferret::Index::Index.new() conn = ODBC.connect(@odbc[:dsn], @odbc[:uid], @odbc[:pwd]) @sqls.each do |sql| stmt = conn.prepare(sql) stmt.execute.each_hash{ |row| @index << row } stmt.close stmt.drop end conn.disconnect end The field definitions do not change though, so I''m happy as long as the hash-as-doc support remains in Ferret. Cheers, Neville -----Original Message----- From: ferret-talk-bounces at rubyforge.org [mailto:ferret-talk-bounces at rubyforge.org] On Behalf Of Marvin Humphrey Sent: Thursday, 8 June 2006 3:07 PM To: ferret-talk at rubyforge.org Subject: Re: [Ferret-talk] Proposal of some radical changes to API On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote:>>> I asked the question because I honestly wanted to see a concrete >>> example of an application that couldn''t be handled within the >>> constraint of pre- defined fields. > > My current application involves writing a web application which can > seach a ferret index built from a SQL database. > > The idea is that the customer supplies SQLs for say customers, > suppliers, sales and puchases etc. The app then retrieves the rows > from the datasource and indexes using Ferret. The app provides both a > html website as an interface to the index, and also an XML api which > can be used by non browser clients. > > The field set is quite different for each SQL [and is essentially out > of our control].So at what point does your app learn the structure of the SQL table? Would it work if you were to start each session by telling the index writer about the fields that were coming? def connect(field_names) field_names.each do |field_name| index.spec_field(field_name) # use default properties end end def add_to_index(submission) index.add_hash_as_doc(submission) end I can imagine a scenario where that''s not possible, and the fields may change up on each insert. In that case, under the interface I envision, you''d have to do something like... def add_to_index(submission) submission.each do |field_name, value| index.spec_field(field_name) # use default properties end index.add_hash_as_doc(submission) end FWIW, this stuff is happening anyway, behind the scenes. Essentially, every time you add a field to an index, Ferret asks, "Say, is this field indexed? And how about TermVectors, you want those?" The 10_000th time you add the field, Ferret asks, "This field wasn''t indexed before -- have you changed your mind? OK, I''ll check back again later."... 1_000_000th doc: "You sure? How about I make it indexed? Awwwww, c''mon... Hey, could you use some TermVectors?" When it makes sense, of course you want to simplify the interface and hide the complexity inside the library. However, given that it''s not possible to make coherent updates to existing data within a Lucene- esque file format, my argument is that field definitions should never change. So the repeated calls to spec_field above would be completely redundant -- you''d get an error if you ever tried to change the field def. Your app would be a little less elegant, it''s true (performance impact would be somewhere between insignificant and tiny unless you had a zillion very short fields). However, I think the use case where the fields are not known in advance is the exception rather than the rule. It would also be possible to use Dave''s polymorphic hash-as-doc technique, where if the hash value is a Field object, you spec out the field definition using that Field object''s properties -- you would just use full-on Field objects for each field. My argument would be, again, that the field definitions should not change. If you don''t agree with that and the definition has to be modifiable (within the current constraints), then that single-method technique is probably better. However, if the definition is not modifiable, then I''d argue it''s cleaner to separate the two functions. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk