Hi, I have a model which has properties, these are your standard name/value pairs, but also have attributes that affect how I want to store them in ferret. I was using 0.9.5 with 0.2 of aaf, which seemed fine, I just copied and pasted (yes, I know, ick) the to_doc method and added code to iterate though the properties that that model had, and add relavent fields to the document. It seems that this will be a bit harder now with the FieldInfos. Has anyone else done this, and is there a recognised way of doing it? David -- Posted via http://www.ruby-forum.com/.
On Mon, Sep 18, 2006 at 05:28:45PM +0200, David Sheldon wrote:> Hi, > > I have a model which has properties, these are your standard name/value > pairs, but also have attributes that affect how I want to store them in > ferret. I was using 0.9.5 with 0.2 of aaf, which seemed fine, I just > copied and pasted (yes, I know, ick) the to_doc method and added code to > iterate though the properties that that model had, and add relavent > fields to the document.instead copy''n paste you could just call super: def to_doc doc = super # custom code here doc end> It seems that this will be a bit harder now with the FieldInfos. Has > anyone else done this, and is there a recognised way of doing it?imho adding arbitrary fields should work, you just can''t specify any special per-field storage/indexing options, since the defaults determined at index creation will be used. With aaf this means :store => :no, :index => :tokenize changing the characteristics of a field for a special document doesn''t seem to be possible any more. Was that what you did until now, i.e. tokenize or store a field''s value sometimes, and sometimes not ? Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
Jens Kraemer wrote:> instead copy''n paste you could just call super: > > def to_doc > doc = super > # custom code here > doc > endAh, I had missed out on that, I don''t really understand how super works in ruby. I had been trying to rename the method and create a new one aliased to it which didn''t work. I''m still a bit confused as to_doc is created by the mixin as an instance method, is there still a superclass version? Anyway thanks for that tip, I''ll try it.> changing the characteristics of a field for a special document doesn''t > seem to be possible any more. Was that what you did until now, i.e. > tokenize or store a field''s value sometimes, and sometimes not ?Yes. Some are strings (tokenize), some are integers (dont tokenize, ideally use a different analyser), and some are choices from lists (either untokenized String or treat as integer index of choice). Dates are treated as integers, and we may want to include some strings in the DB so they can be displayed in the search results. David -- Posted via http://www.ruby-forum.com/.
On Tue, Sep 19, 2006 at 08:50:29AM +0200, David Sheldon wrote:> Jens Kraemer wrote: > > instead copy''n paste you could just call super: > > > > def to_doc > > doc = super > > # custom code here > > doc > > end > > Ah, I had missed out on that, I don''t really understand how super works > in ruby. I had been trying to rename the method and create a new one > aliased to it which didn''t work. I''m still a bit confused as to_doc is > created by the mixin as an instance method, is there still a superclass > version? Anyway thanks for that tip, I''ll try it.ah, good point. But this should still work if you do the override after calling acts_as_ferret.> > changing the characteristics of a field for a special document doesn''t > > seem to be possible any more. Was that what you did until now, i.e. > > tokenize or store a field''s value sometimes, and sometimes not ? > > Yes. Some are strings (tokenize), some are integers (dont tokenize, > ideally use a different analyser), and some are choices from lists > (either untokenized String or treat as integer index of choice). Dates > are treated as integers, and we may want to include some strings in the > DB so they can be displayed in the search results.difficult. you could declare one field per type of data (in terms of indexed/stored) you possibly run into, and in your to_doc then decide which data has to go into which field. doesn''t sound really nice to mee, but might work. For searching you would then always have to search all these fields, of yourse. Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
On 9/19/06, David Sheldon <david.sheldon at torchbox.com> wrote:> Jens Kraemer wrote: > > changing the characteristics of a field for a special document doesn''t > > seem to be possible any more. Was that what you did until now, i.e. > > tokenize or store a field''s value sometimes, and sometimes not ? > > Yes. Some are strings (tokenize), some are integers (dont tokenize, > ideally use a different analyser), and some are choices from lists > (either untokenized String or treat as integer index of choice). Dates > are treated as integers, and we may want to include some strings in the > DB so they can be displayed in the search results. > > DavidHi David, Is there any reason you need them all to be in the same field? Or am I misunderstanding you? You do realize that different fields can have different properties right? Cheers, Dave
David Balmain wrote:> Is there any reason you need them all to be in the same field? Or am I > misunderstanding you? You do realize that different fields can have > different properties right?Yes, I want them all in different fields, named after the property, that way you could search for someone''s name by ''name:Bob'' or their year of matriculation with ''matriculation:1978''. The problem is that on creation of the index I do not know what properties will be associated with users so cannot define their field infos. Previously I was able to just specify the properties when adding that field to the document. David -- Posted via http://www.ruby-forum.com/.
without reading the whole thread: 1. you know that users have properties, right? 2. theses properties are like key value pairs. one could have a property like hobby: ''cars'', another user might have a property like place-of-birth: ''Hamburg, Germany'' 3. users might build their property key-value dynamically. You don''t know which user chooses to inform you about which property 4. couldn''t you use rubys reflection, inflection whatever features to iterate over the properties of which a user has many from and then inflect the key-value pairs to put them into the index? 5. this would mean that the field list of the index might grow to a great number. don''t know how this would affect ferret. this further means that you need to know which fields one is able to search for. you would need to build something like an extended search form with all of these fields or inform the user about which fields he might use in his queries with effect. he should also be informed that only because of the existance of this field a user might not have provided this information. maybe it''s only one user that informed you about his place-of-birth. cheers, Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060919/05be046f/attachment-0001.html
imho the described problem of a growing field list is one of the reasons for the popularity of tags. Simply let the user choose how to tag himself, his question, comment whatever and don''t care about the field. it''s fulltext search for a reason. imho you''ve got two sides in things like this: 1. predefine a field list, that would be filled in by most users and therefore is valueable information for your search, 2. choose tags for the stuff where users should be able to freely decide about the categorization. cheers, Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060919/15dda497/attachment.html
On 9/19/06, David Sheldon <david.sheldon at torchbox.com> wrote:> David Balmain wrote: > > > Is there any reason you need them all to be in the same field? Or am I > > misunderstanding you? You do realize that different fields can have > > different properties right? > > Yes, I want them all in different fields, named after the property, that > way you could search for someone''s name by ''name:Bob'' or their year of > matriculation with ''matriculation:1978''. The problem is that on creation > of the index I do not know what properties will be associated with users > so cannot define their field infos. Previously I was able to just > specify the properties when adding that field to the document. > > DavidI''m assuming the matriculation field is always going to be a number. It won''t change at a later date. So you can just set up the field whenever you use it for the first time. require ''rubygems'' require ''ferret'' i = Ferret::I.new puts i.field_infos if not i.field_infos[:matriculation] i.field_infos.add_field(:matriculation, :index => :untokenized) end puts i.field_infos i << {:matriculation => 1978} Of course you only need to do this for fields which vary from the norm. Whatever properties you instantiated the FieldInfos with will be used for fields added with the FieldInfos#add_field method unless otherwise specified. So if most of your fields are number or date fields you''d create the FieldInfos like this: fis = FieldInfos.new(:index => :untokenized_omit_norms, :term_vector => :no) Now when you add a text field you''ll need to explicitly set it to tokenized and store term vectors: if not i.field_infos[:content] i.field_infos.add_field(:content, :term_vector => :with_positions_offsets, :index => :yes) end Let me know if this helps or not. Cheers, Dave
David Balmain wrote:> On 9/19/06, David Sheldon <david.sheldon at torchbox.com> wrote: >> so cannot define their field infos. Previously I was able to just >> specify the properties when adding that field to the document. >> >> David > > I''m assuming the matriculation field is always going to be a number. > It won''t change at a later date. So you can just set up the field > whenever you use it for the first time.I''ve considered this. I use aaf, and this requires the model that describes what fields are allowed on objects to have access to the index models indexer, this isn''t too bad. The only problem is when the index is created by something like rebuild_index, which needs to be extended to create all the extra fields. I don''t want to add the fields to fields_for_ferret, as that would mean calling #{fieldname}_for_ferret for each possible property, rather than taking the properties defined on that user, and adding them. Would the fields_for_ferret solution be the correct way, somehow populating this out of the database and then overriding the foo_to_ferret methods to look in a cache? This was really easy with the old API. It seems a shame that it is so hard now. David -- Posted via http://www.ruby-forum.com/.
David Balmain wrote:> I''m assuming the matriculation field is always going to be a number. > It won''t change at a later date. So you can just set up the field > whenever you use it for the first time. > > require ''rubygems'' > require ''ferret'' > i = Ferret::I.new > puts i.field_infos > if not i.field_infos[:matriculation] > i.field_infos.add_field(:matriculation, > :index => :untokenized) > end > puts i.field_infos > i << {:matriculation => 1978}Oh, I didn''t really read this last time. It looks like this might be handy, http://ferret.davebalmain.com/api/classes/Ferret/Index/Index.html only lists the IndexReader as having the field_infos. How much overhead would it be to write an "add_value" method that is called, say 10 times per doc, which will lookup the field we''re going to add in the index, and add it if it isn''t already there? Is this what the old code did anyway? David -- Posted via http://www.ruby-forum.com/.
On 9/20/06, David Sheldon <david.sheldon at torchbox.com> wrote:> David Balmain wrote: > > > I''m assuming the matriculation field is always going to be a number. > > It won''t change at a later date. So you can just set up the field > > whenever you use it for the first time. > > > > require ''rubygems'' > > require ''ferret'' > > i = Ferret::I.new > > puts i.field_infos > > if not i.field_infos[:matriculation] > > i.field_infos.add_field(:matriculation, > > :index => :untokenized) > > end > > puts i.field_infos > > i << {:matriculation => 1978} > > Oh, I didn''t really read this last time. > > It looks like this might be handy, > > http://ferret.davebalmain.com/api/classes/Ferret/Index/Index.html only > lists the IndexReader as having the field_infos. > > How much overhead would it be to write an "add_value" method that is > called, say 10 times per doc, which will lookup the field we''re going to > add in the index, and add it if it isn''t already there?Not a lot. It''s a hash lookup so it''s fast and it should be rare (after a while at least) that new fields are added. ie, it''s probably not going to happen for every document.> Is this what the old code did anyway? > > DavidThe old code created a completely new FieldInfos object for every document you add to the index. It then merges the field_infos objects when the documents are merged. In other words it was a lot more complex. This is one of the reasons for the API change. Even after adding the add_value method, I''d guess that the newer version of Ferret will still index a lot faster. Cheers, Dave
David Sheldon wrote:> How much overhead would it be to write an "add_value" method that is > called, say 10 times per doc, which will lookup the field we''re going to > add in the index, and add it if it isn''t already there?Ok, I''ve done this. But it was causing problems when called from rebuild_index, as there isn''t an index at that point, and I was calling ferret_index on my model, which was creating a new index which couldnt get a write lock for my new fields. I have solved this by giving to_doc an optional index parameter that is passed in when rebuild is running, but if it is nil, it will call Model.ferret_index. It seems like an incorrect separation for the index to be passed in to the to_doc method. Have you any suggestions on how to make this nicer? David -- Posted via http://www.ruby-forum.com/.
Hi! On Wed, Sep 20, 2006 at 11:22:52AM +0200, David Sheldon wrote:> David Sheldon wrote: > > > How much overhead would it be to write an "add_value" method that is > > called, say 10 times per doc, which will lookup the field we''re going to > > add in the index, and add it if it isn''t already there? > > Ok, I''ve done this. But it was causing problems when called from > rebuild_index, as there isn''t an index at that point, and I was calling > ferret_index on my model, which was creating a new index which couldnt > get a write lock for my new fields. > > I have solved this by giving to_doc an optional index parameter that is > passed in when rebuild is running, but if it is nil, it will call > Model.ferret_index. > > It seems like an incorrect separation for the index to be passed in to > the to_doc method. Have you any suggestions on how to make this nicer?I could change the way rebuild_index works so that it uses and initializes the Ferret index instance returned by ferret_index. So you could access the index instance in to_doc when being called by rebuild_index, too. Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
Jens Kraemer wrote:> I could change the way rebuild_index works so that it uses and > initializes the Ferret index instance returned by ferret_index. So you > could access the index instance in to_doc when being called by > rebuild_index, too.That sounds good. The other thing I noticed was that if you wanted to create a field that is created by rebuild_index, but isn''t actually put in there by the standard to_doc you can specifiy the fields along with :ignore => true, for example { :index => :untokenized, :ignore => true }. I want to do this as there is a field that I want to include many times on a document, and returning an array from foo_for_ferret didn''t add a field for each. David, are you supposed to be able to set several values for a field in the document? Thanks for all you guy''s support. David -- Posted via http://www.ruby-forum.com/.
On 9/20/06, David Sheldon <david.sheldon at torchbox.com> wrote:> David, are you supposed to be able to set several values for a field in > the document?I think I know what you are asking here but I''m not sure. You can do this in Ferret: index << {:content = "yada yada yada", :tags => ["ruby", "rails", "ferret"]} So :tags has multiple values. But you can''t do this: doc = Ferret::Document.new doc[:tag] = "ruby" doc[:tag] = "rails" doc[:tag] = "ferret" You should do this: doc[:tag] = ["ruby", "rails", "ferret"] Or this: doc[:tag] = ["ruby"] doc[:tag] << "rails" doc[:tag] << "ferret" After all, Ferret::Document is just a Hash with a boost field. Perhaps I have just misunderstood you completely so please let me know if I did. Cheers, Dave
David Balmain wrote:> So :tags has multiple values. But you can''t do this: > > doc = Ferret::Document.new > doc[:tag] = "ruby" > doc[:tag] = "rails" > doc[:tag] = "ferret" > > You should do this: > > doc[:tag] = ["ruby", "rails", "ferret"]That is exactly what I mean. And it looks like that is another way I can simplify my code with the new API. I can return an array from foo_for_ferret and have all the individual values counted. Previously I did basically networks.each { |net| doc << Field.new(''network'', net.name) } Thanks. David -- Posted via http://www.ruby-forum.com/.