Hi, Apologies for reposting this for those who read this via ruby-forum, but it didn''t make it to the list before, and the list seems more active... I''m using ferret (via acts_as_ferret) in a somewhat unorthodox manner and am having a strange wildcard problem. Before anyone wonders why we''re doing things this way, the answer is basically that it lets us precompute what would be expensive database queries and store the results in a simple way (ferret index) prior to pushing the static data to our production server. Basically, I''ve got two (for the sake of simplicity) models, both of which are indexed on a similar (but separate) non-model field. However, one of those two models does not seem to get the proper number of results for a wildcard search: First of all, there''s a non-indexed model called ProductTuple that''s got a supplier_id as well as a product_category_id and product_material_id as well as some other id fields that aren''t really important here. Thus, a ProductTuple has foreign key relationships to Suppliers and ProductCategories and ProductMaterials, but for ferret purposes just think of those foreign keys as what they are - ids (e.g. integers). The first model, Supplier, is ferret-indexed on several fields, such as the supplier name and supplier country, as well as the ''ferret_product_tuples'' non-model field. ferret_product_tuples simply takes all the product tuples for a supplier and concatenates their product_category_id, product_material_id, etc. with delimiters. So, for a product tuple with product_category_id 82, product_material_id 88, and undefined product_technique_id, the resulting part of the ferret_product_tuple string would look like x00082_00088_00000x (where we use 00000 to indicate null). the xs are used as anchors, essentially, as a given supplier''s ferret_product_tuple string might look like ''x00082_00088_00000x x00000_00081_00013x''. Now, the ferret query that gets constructed when we do the relevant queries simply looks like: ''ferret_product_tuple:x00082_?????_?????x'' and this would, in the above instance, match that supplier. Everything I''ve described works _perfectly_, EXCEPT... we also index product_categories on this same string. So product category #82 would have a bunch of ferret_product_tuple strings that start out x00082 and have various things in the other positions. Here''s what''s strange... a product_category query for ''ferret_product_tuple:x?????_?????_?????x'' should return ALL product categories, right? Yet it only returns six. A product category query for ''ferret_product_tuple:x?????_00081_?????x'' should return all the product categories that share product_tuples with product_material #81, but in fact returns only a small number of categories. Yet making the wildcard match MORE restrictive by substituting ''ferret_product_tuple:x00082_00081_?????x'' into that query yields product_category #82, which is erroneously not included in the 6 results for ''ferret_product_tuple:x?????_00081_?????x''. So, have I stumbled upon a bug in the wildcard handling? My initial thought was that the different analyzer I was using for the product_category index was the culprit, but I changed that analyzer out to no effect, so I''ve ruled that out. Any ideas? Thanks!
Hi! wildcard queries have a built in upper limit of terms they search for, which by default is set to 512 (according to http://ferret.davebalmain.com/api/classes/Ferret/Search/WildcardQuery.html). So when you query for asdf*, Ferret expands this to all terms in your index starting with asdf, but will stop after collecting 512 terms, then go and retrieve all documents containing these 512 terms, obviously missing those that would in theory match your query, but do this by containing a matching term that wasn''t retrieved in the first step. Of course you can set the max_term count to a higher value, but in the long run this isn''t really a solution. If I understand you correctly, your tuple field right now has a single term for each document, and that term is different for each document. Splitting up your tuple values into several different terms could help to reduce the number of terms needed to fetch for a wild card query. Cheers, Jens On Mon, Nov 05, 2007 at 04:11:53PM -0500, Noah M. Daniels wrote:> Hi, > Apologies for reposting this for those who read this via ruby-forum, > but it didn''t make it to the list before, and the list seems more > active... > I''m using ferret (via acts_as_ferret) in a somewhat unorthodox > manner and am having a strange wildcard problem. Before anyone wonders > why we''re doing things this way, the answer is basically that it lets > us precompute what would be expensive database queries and store the > results in a simple way (ferret index) prior to pushing the static > data to our production server. > Basically, I''ve got two (for the sake of simplicity) models, both of > which are indexed on a similar (but separate) non-model field. > However, one of those two models does not seem to get the proper > number of results for a wildcard search: > First of all, there''s a non-indexed model called ProductTuple that''s > got a supplier_id as well as a product_category_id and > product_material_id as well as some other id fields that aren''t really > important here. Thus, a ProductTuple has foreign key relationships to > Suppliers and ProductCategories and ProductMaterials, but for ferret > purposes just think of those foreign keys as what they are - ids (e.g. > integers). > The first model, Supplier, is ferret-indexed on several fields, such > as the supplier name and supplier country, as well as the > ''ferret_product_tuples'' non-model field. ferret_product_tuples simply > takes all the product tuples for a supplier and concatenates their > product_category_id, product_material_id, etc. with delimiters. > So, for a product tuple with product_category_id 82, > product_material_id 88, and undefined product_technique_id, the > resulting part of the ferret_product_tuple string would look like > x00082_00088_00000x (where we use 00000 to indicate null). the xs are > used as anchors, essentially, as a given supplier''s > ferret_product_tuple string might look like ''x00082_00088_00000x > x00000_00081_00013x''. > Now, the ferret query that gets constructed when we do the relevant > queries simply looks like: > ''ferret_product_tuple:x00082_?????_?????x'' > and this would, in the above instance, match that supplier. > Everything I''ve described works _perfectly_, EXCEPT... > we also index product_categories on this same string. So product > category #82 would have a bunch of ferret_product_tuple strings that > start out x00082 and have various things in the other positions. > Here''s what''s strange... a product_category query for > ''ferret_product_tuple:x?????_?????_?????x'' should return ALL product > categories, right? Yet it only returns six. A product category query > for ''ferret_product_tuple:x?????_00081_?????x'' should return all the > product categories that share product_tuples with product_material > #81, but in fact returns only a small number of categories. Yet making > the wildcard match MORE restrictive by substituting > ''ferret_product_tuple:x00082_00081_?????x'' into that query yields > product_category #82, which is erroneously not included in the 6 > results for ''ferret_product_tuple:x?????_00081_?????x''. > So, have I stumbled upon a bug in the wildcard handling? My initial > thought was that the different analyzer I was using for the > product_category index was the culprit, but I changed that analyzer > out to no effect, so I''ve ruled that out. > Any ideas? Thanks! > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
Jens Kraemer wrote:> Hi! > > wildcard queries have a built in upper limit of terms they search for, > which by default is set to 512 (according to > http://ferret.davebalmain.com/api/classes/Ferret/Search/WildcardQuery.html). > > So when you query for asdf*, Ferret expands this to all terms in your > index starting with asdf, but will stop after collecting 512 terms, then > go and retrieve all documents containing these 512 terms, obviously > missing those that would in theory match your query, but do this by > containing a matching term that wasn''t retrieved in the first step. > > Of course you can set the max_term count to a higher value, but in the > long run this isn''t really a solution. If I understand you correctly, > your tuple field right now has a single term for each document, and that > term is different for each document. Splitting up your tuple values into > several different terms could help to reduce the number of terms needed > to fetch for a wild card query. >Interesting, thanks. Actually I can''t split the tuple values up -- the requirement is to see those terms occur together in the same tuple, not just for the same document (there is a difference in this case). So, I''ll try expanding the max_term count to see if that helps; otherwise I''ll have to rethink the solution. -- Posted via http://www.ruby-forum.com/.
On Nov 6, 2007, at 9:00 AM, Noah Daniels wrote:> Jens Kraemer wrote: >> > > Interesting, thanks. Actually I can''t split the tuple values up -- the > requirement is to see those terms occur together in the same tuple, > not > just for the same document (there is a difference in this case). So, > I''ll try expanding the max_term count to see if that helps; otherwise > I''ll have to rethink the solution.Jens, many thanks; upping the max_terms (max_clauses seems to be the same thing) solved the problem beautifully. However, now I''m trying to get this working with a remote ferret server (using acts_as_ferret) and not having any luck. Particularly, I can''t figure out where to set max_terms (or Ferret::Search::MultiTermQuery.default_max_terms= ) such that the remote ferret server will pick it up -- including in the start script for the remote ferret server. Where can I change this option so it''ll work for a remote server with AAF? thanks!
On Tue, Nov 06, 2007 at 11:25:56AM -0500, Noah M. Daniels wrote:> > > On Nov 6, 2007, at 9:00 AM, Noah Daniels wrote: > > > Jens Kraemer wrote: > >> > > > > Interesting, thanks. Actually I can''t split the tuple values up -- the > > requirement is to see those terms occur together in the same tuple, > > not > > just for the same document (there is a difference in this case). So, > > I''ll try expanding the max_term count to see if that helps; otherwise > > I''ll have to rethink the solution. > > Jens, many thanks; upping the max_terms (max_clauses seems to be the > same thing) solved the problem beautifully. However, now I''m trying to > get this working with a remote ferret server (using acts_as_ferret) > and not having any luck. Particularly, I can''t figure out where to set > max_terms (or Ferret::Search::MultiTermQuery.default_max_terms= ) such > that the remote ferret server will pick it up -- including in the > start script for the remote ferret server. Where can I change this > option so it''ll work for a remote server with AAF?Placing it at the end of acts_as_ferret''s init.rb should work. Cheers, Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
On Nov 6, 2007, at 11:35 AM, Jens Kraemer wrote:> On Tue, Nov 06, 2007 at 11:25:56AM -0500, Noah M. Daniels wrote: >> > Placing it at the end of acts_as_ferret''s init.rb should work.Unfortunately, it doesn''t seem to. For a local index, I can just put this anywhere in code (even in a controller, or in the console) and I start getting correct results from my query: Ferret::Search::MultiTermQuery.default_max_terms = 5000 but on my staging server, where a drb ferret server is used, putting that line in the init.rb doesn''t seem to do anything -- in fact, even putting it into the initialize method of the LocalIndex class doesn''t help! Any ideas? thanks!
Just a ''ping'' since I still haven''t been able to solve this without doing what I don''t want to do (putting this into my local copy of ferret itself) Setting this in init.rb in the acts_as_ferret plugin does nothing. Does anyone have a suggestion for where it would work? thanks! On Nov 6, 2007, at 11:41 AM, Noah M. Daniels wrote:> Unfortunately, it doesn''t seem to. For a local index, I can just put > this anywhere in code (even in a controller, or in the console) and > I start getting correct results from my query: > > Ferret::Search::MultiTermQuery.default_max_terms = 5000 > > but on my staging server, where a drb ferret server is used, putting > that line in the init.rb doesn''t seem to do anything -- in fact, > even putting it into the initialize method of the LocalIndex class > doesn''t help! Any ideas? > > thanks!