I am looking to have a number of categories populated from my results of a search. For example, searching on "sport" would display all results for sport. I want to also have a number of categories to refine the documents down. So by clicking on the "Fishing" category or the "Shooting" category, I would only see the results on sport around that category. Now for the fun. I want to determine the total number of results in each category for a give search. So in the above, for a search on sport I want to display the results but in the Fishing item I want to say how many results there are in total before the user clicks on the item. For example in the pull down I want to display "Fishing (10001), Shooting (2003)". I was going to do this in Ruby by doing a simple count for each category item on the returned result set, but I believe that this would mean returning all the results of a given query to Ruby in order to do this count and I am concerned that this would cause performance issues for large result sets. If I put pagination into the mix and only display the first 50 results on the screen, would this add an additional complexity or would this just be called through Ruby? Thanks for your assistance with this... -- Posted via http://www.ruby-forum.com/.
On 7/10/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote:> I am looking to have a number of categories populated from my results of > a search. For example, searching on "sport" would display all results > for sport. I want to also have a number of categories to refine the > documents down. So by clicking on the "Fishing" category or the > "Shooting" category, I would only see the results on sport around that > category. > > Now for the fun. I want to determine the total number of results in each > category for a give search. So in the above, for a search on sport I > want to display the results but in the Fishing item I want to say how > many results there are in total before the user clicks on the item. For > example in the pull down I want to display "Fishing (10001), Shooting > (2003)".Hi Clare, The fastest way to do this would be to run the query multiple times. So for your "sport" example you''d do something like this; fishing_count = index.search_each("sport AND fishing", :num_docs => 1) {} shooting_count = index.search_each("sport AND shooting", :num_docs => 1) {} # etc. Then go ahead and paginate your query as you usually would.> I was going to do this in Ruby by doing a simple count for each category > item on the returned result set, but I believe that this would mean > returning all the results of a given query to Ruby in order to do this > count and I am concerned that this would cause performance issues for > large result sets.Quite possibly. But running the query multiple times should be fine in terms of performance. You could use filters instead of the code I demonstrated above to further improve performance.> If I put pagination into the mix and only display the first 50 results > on the screen, would this add an additional complexity or would this > just be called through Ruby? > > Thanks for your assistance with this...I''m not exactly sure what you mean here when you say "would this be called through ruby". I hope I''ve already answered your question. Let me know if I didn''t. Cheers, Dave
David Balmain wrote:> On 7/10/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote: >> many results there are in total before the user clicks on the item. For >> example in the pull down I want to display "Fishing (10001), Shooting >> (2003)". > > Hi Clare, > The fastest way to do this would be to run the query multiple times. > So for your "sport" example you''d do something like this; > > fishing_count = index.search_each("sport AND fishing", :num_docs => > 1) {} > shooting_count = index.search_each("sport AND shooting", :num_docs > => 1) {} > # etc. > > Then go ahead and paginate your query as you usually would. >Thank you very much for your quite response. I have several sub categories (taxonomy really) and what I was thinking of doing was this in 2 queries. Index the data as per normal so that you can do the full text search but also index the structure of the taxonomy and have each branch contain the records that contain it. Run one big search over the fulltext to get the list of hits and then use this list as a query against the second index to get all the category bits. This would be a big query though - although it should be quick but I would need to re-index the category bits everytime a document was added. Does this make sense and/or would it make sense in Ferret. I have done this before in another search engine that required special category manipulation but never with Ferret and not sure how to go about doing this in Ferret. I am not sure about your idea around filtering the results>> I was going to do this in Ruby by doing a simple count for each category >> item on the returned result set, but I believe that this would mean >> returning all the results of a given query to Ruby in order to do this >> count and I am concerned that this would cause performance issues for >> large result sets. > > Quite possibly. But running the query multiple times should be fine in > terms of performance. You could use filters instead of the code I > demonstrated above to further improve performance. > >> If I put pagination into the mix and only display the first 50 results >> on the screen, would this add an additional complexity or would this >> just be called through Ruby? >> >> Thanks for your assistance with this... > > I''m not exactly sure what you mean here when you say "would this be > called through ruby". I hope I''ve already answered your question. Let > me know if I didn''t. > > Cheers, > Dave-- Posted via http://www.ruby-forum.com/.
On 7/10/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote:> David Balmain wrote: > > On 7/10/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote: > >> many results there are in total before the user clicks on the item. For > >> example in the pull down I want to display "Fishing (10001), Shooting > >> (2003)". > > > > Hi Clare, > > The fastest way to do this would be to run the query multiple times. > > So for your "sport" example you''d do something like this; > > > > fishing_count = index.search_each("sport AND fishing", :num_docs => > > 1) {} > > shooting_count = index.search_each("sport AND shooting", :num_docs > > => 1) {} > > # etc. > > > > Then go ahead and paginate your query as you usually would. > > > > Thank you very much for your quite response. > > I have several sub categories (taxonomy really) and what I was thinking > of doing was this in 2 queries. Index the data as per normal so that you > can do the full text search but also index the structure of the taxonomy > and have each branch contain the records that contain it. > Run one big search over the fulltext to get the list of hits and then > use this list as a query against the second index to get all the > category bits.I''m not sure what you mean by "category bits". Can you possible implement the categories like this; sport/ sport/shooting/ sport/fishing/ sport/fishing/fly sprot/fishing/deep_sea etc. Then, lets say you have a query in query_str. You can get all results in the sport category like this; index.search_each(query_str + "AND category:sport/*") { # ... } You can get all results in the fishing category like this; index.search_each(query_str + "AND category:sport/fishing/*") { # ... } Am I making sense?> This would be a big query though - although it should be quick but I > would need to re-index the category bits everytime a document was added.You''ve lost me. Could you give some example code?> Does this make sense and/or would it make sense in Ferret. I have done > this before in another search engine that required special category > manipulation but never with Ferret and not sure how to go about doing > this in Ferret. > > I am not sure about your idea around filtering the resultsI''ll explain filtering once I understand better what it is you are trying to do. Cheers, Dave
David Balmain wrote: David Thanks for your continued help and assistance. I don''t have code at this stage because I started writing it one way and realised that the way I was writing it through counts in Ruby would not work because of pagination. A little more background is in order. The user will be presented with a pull down menu with 5 selections in a main category. Doing 6 queries (one main query) and 5 count queries in this instance is not a problem. The problem arises when they select one of these categories. They will then be presented with up to 5 other category structures. One would be new or old, another would be type (up to 5 nodes), another would be, for example, book type (such as fiction, no fiction, authbiography) etc. (up to 20 categories), another could have up to 40 categories. The user is free to select any of these category nodes because they may be interested in old books and fiction. I will therefore have to populate all of the nodes with the number of documents in each node. This could leave me with spawing 60 odd queries to count the number of documents in each node. Subsequent selections of nodes would refine the result set down further. What I really would like to do is 2 or 3 queries. One which does the normal search over the document set (collection) and the second to populate each node in the classification structure with the number of documents that match each node. It is pretty easy in 2 queries to tell if there are any documents in each node but doing a count over all the nodes is more tricky. I was originally going to have another table which had a row for each node with the name of the node (and structure) in one field and the document_id''s in another field. For example, [Fishing, "doc1 doc2 doc3 doc4"], [Fishing/Fiction, "doc2, doc3"], [Fishing/Non Fiction, "doc 1] etc. I would then get a result set that provided all the categories that had hits against a given query. However, it does not provide the number of documents against each node. So I could not populate the pull down categories with Fishing (2), Fiction (1), Non Fiction (1) etc. Therefore, what I really need is a function that will return the number of documents in each node of a given classification structure. An addition to the Num_Docs capability already available perhaps. I could easily produce a results set that would be like this.... Fishing doc1 Fishing doc2 Fishing/Fiction doc3 Fishing/Fiction doc1 Fishing/Non Fiction doc4 etc... Num_Docs would provide 5 in this instance but what I really want is: Fishing 2 Fishing/Fiction 2 Fishing/Non Fiction 1 etc... All that, and done in 1 or 2 queries over and above the original search.... Simple eh! I hope that I have not confused you to much, but this is something that I desperately need or my project is kaput! I found this: http://www.mail-archive.com/ferret-talk at rubyforge.org/msg00343.html and http://www.ruby-forum.com/topic/56232#40931 Do you think that this is the way to go? Thanks very much.> On 7/10/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote: >> > fishing_count = index.search_each("sport AND fishing", :num_docs => >> I have several sub categories (taxonomy really) and what I was thinking >> of doing was this in 2 queries. Index the data as per normal so that you >> can do the full text search but also index the structure of the taxonomy >> and have each branch contain the records that contain it. >> Run one big search over the fulltext to get the list of hits and then >> use this list as a query against the second index to get all the >> category bits. > > I''m not sure what you mean by "category bits". Can you possible > implement the categories like this; > > sport/ > sport/shooting/ > sport/fishing/ > sport/fishing/fly > sprot/fishing/deep_sea > etc. > > Then, lets say you have a query in query_str. You can get all results > in the sport category like this; > > index.search_each(query_str + "AND category:sport/*") { > # ... > } > > You can get all results in the fishing category like this; > > index.search_each(query_str + "AND category:sport/fishing/*") { > # ... > } > > Am I making sense? > >> This would be a big query though - although it should be quick but I >> would need to re-index the category bits everytime a document was added. > > You''ve lost me. Could you give some example code? > >> Does this make sense and/or would it make sense in Ferret. I have done >> this before in another search engine that required special category >> manipulation but never with Ferret and not sure how to go about doing >> this in Ferret. >> >> I am not sure about your idea around filtering the results > > I''ll explain filtering once I understand better what it is you are > trying to do. > > Cheers, > Dave-- Posted via http://www.ruby-forum.com/.
On 7/11/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote:> David Balmain wrote: > > David > > Thanks for your continued help and assistance. > > I don''t have code at this stage because I started writing it one way and > realised that the way I was writing it through counts in Ruby would not > work because of pagination. > > A little more background is in order. The user will be presented with a > pull down menu with 5 selections in a main category. Doing 6 queries > (one main query) and 5 count queries in this instance is not a problem. > The problem arises when they select one of these categories. > > They will then be presented with up to 5 other category structures. One > would be new or old, another would be type (up to 5 nodes), another > would be, for example, book type (such as fiction, no fiction, > authbiography) etc. (up to 20 categories), another could have up to 40 > categories. The user is free to select any of these category nodes > because they may be interested in old books and fiction. I will > therefore have to populate all of the nodes with the number of documents > in each node. This could leave me with spawing 60 odd queries to count > the number of documents in each node. Subsequent selections of nodes > would refine the result set down further. > > What I really would like to do is 2 or 3 queries. One which does the > normal search over the document set (collection) and the second to > populate each node in the classification structure with the number of > documents that match each node. > > It is pretty easy in 2 queries to tell if there are any documents in > each node but doing a count over all the nodes is more tricky. I was > originally going to have another table which had a row for each node > with the name of the node (and structure) in one field and the > document_id''s in another field. For example, [Fishing, "doc1 doc2 doc3 > doc4"], [Fishing/Fiction, "doc2, doc3"], [Fishing/Non Fiction, "doc 1] > etc. I would then get a result set that provided all the categories that > had hits against a given query. However, it does not provide the number > of documents against each node. So I could not populate the pull down > categories with Fishing (2), Fiction (1), Non Fiction (1) etc. > > Therefore, what I really need is a function that will return the number > of documents in each node of a given classification structure. An > addition to the Num_Docs capability already available perhaps. > > I could easily produce a results set that would be like this.... > > Fishing doc1 > Fishing doc2 > Fishing/Fiction doc3 > Fishing/Fiction doc1 > Fishing/Non Fiction doc4 > etc... > > Num_Docs would provide 5 in this instance but what I really want is: > Fishing 2 > Fishing/Fiction 2 > Fishing/Non Fiction 1 > etc... > > All that, and done in 1 or 2 queries over and above the original > search.... Simple eh! > > I hope that I have not confused you to much, but this is something that > I desperately need or my project is kaput! > > I found this: > http://www.mail-archive.com/ferret-talk at rubyforge.org/msg00343.html and > > http://www.ruby-forum.com/topic/56232#40931 > > Do you think that this is the way to go?I think I finally understand what you want now and I do think this is the way to go. What you will need to do is build BitVectors for each of your categories and sub-categories using the examples in those those threads. Or you could just use a QueryFilter. filter = QueryFilter.new(PrefixQuery.new(:category, "fishing"))) fishing_bits = filter.bits(index_reader) filter = QueryFilter.new(PrefixQuery.new(:category, "fishing/fiction"))) fishing_fiction_bits = filter.bits(index_reader) filter = QueryFilter.new(PrefixQuery.new(:category, "fishing/nonfiction"))) fishing_nonfiction_bits = filter.bits(index_reader) This assumes that everything in fishing/fiction is also in fishing/. In your example, it doesn''t seem to be the case, so you should use a TermQuery instead of a PrefixQuery. Now you just need to run your search the same way. Something like this; query = query_parser.parse(query_str) query_bits = QueryFilter.new(query).bits(index_reader) And now you can get your counts like this; fishing_count = (fishing_bits & query_bits).count fishing_fiction_count = (fishing_fiction_bits & query_bits).count fishing_nofiction_count = (fishing_nonfiction_bits & query_bits).count Sadly, this code only works in theory since I haven''t release the code that &s bit vectors yet and I used the new style PrefixQuery declarations so they won''t work either. But if this solution seems like it will work for you and you can wait a week, you''ll be set. Cheers, Dave
David Balmain wrote:> > I think I finally understand what you want now and I do think this is > the way to go. What you will need to do is build BitVectors for each > of your categories and sub-categories using the examples in those > those threads. Or you could just use a QueryFilter. > > filter = QueryFilter.new(PrefixQuery.new(:category, "fishing"))) > fishing_bits = filter.bits(index_reader) > > filter = QueryFilter.new(PrefixQuery.new(:category, > "fishing/fiction"))) > fishing_fiction_bits = filter.bits(index_reader) > > filter = QueryFilter.new(PrefixQuery.new(:category, > "fishing/nonfiction"))) > fishing_nonfiction_bits = filter.bits(index_reader) > > This assumes that everything in fishing/fiction is also in fishing/. > In your example, it doesn''t seem to be the case, so you should use a > TermQuery instead of a PrefixQuery. > > Now you just need to run your search the same way. Something like this; > > query = query_parser.parse(query_str) > query_bits = QueryFilter.new(query).bits(index_reader) > > And now you can get your counts like this; > > fishing_count = (fishing_bits & query_bits).count > fishing_fiction_count = (fishing_fiction_bits & query_bits).count > fishing_nofiction_count = (fishing_nonfiction_bits & > query_bits).count > > Sadly, this code only works in theory since I haven''t release the code > that &s bit vectors yet and I used the new style PrefixQuery > declarations so they won''t work either. But if this solution seems > like it will work for you and you can wait a week, you''ll be set. > > Cheers, > DaveDave Thanks very much for this and I can wait a week for this to be released. I am sorry if I was not clear about this but everything in the sub categories will have to be in the category above as this is the way that the system is designed. Fishing contains documents A B C D E F G H I J Fishing_Fiction contains A B C Fishing_Non_Fiction contains D and E Fishing_Fiction_New contains A B Fishing_Fiction_Old contains C etc. I am assuming that I need to still wait in this case? I will try and understand this in more detail in the meantime. Thanks once again for all your assistance. -- Posted via http://www.ruby-forum.com/.
On 7/11/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote:> David Balmain wrote: > > > I think I finally understand what you want now and I do think this is > > the way to go. What you will need to do is build BitVectors for each > > of your categories and sub-categories using the examples in those > > those threads. Or you could just use a QueryFilter. > > > > filter = QueryFilter.new(PrefixQuery.new(:category, "fishing"))) > > fishing_bits = filter.bits(index_reader) > > > > filter = QueryFilter.new(PrefixQuery.new(:category, > > "fishing/fiction"))) > > fishing_fiction_bits = filter.bits(index_reader) > > > > filter = QueryFilter.new(PrefixQuery.new(:category, > > "fishing/nonfiction"))) > > fishing_nonfiction_bits = filter.bits(index_reader) > > > > This assumes that everything in fishing/fiction is also in fishing/. > > In your example, it doesn''t seem to be the case, so you should use a > > TermQuery instead of a PrefixQuery. > > > > Now you just need to run your search the same way. Something like this; > > > > query = query_parser.parse(query_str) > > query_bits = QueryFilter.new(query).bits(index_reader) > > > > And now you can get your counts like this; > > > > fishing_count = (fishing_bits & query_bits).count > > fishing_fiction_count = (fishing_fiction_bits & query_bits).count > > fishing_nofiction_count = (fishing_nonfiction_bits & > > query_bits).count > > > > Sadly, this code only works in theory since I haven''t release the code > > that &s bit vectors yet and I used the new style PrefixQuery > > declarations so they won''t work either. But if this solution seems > > like it will work for you and you can wait a week, you''ll be set. > > > > Cheers, > > Dave > > Dave > > Thanks very much for this and I can wait a week for this to be released. > <snip>Great. A word or warning though, it''s all new code and you''ll be riding on the bleeding edge. But hopefully it will stabalize quickly. I''m working on this full time at the moment (when I''m not answering emails ;-)). Cheers, Dave
David Balmain wrote:> On 7/11/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote: >> > "fishing/fiction"))) >> > Now you just need to run your search the same way. Something like this; >> > >> Thanks very much for this and I can wait a week for this to be released. >> <snip> > > Great. A word or warning though, it''s all new code and you''ll be > riding on the bleeding edge. But hopefully it will stabalize quickly. > I''m working on this full time at the moment (when I''m not answering > emails ;-)). > > Cheers, > DaveDave One last thought on this... because this will be new code.... originally I was going to write the count as a client side piece of code to count the documents in each category. I realised that I would have to return the full result set in order to do this which would cause problems with performance. If I were to write this as a server side script, outside of ferret, I believe that I could achieve the same result as in your example. Can you think of any gotchas that would make this a stupid idea? Thanks (Sorry in advance for taking this outside Ferret!) -- Posted via http://www.ruby-forum.com/.
On 7/12/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote:> David Balmain wrote: > > On 7/11/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote: > >> > "fishing/fiction"))) > >> > Now you just need to run your search the same way. Something like this; > >> > > >> Thanks very much for this and I can wait a week for this to be released. > >> <snip> > > > > Great. A word or warning though, it''s all new code and you''ll be > > riding on the bleeding edge. But hopefully it will stabalize quickly. > > I''m working on this full time at the moment (when I''m not answering > > emails ;-)). > > > > Cheers, > > Dave > > Dave > > One last thought on this... because this will be new code.... originally > I was going to write the count as a client side piece of code to count > the documents in each category. I realised that I would have to return > the full result set in order to do this which would cause problems with > performance. > > If I were to write this as a server side script, outside of ferret, I > believe that I could achieve the same result as in your example. Can > you think of any gotchas that would make this a stupid idea?If you mean grab the whole result set and loop through every result taking a running count then yes, this should work fine. I''d say my example would be a lot faster but you never know without trying it. Dave
David Balmain wrote:> On 7/12/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote: >> > I''m working on this full time at the moment (when I''m not answering >> the full result set in order to do this which would cause problems with >> performance. >> >> If I were to write this as a server side script, outside of ferret, I >> believe that I could achieve the same result as in your example. Can >> you think of any gotchas that would make this a stupid idea? > > If you mean grab the whole result set and loop through every result > taking a running count then yes, this should work fine. I''d say my > example would be a lot faster but you never know without trying it. > > DaveAgain, many thanks for replying to my queries. I may go ahead and implement it this way just to see it working and then when your code is available implement it that way. It will give us the opportunity to compare but my suspicion is that the larger the dataset the faster your approach will be.... Would it be possible to ping me when your code is available? Thanks -- Posted via http://www.ruby-forum.com/.
On 7/12/06, Guest <clare.cavanagh at argoent.co.uk> wrote:> David Balmain wrote: > > On 7/12/06, BlueJay <clare.cavanagh at argoent.co.uk> wrote: > >> > I''m working on this full time at the moment (when I''m not answering > >> the full result set in order to do this which would cause problems with > >> performance. > >> > >> If I were to write this as a server side script, outside of ferret, I > >> believe that I could achieve the same result as in your example. Can > >> you think of any gotchas that would make this a stupid idea? > > > > If you mean grab the whole result set and loop through every result > > taking a running count then yes, this should work fine. I''d say my > > example would be a lot faster but you never know without trying it. > > > > Dave > > Again, many thanks for replying to my queries. I may go ahead and > implement it this way just to see it working and then when your code is > available implement it that way. It will give us the opportunity to > compare but my suspicion is that the larger the dataset the faster your > approach will be.... > > Would it be possible to ping me when your code is available?Sure. There will be an announcement on the this mailing list as well as the ruby and rails lists. Dave