Hello, I need to be able to count the occurences of certain terms in the reults. Currently my setup is Ferret 0.10.1 aaf bleeding edge. results = VoObject.find_by_contents(query,:offset=>page, :limit=> 20,:sort => sort_fields) I use results.total_hits for pagination. This all works really nicely. However i need to be able to know how many occurences of certain predefined terms occur in each result set. So in the animals fields there can be "mouse", "cat", "fish". A perfect sollution would be to have the results set has some extra attributes like results.cat_hits (that would be amazing) In reality there needs to be counts for 5 different fields. So is this something that ferret can do easily? How do i get ferret and aaf to produce this data for each search result? What should i go and investigate? Best reagards caspar -- Posted via http://www.ruby-forum.com/.
Jens Kraemer
2006-Sep-07 15:10 UTC
[Ferret-talk] counting occurences of words in the result set
On Thu, Sep 07, 2006 at 04:58:08PM +0200, Caspar wrote:> Hello, I need to be able to count the occurences of certain terms in the > reults. > Currently my setup is Ferret 0.10.1 aaf bleeding edge. > > results = VoObject.find_by_contents(query,:offset=>page, :limit=> > 20,:sort => sort_fields) > > I use results.total_hits for pagination. This all works really nicely. > However i need to be able to know how many occurences of certain > predefined terms occur in each result set. So in the animals fields > there can be "mouse", "cat", "fish". > A perfect sollution would be to have the results set has some extra > attributes like results.cat_hits (that would be amazing) In reality > there needs to be counts for 5 different fields. > > So is this something that ferret can do easily? > How do i get ferret and aaf to produce this data for each search result? > What should i go and investigate?I''d first try to just issue a seperate query for each of your special terms (ANDed with the original query), and take it''s result count. Ideally you wouldn''t use find_by_contents for this (because it fetches results from the db, which you don''t want here), but use something like VoObject.ferret_index.search(query + " AND cat",...).total_hits Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
Hi Jens, Thankyou for getting back so quickly. I should have given more information about the problem tho. One of the fields contains about 35 predefined values. I hope there is a more efficient way of producing these counts or I may well have to drop this functionality from the app. Any other ideas? I really appreciate the speed with which people reply on this forum. Regards c Jens Kraemer wrote:> On Thu, Sep 07, 2006 at 04:58:08PM +0200, Caspar wrote: >> there can be "mouse", "cat", "fish". >> A perfect sollution would be to have the results set has some extra >> attributes like results.cat_hits (that would be amazing) In reality >> there needs to be counts for 5 different fields. >> >> So is this something that ferret can do easily? >> How do i get ferret and aaf to produce this data for each search result? >> What should i go and investigate? > > I''d first try to just issue a seperate query for each of your special > terms (ANDed with the original query), and take it''s result count. > > Ideally you wouldn''t use find_by_contents for this (because it fetches > results from the db, which you don''t want here), but use something like > > VoObject.ferret_index.search(query + " AND cat",...).total_hits > > > Jens > > -- > webit! Gesellschaft f?r neue Medien mbH www.webit.de > Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de > Schnorrstra?e 76 Tel +49 351 46766 0 > D-01069 Dresden Fax +49 351 46766 66-- Posted via http://www.ruby-forum.com/.
Hi okay we have spent the last few hours trawling through the ferret api and have come accross lots of promising leads, and many questions. index_reader.doc_freq(field, term) ? integer Return the number of documents in which the term term appears in the field field. would seem to partly fit the requirmemts. However when i have tried to instantiate a new index_reader like this reader = Ferret::Index::IndexReader.new("/home/c/V_O_2/index/development/vo_object/") and then try to access some of the documents returned by search_each i am only able to access the :id field. Q1: how do you create an index_reader that is able to access your aaf index? Q2: how do you actually return the contents of a field? Q3: How can i combine doc_freq (which seems perfect) with a search to count the frequency of terms? any answers would be brilliant. best regards caspar -- Posted via http://www.ruby-forum.com/.
Caspar I have been trying to get the same thing working for a while but did not ever find a solution. It would help greatly if someone has the answer to this because I want to add this capability to my search to provide additional information to the user in the results page. But I only got the :id from the index also... :( Any help would be appreciated on this one. Thanks in advance as always! Clare -- Posted via http://www.ruby-forum.com/.
David Balmain
2006-Sep-08 02:58 UTC
[Ferret-talk] counting occurences of words in the result set
On 9/8/06, Clare <clare.cavanagh at nospam.co.uk> wrote:> Caspar > > I have been trying to get the same thing working for a while but did not > ever find a solution. It would help greatly if someone has the answer to > this because I want to add this capability to my search to provide > additional information to the user in the results page. > > But I only got the :id from the index also... :( > > Any help would be appreciated on this one. > > Thanks in advance as always!By default acts_as_ferret only stores the :id. You need to set the :store parameter of any other fields that you want stored. Something like this; acts_as_ferret :fields => { :title => { :store => :yes } :content => { :store => :yes } } As for counting the the frequency of terms in a resultset, IndexReader#doc_freq probably won''t work. It will counts the frequency of terms in the index, not in the resultset. So back to the problem. Jens gave the solution I would probably use. Ferret''s searches are faster enough that this solution is quite feasible for most indexes. Try it. You might be surprised. The alternative is counting throught the resultset. To do this you will need to set :limit => :all in the search_each method so you get all results back, then iterate through each result counting the occurances. For a huge index - slow query - small resultset this might be faster. Also, with the new filter_proc method there is another way you can do this without having to retrieve all results; require ''rubygems'' require ''ferret'' include Ferret index = I.new words = %w{one two three four five} 100000.times do |i| index << {:id => "%05d" % i, :word => words[rand(words.size)]} end counter = Hash.new(0) filter_proc = lambda do |doc, score, searcher| counter[searcher[doc][:word]] += 1 end resultset = index.search("id:[10000 20000}", :limit => 1, :filter_proc => filter_proc) puts resultset.total_hits puts counter.inspect Hope that helps, Dave
Clare
2006-Sep-08 07:40 UTC
[Ferret-talk] Performance Testing +counting occurences of words in the res
Thanks David I will try both options. I am infact doing some performance testing now. I have created 100,000 search result set and it takes around 5 seconds (end to end) on my internal server to be returned (with 1 user). I am only doing 6 significant searches on this set. One for the main results and one for the top level categories. This is only on my test server and not in the larger production server and I am happy with this performance. If however I were to do my second level category search that has around 40 nodes in it, that would be 30 searches. I am not sure how this would perform. What I am seeing is CPU hungry search but not memory hungry. This makes sense to me. Q - I have test data set up in my tests that has some random junk in and then a word such as "fish" at the end of it. I am starting to think that I may have set up the test data wrong and should use a lot of different words in the result set because I am sure that Ferret will cache the search. This would give me a false impression on speed of search. I will create more test data however at the weekend but my instinct is that your method outlined above may be faster. I have 5 top level categories and this will not change much. Depending on the search there were be a lot more results in one category that the rest after the initial search. Drilling into the second level categories, the most nodes I have in a single second level category is around 40 at the moment although this is likely to be added to over time. The resuls again will not be normally distributed over the results set but assuming for now that they were and I had 500,000 records, and drilled into the second tier category structure I would have 100,000 records in this category. I would be doing 40 searches over 100,000 records. Q - What do you think will perform faster in this instance? I would love to have the time to build a x-dimensional memory resident result (bucket set) that kept all the results parameterised for all the categories, built at the initial time of the search. Would be memory hungry but would make searching through categories and nodes and parameters in subsequent searches lightening fast. Would be a great addition or am I missing something? I am really interested in the performance testing scenarios. As stated above, I only have one word "FISH" in my test data with random made up beforehand. e.g. "sadssderssdaatg FISH" etc. Q - Would I be better using more words in my test data? Also - I am interested in the round trip performance of search. The length of time it takes from when the user clicks on search and gets the results back. I will do this on the production server in the production environment. My rule of thumb is that it should not take longer than 8 seconds to return the results or the user will refresh (even worse for performance). With one user on my test system with 6 searches over 100,000 records it takes 5 seconds at the moment. I am expecting a large number of concurrent searches happening. I am defining concurrency as someone searching at the same time as another user is either searching or waiting for the results to be returned. Most testing tools that I can see only show you what is happening on the server. I am interested from the users perspective. I had a thought of setting up a script that would open a number of browser sessions and doing random searches concurrently and hammering the server to see when it 1) breaks search, 2) breaks something else 3) search goes over the 8 second limit. Q - does anyone have any experience in this area. Even better does anyone have a script to do this? If not, and I do write a script to do this would this be of value to the greater community? Sorry for the long winded post. My search page and category search is the most critical part of my site and I am anal on performance of this because if it does not work then my site will not work. Thanks once again for all your assistance. Sorry for any stupid or ignorant thoughts/remarks. Ferret rocks! Clare -- Posted via http://www.ruby-forum.com/.
David Balmain
2006-Sep-08 08:49 UTC
[Ferret-talk] Performance Testing +counting occurences of words in the res
On 9/8/06, Clare <Clare at nospam.com> wrote:> Thanks David > > I will try both options. I am infact doing some performance testing now. > I have created 100,000 search result set and it takes around 5 seconds > (end to end) on my internal server to be returned (with 1 user). I am > only doing 6 significant searches on this set. One for the main results > and one for the top level categories. This is only on my test server and > not in the larger production server and I am happy with this > performance. If however I were to do my second level category search > that has around 40 nodes in it, that would be 30 searches. I am not sure > how this would perform. > > What I am seeing is CPU hungry search but not memory hungry. This makes > sense to me. > > Q - I have test data set up in my tests that has some random junk in and > then a word such as "fish" at the end of it. I am starting to think that > I may have set up the test data wrong and should use a lot of different > words in the result set because I am sure that Ferret will cache the > search. This would give me a false impression on speed of search.Firstly, searches don''t get cached. Only filters do. If you want to cache the results from a query (which you would in this instance) then you should use a QueryFilter. Secondly, I''m not sure exactly what you are saying when you say your tests have some random junk and then the word "fish"? If you are putting data like this into every document; index << "asdlgkjhasd askdj asdg asdg asdg asdg lkjh asd fish" Then you probably should work on your test data. As far as search perfomance goes, this will be no different to doing this; index << "fish" What is important is to remember that TermQueries (fish) perform a lot better than BooleanQueries (fish AND rod) and PhraseQueries ("fishing rod") which perform better again than WildCardQueries (fi*) so you should try these queries too. Here is a much better way to create random strings; WORDS = %w{one two three} def random_sentence(min_size, max_size) len = min_size + rand(max_size - min_size) sentence = [] len.times {sentence << WORDS[Math.sqrt(rand(WORDS.size * WORDS.size))]} sentence.join(" ") end 10.times { puts random_sentence(10, 100) } The Math.sqrt stuff makes sure that words aren''t evenly distributed to be more realistic. Words appearing later in the WORDS array will be much more common. Even better than this would be to use a copy of the real data that you will be using though.> I will create more test data however at the weekend but my instinct is > that your method outlined above may be faster. > > I have 5 top level categories and this will not change much. Depending > on the search there were be a lot more results in one category that the > rest after the initial search. > > Drilling into the second level categories, the most nodes I have in a > single second level category is around 40 at the moment although this is > likely to be added to over time. The resuls again will not be normally > distributed over the results set but assuming for now that they were and > I had 500,000 records, and drilled into the second tier category > structure I would have 100,000 records in this category. I would be > doing 40 searches over 100,000 records. > > Q - What do you think will perform faster in this instance?Impossible to say without testing. Both methods are pretty simple though so I''d try both with a variety of search strings.> I would love to have the time to build a x-dimensional memory resident > result (bucket set) that kept all the results parameterised for all the > categories, built at the initial time of the search. Would be memory > hungry but would make searching through categories and nodes and > parameters in subsequent searches lightening fast. > > Would be a great addition or am I missing something?As far as I''m concerned this functionality is already there with the filter_proc parameter. Make it any less general than this and it isn''t much use anymore. For example; require ''rubygems'' require ''ferret'' include Ferret index = I.new words = %w{one two three four five} 100000.times do |i| index << {:id => "%05d" % i, :word => words[rand(words.size)]} end groups = {} filter_proc = lambda do |doc, score, searcher| word = searcher[doc][:word] (groups[word]||=[]) << doc end resultset = index.search("id:[09900 10000}", :limit => 1, :filter_proc => filter_proc) puts resultset.total_hits puts groups.inspect puts groups["two"].size I really can''t see how you could make it any easier than that.> I am really interested in the performance testing scenarios. As stated > above, I only have one word "FISH" in my test data with random made up > beforehand. e.g. "sadssderssdaatg FISH" etc. > > Q - Would I be better using more words in my test data?See above.> Also - I am interested in the round trip performance of search. The > length of time it takes from when the user clicks on search and gets the > results back. I will do this on the production server in the production > environment. My rule of thumb is that it should not take longer than 8 > seconds to return the results or the user will refresh (even worse for > performance). With one user on my test system with 6 searches over > 100,000 records it takes 5 seconds at the moment.5 seconds seems like a long time. Try optimizing your index and see how you go then. The example above took 0.028109 seconds. Personally, I would be worried about anything taking over 1 second which was the whole reason I wrote Ferret in C.> I am expecting a large number of concurrent searches happening. I am > defining concurrency as someone searching at the same time as another > user is either searching or waiting for the results to be returned. > > Most testing tools that I can see only show you what is happening on the > server. I am interested from the users perspective. > > I had a thought of setting up a script that would open a number of > browser sessions and doing random searches concurrently and hammering > the server to see when it 1) breaks search, 2) breaks something else 3) > search goes over the 8 second limit. > > Q - does anyone have any experience in this area. Even better does > anyone have a script to do this? If not, and I do write a script to do > this would this be of value to the greater community?If I were you, I''d test plain old search performance before I tested performance through a browser. And, again, it is pretty hard to generalize a script like this since so many people have different search needs. In my opinion, Ruby makes it easy enough to write this from scratch each time.> Sorry for the long winded post. My search page and category search is > the most critical part of my site and I am anal on performance of this > because if it does not work then my site will not work. > > Thanks once again for all your assistance. Sorry for any stupid or > ignorant thoughts/remarks. > > Ferret rocks!You''re welcome, Dave> Clare > > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >