thr3ads.net - Ferret talk - [Ferret-talk] counting occurences of words in the result set [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Caspar

2006-Sep-07 14:58 UTC

[Ferret-talk] counting occurences of words in the result set

Hello, I need to be able to count the occurences of certain terms in the 
reults.
Currently my setup is Ferret 0.10.1 aaf bleeding edge.

results = VoObject.find_by_contents(query,:offset=>page, :limit=> 
20,:sort => sort_fields)

I use results.total_hits for pagination. This all works really nicely. 
However i need to be able to know how many occurences of certain 
predefined terms occur in each result set. So in the animals fields 
there can be "mouse", "cat", "fish".
A perfect sollution would be to have the results set has some extra 
attributes like results.cat_hits (that would be amazing) In reality 
there needs to be counts for 5 different fields.

So is this something that ferret can do easily?
How do i get ferret and aaf to produce this data for each search result?
What should i go and investigate?

Best reagards
caspar

-- 
Posted via http://www.ruby-forum.com/.

Jens Kraemer

2006-Sep-07 15:10 UTC

head link

[Ferret-talk] counting occurences of words in the result set

On Thu, Sep 07, 2006 at 04:58:08PM +0200, Caspar wrote:> Hello, I need to be able to count the occurences of certain terms in the 
> reults.
> Currently my setup is Ferret 0.10.1 aaf bleeding edge.
> 
> results = VoObject.find_by_contents(query,:offset=>page, :limit=> 
> 20,:sort => sort_fields)
> 
> I use results.total_hits for pagination. This all works really nicely. 
> However i need to be able to know how many occurences of certain 
> predefined terms occur in each result set. So in the animals fields 
> there can be "mouse", "cat", "fish".
> A perfect sollution would be to have the results set has some extra 
> attributes like results.cat_hits (that would be amazing) In reality 
> there needs to be counts for 5 different fields.
> 
> So is this something that ferret can do easily?
> How do i get ferret and aaf to produce this data for each search result?
> What should i go and investigate?
I''d first try to just issue a seperate query for each of your special
terms (ANDed with the original query), and take it''s result count.

Ideally you wouldn''t use find_by_contents for this (because it fetches
results from the db, which you don''t want here), but use something like

VoObject.ferret_index.search(query + " AND cat",...).total_hits


Jens

-- 
webit! Gesellschaft f?r neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Kr?mer       kraemer at webit.de
Schnorrstra?e 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66

Caspar

2006-Sep-07 15:44 UTC

head link

[Ferret-talk] counting occurences of words in the result set

Hi Jens,
Thankyou for getting back so quickly. I should have given more 
information about the problem tho. One of the fields contains about 35 
predefined values. I hope there is a more efficient way of producing 
these counts or I may well have to drop this functionality from the app.
Any other ideas?
I really appreciate the speed with which people reply on this forum.
Regards
c



Jens Kraemer wrote:> On Thu, Sep 07, 2006 at 04:58:08PM +0200, Caspar wrote:
>> there can be "mouse", "cat", "fish".
>> A perfect sollution would be to have the results set has some extra 
>> attributes like results.cat_hits (that would be amazing) In reality 
>> there needs to be counts for 5 different fields.
>> 
>> So is this something that ferret can do easily?
>> How do i get ferret and aaf to produce this data for each search
result?
>> What should i go and investigate?
> 
> I''d first try to just issue a seperate query for each of your
special
> terms (ANDed with the original query), and take it''s result count.
> 
> Ideally you wouldn''t use find_by_contents for this (because it
fetches
> results from the db, which you don''t want here), but use something
like
> 
> VoObject.ferret_index.search(query + " AND cat",...).total_hits
> 
> 
> Jens
> 
> --
> webit! Gesellschaft f?r neue Medien mbH          www.webit.de
> Dipl.-Wirtschaftsingenieur Jens Kr?mer       kraemer at webit.de
> Schnorrstra?e 76                         Tel +49 351 46766  0
> D-01069 Dresden                          Fax +49 351 46766 66

-- 
Posted via http://www.ruby-forum.com/.

Caspar

2006-Sep-07 17:55 UTC

head link

[Ferret-talk] counting occurences of words in the result set

Hi okay we have spent the last few hours trawling through the ferret api 
and have come accross lots of promising leads, and many questions.


index_reader.doc_freq(field, term) ? integer
Return the number of documents in which the term term appears in the 
field field.

would seem to partly fit the requirmemts. However when i have tried to 
instantiate  a new index_reader like this
 reader = 
Ferret::Index::IndexReader.new("/home/c/V_O_2/index/development/vo_object/")

and then try to access some of the documents returned by search_each i 
am only able to access the :id field.

Q1: how do you create an index_reader that is able to access your aaf 
index?
Q2: how do you actually return the contents of a field?
Q3: How can i combine doc_freq (which seems perfect) with a search to 
count the frequency of terms?

any answers would be brilliant.
best regards
caspar

-- 
Posted via http://www.ruby-forum.com/.

Clare

2006-Sep-07 18:11 UTC

head link

[Ferret-talk] counting occurences of words in the result set

Caspar

I have been trying to get the same thing working for a while but did not 
ever find a solution. It would help greatly if someone has the answer to 
this because I want to add this capability to my search to provide 
additional information to the user in the results page.

But I only got the :id from the index also... :(

Any help would be appreciated on this one.

Thanks in advance as always!

Clare

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Sep-08 02:58 UTC

head link

[Ferret-talk] counting occurences of words in the result set

On 9/8/06, Clare <clare.cavanagh at nospam.co.uk>
wrote:> Caspar
>
> I have been trying to get the same thing working for a while but did not
> ever find a solution. It would help greatly if someone has the answer to
> this because I want to add this capability to my search to provide
> additional information to the user in the results page.
>
> But I only got the :id from the index also... :(
>
> Any help would be appreciated on this one.
>
> Thanks in advance as always!
By default acts_as_ferret only stores the :id. You need to set the
:store parameter of any other fields that you want stored. Something
like this;

acts_as_ferret :fields => {
    :title => { :store => :yes }
    :content => { :store => :yes }
}

As for counting the the frequency of terms in a resultset,
IndexReader#doc_freq probably won''t work. It will counts the frequency
of terms in the index, not in the resultset.

So back to the problem. Jens gave the solution I would probably use.
Ferret''s searches are faster enough that this solution is quite
feasible for most indexes. Try it. You might be surprised.

The alternative is counting throught the resultset. To do this you
will need to set :limit => :all in the search_each method so you get
all results back, then iterate through each result counting the
occurances. For a huge index - slow query - small resultset this might
be faster. Also, with the new filter_proc method there is another way
you can do this without having to retrieve all results;

    require ''rubygems''
    require ''ferret''

    include Ferret

    index = I.new

    words = %w{one two three four five}

    100000.times do |i|
      index << {:id => "%05d" % i, :word =>
words[rand(words.size)]}
    end

    counter = Hash.new(0)
    filter_proc = lambda do |doc, score, searcher|
      counter[searcher[doc][:word]] += 1
    end

    resultset = index.search("id:[10000 20000}", :limit => 1,
:filter_proc => filter_proc)
    puts resultset.total_hits
    puts counter.inspect

Hope that helps,

Dave

Clare

2006-Sep-08 07:40 UTC

head link

[Ferret-talk] Performance Testing +counting occurences of words in the res

Thanks David

I will try both options. I am infact doing some performance testing now. 
I have created 100,000 search result set and it takes around 5 seconds 
(end to end) on my internal server to be returned (with 1 user). I am 
only doing 6 significant searches on this set. One for the main results 
and one for the top level categories. This is only on my test server and 
not in the larger production server and I am happy with this 
performance. If however I were to do my second level category search 
that has around 40 nodes in it, that would be 30 searches. I am not sure 
how this would perform.

What I am seeing is CPU hungry search but not memory hungry. This makes 
sense to me.

Q - I have test data set up in my tests that has some random junk in and 
then a word such as "fish" at the end of it. I am starting to think
that
I may have set up the test data wrong and should use a lot of different 
words in the result set because I am sure that Ferret will cache the 
search. This would give me a false impression on speed of search.

I will create more test data however at the weekend but my instinct is 
that your method outlined above may be faster.

I have 5 top level categories and this will not change much. Depending 
on the search there were be a lot more results in one category that the 
rest after the initial search.

Drilling into the second level categories, the most nodes I have in a 
single second level category is around 40 at the moment although this is 
likely to be added to over time. The resuls again will not be normally 
distributed over the results set but assuming for now that they were and 
I had 500,000 records, and drilled into the second tier category 
structure I would have 100,000 records in this category. I would be 
doing 40 searches over 100,000 records.

Q - What do you think will perform faster in this instance?

I would love to have the time to build a x-dimensional memory resident 
result (bucket set) that kept all the results parameterised for all the 
categories, built at the initial time of the search. Would be memory 
hungry but would make searching through categories and nodes and 
parameters in subsequent searches lightening fast.

Would be a great addition or am I missing something?

I am really interested in the performance testing scenarios. As stated 
above, I only have one word "FISH" in my test data with random made up
beforehand. e.g. "sadssderssdaatg FISH" etc.

Q - Would I be better using more words in my test data?

Also - I am interested in the round trip performance of search. The 
length of time it takes from when the user clicks on search and gets the 
results back. I will do this on the production server in the production 
environment. My rule of thumb is that it should not take longer than 8 
seconds to return the results or the user will refresh (even worse for 
performance). With one user on my test system with 6 searches over 
100,000 records it takes 5 seconds at the moment.

I am expecting a large number of concurrent searches happening. I am 
defining concurrency as someone searching at the same time as another 
user is either searching or waiting for the results to be returned.

Most testing tools that I can see only show you what is happening on the 
server. I am interested from the users perspective.

I had a thought of setting up a script that would open a number of 
browser sessions and doing random searches concurrently and hammering 
the server to see when it 1) breaks search, 2) breaks something else 3) 
search goes over the 8 second limit.

Q - does anyone have any experience in this area. Even better does 
anyone have a script to do this? If not, and I do write a script to do 
this would this be of value to the greater community?

Sorry for the long winded post. My search page and category search is 
the most critical part of my site and I am anal on performance of this 
because if it does not work then my site will not work.

Thanks once again for all your assistance. Sorry for any stupid or 
ignorant thoughts/remarks.

Ferret rocks!

Clare


-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Sep-08 08:49 UTC

head link

[Ferret-talk] Performance Testing +counting occurences of words in the res

On 9/8/06, Clare <Clare at nospam.com> wrote:> Thanks David
>
> I will try both options. I am infact doing some performance testing now.
> I have created 100,000 search result set and it takes around 5 seconds
> (end to end) on my internal server to be returned (with 1 user). I am
> only doing 6 significant searches on this set. One for the main results
> and one for the top level categories. This is only on my test server and
> not in the larger production server and I am happy with this
> performance. If however I were to do my second level category search
> that has around 40 nodes in it, that would be 30 searches. I am not sure
> how this would perform.
>
> What I am seeing is CPU hungry search but not memory hungry. This makes
> sense to me.
>
> Q - I have test data set up in my tests that has some random junk in and
> then a word such as "fish" at the end of it. I am starting to
think that
> I may have set up the test data wrong and should use a lot of different
> words in the result set because I am sure that Ferret will cache the
> search. This would give me a false impression on speed of search.
Firstly, searches don''t get cached. Only filters do. If you want to
cache the results from a query (which you would in this instance) then
you should use a QueryFilter. Secondly, I''m not sure exactly what you
are saying when you say your tests have some random junk and then the
word "fish"? If you are putting data like this into every document;

    index << "asdlgkjhasd askdj asdg asdg asdg asdg lkjh asd
fish"

Then you probably should work on your test data. As far as search
perfomance goes, this will be no different to doing this;

    index << "fish"

What is important is to remember that TermQueries (fish) perform a lot
better than BooleanQueries (fish AND rod) and PhraseQueries ("fishing
rod") which perform better again than WildCardQueries (fi*) so you
should try these queries too.

Here is a much better way to create random strings;

    WORDS = %w{one two three}

    def random_sentence(min_size, max_size)
      len = min_size + rand(max_size - min_size)
      sentence = []
      len.times {sentence << WORDS[Math.sqrt(rand(WORDS.size *
WORDS.size))]}
      sentence.join(" ")
    end

    10.times { puts random_sentence(10, 100) }

The Math.sqrt stuff makes sure that words aren''t evenly distributed to
be more realistic. Words appearing later in the WORDS array will be
much more common. Even better than this would be to use a copy of the
real data that you will be using though.
> I will create more test data however at the weekend but my instinct is
> that your method outlined above may be faster.
>
> I have 5 top level categories and this will not change much. Depending
> on the search there were be a lot more results in one category that the
> rest after the initial search.
>
> Drilling into the second level categories, the most nodes I have in a
> single second level category is around 40 at the moment although this is
> likely to be added to over time. The resuls again will not be normally
> distributed over the results set but assuming for now that they were and
> I had 500,000 records, and drilled into the second tier category
> structure I would have 100,000 records in this category. I would be
> doing 40 searches over 100,000 records.
>
> Q - What do you think will perform faster in this instance?
Impossible to say without testing. Both methods are pretty simple
though so I''d try both with a variety of search strings.
> I would love to have the time to build a x-dimensional memory resident
> result (bucket set) that kept all the results parameterised for all the
> categories, built at the initial time of the search. Would be memory
> hungry but would make searching through categories and nodes and
> parameters in subsequent searches lightening fast.
>
> Would be a great addition or am I missing something?
As far as I''m concerned this functionality is already there with the
filter_proc parameter. Make it any less general than this and it isn''t
much use anymore. For example;

    require ''rubygems''
    require ''ferret''

    include Ferret
    index = I.new

    words = %w{one two three four five}
    100000.times do |i|
      index << {:id => "%05d" % i, :word =>
words[rand(words.size)]}
    end

    groups = {}

    filter_proc = lambda do |doc, score, searcher|
      word = searcher[doc][:word]
      (groups[word]||=[]) << doc
    end

    resultset = index.search("id:[09900 10000}", :limit => 1,
:filter_proc => filter_proc)
    puts resultset.total_hits
    puts groups.inspect
    puts groups["two"].size

I really can''t see how you could make it any easier than that.
> I am really interested in the performance testing scenarios. As stated
> above, I only have one word "FISH" in my test data with random
made up
> beforehand. e.g. "sadssderssdaatg FISH" etc.
>
> Q - Would I be better using more words in my test data?
See above.
> Also - I am interested in the round trip performance of search. The
> length of time it takes from when the user clicks on search and gets the
> results back. I will do this on the production server in the production
> environment. My rule of thumb is that it should not take longer than 8
> seconds to return the results or the user will refresh (even worse for
> performance). With one user on my test system with 6 searches over
> 100,000 records it takes 5 seconds at the moment.
5 seconds seems like a long time. Try optimizing your index and see
how you go then. The example above took  0.028109 seconds. Personally,
I would be worried about anything taking over 1 second which was the
whole reason I wrote Ferret in C.
> I am expecting a large number of concurrent searches happening. I am
> defining concurrency as someone searching at the same time as another
> user is either searching or waiting for the results to be returned.
>
> Most testing tools that I can see only show you what is happening on the
> server. I am interested from the users perspective.
>
> I had a thought of setting up a script that would open a number of
> browser sessions and doing random searches concurrently and hammering
> the server to see when it 1) breaks search, 2) breaks something else 3)
> search goes over the 8 second limit.
>
> Q - does anyone have any experience in this area. Even better does
> anyone have a script to do this? If not, and I do write a script to do
> this would this be of value to the greater community?
If I were you, I''d test plain old search performance before I tested
performance through a browser. And, again, it is pretty hard to
generalize a script like this since so many people have different
search needs. In my opinion, Ruby makes it easy enough to write this
from scratch each time.
> Sorry for the long winded post. My search page and category search is
> the most critical part of my site and I am anal on performance of this
> because if it does not work then my site will not work.
>
> Thanks once again for all your assistance. Sorry for any stupid or
> ignorant thoughts/remarks.
>
> Ferret rocks!
You''re welcome,

Dave
> Clare
>
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Seemingly Similar Threads

Search for more maybe matching threads

Ferret talk - Sep 2006 - counting occurences of words in the result set

[Ferret-talk] counting occurences of words in the result set

[Ferret-talk] counting occurences of words in the result set

[Ferret-talk] counting occurences of words in the result set

[Ferret-talk] counting occurences of words in the result set

[Ferret-talk] counting occurences of words in the result set

[Ferret-talk] counting occurences of words in the result set

[Ferret-talk] Performance Testing +counting occurences of words in the res

[Ferret-talk] Performance Testing +counting occurences of words in the res

Seemingly Similar Threads