Wow, thanks for taking the time to put that together Dave. That looks
very promising.
I appreciate it. If I have a chance to do performance tests, I will
report back to this list.
Tom
On 2/28/06, David Balmain <dbalmain.ml at gmail.com>
wrote:> On 2/28/06, Tom Davies <atomgiant at gmail.com> wrote:
> > Hi,
> >
> > I have an index where each document contains an untokenized
''url''
> > field. I would like to query the index for the most popular urls. In
> > SQL I would do this via a Group By clause. Is there anything in
> > Ferret that will do something similar?
> >
> > I found this discussion that proposed a solution involving TermEnums:
> >
> > http://www.gossamer-threads.com/lists/lucene/java-user/32272#32272
> >
> > But I noticed the IndexReader.terms and IndexReader.term_docs are not
> > implemented. Is that solution the way to go? Would an index-only
> > solution perform a lot faster than a pure database solution using a
> > group by clause?
>
> Hi Tom,
>
> Those methods are implemented. Just not in IndexReader. They''re
> implemented in SegmentReader and MultiReader. IndexReader is an
> abstract class. Whenever you call IndexReader#open you''ll get
either a
> SegmentReader or a MultiReader.
>
> Anyway, if you want to run searches on all documents with the url
> field you could use a filter like this;
>
> module Ferret::Search
> # A Filter that restricts search results to only those documents with a
> # certain field called @group_name.
> class GroupFilter < Filter
> include Ferret::Index
>
> def initialize(group_name)
> @group_name = group_name
> end
>
> # Returns a BitVector with true for documents which should be permitted
in
> # search results, and false for those that should not.
> def bits(reader)
> bits = Ferret::Utils::BitVector.new()
> term_enum = reader.terms_from(Term.new(@group_name, ""))
>
> begin
> if (term_enum.term() == nil)
> return bits
> end
> term_docs = reader.term_docs
> begin
> begin
> term = term_enum.term()
> break if (term.nil? or term.field != @group_name)
>
> term_docs.seek(term_enum)
> while term_docs.next?
> bits.set(term_docs.doc)
> end
> end while term_enum.next?
> ensure
> term_docs.close()
> end
> ensure
> term_enum.close()
> end
>
> return bits
> end
> end
> end
>
> Or perhaps you only want the 10 most popular urls and you''d like
to
> create the filter like this;
>
> filter = Filter.new("url", ["url1", "url2",
..., "url10"])
>
> This filter might look something like this;
>
> module Ferret::Search
> # A Filter that restricts search results to only those documents with a
> # certain field called @field_name with values in the @values array.
> class GroupFilter < Filter
> include Ferret::Index
>
> def initialize(field_name, values)
> @field_name = field_name
> @values = values
> end
>
> # Returns a BitVector with true for documents which should be permitted
in
> # search results, and false for those that should not.
> def bits(reader)
> bits = Ferret::Utils::BitVector.new()
> term_enum = reader.terms_from(Term.new(@field_name, ""))
>
> begin
> if (term_enum.term() == nil)
> return bits
> end
> term_docs = reader.term_docs
> begin
> begin
> term = term_enum.term()
> break if (term.nil? or term.field != @field_name)
>
> if @values.index(term.text)
> term_docs.seek(term_enum)
> while term_docs.next?
> bits.set(term_docs.doc)
> end
> end
> end while term_enum.next?
> ensure
> term_docs.close()
> end
> ensure
> term_enum.close()
> end
>
> return bits
> end
> end
> end
>
> WARNING:: I haven''t tested any of this code. Also, I
don''t know how it
> would perform compared to using a group_by on the database itself
> although I''d be happy to hear about any performance tests you
might
> do. I hope this helps.
>
> Cheers,
> Dave
>
> >
> > Any feedback is appreciated.
> >
> > Tom
> >
> > _______________________________________________
> > Ferret-talk mailing list
> > Ferret-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/ferret-talk
> >
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>