thr3ads.net - Ferret talk - [Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s [Jan 2006]

If this information is useful, please help other people find it:
Share via:

Finn Smith

2006-Jan-02 23:20 UTC

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

Recently I''ve been revisiting some of my search code. With a greater
understanding of how Java Lucene implements its search methods, I
realized that one level of abstraction is not present in the Ferret
classes/methods. Here are the relevant method signatures:

Ferret''s search methods:

in Ferret::Index::Index:
search(query, options = {}) -> returns a TopDocs
search_each(query, options = {}) {|doc, score| ...} -> yields to
context w/ doc and score for each hit

in Ferret::Search::IndexSearcher:
search(query, options = {}) -> returns a TopDocs
search_each(query, filter = nil) {|doc, score| ...} -> yields to
context w/ doc and score for each hit


Lucene''s search methods:

in the interface Searchable:
public void search(Query query, Filter filter, HitCollector results)
public TopDocs search(Query query, Filter filter, int n)
public TopFieldDocs search(Query query, Filter filter, int n, Sort sort)

in org.apache.lucene.search.Searcher (which implements Searchable):
public final Hits search(Query query)
public Hits search(Query query, Filter filter)
public Hits search(Query query, Sort sort)
public Hits search(Query query, Filter filter, Sort sort)


I was wondering if there were plans to implement the Hits class in
Ferret. (Or if someone were to write a patch implementing them, would
David integrate it into the source?) It seems like it is a useful
abstraction since TopDocs does not allow you to access its hits by
index, only by the .each() method call.

Some questions:
 * Will changing these methods break people''s existing code?
 * Where is the proper place to put these methods? Move the methods
that return TopDocs to a module, which is more or less the same as a
Java interface, and implement the methods that return Hits directly in
the class? What is a good way to do this  that feels Rubyish and takes
advantage of its strengths and idioms?
 * The options to limit the search (first_doc and num_doc) in
Search::IndexSearcher and the code that implements them should
probably be moved out of Search::IndexSearcher into Index::Index
 * Are there lower level issues I am not aware of that would make any
of this a bad idea?

Am I missing something here? Are there reasons not to have Ferret''s
implementation of these methods and classes follow Java Lucene''s as
closely as possible? I''d appreciate hearing your thoughts.

-F

David Balmain

2006-Jan-12 01:43 UTC

head link

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

On 1/3/06, Finn Smith <fcsmith at gmail.com>
wrote:> Recently I''ve been revisiting some of my search code. With a
greater
> understanding of how Java Lucene implements its search methods, I
> realized that one level of abstraction is not present in the Ferret
> classes/methods. Here are the relevant method signatures:
>
> Ferret''s search methods:
>
> in Ferret::Index::Index:
> search(query, options = {}) -> returns a TopDocs
> search_each(query, options = {}) {|doc, score| ...} -> yields to
> context w/ doc and score for each hit
>
> in Ferret::Search::IndexSearcher:
> search(query, options = {}) -> returns a TopDocs
> search_each(query, filter = nil) {|doc, score| ...} -> yields to
> context w/ doc and score for each hit
>
>
> Lucene''s search methods:
>
> in the interface Searchable:
> public void search(Query query, Filter filter, HitCollector results)
> public TopDocs search(Query query, Filter filter, int n)
> public TopFieldDocs search(Query query, Filter filter, int n, Sort sort)
>
> in org.apache.lucene.search.Searcher (which implements Searchable):
> public final Hits search(Query query)
> public Hits search(Query query, Filter filter)
> public Hits search(Query query, Sort sort)
> public Hits search(Query query, Filter filter, Sort sort)
>
>
> I was wondering if there were plans to implement the Hits class in
> Ferret. (Or if someone were to write a patch implementing them, would
> David integrate it into the source?)
I''d be happy to integrate it if someone sends me a patch. Having said
that...
> It seems like it is a useful
> abstraction since TopDocs does not allow you to access its hits by
> index, only by the .each() method call.
Actually you can access the hits by index like this;

    hit_three = topdocs.score_docs[2]

The reason I didn''t bother implementing the hits class is that I
can''t
see that it adds anything useful. Really it all just seems a matter of
notation. What is easiest for people to understand and remember.
Adding the hits class might just make everthing a little more
complicated. Please refer to Martin Fowler''s discussion on the Humane
interface;

http://www.martinfowler.com/bliki/HumaneInterface.html

While Java likes to have multiple different implementations of simple
interfaces and a separate class for each data structure, in Ruby you
can use an array for many different jobs; stack, list queue etc. I
feel it would be better to do the same thing with TopDocs. Rather than
adding the Hits class I feel it would be better to add the desired
functionality to TopDocs. I''m happy to listen to other points of view.
> Some questions:
>  * Will changing these methods break people''s existing code?
Perhaps. Depends what we change. Ferret is still beta though so I
think it''s open to non-backwards compatible changes if necessary,
although we should avoid this if possible.
>  * Where is the proper place to put these methods? Move the methods
> that return TopDocs to a module, which is more or less the same as a
> Java interface, and implement the methods that return Hits directly in
> the class? What is a good way to do this  that feels Rubyish and takes
> advantage of its strengths and idioms?
I think I answered this already. I''d like to keep TopDocs as a class
as add the desired functionality to it.
>  * The options to limit the search (first_doc and num_doc) in
> Search::IndexSearcher and the code that implements them should
> probably be moved out of Search::IndexSearcher into Index::Index
I think this needs to stay in IndexSearcher as it limits the amount of
memory used by a search. Even the java version allows you to specify
nDocs.

Hope this helps. Feedback is welcome.

Cheers,
Dave
>  * Are there lower level issues I am not aware of that would make any
> of this a bad idea?
>
> Am I missing something here? Are there reasons not to have
Ferret''s
> implementation of these methods and classes follow Java Lucene''s
as
> closely as possible? I''d appreciate hearing your thoughts.
>
> -F
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Erik Hatcher

2006-Jan-12 13:24 UTC

head link

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

On Jan 11, 2006, at 8:43 PM, David Balmain wrote:>> I was wondering if there were plans to implement the Hits class in
>> Ferret. (Or if someone were to write a patch implementing them, would
>> David integrate it into the source?)
>
> I''d be happy to integrate it if someone sends me a patch. Having  
> said that...
>
>> It seems like it is a useful
>> abstraction since TopDocs does not allow you to access its hits by
>> index, only by the .each() method call.
>
> Actually you can access the hits by index like this;
>
>     hit_three = topdocs.score_docs[2]
>
> The reason I didn''t bother implementing the hits class is that I
can''t
> see that it adds anything useful. Really it all just seems a matter of
> notation.
It''s more than just notation.  Hits performs some caching of Document  
objects as well as providing a means to iterate through the hits  
without having to manually re-search as it does it under the covers.   
Sure, it''s perhaps a mere convenience, but a handy abstraction  
nonetheless.
> What is easiest for people to understand and remember.
> Adding the hits class might just make everthing a little more
> complicated. Please refer to Martin Fowler''s discussion on the
Humane
> interface;
>
> http://www.martinfowler.com/bliki/HumaneInterface.html
>
> While Java likes to have multiple different implementations of simple
> interfaces and a separate class for each data structure, in Ruby you
> can use an array for many different jobs; stack, list queue etc. I
> feel it would be better to do the same thing with TopDocs. Rather than
> adding the Hits class I feel it would be better to add the desired
> functionality to TopDocs. I''m happy to listen to other points of
view.
I think not having Hits makes it more complicated for those coming  
from Java Lucene at least, but it is also a conceptual abstraction.   
One thinks of getting "hits" back from a search, not "top
docs".  So
in that sense, the semantics of having Hits is powerful.  Part of  
Fowler''s argument is to have redundancy, aliases, and conveniences  
for the humane interface, and I think Hits would qualify in that regard.

	Erik

David Balmain

2006-Jan-12 23:52 UTC

head link

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

On 1/12/06, Erik Hatcher <erik at ehatchersolutions.com>
wrote:>
> On Jan 11, 2006, at 8:43 PM, David Balmain wrote:
> >> I was wondering if there were plans to implement the Hits class in
> >> Ferret. (Or if someone were to write a patch implementing them,
would
> >> David integrate it into the source?)
> >
> > I''d be happy to integrate it if someone sends me a patch.
Having
> > said that...
> >
> >> It seems like it is a useful
> >> abstraction since TopDocs does not allow you to access its hits by
> >> index, only by the .each() method call.
> >
> > Actually you can access the hits by index like this;
> >
> >     hit_three = topdocs.score_docs[2]
> >
> > The reason I didn''t bother implementing the hits class is
that I can''t
> > see that it adds anything useful. Really it all just seems a matter of
> > notation.
>
> It''s more than just notation.  Hits performs some caching of
Document
> objects as well as providing a means to iterate through the hits
> without having to manually re-search as it does it under the covers.
> Sure, it''s perhaps a mere convenience, but a handy abstraction
> nonetheless.
>
> > What is easiest for people to understand and remember.
> > Adding the hits class might just make everthing a little more
> > complicated. Please refer to Martin Fowler''s discussion on
the Humane
> > interface;
> >
> > http://www.martinfowler.com/bliki/HumaneInterface.html
> >
> > While Java likes to have multiple different implementations of simple
> > interfaces and a separate class for each data structure, in Ruby you
> > can use an array for many different jobs; stack, list queue etc. I
> > feel it would be better to do the same thing with TopDocs. Rather than
> > adding the Hits class I feel it would be better to add the desired
> > functionality to TopDocs. I''m happy to listen to other points
of view.
>
> I think not having Hits makes it more complicated for those coming
> from Java Lucene at least, but it is also a conceptual abstraction.
> One thinks of getting "hits" back from a search, not "top
docs".  So
> in that sense, the semantics of having Hits is powerful.  Part of
> Fowler''s argument is to have redundancy, aliases, and conveniences
> for the humane interface, and I think Hits would qualify in that regard.
>
>         Erik
I''m not arguing that TopDocs is a better name than Hits. Rather that
having search methods return two different classes is unnecessary and
not "The Ruby Way". My goal is to make Ferret easy for Ruby
programmers to use, not Java programmers. So what I''d like to hear is
an argument as to why having two separate classes - TopDocs and Hits -
is superior to combining the functionality of both into one class. My
personal feeling is that this is where the difference lies between
Java and Ruby but I could easily be swayed.

Dave

Erik Hatcher

2006-Jan-13 02:12 UTC

head link

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

On Jan 12, 2006, at 6:52 PM, David Balmain wrote:
> On 1/12/06, Erik Hatcher <erik at ehatchersolutions.com> wrote:
>>
>> On Jan 11, 2006, at 8:43 PM, David Balmain wrote:
>>>> I was wondering if there were plans to implement the Hits class
in
>>>> Ferret. (Or if someone were to write a patch implementing them,
>>>> would
>>>> David integrate it into the source?)
>>>
>>> I''d be happy to integrate it if someone sends me a patch.
Having
>>> said that...
>>>
>>>> It seems like it is a useful
>>>> abstraction since TopDocs does not allow you to access its hits
by
>>>> index, only by the .each() method call.
>>>
>>> Actually you can access the hits by index like this;
>>>
>>>     hit_three = topdocs.score_docs[2]
>>>
>>> The reason I didn''t bother implementing the hits class is
that I
>>> can''t
>>> see that it adds anything useful. Really it all just seems a  
>>> matter of
>>> notation.
>>
>> It''s more than just notation.  Hits performs some caching of
Document
>> objects as well as providing a means to iterate through the hits
>> without having to manually re-search as it does it under the covers.
>> Sure, it''s perhaps a mere convenience, but a handy abstraction
>> nonetheless.
>>
>>> What is easiest for people to understand and remember.
>>> Adding the hits class might just make everthing a little more
>>> complicated. Please refer to Martin Fowler''s discussion on
the
>>> Humane
>>> interface;
>>>
>>> http://www.martinfowler.com/bliki/HumaneInterface.html
>>>
>>> While Java likes to have multiple different implementations of  
>>> simple
>>> interfaces and a separate class for each data structure, in Ruby
you
>>> can use an array for many different jobs; stack, list queue etc. I
>>> feel it would be better to do the same thing with TopDocs. Rather  
>>> than
>>> adding the Hits class I feel it would be better to add the desired
>>> functionality to TopDocs. I''m happy to listen to other
points of
>>> view.
>>
>> I think not having Hits makes it more complicated for those coming
>> from Java Lucene at least, but it is also a conceptual abstraction.
>> One thinks of getting "hits" back from a search, not
"top docs".  So
>> in that sense, the semantics of having Hits is powerful.  Part of
>> Fowler''s argument is to have redundancy, aliases, and
conveniences
>> for the humane interface, and I think Hits would qualify in that  
>> regard.
>>
>>         Erik
>
> I''m not arguing that TopDocs is a better name than Hits. Rather
that
> having search methods return two different classes is unnecessary and
> not "The Ruby Way". My goal is to make Ferret easy for Ruby
> programmers to use, not Java programmers. So what I''d like to hear
is
> an argument as to why having two separate classes - TopDocs and Hits -
> is superior to combining the functionality of both into one class. My
> personal feeling is that this is where the difference lies between
> Java and Ruby but I could easily be swayed.
It seems an injustice to Java in this regard.  Surely Hits and  
TopDocs could have their functionality blended together into single  
class.  There was an intentional separation, not some constraint that  
Java the language imposed.

I''m being a bit defensive of the Lucene API here and don''t
want to
see Ferret diverge too much from it for no real benefit.  What''s one  
more class in Ruby in this situation to maintain consistency across  
languages for the finest search engine available?  Seems a small  
sacrifice of Ruby "purity" to make for the noble cause :)  Just my  
$0.02.

Practically no one in Java Lucene uses TopDocs - you''ll notice that  
all of those search methods are marked as "Expert".  Hits is the most
common way to access search results, allowing them to automatically  
be paged through and have a bit of caching along with it.

	Erik

Vamsee Kanakala

2006-Jan-13 05:18 UTC

head link

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

David Balmain wrote:
> My
>personal feeling is that this is where the difference lies between
>Java and Ruby but I could easily be swayed.
>
>  
>Hi Dave & Erik,

I don''t intend to hurt anybody''s opinions, but let me speak up
on
something: I did some 2 years of Java programming, and was never really 
comfortable with its verbosity though I liked it for other things. I 
felt Ferret''s API is already a bit un-Rubyish, if you know what I mean.
It almost feels like I''m back to using Java libraries again.

Learning Rails took a complete break from the way I was doing J2EE 
programming. But I totally love it, for it''s refreshing simplicity and 
thereby came to love Ruby too. So what I mean to say is, don''t be
afraid
to break compatibility - if it makes programmer''s life easier. I agree 
with Dave''s sentiments that Ferret should be better than Lucene.
I''m
guilty of not understanding what it really takes, but I think we should 
put ''making a developer''s job easy'' before anything
else. If it means
breaking compatibility with Lucene, I don''t really mind.

Of course, I speak from a very selfish point of view - I don''t have any
running Lucene apps to run or port. Just my 2 cents.

Regards,
Vamsee.

David Balmain

2006-Jan-13 10:13 UTC

head link

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

On 1/13/06, Erik Hatcher <erik at ehatchersolutions.com>
wrote:>
> On Jan 12, 2006, at 6:52 PM, David Balmain wrote:
>
> > On 1/12/06, Erik Hatcher <erik at ehatchersolutions.com> wrote:
> >>
> >> On Jan 11, 2006, at 8:43 PM, David Balmain wrote:
> >>>> I was wondering if there were plans to implement the Hits
class in
> >>>> Ferret. (Or if someone were to write a patch implementing
them,
> >>>> would
> >>>> David integrate it into the source?)
> >>>
> >>> I''d be happy to integrate it if someone sends me a
patch. Having
> >>> said that...
> >>>
> >>>> It seems like it is a useful
> >>>> abstraction since TopDocs does not allow you to access its
hits by
> >>>> index, only by the .each() method call.
> >>>
> >>> Actually you can access the hits by index like this;
> >>>
> >>>     hit_three = topdocs.score_docs[2]
> >>>
> >>> The reason I didn''t bother implementing the hits
class is that I
> >>> can''t
> >>> see that it adds anything useful. Really it all just seems a
> >>> matter of
> >>> notation.
> >>
> >> It''s more than just notation.  Hits performs some caching
of Document
> >> objects as well as providing a means to iterate through the hits
> >> without having to manually re-search as it does it under the
covers.
> >> Sure, it''s perhaps a mere convenience, but a handy
abstraction
> >> nonetheless.
> >>
> >>> What is easiest for people to understand and remember.
> >>> Adding the hits class might just make everthing a little more
> >>> complicated. Please refer to Martin Fowler''s
discussion on the
> >>> Humane
> >>> interface;
> >>>
> >>> http://www.martinfowler.com/bliki/HumaneInterface.html
> >>>
> >>> While Java likes to have multiple different implementations of
> >>> simple
> >>> interfaces and a separate class for each data structure, in
Ruby you
> >>> can use an array for many different jobs; stack, list queue
etc. I
> >>> feel it would be better to do the same thing with TopDocs.
Rather
> >>> than
> >>> adding the Hits class I feel it would be better to add the
desired
> >>> functionality to TopDocs. I''m happy to listen to
other points of
> >>> view.
> >>
> >> I think not having Hits makes it more complicated for those coming
> >> from Java Lucene at least, but it is also a conceptual
abstraction.
> >> One thinks of getting "hits" back from a search, not
"top docs".  So
> >> in that sense, the semantics of having Hits is powerful.  Part of
> >> Fowler''s argument is to have redundancy, aliases, and
conveniences
> >> for the humane interface, and I think Hits would qualify in that
> >> regard.
> >>
> >>         Erik
> >
> > I''m not arguing that TopDocs is a better name than Hits.
Rather that
> > having search methods return two different classes is unnecessary and
> > not "The Ruby Way". My goal is to make Ferret easy for Ruby
> > programmers to use, not Java programmers. So what I''d like to
hear is
> > an argument as to why having two separate classes - TopDocs and Hits -
> > is superior to combining the functionality of both into one class. My
> > personal feeling is that this is where the difference lies between
> > Java and Ruby but I could easily be swayed.
>
> It seems an injustice to Java in this regard.  Surely Hits and
> TopDocs could have their functionality blended together into single
> class.  There was an intentional separation, not some constraint that
> Java the language imposed.
I was never implying there was some constraint imposed by the language
itself. I''m talking about the way things are done in Ruby versus the
way things are done in Java. There was an intentional seperation of
ArrayList, Vector, Stack etc in Java too but it doesn''t mean we have
to do the same thing in Ruby. I''m not saying one way is better than
the other. But Ferret is a Ruby library so I''d like to do it the Ruby
way where possible.
> I''m being a bit defensive of the Lucene API here and
don''t want to
> see Ferret diverge too much from it for no real benefit.  What''s
one
> more class in Ruby in this situation to maintain consistency across
> languages for the finest search engine available?  Seems a small
> sacrifice of Ruby "purity" to make for the noble cause :)  Just
my
> $0.02.
>
> Practically no one in Java Lucene uses TopDocs - you''ll notice
that
> all of those search methods are marked as "Expert".  Hits is the
most
> common way to access search results, allowing them to automatically
> be paged through and have a bit of caching along with it.
If this is the case then maybe I should just return a Hits object,
roll the TopDocs functionality into it and be done with it. If
practically no one is using TopDocs then practically no one will miss
it. ;-)

What I was really looking for (and still hope to see) was an argument
discussing the pros and cons of having the two separate classes (and
"That''s what they did in Java" doesn''t count :-).
Nevertheless, I''ve
revisted the Hits class in Lucene and I''ve thought more about the
issue at hand and Hits will be coming in the next release of Ferret. I
haven''t decided exactly how I''m going to do it yet. There will
probably still be some differences from the Lucene API. For example,
search_each() is here to stay. I''ll probably bring it up for
discussion again when I come to it. I still have a fair bit of work in
cFerret before I get to that stage.

Cheers,
Dave

Erik Hatcher

2006-Jan-13 10:28 UTC

head link

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

Vamsee,

No feelings hurt here, and I completely understand your sentiment.

There are folks using Java Lucene that have expressed similar  
sentiment about its API, which is more C-like than Java-like in many  
ways.  But, let''s focus on the heart of this thing... a high-powered  
full-text search engine.  The goal is speed and efficient use of  
resources.  An elegant API is desirable but of a secondary nature.   
Hits is a pretty elegant way to navigate search results.  I hope this  
simple class can find its way into Ferret and that the IndexSearcher  
API be made reasonably similar to Java Lucene.

Dave has done a great job with the Index class in Ferret, which has  
features that Java Lucene does not - that of being able to flush and  
see changes right away (which is harder to manage in Java Lucene) and  
also of having keys to documents and managing an "update".  In Java  
Lucene there is no concept of an update - there is only a remove and  
an add.

I''m all for a slick Ruby API, but I would very much like to see it  
built on top of a Lucene compatible index format for  
interoperability.  That interoperability is important to me and will  
very likely be important to others.  Consider Nutch for example.  It  
is an incredibly scalable web crawler and indexer.  With index  
compatibility you could use Nutch to crawl the web and use Ferret for  
searching.  Further, the HTML parsers I''ve used in Ruby are lousy  
compared to what is available in Java.  Indexing in Java makes a lot  
of sense in many circumstances, but fronting an application with  
Rails and Ferret also makes a lot of sense.

Dave is, of course, the creator and driver of Ferret.  I encourage  
him to consider keeping index file compatibility, and keep basic API  
in tact for the classes that are direct ports of Lucene, and innovate  
on top of them rather than change them.  He certainly may choose to  
do otherwise, but doing so would likely drive me (and perhaps  
others?) to other solutions.

	Erik

On Jan 13, 2006, at 12:18 AM, Vamsee Kanakala wrote:> Hi Dave & Erik,
>
> I don''t intend to hurt anybody''s opinions, but let me
speak up on
> something: I did some 2 years of Java programming, and was never  
> really
> comfortable with its verbosity though I liked it for other things. I
> felt Ferret''s API is already a bit un-Rubyish, if you know what I
> mean.
> It almost feels like I''m back to using Java libraries again.
>
> Learning Rails took a complete break from the way I was doing J2EE
> programming. But I totally love it, for it''s refreshing simplicity
and
> thereby came to love Ruby too. So what I mean to say is, don''t be
> afraid
> to break compatibility - if it makes programmer''s life easier. I
agree
> with Dave''s sentiments that Ferret should be better than Lucene.
I''m
> guilty of not understanding what it really takes, but I think we  
> should
> put ''making a developer''s job easy'' before
anything else. If it means
> breaking compatibility with Lucene, I don''t really mind.
>
> Of course, I speak from a very selfish point of view - I don''t
have
> any
> running Lucene apps to run or port. Just my 2 cents.
>
> Regards,
> Vamsee.
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk

Erik Hatcher

2006-Jan-13 10:36 UTC

head link

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

On Jan 13, 2006, at 5:13 AM, David Balmain wrote:>
>> I''m being a bit defensive of the Lucene API here and
don''t want to
>> see Ferret diverge too much from it for no real benefit. 
What''s one
>> more class in Ruby in this situation to maintain consistency across
>> languages for the finest search engine available?  Seems a small
>> sacrifice of Ruby "purity" to make for the noble cause :) 
Just my
>> $0.02.
>>
>> Practically no one in Java Lucene uses TopDocs - you''ll notice
that
>> all of those search methods are marked as "Expert".  Hits is
the most
>> common way to access search results, allowing them to automatically
>> be paged through and have a bit of caching along with it.
>
> If this is the case then maybe I should just return a Hits object,
> roll the TopDocs functionality into it and be done with it. If
> practically no one is using TopDocs then practically no one will miss
> it. ;-)
My argument is mainly on why there is an issue with one more internal  
class.  You''ve ported quite faithfully the underlying Lucene class  
structure and API.  Why is this one more item a big deal?  TopDocs is  
useful, don''t get me wrong.  It is just not used by the general  
Lucene consuming public, but many expert level folks do use it.

I hope that Hits and TopDocs can stay, and whether it makes sense for  
them to be separate classes or not is really immaterial, but at least  
keep the most public and useful part of Lucene''s API, IndexSearcher,  
as compatible as possible.
> What I was really looking for (and still hope to see) was an argument
> discussing the pros and cons of having the two separate classes (and
> "That''s what they did in Java" doesn''t count
:-).
Keeping a consistent IndexSearcher API between Java Lucene and Ferret  
is definitely an argument that counts for me personally.  Innovating  
"Ruby Way" features alongside that is also greatly desirable for sure!
> Nevertheless, I''ve
> revisted the Hits class in Lucene and I''ve thought more about the
> issue at hand and Hits will be coming in the next release of Ferret.
Yay!!!
> I
> haven''t decided exactly how I''m going to do it yet. There
will
> probably still be some differences from the Lucene API. For example,
> search_each() is here to stay. I''ll probably bring it up for
> discussion again when I come to it. I still have a fair bit of work in
> cFerret before I get to that stage.
Adding conveniences with block iteration and such make me extremely  
happy!  PyLucene did the same thing.

	Erik

p.s. whispering *GCJ... SWIG....*  :)

Finn Smith

2006-Jan-13 19:05 UTC

head link

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

On 1/11/06, David Balmain <dbalmain.ml at gmail.com>
wrote:> >  * The options to limit the search (first_doc and num_doc) in
> > Search::IndexSearcher and the code that implements them should
> > probably be moved out of Search::IndexSearcher into Index::Index
>
> I think this needs to stay in IndexSearcher as it limits the amount of
> memory used by a search. Even the java version allows you to specify
> nDocs.
Reviewing the code again, and taking another look at the Java code I
think you''re right about this. If there is a more general search
method exposed that returns Hits I''ll be happy.

-F

Finn Smith

2006-Jan-13 19:42 UTC

head link

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

On 1/13/06, David Balmain <dbalmain.ml at gmail.com>
wrote:> On 1/13/06, Erik Hatcher <erik at ehatchersolutions.com> wrote:
> > Practically no one in Java Lucene uses TopDocs - you''ll
notice that
> > all of those search methods are marked as "Expert".  Hits is
the most
> > common way to access search results, allowing them to automatically
> > be paged through and have a bit of caching along with it.
>
> If this is the case then maybe I should just return a Hits object,
> roll the TopDocs functionality into it and be done with it. If
> practically no one is using TopDocs then practically no one will miss
> it. ;-)
>
> What I was really looking for (and still hope to see) was an argument
> discussing the pros and cons of having the two separate classes (and
> "That''s what they did in Java" doesn''t count
:-). Nevertheless, I''ve
> revisted the Hits class in Lucene and I''ve thought more about the
> issue at hand and Hits will be coming in the next release of Ferret. I
> haven''t decided exactly how I''m going to do it yet. There
will
> probably still be some differences from the Lucene API. For example,
> search_each() is here to stay. I''ll probably bring it up for
> discussion again when I come to it. I still have a fair bit of work in
> cFerret before I get to that stage.
I was curious how this problem was addressed in other languages that
are not as strongly typed as Java so I took a look at the Plucene
implementation.

In Plucene there is an abstract base class Searcher which
IndexSearcher inherits from. Searcher has the method search which
instantiates a Hits object and passes "self" in as the searcher
argument before returning the newly created Hits object. The abstract
method search_top is implemented in IndexSearcher and returns TopDocs.
The search_top method is used internally by Hits when retrieving
results.

This follows the Java implementation pretty closely while still having
some of the advantages of more dynamic languages. A method isn''t
defined for each possible combination of arguments. Rather, methods
are identified by their functionality as reflected in their name. This
is in contrast to Java where a bunch of methods with the same name
("search") are identified by the method signature consisting of return
type, name and arguments.

I don''t know if it will be any help, but it might be worth glancing
through the Plucene code for another perspective on how to organize
the various objects and their interactions.

-F

David Balmain

2006-Jan-14 10:20 UTC

head link

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

On 1/14/06, Finn Smith <fcsmith at gmail.com>
wrote:> On 1/13/06, David Balmain <dbalmain.ml at gmail.com> wrote:
> > On 1/13/06, Erik Hatcher <erik at ehatchersolutions.com> wrote:
> > > Practically no one in Java Lucene uses TopDocs - you''ll
notice that
> > > all of those search methods are marked as "Expert". 
Hits is the most
> > > common way to access search results, allowing them to
automatically
> > > be paged through and have a bit of caching along with it.
> >
> > If this is the case then maybe I should just return a Hits object,
> > roll the TopDocs functionality into it and be done with it. If
> > practically no one is using TopDocs then practically no one will miss
> > it. ;-)
> >
> > What I was really looking for (and still hope to see) was an argument
> > discussing the pros and cons of having the two separate classes (and
> > "That''s what they did in Java" doesn''t
count :-). Nevertheless, I''ve
> > revisted the Hits class in Lucene and I''ve thought more about
the
> > issue at hand and Hits will be coming in the next release of Ferret. I
> > haven''t decided exactly how I''m going to do it yet.
There will
> > probably still be some differences from the Lucene API. For example,
> > search_each() is here to stay. I''ll probably bring it up for
> > discussion again when I come to it. I still have a fair bit of work in
> > cFerret before I get to that stage.
>
> I was curious how this problem was addressed in other languages that
> are not as strongly typed as Java so I took a look at the Plucene
> implementation.
>
> In Plucene there is an abstract base class Searcher which
> IndexSearcher inherits from. Searcher has the method search which
> instantiates a Hits object and passes "self" in as the searcher
> argument before returning the newly created Hits object. The abstract
> method search_top is implemented in IndexSearcher and returns TopDocs.
> The search_top method is used internally by Hits when retrieving
> results.
>
> This follows the Java implementation pretty closely while still having
> some of the advantages of more dynamic languages. A method isn''t
> defined for each possible combination of arguments. Rather, methods
> are identified by their functionality as reflected in their name. This
> is in contrast to Java where a bunch of methods with the same name
> ("search") are identified by the method signature consisting of
return
> type, name and arguments.
>
> I don''t know if it will be any help, but it might be worth
glancing
> through the Plucene code for another perspective on how to organize
> the various objects and their interactions.
>
> -F
Thanks Finn. I have downloaded PyLucene, Plucene, Lupy etc. and I have
been using all of them to solve various problems. I will certainly
study all of their search APIs.

Cheers,
Dave

Apparently Analagous Threads

Search for more apparently analagous threads

Ferret talk - Jan 2006 - aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s

Apparently Analagous Threads