Finn Smith
2006-Jan-02 23:20 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
Recently I''ve been revisiting some of my search code. With a greater understanding of how Java Lucene implements its search methods, I realized that one level of abstraction is not present in the Ferret classes/methods. Here are the relevant method signatures: Ferret''s search methods: in Ferret::Index::Index: search(query, options = {}) -> returns a TopDocs search_each(query, options = {}) {|doc, score| ...} -> yields to context w/ doc and score for each hit in Ferret::Search::IndexSearcher: search(query, options = {}) -> returns a TopDocs search_each(query, filter = nil) {|doc, score| ...} -> yields to context w/ doc and score for each hit Lucene''s search methods: in the interface Searchable: public void search(Query query, Filter filter, HitCollector results) public TopDocs search(Query query, Filter filter, int n) public TopFieldDocs search(Query query, Filter filter, int n, Sort sort) in org.apache.lucene.search.Searcher (which implements Searchable): public final Hits search(Query query) public Hits search(Query query, Filter filter) public Hits search(Query query, Sort sort) public Hits search(Query query, Filter filter, Sort sort) I was wondering if there were plans to implement the Hits class in Ferret. (Or if someone were to write a patch implementing them, would David integrate it into the source?) It seems like it is a useful abstraction since TopDocs does not allow you to access its hits by index, only by the .each() method call. Some questions: * Will changing these methods break people''s existing code? * Where is the proper place to put these methods? Move the methods that return TopDocs to a module, which is more or less the same as a Java interface, and implement the methods that return Hits directly in the class? What is a good way to do this that feels Rubyish and takes advantage of its strengths and idioms? * The options to limit the search (first_doc and num_doc) in Search::IndexSearcher and the code that implements them should probably be moved out of Search::IndexSearcher into Index::Index * Are there lower level issues I am not aware of that would make any of this a bad idea? Am I missing something here? Are there reasons not to have Ferret''s implementation of these methods and classes follow Java Lucene''s as closely as possible? I''d appreciate hearing your thoughts. -F
David Balmain
2006-Jan-12 01:43 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
On 1/3/06, Finn Smith <fcsmith at gmail.com> wrote:> Recently I''ve been revisiting some of my search code. With a greater > understanding of how Java Lucene implements its search methods, I > realized that one level of abstraction is not present in the Ferret > classes/methods. Here are the relevant method signatures: > > Ferret''s search methods: > > in Ferret::Index::Index: > search(query, options = {}) -> returns a TopDocs > search_each(query, options = {}) {|doc, score| ...} -> yields to > context w/ doc and score for each hit > > in Ferret::Search::IndexSearcher: > search(query, options = {}) -> returns a TopDocs > search_each(query, filter = nil) {|doc, score| ...} -> yields to > context w/ doc and score for each hit > > > Lucene''s search methods: > > in the interface Searchable: > public void search(Query query, Filter filter, HitCollector results) > public TopDocs search(Query query, Filter filter, int n) > public TopFieldDocs search(Query query, Filter filter, int n, Sort sort) > > in org.apache.lucene.search.Searcher (which implements Searchable): > public final Hits search(Query query) > public Hits search(Query query, Filter filter) > public Hits search(Query query, Sort sort) > public Hits search(Query query, Filter filter, Sort sort) > > > I was wondering if there were plans to implement the Hits class in > Ferret. (Or if someone were to write a patch implementing them, would > David integrate it into the source?)I''d be happy to integrate it if someone sends me a patch. Having said that...> It seems like it is a useful > abstraction since TopDocs does not allow you to access its hits by > index, only by the .each() method call.Actually you can access the hits by index like this; hit_three = topdocs.score_docs[2] The reason I didn''t bother implementing the hits class is that I can''t see that it adds anything useful. Really it all just seems a matter of notation. What is easiest for people to understand and remember. Adding the hits class might just make everthing a little more complicated. Please refer to Martin Fowler''s discussion on the Humane interface; http://www.martinfowler.com/bliki/HumaneInterface.html While Java likes to have multiple different implementations of simple interfaces and a separate class for each data structure, in Ruby you can use an array for many different jobs; stack, list queue etc. I feel it would be better to do the same thing with TopDocs. Rather than adding the Hits class I feel it would be better to add the desired functionality to TopDocs. I''m happy to listen to other points of view.> Some questions: > * Will changing these methods break people''s existing code?Perhaps. Depends what we change. Ferret is still beta though so I think it''s open to non-backwards compatible changes if necessary, although we should avoid this if possible.> * Where is the proper place to put these methods? Move the methods > that return TopDocs to a module, which is more or less the same as a > Java interface, and implement the methods that return Hits directly in > the class? What is a good way to do this that feels Rubyish and takes > advantage of its strengths and idioms?I think I answered this already. I''d like to keep TopDocs as a class as add the desired functionality to it.> * The options to limit the search (first_doc and num_doc) in > Search::IndexSearcher and the code that implements them should > probably be moved out of Search::IndexSearcher into Index::IndexI think this needs to stay in IndexSearcher as it limits the amount of memory used by a search. Even the java version allows you to specify nDocs. Hope this helps. Feedback is welcome. Cheers, Dave> * Are there lower level issues I am not aware of that would make any > of this a bad idea? > > Am I missing something here? Are there reasons not to have Ferret''s > implementation of these methods and classes follow Java Lucene''s as > closely as possible? I''d appreciate hearing your thoughts. > > -F > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
Erik Hatcher
2006-Jan-12 13:24 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
On Jan 11, 2006, at 8:43 PM, David Balmain wrote:>> I was wondering if there were plans to implement the Hits class in >> Ferret. (Or if someone were to write a patch implementing them, would >> David integrate it into the source?) > > I''d be happy to integrate it if someone sends me a patch. Having > said that... > >> It seems like it is a useful >> abstraction since TopDocs does not allow you to access its hits by >> index, only by the .each() method call. > > Actually you can access the hits by index like this; > > hit_three = topdocs.score_docs[2] > > The reason I didn''t bother implementing the hits class is that I can''t > see that it adds anything useful. Really it all just seems a matter of > notation.It''s more than just notation. Hits performs some caching of Document objects as well as providing a means to iterate through the hits without having to manually re-search as it does it under the covers. Sure, it''s perhaps a mere convenience, but a handy abstraction nonetheless.> What is easiest for people to understand and remember. > Adding the hits class might just make everthing a little more > complicated. Please refer to Martin Fowler''s discussion on the Humane > interface; > > http://www.martinfowler.com/bliki/HumaneInterface.html > > While Java likes to have multiple different implementations of simple > interfaces and a separate class for each data structure, in Ruby you > can use an array for many different jobs; stack, list queue etc. I > feel it would be better to do the same thing with TopDocs. Rather than > adding the Hits class I feel it would be better to add the desired > functionality to TopDocs. I''m happy to listen to other points of view.I think not having Hits makes it more complicated for those coming from Java Lucene at least, but it is also a conceptual abstraction. One thinks of getting "hits" back from a search, not "top docs". So in that sense, the semantics of having Hits is powerful. Part of Fowler''s argument is to have redundancy, aliases, and conveniences for the humane interface, and I think Hits would qualify in that regard. Erik
David Balmain
2006-Jan-12 23:52 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
On 1/12/06, Erik Hatcher <erik at ehatchersolutions.com> wrote:> > On Jan 11, 2006, at 8:43 PM, David Balmain wrote: > >> I was wondering if there were plans to implement the Hits class in > >> Ferret. (Or if someone were to write a patch implementing them, would > >> David integrate it into the source?) > > > > I''d be happy to integrate it if someone sends me a patch. Having > > said that... > > > >> It seems like it is a useful > >> abstraction since TopDocs does not allow you to access its hits by > >> index, only by the .each() method call. > > > > Actually you can access the hits by index like this; > > > > hit_three = topdocs.score_docs[2] > > > > The reason I didn''t bother implementing the hits class is that I can''t > > see that it adds anything useful. Really it all just seems a matter of > > notation. > > It''s more than just notation. Hits performs some caching of Document > objects as well as providing a means to iterate through the hits > without having to manually re-search as it does it under the covers. > Sure, it''s perhaps a mere convenience, but a handy abstraction > nonetheless. > > > What is easiest for people to understand and remember. > > Adding the hits class might just make everthing a little more > > complicated. Please refer to Martin Fowler''s discussion on the Humane > > interface; > > > > http://www.martinfowler.com/bliki/HumaneInterface.html > > > > While Java likes to have multiple different implementations of simple > > interfaces and a separate class for each data structure, in Ruby you > > can use an array for many different jobs; stack, list queue etc. I > > feel it would be better to do the same thing with TopDocs. Rather than > > adding the Hits class I feel it would be better to add the desired > > functionality to TopDocs. I''m happy to listen to other points of view. > > I think not having Hits makes it more complicated for those coming > from Java Lucene at least, but it is also a conceptual abstraction. > One thinks of getting "hits" back from a search, not "top docs". So > in that sense, the semantics of having Hits is powerful. Part of > Fowler''s argument is to have redundancy, aliases, and conveniences > for the humane interface, and I think Hits would qualify in that regard. > > ErikI''m not arguing that TopDocs is a better name than Hits. Rather that having search methods return two different classes is unnecessary and not "The Ruby Way". My goal is to make Ferret easy for Ruby programmers to use, not Java programmers. So what I''d like to hear is an argument as to why having two separate classes - TopDocs and Hits - is superior to combining the functionality of both into one class. My personal feeling is that this is where the difference lies between Java and Ruby but I could easily be swayed. Dave
Erik Hatcher
2006-Jan-13 02:12 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
On Jan 12, 2006, at 6:52 PM, David Balmain wrote:> On 1/12/06, Erik Hatcher <erik at ehatchersolutions.com> wrote: >> >> On Jan 11, 2006, at 8:43 PM, David Balmain wrote: >>>> I was wondering if there were plans to implement the Hits class in >>>> Ferret. (Or if someone were to write a patch implementing them, >>>> would >>>> David integrate it into the source?) >>> >>> I''d be happy to integrate it if someone sends me a patch. Having >>> said that... >>> >>>> It seems like it is a useful >>>> abstraction since TopDocs does not allow you to access its hits by >>>> index, only by the .each() method call. >>> >>> Actually you can access the hits by index like this; >>> >>> hit_three = topdocs.score_docs[2] >>> >>> The reason I didn''t bother implementing the hits class is that I >>> can''t >>> see that it adds anything useful. Really it all just seems a >>> matter of >>> notation. >> >> It''s more than just notation. Hits performs some caching of Document >> objects as well as providing a means to iterate through the hits >> without having to manually re-search as it does it under the covers. >> Sure, it''s perhaps a mere convenience, but a handy abstraction >> nonetheless. >> >>> What is easiest for people to understand and remember. >>> Adding the hits class might just make everthing a little more >>> complicated. Please refer to Martin Fowler''s discussion on the >>> Humane >>> interface; >>> >>> http://www.martinfowler.com/bliki/HumaneInterface.html >>> >>> While Java likes to have multiple different implementations of >>> simple >>> interfaces and a separate class for each data structure, in Ruby you >>> can use an array for many different jobs; stack, list queue etc. I >>> feel it would be better to do the same thing with TopDocs. Rather >>> than >>> adding the Hits class I feel it would be better to add the desired >>> functionality to TopDocs. I''m happy to listen to other points of >>> view. >> >> I think not having Hits makes it more complicated for those coming >> from Java Lucene at least, but it is also a conceptual abstraction. >> One thinks of getting "hits" back from a search, not "top docs". So >> in that sense, the semantics of having Hits is powerful. Part of >> Fowler''s argument is to have redundancy, aliases, and conveniences >> for the humane interface, and I think Hits would qualify in that >> regard. >> >> Erik > > I''m not arguing that TopDocs is a better name than Hits. Rather that > having search methods return two different classes is unnecessary and > not "The Ruby Way". My goal is to make Ferret easy for Ruby > programmers to use, not Java programmers. So what I''d like to hear is > an argument as to why having two separate classes - TopDocs and Hits - > is superior to combining the functionality of both into one class. My > personal feeling is that this is where the difference lies between > Java and Ruby but I could easily be swayed.It seems an injustice to Java in this regard. Surely Hits and TopDocs could have their functionality blended together into single class. There was an intentional separation, not some constraint that Java the language imposed. I''m being a bit defensive of the Lucene API here and don''t want to see Ferret diverge too much from it for no real benefit. What''s one more class in Ruby in this situation to maintain consistency across languages for the finest search engine available? Seems a small sacrifice of Ruby "purity" to make for the noble cause :) Just my $0.02. Practically no one in Java Lucene uses TopDocs - you''ll notice that all of those search methods are marked as "Expert". Hits is the most common way to access search results, allowing them to automatically be paged through and have a bit of caching along with it. Erik
Vamsee Kanakala
2006-Jan-13 05:18 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
David Balmain wrote:> My >personal feeling is that this is where the difference lies between >Java and Ruby but I could easily be swayed. > > >Hi Dave & Erik, I don''t intend to hurt anybody''s opinions, but let me speak up on something: I did some 2 years of Java programming, and was never really comfortable with its verbosity though I liked it for other things. I felt Ferret''s API is already a bit un-Rubyish, if you know what I mean. It almost feels like I''m back to using Java libraries again. Learning Rails took a complete break from the way I was doing J2EE programming. But I totally love it, for it''s refreshing simplicity and thereby came to love Ruby too. So what I mean to say is, don''t be afraid to break compatibility - if it makes programmer''s life easier. I agree with Dave''s sentiments that Ferret should be better than Lucene. I''m guilty of not understanding what it really takes, but I think we should put ''making a developer''s job easy'' before anything else. If it means breaking compatibility with Lucene, I don''t really mind. Of course, I speak from a very selfish point of view - I don''t have any running Lucene apps to run or port. Just my 2 cents. Regards, Vamsee.
David Balmain
2006-Jan-13 10:13 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
On 1/13/06, Erik Hatcher <erik at ehatchersolutions.com> wrote:> > On Jan 12, 2006, at 6:52 PM, David Balmain wrote: > > > On 1/12/06, Erik Hatcher <erik at ehatchersolutions.com> wrote: > >> > >> On Jan 11, 2006, at 8:43 PM, David Balmain wrote: > >>>> I was wondering if there were plans to implement the Hits class in > >>>> Ferret. (Or if someone were to write a patch implementing them, > >>>> would > >>>> David integrate it into the source?) > >>> > >>> I''d be happy to integrate it if someone sends me a patch. Having > >>> said that... > >>> > >>>> It seems like it is a useful > >>>> abstraction since TopDocs does not allow you to access its hits by > >>>> index, only by the .each() method call. > >>> > >>> Actually you can access the hits by index like this; > >>> > >>> hit_three = topdocs.score_docs[2] > >>> > >>> The reason I didn''t bother implementing the hits class is that I > >>> can''t > >>> see that it adds anything useful. Really it all just seems a > >>> matter of > >>> notation. > >> > >> It''s more than just notation. Hits performs some caching of Document > >> objects as well as providing a means to iterate through the hits > >> without having to manually re-search as it does it under the covers. > >> Sure, it''s perhaps a mere convenience, but a handy abstraction > >> nonetheless. > >> > >>> What is easiest for people to understand and remember. > >>> Adding the hits class might just make everthing a little more > >>> complicated. Please refer to Martin Fowler''s discussion on the > >>> Humane > >>> interface; > >>> > >>> http://www.martinfowler.com/bliki/HumaneInterface.html > >>> > >>> While Java likes to have multiple different implementations of > >>> simple > >>> interfaces and a separate class for each data structure, in Ruby you > >>> can use an array for many different jobs; stack, list queue etc. I > >>> feel it would be better to do the same thing with TopDocs. Rather > >>> than > >>> adding the Hits class I feel it would be better to add the desired > >>> functionality to TopDocs. I''m happy to listen to other points of > >>> view. > >> > >> I think not having Hits makes it more complicated for those coming > >> from Java Lucene at least, but it is also a conceptual abstraction. > >> One thinks of getting "hits" back from a search, not "top docs". So > >> in that sense, the semantics of having Hits is powerful. Part of > >> Fowler''s argument is to have redundancy, aliases, and conveniences > >> for the humane interface, and I think Hits would qualify in that > >> regard. > >> > >> Erik > > > > I''m not arguing that TopDocs is a better name than Hits. Rather that > > having search methods return two different classes is unnecessary and > > not "The Ruby Way". My goal is to make Ferret easy for Ruby > > programmers to use, not Java programmers. So what I''d like to hear is > > an argument as to why having two separate classes - TopDocs and Hits - > > is superior to combining the functionality of both into one class. My > > personal feeling is that this is where the difference lies between > > Java and Ruby but I could easily be swayed. > > It seems an injustice to Java in this regard. Surely Hits and > TopDocs could have their functionality blended together into single > class. There was an intentional separation, not some constraint that > Java the language imposed.I was never implying there was some constraint imposed by the language itself. I''m talking about the way things are done in Ruby versus the way things are done in Java. There was an intentional seperation of ArrayList, Vector, Stack etc in Java too but it doesn''t mean we have to do the same thing in Ruby. I''m not saying one way is better than the other. But Ferret is a Ruby library so I''d like to do it the Ruby way where possible.> I''m being a bit defensive of the Lucene API here and don''t want to > see Ferret diverge too much from it for no real benefit. What''s one > more class in Ruby in this situation to maintain consistency across > languages for the finest search engine available? Seems a small > sacrifice of Ruby "purity" to make for the noble cause :) Just my > $0.02. > > Practically no one in Java Lucene uses TopDocs - you''ll notice that > all of those search methods are marked as "Expert". Hits is the most > common way to access search results, allowing them to automatically > be paged through and have a bit of caching along with it.If this is the case then maybe I should just return a Hits object, roll the TopDocs functionality into it and be done with it. If practically no one is using TopDocs then practically no one will miss it. ;-) What I was really looking for (and still hope to see) was an argument discussing the pros and cons of having the two separate classes (and "That''s what they did in Java" doesn''t count :-). Nevertheless, I''ve revisted the Hits class in Lucene and I''ve thought more about the issue at hand and Hits will be coming in the next release of Ferret. I haven''t decided exactly how I''m going to do it yet. There will probably still be some differences from the Lucene API. For example, search_each() is here to stay. I''ll probably bring it up for discussion again when I come to it. I still have a fair bit of work in cFerret before I get to that stage. Cheers, Dave
Erik Hatcher
2006-Jan-13 10:28 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
Vamsee, No feelings hurt here, and I completely understand your sentiment. There are folks using Java Lucene that have expressed similar sentiment about its API, which is more C-like than Java-like in many ways. But, let''s focus on the heart of this thing... a high-powered full-text search engine. The goal is speed and efficient use of resources. An elegant API is desirable but of a secondary nature. Hits is a pretty elegant way to navigate search results. I hope this simple class can find its way into Ferret and that the IndexSearcher API be made reasonably similar to Java Lucene. Dave has done a great job with the Index class in Ferret, which has features that Java Lucene does not - that of being able to flush and see changes right away (which is harder to manage in Java Lucene) and also of having keys to documents and managing an "update". In Java Lucene there is no concept of an update - there is only a remove and an add. I''m all for a slick Ruby API, but I would very much like to see it built on top of a Lucene compatible index format for interoperability. That interoperability is important to me and will very likely be important to others. Consider Nutch for example. It is an incredibly scalable web crawler and indexer. With index compatibility you could use Nutch to crawl the web and use Ferret for searching. Further, the HTML parsers I''ve used in Ruby are lousy compared to what is available in Java. Indexing in Java makes a lot of sense in many circumstances, but fronting an application with Rails and Ferret also makes a lot of sense. Dave is, of course, the creator and driver of Ferret. I encourage him to consider keeping index file compatibility, and keep basic API in tact for the classes that are direct ports of Lucene, and innovate on top of them rather than change them. He certainly may choose to do otherwise, but doing so would likely drive me (and perhaps others?) to other solutions. Erik On Jan 13, 2006, at 12:18 AM, Vamsee Kanakala wrote:> Hi Dave & Erik, > > I don''t intend to hurt anybody''s opinions, but let me speak up on > something: I did some 2 years of Java programming, and was never > really > comfortable with its verbosity though I liked it for other things. I > felt Ferret''s API is already a bit un-Rubyish, if you know what I > mean. > It almost feels like I''m back to using Java libraries again. > > Learning Rails took a complete break from the way I was doing J2EE > programming. But I totally love it, for it''s refreshing simplicity and > thereby came to love Ruby too. So what I mean to say is, don''t be > afraid > to break compatibility - if it makes programmer''s life easier. I agree > with Dave''s sentiments that Ferret should be better than Lucene. I''m > guilty of not understanding what it really takes, but I think we > should > put ''making a developer''s job easy'' before anything else. If it means > breaking compatibility with Lucene, I don''t really mind. > > Of course, I speak from a very selfish point of view - I don''t have > any > running Lucene apps to run or port. Just my 2 cents. > > Regards, > Vamsee. > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk
Erik Hatcher
2006-Jan-13 10:36 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
On Jan 13, 2006, at 5:13 AM, David Balmain wrote:> >> I''m being a bit defensive of the Lucene API here and don''t want to >> see Ferret diverge too much from it for no real benefit. What''s one >> more class in Ruby in this situation to maintain consistency across >> languages for the finest search engine available? Seems a small >> sacrifice of Ruby "purity" to make for the noble cause :) Just my >> $0.02. >> >> Practically no one in Java Lucene uses TopDocs - you''ll notice that >> all of those search methods are marked as "Expert". Hits is the most >> common way to access search results, allowing them to automatically >> be paged through and have a bit of caching along with it. > > If this is the case then maybe I should just return a Hits object, > roll the TopDocs functionality into it and be done with it. If > practically no one is using TopDocs then practically no one will miss > it. ;-)My argument is mainly on why there is an issue with one more internal class. You''ve ported quite faithfully the underlying Lucene class structure and API. Why is this one more item a big deal? TopDocs is useful, don''t get me wrong. It is just not used by the general Lucene consuming public, but many expert level folks do use it. I hope that Hits and TopDocs can stay, and whether it makes sense for them to be separate classes or not is really immaterial, but at least keep the most public and useful part of Lucene''s API, IndexSearcher, as compatible as possible.> What I was really looking for (and still hope to see) was an argument > discussing the pros and cons of having the two separate classes (and > "That''s what they did in Java" doesn''t count :-).Keeping a consistent IndexSearcher API between Java Lucene and Ferret is definitely an argument that counts for me personally. Innovating "Ruby Way" features alongside that is also greatly desirable for sure!> Nevertheless, I''ve > revisted the Hits class in Lucene and I''ve thought more about the > issue at hand and Hits will be coming in the next release of Ferret.Yay!!!> I > haven''t decided exactly how I''m going to do it yet. There will > probably still be some differences from the Lucene API. For example, > search_each() is here to stay. I''ll probably bring it up for > discussion again when I come to it. I still have a fair bit of work in > cFerret before I get to that stage.Adding conveniences with block iteration and such make me extremely happy! PyLucene did the same thing. Erik p.s. whispering *GCJ... SWIG....* :)
Finn Smith
2006-Jan-13 19:05 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
On 1/11/06, David Balmain <dbalmain.ml at gmail.com> wrote:> > * The options to limit the search (first_doc and num_doc) in > > Search::IndexSearcher and the code that implements them should > > probably be moved out of Search::IndexSearcher into Index::Index > > I think this needs to stay in IndexSearcher as it limits the amount of > memory used by a search. Even the java version allows you to specify > nDocs.Reviewing the code again, and taking another look at the Java code I think you''re right about this. If there is a more general search method exposed that returns Hits I''ll be happy. -F
Finn Smith
2006-Jan-13 19:42 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
On 1/13/06, David Balmain <dbalmain.ml at gmail.com> wrote:> On 1/13/06, Erik Hatcher <erik at ehatchersolutions.com> wrote: > > Practically no one in Java Lucene uses TopDocs - you''ll notice that > > all of those search methods are marked as "Expert". Hits is the most > > common way to access search results, allowing them to automatically > > be paged through and have a bit of caching along with it. > > If this is the case then maybe I should just return a Hits object, > roll the TopDocs functionality into it and be done with it. If > practically no one is using TopDocs then practically no one will miss > it. ;-) > > What I was really looking for (and still hope to see) was an argument > discussing the pros and cons of having the two separate classes (and > "That''s what they did in Java" doesn''t count :-). Nevertheless, I''ve > revisted the Hits class in Lucene and I''ve thought more about the > issue at hand and Hits will be coming in the next release of Ferret. I > haven''t decided exactly how I''m going to do it yet. There will > probably still be some differences from the Lucene API. For example, > search_each() is here to stay. I''ll probably bring it up for > discussion again when I come to it. I still have a fair bit of work in > cFerret before I get to that stage.I was curious how this problem was addressed in other languages that are not as strongly typed as Java so I took a look at the Plucene implementation. In Plucene there is an abstract base class Searcher which IndexSearcher inherits from. Searcher has the method search which instantiates a Hits object and passes "self" in as the searcher argument before returning the newly created Hits object. The abstract method search_top is implemented in IndexSearcher and returns TopDocs. The search_top method is used internally by Hits when retrieving results. This follows the Java implementation pretty closely while still having some of the advantages of more dynamic languages. A method isn''t defined for each possible combination of arguments. Rather, methods are identified by their functionality as reflected in their name. This is in contrast to Java where a bunch of methods with the same name ("search") are identified by the method signature consisting of return type, name and arguments. I don''t know if it will be any help, but it might be worth glancing through the Plucene code for another perspective on how to organize the various objects and their interactions. -F
David Balmain
2006-Jan-14 10:20 UTC
[Ferret-talk] aligning Ferret''s IndexSearcher.search API with Lucene''s
On 1/14/06, Finn Smith <fcsmith at gmail.com> wrote:> On 1/13/06, David Balmain <dbalmain.ml at gmail.com> wrote: > > On 1/13/06, Erik Hatcher <erik at ehatchersolutions.com> wrote: > > > Practically no one in Java Lucene uses TopDocs - you''ll notice that > > > all of those search methods are marked as "Expert". Hits is the most > > > common way to access search results, allowing them to automatically > > > be paged through and have a bit of caching along with it. > > > > If this is the case then maybe I should just return a Hits object, > > roll the TopDocs functionality into it and be done with it. If > > practically no one is using TopDocs then practically no one will miss > > it. ;-) > > > > What I was really looking for (and still hope to see) was an argument > > discussing the pros and cons of having the two separate classes (and > > "That''s what they did in Java" doesn''t count :-). Nevertheless, I''ve > > revisted the Hits class in Lucene and I''ve thought more about the > > issue at hand and Hits will be coming in the next release of Ferret. I > > haven''t decided exactly how I''m going to do it yet. There will > > probably still be some differences from the Lucene API. For example, > > search_each() is here to stay. I''ll probably bring it up for > > discussion again when I come to it. I still have a fair bit of work in > > cFerret before I get to that stage. > > I was curious how this problem was addressed in other languages that > are not as strongly typed as Java so I took a look at the Plucene > implementation. > > In Plucene there is an abstract base class Searcher which > IndexSearcher inherits from. Searcher has the method search which > instantiates a Hits object and passes "self" in as the searcher > argument before returning the newly created Hits object. The abstract > method search_top is implemented in IndexSearcher and returns TopDocs. > The search_top method is used internally by Hits when retrieving > results. > > This follows the Java implementation pretty closely while still having > some of the advantages of more dynamic languages. A method isn''t > defined for each possible combination of arguments. Rather, methods > are identified by their functionality as reflected in their name. This > is in contrast to Java where a bunch of methods with the same name > ("search") are identified by the method signature consisting of return > type, name and arguments. > > I don''t know if it will be any help, but it might be worth glancing > through the Plucene code for another perspective on how to organize > the various objects and their interactions. > > -FThanks Finn. I have downloaded PyLucene, Plucene, Lupy etc. and I have been using all of them to solve various problems. I will certainly study all of their search APIs. Cheers, Dave