I''ve run in to an issue that I''m not sure how to address. Basically, I''m building queries with occur_default Search::BooleanClause::Occur::MUST, and using the StandardAnalyzer which does stop filtering. The stop filtering is working beautifully on the indexing side. The problem is that when the query parser parses through a query with a stop word in it, say "the oregon trail", it builds a query that looks something like this: MUST title: <blank> MUST title: oregon MUST title: trail Which unfortunately fails when searching for the previously indexed "The Oregon Trail" because it doesn''t have a blank title term in it. Is there a good way to deal with this issue besides filtering stop words before handing the query string off to the parser? Thanks! Nathaniel P.S. I''m using the pure Ruby part of Ferret 0.9.0 on Ruby 1.8.4. -- Posted via http://www.ruby-forum.com/.
Hi Nathaniel, This is a bug. I might get around to fixing it but I can''t promise anything. I''m focusing entirely on the C extension version of Ferret (which doesn''t have this bug). Cheers, Dave PS: Sorry for the slow reply. It''s been a tough few weeks here. On 4/3/06, Nathaniel Talbott <nathaniel at talbott.ws> wrote:> I''ve run in to an issue that I''m not sure how to address. Basically, I''m > building queries with occur_default Search::BooleanClause::Occur::MUST, > and using the StandardAnalyzer which does stop filtering. The stop > filtering is working beautifully on the indexing side. The problem is > that when the query parser parses through a query with a stop word in > it, say "the oregon trail", it builds a query that looks something like > this: > > MUST title: <blank> > MUST title: oregon > MUST title: trail > Which unfortunately fails when searching for the previously indexed "The > Oregon Trail" because it doesn''t have a blank title term in it. > > Is there a good way to deal with this issue besides filtering stop words > before handing the query string off to the parser? > > Thanks! > > > Nathaniel > > P.S. I''m using the pure Ruby part of Ferret 0.9.0 on Ruby 1.8.4. > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
David Balmain wrote:> This is a bug. I might get around to fixing it but I can''t promise > anything. I''m focusing entirely on the C extension version of Ferret > (which doesn''t have this bug).Bummer... I''ve got a hack to avoid the problem for the time being, but it''s _really_ ugly :-/ This brings up another issue, though, that I''ll go ahead and broach... I''m kind of sad that you''ve made the jump to C so soon. Ferret is brimming with potential, but it still feels a lot like, well, Java. The API is still pretty heavy, and when I dig in to the underlying code it feels over-designed. I''m guessing a lot of that is due to the straight translation from Java, which while it''s a good first step, it''s also not surprising that it would initially result in a library that feels pretty alien. While I understand the performance reasons for using C, doing so also makes it much harder to refactor and refine the API, and my feeling is that for most problems, the pure-Ruby performance isn''t a show-stopper. Putting everything in C also makes it harder for folks such as myself, who don''t do much C, to hack on the internals ourselves and push patches back up to you. I hope this comes off the right way - it''s open-source, and you''re of course free to take the project where you will. I''m also extremely grateful for the project - it''s helping me out a lot. I just have doubts about the long-term viability of Ferret within the Ruby community when an API (and underlying code) that I find I spend a lot of time fighting is getting set in stone so early. I''d hate to see you spend a lot of time on it to only have it be a prototype for a more Ruby-ish library that comes along later. I want Ferret to be the standard by which other indexing tools are measured, in Ruby and elsewhere, and I don''t think that raw benchmarks are going to drive that.> PS: Sorry for the slow reply. It''s been a tough few weeks here.No problem! So what exactly do you do? Are you a student? Freelancer? Employee? Astronaut? Thanks a ton for the great library, Nathaniel -- Posted via http://www.ruby-forum.com/.
On 4/14/06, Nathaniel Talbott <nathaniel at talbott.ws> wrote:> David Balmain wrote: > > > This is a bug. I might get around to fixing it but I can''t promise > > anything. I''m focusing entirely on the C extension version of Ferret > > (which doesn''t have this bug). > > Bummer... I''ve got a hack to avoid the problem for the time being, but > it''s _really_ ugly :-/ > > This brings up another issue, though, that I''ll go ahead and broach... > > I''m kind of sad that you''ve made the jump to C so soon. Ferret is > brimming with potential, but it still feels a lot like, well, Java. The > API is still pretty heavy, and when I dig in to the underlying code it > feels over-designed. I''m guessing a lot of that is due to the straight > translation from Java, which while it''s a good first step, it''s also not > surprising that it would initially result in a library that feels pretty > alien. > > While I understand the performance reasons for using C, doing so also > makes it much harder to refactor and refine the API, and my feeling is > that for most problems, the pure-Ruby performance isn''t a show-stopper. > Putting everything in C also makes it harder for folks such as myself, > who don''t do much C, to hack on the internals ourselves and push patches > back up to you.The pure ruby version is still there and I''d love for someone to take over from me. I completely agree with you on the advantages of having a pure ruby version. I personally want the performance which is why I have taken the C route. And there is a huge difference. Somewhere around 100 times. There are people out there who were still using Java Lucene for indexing because of performance issues so I wasn''t the only one concerned about the performance. As for refactoring the API, I understand it is very difficult for some Ruby programmers to get around the C code but you don''t need to send me a patch. Just let me know what you think needs to be changed.> > I hope this comes off the right way - it''s open-source, and you''re of > course free to take the project where you will. I''m also extremely > grateful for the project - it''s helping me out a lot. I just have doubts > about the long-term viability of Ferret within the Ruby community when > an API (and underlying code) that I find I spend a lot of time fighting > is getting set in stone so early. I''d hate to see you spend a lot of > time on it to only have it be a prototype for a more Ruby-ish library > that comes along later. I want Ferret to be the standard by which other > indexing tools are measured, in Ruby and elsewhere, and I don''t think > that raw benchmarks are going to drive that.I want the same thing too. The other advantage to having the C version is that it won''t be too much work to Ferret in Perl, Python, Tcl etc.> > > PS: Sorry for the slow reply. It''s been a tough few weeks here. > > No problem! So what exactly do you do? Are you a student? Freelancer? > Employee? Astronaut?I''m currently an athlete. I''m practicing Judo in Japan and working on Ferret whenever I have time.> Thanks a ton for the great library, > > > Nathaniel > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
David Balmain wrote:> The pure ruby version is still there and I''d love for someone to take > over from me. I completely agree with you on the advantages of having > a pure ruby version. I personally want the performance which is why I > have taken the C route. And there is a huge difference. Somewhere > around 100 times. There are people out there who were still using Java > Lucene for indexing because of performance issues so I wasn''t the only > one concerned about the performance.Understood, and I do look forward to improved performance.> As for refactoring the API, I > understand it is very difficult for some Ruby programmers to get > around the C code but you don''t need to send me a patch. Just let me > know what you think needs to be changed.My big suggestion would be to cut down on the surface area of the API - it''s almost overly flexible, and feels over-designed (probably due to the port from Java). Fewer (documented) classes, simplified options, etc. Basically, it''s a bit overwhelming to someone coming at it for the first time, and I don''t think that''s strictly (or even mostly) a documentation issue. As I use it more I''ll try to come up with specific examples. My small suggestion would be to use symbols (and booleans) for configuration instead of the constants currently being used. For instance: Ferret::Document::Field::Store::YES -> true Ferret::Document::Field::Store::NO -> false Ferret::Document::Field::Store::COMPRESS -> :compress and Ferret::Document::Field::Index::NO -> false Ferret::Document::Field::Index::TOKENIZED -> :tokenized Ferret::Document::Field::Index::UNTOKENIZED -> :untokenized I think this would help Ferret configuration feel much more Rubyish.>> that raw benchmarks are going to drive that. > I want the same thing too. The other advantage to having the C version > is that it won''t be too much work to Ferret in Perl, Python, Tcl etc.But why share? (just kidding ;-)> I''m currently an athlete. I''m practicing Judo in Japan and working on > Ferret whenever I have time.Fascinating (and very cool). Best of luck with it! Thanks again for Ferret, Nathaniel Talbott -- Posted via http://www.ruby-forum.com/.