I''m trying to use Ferret to do fuzzy searches. If I use fuzzy search for just one word, it works fine: index.search(''name:gogle~0.4'') However, if I try to use a phrase, it doesn''t work: index.search(''name:"gogle search engine"~0.4'') On the other hand, I could do: index.search(''name:gogle~0.4 AND name:search~0.4 AND name:engine~0.4'') This isn''t exactly the same as fuzzy search on a phrase, though ... is there a better way? Thanks! Jen
On Dec 14, 2005, at 6:52 PM, jennyw wrote:> I''m trying to use Ferret to do fuzzy searches. If I use fuzzy > search for > just one word, it works fine: > > index.search(''name:gogle~0.4'') > > However, if I try to use a phrase, it doesn''t work: > > index.search(''name:"gogle search engine"~0.4'')Lucene doesn''t support this type of query. I''m still not deep enough into Ferret to know the query parser behavior, but in Java Lucene a "phrase with quotes"~10 makes a sloppy phrase query. The words still have to be spelled correctly but they can be separated in the original text by other words (think of it as proximity, google must be near search and near engine).> On the other hand, I could do: > > index.search(''name:gogle~0.4 AND name:search~0.4 AND > name:engine~0.4'') > > This isn''t exactly the same as fuzzy search on a phrase, though ... is > there a better way?There really isn''t an easy better way unless you get into using the SpanQuery infrastructure and create (or use if Ferret implements it) the SpanRegexQuery, and nest it within a SpanNearQuery. Query parser support for such a beast would be tricky at best and Java Lucene currently does not implement this sort of thing. Though for a consulting client I created (Span)RegexQuery for just these sorts of queries. So it is possible, but not without going deeper into creating a query using the API or making your own query parser that could do this. Dave can fill us in on any details I''ve missed and probably demonstrate just how easy it is and I don''t even know it :) Erik
Erik Hatcher wrote:>There really isn''t an easy better way unless you get into using the >SpanQuery infrastructure and create (or use if Ferret implements it) >the SpanRegexQuery, and nest it within a SpanNearQuery. Query parser >support for such a beast would be tricky at best and Java Lucene >currently does not implement this sort of thing. Though for a >consulting client I created (Span)RegexQuery for just these sorts of >queries. So it is possible, but not without going deeper into >creating a query using the API or making your own query parser that >could do this. > >I was afraid the answer would be something like that! I think that''d be a bit beyond the scope of my current project (I haven''t yet gotten deep into Ferret enough to know what a SpanNearQuery is).>Dave can fill us in on any details I''ve missed and probably >demonstrate just how easy it is and I don''t even know it :) > >That''d be great if so. Here''s to optimism! ;-) Jen
On Dec 14, 2005, at 10:13 PM, jennyw wrote:> Erik Hatcher wrote: > >> There really isn''t an easy better way unless you get into using the >> SpanQuery infrastructure and create (or use if Ferret implements it) >> the SpanRegexQuery, and nest it within a SpanNearQuery. Query parser >> support for such a beast would be tricky at best and Java Lucene >> currently does not implement this sort of thing. Though for a >> consulting client I created (Span)RegexQuery for just these sorts of >> queries. So it is possible, but not without going deeper into >> creating a query using the API or making your own query parser that >> could do this. >> >> > I was afraid the answer would be something like that! I think > that''d be > a bit beyond the scope of my current project (I haven''t yet gotten > deep > into Ferret enough to know what a SpanNearQuery is).You can get some nuggets of info here: http://www.lucenebook.com/search?query=SpanQuery Ferret *is* Lucene, so by knowing what Java Lucene can do, you''ve got a good handle on what Ferret can do. Dave has made some handy conveniences on top of the API, so for general uses you can get by without knowing the guts, but it always helps to know the capabilities under the covers when a more sophisticated requirement is encountered. In brief, a SpanQuery is a family of subclasses (SpanNearQuery, SpanFirstQuery, SpanOrQuery, and now SpanRegexQuery). These allow for matching on positional ranges of the indexed terms, like a window of words. A PhaseQuery (what you created using "words in quotes") matches on words together as well, but it doesn''t have the capability to do any refinement beyond just a proximity setting. By using SpanQuery''s in a nested fashion, matches could be made on a query like this, for example: "some phrase" within 10 positions of "another phrase" but "excluded phrase" cannot be in the middle I was actually a bit wrong about the SpanRegexQuery being useful in your example, unless the query was something like "go*gle" where it was a regular expression that could match various terms in the index. What you''d need is a SpanFuzzyQuery, which would not be hard to create, but is certainly an advanced thing to do with Lucene.>> Dave can fill us in on any details I''ve missed and probably >> demonstrate just how easy it is and I don''t even know it :) >> >> > That''d be great if so. Here''s to optimism! ;-)Dave''s pretty darn good, but I''ll be amazed if he pulls this one out of his hat without coding up a SpanFuzzyQuery and adding support in his query parser for it. Erik
On 12/15/05, Erik Hatcher <erik at ehatchersolutions.com> wrote:> > On Dec 14, 2005, at 10:13 PM, jennyw wrote: > > > Erik Hatcher wrote: > > > >> There really isn''t an easy better way unless you get into using the > >> SpanQuery infrastructure and create (or use if Ferret implements it) > >> the SpanRegexQuery, and nest it within a SpanNearQuery. Query parser > >> support for such a beast would be tricky at best and Java Lucene > >> currently does not implement this sort of thing. Though for a > >> consulting client I created (Span)RegexQuery for just these sorts of > >> queries. So it is possible, but not without going deeper into > >> creating a query using the API or making your own query parser that > >> could do this. > >> > >> > > I was afraid the answer would be something like that! I think > > that''d be > > a bit beyond the scope of my current project (I haven''t yet gotten > > deep > > into Ferret enough to know what a SpanNearQuery is). > > You can get some nuggets of info here: > > http://www.lucenebook.com/search?query=SpanQuery > > Ferret *is* Lucene, so by knowing what Java Lucene can do, you''ve got > a good handle on what Ferret can do. Dave has made some handy > conveniences on top of the API, so for general uses you can get by > without knowing the guts, but it always helps to know the > capabilities under the covers when a more sophisticated requirement > is encountered. > > In brief, a SpanQuery is a family of subclasses (SpanNearQuery, > SpanFirstQuery, SpanOrQuery, and now SpanRegexQuery). These allow > for matching on positional ranges of the indexed terms, like a window > of words. A PhaseQuery (what you created using "words in quotes") > matches on words together as well, but it doesn''t have the capability > to do any refinement beyond just a proximity setting. By using > SpanQuery''s in a nested fashion, matches could be made on a query > like this, for example: > > "some phrase" within 10 positions of "another phrase" but "excluded > phrase" cannot be in the middle > > I was actually a bit wrong about the SpanRegexQuery being useful in > your example, unless the query was something like "go*gle" where it > was a regular expression that could match various terms in the > index. What you''d need is a SpanFuzzyQuery, which would not be hard > to create, but is certainly an advanced thing to do with Lucene. > > >> Dave can fill us in on any details I''ve missed and probably > >> demonstrate just how easy it is and I don''t even know it :) > >> > >> > > That''d be great if so. Here''s to optimism! ;-) > > Dave''s pretty darn good, but I''ll be amazed if he pulls this one out > of his hat without coding up a SpanFuzzyQuery and adding support in > his query parser for it.Thanks for the complement. :-) SpanFuzzyQuery definitely sounds like the solution to this problem. I''ll be implementing span queries in C in the next couple of days so I''ll keep it in mind. Dave