Is there any somewhat standard way to remove or otherwise handle special or syntax characters from a user''s search, such as a colon? I was thinking maybe there was something akin to Ferret::Analysis::FULL_ENGLISH_STOP_WORDS, like Ferret::Analysis::FERRET_SYNTAX_CHARS, but no such luck. How are other folks dealing with filtering user input? John
On 2007-01-17, at 10:24, John Bachir wrote:> Is there any somewhat standard way to remove or otherwise handle > special or syntax characters from a user''s search, such as a colon? > > I was thinking maybe there was something akin to > Ferret::Analysis::FULL_ENGLISH_STOP_WORDS, like > Ferret::Analysis::FERRET_SYNTAX_CHARS, but no such luck. > > How are other folks dealing with filtering user input?Hey John, i guess that would be a nice addition to have a const defined.. i''ll do it manually .. if not defined?(FERRET_SPECIAL_CHARS) FERRET_SPECIAL_CHARS = [ /:/, /\(/, /\)/, /\[/, /\]/, /!/, /\ +/, /"/, /~/, /\^/, /-/, /|/, />/, /</, /=/, /\*/, /\?/, / \./, /&/ ] end Ben
Excerpts from John Bachir''s message of Wed Jan 17 13:24:48 -0800 2007:> Is there any somewhat standard way to remove or otherwise handle > special or syntax characters from a user''s search, such as a colon?If you want to allow them the full syntax, just use QueryParser#parse (and handle the QueryParseException). If you want to disallow anything special, you could split on whitespace and turn each token into a TermQuery, then throw them all into a BooleanQuery. Anything in between (e.g. allow phrase queries, but disallow everything else) will be more complicated. But I can''t think of many good reasons to disallow the full syntax in the first place. -- William <wmorgan-ferret at masanjin.net>
On Jan 17, 2007, at 5:26 PM, Benjamin Krause wrote:> i guess that would be a nice addition to have a const defined.. > i''ll do > it manually .. > > if not defined?(FERRET_SPECIAL_CHARS) > FERRET_SPECIAL_CHARS = [ /:/, /\(/, /\)/, /\[/, /\]/, /!/, /\ > +/, /"/, /~/, /\^/, > /-/, /|/, />/, /</, /=/, /\*/, /\?/, / > \./, /&/ ] > endThanks Benjamin! On Jan 17, 2007, at 6:46 PM, William Morgan wrote:> If you want to allow them the full syntax, just use QueryParser#parse > (and handle the QueryParseException). If you want to disallow anything > special, you could split on whitespace and turn each token into a > TermQuery, then throw them all into a BooleanQuery. > > Anything in between (e.g. allow phrase queries, but disallow > everything > else) will be more complicated. But I can''t think of many good reasons > to disallow the full syntax in the first place.William- I agree. If it was up to me, I would allow the full syntax. Unfortunately, one of the things that the client has asked for is one two three to be transformed to *one* *two* *three* And also to be able to transparently search FOR the special characters themselves. Which means I will actually not be filtering, but escaping the special characters. (I''m assuming Ferret has some facility for searching for special characters, although I admit I haven''t looked into it much yet). Cheers, John
Excerpts from John Bachir''s message of Wed Jan 17 16:14:47 -0800 2007:> Unfortunately, one of the things that the client has asked for is > > one two three > > to be transformed to > > *one* *two* *three*Ok. Then I don''t think you really need to worry about escaping anything. You can split on whitespace, and wrap each token in a WildcardQuery, prefixed and suffixed with a star. Unless you''re supporting phrase queries surrounded by quotes, in which case "split on whitespace" becomes something more complicated. Or unless you want to disallow wildcards from the user, in which case you''ll need to escape * and ?.> And also to be able to transparently search FOR the special characters > themselves. Which means I will actually not be filtering, but escaping > the special characters. (I''m assuming Ferret has some facility for > searching for special characters, although I admit I haven''t looked > into it much yet).Yep, as long as your tokenizer doesn''t discard them, you''re fine. Basically if you''re avoiding QueryParser and building Query objects directly from the strings, then none of these characters have special semantics (except for * and ? with WildcardQuery). -- William <wmorgan-ferret at masanjin.net>
On Jan 17, 2007, at 8:23 PM, William Morgan wrote:> You can split on whitespace, and wrap each token in a WildcardQuery, > prefixed and suffixed with a star. Unless you''re supporting phrase > queries surrounded by quotes, in which case "split on whitespace" > becomes something more complicated. Or unless you want to disallow > wildcards from the user, in which case you''ll need to escape * and ?.Yes, I want to do all of the above :-D Thanks for all the tips William, I''m going to look into this in the future when I make a more refined solution. In the meantime, I am just going to strip out all special/syntax chars from the queries, which I believe will have the behavior I desire. i want a search for one-two to pull up results with one two one-two onetwo John
On Jan 17, 2007, at 5:26 PM, Benjamin Krause wrote:> FERRET_SPECIAL_CHARS = [ /:/, /\(/, /\)/, /\[/, /\]/, /!/, /\ > +/, /"/, /~/, /\^/, /-/, /|/, />/, /</, /=/, /\*/, /\?/, /\./, /&/ ]1. Should $ be in the list? 2. Here is the solution I came up with, (nothing mind shattering but I thought some folks on the list might appreciate seeing it): query = (query.split('''') - (FERRET_SPECIAL_CHARS - CONFIG [:allowed_ferret_syntax])).join() CONFIG[:allowed_ferret_syntax] contains the characters we are allowing, right now only double quote. Unless I am missing something, we are now successfully allowing no ferret syntax other than phrases. Whoo hoo! John
Excerpts from John Bachir''s message of Fri Jan 19 15:57:35 -0800 2007:> On Jan 17, 2007, at 5:26 PM, Benjamin Krause wrote: > > > FERRET_SPECIAL_CHARS = [ /:/, /\(/, /\)/, /\[/, /\]/, /!/, /\ > > +/, /"/, /~/, /\^/, /-/, /|/, />/, /</, /=/, /\*/, /\?/, /\./, /&/ ] > > 1. Should $ be in the list?There''s a list at http://ferret.davebalmain.com/api/classes/Ferret/QueryParser.html and $ doesn''t seem to be on it. (Neither does & or .)> 2. Here is the solution I came up with, (nothing mind shattering but > I thought some folks on the list might appreciate seeing it): > > query = (query.split('''') - (FERRET_SPECIAL_CHARS - CONFIG > [:allowed_ferret_syntax])).join()Doesn''t this also eliminate escaped versions of the special characters? (Might not be a problem, depending on the specifics of the corpus.) -- William <wmorgan-ferret at masanjin.net>