thr3ads.net - Ferret talk - [Ferret-talk] How to do case-sensitive searches [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Carl Youngblood

2006-Apr-19 05:32 UTC

[Ferret-talk] How to do case-sensitive searches

Forgive me if this topic has already been discussed on the list.  I
googled but couldn''t find much.  I''d like to search through
text for
US state abbreviations that are written in capitals.  What is the best
way to do this?  I read somewhere that tokenized fields are stored in
the index in lowercase, so I am concerned that I will lose precision. 
What is the best way to store a field so that normal searches are
case-insensitive but case-sensitive searches can still be made?

Thanks,
Carl

Jens Kraemer

2006-Apr-23 10:41 UTC

head link

[Ferret-talk] How to do case-sensitive searches

Hi Carl,

On Tue, Apr 18, 2006 at 11:32:36PM -0600, Carl Youngblood
wrote:> Forgive me if this topic has already been discussed on the list.  I
> googled but couldn''t find much.  I''d like to search
through text for
> US state abbreviations that are written in capitals.  What is the best
> way to do this?  I read somewhere that tokenized fields are stored in
> the index in lowercase, so I am concerned that I will lose precision. 
> What is the best way to store a field so that normal searches are
> case-insensitive but case-sensitive searches can still be made?
Are you sure this is a problem, i.e. do you get wrong hits because 
the lowercase variant of an abbreviation is used in another context ?
I don''t know what those abbrevs look like...

To run case-sensitive and case-insensitive searches you''d need two
fields, a tokenized one for normal case-insensitive searches, and an
untokenized one for looking up the abbreviations. 

To reduce overhead in the index, you could filter the text for the 
known set of abbreviations at indexing time and only store those 
values in the untokenized field. Possibly this could be done in a 
custom analyzer.

regards,
Jens


-- 
webit! Gesellschaft f?r neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Kr?mer       kraemer at webit.de
Schnorrstra?e 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66

David Balmain

2006-Apr-25 14:06 UTC

head link

[Ferret-talk] How to do case-sensitive searches

Hey guys,

This might help. The following code takes th input;

    "I used to live in NSW but now I live in the A.C.T."

and produces;

    token["used":2:6:2]
    token["live":10:14:2]
    token["nsw":18:21:2]
    token["NSW":18:21:0]
    token["now":26:29:2]
    token["live":32:36:2]
    token["act":44:49:3]
    token["ACT":44:49:0]

As you can see, the Australian states have been entered twice, once in
upper case. The first two numbers are the start and end offsets, ie
how many bytes from the start. The third digit is the position
increment. So "live" occurs two words after "used" but
"nsw" is in the
same position as "NSW". Now you just have to make sure your query
parser uses the correct Analyzer.

Hope this helps.

Dave

require ''ferret''

module Ferret::Analysis
  class TokenFilter < TokenStream
    protected
      # Construct a token stream filtering the given input.
      def initialize(input)
        @input = input
      end
  end

  class StateFilter < TokenFilter
    STATES = {
      "nsw" => "new south wales",
      "vic" => "victoria",
      "qld" => "queensland",
      "tas" => "tasmania",
      "sa" => "south australia",
      "wa" => "western australia",
      "nt" => "northern territory",
      "act" => "australian capital territory"
    }
    def next()
      if @state
        t = @state
        @state = nil
        return t
      end
      t = @input.next
      return nil if t.nil?
      if STATES[t.text]
        @state = Token.new(t.text.upcase, t.start_offset, t.end_offset, 0)
      end
      return t
    end
  end

  class StateAnalyzer < StandardAnalyzer
    def token_stream(field, text)
      StateFilter.new(super)
    end
  end

  analyzer = StateAnalyzer.new

  ts = analyzer.token_stream(nil,
    "I used to live in NSW but now I live in the A.C.T.")

  while t = ts.next
    puts t
  end

end

On 4/23/06, Jens Kraemer <kraemer at webit.de>
wrote:> Hi Carl,
>
> On Tue, Apr 18, 2006 at 11:32:36PM -0600, Carl Youngblood wrote:
> > Forgive me if this topic has already been discussed on the list.  I
> > googled but couldn''t find much.  I''d like to search
through text for
> > US state abbreviations that are written in capitals.  What is the best
> > way to do this?  I read somewhere that tokenized fields are stored in
> > the index in lowercase, so I am concerned that I will lose precision.
> > What is the best way to store a field so that normal searches are
> > case-insensitive but case-sensitive searches can still be made?
>
> Are you sure this is a problem, i.e. do you get wrong hits because
> the lowercase variant of an abbreviation is used in another context ?
> I don''t know what those abbrevs look like...
>
> To run case-sensitive and case-insensitive searches you''d need two
> fields, a tokenized one for normal case-insensitive searches, and an
> untokenized one for looking up the abbreviations.
>
> To reduce overhead in the index, you could filter the text for the
> known set of abbreviations at indexing time and only store those
> values in the untokenized field. Possibly this could be done in a
> custom analyzer.
>
> regards,
> Jens
>
>
> --
> webit! Gesellschaft f?r neue Medien mbH          www.webit.de
> Dipl.-Wirtschaftsingenieur Jens Kr?mer       kraemer at webit.de
> Schnorrstra?e 76                         Tel +49 351 46766  0
> D-01069 Dresden                          Fax +49 351 46766 66
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Maybe Matching Threads

Search for more seemingly similar threads

Ferret talk - Apr 2006 - How to do case-sensitive searches

[Ferret-talk] How to do case-sensitive searches

[Ferret-talk] How to do case-sensitive searches

[Ferret-talk] How to do case-sensitive searches

Maybe Matching Threads