thr3ads.net - Ferret talk - [Ferret-talk] Handling Carriage Returns [Apr 2008]

If this information is useful, please help other people find it:
Share via:

S D

2008-Apr-28 07:04 UTC

[Ferret-talk] Handling Carriage Returns

It''s my understanding that the tokens in a token_stream consist of text
along with start/stop positions that represent the byte positions of the
text within the corresponding document field. The documentation I''ve
been
reading (i.e., O''Reilly - Ferret - page 67) suggests that these byte
positions represent positions within the entire field but based on my
testing it appears that the byte positions are with respect to the line that
contains the corresponding text within the field. I read my fields following
Brian McCallister:

      index.add_document :file => path,
                         :content => file.readlines


Hence, if I have a file that contains carriage returns, the token positions
will be reset with each new line. For example, the following file contents
(File A)
          this is a sentence
will result in a token for the text "sentence" with start position
equal to
10 (assume "this" starts in position 0) while a file with a carriage
return
          this is a
          sentence
will result in a token for the text "sentence" with start position
equal to
0. I get the same results for my custom tokenizer as well as
StandardTokenizer. The above does not seem consistent with the documentation
but more importantly, it seems that global positions are more useful than
line-based positions (e.g., for highlighting).

Digging a little deeper it seems that the tokenizer''s initialize method
is
called each time the token_stream method of the containing analyzer is
called:

class CustomAnalyzer
  def token_stream(field, str)
    ts = StandardTokenizer.new(str)
  end
end

Am I missing something here? Are the start/stop byte positions intended to
be with respect to the line? Is there a way for token_stream to only be
called once for an entire string sequence (even if carriage returns are
contained)?

Thanks,
John
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/ferret-talk/attachments/20080428/07e306a3/attachment-0001.html>

Jens Kraemer

2008-Apr-28 10:37 UTC

head link

[Ferret-talk] Handling Carriage Returns

Hi,

File.readlines returns an array which I think is the root cause of the
problem. 
Just using File.read instead should solve your problem.

Cheers,
Jens

On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote:> It''s my understanding that the tokens in a token_stream consist of
text
> along with start/stop positions that represent the byte positions of the
> text within the corresponding document field. The documentation
I''ve been
> reading (i.e., O''Reilly - Ferret - page 67) suggests that these
byte
> positions represent positions within the entire field but based on my
> testing it appears that the byte positions are with respect to the line
that
> contains the corresponding text within the field. I read my fields
following
> Brian McCallister:
> 
>       index.add_document :file => path,
>                          :content => file.readlines
> 
> 
> Hence, if I have a file that contains carriage returns, the token positions
> will be reset with each new line. For example, the following file contents
> (File A)
>           this is a sentence
> will result in a token for the text "sentence" with start
position equal to
> 10 (assume "this" starts in position 0) while a file with a
carriage return
>           this is a
>           sentence
> will result in a token for the text "sentence" with start
position equal to
> 0. I get the same results for my custom tokenizer as well as
> StandardTokenizer. The above does not seem consistent with the
documentation
> but more importantly, it seems that global positions are more useful than
> line-based positions (e.g., for highlighting).
> 
> Digging a little deeper it seems that the tokenizer''s initialize
method is
> called each time the token_stream method of the containing analyzer is
> called:
> 
> class CustomAnalyzer
>   def token_stream(field, str)
>     ts = StandardTokenizer.new(str)
>   end
> end
> 
> Am I missing something here? Are the start/stop byte positions intended to
> be with respect to the line? Is there a way for token_stream to only be
> called once for an entire string sequence (even if carriage returns are
> contained)?
> 
> Thanks,
> John
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
-- 
Jens Kr?mer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database

S D

2008-Apr-30 05:47 UTC

head link

[Ferret-talk] Handling Carriage Returns

That was it. Stupid mistake on my part.

Thanks!
John

On Mon, Apr 28, 2008 at 6:37 AM, Jens Kraemer <jk at jkraemer.net> wrote:
> Hi,
>
> File.readlines returns an array which I think is the root cause of the
> problem.
> Just using File.read instead should solve your problem.
>
> Cheers,
> Jens
>
> On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote:
> > It''s my understanding that the tokens in a token_stream
consist of text
> > along with start/stop positions that represent the byte positions of
the
> > text within the corresponding document field. The documentation
I''ve
> been
> > reading (i.e., O''Reilly - Ferret - page 67) suggests that
these byte
> > positions represent positions within the entire field but based on my
> > testing it appears that the byte positions are with respect to the
line
> that
> > contains the corresponding text within the field. I read my fields
> following
> > Brian McCallister:
> >
> >       index.add_document :file => path,
> >                          :content => file.readlines
> >
> >
> > Hence, if I have a file that contains carriage returns, the token
> positions
> > will be reset with each new line. For example, the following file
> contents
> > (File A)
> >           this is a sentence
> > will result in a token for the text "sentence" with start
position equal
> to
> > 10 (assume "this" starts in position 0) while a file with a
carriage
> return
> >           this is a
> >           sentence
> > will result in a token for the text "sentence" with start
position equal
> to
> > 0. I get the same results for my custom tokenizer as well as
> > StandardTokenizer. The above does not seem consistent with the
> documentation
> > but more importantly, it seems that global positions are more useful
> than
> > line-based positions (e.g., for highlighting).
> >
> > Digging a little deeper it seems that the tokenizer''s
initialize method
> is
> > called each time the token_stream method of the containing analyzer is
> > called:
> >
> > class CustomAnalyzer
> >   def token_stream(field, str)
> >     ts = StandardTokenizer.new(str)
> >   end
> > end
> >
> > Am I missing something here? Are the start/stop byte positions
intended
> to
> > be with respect to the line? Is there a way for token_stream to only
be
> > called once for an entire string sequence (even if carriage returns
are
> > contained)?
> >
> > Thanks,
> > John
>
> > _______________________________________________
> > Ferret-talk mailing list
> > Ferret-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/ferret-talk
>
> --
> Jens Kr?mer
> Finkenlust 14, 06449 Aschersleben, Germany
> VAT Id DE251962952
> http://www.jkraemer.net/ - Blog
> http://www.omdb.org/     - The new free film database
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/ferret-talk/attachments/20080430/b906716b/attachment.html>

Ferret talk - Apr 2008 - Handling Carriage Returns

[Ferret-talk] Handling Carriage Returns

[Ferret-talk] Handling Carriage Returns

[Ferret-talk] Handling Carriage Returns