It''s my understanding that the tokens in a token_stream consist of text along with start/stop positions that represent the byte positions of the text within the corresponding document field. The documentation I''ve been reading (i.e., O''Reilly - Ferret - page 67) suggests that these byte positions represent positions within the entire field but based on my testing it appears that the byte positions are with respect to the line that contains the corresponding text within the field. I read my fields following Brian McCallister: index.add_document :file => path, :content => file.readlines Hence, if I have a file that contains carriage returns, the token positions will be reset with each new line. For example, the following file contents (File A) this is a sentence will result in a token for the text "sentence" with start position equal to 10 (assume "this" starts in position 0) while a file with a carriage return this is a sentence will result in a token for the text "sentence" with start position equal to 0. I get the same results for my custom tokenizer as well as StandardTokenizer. The above does not seem consistent with the documentation but more importantly, it seems that global positions are more useful than line-based positions (e.g., for highlighting). Digging a little deeper it seems that the tokenizer''s initialize method is called each time the token_stream method of the containing analyzer is called: class CustomAnalyzer def token_stream(field, str) ts = StandardTokenizer.new(str) end end Am I missing something here? Are the start/stop byte positions intended to be with respect to the line? Is there a way for token_stream to only be called once for an entire string sequence (even if carriage returns are contained)? Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/ferret-talk/attachments/20080428/07e306a3/attachment-0001.html>
Hi, File.readlines returns an array which I think is the root cause of the problem. Just using File.read instead should solve your problem. Cheers, Jens On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote:> It''s my understanding that the tokens in a token_stream consist of text > along with start/stop positions that represent the byte positions of the > text within the corresponding document field. The documentation I''ve been > reading (i.e., O''Reilly - Ferret - page 67) suggests that these byte > positions represent positions within the entire field but based on my > testing it appears that the byte positions are with respect to the line that > contains the corresponding text within the field. I read my fields following > Brian McCallister: > > index.add_document :file => path, > :content => file.readlines > > > Hence, if I have a file that contains carriage returns, the token positions > will be reset with each new line. For example, the following file contents > (File A) > this is a sentence > will result in a token for the text "sentence" with start position equal to > 10 (assume "this" starts in position 0) while a file with a carriage return > this is a > sentence > will result in a token for the text "sentence" with start position equal to > 0. I get the same results for my custom tokenizer as well as > StandardTokenizer. The above does not seem consistent with the documentation > but more importantly, it seems that global positions are more useful than > line-based positions (e.g., for highlighting). > > Digging a little deeper it seems that the tokenizer''s initialize method is > called each time the token_stream method of the containing analyzer is > called: > > class CustomAnalyzer > def token_stream(field, str) > ts = StandardTokenizer.new(str) > end > end > > Am I missing something here? Are the start/stop byte positions intended to > be with respect to the line? Is there a way for token_stream to only be > called once for an entire string sequence (even if carriage returns are > contained)? > > Thanks, > John> _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk-- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database
That was it. Stupid mistake on my part. Thanks! John On Mon, Apr 28, 2008 at 6:37 AM, Jens Kraemer <jk at jkraemer.net> wrote:> Hi, > > File.readlines returns an array which I think is the root cause of the > problem. > Just using File.read instead should solve your problem. > > Cheers, > Jens > > On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote: > > It''s my understanding that the tokens in a token_stream consist of text > > along with start/stop positions that represent the byte positions of the > > text within the corresponding document field. The documentation I''ve > been > > reading (i.e., O''Reilly - Ferret - page 67) suggests that these byte > > positions represent positions within the entire field but based on my > > testing it appears that the byte positions are with respect to the line > that > > contains the corresponding text within the field. I read my fields > following > > Brian McCallister: > > > > index.add_document :file => path, > > :content => file.readlines > > > > > > Hence, if I have a file that contains carriage returns, the token > positions > > will be reset with each new line. For example, the following file > contents > > (File A) > > this is a sentence > > will result in a token for the text "sentence" with start position equal > to > > 10 (assume "this" starts in position 0) while a file with a carriage > return > > this is a > > sentence > > will result in a token for the text "sentence" with start position equal > to > > 0. I get the same results for my custom tokenizer as well as > > StandardTokenizer. The above does not seem consistent with the > documentation > > but more importantly, it seems that global positions are more useful > than > > line-based positions (e.g., for highlighting). > > > > Digging a little deeper it seems that the tokenizer''s initialize method > is > > called each time the token_stream method of the containing analyzer is > > called: > > > > class CustomAnalyzer > > def token_stream(field, str) > > ts = StandardTokenizer.new(str) > > end > > end > > > > Am I missing something here? Are the start/stop byte positions intended > to > > be with respect to the line? Is there a way for token_stream to only be > > called once for an entire string sequence (even if carriage returns are > > contained)? > > > > Thanks, > > John > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > -- > Jens Kr?mer > Finkenlust 14, 06449 Aschersleben, Germany > VAT Id DE251962952 > http://www.jkraemer.net/ - Blog > http://www.omdb.org/ - The new free film database > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/ferret-talk/attachments/20080430/b906716b/attachment.html>