S D
2008-Apr-23 04:50 UTC
[Ferret-talk] Problem if method is called during Analyzer.token_stream operation
I''ve written a tokenizer/analyzer that parses a file extracting tokens and operate this analyzer/tokenizer on ASCII data consisting of XML files (the tokenizer skips over XML elements but maintains relative positioning). I''ve written many units tests to check the produced token stream and was confident that the tokenizer was working properly. Then I noticed two problems: 1. StopFilter (using English stop words) does not properly filter the token stream output from my tokenizer. If I explicitly pass an array of stop words to the stop filter it still doesn''t work. If I simply switch my tokenizer to a StandardTokenizer the stop words are appropriately filtered (of course the XML tags are treated differently). 2. When I try a simple search no results come up. I can see that my tokenizer is adding files to the index but a simple search (using Ferret::Index::Index.search_each) produces no results. I''m now trying to track down the above problem which seems to have led me to another (though possibly related) problem for which I am seeking an answer. Below is the token_stream() method of my analyzer (XMLAnalyzer). Note that I''ve commented out my custom tokenizer (XMLTokenizer) so that the StandardTokenizer is being used within my custom analyzer. def token_stream(field, str) # ts = XMLTokenizer.new(str) ts = StandardTokenizer.new(str) # test_token_stream(ts) ts end In the above I''ve commented out the test_token_stream() method taken from Balmain''s Ferret book (O''Reilly, pg 68) that simply prints out the tokens contained within a stream; i.e.,: def test_token_stream(token_stream) puts "\033[32mStart | End | PosInc | Text\033[m" while tkn = token_stream.next puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc, tkn.text] end end If I keep test_token_stream() commented out then the indexing and search work fine (using StandardTokenizer). However, if I do not comment out test_token_stream() then creating the index appears to work fine but a search produces no results. I haven''t been able to track this down but thought it might be related to the problems I was having with XMLTokenizer. Note that I create my index with the Ferret::Index::Index index = Index::Index.new(:analyzer => XMLAnalyzer.new(), :path => options.indexLocation, :create_if_missing => true) and I perform searches using Ferret::Search::Searcher Any thoughts would be appreciated. Regards, John aka sd.codewarrior -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080423/8d3c78ed/attachment-0001.html
Jens Kraemer
2008-Apr-23 08:13 UTC
[Ferret-talk] Problem if method is called during Analyzer.token_stream operation
Hi! First guess - the test_token_stream method removes items from the stream by calling next(), so the stream is empty when you return it, and Ferret has nothing left to index. Cheers, Jens On Wed, Apr 23, 2008 at 12:50:25AM -0400, S D wrote:> I''ve written a tokenizer/analyzer that parses a file extracting tokens and > operate this analyzer/tokenizer on ASCII data consisting of XML files (the > tokenizer skips over XML elements but maintains relative positioning). I''ve > written many units tests to check the produced token stream and was > confident that the tokenizer was working properly. Then I noticed two > problems: > > 1. StopFilter (using English stop words) does not properly filter the > token stream output from my tokenizer. If I explicitly pass an array of stop > words to the stop filter it still doesn''t work. If I simply switch my > tokenizer to a StandardTokenizer the stop words are appropriately filtered > (of course the XML tags are treated differently). > 2. When I try a simple search no results come up. I can see that my > tokenizer is adding files to the index but a simple search (using > Ferret::Index::Index.search_each) produces no results. > > I''m now trying to track down the above problem which seems to have led me to > another (though possibly related) problem for which I am seeking an answer. > Below is the token_stream() method of my analyzer (XMLAnalyzer). Note that > I''ve commented out my custom tokenizer (XMLTokenizer) so that the > StandardTokenizer is being used within my custom analyzer. > def token_stream(field, str) > # ts = XMLTokenizer.new(str) > ts = StandardTokenizer.new(str) > # test_token_stream(ts) > ts > end > In the above I''ve commented out the test_token_stream() method taken from > Balmain''s Ferret book (O''Reilly, pg 68) that simply prints out the tokens > contained within a stream; i.e.,: > def test_token_stream(token_stream) > puts "\033[32mStart | End | PosInc | Text\033[m" > while tkn = token_stream.next > puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, > tkn.pos_inc, tkn.text] > end > end > > If I keep test_token_stream() commented out then the indexing and search > work fine (using StandardTokenizer). However, if I do not comment out > test_token_stream() then creating the index appears to work fine but a > search produces no results. I haven''t been able to track this down but > thought it might be related to the problems I was having with XMLTokenizer. > Note that I create my index with the Ferret::Index::Index > > index = Index::Index.new(:analyzer => XMLAnalyzer.new(), > :path => options.indexLocation, > :create_if_missing => true) > > and I perform searches using Ferret::Search::Searcher > > Any thoughts would be appreciated. > > Regards, > John > aka sd.codewarrior> _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk-- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database