First, thanks to Jens K. for pointing a stupid error on my part regarding the use of test_token_stream(). My current problem, a custom tokenizer I''ve written in Ruby does not properly create an index (or at least searches on the index don''t work). Using test_token_stream() I have verified that my tokenizer properly creates the token_stream; certainly each Token''s attributes are set properly. Nevertheless, simple searches return zero results. The essence of my tokenizer is to skip beyond XML tags in a file and break up and return text components as tokens. I use this approach as opposed to an Hpricot approach because I need to keep track of the location of the text with respect to XML tags since after a search for a phrase I''ll want to extract the nearby XML tags as they contain important context. My tokenizer (XMLTokenizer) contains a the obligatory initialize, next and text methods (shown below) as well as a lot of parsing methods that are called at the top level by the method XMLTokenizer.get_next_token which is the primary action within next. I didn''t add the details of get_next_token as I''m assuming that if each token produced by get_next_token has the proper attributes then it shouldn''t be the cause of the problem. What more should I be looking for? I''ve been looking for a custom tokenizer written in Ruby to model after; any suggestions? def initialize(xmlText) @xmlText = xmlText.gsub(/[;,!]/, '' '') @currPtr = 0 @currWordStart = nil @currTextStart = 0 @nextTagStart = 0 @startOfTextRegion = 0 @currTextStart = \ XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText) @nextTagStart = \ XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText) @currPtr = @currTextStart @startOfTextRegion = 1 end def next tkn = get_next_token if tkn != nil puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc, tkn.text] end return tkn end def text=(text) initialize(text) @xmlText end Below is text from a previous, related message that shows that StopFiltering is not working:>* I''ve written a tokenizer/analyzer that parses a file extracting tokens and*>* operate this analyzer/tokenizer on ASCII data consisting of XML files (the *>* tokenizer skips over XML elements but maintains relative positioning). I''ve *>* written many units tests to check the produced token stream and was *>* confident that the tokenizer was working properly. Then I noticed two *>* problems: *>* *>* 1. StopFilter (using English stop words) does not properly filter the *>* token stream output from my tokenizer. If I explicitly pass an array of stop *>* words to the stop filter it still doesn''t work. If I simply switch my *>* tokenizer to a StandardTokenizer the stop words are appropriately filtered *>* (of course the XML tags are treated differently). *>>* 2. When I try a simple search no results come up. I can see that my*>* tokenizer is adding files to the index but a simple search (using *>* Ferret::Index::Index.search_each) produces no results. * Any suggestions are appreciated. John -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080423/588284c8/attachment.html
Hi! On Wed, Apr 23, 2008 at 12:18:12PM -0400, S D wrote: [..]> My current problem, a custom tokenizer I''ve written in Ruby does not > properly create an index (or at least searches on the index don''t work). > Using test_token_stream() I have verified that my tokenizer properly creates > the token_stream; certainly each Token''s attributes are set properly. > Nevertheless, simple searches return zero results.Could you have a look at your index with the ferret_browser utility? It allows you to check what exactly has been indexed and that maybe leads to the root of your problem. What does your analyzer, where you use the Tokenizer, look like? Is your next() method below being called and working correctly when test driving your analyzer i.e. in irb? Cheers, Jens> The essence of my tokenizer is to skip beyond XML tags in a file and break > up and return text components as tokens. I use this approach as opposed to > an Hpricot approach because I need to keep track of the location of the text > with respect to XML tags since after a search for a phrase I''ll want to > extract the nearby XML tags as they contain important context. My tokenizer > (XMLTokenizer) contains a the obligatory initialize, next and text methods > (shown below) as well as a lot of parsing methods that are called at the top > level by the method XMLTokenizer.get_next_token which is the primary action > within next. I didn''t add the details of get_next_token as I''m assuming that > if each token produced by get_next_token has the proper attributes then it > shouldn''t be the cause of the problem. What more should I be looking for? > I''ve been looking for a custom tokenizer written in Ruby to model after; any > suggestions? > > def initialize(xmlText) > @xmlText = xmlText.gsub(/[;,!]/, '' '') > @currPtr = 0 > @currWordStart = nil > @currTextStart = 0 > @nextTagStart = 0 > @startOfTextRegion = 0 > > @currTextStart = \ > XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText) > @nextTagStart = \ > XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText) > @currPtr = @currTextStart > @startOfTextRegion = 1 > end > > def next > tkn = get_next_token > if tkn != nil > puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc, > tkn.text] > end > return tkn > end > > def text=(text) > initialize(text) > @xmlText > end > > Below is text from a previous, related message that shows that StopFiltering > is not working: > > >* I''ve written a tokenizer/analyzer that parses a file extracting tokens and > *>* operate this analyzer/tokenizer on ASCII data consisting of XML files (the > *>* tokenizer skips over XML elements but maintains relative positioning). I''ve > *>* written many units tests to check the produced token stream and was > *>* confident that the tokenizer was working properly. Then I noticed two > *>* problems: > *>* > *>* 1. StopFilter (using English stop words) does not properly filter the > *>* token stream output from my tokenizer. If I explicitly pass an > array of stop > *>* words to the stop filter it still doesn''t work. If I simply switch my > *>* tokenizer to a StandardTokenizer the stop words are > appropriately filtered > *>* (of course the XML tags are treated differently). > *> > >* 2. When I try a simple search no results come up. I can see that my > *>* tokenizer is adding files to the index but a simple search (using > *>* Ferret::Index::Index.search_each) produces no results. > * > > > Any suggestions are appreciated. > > John> _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk-- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database
[unfortunately I received my messages as a batched digest...hence, I''m forced to respond in a new thread. I''ve requested the administrator to change my config to receive each message on this list. Sorry for any inconvenience] Thanks for the response below. Here is XMLAnalyzer (currently I''m not using the stop or lower case filter): class XMLAnalyzer < Ferret::Analysis::Analyzer def initialize(synonym_engine = nil, stop_words = FULL_ENGLISH_STOP_WORDS, lower = true) @synonym_engine = synonym_engine @lower = lower @stop_words = stop_words end def token_stream(field, str) # ts = XMLTokenizer.new(str) ts = StandardTokenizer.new(str) # test_token_stream(ts) return ts end end I just tried running ferret-browser by pointing to an index created with StandardTokenizer and got the error below in Firefox. Is there any configuration that is necessary? Presumably the defaults should work. John Internal Server Error No such file or directory - /usr/local/lib/site_ruby/1.8/ferret/browser/views/error/index.rhtml ------------------------------ WEBrick/1.3.1 (Ruby/1.8.6/2007-06-07) at 127.0.0.1:3301 Hi! On Wed, Apr 23, 2008 at 12:18:12PM -0400, S D wrote: [..]>* My current problem, a custom tokenizer I''ve written in Ruby does not*>* properly create an index (or at least searches on the index don''t work). *>* Using test_token_stream() I have verified that my tokenizer properly creates *>* the token_stream; certainly each Token''s attributes are set properly. *>* Nevertheless, simple searches return zero results. * Could you have a look at your index with the ferret_browser utility? It allows you to check what exactly has been indexed and that maybe leads to the root of your problem. What does your analyzer, where you use the Tokenizer, look like? Is your next() method below being called and working correctly when test driving your analyzer i.e. in irb? Cheers, Jens>* The essence of my tokenizer is to skip beyond XML tags in a file and break*>* up and return text components as tokens. I use this approach as opposed to *>* an Hpricot approach because I need to keep track of the location of the text *>* with respect to XML tags since after a search for a phrase I''ll want to *>* extract the nearby XML tags as they contain important context. My tokenizer *>* (XMLTokenizer) contains a the obligatory initialize, next and text methods *>* (shown below) as well as a lot of parsing methods that are called at the top *>* level by the method XMLTokenizer.get_next_token which is the primary action *>* within next. I didn''t add the details of get_next_token as I''m assuming that *>* if each token produced by get_next_token has the proper attributes then it *>* shouldn''t be the cause of the problem. What more should I be looking for? *>* I''ve been looking for a custom tokenizer written in Ruby to model after; any *>* suggestions? *>* *>* def initialize(xmlText) *>* @xmlText = xmlText.gsub(/[;,!]/, '' '') *>* @currPtr = 0 *>* @currWordStart = nil *>* @currTextStart = 0 *>* @nextTagStart = 0 *>* @startOfTextRegion = 0 *>* *>* @currTextStart = \ *>* XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText) *>* @nextTagStart = \ *>* XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText) *>* @currPtr = @currTextStart *>* @startOfTextRegion = 1 *>* end *>* *>* def next *>* tkn = get_next_token *>* if tkn != nil *>* puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc, *>* tkn.text] *>* end *>* return tkn *>* end *>* *>* def text=(text) *>* initialize(text) *>* @xmlText *>* end *>* *>* Below is text from a previous, related message that shows that StopFiltering *>* is not working: *>* *>* >* I''ve written a tokenizer/analyzer that parses a file extracting tokens and *>* *>* operate this analyzer/tokenizer on ASCII data consisting of XML files (the *>* *>* tokenizer skips over XML elements but maintains relative positioning). I''ve *>* *>* written many units tests to check the produced token stream and was *>* *>* confident that the tokenizer was working properly. Then I noticed two *>* *>* problems: *>* *>* *>* *>* 1. StopFilter (using English stop words) does not properly filter the *>* *>* token stream output from my tokenizer. If I explicitly pass an *>* array of stop *>* *>* words to the stop filter it still doesn''t work. If I simply switch my *>* *>* tokenizer to a StandardTokenizer the stop words are *>* appropriately filtered *>* *>* (of course the XML tags are treated differently). *>* *> *>* >* 2. When I try a simple search no results come up. I can see that my *>* *>* tokenizer is adding files to the index but a simple search (using *>* *>* Ferret::Index::Index.search_each) produces no results. *>* * *>* *>* *>* Any suggestions are appreciated. *>* *>* John *>* _______________________________________________*>* Ferret-talk mailing list *>* Ferret-talk at rubyforge.org <http://rubyforge.org/mailman/listinfo/ferret-talk> *>* http://rubyforge.org/mailman/listinfo/ferret-talk * -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080423/6bacba0b/attachment-0001.html
Hi! On Wed, Apr 23, 2008 at 01:59:32PM -0400, S D wrote:> [unfortunately I received my messages as a batched digest...hence, I''m > forced to respond in a new thread. I''ve requested the administrator to > change my config to receive each message on this list. Sorry for any > inconvenience] > > Thanks for the response below. Here is XMLAnalyzer (currently I''m not using > the stop or lower case filter): > > class XMLAnalyzer < Ferret::Analysis::Analyzercould you try if not inheriting from Ferret''s Analyzer changes anything? At least I usually don''t do that in my analyzers. [..]> I just tried running ferret-browser by pointing to an index created with > StandardTokenizer and got the error below in Firefox. Is there any > configuration that is necessary? Presumably the defaults should work.[..]> Internal Server Error No such file or directory - > /usr/local/lib/site_ruby/1.8/ferret/browser/views/error/index.rhtml > ------------------------------ > WEBrick/1.3.1 (Ruby/1.8.6/2007-06-07) at 127.0.0.1:3301works just fine here (Ferret 0.11.6 / Ubuntu), just tried it out. The location from the error message looks a bit strange to me, how did you install ferret? Cheers, Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold
I changed XMLAnalyzer so that it does not inherit from Ferret::Analysis::Analyzer. That seemed to have no effect. I successfully ran ferret-browser.As shown below, I am using two fields - :file and :content. When I browse through the "file" term everything appears fine; all of the filenames are found. The "content" term on the other hand is empty. Apparently I''m not stuffing the tokens in the index at all. One question I have is exactly what should happen in the Tokenizer#text method and when will this method be called? Thanks, John ====index = Index::Index.new(:analyzer => XMLAnalyzer.new(), :path => options.indexLocation, :create => true) Find.find(options.searchPath) do |path| if FileTest.file? path File.open(path) do |file| puts "Adding file to index: " + path index.add_document :file => path, :content => file.readlines end end end ==== Hi! On Wed, Apr 23, 2008 at 01:59:32PM -0400, S D wrote:>* [unfortunately I received my messages as a batched digest...hence, I''m*>* forced to respond in a new thread. I''ve requested the administrator to *>* change my config to receive each message on this list. Sorry for any *>* inconvenience] *>* *>* Thanks for the response below. Here is XMLAnalyzer (currently I''m not using *>* the stop or lower case filter): *>* *>* class XMLAnalyzer < Ferret::Analysis::Analyzer * could you try if not inheriting from Ferret''s Analyzer changes anything? At least I usually don''t do that in my analyzers. [..]>* I just tried running ferret-browser by pointing to an index created with*>* StandardTokenizer and got the error below in Firefox. Is there any *>* configuration that is necessary? Presumably the defaults should work. *[..]>* Internal Server Error No such file or directory -*>* /usr/local/lib/site_ruby/1.8/ferret/browser/views/error/index.rhtml *>* ------------------------------ *>* WEBrick/1.3.1 (Ruby/1.8.6/2007-06-07) at 127.0.0.1:3301 * works just fine here (Ferret 0.11.6 / Ubuntu), just tried it out. The location from the error message looks a bit strange to me, how did you install ferret? Cheers, Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de <http://rubyforge.org/mailman/listinfo/ferret-talk> | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080423/d7d224c4/attachment.html