Hi Steven,
sorry for replying that late - I''m quite busy atm.
The error you received was because of an invalid mailto: link
which rdig failed to handle correctly.
I just uploaded RDig 0.3.1, fixing this bug.
In testing with your site I noticed that it takes quite long to parse
the index page, so you might have to set
cfg.crawler.wait_before_leave
to a higher value (20 worked for me) to prevent rdig from exiting before
the parser has finished parsing the index page.
The parsing speed of RDig is really bad for big pages (your
index page weighs around 62kB). I''d happily accept a patch adding a
faster html content extraction mechanism for RDig users to choose from ;-)
Maybe even a special Ferret analyzer just stripping out any html tags
would do.
Regards,
Jens
On Tue, Jul 25, 2006 at 11:28:13AM +0200, Steven Shingler
wrote:> Hi all,
>
> Am having problems using RDig:
>
> With this rdig config...
>
> cfg.crawler.start_urls = [''http://www.defensetech.org'']
> cfg.crawler.include_hosts = [''www.defensetech.org'']
> cfg.index.path = ''/my/path/to/index''
> cfg.verbose = true
>
> ...I get this output:
>
> $ rdig -c config/rdig_config.rb
> /usr/local/lib/site_ruby/1.8/ferret/index/term.rb:45: warning: method
> redefined; discarding old text>
/usr/local/lib/site_ruby/1.8/ferret/search/sort_field.rb:69: warning:
> instance variable @name not initialized
> /usr/local/lib/site_ruby/1.8/ferret/search/sort_field.rb:69: warning:
> instance variable @name not initialized
> lib/ferret/query_parser/query_parser.y:128: warning: method redefined;
> discarding old initialize
> lib/ferret/query_parser/query_parser.y:157: warning: method redefined;
> discarding old parse
> lib/ferret/query_parser/query_parser.y:216: warning: method redefined;
> discarding old clean_string
> /usr/lib/ruby/gems/1.8/gems/rubyful_soup-1.0.4/lib/rubyful_soup.rb:230:
> warning: method redefined; discarding old attrs
> discovered content extractor class:
> RDig::ContentExtractors::PdfContentExtractor
> discovered content extractor class:
> RDig::ContentExtractors::WordContentExtractor
> discovered content extractor class:
> RDig::ContentExtractors::HtmlContentExtractor
> using Ferret 0.9.0
> /usr/local/lib/site_ruby/1.8/rdig/url_filters.rb:116: warning: instance
> variable @patterns not initialized
> /usr/local/lib/site_ruby/1.8/rdig/url_filters.rb:105: warning: instance
> variable @patterns not initialized
> added url http://www.defensetech.org
> fetching http://www.defensetech.org
> waiting for threads to finish...
> /usr/local/lib/site_ruby/1.8/rdig/url_filters.rb:116: warning: instance
> variable @patterns not initialized
> /usr/local/lib/site_ruby/1.8/rdig/url_filters.rb:105: warning: instance
> variable @patterns not initialized
> added url http://www.defensetech.org
> error processing document http://www.defensetech.org/: undefined local
> variable or method `url'' for
#<RDig::HttpDocument:0xb7a7fbb4>
> Trace: /usr/local/lib/site_ruby/1.8/rdig/documents.rb:35:in
`initialize''
> /usr/local/lib/site_ruby/1.8/rdig/documents.rb:107:in `initialize''
> /usr/local/lib/site_ruby/1.8/rdig/documents.rb:15:in `create''
> /usr/local/lib/site_ruby/1.8/rdig/crawler.rb:68:in `add_url''
> /usr/local/lib/site_ruby/1.8/rdig/crawler.rb:51:in
`process_document''
> /usr/local/lib/site_ruby/1.8/rdig/crawler.rb:50:in
`process_document''
> /usr/local/lib/site_ruby/1.8/rdig/crawler.rb:28:in `run''
> /usr/local/lib/site_ruby/1.8/rdig/crawler.rb:25:in `run''
> /usr/local/lib/site_ruby/1.8/rdig/crawler.rb:24:in `run''
> /usr/local/lib/site_ruby/1.8/rdig.rb:258:in `run''
> /usr/bin/rdig:14
>
> If anyone could tell me why @patterns and url aren''t being set,
I''d
> really appreciate it.
>
> Am on Ubuntu 6.06, ruby 1.8.4, gems: rdig 0.3.0, rubyful_soup 1.0.4,
> ferret 0.9.4
>
> Many Thanks,
> Steven
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
--
webit! Gesellschaft f?r neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de
Schnorrstra?e 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66