search for: content_extraction

Displaying 3 results from an estimated 3 matches for "content_extraction".

2007 Sep 27
2
Problem getting "extract" from RDig
...following code in my /config/rdig_config.rb 1. RDig.configuration do |cfg| 2. cfg.crawler.start_urls = [ ''http://localhost:3000/login/index'' ] 3. cfg.index.path = "C:/rails/managedsupport/index/development/rdig-index" 4. cfg.verbose = true 5. cfg.content_extraction = OpenStruct.new( 6. :hpricot => OpenStruct.new( 7. :title_tag_selector => ''title'', 8. :content_tag_selector => ''body'' 9. ) 10. ) 11. 12. end I have created the index file using the code 1. rdig -c config/...
2007 Sep 18
4
basic rdig setup
...like they have next to nothing in them. Both rdig_config.rb files look like: cfg.crawler.start_urls = [ ''http://domain.tpl/'' ] cfg.crawler.include_hosts = [ ''domain.tpl/'' ] cfg.index.path = ''./rdig_index'' cfg.verbose = true cfg.content_extraction = OpenStruct.new( :hpricot => OpenStruct.new( :title_tag_selector => ''title'', :content_tag_selector => ''body'' ) Both enviroment.rb files have: require ''acts_as_ferret'' require ''rdig'' require...
2007 Jan 23
3
Someone getting RDig work for Linux?
...th = ''/home/myaccount/index'' ################################################################## # options you might want to set, the given values are the defaults # set to true to get stack traces on errors cfg.verbose = true # content extraction options cfg.content_extraction = OpenStruct.new( # HPRICOT configuration # this is the html parser used by default from RDig 0.3.3 upwards. # Hpricot by far outperforms Rubyful Soup, and is at least as flexible when # it comes to selection of portions of the html documents. :hpricot => OpenStruct.new(...