Displaying 3 results from an estimated 3 matches for "content_extraction".
2007 Sep 27
2
Problem getting "extract" from RDig
...following code in my /config/rdig_config.rb
1. RDig.configuration do |cfg|
2. cfg.crawler.start_urls = [ ''http://localhost:3000/login/index'' ]
3. cfg.index.path =
"C:/rails/managedsupport/index/development/rdig-index"
4. cfg.verbose = true
5. cfg.content_extraction = OpenStruct.new(
6. :hpricot => OpenStruct.new(
7. :title_tag_selector => ''title'',
8. :content_tag_selector => ''body''
9. )
10. )
11.
12. end
I have created the index file using the code
1. rdig -c config/...
2007 Sep 18
4
basic rdig setup
...like they have next to nothing
in them.
Both rdig_config.rb files look like:
cfg.crawler.start_urls = [ ''http://domain.tpl/'' ]
cfg.crawler.include_hosts = [ ''domain.tpl/'' ]
cfg.index.path = ''./rdig_index''
cfg.verbose = true
cfg.content_extraction = OpenStruct.new(
:hpricot => OpenStruct.new(
:title_tag_selector => ''title'',
:content_tag_selector => ''body''
)
Both enviroment.rb files have:
require ''acts_as_ferret''
require ''rdig''
require...
2007 Jan 23
3
Someone getting RDig work for Linux?
...th = ''/home/myaccount/index''
##################################################################
# options you might want to set, the given values are the defaults
# set to true to get stack traces on errors
cfg.verbose = true
# content extraction options
cfg.content_extraction = OpenStruct.new(
# HPRICOT configuration
# this is the html parser used by default from RDig 0.3.3 upwards.
# Hpricot by far outperforms Rubyful Soup, and is at least as flexible
when
# it comes to selection of portions of the html documents.
:hpricot => OpenStruct.new(...