I got this
root at linux:~# rdig -c configfile
RDig version 0.3.4
using Ferret 0.10.14
added url file:///home/myaccount/documents/
waiting for threads to finish...
root at linux:~# rdig -c configfile -q "Ruby"
RDig version 0.3.4
using Ferret 0.10.14
executing query >Ruby<
Query:
total results: 0
root at linux:~#
my configfile
I changed from config to cfg, because of maybe mistyping
cfg.index.create = false
RDig.configuration do |cfg|
##################################################################
# options you really should set
# provide one or more URLs for the crawler to start from
cfg.crawler.start_urls = [ ''http://www.example.com/'' ]
# use something like this for crawling a file system:
cfg.crawler.start_urls = [
''file:///home/myaccount/documents/'' ]
# beware, mixing file and http crawling is not possible and might
result in
# unpredictable results.
# limit the crawl to these hosts. The crawler will never
# follow any links pointing to hosts other than those given here.
# ignored for file system crawling
cfg.crawler.include_hosts = [ ''www.example.com'' ]
# this is the path where the index will be stored
# caution, existing contents of this directory will be deleted!
cfg.index.path = ''/home/myaccount/index''
##################################################################
# options you might want to set, the given values are the defaults
# set to true to get stack traces on errors
cfg.verbose = true
# content extraction options
cfg.content_extraction = OpenStruct.new(
# HPRICOT configuration
# this is the html parser used by default from RDig 0.3.3 upwards.
# Hpricot by far outperforms Rubyful Soup, and is at least as flexible
when
# it comes to selection of portions of the html documents.
:hpricot => OpenStruct.new(
# css selector for the element containing the page title
:title_tag_selector => ''title'',
# might also be a proc returning either an element or a string:
# :title_tag_selector => lambda { |hpricot_doc| ... }
:content_tag_selector => ''body''
# might also be a proc returning either an element or a string:
# :content_tag_selector => lambda { |hpricot_doc| ... }
)
# RUBYFUL SOUP
# This is a powerful, but somewhat slow, ruby-only html parsing lib
which was
# RDig''s default html parser up to version 0.3.2. To use it, comment
the
# hpricot config above, and uncomment the following:
#
# :rubyful_soup => OpenStruct.new(
# # provide a method that returns the title of an html document
# # this method may either return a tag to extract the title from,
# # or a ready-to-index string.
# :content_tag_selector => lambda { |tagsoup|
# tagsoup.html.body
# },
# # provide a method that selects the tag containing the page
content you
# # want to index. Useful to avoid indexing common elements like
navigation
# # and page footers for every page.
# :title_tag_selector => lambda { |tagsoup|
# tagsoup.html.head.title
# }
# )
)
# crawler options
# Notice: for file system crawling the include/exclude_document
patterns are
# applied to the full path of _files_ only (like /home/bob/test.pdf),
# for http to full URIs (like http://example.com/index.html).
# nil (include all documents) or an array of Regexps
# matching the URLs you want to index.
cfg.crawler.include_documents = nil
# nil (no documents excluded) or an array of Regexps
# matching URLs not to index.
# this filter is used after the one above, so you only need
# to exclude documents here that aren''t wanted but would be
# included by the inclusion patterns.
# cfg.crawler.exclude_documents = nil
# number of document fetching threads to use. Should be raised only if
# your CPU has idle time when indexing.
# cfg.crawler.num_threads = 2
# suggested setting for file system crawling:
cfg.crawler.num_threads = 1
# maximum number of http redirections to follow
# cfg.crawler.max_redirects = 5
# number of seconds to wait with an empty url queue before
# finishing the crawl. Set to a higher number when experiencing
incomplete
# crawls on slow sites. Don''t set to 0, even when crawling a local
fs.
cfg.crawler.wait_before_leave = 10
# indexer options
# create a new index on each run. Will append to the index if false.
Use when
# building a single index from multiple runs, e.g. one across a
website and the
# other a tree in a local file system
cfg.index.create = false
# rewrite document uris before indexing them. This is useful if
you''re
# indexing on disk, but the documents should be accessible via http,
e.g. from
# a web based search application. By default, no rewriting takes
place.
# example:
# cfg.index.rewrite_uri = lambda { |uri|
# uri.path.gsub!(/^\/base\//, ''/virtual_dir/'')
# uri.scheme = ''http''
# uri.host = ''www.mydomain.com''
# }
end
--
Posted via http://www.ruby-forum.com/.