thr3ads.net - Ferret talk - [Ferret-talk] Someone getting RDig work for Linux? [Jan 2007]

If this information is useful, please help other people find it:
Share via:

ngoc

2007-Jan-23 14:55 UTC

[Ferret-talk] Someone getting RDig work for Linux?

I got this

root at linux:~# rdig -c configfile
RDig version 0.3.4
using Ferret 0.10.14
added url file:///home/myaccount/documents/
waiting for threads to finish...
root at linux:~# rdig -c configfile -q "Ruby"
RDig version 0.3.4
using Ferret 0.10.14
executing query >Ruby<
Query:
total results: 0
root at linux:~#



my configfile
I changed from config to cfg, because of maybe mistyping
cfg.index.create = false

RDig.configuration do |cfg|

  ##################################################################
  # options you really should set

  # provide one or more URLs for the crawler to start from
  cfg.crawler.start_urls = [ ''http://www.example.com/'' ]

  # use something like this for crawling a file system:
   cfg.crawler.start_urls = [
''file:///home/myaccount/documents/'' ]
  # beware, mixing file and http crawling is not possible and might
result in
  # unpredictable results.

  # limit the crawl to these hosts. The crawler will never
  # follow any links pointing to hosts other than those given here.
  # ignored for file system crawling
  cfg.crawler.include_hosts = [ ''www.example.com'' ]

  # this is the path where the index will be stored
  # caution, existing contents of this directory will be deleted!
  cfg.index.path        = ''/home/myaccount/index''

  ##################################################################
  # options you might want to set, the given values are the defaults

  # set to true to get stack traces on errors
   cfg.verbose = true

  # content extraction options
  cfg.content_extraction = OpenStruct.new(

  # HPRICOT configuration
  # this is the html parser used by default from RDig 0.3.3 upwards.
  # Hpricot by far outperforms Rubyful Soup, and is at least as flexible
when
  # it comes to selection of portions of the html documents.
    :hpricot      => OpenStruct.new(
      # css selector for the element containing the page title
      :title_tag_selector => ''title'',
      # might also be a proc returning either an element or a string:
      # :title_tag_selector => lambda { |hpricot_doc| ... }
      :content_tag_selector => ''body''
      # might also be a proc returning either an element or a string:
      # :content_tag_selector => lambda { |hpricot_doc| ... }
    )

  # RUBYFUL SOUP
  # This is a powerful, but somewhat slow, ruby-only html parsing lib
which was
  # RDig''s default html parser up to version 0.3.2. To use it, comment
the
  # hpricot config above, and uncomment the following:
  #
  #  :rubyful_soup => OpenStruct.new(
  #    # provide a method that returns the title of an html document
  #    # this method may either return a tag to extract the title from,
  #    # or a ready-to-index string.
  #    :content_tag_selector => lambda { |tagsoup|
  #      tagsoup.html.body
  #    },
  #    # provide a method that selects the tag containing the page
content you
  #    # want to index. Useful to avoid indexing common elements like
navigation
  #    # and page footers for every page.
  #    :title_tag_selector         => lambda { |tagsoup|
  #      tagsoup.html.head.title
  #    }
  #  )
  )

  # crawler options

  # Notice: for file system crawling the include/exclude_document
patterns are
  # applied to the full path of _files_ only (like /home/bob/test.pdf),
  # for http to full URIs (like http://example.com/index.html).

  # nil (include all documents) or an array of Regexps
  # matching the URLs you want to index.
   cfg.crawler.include_documents = nil

  # nil (no documents excluded) or an array of Regexps
  # matching URLs not to index.
  # this filter is used after the one above, so you only need
  # to exclude documents here that aren''t wanted but would be
  # included by the inclusion patterns.
  # cfg.crawler.exclude_documents = nil

  # number of document fetching threads to use. Should be raised only if
  # your CPU has idle time when indexing.
  # cfg.crawler.num_threads = 2
  # suggested setting for file system crawling:
   cfg.crawler.num_threads = 1

  # maximum number of http redirections to follow
  # cfg.crawler.max_redirects = 5

  # number of seconds to wait with an empty url queue before
  # finishing the crawl. Set to a higher number when experiencing
incomplete
  # crawls on slow sites. Don''t set to 0, even when crawling a local
fs.
   cfg.crawler.wait_before_leave = 10

  # indexer options

  # create a new index on each run. Will append to the index if false.
Use when
  # building a single index from multiple runs, e.g. one across a
website and the
  # other a tree in a local file system
   cfg.index.create = false

  # rewrite document uris before indexing them. This is useful if
you''re
  # indexing on disk, but the documents should be accessible via http,
e.g. from
  # a web based search application. By default, no rewriting takes
place.
  # example:
  # cfg.index.rewrite_uri = lambda { |uri|
  #   uri.path.gsub!(/^\/base\//, ''/virtual_dir/'')
  #   uri.scheme = ''http''
  #   uri.host = ''www.mydomain.com''
  # }

end

-- 
Posted via http://www.ruby-forum.com/.

Jens Kraemer

2007-Jan-23 16:43 UTC

head link

[Ferret-talk] Someone getting RDig work for Linux?

On Tue, Jan 23, 2007 at 03:55:06PM +0100, ngoc wrote:> I got this
> 
> root at linux:~# rdig -c configfile
> RDig version 0.3.4
> using Ferret 0.10.14
> added url file:///home/myaccount/documents/
> waiting for threads to finish...
> root at linux:~# rdig -c configfile -q "Ruby"
> RDig version 0.3.4
> using Ferret 0.10.14
> executing query >Ruby<
> Query:
> total results: 0
> root at linux:~#
strange. I cut''n''pasted your config and only changed the
start_urls and
index location, and it worked like a charm. what is in the documents
directory - only files, or subdirectories, any strange file names (spaces 
and such)? There''s a known bug concerning spaces in file/directory
names, maybe that''s the problem?

Jens

-- 
webit! Gesellschaft f?r neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Kr?mer       kraemer at webit.de
Schnorrstra?e 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66

ngoc

2007-Jan-23 17:48 UTC

head link

[Ferret-talk] Someone getting RDig work for Linux?

> and such)? There''s a known bug concerning spaces in file/directory
> names, maybe that''s the problem?Hi Jens
I stored only one file in the catalogue. And it has space in file name 
without ending. So I correct it with connected name and ending html -> 
It works.

I recognise that I need to work more with it before taking in use. It is
so linux oriented. Now I have to read line by line to learn more how it 
works inside. It will take long time.

Thanks Jens

ngoc

-- 
Posted via http://www.ruby-forum.com/.

Jens Kraemer

2007-Jan-24 09:07 UTC

head link

[Ferret-talk] Someone getting RDig work for Linux?

Hi!

On Tue, Jan 23, 2007 at 06:48:03PM +0100, ngoc wrote:> > and such)? There''s a known bug concerning spaces in
file/directory
> > names, maybe that''s the problem?
> Hi Jens
> I stored only one file in the catalogue. And it has space in file name 
> without ending. So I correct it with connected name and ending html -> 
> It works.
ah ok. The filename ending is needed, since there is no other (easy) way
to get an idea what kind of content extractor to use. On *nix systems
the ''file'' command might be of use here, but that would even
more tie
RDig to Linux and friends...
> I recognise that I need to work more with it before taking in use. It is
> so linux oriented. Now I have to read line by line to learn more how it 
> works inside. It will take long time.
sorry for the inconvenience, but I only rarely get to use something else
than Linux - however I''ll happily apply any fixes to make RDig work on
windows. However I''ll fix the problem with spaces in filenames by the
end of the week.

cheers,
Jens


-- 
webit! Gesellschaft f?r neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Kr?mer       kraemer at webit.de
Schnorrstra?e 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66

Reasonably Related Threads

Search for more apparently analagous threads

Ferret talk - Jan 2007 - Someone getting RDig work for Linux?

[Ferret-talk] Someone getting RDig work for Linux?

[Ferret-talk] Someone getting RDig work for Linux?

[Ferret-talk] Someone getting RDig work for Linux?

[Ferret-talk] Someone getting RDig work for Linux?

Reasonably Related Threads