Displaying 20 results from an estimated 124 matches for "crawlers".
Did you mean:
crawler
2007 Jul 27
3
Is mechanize thread safe?
Hello all,
I was just wondering if anybody knew whether mechanize is supposed to
be thread-safe or not? I didn''t really find any information about it
anywhere. I''ve been getting a strange error in protocol.rb when I run
a script that uses mechanize in a multi threaded fashion, but not with
a single thread.
I''m trying to write a spider that does multiple gets in
2011 Mar 03
6
Developing a web crawler
Hi,
I wish to develop a web crawler in R. I have been using the functionalities
available under the RCurl package.
I am able to extract the html content of the site but i don't know how to go
about analyzing the html formatted document.
I wish to know the frequency of a word in the document. I am only acquainted
with analyzing data sets.
So how should i go about analyzing data that is not
2006 Apr 16
4
Preventing crawlers on link_to''s
My understanding was that using the :post=>true on a link_to() was supposed
to prevent search engine crawlers from triggering the link. However, this
does not seem to be working for me. Is there something else that I should
be/can be doing to accomplish this? Thanks.
-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://wrath.rubyonrails.org/pipermail/rails/attachm...
2006 Jul 25
1
RDig document processing error
Hi all,
Am having problems using RDig:
With this rdig config...
cfg.crawler.start_urls = [''http://www.defensetech.org'']
cfg.crawler.include_hosts = [''www.defensetech.org'']
cfg.index.path = ''/my/path/to/index''
cfg.verbose = true
...I get this output:
$ rdig -c config/rdig_config.rb
/usr/local/lib/site_ruby/1.8/ferret/index/term.rb:45:
2009 Sep 13
0
regrex_crawler -- a crawler which uses regular expression to catch data from website
RegexpCrawler is a crawler which uses regular expression to catch data
from website. It is easy to use and less code if you are familiar with
regular expression.
The project site is: http://github.com/flyerhzm/regexp_crawler/tree
I give an example: a script to synchronize your github projects except
fork projects, , please check example/github_projects.rb
require ''rubygems''
2006 Mar 17
1
omega crawler: ht://dig or wget?
At wiki page: http://wiki.xapian.org/Omega
I added a comment that ht://Dig looks like dead.
Does anybody really use it?
>From brief glance at docs I had a feeling it is not easy to configure.
Maybe better crawler is GNU wget? Mature, stable, maintained?
--
Peter Masiar
2010 Oct 03
1
[LLVMdev] Tutorial: Building a stack crawler in LLVM
As promised, here is a document describing how to build a stack crawler
using the garbage collection features of LLVM.
https://docs.google.com/document/pub?id=1-ws0KYo47S0CgqpwkjfWDBJ8wFhW_0UYKxPIJ0TyKrQ
I'm interested in any feedback, particularly on:
- Explanations that aren't clear.
- Spelling errors.
- Technical errors.
- Suggestions for ways in which things could be
2006 Mar 25
1
RDig - ferret-based website crawler/indexer
Hi!
RDig is a small tool to build a Ferret index for the contents of a
website or intranet. It contains a simple HTTP crawler and some support
for extracting textual content from the fetched pages.
I built this to implement a site-wide search for a recent project
that combined a Rails application with lots of static html files
generated by a CMS.
Any feedback is very welcome!
Rubyforge
2008 Sep 07
2
keep rsync from removing unfinished source files?
I have two machines, speed and mass. speed has a fast Internet
connection and is running a crawler which downloads a lot of files to
disk. mass has a lot of disk space. I want to move the files from
speed to mass after they're done downloading. Ideally, I'd just run:
$ rsync --remove-source-files speed:/var/crawldir .
but I worry that rsync will unlink a source file that hasn't
2007 Jan 23
3
Someone getting RDig work for Linux?
I got this
root at linux:~# rdig -c configfile
RDig version 0.3.4
using Ferret 0.10.14
added url file:///home/myaccount/documents/
waiting for threads to finish...
root at linux:~# rdig -c configfile -q "Ruby"
RDig version 0.3.4
using Ferret 0.10.14
executing query >Ruby<
Query:
total results: 0
root at linux:~#
my configfile
I changed from config to cfg, because of maybe
2011 Apr 02
0
Is there an option for Rails sessions to exclude web crawlers and bots?
I''m interested in knowing whether a session is created by pages
requested by web crawlers and bots. I am using MySQL as a the session
store and would like to prevent requests by web crawlers and bots from
creating unnecessary session entries.
--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To post to this group, send...
2006 Apr 03
3
Read Only Error Since 1.1?
Since I upgraded to 1.1, I am getting ''ActiveRecord::ReadOnlyRecord''
exceptions when trying to save a specific record.
I read up on ActiveRecord::Base.readonly? but I don''t think the condition
there (objects pulled in from a certain JOIN type) applies.
Here''s my code that is throwing the exception:
@company = session[:company]
@company.bytes_used =
2010 Sep 22
3
[LLVMdev] Patch to allow llvm.gcroot to work with non-pointer allocas.
I'm moving this thread to llvm-dev in the hopes of reaching a wider
audience.
This patch relaxes the restriction on llvm.gcroot so that it can work with
non-pointer allocas. The only changes are to Verifier.cpp - it appears from
my testing that llvm.gcroot always worked fine with non-pointer allocas,
except that the verifier wouldn't allow it. I've used this patch to build an
2012 Jun 01
4
Is there a ftp crawler in ruby on rails?
Hi,
I''m a newbie to ROR. I wanted to write some code which can help me to
list and then index all the paths on a remote server. Is there a ftp
server crawler in ruby?
Thanks,
Narayana
--
Posted via http://www.ruby-forum.com/.
--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To post to this group, send email to
2008 Jul 18
0
Web crawler - spider and Amazon Web Servces (AWS)
I need to create a web crawler and the closest thing to a tutorial
I''ve found so far is this article:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1182
wich I think I found via RubyInside or some blog.
It uses some of Amazone Web Services, mainly SQS, but this would be my
first time outsourcing a process to a third party and I would like to
know if someone in the
2010 Dec 31
6
HTTP Accept header wildcard breaks rails app
The thunderstone crawler (http://search.thunderstone.com/texis/
websearch/about.html) sends the folliowing HTTP accept header when
requesting pages
Accept: text/*, application/javascript, application/x-javascript
This results in a "Missing template" exception
text/* is valid. How do I tell my rails app to treat this as rhtml by
default instead of returning a 500?
Missing template
2010 Oct 02
2
[LLVMdev] Function inlining creates uninitialized stack roots
I'm still putting the final touches on my stack crawler, and I've run into a
problem having to do with function inlining and local stack roots.
As you know, all local roots must be initialized before you can make any
call to a function which might crawl the stack. My compiler ensures that all
local variables of a function are allocated, declared as root, and
initialized in the first
2006 Oct 23
3
Design Dilemma - Please Help
Hi, I''m new. ;-)
I creating a little rails app, that will crawl the web on a regular
basis and then show the results.
The crawling will be scheduled, likely a cron job.
I can''t wrap my head around where to put my crawler. It doesn''t seem
to fit.
An example:
Model - News Story
Controllers - Grabs a story from the DB, Sort the Stories, Search the
Stories etc.
View -
2008 Mar 25
0
Questions about backgroundrb
...er and then query the data back using ask_status method of a
worker.
>
> In one of your posts, you mention:
> " When you are processing too many tasks from rails, you should use inbuilt
> thread pool, rather than firing new workers"
> ...We are planning to have 100s of web crawlers being initiated and thus
> periodically scheduled to run. I''m assuming I should use the inbuilt thread
> pool. But does this mean that the workers are running in parallel as
> threads no matter the worker type? Or that the instances of each worker are
> run in parallel for o...
2010 Oct 14
1
[LLVMdev] llvm.org robots.txt prevents crawling by Google code search?
...many (possibly
> > out-of-date) mirrors, rather than the up-to-date llvm.org version. This
> is
> > sad.
> This is intentional. The workload of the server was pretty huge w/o this.
>
Could we at least add a rule allowing the codesearch crawler, rather than
opening it up to all crawlers? The user agent string is
SVN/1.5.4/GoogleCodeSearch.
>
> --
> With best regards, Anton Korobeynikov
> Faculty of Mathematics and Mechanics, Saint Petersburg State University
>
--
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http:/...