Displaying 20 results from an estimated 124 matches for "crawler".
2007 Jul 27
3
Is mechanize thread safe?
Hello all,
I was just wondering if anybody knew whether mechanize is supposed to
be thread-safe or not? I didn''t really find any information about it
anywhere. I''ve been getting a strange error in protocol.rb when I run
a script that uses mechanize in a multi threaded fashion, but not with
a single thread.
I''m trying to write a spider that does multiple gets in
2011 Mar 03
6
Developing a web crawler
Hi,
I wish to develop a web crawler in R. I have been using the functionalities
available under the RCurl package.
I am able to extract the html content of the site but i don't know how to go
about analyzing the html formatted document.
I wish to know the frequency of a word in the document. I am only acquainted
with analyzing da...
2006 Apr 16
4
Preventing crawlers on link_to''s
My understanding was that using the :post=>true on a link_to() was supposed
to prevent search engine crawlers from triggering the link. However, this
does not seem to be working for me. Is there something else that I should
be/can be doing to accomplish this? Thanks.
-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://wrath.rubyonrails.org/pipermail/rails/attach...
2006 Jul 25
1
RDig document processing error
Hi all,
Am having problems using RDig:
With this rdig config...
cfg.crawler.start_urls = [''http://www.defensetech.org'']
cfg.crawler.include_hosts = [''www.defensetech.org'']
cfg.index.path = ''/my/path/to/index''
cfg.verbose = true
...I get this output:
$ rdig -c config/rdig_config.rb
/usr/local/lib/site_ruby/1.8/ferret/i...
2009 Sep 13
0
regrex_crawler -- a crawler which uses regular expression to catch data from website
RegexpCrawler is a crawler which uses regular expression to catch data
from website. It is easy to use and less code if you are familiar with
regular expression.
The project site is: http://github.com/flyerhzm/regexp_crawler/tree
I give an example: a script to synchronize your github projects except
fork projec...
2006 Mar 17
1
omega crawler: ht://dig or wget?
At wiki page: http://wiki.xapian.org/Omega
I added a comment that ht://Dig looks like dead.
Does anybody really use it?
>From brief glance at docs I had a feeling it is not easy to configure.
Maybe better crawler is GNU wget? Mature, stable, maintained?
--
Peter Masiar
2010 Oct 03
1
[LLVMdev] Tutorial: Building a stack crawler in LLVM
As promised, here is a document describing how to build a stack crawler
using the garbage collection features of LLVM.
https://docs.google.com/document/pub?id=1-ws0KYo47S0CgqpwkjfWDBJ8wFhW_0UYKxPIJ0TyKrQ
I'm interested in any feedback, particularly on:
- Explanations that aren't clear.
- Spelling errors.
- Technical errors.
- Suggestions for ways...
2006 Mar 25
1
RDig - ferret-based website crawler/indexer
Hi!
RDig is a small tool to build a Ferret index for the contents of a
website or intranet. It contains a simple HTTP crawler and some support
for extracting textual content from the fetched pages.
I built this to implement a site-wide search for a recent project
that combined a Rails application with lots of static html files
generated by a CMS.
Any feedback is very welcome!
Rubyforge project page: http://rubyforge....
2008 Sep 07
2
keep rsync from removing unfinished source files?
I have two machines, speed and mass. speed has a fast Internet
connection and is running a crawler which downloads a lot of files to
disk. mass has a lot of disk space. I want to move the files from
speed to mass after they're done downloading. Ideally, I'd just run:
$ rsync --remove-source-files speed:/var/crawldir .
but I worry that rsync will unlink a source file that hasn't...
2007 Jan 23
3
Someone getting RDig work for Linux?
...l results: 0
root at linux:~#
my configfile
I changed from config to cfg, because of maybe mistyping
cfg.index.create = false
RDig.configuration do |cfg|
##################################################################
# options you really should set
# provide one or more URLs for the crawler to start from
cfg.crawler.start_urls = [ ''http://www.example.com/'' ]
# use something like this for crawling a file system:
cfg.crawler.start_urls = [ ''file:///home/myaccount/documents/'' ]
# beware, mixing file and http crawling is not possible and might...
2011 Apr 02
0
Is there an option for Rails sessions to exclude web crawlers and bots?
I''m interested in knowing whether a session is created by pages
requested by web crawlers and bots. I am using MySQL as a the session
store and would like to prevent requests by web crawlers and bots from
creating unnecessary session entries.
--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To post to this group, send...
2006 Apr 03
3
Read Only Error Since 1.1?
Since I upgraded to 1.1, I am getting ''ActiveRecord::ReadOnlyRecord''
exceptions when trying to save a specific record.
I read up on ActiveRecord::Base.readonly? but I don''t think the condition
there (objects pulled in from a certain JOIN type) applies.
Here''s my code that is throwing the exception:
@company = session[:company]
@company.bytes_used =
2010 Sep 22
3
[LLVMdev] Patch to allow llvm.gcroot to work with non-pointer allocas.
...striction on llvm.gcroot so that it can work with
non-pointer allocas. The only changes are to Verifier.cpp - it appears from
my testing that llvm.gcroot always worked fine with non-pointer allocas,
except that the verifier wouldn't allow it. I've used this patch to build an
efficient stack crawler (an alternative to shadow-stack that uses only
static constant data structures.)
Here's a deal: If you accept this patch, I'll write up an extensive tutorial
on how to write a stack crawler like mine. (Actually, it's already written,
however without this patch the tutorial doesn't...
2012 Jun 01
4
Is there a ftp crawler in ruby on rails?
Hi,
I''m a newbie to ROR. I wanted to write some code which can help me to
list and then index all the paths on a remote server. Is there a ftp
server crawler in ruby?
Thanks,
Narayana
--
Posted via http://www.ruby-forum.com/.
--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To uns...
2008 Jul 18
0
Web crawler - spider and Amazon Web Servces (AWS)
I need to create a web crawler and the closest thing to a tutorial
I''ve found so far is this article:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1182
wich I think I found via RubyInside or some blog.
It uses some of Amazone Web Services, mainly SQS, but this would be my
first time outsourcing...
2010 Dec 31
6
HTTP Accept header wildcard breaks rails app
The thunderstone crawler (http://search.thunderstone.com/texis/
websearch/about.html) sends the folliowing HTTP accept header when
requesting pages
Accept: text/*, application/javascript, application/x-javascript
This results in a "Missing template" exception
text/* is valid. How do I tell my rails app to tre...
2010 Oct 02
2
[LLVMdev] Function inlining creates uninitialized stack roots
I'm still putting the final touches on my stack crawler, and I've run into a
problem having to do with function inlining and local stack roots.
As you know, all local roots must be initialized before you can make any
call to a function which might crawl the stack. My compiler ensures that all
local variables of a function are allocated, declared as...
2006 Oct 23
3
Design Dilemma - Please Help
Hi, I''m new. ;-)
I creating a little rails app, that will crawl the web on a regular
basis and then show the results.
The crawling will be scheduled, likely a cron job.
I can''t wrap my head around where to put my crawler. It doesn''t seem
to fit.
An example:
Model - News Story
Controllers - Grabs a story from the DB, Sort the Stories, Search the
Stories etc.
View - HTML News Story, RSS Story etc.
Then a I have a news crawler, that will go crawl some feeds for new
stories, then insert them into the db. Wh...
2008 Mar 25
0
Questions about backgroundrb
...d to incorporating
> it into my site.
>
> I had several questions regarding implementing some features on my site
> using backgroundrb. If you could help guide me in any way with any of
> these, that would be great!
>
> Background: I''m trying to write a series of web crawler tasks. This is my
> first time writing a robust web crawler.
>
> A new web crawler task is initiated whenever a user decides to track
> information from a new site. Upon initialization by the user, that web
> crawler is supposed to run using backgroundrb and then (1) save the
>...
2010 Oct 14
1
[LLVMdev] llvm.org robots.txt prevents crawling by Google code search?
...symbol in code search, you get one of the many (possibly
> > out-of-date) mirrors, rather than the up-to-date llvm.org version. This
> is
> > sad.
> This is intentional. The workload of the server was pretty huge w/o this.
>
Could we at least add a rule allowing the codesearch crawler, rather than
opening it up to all crawlers? The user agent string is
SVN/1.5.4/GoogleCodeSearch.
>
> --
> With best regards, Anton Korobeynikov
> Faculty of Mathematics and Mechanics, Saint Petersburg State University
>
--
-- Talin
-------------- next part --------------
An HTML...