thr3ads.net - search: "crawlers"

Displaying 20 results from an estimated 124 matches for "crawlers".

Did you mean: crawler

2007 Jul 27

Is mechanize thread safe?

Hello all, I was just wondering if anybody knew whether mechanize is supposed to be thread-safe or not? I didn''t really find any information about it anywhere. I''ve been getting a strange error in protocol.rb when I run a script that uses mechanize in a multi threaded fashion, but not with a single thread. I''m trying to write a spider that does multiple gets in

Developing a web crawler

2011 Mar 03

Developing a web crawler

Hi, I wish to develop a web crawler in R. I have been using the functionalities available under the RCurl package. I am able to extract the html content of the site but i don't know how to go about analyzing the html formatted document. I wish to know the frequency of a word in the document. I am only acquainted with analyzing data sets. So how should i go about analyzing data that is not

Preventing crawlers on link_to''s

2006 Apr 16

Preventing crawlers on link_to''s

My understanding was that using the :post=>true on a link_to() was supposed to prevent search engine crawlers from triggering the link. However, this does not seem to be working for me. Is there something else that I should be/can be doing to accomplish this? Thanks. -Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: http://wrath.rubyonrails.org/pipermail/rails/attachm...

RDig document processing error

2006 Jul 25

RDig document processing error

Hi all, Am having problems using RDig: With this rdig config... cfg.crawler.start_urls = [''http://www.defensetech.org''] cfg.crawler.include_hosts = [''www.defensetech.org''] cfg.index.path = ''/my/path/to/index'' cfg.verbose = true ...I get this output: $ rdig -c config/rdig_config.rb /usr/local/lib/site_ruby/1.8/ferret/index/term.rb:45:

regrex_crawler -- a crawler which uses regular expression to catch data from website

2009 Sep 13

regrex_crawler -- a crawler which uses regular expression to catch data from website

RegexpCrawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression. The project site is: http://github.com/flyerhzm/regexp_crawler/tree I give an example: a script to synchronize your github projects except fork projects, , please check example/github_projects.rb require ''rubygems''

omega crawler: ht://dig or wget?

2006 Mar 17

omega crawler: ht://dig or wget?

At wiki page: http://wiki.xapian.org/Omega I added a comment that ht://Dig looks like dead. Does anybody really use it? >From brief glance at docs I had a feeling it is not easy to configure. Maybe better crawler is GNU wget? Mature, stable, maintained? -- Peter Masiar

[LLVMdev] Tutorial: Building a stack crawler in LLVM

2010 Oct 03

[LLVMdev] Tutorial: Building a stack crawler in LLVM

As promised, here is a document describing how to build a stack crawler using the garbage collection features of LLVM. https://docs.google.com/document/pub?id=1-ws0KYo47S0CgqpwkjfWDBJ8wFhW_0UYKxPIJ0TyKrQ I'm interested in any feedback, particularly on: - Explanations that aren't clear. - Spelling errors. - Technical errors. - Suggestions for ways in which things could be

RDig - ferret-based website crawler/indexer

2006 Mar 25

RDig - ferret-based website crawler/indexer

Hi! RDig is a small tool to build a Ferret index for the contents of a website or intranet. It contains a simple HTTP crawler and some support for extracting textual content from the fetched pages. I built this to implement a site-wide search for a recent project that combined a Rails application with lots of static html files generated by a CMS. Any feedback is very welcome! Rubyforge

keep rsync from removing unfinished source files?

2008 Sep 07

keep rsync from removing unfinished source files?

I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run: $ rsync --remove-source-files speed:/var/crawldir . but I worry that rsync will unlink a source file that hasn't

Someone getting RDig work for Linux?

2007 Jan 23

Someone getting RDig work for Linux?

I got this root at linux:~# rdig -c configfile RDig version 0.3.4 using Ferret 0.10.14 added url file:///home/myaccount/documents/ waiting for threads to finish... root at linux:~# rdig -c configfile -q "Ruby" RDig version 0.3.4 using Ferret 0.10.14 executing query >Ruby< Query: total results: 0 root at linux:~# my configfile I changed from config to cfg, because of maybe

Is there an option for Rails sessions to exclude web crawlers and bots?

2011 Apr 02

Is there an option for Rails sessions to exclude web crawlers and bots?

I''m interested in knowing whether a session is created by pages requested by web crawlers and bots. I am using MySQL as a the session store and would like to prevent requests by web crawlers and bots from creating unnecessary session entries. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send...

Read Only Error Since 1.1?

2006 Apr 03

Read Only Error Since 1.1?

Since I upgraded to 1.1, I am getting ''ActiveRecord::ReadOnlyRecord'' exceptions when trying to save a specific record. I read up on ActiveRecord::Base.readonly? but I don''t think the condition there (objects pulled in from a certain JOIN type) applies. Here''s my code that is throwing the exception: @company = session[:company] @company.bytes_used =

[LLVMdev] Patch to allow llvm.gcroot to work with non-pointer allocas.

2010 Sep 22

[LLVMdev] Patch to allow llvm.gcroot to work with non-pointer allocas.

I'm moving this thread to llvm-dev in the hopes of reaching a wider audience. This patch relaxes the restriction on llvm.gcroot so that it can work with non-pointer allocas. The only changes are to Verifier.cpp - it appears from my testing that llvm.gcroot always worked fine with non-pointer allocas, except that the verifier wouldn't allow it. I've used this patch to build an

Is there a ftp crawler in ruby on rails?

2012 Jun 01

Is there a ftp crawler in ruby on rails?

Hi, I''m a newbie to ROR. I wanted to write some code which can help me to list and then index all the paths on a remote server. Is there a ftp server crawler in ruby? Thanks, Narayana -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to

Web crawler - spider and Amazon Web Servces (AWS)

2008 Jul 18

Web crawler - spider and Amazon Web Servces (AWS)

I need to create a web crawler and the closest thing to a tutorial I''ve found so far is this article: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1182 wich I think I found via RubyInside or some blog. It uses some of Amazone Web Services, mainly SQS, but this would be my first time outsourcing a process to a third party and I would like to know if someone in the

HTTP Accept header wildcard breaks rails app

2010 Dec 31

HTTP Accept header wildcard breaks rails app

The thunderstone crawler (http://search.thunderstone.com/texis/ websearch/about.html) sends the folliowing HTTP accept header when requesting pages Accept: text/*, application/javascript, application/x-javascript This results in a "Missing template" exception text/* is valid. How do I tell my rails app to treat this as rhtml by default instead of returning a 500? Missing template

[LLVMdev] Function inlining creates uninitialized stack roots

2010 Oct 02

[LLVMdev] Function inlining creates uninitialized stack roots

I'm still putting the final touches on my stack crawler, and I've run into a problem having to do with function inlining and local stack roots. As you know, all local roots must be initialized before you can make any call to a function which might crawl the stack. My compiler ensures that all local variables of a function are allocated, declared as root, and initialized in the first

Design Dilemma - Please Help

2006 Oct 23

Design Dilemma - Please Help

Hi, I''m new. ;-) I creating a little rails app, that will crawl the web on a regular basis and then show the results. The crawling will be scheduled, likely a cron job. I can''t wrap my head around where to put my crawler. It doesn''t seem to fit. An example: Model - News Story Controllers - Grabs a story from the DB, Sort the Stories, Search the Stories etc. View -

Questions about backgroundrb

2008 Mar 25

Questions about backgroundrb

...er and then query the data back using ask_status method of a worker. > > In one of your posts, you mention: > " When you are processing too many tasks from rails, you should use inbuilt > thread pool, rather than firing new workers" > ...We are planning to have 100s of web crawlers being initiated and thus > periodically scheduled to run. I''m assuming I should use the inbuilt thread > pool. But does this mean that the workers are running in parallel as > threads no matter the worker type? Or that the instances of each worker are > run in parallel for o...

[LLVMdev] llvm.org robots.txt prevents crawling by Google code search?

2010 Oct 14

[LLVMdev] llvm.org robots.txt prevents crawling by Google code search?

...many (possibly > > out-of-date) mirrors, rather than the up-to-date llvm.org version. This > is > > sad. > This is intentional. The workload of the server was pretty huge w/o this. > Could we at least add a rule allowing the codesearch crawler, rather than opening it up to all crawlers? The user agent string is SVN/1.5.4/GoogleCodeSearch. > > -- > With best regards, Anton Korobeynikov > Faculty of Mathematics and Mechanics, Saint Petersburg State University > -- -- Talin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http:/...

search for: crawlers