thr3ads.net - search: "crawler"

Displaying 20 results from an estimated 124 matches for "crawler".

2007 Jul 27

Is mechanize thread safe?

Hello all, I was just wondering if anybody knew whether mechanize is supposed to be thread-safe or not? I didn''t really find any information about it anywhere. I''ve been getting a strange error in protocol.rb when I run a script that uses mechanize in a multi threaded fashion, but not with a single thread. I''m trying to write a spider that does multiple gets in

Developing a web crawler

2011 Mar 03

Developing a web crawler

Hi, I wish to develop a web crawler in R. I have been using the functionalities available under the RCurl package. I am able to extract the html content of the site but i don't know how to go about analyzing the html formatted document. I wish to know the frequency of a word in the document. I am only acquainted with analyzing da...

Preventing crawlers on link_to''s

2006 Apr 16

Preventing crawlers on link_to''s

My understanding was that using the :post=>true on a link_to() was supposed to prevent search engine crawlers from triggering the link. However, this does not seem to be working for me. Is there something else that I should be/can be doing to accomplish this? Thanks. -Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: http://wrath.rubyonrails.org/pipermail/rails/attach...

RDig document processing error

2006 Jul 25

RDig document processing error

Hi all, Am having problems using RDig: With this rdig config... cfg.crawler.start_urls = [''http://www.defensetech.org''] cfg.crawler.include_hosts = [''www.defensetech.org''] cfg.index.path = ''/my/path/to/index'' cfg.verbose = true ...I get this output: $ rdig -c config/rdig_config.rb /usr/local/lib/site_ruby/1.8/ferret/i...

regrex_crawler -- a crawler which uses regular expression to catch data from website

2009 Sep 13

regrex_crawler -- a crawler which uses regular expression to catch data from website

RegexpCrawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression. The project site is: http://github.com/flyerhzm/regexp_crawler/tree I give an example: a script to synchronize your github projects except fork projec...

omega crawler: ht://dig or wget?

2006 Mar 17

omega crawler: ht://dig or wget?

At wiki page: http://wiki.xapian.org/Omega I added a comment that ht://Dig looks like dead. Does anybody really use it? >From brief glance at docs I had a feeling it is not easy to configure. Maybe better crawler is GNU wget? Mature, stable, maintained? -- Peter Masiar

[LLVMdev] Tutorial: Building a stack crawler in LLVM

2010 Oct 03

[LLVMdev] Tutorial: Building a stack crawler in LLVM

As promised, here is a document describing how to build a stack crawler using the garbage collection features of LLVM. https://docs.google.com/document/pub?id=1-ws0KYo47S0CgqpwkjfWDBJ8wFhW_0UYKxPIJ0TyKrQ I'm interested in any feedback, particularly on: - Explanations that aren't clear. - Spelling errors. - Technical errors. - Suggestions for ways...

RDig - ferret-based website crawler/indexer

2006 Mar 25

RDig - ferret-based website crawler/indexer

Hi! RDig is a small tool to build a Ferret index for the contents of a website or intranet. It contains a simple HTTP crawler and some support for extracting textual content from the fetched pages. I built this to implement a site-wide search for a recent project that combined a Rails application with lots of static html files generated by a CMS. Any feedback is very welcome! Rubyforge project page: http://rubyforge....

keep rsync from removing unfinished source files?

2008 Sep 07

keep rsync from removing unfinished source files?

I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run: $ rsync --remove-source-files speed:/var/crawldir . but I worry that rsync will unlink a source file that hasn't...

Someone getting RDig work for Linux?

2007 Jan 23

Someone getting RDig work for Linux?

...l results: 0 root at linux:~# my configfile I changed from config to cfg, because of maybe mistyping cfg.index.create = false RDig.configuration do |cfg| ################################################################## # options you really should set # provide one or more URLs for the crawler to start from cfg.crawler.start_urls = [ ''http://www.example.com/'' ] # use something like this for crawling a file system: cfg.crawler.start_urls = [ ''file:///home/myaccount/documents/'' ] # beware, mixing file and http crawling is not possible and might...

Is there an option for Rails sessions to exclude web crawlers and bots?

2011 Apr 02

Is there an option for Rails sessions to exclude web crawlers and bots?

I''m interested in knowing whether a session is created by pages requested by web crawlers and bots. I am using MySQL as a the session store and would like to prevent requests by web crawlers and bots from creating unnecessary session entries. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send...

Read Only Error Since 1.1?

2006 Apr 03

Read Only Error Since 1.1?

Since I upgraded to 1.1, I am getting ''ActiveRecord::ReadOnlyRecord'' exceptions when trying to save a specific record. I read up on ActiveRecord::Base.readonly? but I don''t think the condition there (objects pulled in from a certain JOIN type) applies. Here''s my code that is throwing the exception: @company = session[:company] @company.bytes_used =

[LLVMdev] Patch to allow llvm.gcroot to work with non-pointer allocas.

2010 Sep 22

[LLVMdev] Patch to allow llvm.gcroot to work with non-pointer allocas.

...striction on llvm.gcroot so that it can work with non-pointer allocas. The only changes are to Verifier.cpp - it appears from my testing that llvm.gcroot always worked fine with non-pointer allocas, except that the verifier wouldn't allow it. I've used this patch to build an efficient stack crawler (an alternative to shadow-stack that uses only static constant data structures.) Here's a deal: If you accept this patch, I'll write up an extensive tutorial on how to write a stack crawler like mine. (Actually, it's already written, however without this patch the tutorial doesn't...

Is there a ftp crawler in ruby on rails?

2012 Jun 01

Is there a ftp crawler in ruby on rails?

Hi, I''m a newbie to ROR. I wanted to write some code which can help me to list and then index all the paths on a remote server. Is there a ftp server crawler in ruby? Thanks, Narayana -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To uns...

Web crawler - spider and Amazon Web Servces (AWS)

2008 Jul 18

Web crawler - spider and Amazon Web Servces (AWS)

I need to create a web crawler and the closest thing to a tutorial I''ve found so far is this article: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1182 wich I think I found via RubyInside or some blog. It uses some of Amazone Web Services, mainly SQS, but this would be my first time outsourcing...

HTTP Accept header wildcard breaks rails app

2010 Dec 31

HTTP Accept header wildcard breaks rails app

The thunderstone crawler (http://search.thunderstone.com/texis/ websearch/about.html) sends the folliowing HTTP accept header when requesting pages Accept: text/*, application/javascript, application/x-javascript This results in a "Missing template" exception text/* is valid. How do I tell my rails app to tre...

[LLVMdev] Function inlining creates uninitialized stack roots

2010 Oct 02

[LLVMdev] Function inlining creates uninitialized stack roots

I'm still putting the final touches on my stack crawler, and I've run into a problem having to do with function inlining and local stack roots. As you know, all local roots must be initialized before you can make any call to a function which might crawl the stack. My compiler ensures that all local variables of a function are allocated, declared as...

Design Dilemma - Please Help

2006 Oct 23

Design Dilemma - Please Help

Hi, I''m new. ;-) I creating a little rails app, that will crawl the web on a regular basis and then show the results. The crawling will be scheduled, likely a cron job. I can''t wrap my head around where to put my crawler. It doesn''t seem to fit. An example: Model - News Story Controllers - Grabs a story from the DB, Sort the Stories, Search the Stories etc. View - HTML News Story, RSS Story etc. Then a I have a news crawler, that will go crawl some feeds for new stories, then insert them into the db. Wh...

Questions about backgroundrb

2008 Mar 25

Questions about backgroundrb

...d to incorporating > it into my site. > > I had several questions regarding implementing some features on my site > using backgroundrb. If you could help guide me in any way with any of > these, that would be great! > > Background: I''m trying to write a series of web crawler tasks. This is my > first time writing a robust web crawler. > > A new web crawler task is initiated whenever a user decides to track > information from a new site. Upon initialization by the user, that web > crawler is supposed to run using backgroundrb and then (1) save the >...

[LLVMdev] llvm.org robots.txt prevents crawling by Google code search?

2010 Oct 14

[LLVMdev] llvm.org robots.txt prevents crawling by Google code search?

...symbol in code search, you get one of the many (possibly > > out-of-date) mirrors, rather than the up-to-date llvm.org version. This > is > > sad. > This is intentional. The workload of the server was pretty huge w/o this. > Could we at least add a rule allowing the codesearch crawler, rather than opening it up to all crawlers? The user agent string is SVN/1.5.4/GoogleCodeSearch. > > -- > With best regards, Anton Korobeynikov > Faculty of Mathematics and Mechanics, Saint Petersburg State University > -- -- Talin -------------- next part -------------- An HTML...

search for: crawler