search for: crawler

Displaying 20 results from an estimated 124 matches for "crawler".

2007 Jul 27
3
Is mechanize thread safe?
Hello all, I was just wondering if anybody knew whether mechanize is supposed to be thread-safe or not? I didn''t really find any information about it anywhere. I''ve been getting a strange error in protocol.rb when I run a script that uses mechanize in a multi threaded fashion, but not with a single thread. I''m trying to write a spider that does multiple gets in
2011 Mar 03
6
Developing a web crawler
Hi, I wish to develop a web crawler in R. I have been using the functionalities available under the RCurl package. I am able to extract the html content of the site but i don't know how to go about analyzing the html formatted document. I wish to know the frequency of a word in the document. I am only acquainted with analyzing da...
2006 Apr 16
4
Preventing crawlers on link_to''s
My understanding was that using the :post=>true on a link_to() was supposed to prevent search engine crawlers from triggering the link. However, this does not seem to be working for me. Is there something else that I should be/can be doing to accomplish this? Thanks. -Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: http://wrath.rubyonrails.org/pipermail/rails/attach...
2006 Jul 25
1
RDig document processing error
Hi all, Am having problems using RDig: With this rdig config... cfg.crawler.start_urls = [''http://www.defensetech.org''] cfg.crawler.include_hosts = [''www.defensetech.org''] cfg.index.path = ''/my/path/to/index'' cfg.verbose = true ...I get this output: $ rdig -c config/rdig_config.rb /usr/local/lib/site_ruby/1.8/ferret/i...
2009 Sep 13
0
regrex_crawler -- a crawler which uses regular expression to catch data from website
RegexpCrawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression. The project site is: http://github.com/flyerhzm/regexp_crawler/tree I give an example: a script to synchronize your github projects except fork projec...
2006 Mar 17
1
omega crawler: ht://dig or wget?
At wiki page: http://wiki.xapian.org/Omega I added a comment that ht://Dig looks like dead. Does anybody really use it? >From brief glance at docs I had a feeling it is not easy to configure. Maybe better crawler is GNU wget? Mature, stable, maintained? -- Peter Masiar
2010 Oct 03
1
[LLVMdev] Tutorial: Building a stack crawler in LLVM
As promised, here is a document describing how to build a stack crawler using the garbage collection features of LLVM. https://docs.google.com/document/pub?id=1-ws0KYo47S0CgqpwkjfWDBJ8wFhW_0UYKxPIJ0TyKrQ I'm interested in any feedback, particularly on: - Explanations that aren't clear. - Spelling errors. - Technical errors. - Suggestions for ways...
2006 Mar 25
1
RDig - ferret-based website crawler/indexer
Hi! RDig is a small tool to build a Ferret index for the contents of a website or intranet. It contains a simple HTTP crawler and some support for extracting textual content from the fetched pages. I built this to implement a site-wide search for a recent project that combined a Rails application with lots of static html files generated by a CMS. Any feedback is very welcome! Rubyforge project page: http://rubyforge....
2008 Sep 07
2
keep rsync from removing unfinished source files?
I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run: $ rsync --remove-source-files speed:/var/crawldir . but I worry that rsync will unlink a source file that hasn't...
2007 Jan 23
3
Someone getting RDig work for Linux?
...l results: 0 root at linux:~# my configfile I changed from config to cfg, because of maybe mistyping cfg.index.create = false RDig.configuration do |cfg| ################################################################## # options you really should set # provide one or more URLs for the crawler to start from cfg.crawler.start_urls = [ ''http://www.example.com/'' ] # use something like this for crawling a file system: cfg.crawler.start_urls = [ ''file:///home/myaccount/documents/'' ] # beware, mixing file and http crawling is not possible and might...
2011 Apr 02
0
Is there an option for Rails sessions to exclude web crawlers and bots?
I''m interested in knowing whether a session is created by pages requested by web crawlers and bots. I am using MySQL as a the session store and would like to prevent requests by web crawlers and bots from creating unnecessary session entries. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send...
2006 Apr 03
3
Read Only Error Since 1.1?
Since I upgraded to 1.1, I am getting ''ActiveRecord::ReadOnlyRecord'' exceptions when trying to save a specific record. I read up on ActiveRecord::Base.readonly? but I don''t think the condition there (objects pulled in from a certain JOIN type) applies. Here''s my code that is throwing the exception: @company = session[:company] @company.bytes_used =
2010 Sep 22
3
[LLVMdev] Patch to allow llvm.gcroot to work with non-pointer allocas.
...striction on llvm.gcroot so that it can work with non-pointer allocas. The only changes are to Verifier.cpp - it appears from my testing that llvm.gcroot always worked fine with non-pointer allocas, except that the verifier wouldn't allow it. I've used this patch to build an efficient stack crawler (an alternative to shadow-stack that uses only static constant data structures.) Here's a deal: If you accept this patch, I'll write up an extensive tutorial on how to write a stack crawler like mine. (Actually, it's already written, however without this patch the tutorial doesn't...
2012 Jun 01
4
Is there a ftp crawler in ruby on rails?
Hi, I''m a newbie to ROR. I wanted to write some code which can help me to list and then index all the paths on a remote server. Is there a ftp server crawler in ruby? Thanks, Narayana -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To uns...
2008 Jul 18
0
Web crawler - spider and Amazon Web Servces (AWS)
I need to create a web crawler and the closest thing to a tutorial I''ve found so far is this article: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1182 wich I think I found via RubyInside or some blog. It uses some of Amazone Web Services, mainly SQS, but this would be my first time outsourcing...
2010 Dec 31
6
HTTP Accept header wildcard breaks rails app
The thunderstone crawler (http://search.thunderstone.com/texis/ websearch/about.html) sends the folliowing HTTP accept header when requesting pages Accept: text/*, application/javascript, application/x-javascript This results in a "Missing template" exception text/* is valid. How do I tell my rails app to tre...
2010 Oct 02
2
[LLVMdev] Function inlining creates uninitialized stack roots
I'm still putting the final touches on my stack crawler, and I've run into a problem having to do with function inlining and local stack roots. As you know, all local roots must be initialized before you can make any call to a function which might crawl the stack. My compiler ensures that all local variables of a function are allocated, declared as...
2006 Oct 23
3
Design Dilemma - Please Help
Hi, I''m new. ;-) I creating a little rails app, that will crawl the web on a regular basis and then show the results. The crawling will be scheduled, likely a cron job. I can''t wrap my head around where to put my crawler. It doesn''t seem to fit. An example: Model - News Story Controllers - Grabs a story from the DB, Sort the Stories, Search the Stories etc. View - HTML News Story, RSS Story etc. Then a I have a news crawler, that will go crawl some feeds for new stories, then insert them into the db. Wh...
2008 Mar 25
0
Questions about backgroundrb
...d to incorporating > it into my site. > > I had several questions regarding implementing some features on my site > using backgroundrb. If you could help guide me in any way with any of > these, that would be great! > > Background: I''m trying to write a series of web crawler tasks. This is my > first time writing a robust web crawler. > > A new web crawler task is initiated whenever a user decides to track > information from a new site. Upon initialization by the user, that web > crawler is supposed to run using backgroundrb and then (1) save the >...
2010 Oct 14
1
[LLVMdev] llvm.org robots.txt prevents crawling by Google code search?
...symbol in code search, you get one of the many (possibly > > out-of-date) mirrors, rather than the up-to-date llvm.org version. This > is > > sad. > This is intentional. The workload of the server was pretty huge w/o this. > Could we at least add a rule allowing the codesearch crawler, rather than opening it up to all crawlers? The user agent string is SVN/1.5.4/GoogleCodeSearch. > > -- > With best regards, Anton Korobeynikov > Faculty of Mathematics and Mechanics, Saint Petersburg State University > -- -- Talin -------------- next part -------------- An HTML...