similar to: Developing a web crawler

Displaying 20 results from an estimated 800 matches similar to: "Developing a web crawler"

2011 Mar 29
2
Scrap java scripts and styles from an html document
Hi, I am working on developing a web crawler in R and I needed some help with regard to removal of javascripts and style sheets from the html document of a web page. i tried using the xml package, hence the function xpathApply library(XML) txt = xpathApply(html,"//body//text()[not(ancestor::script)][not(ancestor::style)]", xmlValue) The output comes out as text lines, without any html
2006 Mar 17
1
omega crawler: ht://dig or wget?
At wiki page: http://wiki.xapian.org/Omega I added a comment that ht://Dig looks like dead. Does anybody really use it? >From brief glance at docs I had a feeling it is not easy to configure. Maybe better crawler is GNU wget? Mature, stable, maintained? -- Peter Masiar
2010 Oct 03
1
[LLVMdev] Tutorial: Building a stack crawler in LLVM
As promised, here is a document describing how to build a stack crawler using the garbage collection features of LLVM. https://docs.google.com/document/pub?id=1-ws0KYo47S0CgqpwkjfWDBJ8wFhW_0UYKxPIJ0TyKrQ I'm interested in any feedback, particularly on: - Explanations that aren't clear. - Spelling errors. - Technical errors. - Suggestions for ways in which things could be
2006 Mar 25
1
RDig - ferret-based website crawler/indexer
Hi! RDig is a small tool to build a Ferret index for the contents of a website or intranet. It contains a simple HTTP crawler and some support for extracting textual content from the fetched pages. I built this to implement a site-wide search for a recent project that combined a Rails application with lots of static html files generated by a CMS. Any feedback is very welcome! Rubyforge
2012 Jun 01
4
Is there a ftp crawler in ruby on rails?
Hi, I''m a newbie to ROR. I wanted to write some code which can help me to list and then index all the paths on a remote server. Is there a ftp server crawler in ruby? Thanks, Narayana -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to
2009 Sep 13
0
regrex_crawler -- a crawler which uses regular expression to catch data from website
RegexpCrawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression. The project site is: http://github.com/flyerhzm/regexp_crawler/tree I give an example: a script to synchronize your github projects except fork projects, , please check example/github_projects.rb require ''rubygems''
2006 Apr 16
4
Preventing crawlers on link_to''s
My understanding was that using the :post=>true on a link_to() was supposed to prevent search engine crawlers from triggering the link. However, this does not seem to be working for me. Is there something else that I should be/can be doing to accomplish this? Thanks. -Matt -------------- next part -------------- An HTML attachment was scrubbed... URL:
2008 Jul 18
0
Web crawler - spider and Amazon Web Servces (AWS)
I need to create a web crawler and the closest thing to a tutorial I''ve found so far is this article: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1182 wich I think I found via RubyInside or some blog. It uses some of Amazone Web Services, mainly SQS, but this would be my first time outsourcing a process to a third party and I would like to know if someone in the
2006 Sep 21
2
Command area in SciViews 0.8.9 - second try
Dear all I am writing again with a question I posted a few weeks ago (to no avail). I have a problem with SciViews for R. It's probably a slightly stupid question but I cannot find a solution to a very elementary problem. I am using SciViews 0.8.9 on with R 2.3.1pat on a Windows XP Home machine. R is set to SDI mode, I start R, enter "library(svGUI)", SciViews starts properly, I can
2006 Jul 25
1
RDig document processing error
Hi all, Am having problems using RDig: With this rdig config... cfg.crawler.start_urls = [''http://www.defensetech.org''] cfg.crawler.include_hosts = [''www.defensetech.org''] cfg.index.path = ''/my/path/to/index'' cfg.verbose = true ...I get this output: $ rdig -c config/rdig_config.rb /usr/local/lib/site_ruby/1.8/ferret/index/term.rb:45:
2008 Sep 07
2
keep rsync from removing unfinished source files?
I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run: $ rsync --remove-source-files speed:/var/crawldir . but I worry that rsync will unlink a source file that hasn't
2009 Jan 27
1
rJava in R 2.8.1 on Ubuntu 8.10
Hi all I have problems installing rJava on my system. ######## My system: > R.version # at R prompt platform x86_64-pc-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status major 2 minor 8.1 year 2008 month 12 day 22 svn rev 47281 language R version.string R version 2.8.1
2006 Apr 03
3
Read Only Error Since 1.1?
Since I upgraded to 1.1, I am getting ''ActiveRecord::ReadOnlyRecord'' exceptions when trying to save a specific record. I read up on ActiveRecord::Base.readonly? but I don''t think the condition there (objects pulled in from a certain JOIN type) applies. Here''s my code that is throwing the exception: @company = session[:company] @company.bytes_used =
2012 Sep 20
3
(no subject)
>From my book on corpus linguistics with R: # (10) Imagine you have two vectors a and b such that a<-c("d", "d", "j", "f", "e", "g", "f", "f", "i", "g") b<-c("a", "g", "d", "f", "g", "a", "f", "a",
2010 Sep 22
3
[LLVMdev] Patch to allow llvm.gcroot to work with non-pointer allocas.
I'm moving this thread to llvm-dev in the hopes of reaching a wider audience. This patch relaxes the restriction on llvm.gcroot so that it can work with non-pointer allocas. The only changes are to Verifier.cpp - it appears from my testing that llvm.gcroot always worked fine with non-pointer allocas, except that the verifier wouldn't allow it. I've used this patch to build an
2011 Mar 05
1
pvclust crashing R on Ubuntu 10.10
Hi all I am writing to you with a question regarding the pvclust package. And yes, before the usual people produce their usual contact-the-package-maintainers line, ye, I tried that but the emails one can find on the web either bounce or are not responded to. Also, yes, this error has already been reported as a bug but been shot down as not reproducible
2010 Dec 31
6
HTTP Accept header wildcard breaks rails app
The thunderstone crawler (http://search.thunderstone.com/texis/ websearch/about.html) sends the folliowing HTTP accept header when requesting pages Accept: text/*, application/javascript, application/x-javascript This results in a "Missing template" exception text/* is valid. How do I tell my rails app to treat this as rhtml by default instead of returning a 500? Missing template
2018 May 09
3
NAs produced by integer overflow, but only some time ...
I have problem with integer overflow that I cannot understand. I have a character vector curr.lemmas with the following properties: length(curr.lemmas) # 61224 length(unique(curr.lemmas)) # 2652 That vector is the input to the following function: yules.k1 <- function(input) { m1 <- length(input); temp <- table(table(input)) m2 <- sum("*"(temp,
2010 Oct 02
2
[LLVMdev] Function inlining creates uninitialized stack roots
I'm still putting the final touches on my stack crawler, and I've run into a problem having to do with function inlining and local stack roots. As you know, all local roots must be initialized before you can make any call to a function which might crawl the stack. My compiler ensures that all local variables of a function are allocated, declared as root, and initialized in the first
2006 Oct 23
3
Design Dilemma - Please Help
Hi, I''m new. ;-) I creating a little rails app, that will crawl the web on a regular basis and then show the results. The crawling will be scheduled, likely a cron job. I can''t wrap my head around where to put my crawler. It doesn''t seem to fit. An example: Model - News Story Controllers - Grabs a story from the DB, Sort the Stories, Search the Stories etc. View -