Displaying 20 results from an estimated 800 matches similar to: "Developing a web crawler"
2011 Mar 29
2
Scrap java scripts and styles from an html document
Hi,
I am working on developing a web crawler in R and I needed some help with
regard to removal of javascripts and style sheets from the html document of
a web page.
i tried using the xml package, hence the function xpathApply
library(XML)
txt =
xpathApply(html,"//body//text()[not(ancestor::script)][not(ancestor::style)]",
xmlValue)
The output comes out as text lines, without any html
2006 Mar 17
1
omega crawler: ht://dig or wget?
At wiki page: http://wiki.xapian.org/Omega
I added a comment that ht://Dig looks like dead.
Does anybody really use it?
>From brief glance at docs I had a feeling it is not easy to configure.
Maybe better crawler is GNU wget? Mature, stable, maintained?
--
Peter Masiar
2010 Oct 03
1
[LLVMdev] Tutorial: Building a stack crawler in LLVM
As promised, here is a document describing how to build a stack crawler
using the garbage collection features of LLVM.
https://docs.google.com/document/pub?id=1-ws0KYo47S0CgqpwkjfWDBJ8wFhW_0UYKxPIJ0TyKrQ
I'm interested in any feedback, particularly on:
- Explanations that aren't clear.
- Spelling errors.
- Technical errors.
- Suggestions for ways in which things could be
2006 Mar 25
1
RDig - ferret-based website crawler/indexer
Hi!
RDig is a small tool to build a Ferret index for the contents of a
website or intranet. It contains a simple HTTP crawler and some support
for extracting textual content from the fetched pages.
I built this to implement a site-wide search for a recent project
that combined a Rails application with lots of static html files
generated by a CMS.
Any feedback is very welcome!
Rubyforge
2012 Jun 01
4
Is there a ftp crawler in ruby on rails?
Hi,
I''m a newbie to ROR. I wanted to write some code which can help me to
list and then index all the paths on a remote server. Is there a ftp
server crawler in ruby?
Thanks,
Narayana
--
Posted via http://www.ruby-forum.com/.
--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To post to this group, send email to
2009 Sep 13
0
regrex_crawler -- a crawler which uses regular expression to catch data from website
RegexpCrawler is a crawler which uses regular expression to catch data
from website. It is easy to use and less code if you are familiar with
regular expression.
The project site is: http://github.com/flyerhzm/regexp_crawler/tree
I give an example: a script to synchronize your github projects except
fork projects, , please check example/github_projects.rb
require ''rubygems''
2006 Apr 16
4
Preventing crawlers on link_to''s
My understanding was that using the :post=>true on a link_to() was supposed
to prevent search engine crawlers from triggering the link. However, this
does not seem to be working for me. Is there something else that I should
be/can be doing to accomplish this? Thanks.
-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
2008 Jul 18
0
Web crawler - spider and Amazon Web Servces (AWS)
I need to create a web crawler and the closest thing to a tutorial
I''ve found so far is this article:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1182
wich I think I found via RubyInside or some blog.
It uses some of Amazone Web Services, mainly SQS, but this would be my
first time outsourcing a process to a third party and I would like to
know if someone in the
2006 Sep 21
2
Command area in SciViews 0.8.9 - second try
Dear all
I am writing again with a question I posted a few weeks ago (to no avail). I have a problem with SciViews for R. It's probably a slightly stupid question but I cannot find a solution to a very elementary problem. I am using SciViews 0.8.9 on with R 2.3.1pat on a Windows XP Home machine. R is set to SDI mode, I start R, enter "library(svGUI)", SciViews starts properly, I can
2006 Jul 25
1
RDig document processing error
Hi all,
Am having problems using RDig:
With this rdig config...
cfg.crawler.start_urls = [''http://www.defensetech.org'']
cfg.crawler.include_hosts = [''www.defensetech.org'']
cfg.index.path = ''/my/path/to/index''
cfg.verbose = true
...I get this output:
$ rdig -c config/rdig_config.rb
/usr/local/lib/site_ruby/1.8/ferret/index/term.rb:45:
2008 Sep 07
2
keep rsync from removing unfinished source files?
I have two machines, speed and mass. speed has a fast Internet
connection and is running a crawler which downloads a lot of files to
disk. mass has a lot of disk space. I want to move the files from
speed to mass after they're done downloading. Ideally, I'd just run:
$ rsync --remove-source-files speed:/var/crawldir .
but I worry that rsync will unlink a source file that hasn't
2009 Jan 27
1
rJava in R 2.8.1 on Ubuntu 8.10
Hi all
I have problems installing rJava on my system.
######## My system:
> R.version # at R prompt
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 8.1
year 2008
month 12
day 22
svn rev 47281
language R
version.string R version 2.8.1
2006 Apr 03
3
Read Only Error Since 1.1?
Since I upgraded to 1.1, I am getting ''ActiveRecord::ReadOnlyRecord''
exceptions when trying to save a specific record.
I read up on ActiveRecord::Base.readonly? but I don''t think the condition
there (objects pulled in from a certain JOIN type) applies.
Here''s my code that is throwing the exception:
@company = session[:company]
@company.bytes_used =
2012 Sep 20
3
(no subject)
>From my book on corpus linguistics with R:
# (10) Imagine you have two vectors a and b such that
a<-c("d", "d", "j", "f", "e", "g", "f", "f", "i", "g")
b<-c("a", "g", "d", "f", "g", "a", "f", "a",
2010 Sep 22
3
[LLVMdev] Patch to allow llvm.gcroot to work with non-pointer allocas.
I'm moving this thread to llvm-dev in the hopes of reaching a wider
audience.
This patch relaxes the restriction on llvm.gcroot so that it can work with
non-pointer allocas. The only changes are to Verifier.cpp - it appears from
my testing that llvm.gcroot always worked fine with non-pointer allocas,
except that the verifier wouldn't allow it. I've used this patch to build an
2011 Mar 05
1
pvclust crashing R on Ubuntu 10.10
Hi all
I am writing to you with a question regarding the pvclust package. And
yes, before the usual people produce their usual
contact-the-package-maintainers line, ye, I tried that but the emails
one can find on the web either bounce or are not responded to. Also,
yes, this error has already been reported as a bug but been shot down
as not reproducible
2010 Dec 31
6
HTTP Accept header wildcard breaks rails app
The thunderstone crawler (http://search.thunderstone.com/texis/
websearch/about.html) sends the folliowing HTTP accept header when
requesting pages
Accept: text/*, application/javascript, application/x-javascript
This results in a "Missing template" exception
text/* is valid. How do I tell my rails app to treat this as rhtml by
default instead of returning a 500?
Missing template
2018 May 09
3
NAs produced by integer overflow, but only some time ...
I have problem with integer overflow that I cannot understand.
I have a character vector curr.lemmas with the following properties:
length(curr.lemmas) # 61224
length(unique(curr.lemmas)) # 2652
That vector is the input to the following function:
yules.k1 <- function(input) {
m1 <- length(input); temp <- table(table(input))
m2 <- sum("*"(temp,
2010 Oct 02
2
[LLVMdev] Function inlining creates uninitialized stack roots
I'm still putting the final touches on my stack crawler, and I've run into a
problem having to do with function inlining and local stack roots.
As you know, all local roots must be initialized before you can make any
call to a function which might crawl the stack. My compiler ensures that all
local variables of a function are allocated, declared as root, and
initialized in the first
2006 Oct 23
3
Design Dilemma - Please Help
Hi, I''m new. ;-)
I creating a little rails app, that will crawl the web on a regular
basis and then show the results.
The crawling will be scheduled, likely a cron job.
I can''t wrap my head around where to put my crawler. It doesn''t seem
to fit.
An example:
Model - News Story
Controllers - Grabs a story from the DB, Sort the Stories, Search the
Stories etc.
View -