thr3ads.net - R help - [R] Extracting a website text content using R [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Am Stat

2007-Aug-01 21:19 UTC

[R] Extracting a website text content using R

Dear useR,

Just wandering whether it is possible that there is any function in R could
let me get the text contents for a certain website.

Thanks a lot!

Best,

Leon

	[[alternative HTML version deleted]]

Bert Gunter

2007-Aug-01 21:50 UTC

head link

[R] Extracting a website text content using R

Yes, there are.

(Please see and follow the posting guide if you wish to obtain something
more specific)


Bert Gunter
Genetech Nonclinical Statistics


-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Am Stat
Sent: Wednesday, August 01, 2007 2:19 PM
To: r-help at stat.math.ethz.ch
Subject: [R] Extracting a website text content using R

Dear useR,

Just wandering whether it is possible that there is any function in R could
let me get the text contents for a certain website.

Thanks a lot!

Best,

Leon

	[[alternative HTML version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Saeed Abu Nimeh

2007-Aug-01 23:12 UTC

head link

[R] Extracting a website text content using R

work with it as text. for text mining use:
1- http://wwwpeople.unil.ch/jean-pierre.mueller/
2- tm by Ingo F.

Am Stat wrote:> Dear useR,
> 
> Just wandering whether it is possible that there is any function in R could
> let me get the text contents for a certain website.
> 
> Thanks a lot!
> 
> Best,
> 
> Leon
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Steven McKinney

2007-Aug-02 00:53 UTC

head link

[R] Extracting a website text content using R

>-----Original Message-----
>From: r-help-bounces at stat.math.ethz.ch on behalf of Am Stat
>Sent: Wed 8/1/2007 2:19 PM
>To: r-help at stat.math.ethz.ch
>Subject: [R] Extracting a website text content using R
 >Dear useR,
>Just wandering whether it is possible that there is any function in R could
>let me get the text contents for a certain website.
>Thanks a lot!
>Best,
>Leon
	


Is this what you had in mind?
> foo <- scan(url("http://cran.r-project.org/"), what =
"character")
Read 69 items> paste(unlist(foo), collapse = " ")[1] "<!DOCTYPE HTML PUBLIC -//IETF//DTD HTML//EN > <html>
<head> <title>The Comprehensive R Archive Network</title>
<link rel=\"icon\" href=\"favicon.ico\"
type=\"image/x-icon\"> <link rel=\"shortcut icon\"
href=\"favicon.ico\" type=\"image/x-icon\"> <link
rel=\"stylesheet\" type=\"text/css\"
href=\"R.css\"> </head> <FRAMESET cols=\"1*,
4*\" border=0> <FRAMESET rows=\"120, 1*\"> <FRAME
src=\"logo.html\" name=\"logo\" frameborder=0> <FRAME
src=\"navbar.html\" name=\"contents\" frameborder=0>
</FRAMESET> <FRAME src=\"banner.shtml\"
name=\"banner\" frameborder=0> <noframes> <h1>The
Comprehensive R Archive Network</h1> Your browser seems not to support
frames, here is the <A href=\"navbar.html\">contents
page</A> of CRAN. </noframes> </FRAMESET>"


Try the search phrase

cran scan url

in Google for more hits on
info about R functions that
can deal with URLs.

In R try
> apropos("URL") [1] "contourLines"   "URLdecode"      "URLencode"
"browseURL"      "contrib.url"    "main.help.url" 
"url.show"
 [8] "loadURL"        "read.table.url" "scan.url" 
"source.url"     "url"


SteveM

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

mtmorgan at fhcrc.org

2007-Aug-02 02:08 UTC

head link

[R] Extracting a website text content using R

Perhaps more fun is
> library(XML)
> res = htmlTreeParse("http://www.omegahat.org/RSXML/",
useInternalNodes=TRUE)
> xpathApply(res, "//h1", xmlValue)[[1]]
[1] "An XML package for the S language"

Martin

Quoting Steven McKinney <smckinney at bccrc.ca>:
> 
> 
> >-----Original Message-----
> >From: r-help-bounces at stat.math.ethz.ch on behalf of Am Stat
> >Sent: Wed 8/1/2007 2:19 PM
> >To: r-help at stat.math.ethz.ch
> >Subject: [R] Extracting a website text content using R
>  
> >Dear useR,
> 
> >Just wandering whether it is possible that there is any function in R
could
> >let me get the text contents for a certain website.
> 
> >Thanks a lot!
> 
> >Best,
> 
> >Leon
> 
> 	
> 
> 
> Is this what you had in mind?
> 
> > foo <- scan(url("http://cran.r-project.org/"), what =
"character")
> Read 69 items
> > paste(unlist(foo), collapse = " ")
> [1] "<!DOCTYPE HTML PUBLIC -//IETF//DTD HTML//EN > <html>
<head> <title>The
> Comprehensive R Archive Network</title> <link
rel=\"icon\"
> href=\"favicon.ico\" type=\"image/x-icon\"> <link
rel=\"shortcut icon\"
> href=\"favicon.ico\" type=\"image/x-icon\"> <link
rel=\"stylesheet\"
> type=\"text/css\" href=\"R.css\"> </head>
<FRAMESET cols=\"1*, 4*\" border=0>
> <FRAMESET rows=\"120, 1*\"> <FRAME
src=\"logo.html\" name=\"logo\"
> frameborder=0> <FRAME src=\"navbar.html\"
name=\"contents\" frameborder=0>
> </FRAMESET> <FRAME src=\"banner.shtml\"
name=\"banner\" frameborder=0>
> <noframes> <h1>The Comprehensive R Archive Network</h1>
Your browser seems
> not to support frames, here is the <A
href=\"navbar.html\">contents page</A>
> of CRAN. </noframes> </FRAMESET>"
> 
> 
> Try the search phrase
> 
> cran scan url
> 
> in Google for more hits on
> info about R functions that
> can deal with URLs.
> 
> In R try
> 
> > apropos("URL")
>  [1] "contourLines"   "URLdecode"     
"URLencode"      "browseURL"
> "contrib.url"    "main.help.url"  "url.show"
>  [8] "loadURL"        "read.table.url"
"scan.url"       "source.url"
> "url"           
> 
> 
> SteveM
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Seemingly Similar Threads

Search for more seemingly similar threads

R help - Aug 2007 - Extracting a website text content using R

[R] Extracting a website text content using R

[R] Extracting a website text content using R

[R] Extracting a website text content using R

[R] Extracting a website text content using R

[R] Extracting a website text content using R

Seemingly Similar Threads