Tony Breyal
2008-Oct-01 16:17 UTC
[R] changing 'https' to 'http' when using download.file(), any side effects or just use RCurl?
Dear R-Help,>From reading the help file, it is my understanding the the download.file()function does not support HTTPS connections. So therefore, understandably, the follow produces an error: ### R Code> url <- "https://stat.ethz.ch/pipermail/r-help/2008-October/thread.html" > destfile <- "//PFO-SBS001/Redirected/tonyb/Desktop/R_web_test/tmp.txt" > download.file(url, destfile)Error in download.file(url, destfile) : unsupported URL scheme My question is: What about if i remove the 's' from the 'https' url? The download.file() function seems to now work fine (please see below). Did i just get lucky with the url I used, or can I in general simply rewrite 'https' as 'http'. My long term goal is to download hundreds of web pages and then somehow remove all of the html tags so that only the web page text remains. No private information is being sent or received for this task (no passwords etc are used). ### R Code> url <- "http://stat.ethz.ch/pipermail/r-help/2008-October/thread.html" > destfile <- "//PFO-SBS001/Redirected/tonyb/Desktop/R_web_test/tmp.txt" > download.file(url, destfile)trying URL 'http://stat.ethz.ch/pipermail/r-help/2008-October/thread.html' Content type 'text/html; charset=ISO-8859-1' length 13767 bytes (13 Kb) opened URL downloaded 13 Kb A quick forum search shows that a package called RCurl (Omegahat Repository) does support HTTPS connections, but i got an error when using that and have no idea where the omegahat mailing list is, which is why i'd like to know about removing the 's' in 'https'. If it turns out there is a good reason not to remove the 's', then i will repost on. God i hope this post makes sense lol. Many thanks for your valuable time, Tony Breyal Ps. This is my first posting, so please be kind! :-) PPs. Sorry this post was so long. PPPs. For anyone interested, this is what happens when using RCurl: ### R Code> library(RCurl) > txt = getURL("https://stat.ethz.ch/pipermail/r-help/2008-October/thread.html") Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : SSL certificate problem, verify that the CA cert is OK. Details: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed OS: Windows Vista Ultimate R version: 2.7.2 (2008-08-25) [[alternative HTML version deleted]]
Tony Breyal
2008-Oct-01 16:29 UTC
[R] changing 'https' to 'http' when using download.file(), any side effects or just use RCurl?
Dear R-help, I have just been informed that I must not rewrite the 'https' as 'http' as some web pages may not download (I think I just got lucky on the ones I've tried thus far). Therefore I would ask if some kind individual could let me know where I can post questions about the RCurl package (Omegahat Repository), because I honestly can't find the mailing list for it, and ask forgiveness for my earlier post :-) Many thanks, Tony 2008/10/1 Tony Breyal <tony.breyal@googlemail.com>> Dear R-Help, > > From reading the help file, it is my understanding the the download.file() > function does not support HTTPS connections. So therefore, understandably, > the follow produces an error: > > ### R Code > > url <- "https://stat.ethz.ch/pipermail/r-help/2008-October/thread.html" > > destfile <- "//PFO-SBS001/Redirected/tonyb/Desktop/R_web_test/tmp.txt" > > download.file(url, destfile) > Error in download.file(url, destfile) : unsupported URL scheme > > My question is: What about if i remove the 's' from the 'https' url? The > download.file() function seems to now work fine (please see below). Did i > just get lucky with the url I used, or can I in general simply rewrite > 'https' as 'http'. My long term goal is to download hundreds of web pages > and then somehow remove all of the html tags so that only the web page text > remains. No private information is being sent or received for this task (no > passwords etc are used). > > ### R Code > > url <- "http://stat.ethz.ch/pipermail/r-help/2008-October/thread.html" > > destfile <- "//PFO-SBS001/Redirected/tonyb/Desktop/R_web_test/tmp.txt" > > download.file(url, destfile) > trying URL 'http://stat.ethz.ch/pipermail/r-help/2008-October/thread.html' > Content type 'text/html; charset=ISO-8859-1' length 13767 bytes (13 Kb) > opened URL > downloaded 13 Kb > > A quick forum search shows that a package called RCurl (Omegahat > Repository) does support HTTPS connections, but i got an error when using > that and have no idea where the omegahat mailing list is, which is why i'd > like to know about removing the 's' in 'https'. If it turns out there is a > good reason not to remove the 's', then i will repost on. God i hope this > post makes sense lol. > > Many thanks for your valuable time, > Tony Breyal > > Ps. This is my first posting, so please be kind! :-) > PPs. Sorry this post was so long. > PPPs. For anyone interested, this is what happens when using RCurl: > > ### R Code > > library(RCurl) > > txt = getURL(" > https://stat.ethz.ch/pipermail/r-help/2008-October/thread.html") > Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : > SSL certificate problem, verify that the CA cert is OK. Details: > error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify > failed > > OS: Windows Vista Ultimate > R version: 2.7.2 (2008-08-25) >[[alternative HTML version deleted]]