Hi! I've performed a Google Scholar Search using a query, let's say "Frank Harrell", and parsed the links to the EndNote references from the resulting HTML code. Now I'd like to download all the references automatically. For this, I have tried to use RCurl, but I can't seem to get it working: I always get error code "403 Forbidden" from the web server. Initially I tried to do this without using cookies: library(RCurl) getURL(" http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0 ") or getURLContent(" http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0 ") Error: Forbidden and then with cookies: getURL(" http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0", .opts=list(cookiejar="cookiejar.txt")) But they both consistently fail the same way. What am I doing wrong? sessionInfo() R version 2.9.0 (2009-04-17) i386-pc-mingw32 locale: LC_COLLATE=Finnish_Finland.1252;LC_CTYPE=Finnish_Finland.1252;LC_MONETARY=Finnish_Finland.1252;LC_NUMERIC=C;LC_TIME=Finnish_Finland.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RCurl_0.98-1 bitops_1.0-4.1 Thanks! Jarno [[alternative HTML version deleted]]
Duncan Temple Lang
2009-Sep-18 04:39 UTC
[R] RCurl and Google Scholar's EndNote references
Hi Jarno You've only told us half the story. You didn't show how you i) performed the original query ii) retrieved the URL you used in subsequent queries But I can suggest two possible problems. a) specifying the cookiejar option tells libcurl where to write the cookies that the particular curl handle has collected during its life. These are written when the curl handle is destroyed. So that wouldn't change the getURL() operation, just change what happens when the curl handle is destroyed. b) You probably mean to use cookiefile rather than cookiejar so that the curl request would read existing cookies from a file. But in that case, how did that file get created with the correct cookies. c) libcurl will collect cookies in a curl handle as it receives them from a server as part of a response. And it will use these in subsequent requests to that server. But you must be using the same curl handle. Different curl handles are entirely independent (unless one is copied from another). So a possible solution may be that you need to do the initial query with the same curl handle So I would try something like curl = getCurlHandle() z = getForm("http://scholar.google.com/scholar", q ='Frank Harrell', hl = 'en', btnG = 'Search', .opts = list(verbose = TRUE), curl = curl) dd = htmlParse(z) links = getNodeSet(dd, "//a[@href]") # do something to identify the link you want tmp = getURL(linkIWant, curl = curl) Note that we are using the same curl object in both requests. This may not do what you want, but if you let us know the details about how you are doing the preceding steps, we should be able to sort things out. D. Jarno Tuimala wrote:> Hi! > > I've performed a Google Scholar Search using a query, let's say "Frank > Harrell", and parsed the links to the EndNote references from the resulting > HTML code. Now I'd like to download all the references automatically. For > this, I have tried to use RCurl, but I can't seem to get it working: I > always get error code "403 Forbidden" from the web server. > > Initially I tried to do this without using cookies: > > library(RCurl) > getURL(" > http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0 > ") > > or > > getURLContent(" > http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0 > ") > Error: Forbidden > and then with cookies: > > getURL(" > http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0", > .opts=list(cookiejar="cookiejar.txt")) > > But they both consistently fail the same way. What am I doing wrong? > > sessionInfo() > R version 2.9.0 (2009-04-17) > i386-pc-mingw32 > locale: > LC_COLLATE=Finnish_Finland.1252;LC_CTYPE=Finnish_Finland.1252;LC_MONETARY=Finnish_Finland.1252;LC_NUMERIC=C;LC_TIME=Finnish_Finland.1252 > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] RCurl_0.98-1 bitops_1.0-4.1 > > Thanks! > Jarno > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.