Hi!
I've performed a Google Scholar Search using a query, let's say
"Frank
Harrell", and parsed the links to the EndNote references from the resulting
HTML code. Now I'd like to download all the references automatically. For
this, I have tried to use RCurl, but I can't seem to get it working: I
always get error code "403 Forbidden" from the web server.
Initially I tried to do this without using cookies:
library(RCurl)
getURL("
http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
")
or
getURLContent("
http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
")
Error: Forbidden
and then with cookies:
getURL("
http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0",
.opts=list(cookiejar="cookiejar.txt"))
But they both consistently fail the same way. What am I doing wrong?
sessionInfo()
R version 2.9.0 (2009-04-17)
i386-pc-mingw32
locale:
LC_COLLATE=Finnish_Finland.1252;LC_CTYPE=Finnish_Finland.1252;LC_MONETARY=Finnish_Finland.1252;LC_NUMERIC=C;LC_TIME=Finnish_Finland.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_0.98-1 bitops_1.0-4.1
Thanks!
Jarno
[[alternative HTML version deleted]]
Duncan Temple Lang
2009-Sep-18 04:39 UTC
[R] RCurl and Google Scholar's EndNote references
Hi Jarno
You've only told us half the story. You didn't show how you
i) performed the original query
ii) retrieved the URL you used in subsequent queries
But I can suggest two possible problems.
a) specifying the cookiejar option tells libcurl where to write the
cookies that the particular curl handle has collected during its life.
These are written when the curl handle is destroyed.
So that wouldn't change the getURL() operation, just change what happens
when the curl handle is destroyed.
b) You probably mean to use cookiefile rather than cookiejar so that
the curl request would read existing cookies from a file.
But in that case, how did that file get created with the correct cookies.
c) libcurl will collect cookies in a curl handle as it receives them from a
server
as part of a response. And it will use these in subsequent requests to that
server.
But you must be using the same curl handle. Different curl handles are
entirely
independent (unless one is copied from another).
So a possible solution may be that you need to do the initial query with the
same
curl handle
So I would try something like
curl = getCurlHandle()
z = getForm("http://scholar.google.com/scholar", q ='Frank
Harrell', hl = 'en', btnG = 'Search',
.opts = list(verbose = TRUE), curl = curl)
dd = htmlParse(z)
links = getNodeSet(dd, "//a[@href]")
# do something to identify the link you want
tmp = getURL(linkIWant, curl = curl)
Note that we are using the same curl object in both requests.
This may not do what you want, but if you let us know the details
about how you are doing the preceding steps, we should be able to sort
things out.
D.
Jarno Tuimala wrote:> Hi!
>
> I've performed a Google Scholar Search using a query, let's say
"Frank
> Harrell", and parsed the links to the EndNote references from the
resulting
> HTML code. Now I'd like to download all the references automatically.
For
> this, I have tried to use RCurl, but I can't seem to get it working: I
> always get error code "403 Forbidden" from the web server.
>
> Initially I tried to do this without using cookies:
>
> library(RCurl)
> getURL("
>
http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
> ")
>
> or
>
> getURLContent("
>
http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
> ")
> Error: Forbidden
> and then with cookies:
>
> getURL("
>
http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0",
> .opts=list(cookiejar="cookiejar.txt"))
>
> But they both consistently fail the same way. What am I doing wrong?
>
> sessionInfo()
> R version 2.9.0 (2009-04-17)
> i386-pc-mingw32
> locale:
>
LC_COLLATE=Finnish_Finland.1252;LC_CTYPE=Finnish_Finland.1252;LC_MONETARY=Finnish_Finland.1252;LC_NUMERIC=C;LC_TIME=Finnish_Finland.1252
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] RCurl_0.98-1 bitops_1.0-4.1
>
> Thanks!
> Jarno
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.