thr3ads.net - R help - [R] Opening or activating a URL to access data, alternative to browseURL [Oct 2016]

If this information is useful, please help other people find it:
Share via:

Bob Rudis

2016-Sep-29 21:09 UTC

[R] Opening or activating a URL to access data, alternative to browseURL

The rvest/httr/curl trio can do the cookie management pretty well. Make the
initial connection via rvest::html_session() and then hopefully be able to
use other rvest function calls, but curl and httr calls will use the cached
in-memory handle info seamlessly. You'd need to store and retrieve cookies
if you need them preserved between R sessions.

Failing the above and assuming this would not need to be lightning fast,
use the phantomjs or firefox web driver (either with RSelenium or some new
stuff rOpenSci is cooking up) which will then do what browsers do best and
maintain all this state for you. You can still slurp the page contents up
with xml2::read_html() and use the super handy processing idioms in the
scraping tidyverse (it needs it's own name).

A concrete example (assuming the URLs aren't sensitive) would enable me or
someone else to mock up something for you.


On Thu, Sep 29, 2016 at 4:59 PM, Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> On 29/09/2016 3:29 PM, Ryan Utz wrote:
>
>> Hi all,
>>
>> I've got a situation that involves activating a URL so that a link
to some
>> data becomes available for download. I can easily use
'browseURL' to do
>> so,
>> but I'm hoping to make this batch-process-able, and I would prefer
to not
>> have 100s of browser windows open when I go to download multiple data
>> sets.
>>
>> Here's the example:
>>
>> #1
>> browseURL('
>> http://pick18.discoverlife.org/mp/20m?plot=2&kind=Hypoprepia
>>
+fucosa&site=33.9+-83.3&date1=2011,2012,2013&flags=build_txt:
>> ')
>> # This opens the URL and creates a link to machine-readable data on the
>> page, which I can then download by simply doing this:
>>
>> #2
>> read.delim('
>> http://pick18.discoverlife.org/tmp/Hypoprepia_fucosa_33.9_-
>> 83.3_2011,2012,2013.txt
>> ')
>>
>> However, I can only get the second line above to work if the thing in
line
>> #1 has been opened in a browser already. Is there any way to allow me
to
>> either 1) close the browser after it's been opened or 2) execute
the line
>> #2 above without having to open a browser? We have hundreds of species
>> that
>> you can see after the '&kind=' bit of the URL, so I'm
trying to keep the
>> browsing situation sane.
>>
>> Thanks!
>> R
>>
>>
> You'll need to figure out what happens when you open the first page.
Does
> it set a cookie?  Does it record your IP address?  Does it just build the
> file but record nothing about you?
>
> If it's one of the simpler versions, you can just read the first page,
> wait a bit, then read the second one.
>
> If you need to manage cookies, you'll need something more complicated.
I
> don't know the easiest way to do that.
>
> Duncan Murdoch
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Ryan Utz

2016-Oct-11 11:59 UTC

head link

[R] Opening or activating a URL to access data, alternative to browseURL

Bob/Duncan,

Thanks for writing. I think some of the things Bob mentioned might work,
but I'm still not quite getting there. Below is the example I'm working
with:

#1
browseURL('http://pick18.discoverlife.org/mp/20m?plot2&kind=Hypoprepia+fucosa&site=33.9+-83.3&date1=2011,2012,
2013&flags=build_txt:')
# This opens the URL and creates a link to machine-readable data on the
page, which I can then download by simply doing this:

#2
read.delim('http://pick18.discoverlife.org/tmp/Hypoprepia_fucosa_33.9_-83.3_
2011,2012,2013.txt')
#This is what I need to read in terms of data, but this URL only exists if
the URL ran above is activated first

So, for example, try running line #2 without the first line- it won't work.
Next run #1 then #2- works fine.

See what I mean?


On Thu, Sep 29, 2016 at 5:09 PM, Bob Rudis <bob at rud.is> wrote:
> The rvest/httr/curl trio can do the cookie management pretty well. Make
> the initial connection via rvest::html_session() and then hopefully be able
> to use other rvest function calls, but curl and httr calls will use the
> cached in-memory handle info seamlessly. You'd need to store and
retrieve
> cookies if you need them preserved between R sessions.
>
> Failing the above and assuming this would not need to be lightning fast,
> use the phantomjs or firefox web driver (either with RSelenium or some new
> stuff rOpenSci is cooking up) which will then do what browsers do best and
> maintain all this state for you. You can still slurp the page contents up
> with xml2::read_html() and use the super handy processing idioms in the
> scraping tidyverse (it needs it's own name).
>
> A concrete example (assuming the URLs aren't sensitive) would enable me
or
> someone else to mock up something for you.
>
>
> On Thu, Sep 29, 2016 at 4:59 PM, Duncan Murdoch <murdoch.duncan at
gmail.com>
> wrote:
>
>> On 29/09/2016 3:29 PM, Ryan Utz wrote:
>>
>>> Hi all,
>>>
>>> I've got a situation that involves activating a URL so that a
link to
>>> some
>>> data becomes available for download. I can easily use
'browseURL' to do
>>> so,
>>> but I'm hoping to make this batch-process-able, and I would
prefer to not
>>> have 100s of browser windows open when I go to download multiple
data
>>> sets.
>>>
>>> Here's the example:
>>>
>>> #1
>>> browseURL('
>>> http://pick18.discoverlife.org/mp/20m?plot=2&kind=Hypoprepia
>>>
+fucosa&site=33.9+-83.3&date1=2011,2012,2013&flags=build_txt:
>>> ')
>>> # This opens the URL and creates a link to machine-readable data on
the
>>> page, which I can then download by simply doing this:
>>>
>>> #2
>>> read.delim('
>>> http://pick18.discoverlife.org/tmp/Hypoprepia_fucosa_33.9_-8
>>> 3.3_2011,2012,2013.txt
>>> ')
>>>
>>> However, I can only get the second line above to work if the thing
in
>>> line
>>> #1 has been opened in a browser already. Is there any way to allow
me to
>>> either 1) close the browser after it's been opened or 2)
execute the line
>>> #2 above without having to open a browser? We have hundreds of
species
>>> that
>>> you can see after the '&kind=' bit of the URL, so
I'm trying to keep the
>>> browsing situation sane.
>>>
>>> Thanks!
>>> R
>>>
>>>
>> You'll need to figure out what happens when you open the first
page. Does
>> it set a cookie?  Does it record your IP address?  Does it just build
the
>> file but record nothing about you?
>>
>> If it's one of the simpler versions, you can just read the first
page,
>> wait a bit, then read the second one.
>>
>> If you need to manage cookies, you'll need something more
complicated. I
>> don't know the easiest way to do that.
>>
>> Duncan Murdoch
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

-- 

Ryan Utz, Ph.D.
Assistant professor of water resources
*chatham**UNIVERSITY*
Home/Cell: (724) 272-7769

	[[alternative HTML version deleted]]

Duncan Murdoch

2016-Oct-11 13:21 UTC

head link

[R] Opening or activating a URL to access data, alternative to browseURL

On 11/10/2016 7:59 AM, Ryan Utz wrote:> Bob/Duncan,
>
> Thanks for writing. I think some of the things Bob mentioned might work,
> but I'm still not quite getting there. Below is the example I'm
working
> with:
>
It worked for me when I replaced the browseURL call with a readLines 
call, as I suggested the other day.  What went wrong for you?

Duncan Murdoch
> #1
>
browseURL('http://pick18.discoverlife.org/mp/20m?plot=2&kind=Hypoprepia+fucosa&site=33.9+-83.3&date1=2011,2012,2013&flags=build_txt:
>
<http://pick18.discoverlife.org/mp/20m?plot=2&kind=Hypoprepia+fucosa&site=33.9+-83.3&date1=2011,2012,2013&flags=build_txt:>')
> # This opens the URL and creates a link to machine-readable data on the
> page, which I can then download by simply doing this:
>
> #2
>
read.delim('http://pick18.discoverlife.org/tmp/Hypoprepia_fucosa_33.9_-83.3_2011,2012,2013.txt
>
<http://pick18.discoverlife.org/tmp/Hypoprepia_fucosa_33.9_-83.3_2011,2012,2013.txt>')
> #This is what I need to read in terms of data, but this URL only exists
> if the URL ran above is activated first
>
> So, for example, try running line #2 without the first line- it won't
> work. Next run #1 then #2- works fine.
>
> See what I mean?
>
>
> On Thu, Sep 29, 2016 at 5:09 PM, Bob Rudis <bob at rud.is
> <mailto:bob at rud.is>> wrote:
>
>     The rvest/httr/curl trio can do the cookie management pretty well.
>     Make the initial connection via rvest::html_session() and then
>     hopefully be able to use other rvest function calls, but curl and
>     httr calls will use the cached in-memory handle info seamlessly.
>     You'd need to store and retrieve cookies if you need them preserved
>     between R sessions.
>
>     Failing the above and assuming this would not need to be lightning
>     fast, use the phantomjs or firefox web driver (either with RSelenium
>     or some new stuff rOpenSci is cooking up) which will then do what
>     browsers do best and maintain all this state for you. You can still
>     slurp the page contents up with xml2::read_html() and use the super
>     handy processing idioms in the scraping tidyverse (it needs it's
own
>     name).
>
>     A concrete example (assuming the URLs aren't sensitive) would
enable
>     me or someone else to mock up something for you.
>
>
>     On Thu, Sep 29, 2016 at 4:59 PM, Duncan Murdoch
>     <murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>> wrote:
>
>         On 29/09/2016 3:29 PM, Ryan Utz wrote:
>
>             Hi all,
>
>             I've got a situation that involves activating a URL so that
>             a link to some
>             data becomes available for download. I can easily use
>             'browseURL' to do so,
>             but I'm hoping to make this batch-process-able, and I would
>             prefer to not
>             have 100s of browser windows open when I go to download
>             multiple data sets.
>
>             Here's the example:
>
>             #1
>             browseURL('
>            
http://pick18.discoverlife.org/mp/20m?plot=2&kind=Hypoprepia+fucosa&site=33.9+-83.3&date1=2011,2012,2013&flags=build_txt
>            
<http://pick18.discoverlife.org/mp/20m?plot=2&kind=Hypoprepia+fucosa&site=33.9+-83.3&date1=2011,2012,2013&flags=build_txt>:
>             ')
>             # This opens the URL and creates a link to machine-readable
>             data on the
>             page, which I can then download by simply doing this:
>
>             #2
>             read.delim('
>            
http://pick18.discoverlife.org/tmp/Hypoprepia_fucosa_33.9_-83.3_2011,2012,2013.txt
>            
<http://pick18.discoverlife.org/tmp/Hypoprepia_fucosa_33.9_-83.3_2011,2012,2013.txt>
>             ')
>
>             However, I can only get the second line above to work if the
>             thing in line
>             #1 has been opened in a browser already. Is there any way to
>             allow me to
>             either 1) close the browser after it's been opened or 2)
>             execute the line
>             #2 above without having to open a browser? We have hundreds
>             of species that
>             you can see after the '&kind=' bit of the URL, so
I'm trying
>             to keep the
>             browsing situation sane.
>
>             Thanks!
>             R
>
>
>         You'll need to figure out what happens when you open the first
>         page. Does it set a cookie?  Does it record your IP address?
>         Does it just build the file but record nothing about you?
>
>         If it's one of the simpler versions, you can just read the
first
>         page, wait a bit, then read the second one.
>
>         If you need to manage cookies, you'll need something more
>         complicated. I don't know the easiest way to do that.
>
>         Duncan Murdoch
>
>
>         ______________________________________________
>         R-help at r-project.org <mailto:R-help at r-project.org>
mailing list
>         -- To UNSUBSCRIBE and more, see
>         https://stat.ethz.ch/mailman/listinfo/r-help
>         <https://stat.ethz.ch/mailman/listinfo/r-help>
>         PLEASE do read the posting guide
>         http://www.R-project.org/posting-guide.html
>         <http://www.R-project.org/posting-guide.html>
>         and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
>
> --
>
> Ryan Utz, Ph.D.
> Assistant professor of water resources
> *chatham**UNIVERSITY*
> Home/Cell: (724) 272-7769
>

R help - Oct 2016 - Opening or activating a URL to access data, alternative to browseURL

[R] Opening or activating a URL to access data, alternative to browseURL

[R] Opening or activating a URL to access data, alternative to browseURL

[R] Opening or activating a URL to access data, alternative to browseURL