thr3ads.net - R help - [R] readHTMLTable() in XML package [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Doran, Harold

2015-Mar-02 16:00 UTC

[R] readHTMLTable() in XML package

I'm having trouble pulling down data from a website with my code below as I
keep encountering the same error, but the error occurs on different pages.

My code below loops through a wensite and grabs data from the html table. The
error appears on different pages at different times and I'm not sure of the
root cause.

Error in readHTMLTable(readLines(url), which = 1, header = TRUE) :
  error in evaluating the argument 'doc' in selecting a method for
function 'readHTMLTable': Error in readHTMLTable(readLines(url), which =
1, header = TRUE) :
  error in evaluating the argument 'doc' in selecting a method for
function 'readHTMLTable':

library(XML)
for(i in 1:1000){
                url <-
paste(paste('http://games.crossfit.com/scores/leaderboard.php?stage=5&sort=0&page=',
i, sep=''),
'&division=1&region=0&numberperpage=100&competition=0&frontpage=0&expanded=1&year=15&full=1&showtoggles=0&hidedropdowns=0&showathleteac=1&=&is_mobile=0',
sep='')
    tmp <- readHTMLTable(readLines(url), which=1, header=TRUE)
                names(tmp) <- gsub("\\n", "", names(tmp))
                names(tmp) <- gsub(" +", "", names(tmp))
    tmp[] <- lapply(tmp, function(x) gsub("\\n", "", x))

    if(i == 1){
                dat <- tmp
                } else {
                dat <- rbind(dat, tmp)
                }
                cat('Grabbing data from page', i, '\n')
                }

Thanks,
Harold

	[[alternative HTML version deleted]]

Hadley Wickham

2015-Mar-02 19:04 UTC

head link

[R] readHTMLTable() in XML package

This somewhat simpler rvest code does the trick for me:

library(rvest)
library(dplyr)

i <- 1:10
urls <-
paste0('http://games.crossfit.com/scores/leaderboard.php?stage=5',
 
'&sort=0&division=1&region=0&numberperpage=100&competition=0&frontpage=0',
 
'&expanded=1&year=15&full=1&showtoggles=0&hidedropdowns=0&showathleteac=1',
  '&is_mobile=0&page=', i)

results_table <- function(url) {
  url %>% html %>% html_table(fill = TRUE) %>% .[[1]]
}

results <- lapply(urls, results_table)
out <- results %>% bind_rows()

Hadley

On Mon, Mar 2, 2015 at 10:00 AM, Doran, Harold <HDoran at air.org>
wrote:> I'm having trouble pulling down data from a website with my code below
as I keep encountering the same error, but the error occurs on different pages.
>
> My code below loops through a wensite and grabs data from the html table.
The error appears on different pages at different times and I'm not sure of
the root cause.
>
> Error in readHTMLTable(readLines(url), which = 1, header = TRUE) :
>   error in evaluating the argument 'doc' in selecting a method for
function 'readHTMLTable': Error in readHTMLTable(readLines(url), which =
1, header = TRUE) :
>   error in evaluating the argument 'doc' in selecting a method for
function 'readHTMLTable':
>
> library(XML)
> for(i in 1:1000){
>                 url <-
paste(paste('http://games.crossfit.com/scores/leaderboard.php?stage=5&sort=0&page=',
i, sep=''),
'&division=1&region=0&numberperpage=100&competition=0&frontpage=0&expanded=1&year=15&full=1&showtoggles=0&hidedropdowns=0&showathleteac=1&=&is_mobile=0',
sep='')
>     tmp <- readHTMLTable(readLines(url), which=1, header=TRUE)
>                 names(tmp) <- gsub("\\n", "",
names(tmp))
>                 names(tmp) <- gsub(" +", "",
names(tmp))
>     tmp[] <- lapply(tmp, function(x) gsub("\\n", "",
x))
>
>     if(i == 1){
>                 dat <- tmp
>                 } else {
>                 dat <- rbind(dat, tmp)
>                 }
>                 cat('Grabbing data from page', i, '\n')
>                 }
>
> Thanks,
> Harold
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
http://had.co.nz/

Doran, Harold

2015-Mar-03 14:34 UTC

head link

[R] readHTMLTable() in XML package

Hadley

Thanks. I ran into the same roadblock when I use your code below by increasing i
to loop over all pages. I think the problem is related to the fact that the
website I'm scraping is getting hammered with users and the error is just
related to a timeout.

I have provisionally solved my problem by wrapping in some try() statements in
appropriate places and some conditional if/else statements to skip over steps if
a timeout occurs. Not sure if this is elegant, but my sledgehammer approach is
"working" now.



-----Original Message-----
From: Hadley Wickham [mailto:h.wickham at gmail.com] 
Sent: Monday, March 02, 2015 2:05 PM
To: Doran, Harold
Cc: r-help at r-project.org
Subject: Re: [R] readHTMLTable() in XML package

This somewhat simpler rvest code does the trick for me:

library(rvest)
library(dplyr)

i <- 1:10
urls <-
paste0('http://games.crossfit.com/scores/leaderboard.php?stage=5',
 
'&sort=0&division=1&region=0&numberperpage=100&competition=0&frontpage=0',
 
'&expanded=1&year=15&full=1&showtoggles=0&hidedropdowns=0&showathleteac=1',
  '&is_mobile=0&page=', i)

results_table <- function(url) {
  url %>% html %>% html_table(fill = TRUE) %>% .[[1]] }

results <- lapply(urls, results_table)
out <- results %>% bind_rows()

Hadley

On Mon, Mar 2, 2015 at 10:00 AM, Doran, Harold <HDoran at air.org>
wrote:> I'm having trouble pulling down data from a website with my code below
as I keep encountering the same error, but the error occurs on different pages.
>
> My code below loops through a wensite and grabs data from the html table.
The error appears on different pages at different times and I'm not sure of
the root cause.
>
> Error in readHTMLTable(readLines(url), which = 1, header = TRUE) :
>   error in evaluating the argument 'doc' in selecting a method for
function 'readHTMLTable': Error in readHTMLTable(readLines(url), which =
1, header = TRUE) :
>   error in evaluating the argument 'doc' in selecting a method for
function 'readHTMLTable':
>
> library(XML)
> for(i in 1:1000){
>                 url <-
paste(paste('http://games.crossfit.com/scores/leaderboard.php?stage=5&sort=0&page=',
i, sep=''),
'&division=1&region=0&numberperpage=100&competition=0&frontpage=0&expanded=1&year=15&full=1&showtoggles=0&hidedropdowns=0&showathleteac=1&=&is_mobile=0',
sep='')
>     tmp <- readHTMLTable(readLines(url), which=1, header=TRUE)
>                 names(tmp) <- gsub("\\n", "",
names(tmp))
>                 names(tmp) <- gsub(" +", "",
names(tmp))
>     tmp[] <- lapply(tmp, function(x) gsub("\\n", "",
x))
>
>     if(i == 1){
>                 dat <- tmp
>                 } else {
>                 dat <- rbind(dat, tmp)
>                 }
>                 cat('Grabbing data from page', i, '\n')
>                 }
>
> Thanks,
> Harold
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


--
http://had.co.nz/

R help - Mar 2015 - readHTMLTable() in XML package

[R] readHTMLTable() in XML package

[R] readHTMLTable() in XML package

[R] readHTMLTable() in XML package