Hi Bill et al.: ????? That broke the dam:? It gave me a character vector of length 1 consisting of 218 KB.? I fed that to XML::readHTMLTable and purrr::map_chr, both of which returned lists of 337 data.frames. The former retained names for all the tables, absent from the latter.? The columns of the former are all character;? that's not true for the latter. ????? Sadly, it's not quite what I want:? It's one table for each office-party combination, but it's lost the office designation. However, I'm confident I can figure out how to hack that. ????? Thanks, ????? Spencer Graves On 2020-07-23 17:46, William Michels wrote:> Hi Spencer, > > I tried the code below on an older R-installation, and it works fine. > Not a full solution, but it's a start: > >> library(RCurl) > Loading required package: bitops >> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >> M_sos <- getURL(url) >> print(M_sos) > [1] "\r\n<!DOCTYPE html>\r\n\r\n<html > lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections: > Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\" > content=\"width=device-width, initial-scale=1.0\" [...remainder > truncated]. > > HTH, Bill. > > W. Michels, Ph.D. > > > > On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves > <spencer.graves at effectivedefense.org> wrote: >> Hello, All: >> >> >> I've failed with multiple attempts to scrape the table of >> candidates from the website of the Missouri Secretary of State: >> >> >> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 >> >> >> I've tried base::url, base::readLines, xml2::read_html, and >> XML::readHTMLTable; see summary below. >> >> >> Suggestions? >> Thanks, >> Spencer Graves >> >> >> sosURL <- >> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >> >> str(baseURL <- base::url(sosURL)) >> # this might give me something, but I don't know what >> >> sosRead <- base::readLines(sosURL) # 404 Not Found >> sosRb <- base::readLines(baseURL) # 404 Not Found >> >> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404. >> >> sosXML <- XML::readHTMLTable(sosURL) >> # List of 0; does not seem to be XML >> >> sessionInfo() >> >> R version 4.0.2 (2020-06-22) >> Platform: x86_64-apple-darwin17.0 (64-bit) >> Running under: macOS Catalina 10.15.5 >> >> Matrix products: default >> BLAS: >> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib >> LAPACK: >> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets >> [6] methods base >> >> loaded via a namespace (and not attached): >> [1] compiler_4.0.2 tools_4.0.2 curl_4.3 >> [4] xml2_1.3.2 XML_3.99-0.3 >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.
iuke-tier@ey m@iii@g oii uiow@@edu
2020-Jul-24 13:20 UTC
[R] [External] Re: help with web scraping
Maybe try something like this: url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" h <- xml2::read_html(url) tbl <- rvest::html_table(h) Best, luke On Fri, 24 Jul 2020, Spencer Graves wrote:> Hi Bill et al.: > > > ????? That broke the dam:? It gave me a character vector of length 1 > consisting of 218 KB.? I fed that to XML::readHTMLTable and purrr::map_chr, > both of which returned lists of 337 data.frames. The former retained names > for all the tables, absent from the latter.? The columns of the former are > all character;? that's not true for the latter. > > > ????? Sadly, it's not quite what I want:? It's one table for each > office-party combination, but it's lost the office designation. However, I'm > confident I can figure out how to hack that. > > > ????? Thanks, > ????? Spencer Graves > > > On 2020-07-23 17:46, William Michels wrote: >> Hi Spencer, >> >> I tried the code below on an older R-installation, and it works fine. >> Not a full solution, but it's a start: >> >>> library(RCurl) >> Loading required package: bitops >>> url <- >>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >>> M_sos <- getURL(url) >>> print(M_sos) >> [1] "\r\n<!DOCTYPE html>\r\n\r\n<html >> lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections: >> Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\" >> content=\"width=device-width, initial-scale=1.0\" [...remainder >> truncated]. >> >> HTH, Bill. >> >> W. Michels, Ph.D. >> >> >> >> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves >> <spencer.graves at effectivedefense.org> wrote: >>> Hello, All: >>> >>> >>> I've failed with multiple attempts to scrape the table of >>> candidates from the website of the Missouri Secretary of State: >>> >>> >>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 >>> >>> >>> I've tried base::url, base::readLines, xml2::read_html, and >>> XML::readHTMLTable; see summary below. >>> >>> >>> Suggestions? >>> Thanks, >>> Spencer Graves >>> >>> >>> sosURL <- >>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >>> >>> str(baseURL <- base::url(sosURL)) >>> # this might give me something, but I don't know what >>> >>> sosRead <- base::readLines(sosURL) # 404 Not Found >>> sosRb <- base::readLines(baseURL) # 404 Not Found >>> >>> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404. >>> >>> sosXML <- XML::readHTMLTable(sosURL) >>> # List of 0; does not seem to be XML >>> >>> sessionInfo() >>> >>> R version 4.0.2 (2020-06-22) >>> Platform: x86_64-apple-darwin17.0 (64-bit) >>> Running under: macOS Catalina 10.15.5 >>> >>> Matrix products: default >>> BLAS: >>> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib >>> LAPACK: >>> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets >>> [6] methods base >>> >>> loaded via a namespace (and not attached): >>> [1] compiler_4.0.2 tools_4.0.2 curl_4.3 >>> [4] xml2_1.3.2 XML_3.99-0.3 >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
On 2020-07-24 08:20, luke-tierney at uiowa.edu wrote:> Maybe try something like this: > > url <- > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > h <- xml2::read_html(url)Error in open.connection(x, "rb") : HTTP error 404. ????? Thanks for the suggestion, but this failed for me on the platform described in "sessionInfo" below.> tbl <- rvest::html_table(h)????? As I previously noted, RCurl::getURL returned a single character string of roughly 218 KB, from which I've so far gotten most but not all of what I want.? Unfortunately, when I fed that character vector to rvest::html_table, I got: Error in UseMethod("html_table") : ? no applicable method for 'html_table' applied to an object of class "character" ????? I don't know for sure yet, but I believe I'll be able to get what I want from the single character string using, e.g., gregexpr and other functions. ????? Thanks again, ????? Spencer Graves> > Best, > > luke > > On Fri, 24 Jul 2020, Spencer Graves wrote: > >> Hi Bill et al.: >> >> >> ????? That broke the dam:? It gave me a character vector of length 1 >> consisting of 218 KB.? I fed that to XML::readHTMLTable and >> purrr::map_chr, both of which returned lists of 337 data.frames. The >> former retained names for all the tables, absent from the latter.? >> The columns of the former are all character;? that's not true for the >> latter. >> >> >> ????? Sadly, it's not quite what I want:? It's one table for each >> office-party combination, but it's lost the office designation. >> However, I'm confident I can figure out how to hack that. >> >> >> ????? Thanks, >> ????? Spencer Graves >> >> >> On 2020-07-23 17:46, William Michels wrote: >>> Hi Spencer, >>> >>> I tried the code below on an older R-installation, and it works fine. >>> Not a full solution, but it's a start: >>> >>>> library(RCurl) >>> Loading required package: bitops >>>> url <- >>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >>>> M_sos <- getURL(url) >>>> print(M_sos) >>> [1] "\r\n<!DOCTYPE html>\r\n\r\n<html >>> lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections: >>> Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\" >>> content=\"width=device-width, initial-scale=1.0\" [...remainder >>> truncated]. >>> >>> HTH, Bill. >>> >>> W. Michels, Ph.D. >>> >>> >>> >>> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves >>> <spencer.graves at effectivedefense.org> wrote: >>>> Hello, All: >>>> >>>> >>>> ??????? I've failed with multiple attempts to scrape the table of >>>> candidates from the website of the Missouri Secretary of State: >>>> >>>> >>>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 >>>> >>>> >>>> >>>> ??????? I've tried base::url, base::readLines, xml2::read_html, and >>>> XML::readHTMLTable; see summary below. >>>> >>>> >>>> ??????? Suggestions? >>>> ??????? Thanks, >>>> ??????? Spencer Graves >>>> >>>> >>>> sosURL <- >>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >>>> >>>> >>>> str(baseURL <- base::url(sosURL)) >>>> # this might give me something, but I don't know what >>>> >>>> sosRead <- base::readLines(sosURL) # 404 Not Found >>>> sosRb <- base::readLines(baseURL) # 404 Not Found >>>> >>>> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404. >>>> >>>> sosXML <- XML::readHTMLTable(sosURL) >>>> # List of 0;? does not seem to be XML >>>> >>>> sessionInfo() >>>> >>>> R version 4.0.2 (2020-06-22) >>>> Platform: x86_64-apple-darwin17.0 (64-bit) >>>> Running under: macOS Catalina 10.15.5 >>>> >>>> Matrix products: default >>>> BLAS: >>>> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib >>>> >>>> LAPACK: >>>> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib >>>> >>>> >>>> locale: >>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>> >>>> attached base packages: >>>> [1] stats???? graphics? grDevices utils???? datasets >>>> [6] methods?? base >>>> >>>> loaded via a namespace (and not attached): >>>> [1] compiler_4.0.2 tools_4.0.2??? curl_4.3 >>>> [4] xml2_1.3.2???? XML_3.99-0.3 >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >
On 2020-07-24 08:20 -0500, luke-tierney at uiowa.edu wrote:> On Fri, 24 Jul 2020, Spencer Graves wrote: > > On 2020-07-23 17:46, William Michels wrote: > > > On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves > > > <spencer.graves at effectivedefense.org> wrote: > > > > Hello, All: > > > > > > > > I've failed with multiple > > > > attempts to scrape the table of > > > > candidates from the website of > > > > the Missouri Secretary of > > > > State: > > > > > > > > https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 > > > > > > Hi Spencer, > > > > > > I tried the code below on an older > > > R-installation, and it works fine. > > > Not a full solution, but it's a > > > start: > > > > > > > library(RCurl) > > > Loading required package: bitops > > > > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > > > > M_sos <- getURL(url) > > > > Hi Bill et al.: > > > > That broke the dam:? It gave me a > > character vector of length 1 > > consisting of 218 KB.? I fed that to > > XML::readHTMLTable and > > purrr::map_chr, both of which > > returned lists of 337 data.frames. > > The former retained names for all > > the tables, absent from the latter.? > > The columns of the former are all > > character;? that's not true for the > > latter. > > > > Sadly, it's not quite what I want:? > > It's one table for each office-party > > combination, but it's lost the > > office designation. However, I'm > > confident I can figure out how to > > hack that. > > Maybe try something like this: > > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > h <- xml2::read_html(url) > tbl <- rvest::html_table(h)Dear Spencer, I unified the party tables after the first summary table like this: url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" M_sos <- RCurl::getURL(url) saveRDS(object=M_sos, file="dcp.rds") dat <- XML::readHTMLTable(M_sos) idx <- 2:length(dat) cn <- unique(unlist(lapply(dat[idx], colnames))) dat <- do.call(rbind, sapply(idx, function(i, dat, cn) { x <- dat[[i]] x[,cn[!(cn %in% colnames(x))]] <- NA x <- x[,cn] x$Party <- names(dat)[i] return(list(x)) }, dat=dat, cn=cn)) dat[,"Date Filed"] <- as.Date(x=dat[,"Date Filed"], format="%m/%d/%Y") write.table(dat, file="dcp.tsv", sep="\t", row.names=FALSE, quote=TRUE, na="N/A") Best, Rasmus -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200724/1d52dffb/attachment.sig>