Hello, All: ????? I've failed with multiple attempts to scrape the table of candidates from the website of the Missouri Secretary of State: https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 ??? ? I've tried base::url, base::readLines, xml2::read_html, and XML::readHTMLTable; see summary below. ????? Suggestions? ????? Thanks, ????? Spencer Graves sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" str(baseURL <- base::url(sosURL)) # this might give me something, but I don't know what sosRead <- base::readLines(sosURL) # 404 Not Found sosRb <- base::readLines(baseURL) # 404 Not Found sosXml2 <- xml2::read_html(sosURL) # HTTP error 404. sosXML <- XML::readHTMLTable(sosURL) # List of 0;? does not seem to be XML sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.5 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats???? graphics? grDevices utils???? datasets [6] methods?? base loaded via a namespace (and not attached): [1] compiler_4.0.2 tools_4.0.2??? curl_4.3 [4] xml2_1.3.2???? XML_3.99-0.3
Hi Spencer, I tried the code below on an older R-installation, and it works fine. Not a full solution, but it's a start:> library(RCurl)Loading required package: bitops> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > M_sos <- getURL(url) > print(M_sos)[1] "\r\n<!DOCTYPE html>\r\n\r\n<html lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections: Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\" [...remainder truncated]. HTH, Bill. W. Michels, Ph.D. On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves <spencer.graves at effectivedefense.org> wrote:> > Hello, All: > > > I've failed with multiple attempts to scrape the table of > candidates from the website of the Missouri Secretary of State: > > > https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 > > > I've tried base::url, base::readLines, xml2::read_html, and > XML::readHTMLTable; see summary below. > > > Suggestions? > Thanks, > Spencer Graves > > > sosURL <- > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > > str(baseURL <- base::url(sosURL)) > # this might give me something, but I don't know what > > sosRead <- base::readLines(sosURL) # 404 Not Found > sosRb <- base::readLines(baseURL) # 404 Not Found > > sosXml2 <- xml2::read_html(sosURL) # HTTP error 404. > > sosXML <- XML::readHTMLTable(sosURL) > # List of 0; does not seem to be XML > > sessionInfo() > > R version 4.0.2 (2020-06-22) > Platform: x86_64-apple-darwin17.0 (64-bit) > Running under: macOS Catalina 10.15.5 > > Matrix products: default > BLAS: > /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib > LAPACK: > /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets > [6] methods base > > loaded via a namespace (and not attached): > [1] compiler_4.0.2 tools_4.0.2 curl_4.3 > [4] xml2_1.3.2 XML_3.99-0.3 > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi Bill et al.: ????? That broke the dam:? It gave me a character vector of length 1 consisting of 218 KB.? I fed that to XML::readHTMLTable and purrr::map_chr, both of which returned lists of 337 data.frames. The former retained names for all the tables, absent from the latter.? The columns of the former are all character;? that's not true for the latter. ????? Sadly, it's not quite what I want:? It's one table for each office-party combination, but it's lost the office designation. However, I'm confident I can figure out how to hack that. ????? Thanks, ????? Spencer Graves On 2020-07-23 17:46, William Michels wrote:> Hi Spencer, > > I tried the code below on an older R-installation, and it works fine. > Not a full solution, but it's a start: > >> library(RCurl) > Loading required package: bitops >> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >> M_sos <- getURL(url) >> print(M_sos) > [1] "\r\n<!DOCTYPE html>\r\n\r\n<html > lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections: > Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\" > content=\"width=device-width, initial-scale=1.0\" [...remainder > truncated]. > > HTH, Bill. > > W. Michels, Ph.D. > > > > On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves > <spencer.graves at effectivedefense.org> wrote: >> Hello, All: >> >> >> I've failed with multiple attempts to scrape the table of >> candidates from the website of the Missouri Secretary of State: >> >> >> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 >> >> >> I've tried base::url, base::readLines, xml2::read_html, and >> XML::readHTMLTable; see summary below. >> >> >> Suggestions? >> Thanks, >> Spencer Graves >> >> >> sosURL <- >> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >> >> str(baseURL <- base::url(sosURL)) >> # this might give me something, but I don't know what >> >> sosRead <- base::readLines(sosURL) # 404 Not Found >> sosRb <- base::readLines(baseURL) # 404 Not Found >> >> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404. >> >> sosXML <- XML::readHTMLTable(sosURL) >> # List of 0; does not seem to be XML >> >> sessionInfo() >> >> R version 4.0.2 (2020-06-22) >> Platform: x86_64-apple-darwin17.0 (64-bit) >> Running under: macOS Catalina 10.15.5 >> >> Matrix products: default >> BLAS: >> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib >> LAPACK: >> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets >> [6] methods base >> >> loaded via a namespace (and not attached): >> [1] compiler_4.0.2 tools_4.0.2 curl_4.3 >> [4] xml2_1.3.2 XML_3.99-0.3 >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.