On 2020-07-24 08:20 -0500, luke-tierney at uiowa.edu wrote:> On Fri, 24 Jul 2020, Spencer Graves wrote: > > On 2020-07-23 17:46, William Michels wrote: > > > On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves > > > <spencer.graves at effectivedefense.org> wrote: > > > > Hello, All: > > > > > > > > I've failed with multiple > > > > attempts to scrape the table of > > > > candidates from the website of > > > > the Missouri Secretary of > > > > State: > > > > > > > > https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 > > > > > > Hi Spencer, > > > > > > I tried the code below on an older > > > R-installation, and it works fine. > > > Not a full solution, but it's a > > > start: > > > > > > > library(RCurl) > > > Loading required package: bitops > > > > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > > > > M_sos <- getURL(url) > > > > Hi Bill et al.: > > > > That broke the dam:? It gave me a > > character vector of length 1 > > consisting of 218 KB.? I fed that to > > XML::readHTMLTable and > > purrr::map_chr, both of which > > returned lists of 337 data.frames. > > The former retained names for all > > the tables, absent from the latter.? > > The columns of the former are all > > character;? that's not true for the > > latter. > > > > Sadly, it's not quite what I want:? > > It's one table for each office-party > > combination, but it's lost the > > office designation. However, I'm > > confident I can figure out how to > > hack that. > > Maybe try something like this: > > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > h <- xml2::read_html(url) > tbl <- rvest::html_table(h)Dear Spencer, I unified the party tables after the first summary table like this: url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" M_sos <- RCurl::getURL(url) saveRDS(object=M_sos, file="dcp.rds") dat <- XML::readHTMLTable(M_sos) idx <- 2:length(dat) cn <- unique(unlist(lapply(dat[idx], colnames))) dat <- do.call(rbind, sapply(idx, function(i, dat, cn) { x <- dat[[i]] x[,cn[!(cn %in% colnames(x))]] <- NA x <- x[,cn] x$Party <- names(dat)[i] return(list(x)) }, dat=dat, cn=cn)) dat[,"Date Filed"] <- as.Date(x=dat[,"Date Filed"], format="%m/%d/%Y") write.table(dat, file="dcp.tsv", sep="\t", row.names=FALSE, quote=TRUE, na="N/A") Best, Rasmus -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200724/1d52dffb/attachment.sig>
Dear Rasmus: On 2020-07-24 09:16, Rasmus Liland wrote:> On 2020-07-24 08:20 -0500, luke-tierney at uiowa.edu wrote: >> On Fri, 24 Jul 2020, Spencer Graves wrote: >>> On 2020-07-23 17:46, William Michels wrote: >>>> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves >>>> <spencer.graves at effectivedefense.org> wrote: >>>>> Hello, All: >>>>> >>>>> I've failed with multiple >>>>> attempts to scrape the table of >>>>> candidates from the website of >>>>> the Missouri Secretary of >>>>> State: >>>>> >>>>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 >>>> Hi Spencer, >>>> >>>> I tried the code below on an older >>>> R-installation, and it works fine. >>>> Not a full solution, but it's a >>>> start: >>>> >>>>> library(RCurl) >>>> Loading required package: bitops >>>>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >>>>> M_sos <- getURL(url) >>> Hi Bill et al.: >>> >>> That broke the dam:? It gave me a >>> character vector of length 1 >>> consisting of 218 KB.? I fed that to >>> XML::readHTMLTable and >>> purrr::map_chr, both of which >>> returned lists of 337 data.frames. >>> The former retained names for all >>> the tables, absent from the latter. >>> The columns of the former are all >>> character;? that's not true for the >>> latter. >>> >>> Sadly, it's not quite what I want: >>> It's one table for each office-party >>> combination, but it's lost the >>> office designation. However, I'm >>> confident I can figure out how to >>> hack that. >> Maybe try something like this: >> >> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >> h <- xml2::read_html(url) >> tbl <- rvest::html_table(h) > Dear Spencer, > > I unified the party tables after the > first summary table like this: > > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > M_sos <- RCurl::getURL(url) > saveRDS(object=M_sos, file="dcp.rds") > dat <- XML::readHTMLTable(M_sos) > idx <- 2:length(dat) > cn <- unique(unlist(lapply(dat[idx], colnames)))????? This is useful for this application.> dat <- do.call(rbind, > sapply(idx, function(i, dat, cn) { > x <- dat[[i]] > x[,cn[!(cn %in% colnames(x))]] <- NA > x <- x[,cn] > x$Party <- names(dat)[i] > return(list(x)) > }, dat=dat, cn=cn)) > dat[,"Date Filed"] <- > as.Date(x=dat[,"Date Filed"], > format="%m/%d/%Y")????? This misses something extremely important for this application:? The political office.? That's buried in the HTML or whatever it is.? I'm using something like the following to find that: str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]]) ????? After I figure this out, I will use something like your code to combine it all into separate tables for each office, and then probably combine those into one table for the offices I'm interested in.? For my present purposes, I don't want all the offices in Missouri, only the executive positions and those representing parts of the Kansas City metro area in the Missouri legislature. ????? Thanks again, ????? Spencer Graves> write.table(dat, file="dcp.tsv", sep="\t", > row.names=FALSE, > quote=TRUE, na="N/A") > > Best, > Rasmus > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
On 2020-07-24 10:28 -0500, Spencer Graves wrote:> Dear Rasmus: > > > Dear Spencer, > > > > I unified the party tables after the > > first summary table like this: > > > > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > > M_sos <- RCurl::getURL(url) > > saveRDS(object=M_sos, file="dcp.rds") > > dat <- XML::readHTMLTable(M_sos) > > idx <- 2:length(dat) > > cn <- unique(unlist(lapply(dat[idx], colnames))) > > This is useful for this application. > > > dat <- do.call(rbind, > > sapply(idx, function(i, dat, cn) { > > x <- dat[[i]] > > x[,cn[!(cn %in% colnames(x))]] <- NA > > x <- x[,cn] > > x$Party <- names(dat)[i] > > return(list(x)) > > }, dat=dat, cn=cn)) > > dat[,"Date Filed"] <- > > as.Date(x=dat[,"Date Filed"], > > format="%m/%d/%Y") > > This misses something extremely > important for this application:? The > political office.? That's buried in > the HTML or whatever it is.? I'm using > something like the following to find > that: > > str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])Dear Spencer, I came up with a solution, but it is not very elegant. Instead of showing you the solution, hoping you understand everything in it, I istead want to give you some emphatic hints to see if you can come up with a solution on you own. - XML::htmlTreeParse(M_sos) - *Gandalf voice*: climb the tree until you find the content you are looking for flat out at the level of ?The Children of the Div?, *uuuUUU* - you only want to keep the table and header tags at this level - Use XML::xmlValue to extract the values of all the headers (the political positions) - Observe that all the tables on the page you were able to extract previously using XML::readHTMLTable, are at this level, shuffled between the political position header tags, this means you extract the political position and party affiliation by using a for loop, if statements, typeof, names, and [] and [[]] to grab different things from the list (content or the bag itself). XML::readHTMLTable strips away the line break tags from the Mailing address, so if you find a better way of extracting the tables, tell me, e.g. you get 8805 HUNTER AVEKANSAS CITY MO 64138 and not 8805 HUNTER AVE<br/>KANSAS CITY MO 64138 When you've completed this ?programming quest?, you're back at the level of the previous email, i.e. you have have the same tables, but with political position and party affiliation added to them. Best, Rasmus -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200725/126cd316/attachment.sig>