On 2020-07-24 10:28 -0500, Spencer Graves wrote:> Dear Rasmus: > > > Dear Spencer, > > > > I unified the party tables after the > > first summary table like this: > > > > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > > M_sos <- RCurl::getURL(url) > > saveRDS(object=M_sos, file="dcp.rds") > > dat <- XML::readHTMLTable(M_sos) > > idx <- 2:length(dat) > > cn <- unique(unlist(lapply(dat[idx], colnames))) > > This is useful for this application. > > > dat <- do.call(rbind, > > sapply(idx, function(i, dat, cn) { > > x <- dat[[i]] > > x[,cn[!(cn %in% colnames(x))]] <- NA > > x <- x[,cn] > > x$Party <- names(dat)[i] > > return(list(x)) > > }, dat=dat, cn=cn)) > > dat[,"Date Filed"] <- > > as.Date(x=dat[,"Date Filed"], > > format="%m/%d/%Y") > > This misses something extremely > important for this application:? The > political office.? That's buried in > the HTML or whatever it is.? I'm using > something like the following to find > that: > > str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])Dear Spencer, I came up with a solution, but it is not very elegant. Instead of showing you the solution, hoping you understand everything in it, I istead want to give you some emphatic hints to see if you can come up with a solution on you own. - XML::htmlTreeParse(M_sos) - *Gandalf voice*: climb the tree until you find the content you are looking for flat out at the level of ?The Children of the Div?, *uuuUUU* - you only want to keep the table and header tags at this level - Use XML::xmlValue to extract the values of all the headers (the political positions) - Observe that all the tables on the page you were able to extract previously using XML::readHTMLTable, are at this level, shuffled between the political position header tags, this means you extract the political position and party affiliation by using a for loop, if statements, typeof, names, and [] and [[]] to grab different things from the list (content or the bag itself). XML::readHTMLTable strips away the line break tags from the Mailing address, so if you find a better way of extracting the tables, tell me, e.g. you get 8805 HUNTER AVEKANSAS CITY MO 64138 and not 8805 HUNTER AVE<br/>KANSAS CITY MO 64138 When you've completed this ?programming quest?, you're back at the level of the previous email, i.e. you have have the same tables, but with political position and party affiliation added to them. Best, Rasmus -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200725/126cd316/attachment.sig>
Dear Rasmus et al.: On 2020-07-25 04:10, Rasmus Liland wrote:> On 2020-07-24 10:28 -0500, Spencer Graves wrote: >> Dear Rasmus: >> >>> Dear Spencer, >>> >>> I unified the party tables after the >>> first summary table like this: >>> >>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >>> M_sos <- RCurl::getURL(url) >>> saveRDS(object=M_sos, file="dcp.rds") >>> dat <- XML::readHTMLTable(M_sos) >>> idx <- 2:length(dat) >>> cn <- unique(unlist(lapply(dat[idx], colnames))) >> This is useful for this application. >> >>> dat <- do.call(rbind, >>> sapply(idx, function(i, dat, cn) { >>> x <- dat[[i]] >>> x[,cn[!(cn %in% colnames(x))]] <- NA >>> x <- x[,cn] >>> x$Party <- names(dat)[i] >>> return(list(x)) >>> }, dat=dat, cn=cn)) >>> dat[,"Date Filed"] <- >>> as.Date(x=dat[,"Date Filed"], >>> format="%m/%d/%Y") >> This misses something extremely >> important for this application:? The >> political office.? That's buried in >> the HTML or whatever it is.? I'm using >> something like the following to find >> that: >> >> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]]) > Dear Spencer, > > I came up with a solution, but it is not > very elegant. Instead of showing you > the solution, hoping you understand > everything in it, I istead want to give > you some emphatic hints to see if you > can come up with a solution on you own. > > - XML::htmlTreeParse(M_sos) > - *Gandalf voice*: climb the tree > until you find the content you are > looking for flat out at the level of > ?The Children of the Div?, *uuuUUU* > - you only want to keep the table and > header tags at this level > - Use XML::xmlValue to extract the > values of all the headers (the > political positions) > - Observe that all the tables on the > page you were able to extract > previously using XML::readHTMLTable, > are at this level, shuffled between > the political position header tags, > this means you extract the political > position and party affiliation by > using a for loop, if statements, > typeof, names, and [] and [[]] to grab > different things from the list > (content or the bag itself). > XML::readHTMLTable strips away the > line break tags from the Mailing > address, so if you find a better way > of extracting the tables, tell me, > e.g. you get > > 8805 HUNTER AVEKANSAS CITY MO 64138 > > and not > > 8805 HUNTER AVE<br/>KANSAS CITY MO 64138 > > When you've completed this ?programming > quest?, you're back at the level of the > previous email, i.e. you have have the > same tables, but with political position > and party affiliation added to them.????? Please excuse:? Before my last post, I had written code to do all that.? In brief, the political offices are "h3" tags.? I used "strsplit" to split the string at "<h3>".? I then wrote a function to find "</h3>", extract the political office and pass the rest to "XML::readHTMLTable", adding columns for party and political office. ????? However, this suppressed "<br/>" everywhere.? I thought there should be an option with something like "XML::readHTMLTable" that would not delete "<br/>" everywhere, but I couldn't find it.? If you aren't aware of one, I can gsub("<br/>", "\n", ...) on the string for each political office before passing it to "XML::readHTMLTable".? I just tested this:? It works. ????? I have other web scraping problems in my work plan for the few days.? I will definitely try XML::htmlTreeParse, etc., as you suggest. ????? Thanks again. ????? Spencer Graves> > Best, > Rasmus > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
On 2020-07-25 09:56 -0500, Spencer Graves wrote:> Dear Rasmus et al.:It is LILAND et al., is it not? I do not belong to a large Confucian family structure (putting the hunter-gatherer horse-rider tribe name first in all-caps in the email), else it's customary to put a comma in there, isn't it? ... right, moving on: On 2020-07-25 04:10, Rasmus Liland wrote:> > ?????It might be a better idea to write the reply in plain-text utf-8 or at least Western or Eastern-European ISO euro encoding instead of us-ascii (maybe KOI8, ?\_(?)_/?) ... something in your email got string-replaced by "?????" and also "?" got replaced by "?". Please research using Thunderbird, Claws mail, or some other sane e-mail client; they are great, I promise.> Please excuse:? Before my last post, I > had written code to do all that.?Good!> In brief, the political offices are > "h3" tags.?Yes, some type of header element at least, in-between the various tables, everything children of the div in the element tree.> I used "strsplit" to split the string > at "<h3>".? I then wrote a > function to find "</h3>", extract the > political office and pass the rest to > "XML::readHTMLTable", adding columns > for party and political office.Yes, doing that for the political office is also possible, but the party is inside the table's caption tag, which end up as the name of the table in the XML::readHTMLTable list ...> However, this suppressed "<br/>" > everywhere.?Why is that, please explain.> I thought there should be > an option with something like > "XML::readHTMLTable" that would not > delete "<br/>" everywhere, but I > couldn't find it.?No, there is not, AFAIK. Please, if anyone else knows, please say so *echoes in the forest*> If you aren't aware of one, I can > gsub("<br/>", "\n", ...) on the string > for each political office before > passing it to "XML::readHTMLTable".? I > just tested this:? It works.Such a great hack! IMHO, this is much more flexible than using xml2::read_html, rvest::read_table, dplyr::mutate like here[1]> I have other web scraping problems in > my work plan for the few days.?Maybe, idk ...> I will definitely try > XML::htmlTreeParse, etc., as you > suggest.I wish you good luck, Rasmus [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200725/bfa09420/attachment.sig>
Dear Spencer Graves (and Rasmus Liland), I've had some luck just using gsub() to alter the offending "</br>" characters, appending a "___" tag at each instance of "<br>" (first I checked the text to make sure it didn't contain any pre-existing instances of "___"). See the output snippet below:> library(RCurl) > library(XML) > sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > sosChars <- getURL(sosURL) > sosChars2 <- gsub("<br/>", "<br/>___", sosChars) > MOcan <- readHTMLTable(sosChars2) > MOcan[[2]]Name Mailing Address Random Number Date Filed 1 Raleigh Ritter 4476 FIVE MILE RD___SENECA MO 64865 185 2/25/2020 2 Mike Parson 1458 E 464 RD___BOLIVAR MO 65613 348 2/25/2020 3 James W. (Jim) Neely PO BOX 343___CAMERON MO 64429 477 2/25/2020 4 Saundra McDowell 3854 SOUTH AVENUE___SPRINGFIELD MO 65807 3/31/2020>It's true, there's one a 'section' of MOcan output that contains odd-looking characters (see the "Total" line of MOcan[[1]]). But my guess is you'll be deleting this 'line' anyway--and recalulating totals in R. Now that you have a comprehensive list object, you should be able to pull out districts/races of interest. You might want to take a look at the "rlist" package, to see if it can make your work a little easier: https://CRAN.R-project.org/package=rlist https://renkun-ken.github.io/rlist-tutorial/index.html HTH, Bill. W. Michels, Ph.D. On Sat, Jul 25, 2020 at 7:56 AM Spencer Graves <spencer.graves at effectivedefense.org> wrote:> > Dear Rasmus et al.: > > > On 2020-07-25 04:10, Rasmus Liland wrote: > > On 2020-07-24 10:28 -0500, Spencer Graves wrote: > >> Dear Rasmus: > >> > >>> Dear Spencer, > >>> > >>> I unified the party tables after the > >>> first summary table like this: > >>> > >>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > >>> M_sos <- RCurl::getURL(url) > >>> saveRDS(object=M_sos, file="dcp.rds") > >>> dat <- XML::readHTMLTable(M_sos) > >>> idx <- 2:length(dat) > >>> cn <- unique(unlist(lapply(dat[idx], colnames))) > >> This is useful for this application. > >> > >>> dat <- do.call(rbind, > >>> sapply(idx, function(i, dat, cn) { > >>> x <- dat[[i]] > >>> x[,cn[!(cn %in% colnames(x))]] <- NA > >>> x <- x[,cn] > >>> x$Party <- names(dat)[i] > >>> return(list(x)) > >>> }, dat=dat, cn=cn)) > >>> dat[,"Date Filed"] <- > >>> as.Date(x=dat[,"Date Filed"], > >>> format="%m/%d/%Y") > >> This misses something extremely > >> important for this application:? The > >> political office.? That's buried in > >> the HTML or whatever it is.? I'm using > >> something like the following to find > >> that: > >> > >> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]]) > > Dear Spencer, > > > > I came up with a solution, but it is not > > very elegant. Instead of showing you > > the solution, hoping you understand > > everything in it, I istead want to give > > you some emphatic hints to see if you > > can come up with a solution on you own. > > > > - XML::htmlTreeParse(M_sos) > > - *Gandalf voice*: climb the tree > > until you find the content you are > > looking for flat out at the level of > > ?The Children of the Div?, *uuuUUU* > > - you only want to keep the table and > > header tags at this level > > - Use XML::xmlValue to extract the > > values of all the headers (the > > political positions) > > - Observe that all the tables on the > > page you were able to extract > > previously using XML::readHTMLTable, > > are at this level, shuffled between > > the political position header tags, > > this means you extract the political > > position and party affiliation by > > using a for loop, if statements, > > typeof, names, and [] and [[]] to grab > > different things from the list > > (content or the bag itself). > > XML::readHTMLTable strips away the > > line break tags from the Mailing > > address, so if you find a better way > > of extracting the tables, tell me, > > e.g. you get > > > > 8805 HUNTER AVEKANSAS CITY MO 64138 > > > > and not > > > > 8805 HUNTER AVE<br/>KANSAS CITY MO 64138 > > > > When you've completed this ?programming > > quest?, you're back at the level of the > > previous email, i.e. you have have the > > same tables, but with political position > > and party affiliation added to them. > > > Please excuse: Before my last post, I had written code to do all > that. In brief, the political offices are "h3" tags. I used "strsplit" > to split the string at "<h3>". I then wrote a function to find "</h3>", > extract the political office and pass the rest to "XML::readHTMLTable", > adding columns for party and political office. > > > However, this suppressed "<br/>" everywhere. I thought there > should be an option with something like "XML::readHTMLTable" that would > not delete "<br/>" everywhere, but I couldn't find it. If you aren't > aware of one, I can gsub("<br/>", "\n", ...) on the string for each > political office before passing it to "XML::readHTMLTable". I just > tested this: It works. > > > I have other web scraping problems in my work plan for the few > days. I will definitely try XML::htmlTreeParse, etc., as you suggest. > > > Thanks again. > Spencer Graves > > > > Best, > > Rasmus > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.