On 2020-07-25 09:56 -0500, Spencer Graves wrote:> Dear Rasmus et al.:It is LILAND et al., is it not? I do not belong to a large Confucian family structure (putting the hunter-gatherer horse-rider tribe name first in all-caps in the email), else it's customary to put a comma in there, isn't it? ... right, moving on: On 2020-07-25 04:10, Rasmus Liland wrote:> > ?????It might be a better idea to write the reply in plain-text utf-8 or at least Western or Eastern-European ISO euro encoding instead of us-ascii (maybe KOI8, ?\_(?)_/?) ... something in your email got string-replaced by "?????" and also "?" got replaced by "?". Please research using Thunderbird, Claws mail, or some other sane e-mail client; they are great, I promise.> Please excuse:? Before my last post, I > had written code to do all that.?Good!> In brief, the political offices are > "h3" tags.?Yes, some type of header element at least, in-between the various tables, everything children of the div in the element tree.> I used "strsplit" to split the string > at "<h3>".? I then wrote a > function to find "</h3>", extract the > political office and pass the rest to > "XML::readHTMLTable", adding columns > for party and political office.Yes, doing that for the political office is also possible, but the party is inside the table's caption tag, which end up as the name of the table in the XML::readHTMLTable list ...> However, this suppressed "<br/>" > everywhere.?Why is that, please explain.> I thought there should be > an option with something like > "XML::readHTMLTable" that would not > delete "<br/>" everywhere, but I > couldn't find it.?No, there is not, AFAIK. Please, if anyone else knows, please say so *echoes in the forest*> If you aren't aware of one, I can > gsub("<br/>", "\n", ...) on the string > for each political office before > passing it to "XML::readHTMLTable".? I > just tested this:? It works.Such a great hack! IMHO, this is much more flexible than using xml2::read_html, rvest::read_table, dplyr::mutate like here[1]> I have other web scraping problems in > my work plan for the few days.?Maybe, idk ...> I will definitely try > XML::htmlTreeParse, etc., as you > suggest.I wish you good luck, Rasmus [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200725/bfa09420/attachment.sig>
Dear Rasmus Liland et al.: On 2020-07-25 11:30, Rasmus Liland wrote:> On 2020-07-25 09:56 -0500, Spencer Graves wrote: >> Dear Rasmus et al.: > > It is LILAND et al., is it not? ... else it's customary to > put a comma in there, isn't it? ...The APA Style recommends "Sharp et al., 2007": https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html Regarding Confucius, I'm confused.> right, moving on: > > On 2020-07-25 04:10, Rasmus Liland wrote: >><snip>> > Please research using Thunderbird, Claws > mail, or some other sane e-mail client; > they are great, I promise.Thanks. I researched it and turned of HTML. Please excuse: I noticed it was a problem, but hadn't prioritized time to research and fix it until your comment. Thanks.> >> Please excuse:? Before my last post, I >> had written code to do all that.? > > Good! > >> In brief, the political offices are >> "h3" tags.? > > Yes, some type of header element at > least, in-between the various tables, > everything children of the div in the > element tree. > >> I used "strsplit" to split the string >> at "<h3>".? I then wrote a >> function to find "</h3>", extract the >> political office and pass the rest to >> "XML::readHTMLTable", adding columns >> for party and political office. > > Yes, doing that for the political office > is also possible, but the party is > inside the table's caption tag, which > end up as the name of the table in the > XML::readHTMLTable list ... > >> However, this suppressed "<br/>" >> everywhere.? > > Why is that, please explain. >I don't know why the Missouri Secretary of State's web site includes "<br/>" to signal a new line, but it does. I also don't know why XML::readHTMLTable suppressed "<br/>" everywhere it occurred, but it did that. After I used gsub to replace "<br/>" with "\n", I found that XML::readHTMLTable did not replace "\n", so I got what I wanted.>> I thought there should be >> an option with something like >> "XML::readHTMLTable" that would not >> delete "<br/>" everywhere, but I >> couldn't find it.? > > No, there is not, AFAIK. Please, if > anyone else knows, please say so *echoes > in the forest* > >> If you aren't aware of one, I can >> gsub("<br/>", "\n", ...) on the string >> for each political office before >> passing it to "XML::readHTMLTable".? I >> just tested this:? It works. > > Such a great hack! IMHO, this is much > more flexible than using > xml2::read_html, rvest::read_table, > dplyr::mutate like here[1] > >> I have other web scraping problems in >> my work plan for the few days.? > > Maybe, idk ... > >> I will definitely try >> XML::htmlTreeParse, etc., as you >> suggest. > > I wish you good luck, > Rasmus > > [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cellsAnd I added my solution to this problem to this Stackoverflow thread. Thanks again, Spencer> > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Dear GRAVES et al., On 2020-07-25 12:43 -0500, Spencer Graves wrote:> Dear Rasmus Liland et al.: > > On 2020-07-25 11:30, Rasmus Liland wrote: > > On 2020-07-25 09:56 -0500, Spencer Graves wrote: > > > Dear Rasmus et al.: > > > > It is LILAND et al., is it not? ... else it's customary to > > put a comma in there, isn't it? ... > > The APA Style recommends "Sharp et al., 2007": > > https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.htmlIf "Sharp et al., 2007" is an APA citation of this book[*], Sharp is John A Sharp's surname, Liland is my surname. Q.E.D. I have not used APA before (as I am not a Psychiatrist), as the minimalism of IEEE[**] always seemed more desirable.> Regarding Confucius, I'm confused.Nevermind, just fooling around, that's all.> > On 2020-07-25 04:10, Rasmus Liland wrote: > > > > > > However, this suppressed "<br/>" > > > everywhere.? > > > > Why is that, please explain. > > I don't know why the Missouri > Secretary of State's web site includes > "<br/>" to signal a new line, but it > does.Me neither! On top of that, <br /> is actually[***] an XHTML tag, not an HTML tag.> I also don't know why > XML::readHTMLTable suppressed "<br/>" > everywhere it occurred, but it did > that.Yes, I know, I also observed this. But now we swiftly solved this by gsubbig it with the newline char, "\n", which does not make sense for HTML parses anyway.> > > If you aren't aware of one, I can > > > gsub("<br/>", "\n", ...) on the string > > > for each political office before > > > passing it to "XML::readHTMLTable".? I > > > just tested this:? It works. > > > > Such a great hack! IMHO, this is much > > more flexible than using > > xml2::read_html, rvest::read_table, > > dplyr::mutate like here[1] > > > > [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells > > And I added my solution to this > problem to this Stackoverflow thread.I wish you many upvotes, alas the political competition is obiously not tough there, as the other guy just got one down vote. [*] https://www.amazon.co.uk/Management-Student-Research-Project/dp/0566084902 [**] https://pitt.libguides.com/citationhelp/ieee [***] https://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200726/138ec8c5/attachment.sig>