thr3ads.net - R help - [R] [External] Re: help with web scraping [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Rasmus Liland

2020-Jul-25 16:30 UTC

[R] [External] Re: help with web scraping

On 2020-07-25 09:56 -0500, Spencer Graves wrote:> Dear Rasmus et al.:
It is LILAND et al., is it not?  I do 
not belong to a large Confucian family 
structure (putting the hunter-gatherer 
horse-rider tribe name first in all-caps 
in the email), else it's customary to 
put a comma in there, isn't it? ... 
right, moving on:

On 2020-07-25 04:10, Rasmus Liland wrote:> 
>  ????? 
It might be a better idea to write the 
reply in plain-text utf-8 or at least 
Western or Eastern-European ISO euro 
encoding instead of us-ascii (maybe 
KOI8, ?\_(?)_/?) ...  something in your 
email got string-replaced by "?????" and 
also "?" got replaced by "?".

Please research using Thunderbird, Claws 
mail, or some other sane e-mail client; 
they are great, I promise.
> Please excuse:? Before my last post, I 
> had written code to do all that.? 
Good!
> In brief, the political offices are 
> "h3" tags.?
Yes, some type of header element at 
least, in-between the various tables, 
everything children of the div in the 
element tree.
> I used "strsplit" to split the string 
> at "<h3>".? I then wrote a 
> function to find "</h3>", extract the 
> political office and pass the rest to 
> "XML::readHTMLTable", adding columns 
> for party and political office.
Yes, doing that for the political office 
is also possible, but the party is 
inside the table's caption tag, which 
end up as the name of the table in the 
XML::readHTMLTable list ...
> However, this suppressed "<br/>" 
> everywhere.?
Why is that, please explain.
> I thought there should be 
> an option with something like 
> "XML::readHTMLTable" that would not 
> delete "<br/>" everywhere, but I 
> couldn't find it.?
No, there is not, AFAIK.  Please, if 
anyone else knows, please say so *echoes 
in the forest*
> If you aren't aware of one, I can 
> gsub("<br/>", "\n", ...) on the string 
> for each political office before 
> passing it to "XML::readHTMLTable".? I 
> just tested this:? It works.
Such a great hack!  IMHO, this is much 
more flexible than using 
xml2::read_html, rvest::read_table, 
dplyr::mutate like here[1]
> I have other web scraping problems in 
> my work plan for the few days.?
Maybe, idk ... 
> I will definitely try 
> XML::htmlTreeParse, etc., as you 
> suggest.
I wish you good luck,
Rasmus

[1]
https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20200725/bfa09420/attachment.sig>

Spencer Graves

2020-Jul-25 17:43 UTC

head link

[R] [External] Re: help with web scraping

Dear Rasmus Liland et al.:


On 2020-07-25 11:30, Rasmus Liland wrote:> On 2020-07-25 09:56 -0500, Spencer Graves wrote:
>> Dear Rasmus et al.:
> 
> It is LILAND et al., is it not?  ... else it's customary to
> put a comma in there, isn't it? ...

The APA Style recommends "Sharp et al., 2007":


https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html


	  Regarding Confucius, I'm confused.


> right, moving on:
> 
> On 2020-07-25 04:10, Rasmus Liland wrote:
>>
<snip>
> 
> Please research using Thunderbird, Claws
> mail, or some other sane e-mail client;
> they are great, I promise.

Thanks.  I researched it and turned of HTML.  Please excuse:  I noticed 
it was a problem, but hadn't prioritized time to research and fix it 
until your comment.  Thanks.
> 
>> Please excuse:? Before my last post, I
>> had written code to do all that.?
> 
> Good!
> 
>> In brief, the political offices are
>> "h3" tags.?
> 
> Yes, some type of header element at
> least, in-between the various tables,
> everything children of the div in the
> element tree.
> 
>> I used "strsplit" to split the string
>> at "<h3>".? I then wrote a
>> function to find "</h3>", extract the
>> political office and pass the rest to
>> "XML::readHTMLTable", adding columns
>> for party and political office.
> 
> Yes, doing that for the political office
> is also possible, but the party is
> inside the table's caption tag, which
> end up as the name of the table in the
> XML::readHTMLTable list ...
> 
>> However, this suppressed "<br/>"
>> everywhere.?
> 
> Why is that, please explain.
> 
	  I don't know why the Missouri Secretary of State's web site includes 
"<br/>" to signal a new line, but it does.  I also don't
know why
XML::readHTMLTable suppressed "<br/>" everywhere it occurred,
but it did
that.  After I used gsub to replace "<br/>" with "\n",
I found that
XML::readHTMLTable did not replace "\n", so I got what I wanted.

>> I thought there should be
>> an option with something like
>> "XML::readHTMLTable" that would not
>> delete "<br/>" everywhere, but I
>> couldn't find it.?
> 
> No, there is not, AFAIK.  Please, if
> anyone else knows, please say so *echoes
> in the forest*
> 
>> If you aren't aware of one, I can
>> gsub("<br/>", "\n", ...) on the string
>> for each political office before
>> passing it to "XML::readHTMLTable".? I
>> just tested this:? It works.
> 
> Such a great hack!  IMHO, this is much
> more flexible than using
> xml2::read_html, rvest::read_table,
> dplyr::mutate like here[1]
> 
>> I have other web scraping problems in
>> my work plan for the few days.?
> 
> Maybe, idk ...
> 
>> I will definitely try
>> XML::htmlTreeParse, etc., as you
>> suggest.
> 
> I wish you good luck,
> Rasmus
> 
> [1]
https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells

	  And I added my solution to this problem to this Stackoverflow thread.


	  Thanks again,
	  Spencer> 
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Rasmus Liland

2020-Jul-26 15:43 UTC

head link

[R] [External] Re: help with web scraping

Dear GRAVES et al.,

On 2020-07-25 12:43 -0500, Spencer Graves wrote:> Dear Rasmus Liland et al.:
> 
> On 2020-07-25 11:30, Rasmus Liland wrote:
> > On 2020-07-25 09:56 -0500, Spencer Graves wrote:
> > > Dear Rasmus et al.:
> > 
> > It is LILAND et al., is it not?  ... else it's customary to
> > put a comma in there, isn't it? ...
> 
> The APA Style recommends "Sharp et al., 2007":
> 
>
https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html
If "Sharp et al., 2007" is an APA 
citation of this book[*], Sharp is John A 
Sharp's surname, Liland is my surname.  
Q.E.D.

I have not used APA before (as I am not 
a Psychiatrist), as the minimalism of 
IEEE[**] always seemed more desirable.  
> Regarding Confucius, I'm confused.
Nevermind, just fooling around, that's 
all.
> > On 2020-07-25 04:10, Rasmus Liland wrote:
> > > 
> > > However, this suppressed "<br/>"
> > > everywhere.?
> > 
> > Why is that, please explain.
> 
> I don't know why the Missouri 
> Secretary of State's web site includes 
> "<br/>" to signal a new line, but it 
> does.
Me neither!  On top of that, <br /> is 
actually[***] an XHTML tag, not an HTML 
tag.
> I also don't know why 
> XML::readHTMLTable suppressed "<br/>" 
> everywhere it occurred, but it did 
> that.
Yes, I know, I also observed this.  But 
now we swiftly solved this by gsubbig it 
with the newline char, "\n", which does 
not make sense for HTML parses anyway. 
> > > If you aren't aware of one, I can
> > > gsub("<br/>", "\n", ...) on the string
> > > for each political office before
> > > passing it to "XML::readHTMLTable".? I
> > > just tested this:? It works.
> > 
> > Such a great hack!  IMHO, this is much
> > more flexible than using
> > xml2::read_html, rvest::read_table,
> > dplyr::mutate like here[1]
> > 
> > [1]
https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells
> 
> And I added my solution to this 
> problem to this Stackoverflow thread.
I wish you many upvotes, alas the 
political competition is obiously not 
tough there, as the other guy just got 
one down vote.

[*] https://www.amazon.co.uk/Management-Student-Research-Project/dp/0566084902 
[**] https://pitt.libguides.com/citationhelp/ieee
[***] https://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20200726/138ec8c5/attachment.sig>

R help - Jul 2020 - [External] Re: help with web scraping

[R] [External] Re: help with web scraping

[R] [External] Re: help with web scraping

[R] [External] Re: help with web scraping