thr3ads.net - R help - [R] rvest and the not css selector [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Twitter
Facebook
Email

James Toll

2015-Oct-15 21:05 UTC

[R] rvest and the not css selector

Hi,

I'm trying to use rvest to scrape a page and I am having difficulty
excluding child element superscripts via a CSS selector.  For example, here
I've read the html and selected nodes.


p <- read_html(targetUrl)
p %>% html_nodes("td.xyz")


The result looks something like this:

{xml_nodeset (20)}
 [1] <td class="xyz" width="50%">Foo<font
size="-1"><sup>9</sup></font>:</td>
 [2] <td class="xyz" width="50%">Bar<font
size="-1"><sup>3</sup></font>:</td>
[...]


I would like to extract the words "Foo" and "Bar" without
the superscripts by passing along to html_text().  I thought something like this
would work, but it returns just the superscripts.

p %>% 
html_nodes("td.xyz") %>%
html_nodes(":not(sup)") %>% 
html_text()


Perhaps I?m using the not selector improperly.  Any suggestions on how to get
this to work properly?  Thanks.


James

R help - Oct 2015 - rvest and the not css selector

[R] rvest and the not css selector