Hi,
I'm trying to use rvest to scrape a page and I am having difficulty
excluding child element superscripts via a CSS selector. For example, here
I've read the html and selected nodes.
p <- read_html(targetUrl)
p %>% html_nodes("td.xyz")
The result looks something like this:
{xml_nodeset (20)}
[1] <td class="xyz" width="50%">Foo<font
size="-1"><sup>9</sup></font>:</td>
[2] <td class="xyz" width="50%">Bar<font
size="-1"><sup>3</sup></font>:</td>
[...]
I would like to extract the words "Foo" and "Bar" without
the superscripts by passing along to html_text(). I thought something like this
would work, but it returns just the superscripts.
p %>%
html_nodes("td.xyz") %>%
html_nodes(":not(sup)") %>%
html_text()
Perhaps I?m using the not selector improperly. Any suggestions on how to get
this to work properly? Thanks.
James