thr3ads.net - R help - [R] Using Rvest to scrape pages [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Tiffany Adekola

2020-Jul-12 17:42 UTC

[R] Using Rvest to scrape pages

Dear All,

I am just learning how to use R programming. I want to extract reviews
from a page and loop till I extract for all pages:

#specify the first page URL
fpURL <- 'https://wordpress.org/support/plugin/easyrecipe/reviews/'

#read the HTML contents in the first page URL
contentfpURL <- read_html(fpURL)

#identify the anchor tags in the first page URL
fpAnchors <- html_nodes(contentfpURL, css='a.bbp-topic-permalink')

#extract the HREF attribute value of each anchor tag
fpHREF <- html_attr(fpAnchors, 'href')

#create empty lists to store titles & contents found in the HREF
attribute value of each anchor tag
titles = c()
contents = c()

#loop the following actions for each HREF found firstpage
for (u in fpHREF) {

   #read the HTML content of the review page
   fpURL = read_html(u)

  #identify the title anchor and read the title text
  fpreviewT = html_text(html_nodes(fpURL, css='h1.page-title'))

  #identify the content anchor and read the content text
  fpreviewC = html_text(html_nodes(fpURL, css='div.bbp-topic-content'))

  #store the review titles and contents in the previous lists
  titles = c(titles, fpreviewT)
  contents = c(contents, fpreviewC)
}
#identify the anchor tag pointing to the next summary page
npAnchor <- html_text(html_node(contentfpURL, css='a.next
page-numbers'))

#extract the HREF attribute value of the anchor tag pointing to the
next summary page
npHREF <- html_attr(npAnchor, 'href')

#loop the following actions for every next summary page HREF attribute
for (u in npHREF) {

  #specify the URL of the summary page
  spURL <- read_html('npHREF')

  #identify all the anchor tags on that summary page
  spAnchors <- html_nodes(spURL, css='a.bbp-topic-permalink')

  #extract the HREF attribute value of each anchor tag
  spHREF <- html_attr(spAnchors, 'href')

  #loop the following actions for each HREF found on that summarypage

   for (u in fpHREF) {
     #read the HTML contents of the review page
     spURL = read_html(u)

      #identify the title anchor and read the title text
      spreviewT = html_text(html_nodes(spURL, css='h1.page-title'))

      #identify the content anchor and read the content text
      spreviewC = html_text(html_nodes(spURL,
css='div.bbp-topic-content'))

      #store the review titles and contents in the previous lists
      titles = c(titles, spreviewT)
      contents = c(contents, spreviewC)
      }
}

I got stuck at the step to extract the HREF attribute value of the
anchor tag pointing to the next summary page with the error: Error in
UseMethod("xml_attr") :
  no applicable method for 'xml_attr' applied to an object of class
"character"

 I will appreciate any help with this task.
Thanks in advance.

---Tiffany

David Winsemius

2020-Jul-12 23:28 UTC

head link

[R] Using Rvest to scrape pages

On 7/12/20 10:42 AM, Tiffany Adekola wrote:> Dear All,
>
> I am just learning how to use R programming. I want to extract reviews
> from a page and loop till I extract for all pages:
>
> #specify the first page URL
> fpURL <-
'https://wordpress.org/support/plugin/easyrecipe/reviews/'
>
> #read the HTML contents in the first page URL
> contentfpURL <- read_html(fpURL)
>
> #identify the anchor tags in the first page URL
> fpAnchors <- html_nodes(contentfpURL,
css='a.bbp-topic-permalink')
>
> #extract the HREF attribute value of each anchor tag
> fpHREF <- html_attr(fpAnchors, 'href')
>
> #create empty lists to store titles & contents found in the HREF
> attribute value of each anchor tag
> titles = c()
> contents = c()
>
> #loop the following actions for each HREF found firstpage
> for (u in fpHREF) {
>
>     #read the HTML content of the review page
>     fpURL = read_html(u)
>
>    #identify the title anchor and read the title text
>    fpreviewT = html_text(html_nodes(fpURL, css='h1.page-title'))
>
>    #identify the content anchor and read the content text
>    fpreviewC = html_text(html_nodes(fpURL,
css='div.bbp-topic-content'))
>
>    #store the review titles and contents in the previous lists
>    titles = c(titles, fpreviewT)
>    contents = c(contents, fpreviewC)
> }
> #identify the anchor tag pointing to the next summary page
> npAnchor <- html_text(html_node(contentfpURL, css='a.next
page-numbers'))
>
> #extract the HREF attribute value of the anchor tag pointing to the
> next summary page
> npHREF <- html_attr(npAnchor, 'href')

The error occurs with the line above, but if you look at the argument to 
`html_attr` you see that the problem is higher up

str(npAnchor)
# chr NA

Perhaps the problem occurs here:


html_node(contentfpURL, css='a.next page-numbers')
#{xml_missing}
#<NA>

-- 

David.
>
> #loop the following actions for every next summary page HREF attribute
> for (u in npHREF) {
>
>    #specify the URL of the summary page
>    spURL <- read_html('npHREF')
>
>    #identify all the anchor tags on that summary page
>    spAnchors <- html_nodes(spURL, css='a.bbp-topic-permalink')
>
>    #extract the HREF attribute value of each anchor tag
>    spHREF <- html_attr(spAnchors, 'href')
>
>    #loop the following actions for each HREF found on that summarypage
>
>     for (u in fpHREF) {
>       #read the HTML contents of the review page
>       spURL = read_html(u)
>
>        #identify the title anchor and read the title text
>        spreviewT = html_text(html_nodes(spURL,
css='h1.page-title'))
>
>        #identify the content anchor and read the content text
>        spreviewC = html_text(html_nodes(spURL,
css='div.bbp-topic-content'))
>
>        #store the review titles and contents in the previous lists
>        titles = c(titles, spreviewT)
>        contents = c(contents, spreviewC)
>        }
> }
>
> I got stuck at the step to extract the HREF attribute value of the
> anchor tag pointing to the next summary page with the error: Error in
> UseMethod("xml_attr") :
>    no applicable method for 'xml_attr' applied to an object of
class "character"
>
>   I will appreciate any help with this task.
> Thanks in advance.
>
> ---Tiffany
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Jul 2020 - Using Rvest to scrape pages

[R] Using Rvest to scrape pages

[R] Using Rvest to scrape pages