Nick McClure
2013-Jul-26 16:43 UTC
[R] Externalptr class to character class from (web) scrape
I'm hitting a wall. When I use the 'scrape' function from the package 'scrapeR' to get the pagesource from a web page, I do the following: (as an example) website.doc = parse("http://www.google.com") When I look at it, it seems fine: website.doc[[1]] This seems to have the information I need. Then when I try to get it into a character vector, character.website = as.character(website.doc[[1]]) I get the error: Error in as.vector(x, "character") : cannot coerce type 'externalptr' to vector of type 'character' I'm trying very very hard to wrap my head around how to get this external pointer to a character, but after reading many help files, I cannot understand how to do this. Any ideas?
Duncan Murdoch
2013-Jul-26 17:26 UTC
[R] Externalptr class to character class from (web) scrape
On 26/07/2013 12:43 PM, Nick McClure wrote:> I'm hitting a wall. When I use the 'scrape' function from the package > 'scrapeR' to get the pagesource from a web page, I do the following: > (as an example) > > website.doc = parse("http://www.google.com") > > When I look at it, it seems fine: > > website.doc[[1]] > > This seems to have the information I need. Then when I try to get it > into a character vector, > > character.website = as.character(website.doc[[1]]) > > I get the error: > > Error in as.vector(x, "character") : > cannot coerce type 'externalptr' to vector of type 'character' > > I'm trying very very hard to wrap my head around how to get this > external pointer to a character, but after reading many help files, I > cannot understand how to do this. Any ideas?You should use str() in cases like this. When I look at str(website.doc[[1]]) (after producing website.doc with scrape(), not parse()), I see > str(website.doc[[1]]) Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr> - attr(*, "headers")= Named chr [1:2] "<HTML><HEAD><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<TITLE>302 Moved</TITLE></HEAD><BODY>\n<H1>"| __truncated__ "</BODY></HTML>" ..- attr(*, "names")= chr [1:2] "<HTML><HEAD><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<TITLE>302 Moved</TITLE></HEAD><BODY>\n<H1>"| __truncated__ "</BODY></HTML>" So it is an external pointer with a number of classes. One or more of those will have a print method. methods(print) will list all the print methods, and I see there's a (hidden) print.XMLInternalDocument method somewhere. Then > getAnywhere("print.XMLInternalDocument") A single object matching ?print.XMLInternalDocument? was found It was found in the following places registered S3 method for print from namespace XML namespace:XML with value function (x, ...) { cat(as(x, "character"), "\n") } <environment: namespace:XML> shows that the as() generic should work, even though as.character() doesn't, and indeed as(website.doc[[1]], "character") does display something. Duncan Murdoch