Hi Simon
Unfortunately, it works for me on my OS X machine. So I can't reproduce the
problem.
I'd be curious to know which version of libxml2 you are using. That might
be the cause
of the problem.
You can find this with
library(XML)
libxmlVersion()
You might install a more recent version (e.g. libxml >= 2.07.0)
You can send the info to me off list and we can try to resolve the problem.
htmlParse() returns a reference to the internal C-level XML tree/document.
When you print the value of the variable .x, we then serialize that C-level data
structure
to a string.
htmlTreeParse(), by default, converts that C-level XML tree/document into
regular R objects.
So it traverses the tree and creates those R list()s before it returns and then
throws the
C-level tree away.
D.
On 9/5/11 2:48 PM, Simon Kiss wrote:> Dear colleagues,
> each time I use htmlParse, R crashes or hangs. The url I'd like to
parse is included below as is the results of a series of basic commands that
describe what I'm experiencing. The results of sessionInfo() are attached
at the bottom of the message.
> The thing is, htmlTreeParse appears to work just fine, although it
doesn't appear to contain the information I need (the URLs of the articles
linked to on this search page). Regardless, I'd still like to understand
why htmlParse doesn't work.
> Thank you for any insight.
> Yours,
> Simon Kiss
>
>
>
myurl<-c("http://timesofindia.indiatimes.com/searchresult.cms?sortorder=score&searchtype=2&maxrow=10&startdate=2001-01-01&enddate=2011-08-25&article=2&pagenumber=1&isphrase=no&query=IIM&searchfield=§ion=&kdaterange=30&date1mm=01&date1dd=01&date1yyyy=2001&date2mm=08&date2dd=25&date2yyyy=2011")
>
> .x<-htmlParse(myurl)
>
> class(.x)
> #returns "HTMLInternalDocument" "XMLInternalDocument"
>
> .x
> #returns
> *** caught segfault ***
> address 0x1398754, cause 'memory not mapped'
>
> Traceback:
> 1: .Call("RS_XML_dumpHTMLDoc", doc, as.integer(indent),
as.character(encoding), as.logical(indent), PACKAGE = "XML")
> 2: saveXML(from)
> 3: saveXML(from)
> 4: asMethod(object)
> 5: as(x, "character")
> 6: cat(as(x, "character"), "\n")
> 7: print.XMLInternalDocument(<pointer: 0x11656d3e0>)
> 8: print(<pointer: 0x11656d3e0>)
>
> Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace
>
> sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_CA.UTF-8/en_CA.UTF-8/C/C/en_CA.UTF-8/en_CA.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] XML_3.4-0 RCurl_1.5-0 bitops_1.0-4.1
> *********************************
> Simon J. Kiss, PhD
> Assistant Professor, Wilfrid Laurier University
> 73 George Street
> Brantford, Ontario, Canada
> N3T 2C9
> Cell: +1 905 746 7606
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.