Hello, I was wondering if there is a function in R that imports tables directly from a HTML document. I know there are functions (say, getURL() from {RCurl} ) that download the entire page source, but here I refer to something like google document's function importHTML() (if you don't know this function, go check it, it's very useful). Anyway, if someone of something that does this job, I'd be very grateful if you could let me know. Otherwise, here's a suggestion for R-developers: please do write something inspired in google's importHMTL() function. Many thanks, Dimitri [[alternative HTML version deleted]]
Dimitri Szerman-2 wrote:> > Hello, > I was wondering if there is a function in R that imports tables directly > from a HTML document. >The XML package can do this: http://markmail.org/message/cyicoa3htme4gei2 Duncan Temple Lang: The htmlParse() and htmlTreeParse() functions in the XML package use the non-strict HTML parser in libxml2 and so the HTML document can be malformed. Dieter -- View this message in context: http://www.nabble.com/import-HTML-tables-tp23504282p23517322.html Sent from the R help mailing list archive at Nabble.com.
Dieter Menne wrote:> > Dimitri Szerman-2 wrote: >> Hello, >> I was wondering if there is a function in R that imports tables directly >> from a HTML document. >> > > The XML package can do this: > > http://markmail.org/message/cyicoa3htme4gei2 > > Duncan Temple Lang: > > The htmlParse() and htmlTreeParse() functions in the XML package use the > non-strict HTML parser in libxml2 and so the HTML document can be malformed.Indeed. Thanks Dieter. htmlParse() reads the document; getNodeSet allows us to easily find the table or tables of interest. We can find the th and td entries easily using XPath also. The less automated part is how to meaningfully process the content. That is where a human should be involved, deciding whether to trim white space, how to convert text to values, dealing with missing cells. We can do a lot by default, but ... There is a relatively simple function at http://www.omegahat.org/ParseXML/readHTMLTable.R that provides something resembling read.table. It is not well tested as in the past, I have just used XPath directly as, once you know XPath, extracting content from HTML/XML is very straightforward. D.> > > Dieter