thr3ads.net - R help - [R] import HTML tables [May 2009]

If this information is useful, please help other people find it:
Share via:

Dimitri Szerman

2009-May-12 14:57 UTC

[R] import HTML tables

Hello,
I was wondering if there is a function in R that imports tables directly
from a HTML document. I know there are functions (say, getURL() from {RCurl}
) that download the entire page source, but here I refer to something like
google document's function importHTML() (if you don't know this
function, go
check it, it's very useful). Anyway, if someone of something that does this
job, I'd be very grateful if you could let me know. Otherwise, here's a
suggestion for R-developers: please do write something inspired
in google's importHMTL() function.

Many thanks,
Dimitri

	[[alternative HTML version deleted]]

Dieter Menne

2009-May-13 08:10 UTC

head link

[R] import HTML tables

Dimitri Szerman-2 wrote:> 
> Hello,
> I was wondering if there is a function in R that imports tables directly
> from a HTML document.
> 
The XML package can do this:

http://markmail.org/message/cyicoa3htme4gei2

Duncan Temple Lang:

The htmlParse() and htmlTreeParse() functions in the XML package use the
non-strict HTML parser in libxml2 and so the HTML document can be malformed. 

Dieter
-- 
View this message in context:
http://www.nabble.com/import-HTML-tables-tp23504282p23517322.html
Sent from the R help mailing list archive at Nabble.com.

Duncan Temple Lang

2009-May-13 13:55 UTC

head link

[R] import HTML tables

Dieter Menne wrote:> 
> Dimitri Szerman-2 wrote:
>> Hello,
>> I was wondering if there is a function in R that imports tables
directly
>> from a HTML document.
>>
> 
> The XML package can do this:
> 
> http://markmail.org/message/cyicoa3htme4gei2
> 
> Duncan Temple Lang:
> 
> The htmlParse() and htmlTreeParse() functions in the XML package use the
> non-strict HTML parser in libxml2 and so the HTML document can be
malformed.
Indeed. Thanks Dieter.

htmlParse() reads the document; getNodeSet allows us to
easily find the table or tables of interest.
We can find the th and td entries easily using XPath also.

The less automated part is how to meaningfully process the content.
That is where a human  should be involved, deciding whether to trim
white space, how to convert text to values, dealing with missing cells.
We can do a lot by default, but ...

There is a relatively simple function at

   http://www.omegahat.org/ParseXML/readHTMLTable.R

that provides something resembling read.table.
It is not well tested as in the past, I have just used XPath
directly as, once you know XPath, extracting content from HTML/XML is
very straightforward.

   D.

> 
> 
> Dieter

Seemingly Similar Threads

Search for more seemingly similar threads

R help - May 2009 - import HTML tables

[R] import HTML tables

[R] import HTML tables

[R] import HTML tables

Seemingly Similar Threads