thr3ads.net - R help - [R] Re ad HTML table [Nov 2007]

If this information is useful, please help other people find it:
Share via:

f.jamitzky

2007-Nov-18 23:38 UTC

[R] Re ad HTML table

You can use htmlTreeParse and xpathApply from the XML library.
something like:

xpathApply( htmlTreeParse("http://blabla", useInt=T),
"//td", function(x)
xmlValue(x))

should do it.



Gamma wrote:> 
> anyone care to explain how to read a html table, it's streaming data
> (updated every second) and i am looking for a suitable function.
> 
> The imported html tables looks like this:
> 
> [1] "<body><html><table>"
> [2] "<tr><td>SEQUENCE</td>
<td>EXCHANGE</td> <td>BOARD</td>
<td>TIME</td>
> <td>PAPER</td> <td>BID</td>
<td>BID-DEPTH</td> <td>BID-DEPTH-TOTAL</td>
> <td>BID-NUMBER</td> <td>OFFER</td>
<td>OFFER-DEPTH</td>
> <td>OFFER-DEPTH-TOTAL</td> <td>OFFER-NUMBER</td>
<td>OPEN</td>
> <td>HIGH</td> <td>LOW</td>
<td>LAST</td> <td>CHANGE</td>
> <td>CHANGE-PERCENT</td> <td>VOLUME</td>
<td>VALUE</td> <td>TRADES</td>
> <td>STATUS</td></tr>"
>
[3]"<tr><td>184311995</td><td>ST</td><td></td><td>174336</td><td>SX50PI</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>953.9600</td><td>937.9800</td><td>947.5900</td><td>2.6000</td><td>0.2751</td><td></td><td></td><td></td><td></td></tr>"
> and so on to the table closing brackets. 
> 
> [15] "</table></html></body>"  
> 
> 
> Tried a few commands but i only get html code back, like above:
> readLines(url("")), socketConnection() and url() and nothing
seemingly
> useful comes up with apropos("html") either.
> 
> 
> Regards
> 
-- 
View this message in context:
http://www.nabble.com/Read-HTML-table-tf4832010.html#a13825471
Sent from the R help mailing list archive at Nabble.com.

theta

2007-Nov-19 01:33 UTC

head link

[R] Re ad HTML table

f.jamitzky wrote:> 
> You can use htmlTreeParse and xpathApply from the XML library.
> something like:
> 
> xpathApply( htmlTreeParse("http://blabla", useInt=T),
"//td", function(x)
> xmlValue(x))
> 
> should do it.
> 
Thank you, any further ideas how to transform the result into a matrix,
something that R easily could search and find values, i want to use the
imported data in various calculations (Rmetrics) and hope to automate the
process somewhat.

Another thing, the htmlTreeParse takes a while to complete, for a 15 row
table it takes about 10-15 seconds, considering i am planning to use this
method on multiple (15-20) tables with up to 1000 rows it might not be the
ideal solution?
-- 
View this message in context:
http://www.nabble.com/Read-HTML-table-tf4832010.html#a13826367
Sent from the R help mailing list archive at Nabble.com.

f.jamitzky

2007-Nov-19 09:56 UTC

head link

[R] Re ad HTML table

For fixed numbers of columns you can use 

data.frame(matrix(data, nrow, ncol)) 

in order to parse the XML data.

htmlTreeParse should be rather quick, but in case it is too slow you could
use curl for downloading
the data and xmlstarlet for transformation to XML. Then you can use
xmlTreeParse or even read.csv to read the file into R.


Gamma wrote:> 
> 
> f.jamitzky wrote:
>> 
>> You can use htmlTreeParse and xpathApply from the XML library.
>> something like:
>> 
>> xpathApply( htmlTreeParse("http://blabla", useInt=T),
"//td", function(x)
>> xmlValue(x))
>> 
>> should do it.
>> 
> 
> Thank you, any further ideas how to transform the result into a matrix,
> something that R easily could search and find values, i want to use the
> imported data in various calculations (Rmetrics) and hope to automate the
> process somewhat.
> 
> Another thing, the htmlTreeParse takes a while to complete, for a 15 row
> table it takes about 10-15 seconds, considering i am planning to use this
> method on multiple (15-20) tables with up to 1000 rows it might not be the
> ideal solution?
> 
-- 
View this message in context:
http://www.nabble.com/Read-HTML-table-tf4832010.html#a13830637
Sent from the R help mailing list archive at Nabble.com.

theta

2007-Nov-19 20:41 UTC

head link

[R] Re ad HTML table

f.jamitzky wrote:> 
> For fixed numbers of columns you can use 
> 
> data.frame(matrix(data, nrow, ncol)) 
> 
> in order to parse the XML data.
> 
> htmlTreeParse should be rather quick, but in case it is too slow you could
> use curl for downloading
> the data and xmlstarlet for transformation to XML. Then you can use
> xmlTreeParse or even read.csv to read the file into R.
> 
Reading realtime data into R for further computation is a side project, i
guess i am just curious if it is possible at all. I know there exist full
fledged trading clients coded in Matlab for example.

Thank you for helping, much appreciated.

-- 
View this message in context:
http://www.nabble.com/Read-HTML-table-tf4832010.html#a13844935
Sent from the R help mailing list archive at Nabble.com.

Duncan Temple Lang

2007-Nov-20 05:55 UTC

head link

[R] Re ad HTML table

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

theta wrote:> 
> f.jamitzky wrote:
>> You can use htmlTreeParse and xpathApply from the XML library.
>> something like:
>>
>> xpathApply( htmlTreeParse("http://blabla", useInt=T),
"//td", function(x)
>> xmlValue(x))
>>
>> should do it.
>>
> 
> Thank you, any further ideas how to transform the result into a matrix,
> something that R easily could search and find values, i want to use the
> imported data in various calculations (Rmetrics) and hope to automate the
> process somewhat.
> 
> Another thing, the htmlTreeParse takes a while to complete, for a 15 row
> table it takes about 10-15 seconds, considering i am planning to use this
> method on multiple (15-20) tables with up to 1000 rows it might not be the
> ideal solution?
I doubt the parsing is taking very long at all.
On a Linux box running virtually on my Mac, I can parse a 4566 line
HTML file in .3 seconds.

If you pass a URL rather than a local file, then you have to separate
the download time and the parsing time to figure out where the time
is consumed.

And if you are going to download multiple tables from the same server in
rapid succession, then you might want to use some advanced features of
HTTP such as persistent connections or multiple interleaved requests.
These can all be done via the RCurl package and the results fed to
htmlTreeParse().  There is a paper on the RCurl web site that describes
some of these advanced features.

 D.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHQnbB9p/Jzwa2QP4RAhxXAJ4pQz8IEge5UKZ6uwPnPa8qziR2DACffYt8
VRo1CqTGB925amKBNUcOBsI=EHd5
-----END PGP SIGNATURE-----

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Nov 2007 - Re ad HTML table

[R] Re ad HTML table

[R] Re ad HTML table

[R] Re ad HTML table

[R] Re ad HTML table

[R] Re ad HTML table

Possibly Parallel Threads