Is there any package similar to the XML package that is able to "extract" relevant information from HTML files. Namely, I'm interested in obtained data that is represented as a HTML table, into some R-type structure. Thank you. -- Luis Torgo FEP/LIACC, University of Porto Phone : (+351) 22 607 88 30 Machine Learning Group Fax : (+351) 22 600 36 54 R. Campo Alegre, 823 email : ltorgo at liacc.up.pt 4150 PORTO - PORTUGAL WWW : http://www.liacc.up.pt/~ltorgo -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
If my memory serves me correctly, I believe that Dan Veillard's libxml library provides an adaptation of the XML parser that handles HTML. In that case, I can add something to the XML package that allows us to access the HTML parser and use the same interface for both XML and HTML from within R. I'll take a look and see if this is relatively easy to do. Luis Torgo wrote:> Is there any package similar to the XML package that is able to > "extract" relevant information from HTML files. Namely, I'm interested > in obtained data that is represented as a HTML table, into some R-type > structure. > Thank you. > > -- > Luis Torgo > FEP/LIACC, University of Porto Phone : (+351) 22 607 88 30 > Machine Learning Group Fax : (+351) 22 600 36 54 > R. Campo Alegre, 823 email : ltorgo at liacc.up.pt > 4150 PORTO - PORTUGAL WWW : http://www.liacc.up.pt/~ltorgo > > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._-- _______________________________________________________________ Duncan Temple Lang duncan at research.bell-labs.com Bell Labs, Lucent Technologies office: (908)582-3217 700 Mountain Avenue, Room 2C-259 fax: (908)582-3340 Murray Hill, NJ 07974-2070 http://cm.bell-labs.com/stat/duncan -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Hi Luis, I just uploaded the latest version of the XML package to the Omegahat web site at http://www.omegahat.org/RSXML/XML_0.7-0.tar.gz and this now has support for parsing HTML. I only added support for the DOM style of parsing, i.e. reading the entire tree and then applying R functions to convert it. Hopefully that will be enough to suit your needs. Please let me know if there are any problems with the package. Thanks for the suggestion to include HTML support. Duncan. Luis Torgo wrote:> Is there any package similar to the XML package that is able to > "extract" relevant information from HTML files. Namely, I'm interested > in obtained data that is represented as a HTML table, into some R-type > structure. > Thank you. > > -- > Luis Torgo > FEP/LIACC, University of Porto Phone : (+351) 22 607 88 30 > Machine Learning Group Fax : (+351) 22 600 36 54 > R. Campo Alegre, 823 email : ltorgo at liacc.up.pt > 4150 PORTO - PORTUGAL WWW : http://www.liacc.up.pt/~ltorgo > > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._-- _______________________________________________________________ Duncan Temple Lang duncan at research.bell-labs.com Bell Labs, Lucent Technologies office: (908)582-3217 700 Mountain Avenue, Room 2C-259 fax: (908)582-3340 Murray Hill, NJ 07974-2070 http://cm.bell-labs.com/stat/duncan -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._