Yihui Xie
2010-Mar-11 22:35 UTC
[R] parse an HTML page with verbose error message (using XML)
I'm using the function htmlParse() in the XML package, and I need a little bit help on error handling while parsing an HTML page. So far I can use either the default way: # error = xmlErrorCumulator(), by default library(XML) doc = htmlParse("http://www.public.iastate.edu/~pdixon/stat500/") # the error message is: # htmlParseStartTag: invalid element name or the tryCatch() approach: # error = NULL, errors to be caught by tryCatch() tryCatch({ doc = htmlParse("http://www.public.iastate.edu/~pdixon/stat500/", error = NULL) }, XMLError = function(e) { cat("There was an error in the XML at line", e$line, "column", e$col, "\n", e$message, "\n") }) # verbose error message as: # There was an error in the XML at line 90 column 2 # htmlParseStartTag: invalid element name I wish to get the verbose error messages without really stopping the parsing process; the first approach cannot return detailed error messages, while the second one will stop the program... Thanks! Regards, Yihui -- Yihui Xie <xieyihui at gmail.com> Phone: 515-294-6609 Web: http://yihui.name Department of Statistics, Iowa State University 3211 Snedecor Hall, Ames, IA
Duncan Temple Lang
2010-Mar-12 04:43 UTC
[R] parse an HTML page with verbose error message (using XML)
Hi Yihui It took me a moment to see the error message as the latest development version of the XML package suppresses/hides them by default for htmlParse(). You can provide your own function via the error parameter. If you just want to see more detailed error messages on the console you can use a function like the following fullInfoErrorHandler function(msg, code, domain, line, col, level, file) { # level tells how significant the error is # These are 0, 1, 2, 3 for WARNING, ERROR, FATAL # meaning simple warning, recoverable error and fatal/unrecoverable error. # See XML:::xmlErrorLevel # # code is an error code, See the values in XML:::xmlParserErrors # XML_HTML_UNKNOWN_TAG, XML_ERR_DOCUMENT_EMPTY # # domain tells what part of the library raised this error. # See XML:::xmlErrorDomain codeMsg = switch(level, "warning", "recoverable error", "fatal error") cat("There was a", codeMsg, "in the", file, "at line", line, "column", col, "\n", msg, "\n") } doc = htmlParse("~/htmlErrors.html", error = fullInfoErrorHandler) And of course you can mimic xmlErrorCumulator() to form a closure that collects the different details of each message into an object. If you look in the error.R and xmlErrorEnums.R files within the R code of the XML package, you'll find some additional functions that give us further support for working with errors in the XML/HTML parsers. Best, D. Yihui Xie wrote:> I'm using the function htmlParse() in the XML package, and I need a > little bit help on error handling while parsing an HTML page. So far I > can use either the default way: > > # error = xmlErrorCumulator(), by default > library(XML) > doc = htmlParse("http://www.public.iastate.edu/~pdixon/stat500/") > # the error message is: > # htmlParseStartTag: invalid element name > > or the tryCatch() approach: > > # error = NULL, errors to be caught by tryCatch() > tryCatch({ > doc = htmlParse("http://www.public.iastate.edu/~pdixon/stat500/", > error = NULL) > }, XMLError = function(e) { > cat("There was an error in the XML at line", e$line, "column", > e$col, "\n", e$message, "\n") > }) > # verbose error message as: > # There was an error in the XML at line 90 column 2 > # htmlParseStartTag: invalid element name > > I wish to get the verbose error messages without really stopping the > parsing process; the first approach cannot return detailed error > messages, while the second one will stop the program... > > Thanks! > > Regards, > Yihui > -- > Yihui Xie <xieyihui at gmail.com> > Phone: 515-294-6609 Web: http://yihui.name > Department of Statistics, Iowa State University > 3211 Snedecor Hall, Ames, IA > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.