Hi Pratt --
ppatel3026 <pratik.patel at us.rothschild.com> writes:
> Could someone provide a link or examples of parsing XML document in R? Few
> specific questions below:
Always helpful to know what software you're using; here's mine
> library(XML)
> sessionInfo()
R version 2.8.0 Under development (unstable) (2008-06-09 r45889)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics utils datasets grDevices methods base
other attached packages:
[1] XML_1.95-2
loaded via a namespace (and not attached):
[1] tools_2.8.0
> For instance I can retrieve specific nodes using this:
> node <- xpathApply(xml, "//" %+% xtag, xmlValue)
>
> 1) I want to be able to retrieve parent node for this node, how can I do
> this? getParentNode() does not seem to cut it.
I've found it easier to use xpath and the 'internal' representation
> library(XML)
> f <- system.file("exampleData", "mtcars.xml",
package="XML")
> xml <- xmlTreeParse(f, useInternal=TRUE)
> q <- "//record[@id='AMC Javelin']"
> nodes <- xpathApply(xml, q)
nodes is a list of length one. Here's the parent of the first (and
only) element
> parent <- xmlParent(nodes[[1]])
or
> xpathApply(xml, paste(q, "/.."))[[1]]
> 2) How can I retrieve children nodes for a particular node?
> xmlChildren(parent)
or for parent identified by path pq <- "dataset"
> xpathApply(xml, paste(pq, "/*"))
> 3) How can I create an iterator to iterate through the whole tree?
For true event parsing I think you want xmlEventParse, which traverses
the tree and invokes the argument 'handlers' on each node.
'handlers'
is a named list of functions, the name either signifying a general
type of position in the tree (e.g.,'startElement') or name of node
(e.g., 'record'). So
> handler <- list(startElement=function(name, atts, ...) {
+ cat("starting", name, "\n")
+ })> xmlEventParse(f, handler)
starting dataset
starting variables
starting variable
[etc]
The usual 'trick' is to use R's lexical scope to provide a context
where results can be stored, e.g., defining a factory to produce
handlers
handlerFactory <- function() {
## 'local' store visible to functions defined inside
## handlerFactory
counts <- new.env(parent=emptyenv())
## return value -- list of functions
list(startElement=function(name, atts, ...) {
## lexical scope often requires use of <<- rather than <-
if (!exists(name, counts))
counts[[name]] <- 1
else
counts[[name]] <- counts[[name]] + 1
}, getCounts=function() {
## for retrieving results
as.list(counts)
})
}
Then invoke xmlEventParse with an instance of the
handler. xmlEventParse actually returns the handler, which by the end
of xmlEventParse has 'counts' modified appropriately. We access the
results by invoking our getCounts function.
> xmlEventParse(f, handlerFactory())$getCounts()
$record
[1] 32
$variable
[1] 11
[etc]
If the use of lexical scope is a bit mysterious, there is a 'bank
account' example in the Introduction to R manual (section 10.7) and a
paper by Ross Ihaka and Robert Gentleman on lexical scope (referenced
at http://www.r-project.org/doc/bib/R-other.html) that might help.
I don't usually use event parsing, so the above may not be accurate.
Martin
> Thank you,
> Pratt
> --
> View this message in context:
http://www.nabble.com/Parse-XML-tp17757373p17757373.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793