Hi, I am trying to parse XML files and read them into R as a data frame, but have been unable to find examples which I could apply successfully. I'm afraid I don't know much about XML, which makes this all the more difficult. If someone could point me in the right direction to a resource (preferably with an example or two), it would be greatly appreciated. Here is a snippet from one of the XML files that I am looking to read, and I am aiming to be able to get it into a data frame with columns N, T, A, B, C as in the 2nd level of the heirarchy. <?xml version="1.0" encoding="utf-8" ?> - <C S="UnitA" D="1/3/2007" C="24745" F="24648"> <T N="1" T="9:30:13 AM" A="30.05" B="29.85" C="30.05" /> <T N="2" T="9:31:05 AM" A="29.89" B="29.78" C="30.05" /> <T N="3" T="9:31:05 AM" A="29.9" B="29.86" C="29.87" /> <T N="4" T="9:31:05 AM" A="29.86" B="29.86" C="29.87" /> <T N="5" T="9:31:05 AM" A="29.89" B="29.86" C="29.87" /> <T N="6" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> <T N="7" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> <T N="8" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> </C> Thanks for any help or direction anyone can provide. As a point of reference, I am using R 2.8.1 and have loaded the XML package.
Brigid Mooney wrote:> Hi, > > I am trying to parse XML files and read them into R as a data frame, > but have been unable to find examples which I could apply > successfully. > > I'm afraid I don't know much about XML, which makes this all the more > difficult. If someone could point me in the right direction to a > resource (preferably with an example or two), it would be greatly > appreciated. > > Here is a snippet from one of the XML files that I am looking to read, > and I am aiming to be able to get it into a data frame with columns N, > T, A, B, C as in the 2nd level of the heirarchy. >There might be a simpler approach, but this seems to do: library(XML) input = xmlParse( '<?xml version="1.0" encoding="utf-8" ?> <C S="UnitA" D="1/3/2007" C="24745" F="24648"> <T N="1" T="9:30:13 AM" A="30.05" B="29.85" C="30.05" /> <T N="2" T="9:31:05 AM" A="29.89" B="29.78" C="30.05" /> <T N="3" T="9:31:05 AM" A="29.9" B="29.86" C="29.87" /> <T N="4" T="9:31:05 AM" A="29.86" B="29.86" C="29.87" /> <T N="5" T="9:31:05 AM" A="29.89" B="29.86" C="29.87" /> <T N="6" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> <T N="7" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> <T N="8" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> </C>') (output = data.frame(t(xpathSApply(input, '//T', xpathSApply, '@*')))) # N T A B C # 1 1 9:30:13 AM 30.05 29.85 30.05 # 2 2 9:31:05 AM 29.89 29.78 30.05 # 3 3 9:31:05 AM 29.9 29.86 29.87 # 4 4 9:31:05 AM 29.86 29.86 29.87 # 5 5 9:31:05 AM 29.89 29.86 29.87 # 6 6 9:31:06 AM 29.89 29.85 29.86 # 7 7 9:31:06 AM 29.89 29.85 29.86 # 8 8 9:31:06 AM 29.89 29.85 29.86 output$N # [1] 1 2 3 4 5 6 7 8 # Levels: 1 2 3 4 5 6 7 8 you may need to reformat the columns. vQ
Hi Brigid. Here are a few commands that should do what you want: bri = xmlParse("myDataFile.xml") tmp = t(xmlSApply(xmlRoot(bri), xmlAttrs))[, -1] dd = as.data.frame(tmp, stringsAsFactors = FALSE, row.names = 1:nrow(tmp)) And then you can convert the columns to whatever types you want using regular R commands. The basic idea is that for each of the child nodes of C, i.e. the <T>'s, we want the character vector of attributes which we can get with xmlAttrs(). Then we stack them together into a matrix, drop the "N" and then convert the result to a data frame, avoiding duplicate row names which are all "T". (BTW, make certain the '-' on the second line is not in the XML content. I assume that came from bringing the text into mail.) HTH D. Brigid Mooney wrote:> Hi, > > I am trying to parse XML files and read them into R as a data frame, > but have been unable to find examples which I could apply > successfully. > > I'm afraid I don't know much about XML, which makes this all the more > difficult. If someone could point me in the right direction to a > resource (preferably with an example or two), it would be greatly > appreciated. > > Here is a snippet from one of the XML files that I am looking to read, > and I am aiming to be able to get it into a data frame with columns N, > T, A, B, C as in the 2nd level of the heirarchy. > > <?xml version="1.0" encoding="utf-8" ?> > - <C S="UnitA" D="1/3/2007" C="24745" F="24648"> > <T N="1" T="9:30:13 AM" A="30.05" B="29.85" C="30.05" /> > <T N="2" T="9:31:05 AM" A="29.89" B="29.78" C="30.05" /> > <T N="3" T="9:31:05 AM" A="29.9" B="29.86" C="29.87" /> > <T N="4" T="9:31:05 AM" A="29.86" B="29.86" C="29.87" /> > <T N="5" T="9:31:05 AM" A="29.89" B="29.86" C="29.87" /> > <T N="6" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> > <T N="7" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> > <T N="8" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> > </C> > > Thanks for any help or direction anyone can provide. > > As a point of reference, I am using R 2.8.1 and have loaded the XML package. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.