Don MacQueen
2009-Mar-02 03:42 UTC
[R] Need help extracting info from XML file using XML package
I have an XML file that has within it the coordinates of some polygons that I would like to extract and use in R. The polygons are nested rather deeply. For example, I found by trial and error that I can extract the coordinates of one of them using functions from the XML package: doc <- xmlInternalTreeParse('doc.kml') docroot <- xmlRoot(doc) pgon <- xmlValue(docroot[[52]][[3]][[7]][[3]][[3]][[1]][[1]]) but this is hardly general! I'm hoping there is some relatively straightforward way to use functions from the XML package to recursively descend the structure and return the text strings representing the polygons into, say, a list with as many elements as there are polygons. I've been looking at several XML documentation files downloaded from http://www.omegahat.org/RSXML/ , but since my understanding of XML is weak at best, I'm having trouble. I can deal with converting the text strings to an R object suitable for plotting etc. Here's a look at the structure of this file graphics[5]% grep Polygon doc.kml <Polygon id="15342"> </Polygon> <Polygon id="1073"> </Polygon> <Polygon id="16508"> </Polygon> <Polygon id="18665"> </Polygon> <Polygon id="32903"> </Polygon> <Polygon id="5232"> </Polygon> And each of the <Polygon> </Polygon> pairs has <coordinates> as per this example: <Polygon id="15342"> <outerBoundaryIs> <LinearRing id="11467"> <coordinates> -23.679835352296,30.263840290388,5.000000000000001 -23.68138782285701,30.264740875186,5.000000000000001 [snip] -23.679835352296,30.263840290388,5.000000000000001 -23.679835352296,30.263840290388,5.000000000000001 </coordinates> </LinearRing> </outerBoundaryIs> </Polygon> Thanks! -Don p.s. There is a lot of other stuff in this file, i.e, some points, and attributes of the points such as color, as well as a legend describing what the polygons mean, but I can get by without all that stuff, at least for now. Note also that readOGR() would in principle work, but the underlying OGR libraries have some limitations that this file exceeds. Per info at http://www.gdal.org/ogr/drv_kml.html. -- --------------------------------- Don MacQueen Lawrence Livermore National Laboratory Livermore, CA, USA 925-423-1062 macq at llnl.gov
David Winsemius
2009-Mar-02 04:02 UTC
[R] Need help extracting info from XML file using XML package
A bit over a year ago I got useful advice from Gabor Grothendieck and Duncan Temple Lang in this thread: http://finzi.psych.upenn.edu/R/Rhelp02/archive/117140.html If the coordinates are nested deeply, then it probably safer to search for a specific tag or tags that are just above them . You probably want to search for the "LinearRing" tag and then store the coordinates along with its "id". Perhaps some of my mistakes can be avoided as you work on your methods. -- David Winsemius On Mar 1, 2009, at 10:42 PM, Don MacQueen wrote:> I have an XML file that has within it the coordinates of some > polygons that I would like to extract and use in R. The polygons are > nested rather deeply. For example, I found by trial and error that I > can extract the coordinates of one of them using functions from the > XML package: > > doc <- xmlInternalTreeParse('doc.kml') > docroot <- xmlRoot(doc) > pgon <- xmlValue(docroot[[52]][[3]][[7]][[3]][[3]][[1]][[1]]) > > but this is hardly general! > > I'm hoping there is some relatively straightforward way to use > functions from the XML package to recursively descend the structure > and return the text strings representing the polygons into, say, a > list with as many elements as there are polygons. I've been looking > at several XML documentation files downloaded from http://www.omegahat.org/RSXML/ > , but since my understanding of XML is weak at best, I'm having > trouble. I can deal with converting the text strings to an R object > suitable for plotting etc. > > > Here's a look at the structure of this file > > graphics[5]% grep Polygon doc.kml > <Polygon id="15342"> > </Polygon> > <Polygon id="1073"> > </Polygon> > <Polygon id="16508"> > </Polygon> > <Polygon id="18665"> > </Polygon> > <Polygon id="32903"> > </Polygon> > <Polygon id="5232"> > </Polygon> > > And each of the <Polygon> </Polygon> pairs has <coordinates> as per > this example: > > > <Polygon id="15342"> > <outerBoundaryIs> > <LinearRing id="11467"> > <coordinates> > -23.679835352296,30.263840290388,5.000000000000001 > -23.68138782285701,30.264740875186,5.000000000000001 > [snip] > -23.679835352296,30.263840290388,5.000000000000001 > -23.679835352296,30.263840290388,5.000000000000001 </coordinates> > </LinearRing> > </outerBoundaryIs> > </Polygon> > > > Thanks! > -Don > > > p.s. > There is a lot of other stuff in this file, i.e, some points, and > attributes of the points such as color, as well as a legend > describing what the polygons mean, but I can get by without all that > stuff, at least for now. > > Note also that readOGR() would in principle work, but the underlying > OGR libraries have some limitations that this file exceeds. Per info > at http://www.gdal.org/ogr/drv_kml.html. > -- > --------------------------------- > Don MacQueen > Lawrence Livermore National Laboratory > Livermore, CA, USA > 925-423-1062 > macq at llnl.gov > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Wacek Kusnierczyk
2009-Mar-02 09:50 UTC
[R] Need help extracting info from XML file using XML package
Don MacQueen wrote:> I have an XML file that has within it the coordinates of some polygons > that I would like to extract and use in R. The polygons are nested > rather deeply. For example, I found by trial and error that I can > extract the coordinates of one of them using functions from the XML > package: > > doc <- xmlInternalTreeParse('doc.kml') > docroot <- xmlRoot(doc) > pgon <-try lapply( xpathSApply(doc, '//Polygon', xpathSApply, '//coordinates', function(node) strsplit(xmlValue(node), split=',|\\s+')), as.numeric) which should find all polygon nodes, extract the coordinates node for each polygon separately, split the coordinates string by comma and convert to a numeric vector, and then report a list of such vectors, one vector per polygon. i've tried it on some dummy data made up from your example below. the xpath patterns may need to be adjusted, depending on the actual structure of your xml file, as may the strsplit pattern. vQ> but this is hardly general! > > I'm hoping there is some relatively straightforward way to use > functions from the XML package to recursively descend the structure > and return the text strings representing the polygons into, say, a > list with as many elements as there are polygons. I've been looking at > several XML documentation files downloaded from > http://www.omegahat.org/RSXML/ , but since my understanding of XML is > weak at best, I'm having trouble. I can deal with converting the text > strings to an R object suitable for plotting etc. > > > Here's a look at the structure of this file > > graphics[5]% grep Polygon doc.kml > <Polygon id="15342"> > </Polygon> > <Polygon id="1073"> > </Polygon> > <Polygon id="16508"> > </Polygon> > <Polygon id="18665"> > </Polygon> > <Polygon id="32903"> > </Polygon> > <Polygon id="5232"> > </Polygon> > > And each of the <Polygon> </Polygon> pairs has <coordinates> as per > this example: > > > <Polygon id="15342"> > <outerBoundaryIs> > <LinearRing id="11467"> > <coordinates> > -23.679835352296,30.263840290388,5.000000000000001 > -23.68138782285701,30.264740875186,5.000000000000001 > [snip] > -23.679835352296,30.263840290388,5.000000000000001 > -23.679835352296,30.263840290388,5.000000000000001 </coordinates> > </LinearRing> > </outerBoundaryIs> > </Polygon> > > > Thanks! > -Don > > > p.s. > There is a lot of other stuff in this file, i.e, some points, and > attributes of the points such as color, as well as a legend describing > what the polygons mean, but I can get by without all that stuff, at > least for now. > > Note also that readOGR() would in principle work, but the underlying > OGR libraries have some limitations that this file exceeds. Per info > at http://www.gdal.org/ogr/drv_kml.html.