Dear All, I have a question regarding best practise in setting up a XML parser within R. Because I have files with more than 100 MB and I'm only interested in some values I think a SAX-like parser using xmlEventParse() will be the best solution. Unfortunately the values I'm looking for, to construct some higher "mass spectrum", are distributed over different lines: as <spectrum id="2">, <mzArrayBinary>, <intenArrayBinary> <... name="MassToChargeRatio" value="445.598999"/> (as one can see in the xml snip set) I know the mechanism of using Event Handlers, as shown in the examples. But what I'm looking for is, how can I use some "path information" as mentioned in "addContext" parameter of xmlEventParse()? May somebody share a example using "addContext = TRUE" and pointing me to the variables I may use if I implement the "..." parameter within my handlers. Do I have to implement a "status machine" using some variables within my handlers, or would one prefer to use the "state" parameter of xmlEventParse()? I would appreciate any assistance very much! Jan
Hi Jan, On 20 Sep 2005, Hummel at mpimp-golm.mpg.de wrote:> I have a question regarding best practise in setting up a XML parser > within R. [snip] > value="445.598999"/> (as one can see in the xml snip set)I missed the xml snip, but I think I get the gist of your question.> I know the mechanism of using Event Handlers, as shown in the > examples. But what I'm looking for is, how can I use some "path > information" as mentioned in "addContext" parameter of > xmlEventParse()? May somebody share a example using "addContext > TRUE" and pointing me to the variables I may use if I implement the > "..." parameter within my handlers. > > Do I have to implement a "status machine" using some variables > within my handlers, or would one prefer to use the "state" parameter > of xmlEventParse()?I'm not familiar with the addContext arg and don't know whether or not that provides another solution to your problem. I do know that you can do what you want by writing "state machine" code. I played a little with using the state arg for this purpose, but ran into some problems (sorry, no details in my memory banks). There is an example of the state approach in Bioconductor's AnnBuilder package. See R/GO.R. It isn't the prettiest or best example, but maybe it will help get you going. The general approach is to use '<<-' to reach up a level and set the state variables from inside the tag handlers. HTH, + seth
Jan Hummel wrote:> Dear All, > > I have a question regarding best practise in setting up a XML parser > within R. > Because I have files with more than 100 MB and I'm only interested in > some values I think a SAX-like parser using xmlEventParse() will be the > best solution. > Unfortunately the values I'm looking for, to construct some higher "mass > spectrum", are distributed over different lines: as <spectrum id="2">, > <mzArrayBinary>, <intenArrayBinary> <... name="MassToChargeRatio" > value="445.598999"/> (as one can see in the xml snip set) > > I know the mechanism of using Event Handlers, as shown in the examples. > But what I'm looking for is, how can I use some "path information" as > mentioned in "addContext" parameter of xmlEventParse()? May somebody > share a example using "addContext = TRUE" and pointing me to the > variables I may use if I implement the "..." parameter within my > handlers. >The addContext was an attempt to provide contextual information, but it is not obvious how to do this efficiently. And of course efficiency is the name of the game with the SAX model. If we wanted to know path information for the node, we would have to build this and that would slow things down. There are no nodes in the SAX world as we don't build the tree in any way. So the addContext currently doesn't do much. It is there as a hook that we can use if we want in the future. But you can do anything you need in the R code.> Do I have to implement a "status machine" using some variables within my > handlers, or would one prefer to use the "state" parameter of > xmlEventParse()? >As Seth mentioned in his reply, you can use state in your R handler functions to determine where you are. You can maintain a "stack" to determine the exact path of the current "node" in the startElement() handler and pop the name in the endElement() handler. The difference between maintaining state via environments/local persisten scope (using <<- in Seth's mail) and using the state argument is more of a personal preference in R. The state argument was added for S-Plus since it does not support environments. Using the state argument might save an epsilon amount of time, but it is hopefully neglible. BTW, do you have a schema for the XML document you are working on?> I would appreciate any assistance very much! > Jan > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Thank you Seth and Duncan for your input!> BTW, do you have a schema for the XML document you are working on?Yes, a schema is available here http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.xsd Informations around mzData xml format are available here http://psidev.sourceforge.net/ms/#mzdata Next question I want to come up with: is there a way to validate xml again a schema or a dtd while parsing using xmlEventParse()? cheers Jan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Jan Hummel wrote:> Thank you Seth and Duncan for your input! > > >>BTW, do you have a schema for the XML document you are working on? > > > Yes, a schema is available here > http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.xsd > Informations around mzData xml format are available here > http://psidev.sourceforge.net/ms/#mzdata >Thanks.> Next question I want to come up with: is there a way to validate xml > again a schema or a dtd while parsing using xmlEventParse()? >I dug around in the libxml code and the Web to verify that validation is indeed only possible in libxml when one uses DOM (i.e. xmlTreeParse()). Do you really need to validate the input? Given the size of the source, it must be created automatically and so I tend to think it is either correct or not, but that errors will be found with the creation mechanism. BTW, there is a new version of the XML package on the Omegahat web site. It has several new features, including a function to find nodes via XPath expressions, SAX2 support, recursive support for xmlElementsByTagName().> cheers > Jan- -- Duncan Temple Lang duncan at wald.ucdavis.edu Department of Statistics work: (530) 752-4782 371 Kerr Hall fax: (530) 752-7099 One Shields Ave. University of California at Davis Davis, CA 95616, USA -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (Darwin) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFDNLVK9p/Jzwa2QP4RAqNvAJ97+XW1B1AO6zl8ZN2qtVHCcPuu4ACfXnR9 572gL8pD2eMHj/tUSRomZwQ=SXBQ -----END PGP SIGNATURE-----
Hi Duncan, thanks again for your comments.> I dug around in the libxml code and the Web to verify that > validation is indeed only possible in libxml when one uses > DOM (i.e. xmlTreeParse()).Using DOM is not an option for me, so I need to "validate" the xml parts I'm interested in within my creation mechanism. It's OK, but not the best solution in questions of design.> BTW, there is a new version of the XML package on the > Omegahat web site.I'll use it extensive in this days and unfortunately I have already a question/problem pending: Taking the following R function: test<-function(){ sep="" xmlText <-"" xmlText <-paste(xmlText,"<spectrum id=\"3257\">",sep=sep) xmlText <-paste(xmlText,"<mzArrayBinary>",sep=sep) xmlText <-paste(xmlText,"<data>Monday</data>",sep=sep) xmlText <-paste(xmlText,"</mzArrayBinary>",sep=sep) xmlText <-paste(xmlText,"<intenArrayBinary>",sep=sep) xmlText <-paste(xmlText,"<data>Tuesday</data>",sep=sep) xmlText <-paste(xmlText,"</intenArrayBinary>",sep=sep) # xmlText <-paste(xmlText,"</spectrum>",sep=sep) # xmlText <-paste(xmlText,"<spectrum id=\"3259\">",sep=sep) xmlText <-paste(xmlText,"<mzArrayBinary>",sep=sep) xmlText <-paste(xmlText,"<data>Wednesday</data>",sep=sep) xmlText <-paste(xmlText,"</mzArrayBinary>",sep=sep) xmlText <-paste(xmlText,"<intenArrayBinary>",sep=sep) xmlText <-paste(xmlText,"<data>Thursday</data>",sep=sep) xmlText <-paste(xmlText,"</intenArrayBinary>",sep=sep) xmlText <-paste(xmlText,"</spectrum>",sep=sep) xmlEventParse(xmlText, asText=TRUE, handlers = list(text function(x, ...) {cat(nchar(x),x, "\n")})) return(invisible(NULL)) } Using this function in the given form works fine. xmlEventParse() with the simplest handler I can imagine finds all 4 text-nodes within the <spectrum> tag and prints them out. But if one uncomment both lines in the middle, introducing 2 <spectrum> tags with different id's xmlEventParse() returns with an exception. Of course the weekdays within <data> are arbitrary values used here. Further, using an other input file I could see, that for one and the same <data> node the handler for "text"-nodes was invoked two times, one time for a first part of the content and one time for the rest of the content. Both invocations together gave me exactly the content from the <data> node. So, am I on the wrong way? Or is this some buggy behaviour? I appreciat any help and assistance! Jan