thr3ads.net - R help - [R] SAX Parser best practise [Sep 2005]

If this information is useful, please help other people find it:
Share via:

Jan Hummel

2005-Sep-21 06:43 UTC

[R] SAX Parser best practise

Dear All,

I have a question regarding best practise in setting up a XML parser
within R. 
Because I have files with more than 100 MB and I'm only interested in
some values I think a SAX-like parser using xmlEventParse() will be the
best solution.
Unfortunately the values I'm looking for, to construct some higher
"mass
spectrum", are distributed over different lines: as <spectrum
id="2">,
<mzArrayBinary>, <intenArrayBinary> <...
name="MassToChargeRatio"
value="445.598999"/> (as one can see in the xml snip set)

I know the mechanism of using Event Handlers, as shown in the examples.
But what I'm looking for is, how can I use some "path information"
as
mentioned in "addContext" parameter of xmlEventParse()? May somebody
share a example using "addContext = TRUE" and pointing me to the
variables I may use if I implement the "..." parameter within my
handlers.

Do I have to implement a "status machine" using some variables within
my
handlers, or would one prefer to use the "state" parameter of
xmlEventParse()?

I would appreciate any assistance very much!
	Jan

Seth Falcon

2005-Sep-21 14:26 UTC

head link

[R] SAX Parser best practise

Hi Jan,

On 20 Sep 2005, Hummel at mpimp-golm.mpg.de wrote:> I have a question regarding best practise in setting up a XML parser
> within R. [snip]
> value="445.598999"/> (as one can see in the xml snip set)
I missed the xml snip, but I think I get the gist of your question.
> I know the mechanism of using Event Handlers, as shown in the
> examples.  But what I'm looking for is, how can I use some "path
> information" as mentioned in "addContext" parameter of
> xmlEventParse()? May somebody share a example using "addContext >
TRUE" and pointing me to the variables I may use if I implement the
> "..." parameter within my handlers.
>
> Do I have to implement a "status machine" using some variables
> within my handlers, or would one prefer to use the "state"
parameter
> of xmlEventParse()?
I'm not familiar with the addContext arg and don't know whether or not
that provides another solution to your problem.

I do know that you can do what you want by writing "state machine"
code.  I played a little with using the state arg for this purpose,
but ran into some problems (sorry, no details in my memory banks).

There is an example of the state approach in Bioconductor's AnnBuilder
package.  See R/GO.R.  It isn't the prettiest or best example, but
maybe it will help get you going.

The general approach is to use '<<-' to reach up a level and set
the
state variables from inside the tag handlers.

HTH,

+ seth

Duncan Temple Lang

2005-Sep-21 15:43 UTC

head link

[R] SAX Parser best practise

Jan Hummel wrote:> Dear All,
> 
> I have a question regarding best practise in setting up a XML parser
> within R. 
> Because I have files with more than 100 MB and I'm only interested in
> some values I think a SAX-like parser using xmlEventParse() will be the
> best solution.
> Unfortunately the values I'm looking for, to construct some higher
"mass
> spectrum", are distributed over different lines: as <spectrum
id="2">,
> <mzArrayBinary>, <intenArrayBinary> <...
name="MassToChargeRatio"
> value="445.598999"/> (as one can see in the xml snip set)
> 
> I know the mechanism of using Event Handlers, as shown in the examples.
> But what I'm looking for is, how can I use some "path
information" as
> mentioned in "addContext" parameter of xmlEventParse()? May
somebody
> share a example using "addContext = TRUE" and pointing me to the
> variables I may use if I implement the "..." parameter within my
> handlers.
> 
The addContext was an attempt to provide contextual information,
but it is not obvious how to do this efficiently. And of course
efficiency is the name of the game with the SAX model.
If we wanted to know path information for the node, we would have
to build this and that would slow things down. There are no nodes
in the SAX world as we don't build the tree in any way.
So the addContext currently doesn't do much. It is there
as a hook that we can use if we want in the future.
But you can do anything you need in the R code.

> Do I have to implement a "status machine" using some variables
within my
> handlers, or would one prefer to use the "state" parameter of
> xmlEventParse()?
> 

As Seth mentioned in his reply, you can use state in your R handler 
functions to determine where you are.  You can maintain a "stack"
to determine the exact path of the current "node" in the
startElement() handler and pop the name in the endElement() handler.

The difference between maintaining state via environments/local 
persisten scope (using <<- in Seth's mail) and using the state
argument
is more of a personal preference in R.  The state argument was added
for S-Plus since it does not support environments.  Using the state
argument might save an epsilon amount of time, but it is hopefully
neglible.

BTW, do you have a schema for the XML document you are working on?




> I would appreciate any assistance very much!
> 	Jan
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Jan Hummel

2005-Sep-22 09:04 UTC

head link

[R] SAX Parser best practise

Thank you Seth  and Duncan for your input! 
> BTW, do you have a schema for the XML document you are working on?
Yes, a schema is available here
http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.xsd
Informations around mzData xml format are available here
http://psidev.sourceforge.net/ms/#mzdata

Next question I want to come up with: is there a way to validate xml
again a schema or a dtd while parsing using xmlEventParse()?

cheers
	Jan

Duncan Temple Lang

2005-Sep-24 02:09 UTC

head link

[R] SAX Parser best practise

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Jan Hummel wrote:> Thank you Seth  and Duncan for your input! 
> 
> 
>>BTW, do you have a schema for the XML document you are working on?
> 
> 
> Yes, a schema is available here
> http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.xsd
> Informations around mzData xml format are available here
> http://psidev.sourceforge.net/ms/#mzdata
> 
Thanks.

> Next question I want to come up with: is there a way to validate xml
> again a schema or a dtd while parsing using xmlEventParse()?
> 
I dug around in the libxml code and the Web to verify that validation
is indeed only possible in libxml when one uses DOM (i.e. xmlTreeParse()).
Do you really need to validate the input? Given the size of the source,
it must be created automatically and so I tend to think it is either
correct or not, but that errors will be found with the creation
mechanism.


BTW, there is a new version of the XML package on the Omegahat web site.
It has several new features, including a function to find nodes via
XPath expressions, SAX2 support, recursive support for
xmlElementsByTagName().

> cheers
> 	Jan
- --
Duncan Temple Lang                duncan at wald.ucdavis.edu
Department of Statistics          work:  (530) 752-4782
371 Kerr Hall                     fax:   (530) 752-7099
One Shields Ave.
University of California at Davis
Davis, CA 95616, USA
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (Darwin)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDNLVK9p/Jzwa2QP4RAqNvAJ97+XW1B1AO6zl8ZN2qtVHCcPuu4ACfXnR9
572gL8pD2eMHj/tUSRomZwQ=SXBQ
-----END PGP SIGNATURE-----

Jan Hummel

2005-Sep-26 08:13 UTC

head link

[R] SAX Parser best practise

Hi Duncan,

thanks again for your comments.
> I dug around in the libxml code and the Web to verify that 
> validation is indeed only possible in libxml when one uses 
> DOM (i.e. xmlTreeParse()).Using DOM is not an option for me, so I need to "validate" the xml
parts
I'm interested in within my creation mechanism. It's OK, but not the
best solution in questions of design.
> BTW, there is a new version of the XML package on the 
> Omegahat web site.I'll use it extensive in this days and unfortunately I have already a
question/problem pending:

Taking the following R function:

test<-function(){
	sep=""
	xmlText <-""
	xmlText <-paste(xmlText,"<spectrum
id=\"3257\">",sep=sep)
	xmlText <-paste(xmlText,"<mzArrayBinary>",sep=sep)
	xmlText <-paste(xmlText,"<data>Monday</data>",sep=sep)
	xmlText <-paste(xmlText,"</mzArrayBinary>",sep=sep)
	xmlText <-paste(xmlText,"<intenArrayBinary>",sep=sep)
	xmlText
<-paste(xmlText,"<data>Tuesday</data>",sep=sep)
	xmlText <-paste(xmlText,"</intenArrayBinary>",sep=sep)
#	xmlText <-paste(xmlText,"</spectrum>",sep=sep)
#	xmlText <-paste(xmlText,"<spectrum
id=\"3259\">",sep=sep)
	xmlText <-paste(xmlText,"<mzArrayBinary>",sep=sep)
	xmlText
<-paste(xmlText,"<data>Wednesday</data>",sep=sep)
	xmlText <-paste(xmlText,"</mzArrayBinary>",sep=sep)
	xmlText <-paste(xmlText,"<intenArrayBinary>",sep=sep)
	xmlText
<-paste(xmlText,"<data>Thursday</data>",sep=sep)
	xmlText <-paste(xmlText,"</intenArrayBinary>",sep=sep)
	xmlText <-paste(xmlText,"</spectrum>",sep=sep)

	xmlEventParse(xmlText, asText=TRUE, handlers = list(text function(x, ...)
{cat(nchar(x),x, "\n")}))
	return(invisible(NULL))
}

Using this function in the given form works fine. xmlEventParse() with
the simplest handler I can imagine finds all 4 text-nodes within the
<spectrum> tag and prints them out. But if one uncomment both lines in
the middle, introducing 2 <spectrum> tags with different id's
xmlEventParse() returns with an exception. Of course the weekdays within
<data> are arbitrary values used here. Further, using an other input
file I could see, that for one and the same <data> node the handler for
"text"-nodes was invoked two times, one time for a first part of the
content and one time for the rest of the content. Both invocations
together gave me exactly the content from the <data> node. 

So, am I on the wrong way? Or is this some buggy behaviour? 

I appreciat any help and assistance!

Jan

Reasonably Related Threads

Search for more maybe matching threads

R help - Sep 2005 - SAX Parser best practise

[R] SAX Parser best practise

[R] SAX Parser best practise

[R] SAX Parser best practise

[R] SAX Parser best practise

[R] SAX Parser best practise

[R] SAX Parser best practise

Reasonably Related Threads