Frederic Fournier
2012-Oct-26 16:49 UTC
[R] Parsing very large xml datafiles with SAX: How to profile <anonymous> functions?
Hello everyone,
I'm trying to parse a very large XML file using SAX with the XML package
(i.e., mainly the xmlEventParsing function). This function takes as an
argument a list of other functions (handlers) that will be called to handle
particular xml nodes.
If when I use Rprof(), all the handler functions are lumped together under
the <anonymous> label, and I get something like this:
$by.total
total.time total.pct self.time self.pct
"system.time" 151.22 99.99 0.00 0.00
"MyParsingFunction" 149.38 98.77 0.00 0.00
"xmlEventParse" 149.38 98.77 0.00 0.00
".Call" 149.32 98.73 3.04 2.01
"<Anonymous>" 146.74 97.02 141.26
93.40 <---
!!
"xmlValue" 3.04 2.01 0.46 0.30
"xmlValue.XMLInternalNode" 2.58 1.71 0.14 0.09
"standardGeneric" 2.12 1.40 0.50 0.33
"gc" 1.86 1.23 1.86 1.23
...
Is there a way to make Rprof() identify the different handler functions, so
I can know which one might be a bottleneck? Is there another profiling tool
that would be more appropriate in a case like this?
Thank you very much for your help!
Frederic
[[alternative HTML version deleted]]
Duncan Temple Lang
2012-Oct-27 00:01 UTC
[R] Parsing very large xml datafiles with SAX: How to profile <anonymous> functions?
Hi Frederic
Perhaps the simplest way to profile the individual functions in your
handlers is to write the individual handlers as regular
named functions, i.e. assigned to a variable in your work space (or function
body)
and then two write the handler functions as wrapper functions that call these
by name
startElement = function(name, attr, ...) {
# code you want to run when we encounter the start of an XML element
}
myText = function(...) {
# code
}
Now, when calling xmlEventParse()
xmlEventParse(filename,
handlers = list(.startElement = function(...)
startElement(...),
.text = function(...) myText(...)))
Then the profiler will see the calls to startElement and myText.
There is small overhead of the extra layers, but you will get the profile
information.
D.
On 10/26/12 9:49 AM, Frederic Fournier wrote:> Hello everyone,
>
> I'm trying to parse a very large XML file using SAX with the XML
package
> (i.e., mainly the xmlEventParsing function). This function takes as an
> argument a list of other functions (handlers) that will be called to handle
> particular xml nodes.
>
> If when I use Rprof(), all the handler functions are lumped together under
> the <anonymous> label, and I get something like this:
>
> $by.total
> total.time total.pct self.time self.pct
> "system.time" 151.22 99.99 0.00
0.00
> "MyParsingFunction" 149.38 98.77 0.00
0.00
> "xmlEventParse" 149.38 98.77 0.00
0.00
> ".Call" 149.32 98.73 3.04
2.01
> "<Anonymous>" 146.74 97.02 141.26
93.40 <---
> !!
> "xmlValue" 3.04 2.01 0.46
0.30
> "xmlValue.XMLInternalNode" 2.58 1.71 0.14
0.09
> "standardGeneric" 2.12 1.40 0.50
0.33
> "gc" 1.86 1.23 1.86
1.23
> ...
>
>
> Is there a way to make Rprof() identify the different handler functions, so
> I can know which one might be a bottleneck? Is there another profiling tool
> that would be more appropriate in a case like this?
>
> Thank you very much for your help!
>
> Frederic
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>