Frederic Fournier
2012-Oct-26 16:49 UTC
[R] Parsing very large xml datafiles with SAX: How to profile <anonymous> functions?
Hello everyone, I'm trying to parse a very large XML file using SAX with the XML package (i.e., mainly the xmlEventParsing function). This function takes as an argument a list of other functions (handlers) that will be called to handle particular xml nodes. If when I use Rprof(), all the handler functions are lumped together under the <anonymous> label, and I get something like this: $by.total total.time total.pct self.time self.pct "system.time" 151.22 99.99 0.00 0.00 "MyParsingFunction" 149.38 98.77 0.00 0.00 "xmlEventParse" 149.38 98.77 0.00 0.00 ".Call" 149.32 98.73 3.04 2.01 "<Anonymous>" 146.74 97.02 141.26 93.40 <--- !! "xmlValue" 3.04 2.01 0.46 0.30 "xmlValue.XMLInternalNode" 2.58 1.71 0.14 0.09 "standardGeneric" 2.12 1.40 0.50 0.33 "gc" 1.86 1.23 1.86 1.23 ... Is there a way to make Rprof() identify the different handler functions, so I can know which one might be a bottleneck? Is there another profiling tool that would be more appropriate in a case like this? Thank you very much for your help! Frederic [[alternative HTML version deleted]]
Duncan Temple Lang
2012-Oct-27 00:01 UTC
[R] Parsing very large xml datafiles with SAX: How to profile <anonymous> functions?
Hi Frederic Perhaps the simplest way to profile the individual functions in your handlers is to write the individual handlers as regular named functions, i.e. assigned to a variable in your work space (or function body) and then two write the handler functions as wrapper functions that call these by name startElement = function(name, attr, ...) { # code you want to run when we encounter the start of an XML element } myText = function(...) { # code } Now, when calling xmlEventParse() xmlEventParse(filename, handlers = list(.startElement = function(...) startElement(...), .text = function(...) myText(...))) Then the profiler will see the calls to startElement and myText. There is small overhead of the extra layers, but you will get the profile information. D. On 10/26/12 9:49 AM, Frederic Fournier wrote:> Hello everyone, > > I'm trying to parse a very large XML file using SAX with the XML package > (i.e., mainly the xmlEventParsing function). This function takes as an > argument a list of other functions (handlers) that will be called to handle > particular xml nodes. > > If when I use Rprof(), all the handler functions are lumped together under > the <anonymous> label, and I get something like this: > > $by.total > total.time total.pct self.time self.pct > "system.time" 151.22 99.99 0.00 0.00 > "MyParsingFunction" 149.38 98.77 0.00 0.00 > "xmlEventParse" 149.38 98.77 0.00 0.00 > ".Call" 149.32 98.73 3.04 2.01 > "<Anonymous>" 146.74 97.02 141.26 93.40 <--- > !! > "xmlValue" 3.04 2.01 0.46 0.30 > "xmlValue.XMLInternalNode" 2.58 1.71 0.14 0.09 > "standardGeneric" 2.12 1.40 0.50 0.33 > "gc" 1.86 1.23 1.86 1.23 > ... > > > Is there a way to make Rprof() identify the different handler functions, so > I can know which one might be a bottleneck? Is there another profiling tool > that would be more appropriate in a case like this? > > Thank you very much for your help! > > Frederic > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >