Hi Stavros
xmlToDataFrame() is very generic and so doesn't know anything
about the particulars of the XML it is processing. If you know
something about the structure of the XML, you should be able to leverage that
for performance.
xmlToDataFrame is also not optimized as it is just a convenience routine for
people who want to work with
XML without much effort.
If you send me the file and the code you are using to read the file, I'll
take a
look at it.
D.
On 7/30/13 11:10 AM, Stavros Macrakis wrote:> I have a modest-size XML file (52MB) in a format suited to xmlToDataFrame
(package XML).
>
> I have successfully read it into R by splitting the file 10 ways then
running xmlToDataFrame on each part, then
> rbind.fill (package plyr) on the result. This takes about 530 s total, and
results in a data.frame with 71k rows and
> object.size of 21MB.
>
> But trying to run xmlToDataFrame on the whole file takes forever (>
10000 s so far). xmlParse of this file takes only 0.8 s.
>
> I tried running xmlToDataFrame on the first 10% of the file, then the first
10% repeated twice, then three times (with
> the outer tags adjusted of course). Timings:
>
> 1 copy: 111 s = 111 per copy
> 2 copy: 311 s = 155 " "
> 3 copy: 626 s = 209 " "
>
> The runtime is superlinear. What is going on here? Is there a better
approach?
>
> Thanks,
>
> -s
>