Frederic Fournier
2012-Oct-26  17:00 UTC
[R] Parsing very large xml datafiles with SAX (XML package): What data structure should I favor?
Hello again, I have another question related to parsing a very large xml file with SAX: what kind of data structure should I favor? Unlike using DOM function that can return lists of relevant nodes and let me use various versions of 'apply', the SAX parsing returns me one thing at a time. I first tried to simply append to simple solution of appending to lists as I get the data. But I very soon realized that this is way too slow. Then I tried pre-declaring large data.frames of NA and populating them with [[<-.data.frame. But this is quite slow too. I then tried pre-declaring large matrix of NA and populating them with [<-. This is better... but still unmanageable as xml files become large. I also tried using an environment as a hash structure: , but realized that this is simple on the programmer, but stalls the parsing. I then tried to [[alternative HTML version deleted]]
R. Michael Weylandt <michael.weylandt@gmail.com>
2012-Oct-26  21:02 UTC
[R] Parsing very large xml datafiles with SAX (XML package): What data structure should I favor?
I'd look into the data.table package. Cheers, RMW On Oct 26, 2012, at 6:00 PM, Frederic Fournier <frederic.bioinfo at gmail.com> wrote:> Hello again, > > I have another question related to parsing a very large xml file with SAX: > what kind of data structure should I favor? Unlike using DOM function that > can return lists of relevant nodes and let me use various versions of > 'apply', the SAX parsing returns me one thing at a time. > > I first tried to simply append to simple solution of appending to lists as > I get the data. But I very soon realized that this is way too slow. > Then I tried pre-declaring large data.frames of NA and populating them with > [[<-.data.frame. But this is quite slow too. > I then tried pre-declaring large matrix of NA and populating them with [<-. > This is better... but still unmanageable as xml files become large. > I also tried using an environment as a hash structure: > > , but realized that this is simple on the programmer, but stalls the > parsing. > I then tried to > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Maybe Matching Threads
- Parsing very large xml datafiles with SAX: How to profile <anonymous> functions?
- A document about implementing dtrace probes in SAX
- CEBA-2012:1184 CentOS 5 perl-XML-SAX FASTTRACK Update
- CEBA-2011:1446 CentOS 5 i386 perl-XML-SAX FASTTRACK Update
- CEBA-2011:1446 CentOS 5 x86_64 perl-XML-SAX FASTTRACK Update