Frederic Fournier
2012-Aug-10 22:46 UTC
[R] Parsing large XML documents in R - how to optimize the speed?
Hello everyone,
I would like to parse very large xml files from MS/MS experiments and
create R objects from their content. (By very large, I mean going up to
5-10Gb, although I am using a 'small' 40M file to test my code.)
My first attempt at parsing the 40M file, using the XML package, took more
than 2200 seconds and left me quite disappointed.
I managed to cut that down to around 40 seconds by:
-using the 'useInternalNodes' option of the XML package when parsing
the xml tree;
-vectorizing the parsing (i.e., replacing loops like "for(node in
group.of.nodes) {...}" by "sapply(group.of.node,
function(node){...}")
I gained another 5 seconds by making small changes to the functions used
(like replacing 'getNodeset' by 'xmlElementsByTagName' when I
don't need to
navigate to the children nodes).
Now I am blocked at around 35 seconds and I would still like to cut this
time by a 5x, but I have no clue what to do to achieve this gain. I'll try
to expose as briefly as possible the relevant structure of the xml file I
am parsing, the structure of the R object I want to create, and the type of
functions I am using to do it. I hope that one of you will be able to point
me towards a better and quicker way of doing the parsing!
Here is the (simplified) structure of the relevant nodes of the xml file:
<model> (many many nodes)
<protein> (a couple of proteins per model node)
<peptide> (1 per protein node)
<domain> (1 or more per peptide node)
<aa> (0 or more per domain node)
</aa>
</domain>
</peptide>
</protein>
</model>
Here is the basic structure of the R object that I want to create:
'result' object that contains:
-various attributes
-a list of 'protein' objects, each of which containing:
-various attributes
-a list of 'peptide' objects, each of which containing:
-various attributes
-a list of 'aa' objects, each of which consisting of a couple of
attributes.
Here is the basic structure of the code:
xml.doc <- xmlTreeParse("file", getDTD=FALSE,
useInternalNodes=TRUE)
result <- new('S4_result_class')
result@proteins <- xpathApply(xml.doc, "//model/protein",
function(protein.node) {
protein <- new('S4_protein_class')
## fill in a couple of attributes of the protein object using xmlValue
and xmlAttrs(protein.node)
protein@peptides <- xpathApply(protein.node, "./peptide",
function(peptide.node) {
peptide <- new('S4_peptide_class')
## fill in a couple of attributes of the peptide object using xmlValue
and xmlAttrs(peptide.node)
peptide@aas <- sapply(xmlElementsByTagName(peptide.node,
name="aa"),
function(aa.node) {
aa <- new('S4_aa_class')
## fill in a couple of attributes of the 'aa' object using
xmlValue
and xmlAttrs(aa.node)
})
})
})
free(xml.doc)
Does anyone know a better and quicker way of doing this?
Sorry for the very long message and thank you very much for your time and
help!
Frederic
[[alternative HTML version deleted]]
Martin Morgan
2012-Aug-11 00:17 UTC
[R] Parsing large XML documents in R - how to optimize the speed?
On 08/10/2012 03:46 PM, Frederic Fournier wrote:> Hello everyone, > > I would like to parse very large xml files from MS/MS experiments and > create R objects from their content. (By very large, I mean going up to > 5-10Gb, although I am using a 'small' 40M file to test my code.)I'm not 100% sure of it's relevance, but http://bioconductor.org/packages/2.10/bioc/html/MSnbase.html There is a vignette here, for instance http://bioconductor.org/packages/2.10/bioc/vignettes/MSnbase/inst/doc/MSnbase-io.pdf If this is useful, then further questions might be directed to the Bioconductor mailing list. http://bioconductor.org/help/mailing-list/ Martin> > My first attempt at parsing the 40M file, using the XML package, took more > than 2200 seconds and left me quite disappointed. > I managed to cut that down to around 40 seconds by: > -using the 'useInternalNodes' option of the XML package when parsing > the xml tree; > -vectorizing the parsing (i.e., replacing loops like "for(node in > group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}") > I gained another 5 seconds by making small changes to the functions used > (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to > navigate to the children nodes). > Now I am blocked at around 35 seconds and I would still like to cut this > time by a 5x, but I have no clue what to do to achieve this gain. I'll try > to expose as briefly as possible the relevant structure of the xml file I > am parsing, the structure of the R object I want to create, and the type of > functions I am using to do it. I hope that one of you will be able to point > me towards a better and quicker way of doing the parsing! > > > Here is the (simplified) structure of the relevant nodes of the xml file: > > <model> (many many nodes) > <protein> (a couple of proteins per model node) > <peptide> (1 per protein node) > <domain> (1 or more per peptide node) > <aa> (0 or more per domain node) > </aa> > </domain> > </peptide> > </protein> > </model> > > Here is the basic structure of the R object that I want to create: > > 'result' object that contains: > -various attributes > -a list of 'protein' objects, each of which containing: > -various attributes > -a list of 'peptide' objects, each of which containing: > -various attributes > -a list of 'aa' objects, each of which consisting of a couple of > attributes. > > Here is the basic structure of the code: > > xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE) > result <- new('S4_result_class') > result at proteins <- xpathApply(xml.doc, "//model/protein", > function(protein.node) { > protein <- new('S4_protein_class') > ## fill in a couple of attributes of the protein object using xmlValue > and xmlAttrs(protein.node) > protein at peptides <- xpathApply(protein.node, "./peptide", > function(peptide.node) { > peptide <- new('S4_peptide_class') > ## fill in a couple of attributes of the peptide object using xmlValue > and xmlAttrs(peptide.node) > peptide at aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"), > function(aa.node) { > aa <- new('S4_aa_class') > ## fill in a couple of attributes of the 'aa' object using xmlValue > and xmlAttrs(aa.node) > }) > }) > }) > free(xml.doc) > > > Does anyone know a better and quicker way of doing this? > > Sorry for the very long message and thank you very much for your time and > help! > > Frederic > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
Duncan Temple Lang
2012-Aug-11 14:30 UTC
[R] Parsing large XML documents in R - how to optimize the speed?
Hi Frederic
You definitely want to be using xmlParse() (or equivalently
xmlTreeParse( , useInternalNodes = TRUE)).
This then allows use of getNodeSet()
I would suggest you use Rprof() to find out where the bottlenecks arise,
e.g. in the XML functions or in S4 code, or in your code that assembles the
R objects from the XML.
I'm happy to take a look at speeding it up if you can make the test file
available
and show me your code.
D.
On 8/10/12 3:46 PM, Frederic Fournier wrote:> Hello everyone,
>
> I would like to parse very large xml files from MS/MS experiments and
> create R objects from their content. (By very large, I mean going up to
> 5-10Gb, although I am using a 'small' 40M file to test my code.)
>
> My first attempt at parsing the 40M file, using the XML package, took more
> than 2200 seconds and left me quite disappointed.
> I managed to cut that down to around 40 seconds by:
> -using the 'useInternalNodes' option of the XML package when
parsing
> the xml tree;
> -vectorizing the parsing (i.e., replacing loops like "for(node in
> group.of.nodes) {...}" by "sapply(group.of.node,
function(node){...}")
> I gained another 5 seconds by making small changes to the functions used
> (like replacing 'getNodeset' by 'xmlElementsByTagName' when
I don't need to
> navigate to the children nodes).
> Now I am blocked at around 35 seconds and I would still like to cut this
> time by a 5x, but I have no clue what to do to achieve this gain. I'll
try
> to expose as briefly as possible the relevant structure of the xml file I
> am parsing, the structure of the R object I want to create, and the type of
> functions I am using to do it. I hope that one of you will be able to point
> me towards a better and quicker way of doing the parsing!
>
>
> Here is the (simplified) structure of the relevant nodes of the xml file:
>
> <model> (many many nodes)
> <protein> (a couple of proteins per model node)
> <peptide> (1 per protein node)
> <domain> (1 or more per peptide node)
> <aa> (0 or more per domain node)
> </aa>
> </domain>
> </peptide>
> </protein>
> </model>
>
> Here is the basic structure of the R object that I want to create:
>
> 'result' object that contains:
> -various attributes
> -a list of 'protein' objects, each of which containing:
> -various attributes
> -a list of 'peptide' objects, each of which containing:
> -various attributes
> -a list of 'aa' objects, each of which consisting of a
couple of
> attributes.
>
> Here is the basic structure of the code:
>
> xml.doc <- xmlTreeParse("file", getDTD=FALSE,
useInternalNodes=TRUE)
> result <- new('S4_result_class')
> result at proteins <- xpathApply(xml.doc, "//model/protein",
> function(protein.node) {
> protein <- new('S4_protein_class')
> ## fill in a couple of attributes of the protein object using xmlValue
> and xmlAttrs(protein.node)
> protein at peptides <- xpathApply(protein.node, "./peptide",
> function(peptide.node) {
> peptide <- new('S4_peptide_class')
> ## fill in a couple of attributes of the peptide object using xmlValue
> and xmlAttrs(peptide.node)
> peptide at aas <- sapply(xmlElementsByTagName(peptide.node,
name="aa"),
> function(aa.node) {
> aa <- new('S4_aa_class')
> ## fill in a couple of attributes of the 'aa' object using
xmlValue
> and xmlAttrs(aa.node)
> })
> })
> })
> free(xml.doc)
>
>
> Does anyone know a better and quicker way of doing this?
>
> Sorry for the very long message and thank you very much for your time and
> help!
>
> Frederic
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
Erdal Karaca
2012-Aug-11 22:12 UTC
[R] Parsing large XML documents in R - how to optimize the speed?
If this is an option for you: An xml database can handle (very) huge xml files and let you query nodes very efficiently. Then, you could query the xml databse from R (using REST) to do your statistics. There are some open source xquery/xml databases available. 2012/8/11 Frederic Fournier <frederic.bioinfo@gmail.com>> Hello everyone, > > I would like to parse very large xml files from MS/MS experiments and > create R objects from their content. (By very large, I mean going up to > 5-10Gb, although I am using a 'small' 40M file to test my code.) > > My first attempt at parsing the 40M file, using the XML package, took more > than 2200 seconds and left me quite disappointed. > I managed to cut that down to around 40 seconds by: > -using the 'useInternalNodes' option of the XML package when parsing > the xml tree; > -vectorizing the parsing (i.e., replacing loops like "for(node in > group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}") > I gained another 5 seconds by making small changes to the functions used > (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to > navigate to the children nodes). > Now I am blocked at around 35 seconds and I would still like to cut this > time by a 5x, but I have no clue what to do to achieve this gain. I'll try > to expose as briefly as possible the relevant structure of the xml file I > am parsing, the structure of the R object I want to create, and the type of > functions I am using to do it. I hope that one of you will be able to point > me towards a better and quicker way of doing the parsing! > > > Here is the (simplified) structure of the relevant nodes of the xml file: > > <model> (many many nodes) > <protein> (a couple of proteins per model node) > <peptide> (1 per protein node) > <domain> (1 or more per peptide node) > <aa> (0 or more per domain node) > </aa> > </domain> > </peptide> > </protein> > </model> > > Here is the basic structure of the R object that I want to create: > > 'result' object that contains: > -various attributes > -a list of 'protein' objects, each of which containing: > -various attributes > -a list of 'peptide' objects, each of which containing: > -various attributes > -a list of 'aa' objects, each of which consisting of a couple of > attributes. > > Here is the basic structure of the code: > > xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE) > result <- new('S4_result_class') > result@proteins <- xpathApply(xml.doc, "//model/protein", > function(protein.node) { > protein <- new('S4_protein_class') > ## fill in a couple of attributes of the protein object using xmlValue > and xmlAttrs(protein.node) > protein@peptides <- xpathApply(protein.node, "./peptide", > function(peptide.node) { > peptide <- new('S4_peptide_class') > ## fill in a couple of attributes of the peptide object using xmlValue > and xmlAttrs(peptide.node) > peptide@aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"), > function(aa.node) { > aa <- new('S4_aa_class') > ## fill in a couple of attributes of the 'aa' object using xmlValue > and xmlAttrs(aa.node) > }) > }) > }) > free(xml.doc) > > > Does anyone know a better and quicker way of doing this? > > Sorry for the very long message and thank you very much for your time and > help! > > Frederic > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]