Dear R experts,
I try to extract certain child nodes from an XML document and construct a
table in which the parent node names are the columns and the child id
values, joined in a list, are the cell content.
If I first apply an XPath query to extract all above parent nodes, then
iterate over those nodes and again apply a XPath query to select their
child nodes, I get *ALL* matching child nodes of the whole document, *not*
just those of the currently queried parent.
I know, this is because I prefix my XPath Query with // and apparently any
given XMLNode "knows" of his whole document,
but I seem not to be able to find a proper solution.
So, my question is:
How do I restrict a call of getNodeSet to just a XMLNode and not the whole
document it was retrieved from?
I use the XML and RCurl packages. The document I speak of is downloaded
from uniprot.org, a protein knowledge server well known to biologists.
The lamentably somewhat lengthy code follows:
library(XML)
library(RCurl)
getEntries <- function( uniprot.xml, uniprot.error.msg.regex='^ERROR'
) {
# Uniprot's dbfetch can be asked to return several entry tags in the same
XML
# document. This function uses XPath queries to extract all complete
uniprot
# tags.
#
# Args:
# uniprot.xml : The result of a web fetch to Uniprot i.e.
using
# getURL.
# uniprot.error.msg.regex : A regular expression to avoid parsing an
error
# returned from Uniprot.
#
# Returns: A list of extracted uniprot-entry-tags as returned by function
# 'getNodeSet'.
#
if ( ! is.null( uniprot.xml ) && '' != uniprot.xml &&
! grepl( uniprot.error.msg.regex, uniprot.xml )
) {
ns <- c( xmlns="http://uniprot.org/uniprot" )
getNodeSet(
xmlInternalTreeParse( uniprot.xml ),
"//xmlns:entry", namespaces=ns
)
} else {
NULL
}
}
extractExperimentallyVerifiedGoAnnos <- function( doc ) {
# Uses XPath to extract those GO annotations that are experimentally
# verified. Note, that warnings generated by calls to the XML library are
# suppressed to not confuse the user, when no experimentally verified GO
# annotations could be found.
#
# Args:
# doc : A XML tag of type entry as returned i.e. by function
'getEntries'
#
# Returns: A character vector of the extracted experimentally verified GO
# annotations, or NULL, if none can be found.
#
block <- function() {
ns <- c( xmlns="http://uniprot.org/uniprot" )
ndst <- suppressWarnings(
getNodeSet( doc,
"//xmlns:dbReference[@type='GO']//xmlns:property[@type='evidence'
and ( contains(@value, 'EXP') or contains(@value, 'IDA') or
contains(@value, 'IPI') or contains(@value, 'IMP') or
contains(@value,
'IGI') or contains(@value, 'IEP') ) ]/..",
namespaces=ns
)
)
if ( ! is.null( ndst ) && length( ndst ) > 0 )
vapply( ndst, xmlGetAttr, vector( mode='character', length=1 ),
'id' )
else
NULL
}
tryCatch( block(), error=function( err ) {
warning( err, " caused by document ", doc )
})
}
uniprotkb.url <- function( accession, frmt='xml' ) {
# Returns valid URL to access Uniprot's RESTful Web-Service to download
# data about the Protein as referenced by the argument 'accession'.
# Note, that the accession is URL encoded before being pasted into the
# Uniprot URL template.
#
# Args:
# accession : The Protein's Uniprot accession.
# frmt : The format of the downloaded Uniprot Entry. Default is
'xml'.
#
# Returns: The Uniprot URL for the argument accession.
#
paste(
'http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/uniprotkb/',
URLencode( accession ),
'/', frmt, sep=''
)
}
retrieveExperimentallyVerifiedGOAnnotations <- function( uniprot.accessions
) {
# Downloads and parses XML documents from Uniprot for each accession in
# argument. Extracts all experimentally verified GO annotations.
#
# Args:
# uniprot.accessions : A character vector of Uniprot accessions.
#
# Returns: A matrix with row 'GO' and one column for each Uniprot
accession.
# Each cell is either NULL or a character vector holding all
experimentally
# verified GO annotations. NULL annotations are excluded, so the returned
# matrix can be of zero columns and a single row.
#
fetch.url <- uniprotkb.url( paste( uniprot.accessions,
collapse=",",
sep="" ) )
uniprot.entries <- getEntries( getURL( fetch.url ) )
if ( ! is.null(uniprot.entries) && length( uniprot.entries ) > 0 )
{
annos <- do.call( 'cbind',
lapply( uniprot.entries , function( d ) {
list( 'GO'=extractExperimentallyVerifiedGoAnnos( d ) )
})
)
colnames( annos ) <- uniprot.accessions
# Exclude NULL columns:
annos[ , as.character( annos[ 'GO', ] ) != 'NULL' , drop=F ]
}
}
as.data.frame( retrieveExperimentallyVerifiedGOAnnotations(c("A0AEI7",
"Q9ZZX1")) )
Returns:
A0AEI7 Q9ZZX1
GO GO:0004519, GO:0006316 GO:0004519, GO:0006316
But should only have a single column, because A0AEI7 does not have any
experimentally verified Gene Ontology annotations.
Thank you very much in advance for your kind help!
[[alternative HTML version deleted]]