Dear R experts, I try to extract certain child nodes from an XML document and construct a table in which the parent node names are the columns and the child id values, joined in a list, are the cell content. If I first apply an XPath query to extract all above parent nodes, then iterate over those nodes and again apply a XPath query to select their child nodes, I get *ALL* matching child nodes of the whole document, *not* just those of the currently queried parent. I know, this is because I prefix my XPath Query with // and apparently any given XMLNode "knows" of his whole document, but I seem not to be able to find a proper solution. So, my question is: How do I restrict a call of getNodeSet to just a XMLNode and not the whole document it was retrieved from? I use the XML and RCurl packages. The document I speak of is downloaded from uniprot.org, a protein knowledge server well known to biologists. The lamentably somewhat lengthy code follows: library(XML) library(RCurl) getEntries <- function( uniprot.xml, uniprot.error.msg.regex='^ERROR' ) { # Uniprot's dbfetch can be asked to return several entry tags in the same XML # document. This function uses XPath queries to extract all complete uniprot # tags. # # Args: # uniprot.xml : The result of a web fetch to Uniprot i.e. using # getURL. # uniprot.error.msg.regex : A regular expression to avoid parsing an error # returned from Uniprot. # # Returns: A list of extracted uniprot-entry-tags as returned by function # 'getNodeSet'. # if ( ! is.null( uniprot.xml ) && '' != uniprot.xml && ! grepl( uniprot.error.msg.regex, uniprot.xml ) ) { ns <- c( xmlns="http://uniprot.org/uniprot" ) getNodeSet( xmlInternalTreeParse( uniprot.xml ), "//xmlns:entry", namespaces=ns ) } else { NULL } } extractExperimentallyVerifiedGoAnnos <- function( doc ) { # Uses XPath to extract those GO annotations that are experimentally # verified. Note, that warnings generated by calls to the XML library are # suppressed to not confuse the user, when no experimentally verified GO # annotations could be found. # # Args: # doc : A XML tag of type entry as returned i.e. by function 'getEntries' # # Returns: A character vector of the extracted experimentally verified GO # annotations, or NULL, if none can be found. # block <- function() { ns <- c( xmlns="http://uniprot.org/uniprot" ) ndst <- suppressWarnings( getNodeSet( doc, "//xmlns:dbReference[@type='GO']//xmlns:property[@type='evidence' and ( contains(@value, 'EXP') or contains(@value, 'IDA') or contains(@value, 'IPI') or contains(@value, 'IMP') or contains(@value, 'IGI') or contains(@value, 'IEP') ) ]/..", namespaces=ns ) ) if ( ! is.null( ndst ) && length( ndst ) > 0 ) vapply( ndst, xmlGetAttr, vector( mode='character', length=1 ), 'id' ) else NULL } tryCatch( block(), error=function( err ) { warning( err, " caused by document ", doc ) }) } uniprotkb.url <- function( accession, frmt='xml' ) { # Returns valid URL to access Uniprot's RESTful Web-Service to download # data about the Protein as referenced by the argument 'accession'. # Note, that the accession is URL encoded before being pasted into the # Uniprot URL template. # # Args: # accession : The Protein's Uniprot accession. # frmt : The format of the downloaded Uniprot Entry. Default is 'xml'. # # Returns: The Uniprot URL for the argument accession. # paste( 'http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/uniprotkb/', URLencode( accession ), '/', frmt, sep='' ) } retrieveExperimentallyVerifiedGOAnnotations <- function( uniprot.accessions ) { # Downloads and parses XML documents from Uniprot for each accession in # argument. Extracts all experimentally verified GO annotations. # # Args: # uniprot.accessions : A character vector of Uniprot accessions. # # Returns: A matrix with row 'GO' and one column for each Uniprot accession. # Each cell is either NULL or a character vector holding all experimentally # verified GO annotations. NULL annotations are excluded, so the returned # matrix can be of zero columns and a single row. # fetch.url <- uniprotkb.url( paste( uniprot.accessions, collapse=",", sep="" ) ) uniprot.entries <- getEntries( getURL( fetch.url ) ) if ( ! is.null(uniprot.entries) && length( uniprot.entries ) > 0 ) { annos <- do.call( 'cbind', lapply( uniprot.entries , function( d ) { list( 'GO'=extractExperimentallyVerifiedGoAnnos( d ) ) }) ) colnames( annos ) <- uniprot.accessions # Exclude NULL columns: annos[ , as.character( annos[ 'GO', ] ) != 'NULL' , drop=F ] } } as.data.frame( retrieveExperimentallyVerifiedGOAnnotations(c("A0AEI7", "Q9ZZX1")) ) Returns: A0AEI7 Q9ZZX1 GO GO:0004519, GO:0006316 GO:0004519, GO:0006316 But should only have a single column, because A0AEI7 does not have any experimentally verified Gene Ontology annotations. Thank you very much in advance for your kind help! [[alternative HTML version deleted]]