thr3ads.net - R help - [R] read xml [Apr 2010]

If this information is useful, please help other people find it:
Share via:
Alex Campos
2010-Apr-16 18:05 UTC
[R] read xml

Hi
I am trying to read selected fields from a xml file with R using xml  
package. So far I have learned the basics of this package by going  
through the manual, examples, tutorial, and so on (www.omegahat.org/RSXML) 
. The problem is that I am getting stuck when it comes down to more  
complex xml files. I am a novice in R and xml, and was wondering if  
someone could help me out with here.

Here is my xml file. I am only interested in the <protein_group node.  
Therefore, I have omitted most of the information from the other two  
previous nodes (protein_summary_header, proteinprophet_details).

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl"
href="http://localhost/ISB/data/interact-LFA1_C18_PME5R1.prot.xsl
"?>
<protein_summary
xmlns="http://regis-web.systemsbiology.net/protXML"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
xsi:schemaLocation="http://sashimi.sourceforge.net/schema_revision/protXML/protXML_v6.xsd
" summary_xml="interact-LFA1_C18_PME5R1.prot.xml">
<protein_summary_header
reference_database="EColi_decoy_v3.0.fasta">
<program_details analysis="proteinprophet">
<proteinprophet_details  occam_flag="Y"
run_options="XML">
<protein_group group_number="1" probability="1.0000">
       <protein protein_name="sp|P00004|CYC_HORSE"  
n_indistinguishable_proteins="1" probability="1.0000"  
percent_coverage="46.7"
unique_stripped_peptides="EDLIAYLK+EETLMEYLENPK
+KTGQAPGFTYTDANK+TEREDLIAYLK+TGPNLHGLFGR+TGQAPGFTYTDANK"  
group_sibling_id="a" total_number_peptides="226"  
pct_spectrum_ids="2.54" confidence="1.00">
          <parameter name="prot_length" value="107"/>
          <annotation protein_description="Cytochrome c OS=Equus  
caballus GN=CYCS PE=1 SV=2"/>
          <peptide peptide_sequence="KTGQAPGFTYTDANK"
charge="2"
initial_probability="0.9989"
nsp_adjusted_probability="0.9998"
peptide_group_designator="a" weight="1.00"  
is_nondegenerate_evidence="Y" n_enzymatic_termini="2"  
n_sibling_peptides="8.50" n_sibling_peptides_bin="6"
n_instances="10"
exp_tot_instances="9.94" is_contributing_evidence="Y"  
calc_neutral_pep_mass="1597.7737">
          </peptide>
          <peptide peptide_sequence="TGQAPGFTYTDANK"
charge="2"
initial_probability="0.9989"
nsp_adjusted_probability="0.9998"
weight="1.00" is_nondegenerate_evidence="Y"
n_enzymatic_termini="2"
n_sibling_peptides="8.50" n_sibling_peptides_bin="6"
n_instances="90"
exp_tot_instances="89.82" is_contributing_evidence="Y"  
calc_neutral_pep_mass="1469.6786">
          </peptide>
          <peptide peptide_sequence="KTGQAPGFTYTDANK"
charge="3"
initial_probability="0.9990"
nsp_adjusted_probability="0.9998"
peptide_group_designator="a" weight="1.00"  
is_nondegenerate_evidence="Y" n_enzymatic_termini="2"  
n_sibling_peptides="8.50" n_sibling_peptides_bin="6"
n_instances="10"
exp_tot_instances="9.89" is_contributing_evidence="Y"  
calc_neutral_pep_mass="1597.7737">
          </peptide>
       </protein>
</protein_group>
<protein_group group_number="2" probability="1.0000">
       <protein protein_name="sp|P00350|6PGD_ECOLI"  
n_indistinguishable_proteins="1" probability="1.0000"  
percent_coverage="32.1"
unique_stripped_peptides="AGAGTDAAIDSLKPYLDK
+EAYELVAPILTK+EFVESLETPR+EKTEEVIAENPGK+GDIIIDGGNTFFQDTIR+GPSIMPGGQK 
+GYTVSIFNR+IAAVAEDGEPCVTYIGADGAGHYVK+IVSYAQGFSQLR+QIADDYQQALR 
+TEEVIAENPGK+VLSGPQAQPAGDK" group_sibling_id="a"  
total_number_peptides="32" pct_spectrum_ids="0.36"
confidence="1.00">
          <parameter name="prot_length" value="474"/>
          <annotation protein_description="6-phosphogluconate deh ...


I did the following:
 > doc <- xmlRoot(xmlTreeParse("myfile.xml"))
 > xmlApply(doc, names)
$protein_summary_header
   program_details
"program_details"

$dataset_derivation
list()

$protein_group
   protein
"protein"

$protein_group
   protein
"protein"

[IN FACT, THE $protein_group APPEARS A COUPLE HUNDRED TIMES]

So, I want to create a data frame comprising of selected information  
from my $protein_group as follows:

group_number	protein_name	probability	peptide_sequence	 
initial_probability	n_instances
1	sp|P00004|CYC_HORSE	1.0000	KTGQAPGFTYTDANK	0.9989	10
1	sp|P00004|CYC_HORSE	1.0000	TGQAPGFTYTDANK	0.9989	90
1	sp|P00004|CYC_HORSE	1.0000	KTGQAPGFTYTDANK	0.9990	10
2	sp|P00350|6PGD_ECOLI	1.0000	NAPGTYCMR	0.9349	8
2	sp|P00350|6PGD_ECOLI	1.0000	TGAHPGPMK	0.9124	2

As I understand the variables from columns 4, 5 and 6 are children  
from protein_group. For each $protein_group, I need to retrieve some  
of its children.
I would greatly appreciate any help.
Thank you very much,
Alex
	[[alternative HTML version deleted]]
R help - Apr 2010 - read xml

[R] read xml