Hi
I am trying to read selected fields from a xml file with R using xml
package. So far I have learned the basics of this package by going
through the manual, examples, tutorial, and so on (www.omegahat.org/RSXML)
. The problem is that I am getting stuck when it comes down to more
complex xml files. I am a novice in R and xml, and was wondering if
someone could help me out with here.
Here is my xml file. I am only interested in the <protein_group node.
Therefore, I have omitted most of the information from the other two
previous nodes (protein_summary_header, proteinprophet_details).
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl"
href="http://localhost/ISB/data/interact-LFA1_C18_PME5R1.prot.xsl
"?>
<protein_summary
xmlns="http://regis-web.systemsbiology.net/protXML"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://sashimi.sourceforge.net/schema_revision/protXML/protXML_v6.xsd
" summary_xml="interact-LFA1_C18_PME5R1.prot.xml">
<protein_summary_header
reference_database="EColi_decoy_v3.0.fasta">
<program_details analysis="proteinprophet">
<proteinprophet_details occam_flag="Y"
run_options="XML">
<protein_group group_number="1" probability="1.0000">
<protein protein_name="sp|P00004|CYC_HORSE"
n_indistinguishable_proteins="1" probability="1.0000"
percent_coverage="46.7"
unique_stripped_peptides="EDLIAYLK+EETLMEYLENPK
+KTGQAPGFTYTDANK+TEREDLIAYLK+TGPNLHGLFGR+TGQAPGFTYTDANK"
group_sibling_id="a" total_number_peptides="226"
pct_spectrum_ids="2.54" confidence="1.00">
<parameter name="prot_length" value="107"/>
<annotation protein_description="Cytochrome c OS=Equus
caballus GN=CYCS PE=1 SV=2"/>
<peptide peptide_sequence="KTGQAPGFTYTDANK"
charge="2"
initial_probability="0.9989"
nsp_adjusted_probability="0.9998"
peptide_group_designator="a" weight="1.00"
is_nondegenerate_evidence="Y" n_enzymatic_termini="2"
n_sibling_peptides="8.50" n_sibling_peptides_bin="6"
n_instances="10"
exp_tot_instances="9.94" is_contributing_evidence="Y"
calc_neutral_pep_mass="1597.7737">
</peptide>
<peptide peptide_sequence="TGQAPGFTYTDANK"
charge="2"
initial_probability="0.9989"
nsp_adjusted_probability="0.9998"
weight="1.00" is_nondegenerate_evidence="Y"
n_enzymatic_termini="2"
n_sibling_peptides="8.50" n_sibling_peptides_bin="6"
n_instances="90"
exp_tot_instances="89.82" is_contributing_evidence="Y"
calc_neutral_pep_mass="1469.6786">
</peptide>
<peptide peptide_sequence="KTGQAPGFTYTDANK"
charge="3"
initial_probability="0.9990"
nsp_adjusted_probability="0.9998"
peptide_group_designator="a" weight="1.00"
is_nondegenerate_evidence="Y" n_enzymatic_termini="2"
n_sibling_peptides="8.50" n_sibling_peptides_bin="6"
n_instances="10"
exp_tot_instances="9.89" is_contributing_evidence="Y"
calc_neutral_pep_mass="1597.7737">
</peptide>
</protein>
</protein_group>
<protein_group group_number="2" probability="1.0000">
<protein protein_name="sp|P00350|6PGD_ECOLI"
n_indistinguishable_proteins="1" probability="1.0000"
percent_coverage="32.1"
unique_stripped_peptides="AGAGTDAAIDSLKPYLDK
+EAYELVAPILTK+EFVESLETPR+EKTEEVIAENPGK+GDIIIDGGNTFFQDTIR+GPSIMPGGQK
+GYTVSIFNR+IAAVAEDGEPCVTYIGADGAGHYVK+IVSYAQGFSQLR+QIADDYQQALR
+TEEVIAENPGK+VLSGPQAQPAGDK" group_sibling_id="a"
total_number_peptides="32" pct_spectrum_ids="0.36"
confidence="1.00">
<parameter name="prot_length" value="474"/>
<annotation protein_description="6-phosphogluconate deh ...
I did the following:
> doc <- xmlRoot(xmlTreeParse("myfile.xml"))
> xmlApply(doc, names)
$protein_summary_header
program_details
"program_details"
$dataset_derivation
list()
$protein_group
protein
"protein"
$protein_group
protein
"protein"
[IN FACT, THE $protein_group APPEARS A COUPLE HUNDRED TIMES]
So, I want to create a data frame comprising of selected information
from my $protein_group as follows:
group_number protein_name probability peptide_sequence
initial_probability n_instances
1 sp|P00004|CYC_HORSE 1.0000 KTGQAPGFTYTDANK 0.9989 10
1 sp|P00004|CYC_HORSE 1.0000 TGQAPGFTYTDANK 0.9989 90
1 sp|P00004|CYC_HORSE 1.0000 KTGQAPGFTYTDANK 0.9990 10
2 sp|P00350|6PGD_ECOLI 1.0000 NAPGTYCMR 0.9349 8
2 sp|P00350|6PGD_ECOLI 1.0000 TGAHPGPMK 0.9124 2
As I understand the variables from columns 4, 5 and 6 are children
from protein_group. For each $protein_group, I need to retrieve some
of its children.
I would greatly appreciate any help.
Thank you very much,
Alex
[[alternative HTML version deleted]]