Hi I am trying to read selected fields from a xml file with R using xml package. So far I have learned the basics of this package by going through the manual, examples, tutorial, and so on (www.omegahat.org/RSXML) . The problem is that I am getting stuck when it comes down to more complex xml files. I am a novice in R and xml, and was wondering if someone could help me out with here. Here is my xml file. I am only interested in the <protein_group node. Therefore, I have omitted most of the information from the other two previous nodes (protein_summary_header, proteinprophet_details). <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="http://localhost/ISB/data/interact-LFA1_C18_PME5R1.prot.xsl "?> <protein_summary xmlns="http://regis-web.systemsbiology.net/protXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://sashimi.sourceforge.net/schema_revision/protXML/protXML_v6.xsd " summary_xml="interact-LFA1_C18_PME5R1.prot.xml"> <protein_summary_header reference_database="EColi_decoy_v3.0.fasta"> <program_details analysis="proteinprophet"> <proteinprophet_details occam_flag="Y" run_options="XML"> <protein_group group_number="1" probability="1.0000"> <protein protein_name="sp|P00004|CYC_HORSE" n_indistinguishable_proteins="1" probability="1.0000" percent_coverage="46.7" unique_stripped_peptides="EDLIAYLK+EETLMEYLENPK +KTGQAPGFTYTDANK+TEREDLIAYLK+TGPNLHGLFGR+TGQAPGFTYTDANK" group_sibling_id="a" total_number_peptides="226" pct_spectrum_ids="2.54" confidence="1.00"> <parameter name="prot_length" value="107"/> <annotation protein_description="Cytochrome c OS=Equus caballus GN=CYCS PE=1 SV=2"/> <peptide peptide_sequence="KTGQAPGFTYTDANK" charge="2" initial_probability="0.9989" nsp_adjusted_probability="0.9998" peptide_group_designator="a" weight="1.00" is_nondegenerate_evidence="Y" n_enzymatic_termini="2" n_sibling_peptides="8.50" n_sibling_peptides_bin="6" n_instances="10" exp_tot_instances="9.94" is_contributing_evidence="Y" calc_neutral_pep_mass="1597.7737"> </peptide> <peptide peptide_sequence="TGQAPGFTYTDANK" charge="2" initial_probability="0.9989" nsp_adjusted_probability="0.9998" weight="1.00" is_nondegenerate_evidence="Y" n_enzymatic_termini="2" n_sibling_peptides="8.50" n_sibling_peptides_bin="6" n_instances="90" exp_tot_instances="89.82" is_contributing_evidence="Y" calc_neutral_pep_mass="1469.6786"> </peptide> <peptide peptide_sequence="KTGQAPGFTYTDANK" charge="3" initial_probability="0.9990" nsp_adjusted_probability="0.9998" peptide_group_designator="a" weight="1.00" is_nondegenerate_evidence="Y" n_enzymatic_termini="2" n_sibling_peptides="8.50" n_sibling_peptides_bin="6" n_instances="10" exp_tot_instances="9.89" is_contributing_evidence="Y" calc_neutral_pep_mass="1597.7737"> </peptide> </protein> </protein_group> <protein_group group_number="2" probability="1.0000"> <protein protein_name="sp|P00350|6PGD_ECOLI" n_indistinguishable_proteins="1" probability="1.0000" percent_coverage="32.1" unique_stripped_peptides="AGAGTDAAIDSLKPYLDK +EAYELVAPILTK+EFVESLETPR+EKTEEVIAENPGK+GDIIIDGGNTFFQDTIR+GPSIMPGGQK +GYTVSIFNR+IAAVAEDGEPCVTYIGADGAGHYVK+IVSYAQGFSQLR+QIADDYQQALR +TEEVIAENPGK+VLSGPQAQPAGDK" group_sibling_id="a" total_number_peptides="32" pct_spectrum_ids="0.36" confidence="1.00"> <parameter name="prot_length" value="474"/> <annotation protein_description="6-phosphogluconate deh ... I did the following: > doc <- xmlRoot(xmlTreeParse("myfile.xml")) > xmlApply(doc, names) $protein_summary_header program_details "program_details" $dataset_derivation list() $protein_group protein "protein" $protein_group protein "protein" [IN FACT, THE $protein_group APPEARS A COUPLE HUNDRED TIMES] So, I want to create a data frame comprising of selected information from my $protein_group as follows: group_number protein_name probability peptide_sequence initial_probability n_instances 1 sp|P00004|CYC_HORSE 1.0000 KTGQAPGFTYTDANK 0.9989 10 1 sp|P00004|CYC_HORSE 1.0000 TGQAPGFTYTDANK 0.9989 90 1 sp|P00004|CYC_HORSE 1.0000 KTGQAPGFTYTDANK 0.9990 10 2 sp|P00350|6PGD_ECOLI 1.0000 NAPGTYCMR 0.9349 8 2 sp|P00350|6PGD_ECOLI 1.0000 TGAHPGPMK 0.9124 2 As I understand the variables from columns 4, 5 and 6 are children from protein_group. For each $protein_group, I need to retrieve some of its children. I would greatly appreciate any help. Thank you very much, Alex [[alternative HTML version deleted]]