sowmiyan
2016-Jan-24 17:27 UTC
[R] Extracting complete information from XML data file using R-Nested Lists
I am working with a XML, which can be found in the link Sample XML file
<https://www.dropbox.com/s/8kn9g8xev2u5n8o/Dummy.xml?dl=0&preview=Dummy.xml>
I am trying to extract each and every fields information to a csv file. I
want my output to be as below: Required output:
*Total of 20 columns and 2 rows*
DateCreated DateModified Creator.UserAccountName Creator.PersonName
Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName
Modifier..attrs.referenceNumber AdditionalEmailStr AdditionalComment
DateIssued DocumentaryInstructions NominationParcel.attr.Referencenumber
NominationParcel.SecondContractNumber
NominationParcel.Coordinator.RefernceNumber
NominationParcel.Coordinator.Username NominationParcel.Coordinator.Email
NominationParcel.Coordinator.Office.Name
NominationParcel.Coordinator.Office.Email
NominationParcel.Coordinator.Office.attrs.referenceNumber
Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Good work 7 sam
Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Nicely Performed 10 107 102
But I am not able to get my output in the required format. I have tried in
two different ways
1 Below is my first code, the problem with this is that my NULL fields are
not getting captured correctly and there is spillover of data. Also I am
not able to capture all the fields of nested lists in the XML
*Code 1*
doc <- xmlParse("Dummy.xml")
lst<-xmlToList(doc)
f <- function(col) do.call(rbind, lapply(lst, function(x)
unlist(x[cols])));
cols
<-c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued",
"DocumentaryInstructions", "NominationParcel" );
res <- setNames(lapply(cols, f), cols);
list2env(res, .GlobalEnv)
*Output 1*
DateCreated DateModified Creator.UserAccountName Creator.PersonName
Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName
Modifier..attrs.referenceNumber AdditionalComment
NominationParcel.Coordinator.UserAccountName
NominationParcel.Coordinator.Office..attrs.referenceNumber
NominationParcel.Coordinator..attrs.referenceNumber
NominationParcel..attrs.referenceNumber
Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Good Work sam 7
Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Nicely performed 102 107 10
2007-11-25T17:18:01
2 To avoid spillover of information of one cell to other because of
"NULL",
I have used for loop to replace the NULL cells with NA. By using this I was
able to capture the correct data, but I could not get all the fields
information present in the XML
*Code 2*
doc <- xmlParse("Dummy.xml")
lstsub<-xmlToList(doc)
for(i in 1:length(lstsub))
{
for(j in 1:length(lstsub[[i]]))
{
lstsub[[i]][[j]]ifelse(is.null(lstsub[[i]][[j]]),NA,lstsub[[i]][[j]])
if(length(lstsub[[i]][[j]])>1)
{
for(k in 1:length(lstsub[[i]][[j]]))
{
lstsub[[i]][[j]][[k]]
ifelse(is.null(lstsub[[i]][[j]][[k]]),NA,lstsub[[i]][[j]][[k]])
if(length(lstsub[[i]][[j]][[k]])>1)
{
for(l in 1:length(lstsub[[i]][[j]][[k]]))
{
lstsub[[i]][[j]][[k]][[l]]
ifelse(is.null(lstsub[[i]][[j]][[k]][[l]]),NA,lstsub[[i]][[j]][[k]][[l]])
}
}
}
}
}
}
f <- function(col) do.call(rbind, lapply(lstsub, function(x)
unlist(x[cols])));
cols <-
c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued",
"DocumentaryInstructions", "NominationParcel" );
res <- setNames(lapply(cols, f), cols);
list2env(res, .GlobalEnv)
write.csv(Creator,"dummy_2.csv")
*Output 2*
DateCreated DateModified Creator Modifier
AdditionalEmailStr AdditionalComment DateIssued DocumentaryInstructions
Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker mkolker NA
Good Work NA NA
Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker mkolker NA
Nicely performed NA NA
Could somebody please help me in how could I get the required output
I have posted the same question in Stackoverflow and the link is here (it
might help in giving more clear picture)
http://stackoverflow.com/questions/34963724/extracting-complete-information-from-nested-lists-in-xml-to-a-data-frame-using-r/34963821#34963821
Regards,
Sowmiyan
[[alternative HTML version deleted]]
Oliver Keyes
2016-Jan-24 20:19 UTC
[R] Extracting complete information from XML data file using R-Nested Lists
Hey Sowmiyan, I would recommend taking a look at the xml2, rather than xml, package for a start. It's a lot more structured and traversing between elements far easier :) On 24 January 2016 at 12:27, sowmiyan <sowmiyan0508 at gmail.com> wrote:> I am working with a XML, which can be found in the link Sample XML file > <https://www.dropbox.com/s/8kn9g8xev2u5n8o/Dummy.xml?dl=0&preview=Dummy.xml> > > I am trying to extract each and every fields information to a csv file. I > want my output to be as below: Required output: > *Total of 20 columns and 2 rows* > DateCreated DateModified Creator.UserAccountName Creator.PersonName > Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName > Modifier..attrs.referenceNumber AdditionalEmailStr AdditionalComment > DateIssued DocumentaryInstructions NominationParcel.attr.Referencenumber > NominationParcel.SecondContractNumber > NominationParcel.Coordinator.RefernceNumber > NominationParcel.Coordinator.Username NominationParcel.Coordinator.Email > NominationParcel.Coordinator.Office.Name > NominationParcel.Coordinator.Office.Email > NominationParcel.Coordinator.Office.attrs.referenceNumber > Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker > 15351 mkolker Merryn Kolker 15351 Good work 7 sam > Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker > 15351 mkolker Merryn Kolker 15351 Nicely Performed 10 107 102 > > But I am not able to get my output in the required format. I have tried in > two different ways > > 1 Below is my first code, the problem with this is that my NULL fields are > not getting captured correctly and there is spillover of data. Also I am > not able to capture all the fields of nested lists in the XML > > *Code 1* > > doc <- xmlParse("Dummy.xml") > lst<-xmlToList(doc) > f <- function(col) do.call(rbind, lapply(lst, function(x) > unlist(x[cols]))); > cols > <-c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued", > "DocumentaryInstructions", "NominationParcel" ); > res <- setNames(lapply(cols, f), cols); > list2env(res, .GlobalEnv) > *Output 1* > > > DateCreated DateModified Creator.UserAccountName Creator.PersonName > Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName > Modifier..attrs.referenceNumber AdditionalComment > NominationParcel.Coordinator.UserAccountName > NominationParcel.Coordinator.Office..attrs.referenceNumber > NominationParcel.Coordinator..attrs.referenceNumber > NominationParcel..attrs.referenceNumber > Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker > 15351 mkolker Merryn Kolker 15351 Good Work sam 7 > Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker > 15351 mkolker Merryn Kolker 15351 Nicely performed 102 107 10 > 2007-11-25T17:18:01 > > 2 To avoid spillover of information of one cell to other because of "NULL", > I have used for loop to replace the NULL cells with NA. By using this I was > able to capture the correct data, but I could not get all the fields > information present in the XML > > *Code 2* > > doc <- xmlParse("Dummy.xml") > lstsub<-xmlToList(doc) > for(i in 1:length(lstsub)) > { > for(j in 1:length(lstsub[[i]])) > { > lstsub[[i]][[j]]> ifelse(is.null(lstsub[[i]][[j]]),NA,lstsub[[i]][[j]]) > if(length(lstsub[[i]][[j]])>1) > { > for(k in 1:length(lstsub[[i]][[j]])) > { > lstsub[[i]][[j]][[k]]> ifelse(is.null(lstsub[[i]][[j]][[k]]),NA,lstsub[[i]][[j]][[k]]) > if(length(lstsub[[i]][[j]][[k]])>1) > { > for(l in 1:length(lstsub[[i]][[j]][[k]])) > { > lstsub[[i]][[j]][[k]][[l]]> ifelse(is.null(lstsub[[i]][[j]][[k]][[l]]),NA,lstsub[[i]][[j]][[k]][[l]]) > } > } > } > } > } > } > f <- function(col) do.call(rbind, lapply(lstsub, function(x) > unlist(x[cols]))); > cols <- > c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued", > "DocumentaryInstructions", "NominationParcel" ); > res <- setNames(lapply(cols, f), cols); > list2env(res, .GlobalEnv) > write.csv(Creator,"dummy_2.csv") > > *Output 2* > > DateCreated DateModified Creator Modifier > AdditionalEmailStr AdditionalComment DateIssued DocumentaryInstructions > > Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker mkolker NA > Good Work NA NA > Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker mkolker NA > Nicely performed NA NA > > Could somebody please help me in how could I get the required output > > I have posted the same question in Stackoverflow and the link is here (it > might help in giving more clear picture) > > http://stackoverflow.com/questions/34963724/extracting-complete-information-from-nested-lists-in-xml-to-a-data-frame-using-r/34963821#34963821 > > > Regards, > Sowmiyan > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Oliver Keyes Count Logula Wikimedia Foundation