sowmiyan
2016-Jan-24 17:27 UTC
[R] Extracting complete information from XML data file using R-Nested Lists
I am working with a XML, which can be found in the link Sample XML file <dropbox.com/s/8kn9g8xev2u5n8o/Dummy.xml?dl=0&preview=Dummy.xml> I am trying to extract each and every fields information to a csv file. I want my output to be as below: Required output: *Total of 20 columns and 2 rows* DateCreated DateModified Creator.UserAccountName Creator.PersonName Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName Modifier..attrs.referenceNumber AdditionalEmailStr AdditionalComment DateIssued DocumentaryInstructions NominationParcel.attr.Referencenumber NominationParcel.SecondContractNumber NominationParcel.Coordinator.RefernceNumber NominationParcel.Coordinator.Username NominationParcel.Coordinator.Email NominationParcel.Coordinator.Office.Name NominationParcel.Coordinator.Office.Email NominationParcel.Coordinator.Office.attrs.referenceNumber Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker 15351 mkolker Merryn Kolker 15351 Good work 7 sam Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker 15351 mkolker Merryn Kolker 15351 Nicely Performed 10 107 102 But I am not able to get my output in the required format. I have tried in two different ways 1 Below is my first code, the problem with this is that my NULL fields are not getting captured correctly and there is spillover of data. Also I am not able to capture all the fields of nested lists in the XML *Code 1* doc <- xmlParse("Dummy.xml") lst<-xmlToList(doc) f <- function(col) do.call(rbind, lapply(lst, function(x) unlist(x[cols]))); cols <-c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued", "DocumentaryInstructions", "NominationParcel" ); res <- setNames(lapply(cols, f), cols); list2env(res, .GlobalEnv) *Output 1* DateCreated DateModified Creator.UserAccountName Creator.PersonName Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName Modifier..attrs.referenceNumber AdditionalComment NominationParcel.Coordinator.UserAccountName NominationParcel.Coordinator.Office..attrs.referenceNumber NominationParcel.Coordinator..attrs.referenceNumber NominationParcel..attrs.referenceNumber Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker 15351 mkolker Merryn Kolker 15351 Good Work sam 7 Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker 15351 mkolker Merryn Kolker 15351 Nicely performed 102 107 10 2007-11-25T17:18:01 2 To avoid spillover of information of one cell to other because of "NULL", I have used for loop to replace the NULL cells with NA. By using this I was able to capture the correct data, but I could not get all the fields information present in the XML *Code 2* doc <- xmlParse("Dummy.xml") lstsub<-xmlToList(doc) for(i in 1:length(lstsub)) { for(j in 1:length(lstsub[[i]])) { lstsub[[i]][[j]]ifelse(is.null(lstsub[[i]][[j]]),NA,lstsub[[i]][[j]]) if(length(lstsub[[i]][[j]])>1) { for(k in 1:length(lstsub[[i]][[j]])) { lstsub[[i]][[j]][[k]] ifelse(is.null(lstsub[[i]][[j]][[k]]),NA,lstsub[[i]][[j]][[k]]) if(length(lstsub[[i]][[j]][[k]])>1) { for(l in 1:length(lstsub[[i]][[j]][[k]])) { lstsub[[i]][[j]][[k]][[l]] ifelse(is.null(lstsub[[i]][[j]][[k]][[l]]),NA,lstsub[[i]][[j]][[k]][[l]]) } } } } } } f <- function(col) do.call(rbind, lapply(lstsub, function(x) unlist(x[cols]))); cols <- c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued", "DocumentaryInstructions", "NominationParcel" ); res <- setNames(lapply(cols, f), cols); list2env(res, .GlobalEnv) write.csv(Creator,"dummy_2.csv") *Output 2* DateCreated DateModified Creator Modifier AdditionalEmailStr AdditionalComment DateIssued DocumentaryInstructions Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker mkolker NA Good Work NA NA Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker mkolker NA Nicely performed NA NA Could somebody please help me in how could I get the required output I have posted the same question in Stackoverflow and the link is here (it might help in giving more clear picture) stackoverflow.com/questions/34963724/extracting-complete-information-from-nested-lists-in-xml-to-a-data-frame-using-r/34963821#34963821 Regards, Sowmiyan [[alternative HTML version deleted]]
Oliver Keyes
2016-Jan-24 20:19 UTC
[R] Extracting complete information from XML data file using R-Nested Lists
Hey Sowmiyan, I would recommend taking a look at the xml2, rather than xml, package for a start. It's a lot more structured and traversing between elements far easier :) On 24 January 2016 at 12:27, sowmiyan <sowmiyan0508 at gmail.com> wrote:> I am working with a XML, which can be found in the link Sample XML file > <dropbox.com/s/8kn9g8xev2u5n8o/Dummy.xml?dl=0&preview=Dummy.xml> > > I am trying to extract each and every fields information to a csv file. I > want my output to be as below: Required output: > *Total of 20 columns and 2 rows* > DateCreated DateModified Creator.UserAccountName Creator.PersonName > Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName > Modifier..attrs.referenceNumber AdditionalEmailStr AdditionalComment > DateIssued DocumentaryInstructions NominationParcel.attr.Referencenumber > NominationParcel.SecondContractNumber > NominationParcel.Coordinator.RefernceNumber > NominationParcel.Coordinator.Username NominationParcel.Coordinator.Email > NominationParcel.Coordinator.Office.Name > NominationParcel.Coordinator.Office.Email > NominationParcel.Coordinator.Office.attrs.referenceNumber > Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker > 15351 mkolker Merryn Kolker 15351 Good work 7 sam > Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker > 15351 mkolker Merryn Kolker 15351 Nicely Performed 10 107 102 > > But I am not able to get my output in the required format. I have tried in > two different ways > > 1 Below is my first code, the problem with this is that my NULL fields are > not getting captured correctly and there is spillover of data. Also I am > not able to capture all the fields of nested lists in the XML > > *Code 1* > > doc <- xmlParse("Dummy.xml") > lst<-xmlToList(doc) > f <- function(col) do.call(rbind, lapply(lst, function(x) > unlist(x[cols]))); > cols > <-c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued", > "DocumentaryInstructions", "NominationParcel" ); > res <- setNames(lapply(cols, f), cols); > list2env(res, .GlobalEnv) > *Output 1* > > > DateCreated DateModified Creator.UserAccountName Creator.PersonName > Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName > Modifier..attrs.referenceNumber AdditionalComment > NominationParcel.Coordinator.UserAccountName > NominationParcel.Coordinator.Office..attrs.referenceNumber > NominationParcel.Coordinator..attrs.referenceNumber > NominationParcel..attrs.referenceNumber > Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker > 15351 mkolker Merryn Kolker 15351 Good Work sam 7 > Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker > 15351 mkolker Merryn Kolker 15351 Nicely performed 102 107 10 > 2007-11-25T17:18:01 > > 2 To avoid spillover of information of one cell to other because of "NULL", > I have used for loop to replace the NULL cells with NA. By using this I was > able to capture the correct data, but I could not get all the fields > information present in the XML > > *Code 2* > > doc <- xmlParse("Dummy.xml") > lstsub<-xmlToList(doc) > for(i in 1:length(lstsub)) > { > for(j in 1:length(lstsub[[i]])) > { > lstsub[[i]][[j]]> ifelse(is.null(lstsub[[i]][[j]]),NA,lstsub[[i]][[j]]) > if(length(lstsub[[i]][[j]])>1) > { > for(k in 1:length(lstsub[[i]][[j]])) > { > lstsub[[i]][[j]][[k]]> ifelse(is.null(lstsub[[i]][[j]][[k]]),NA,lstsub[[i]][[j]][[k]]) > if(length(lstsub[[i]][[j]][[k]])>1) > { > for(l in 1:length(lstsub[[i]][[j]][[k]])) > { > lstsub[[i]][[j]][[k]][[l]]> ifelse(is.null(lstsub[[i]][[j]][[k]][[l]]),NA,lstsub[[i]][[j]][[k]][[l]]) > } > } > } > } > } > } > f <- function(col) do.call(rbind, lapply(lstsub, function(x) > unlist(x[cols]))); > cols <- > c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued", > "DocumentaryInstructions", "NominationParcel" ); > res <- setNames(lapply(cols, f), cols); > list2env(res, .GlobalEnv) > write.csv(Creator,"dummy_2.csv") > > *Output 2* > > DateCreated DateModified Creator Modifier > AdditionalEmailStr AdditionalComment DateIssued DocumentaryInstructions > > Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker mkolker NA > Good Work NA NA > Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker mkolker NA > Nicely performed NA NA > > Could somebody please help me in how could I get the required output > > I have posted the same question in Stackoverflow and the link is here (it > might help in giving more clear picture) > > stackoverflow.com/questions/34963724/extracting-complete-information-from-nested-lists-in-xml-to-a-data-frame-using-r/34963821#34963821 > > > Regards, > Sowmiyan > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Oliver Keyes Count Logula Wikimedia Foundation