Juliet Hannah
2011-Aug-31 17:51 UTC
[R] formatting a 6 million row data set; creating a censoring variable
List, Consider the following data. gender mygroup id 1 F A 1 2 F B 2 3 F B 2 4 F B 2 5 F C 2 6 F C 2 7 F C 2 8 F D 2 9 F D 2 10 F D 2 11 F D 2 12 F D 2 13 F D 2 14 M A 3 15 M A 3 16 M A 3 17 M B 3 18 M B 3 19 M B 3 20 M A 4 Here is the reshaping I am seeking (explanation below). id mygroup mytime censor [1,] 1 A 1 1 [2,] 2 B 3 0 [3,] 2 C 3 0 [4,] 2 D 6 1 [5,] 3 A 3 0 [6,] 3 B 3 1 [7,] 4 A 1 1 I need to create 2 variables. The first one is a time variable. Observe that for id=2, the variable mygroup=B was observed 3 times. In the solution we see in row 2 that id=2 has a mytime variable of 3. Next, I need to create a censoring variable. Notice id=2 goes through has values of B, C, D for mygroup. This means the change from B to C and C to D is observed. There is no change from D. I need to indicate this with a 'censoring' variable. So B and C would have values 0, and D would have a value of 1. As another example, id=1 never changes, so I assign it censor= 1. Overall, if a change is observed, 0 should be assigned, and if a change is not observed 1 should be assigned. One potential challenge is that the original data set has over 5 million rows. I have ideas, but I'm still getting used the the data.table and plyr syntax. I also seek a base R solution. I'll post the timings on the real data set shortly. Thanks for your help.> sessionInfo()R version 2.13.1 (2011-07-08) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base # Here is a simplified data set myData <- structure(list(gender = c("F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M" ), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"), id = c("1", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3", "3", "4")), .Names = c("gender", "mygroup", "id"), class = "data.frame", row.names = c(NA, -20L)) # here is plyr solution with idata.frame library(plyr) imyData <- idata.frame(myData) timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize, mytime = length(mygroup))) makeCensor <- function(x) { myvec <- rep(0,length(x)) lastInd <- length(myvec) myvec[lastInd] = 1 myvec } plyrSolution <- ddply(timeData, "id", transform, censor = makeCensor(mygroup)) # here is a data table solution # use makeCensor function from above library(data.table) mydt <- data.table(myData) setkey(mydt,id,mygroup) timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)] makeCensor <- function(x) { myvec <- rep(0,length(x)) lastInd <- length(myvec) myvec[lastInd] = 1 myvec } mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id] datatableSolution <- cbind(timeData,mycensor[,list(censor)])
William Dunlap
2011-Aug-31 19:32 UTC
[R] formatting a 6 million row data set; creating a censoring variable
I'll assume that all of an individual's data rows are contiguous and that an individual always passes through the groups in order (or, least, the individual never leaves a group and then reenters it), so we can find everything we need to know by comparing each row with the previous row. You can use rle() to quickly make the time column: > rle(paste(d$mygroup, d$id))$lengths [1] 1 3 3 6 3 3 1 For the censor column it is probably easiest to consider what rle() must do internally and use a modification of that. E.g., isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)]) isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE) outputRows <- isLastInRun(d$mygroup) | isLastInRun(d$id) output <- d[outputRows, ] output$mytime <- diff(c(0, which(outputRows))) output$censor <- as.integer(isLastInRun(e$id)) which gives you > output gender mygroup id mytimes censor 1 F A 1 1 1 4 F B 2 3 0 7 F C 2 3 0 13 F D 2 6 1 16 M A 3 3 0 19 M B 3 3 1 20 M A 4 1 1 You showed a rearrangment of the columns > output[, c("id", "mygroup", "mytime", "censor")] id mygroup mytime censor 1 1 A 1 1 4 2 B 3 0 7 2 C 3 0 13 2 D 6 1 16 3 A 3 0 19 3 B 3 1 20 4 A 1 1 This ought to be quicker than plyr, but data.table may do similar run-oriented operations. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Juliet Hannah > Sent: Wednesday, August 31, 2011 10:51 AM > To: r-help at r-project.org > Subject: [R] formatting a 6 million row data set; creating a censoring variable > > List, > > Consider the following data. > > gender mygroup id > 1 F A 1 > 2 F B 2 > 3 F B 2 > 4 F B 2 > 5 F C 2 > 6 F C 2 > 7 F C 2 > 8 F D 2 > 9 F D 2 > 10 F D 2 > 11 F D 2 > 12 F D 2 > 13 F D 2 > 14 M A 3 > 15 M A 3 > 16 M A 3 > 17 M B 3 > 18 M B 3 > 19 M B 3 > 20 M A 4 > > Here is the reshaping I am seeking (explanation below). > > id mygroup mytime censor > [1,] 1 A 1 1 > [2,] 2 B 3 0 > [3,] 2 C 3 0 > [4,] 2 D 6 1 > [5,] 3 A 3 0 > [6,] 3 B 3 1 > [7,] 4 A 1 1 > > I need to create 2 variables. The first one is a time variable. > Observe that for id=2, the variable mygroup=B was observed 3 times. In > the solution we see in row 2 that id=2 has a mytime variable of 3. > > Next, I need to create a censoring variable. > > Notice id=2 goes through has values of B, C, D for mygroup. This means > the change from B to C and C to D is observed. There is no change > from D. I need to indicate this with a 'censoring' variable. So B and > C would have values 0, and D would have a value of 1. As another > example, id=1 never changes, so I assign it censor= 1. Overall, if a > change is observed, 0 should be assigned, and if a change is not > observed 1 should be assigned. > > One potential challenge is that the original data set has over 5 > million rows. I have ideas, but I'm still getting used the the > data.table and plyr syntax. I also seek a base R solution. I'll post > the timings on the real data set shortly. > > Thanks for your help. > > > sessionInfo() > R version 2.13.1 (2011-07-08) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > # Here is a simplified data set > > myData <- structure(list(gender = c("F", "F", "F", "F", "F", "F", "F", > "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M" > ), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D", "D", "D", > "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"), id = c("1", > "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3", > "3", "3", "3", "3", "3", "4")), .Names = c("gender", "mygroup", > "id"), class = "data.frame", row.names = c(NA, -20L)) > > > # here is plyr solution with idata.frame > > library(plyr) > imyData <- idata.frame(myData) > timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize, > mytime = length(mygroup))) > > makeCensor <- function(x) { > myvec <- rep(0,length(x)) > lastInd <- length(myvec) > myvec[lastInd] = 1 > myvec > } > > > plyrSolution <- ddply(timeData, "id", transform, censor = makeCensor(mygroup)) > > > # here is a data table solution > # use makeCensor function from above > > library(data.table) > mydt <- data.table(myData) > setkey(mydt,id,mygroup) > > timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)] > makeCensor <- function(x) { > myvec <- rep(0,length(x)) > lastInd <- length(myvec) > myvec[lastInd] = 1 > myvec > } > > mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id] > datatableSolution <- cbind(timeData,mycensor[,list(censor)]) > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.