thr3ads.net - R help - [R] formatting a 6 million row data set; creating a censoring variable [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Juliet Hannah

2011-Aug-31 17:51 UTC

[R] formatting a 6 million row data set; creating a censoring variable

List,

Consider the following data.

   gender mygroup id
1       F       A  1
2       F       B  2
3       F       B  2
4       F       B  2
5       F       C  2
6       F       C  2
7       F       C  2
8       F       D  2
9       F       D  2
10      F       D  2
11      F       D  2
12      F       D  2
13      F       D  2
14      M       A  3
15      M       A  3
16      M       A  3
17      M       B  3
18      M       B  3
19      M       B  3
20      M       A  4

Here is the reshaping I am seeking (explanation below).

     id mygroup mytime censor
[1,]  1       A      1      1
[2,]  2       B      3      0
[3,]  2       C      3      0
[4,]  2       D      6      1
[5,]  3       A      3      0
[6,]  3       B      3      1
[7,]  4       A      1      1

I need to create 2 variables. The first one is a time variable.
Observe that for id=2, the variable mygroup=B was observed 3 times. In
the solution we see in row 2 that id=2 has a mytime variable of 3.

Next, I need to create a censoring variable.

Notice id=2 goes through has values of B, C, D for mygroup. This means
the change from B to C and C to D is observed.  There is no change
from D. I need to indicate this with a 'censoring' variable. So B and
C would have values 0, and D would have a value of 1. As another
example, id=1 never changes, so I assign it  censor= 1. Overall, if a
change is observed, 0 should be assigned, and if a change is not
observed 1 should be assigned.

One potential challenge is that the original data set has over 5
million rows. I have ideas, but I'm still getting used the the
data.table and plyr syntax.  I also seek a base R solution. I'll post
the timings on the real data set shortly.

Thanks for your help.
> sessionInfo()R version 2.13.1 (2011-07-08)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

# Here is a simplified data set

myData <- structure(list(gender = c("F", "F",
"F", "F", "F", "F", "F",
"F", "F", "F", "F", "F",
"F", "M", "M", "M", "M",
"M", "M", "M"
), mygroup = c("A", "B", "B", "B",
"C", "C", "C", "D", "D",
"D",
"D", "D", "D", "A", "A",
"A", "B", "B", "B", "A"), id =
c("1",
"2", "2", "2", "2", "2",
"2", "2", "2", "2", "2",
"2", "2", "3",
"3", "3", "3", "3", "3",
"4")), .Names = c("gender", "mygroup",
"id"), class = "data.frame", row.names = c(NA, -20L))


# here is plyr solution with  idata.frame

library(plyr)
imyData <-  idata.frame(myData)
timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize,
mytime = length(mygroup)))

makeCensor <- function(x) {
   myvec <- rep(0,length(x))
   lastInd <- length(myvec)
   myvec[lastInd] = 1
   myvec
}


plyrSolution <- ddply(timeData, "id", transform, censor =
makeCensor(mygroup))


# here is a data table solution
# use makeCensor function from above

library(data.table)
mydt <- data.table(myData)
setkey(mydt,id,mygroup)

timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)]
makeCensor <- function(x) {
   myvec <- rep(0,length(x))
   lastInd <- length(myvec)
   myvec[lastInd] = 1
   myvec
}

mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id]
datatableSolution <- cbind(timeData,mycensor[,list(censor)])

William Dunlap

2011-Aug-31 19:32 UTC

head link

[R] formatting a 6 million row data set; creating a censoring variable

I'll assume that all of an individual's data rows
are contiguous and that an individual always passes through
the groups in order (or, least, the individual
never leaves a group and then reenters it), so we
can find everything we need to know by comparing each
row with the previous row.

You can use rle() to quickly make the time
column: 
  > rle(paste(d$mygroup, d$id))$lengths
  [1] 1 3 3 6 3 3 1

For the censor column it is probably easiest to consider
what rle() must do internally and use a modification of that.
E.g.,
  isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)])
  isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
  outputRows <- isLastInRun(d$mygroup) | isLastInRun(d$id)
  output <- d[outputRows, ]
  output$mytime <- diff(c(0, which(outputRows)))
  output$censor <- as.integer(isLastInRun(e$id))
which gives you
  > output
     gender mygroup id mytimes censor
  1       F       A  1       1      1
  4       F       B  2       3      0
  7       F       C  2       3      0
  13      F       D  2       6      1
  16      M       A  3       3      0
  19      M       B  3       3      1
  20      M       A  4       1      1
You showed a rearrangment of the columns
  > output[, c("id", "mygroup", "mytime",
"censor")]
     id mygroup mytime censor
  1   1       A      1      1
  4   2       B      3      0
  7   2       C      3      0
  13  2       D      6      1
  16  3       A      3      0
  19  3       B      3      1
  20  4       A      1      1
This ought to be quicker than plyr, but data.table
may do similar run-oriented operations.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org] On Behalf Of Juliet Hannah
> Sent: Wednesday, August 31, 2011 10:51 AM
> To: r-help at r-project.org
> Subject: [R] formatting a 6 million row data set; creating a censoring
variable
> 
> List,
> 
> Consider the following data.
> 
>    gender mygroup id
> 1       F       A  1
> 2       F       B  2
> 3       F       B  2
> 4       F       B  2
> 5       F       C  2
> 6       F       C  2
> 7       F       C  2
> 8       F       D  2
> 9       F       D  2
> 10      F       D  2
> 11      F       D  2
> 12      F       D  2
> 13      F       D  2
> 14      M       A  3
> 15      M       A  3
> 16      M       A  3
> 17      M       B  3
> 18      M       B  3
> 19      M       B  3
> 20      M       A  4
> 
> Here is the reshaping I am seeking (explanation below).
> 
>      id mygroup mytime censor
> [1,]  1       A      1      1
> [2,]  2       B      3      0
> [3,]  2       C      3      0
> [4,]  2       D      6      1
> [5,]  3       A      3      0
> [6,]  3       B      3      1
> [7,]  4       A      1      1
> 
> I need to create 2 variables. The first one is a time variable.
> Observe that for id=2, the variable mygroup=B was observed 3 times. In
> the solution we see in row 2 that id=2 has a mytime variable of 3.
> 
> Next, I need to create a censoring variable.
> 
> Notice id=2 goes through has values of B, C, D for mygroup. This means
> the change from B to C and C to D is observed.  There is no change
> from D. I need to indicate this with a 'censoring' variable. So B
and
> C would have values 0, and D would have a value of 1. As another
> example, id=1 never changes, so I assign it  censor= 1. Overall, if a
> change is observed, 0 should be assigned, and if a change is not
> observed 1 should be assigned.
> 
> One potential challenge is that the original data set has over 5
> million rows. I have ideas, but I'm still getting used the the
> data.table and plyr syntax.  I also seek a base R solution. I'll post
> the timings on the real data set shortly.
> 
> Thanks for your help.
> 
> > sessionInfo()
> R version 2.13.1 (2011-07-08)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> # Here is a simplified data set
> 
> myData <- structure(list(gender = c("F", "F",
"F", "F", "F", "F", "F",
> "F", "F", "F", "F", "F",
"F", "M", "M", "M", "M",
"M", "M", "M"
> ), mygroup = c("A", "B", "B", "B",
"C", "C", "C", "D", "D",
"D",
> "D", "D", "D", "A", "A",
"A", "B", "B", "B", "A"), id =
c("1",
> "2", "2", "2", "2", "2",
"2", "2", "2", "2", "2",
"2", "2", "3",
> "3", "3", "3", "3", "3",
"4")), .Names = c("gender", "mygroup",
> "id"), class = "data.frame", row.names = c(NA, -20L))
> 
> 
> # here is plyr solution with  idata.frame
> 
> library(plyr)
> imyData <-  idata.frame(myData)
> timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize,
> mytime = length(mygroup)))
> 
> makeCensor <- function(x) {
>    myvec <- rep(0,length(x))
>    lastInd <- length(myvec)
>    myvec[lastInd] = 1
>    myvec
> }
> 
> 
> plyrSolution <- ddply(timeData, "id", transform, censor =
makeCensor(mygroup))
> 
> 
> # here is a data table solution
> # use makeCensor function from above
> 
> library(data.table)
> mydt <- data.table(myData)
> setkey(mydt,id,mygroup)
> 
> timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)]
> makeCensor <- function(x) {
>    myvec <- rep(0,length(x))
>    lastInd <- length(myvec)
>    myvec[lastInd] = 1
>    myvec
> }
> 
> mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id]
> datatableSolution <- cbind(timeData,mycensor[,list(censor)])
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Reasonably Related Threads

Search for more maybe matching threads

R help - Aug 2011 - formatting a 6 million row data set; creating a censoring variable

[R] formatting a 6 million row data set; creating a censoring variable

[R] formatting a 6 million row data set; creating a censoring variable

Reasonably Related Threads