thr3ads.net - R help - [R] Generating a "conditional time" variable [May 2009]

If this information is useful, please help other people find it:
Share via:

Vincent Arel-Bundock

2009-May-09 17:40 UTC

[R] Generating a "conditional time" variable

Hi everyone,

Please forgive me if my question is simple and my code terrible, I'm new to
R. I am not looking for a ready-made answer, but I would really appreciate
it if someone could share conceptual hints for programming, or point me
toward an R function/package that could speed up my processing time.

Thanks a lot for your help!

##

My dataframe includes the variables 'year', 'id', and
'eif' and has +/- 1.9
million id-year observations

I would like to do 2 things:

-1- I want to create a 'conditional_time' variable, which increases in
increments of 1 every year, but which resets during year(t) if event
'eif'
occured for this 'id' at year(t-1). It should also reset when we switch
to a
new 'id'. For example:

dataframe = test
 year        id         eif  conditional_time

1990       1010          0    1
1991       1010          0    2
1992       1010          1    3
1993       1010          0    1
1994       1010          0    2
1995       1010          0    3
1996       1010          0    4
1997       1010          1    5
1998       1010          0    1
1999       1010          0    2
2000       1010          0    3
2001       1010          0    4
2002       1010          0    5
2003       1010          0    6
1990       2010          0    1
1991       2010          0    2
1992       2010          0    3
1993       2010          0    4
1994       2010          0    5
1995       2010          0    6
1996       2010          0    7
1997       2010          0    8
1998       2010          0    9
1999       2010          0    10
2000       2010          0    11
2001       2010          1    12
2002       2010          0    1
2003       2010          0    2

-2- In a copy of the original dataframe, drop all id-year rows that
correspond to years after a given id has experienced his first 'eif'
event.

I have written the code below to take care of -1-, but it is incredibly
inefficient. Given the size of my database, and considering how slow my
computer is, I don't think it's practical to use it. Also, it depends on
correct sorting of the dataframe, which might generate errors.

##

for (i in 1:nrow(test)) {
    if (i == 1) {                            # If first id-year
        cond_time <- 1
        test[i, 4] <- cond_time

    } else if ((test[i-1, 1]) != (test[i, 4])) {             # If new id
        cond_time <- 1
        test[i, 4] <- cond_time
     } else {                            # Same id as previous row
        if (test[i, 3] == 0) {
            test[i, 4] <- sum(cond_time, 1)
            cond_time <- test[i, 6]
        } else {
            test[i, 4] <- sum(cond_time, 1)
            cond_time <- 0
            }
        }
}

-- 
Vincent Arel
M.A. Student, McGill University

	[[alternative HTML version deleted]]

Finak Greg

2009-May-09 19:11 UTC

head link

[R] Generating a "conditional time" variable

Assuming the year column has complete data and doesn't skip a year, the
following should take care of 1)

#Simulated data frame: year from 1990 to 2003, for 5 different ids, each having
one or two eif "events"
test<-data.frame(year=rep(1990:2003,5),id=gl(5,length(1990:2003)),eif=as.vector(sapply(1:5,function(z){a<-rep(0,length(1990:2003));a[sample(1:length(1990:2003),sample(1:2,1))]<-1;a})))

#Generate the "conditional_time" column.
test<-do.call("rbind",lapply(split(test,test$id),function(z){s<-0;data.frame(z,cond_time=sapply(z$eif,function(i)ifelse(i==1,s<-1,s<<-s+1)))}))

Generally sapply, lapply, and apply are faster than "for" loops.
split() will split your data frame by the $id column (second argument). lapply()
loops through the resulting list and generates the cond_time variable, resetting
when eif==1, otherwise incrementing the count, much as you have in your code.


If I understand 2) correctly, the following should do the trick:
test2<-test; #copy the data frame
test2<-do.call("rbind",lapply(split(test,test$id),function(z)z[1:which(z$eif==1)[1],]))

Similar to the former, but sub-setting the rows of the data data frame up to the
first event, for each id.

If the above is all you need, then 1) and 2) could be combined in a single call.

Others will likely have a different approach..

Cheers,

--
Greg Finak
Post-Doctoral Research Associate
Computational Biology Unit
Institut des Recherches Cliniques de Montreal
Montreal, QC.


On 09/05/09 1:40 PM, "Vincent Arel-Bundock" <vincent.arel at
gmail.com> wrote:

Hi everyone,

Please forgive me if my question is simple and my code terrible, I'm new to
R. I am not looking for a ready-made answer, but I would really appreciate
it if someone could share conceptual hints for programming, or point me
toward an R function/package that could speed up my processing time.

Thanks a lot for your help!

##

My dataframe includes the variables 'year', 'id', and
'eif' and has +/- 1.9
million id-year observations

I would like to do 2 things:

-1- I want to create a 'conditional_time' variable, which increases in
increments of 1 every year, but which resets during year(t) if event
'eif'
occured for this 'id' at year(t-1). It should also reset when we switch
to a
new 'id'. For example:

dataframe = test
 year        id         eif  conditional_time

1990       1010          0    1
1991       1010          0    2
1992       1010          1    3
1993       1010          0    1
1994       1010          0    2
1995       1010          0    3
1996       1010          0    4
1997       1010          1    5
1998       1010          0    1
1999       1010          0    2
2000       1010          0    3
2001       1010          0    4
2002       1010          0    5
2003       1010          0    6
1990       2010          0    1
1991       2010          0    2
1992       2010          0    3
1993       2010          0    4
1994       2010          0    5
1995       2010          0    6
1996       2010          0    7
1997       2010          0    8
1998       2010          0    9
1999       2010          0    10
2000       2010          0    11
2001       2010          1    12
2002       2010          0    1
2003       2010          0    2

-2- In a copy of the original dataframe, drop all id-year rows that
correspond to years after a given id has experienced his first 'eif'
event.

I have written the code below to take care of -1-, but it is incredibly
inefficient. Given the size of my database, and considering how slow my
computer is, I don't think it's practical to use it. Also, it depends on
correct sorting of the dataframe, which might generate errors.

##

for (i in 1:nrow(test)) {
    if (i == 1) {                            # If first id-year
        cond_time <- 1
        test[i, 4] <- cond_time

    } else if ((test[i-1, 1]) != (test[i, 4])) {             # If new id
        cond_time <- 1
        test[i, 4] <- cond_time
     } else {                            # Same id as previous row
        if (test[i, 3] == 0) {
            test[i, 4] <- sum(cond_time, 1)
            cond_time <- test[i, 6]
        } else {
            test[i, 4] <- sum(cond_time, 1)
            cond_time <- 0
            }
        }
}

--
Vincent Arel
M.A. Student, McGill University

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

William Dunlap

2009-May-09 21:53 UTC

head link

[R] Generating a "conditional time" variable

You might try the following function.  First it identifies the last element in
each run, then the length of each run, then calls sequence() to generate the
within-run sequence numbers.  my.sequence is a version of sequence that is more
efficient (less time, less memory) than sequence when there are lots of short
runs (sequence() calls lapply, which makes a memory consuming list, and then
unlists it, and my.sequence avoids the big intermediate list).

For your data, f(data) produces the same thing as data$conditional_time.

f<-function(data, use.my.sequence=FALSE){
   n<-nrow(data)
   lastInRun <- with(data, eif | c(id[-1]!=id[-n], TRUE))
   runLengths <- diff(c(0L,which(lastInRun)))
   if (use.my.sequence) {
      my.sequence<-
function(nvec)seq_len(sum(nvec))-rep.int(c(0L,cumsum(nvec[-length(nvec)])),nvec)
      my.sequence(runLengths)
   } else {
      sequence(runLengths)
   }
}

Bill Dunlap, Spotfire Division, TIBCO Software Inc.
---------------------------------------- 


 Hi everyone,

Please forgive me if my question is simple and my code terrible, I'm new to
R. I am not looking for a ready-made answer, but I would really appreciate
it if someone could share conceptual hints for programming, or point me
toward an R function/package that could speed up my processing time.

Thanks a lot for your help!

##

My dataframe includes the variables 'year', 'id', and
'eif' and has +/- 1.9
million id-year observations

I would like to do 2 things:

-1- I want to create a 'conditional_time' variable, which increases in
increments of 1 every year, but which resets during year(t) if event
'eif'
occured for this 'id' at year(t-1). It should also reset when we switch
to a
new 'id'. For example:

dataframe = test
 year        id         eif  conditional_time

1990       1010          0    1
1991       1010          0    2
1992       1010          1    3
1993       1010          0    1
1994       1010          0    2
1995       1010          0    3
1996       1010          0    4
1997       1010          1    5
1998       1010          0    1
1999       1010          0    2
2000       1010          0    3
2001       1010          0    4
2002       1010          0    5
2003       1010          0    6
1990       2010          0    1
1991       2010          0    2
1992       2010          0    3
1993       2010          0    4
1994       2010          0    5
1995       2010          0    6
1996       2010          0    7
1997       2010          0    8
1998       2010          0    9
1999       2010          0    10
2000       2010          0    11
2001       2010          1    12
2002       2010          0    1
2003       2010          0    2

-2- In a copy of the original dataframe, drop all id-year rows that
correspond to years after a given id has experienced his first 'eif'
event.

I have written the code below to take care of -1-, but it is incredibly
inefficient. Given the size of my database, and considering how slow my
computer is, I don't think it's practical to use it. Also, it depends on
correct sorting of the dataframe, which might generate errors.

##

for (i in 1:nrow(test)) {
    if (i == 1) {                            # If first id-year
        cond_time <- 1
        test[i, 4] <- cond_time

    } else if ((test[i-1, 1]) != (test[i, 4])) {             # If new id
        cond_time <- 1
        test[i, 4] <- cond_time
     } else {                            # Same id as previous row
        if (test[i, 3] == 0) {
            test[i, 4] <- sum(cond_time, 1)
            cond_time <- test[i, 6]
        } else {
            test[i, 4] <- sum(cond_time, 1)
            cond_time <- 0
            }
        }
}

--
Vincent Arel
M.A. Student, McGill University

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



	[[alternative HTML version deleted]]

jim holtman

2009-May-09 22:29 UTC

head link

[R] Generating a "conditional time" variable

Here is yet another way of doing it (always the case in R):

#Simulated data frame: year from 1990 to 2003, for 5 different ids, each
having one or two eif "events"
test<-data.frame(year=rep(1990:2003,5),id=gl(5,length(1990:2003)),
    eif=as.vector(sapply(1:5,function(z){
        a<-rep(0,length(1990:2003))
        a[sample(1:length(1990:2003),sample(1:2,1))]<-1
        a
    })))

# partition by 'id' and then by 'eif' changes
test.new <- do.call(rbind, lapply(split(test, test$id), function(.id){
    # now by 'eif' changes
    do.call(rbind, lapply(split(.id, cumsum(.id$eif)), function(.eif){
        # create new dataframe with column
        cbind(.eif, conditional_time=seq(nrow(.eif)))
    }))
}))



On Sat, May 9, 2009 at 1:40 PM, Vincent Arel-Bundock
<vincent.arel@gmail.com> wrote:
>  Hi everyone,
>
> Please forgive me if my question is simple and my code terrible, I'm
new to
> R. I am not looking for a ready-made answer, but I would really appreciate
> it if someone could share conceptual hints for programming, or point me
> toward an R function/package that could speed up my processing time.
>
> Thanks a lot for your help!
>
> ##
>
> My dataframe includes the variables 'year', 'id', and
'eif' and has +/- 1.9
> million id-year observations
>
> I would like to do 2 things:
>
> -1- I want to create a 'conditional_time' variable, which increases
in
> increments of 1 every year, but which resets during year(t) if event
'eif'
> occured for this 'id' at year(t-1). It should also reset when we
switch to
> a
> new 'id'. For example:
>
> dataframe = test
>  year        id         eif  conditional_time
>
> 1990       1010          0    1
> 1991       1010          0    2
> 1992       1010          1    3
> 1993       1010          0    1
> 1994       1010          0    2
> 1995       1010          0    3
> 1996       1010          0    4
> 1997       1010          1    5
> 1998       1010          0    1
> 1999       1010          0    2
> 2000       1010          0    3
> 2001       1010          0    4
> 2002       1010          0    5
> 2003       1010          0    6
> 1990       2010          0    1
> 1991       2010          0    2
> 1992       2010          0    3
> 1993       2010          0    4
> 1994       2010          0    5
> 1995       2010          0    6
> 1996       2010          0    7
> 1997       2010          0    8
> 1998       2010          0    9
> 1999       2010          0    10
> 2000       2010          0    11
> 2001       2010          1    12
> 2002       2010          0    1
> 2003       2010          0    2
>
> -2- In a copy of the original dataframe, drop all id-year rows that
> correspond to years after a given id has experienced his first
'eif' event.
>
> I have written the code below to take care of -1-, but it is incredibly
> inefficient. Given the size of my database, and considering how slow my
> computer is, I don't think it's practical to use it. Also, it
depends on
> correct sorting of the dataframe, which might generate errors.
>
> ##
>
> for (i in 1:nrow(test)) {
>    if (i == 1) {                            # If first id-year
>        cond_time <- 1
>        test[i, 4] <- cond_time
>
>    } else if ((test[i-1, 1]) != (test[i, 4])) {             # If new id
>        cond_time <- 1
>        test[i, 4] <- cond_time
>     } else {                            # Same id as previous row
>        if (test[i, 3] == 0) {
>            test[i, 4] <- sum(cond_time, 1)
>            cond_time <- test[i, 6]
>        } else {
>            test[i, 4] <- sum(cond_time, 1)
>            cond_time <- 0
>            }
>        }
> }
>
> --
> Vincent Arel
> M.A. Student, McGill University
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

	[[alternative HTML version deleted]]

jim holtman

2009-May-09 22:34 UTC

head link

[R] Generating a "conditional time" variable

Corrected version.  I forgot the the count had to change 'after' eif==1:

#Simulated data frame: year from 1990 to 2003, for 5 different ids, each
having one or two eif "events"
test<-data.frame(year=rep(1990:2003,5),id=gl(5,length(1990:2003)),
    eif=as.vector(sapply(1:5,function(z){
        a<-rep(0,length(1990:2003))
        a[sample(1:length(1990:2003),sample(1:2,1))]<-1
        a
    })))
# partition by 'id' and then by 'eif' changes
test.new <- do.call(rbind, lapply(split(test, test$id), function(.id){
    # now by 'eif' changes
    do.call(rbind, lapply(split(.id, cumsum(c(0, diff(.id$eif) == -1))),
function(.eif){
        cbind(.eif, conditional_time=seq(nrow(.eif)))
    }))
}))



On Sat, May 9, 2009 at 1:40 PM, Vincent Arel-Bundock
<vincent.arel@gmail.com> wrote:
>  Hi everyone,
>
> Please forgive me if my question is simple and my code terrible, I'm
new to
> R. I am not looking for a ready-made answer, but I would really appreciate
> it if someone could share conceptual hints for programming, or point me
> toward an R function/package that could speed up my processing time.
>
> Thanks a lot for your help!
>
> ##
>
> My dataframe includes the variables 'year', 'id', and
'eif' and has +/- 1.9
> million id-year observations
>
> I would like to do 2 things:
>
> -1- I want to create a 'conditional_time' variable, which increases
in
> increments of 1 every year, but which resets during year(t) if event
'eif'
> occured for this 'id' at year(t-1). It should also reset when we
switch to
> a
> new 'id'. For example:
>
> dataframe = test
>  year        id         eif  conditional_time
>
> 1990       1010          0    1
> 1991       1010          0    2
> 1992       1010          1    3
> 1993       1010          0    1
> 1994       1010          0    2
> 1995       1010          0    3
> 1996       1010          0    4
> 1997       1010          1    5
> 1998       1010          0    1
> 1999       1010          0    2
> 2000       1010          0    3
> 2001       1010          0    4
> 2002       1010          0    5
> 2003       1010          0    6
> 1990       2010          0    1
> 1991       2010          0    2
> 1992       2010          0    3
> 1993       2010          0    4
> 1994       2010          0    5
> 1995       2010          0    6
> 1996       2010          0    7
> 1997       2010          0    8
> 1998       2010          0    9
> 1999       2010          0    10
> 2000       2010          0    11
> 2001       2010          1    12
> 2002       2010          0    1
> 2003       2010          0    2
>
> -2- In a copy of the original dataframe, drop all id-year rows that
> correspond to years after a given id has experienced his first
'eif' event.
>
> I have written the code below to take care of -1-, but it is incredibly
> inefficient. Given the size of my database, and considering how slow my
> computer is, I don't think it's practical to use it. Also, it
depends on
> correct sorting of the dataframe, which might generate errors.
>
> ##
>
> for (i in 1:nrow(test)) {
>    if (i == 1) {                            # If first id-year
>        cond_time <- 1
>        test[i, 4] <- cond_time
>
>    } else if ((test[i-1, 1]) != (test[i, 4])) {             # If new id
>        cond_time <- 1
>        test[i, 4] <- cond_time
>     } else {                            # Same id as previous row
>        if (test[i, 3] == 0) {
>            test[i, 4] <- sum(cond_time, 1)
>            cond_time <- test[i, 6]
>        } else {
>            test[i, 4] <- sum(cond_time, 1)
>            cond_time <- 0
>            }
>        }
> }
>
> --
> Vincent Arel
> M.A. Student, McGill University
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more maybe matching threads

R help - May 2009 - Generating a "conditional time" variable

[R] Generating a "conditional time" variable

[R] Generating a "conditional time" variable

[R] Generating a "conditional time" variable

[R] Generating a "conditional time" variable

[R] Generating a "conditional time" variable

Possibly Parallel Threads