I have the following data ID x y time 1 10 20 0 1 10 30 1 1 10 40 2 2 12 23 0 2 12 25 1 2 12 28 2 2 12 38 3 3 5 10 0 3 5 15 2 ..... x is time invariant, ID is the subject id number, y is changing over time. I want to find out the difference between the first and last observed y value for each subject and get a table like ID x y 1 10 20 2 12 15 3 5 5 ...... Is there any easy way to generate the data set? [[alternative HTML version deleted]]
Hi r-help-bounces at r-project.org napsal dne 02.01.2009 10:20:23:> I have the following data > > ID x y time > 1 10 20 0 > 1 10 30 1 > 1 10 40 2 > 2 12 23 0 > 2 12 25 1 > 2 12 28 2 > 2 12 38 3 > 3 5 10 0 > 3 5 15 2 > ..... > > x is time invariant, ID is the subject id number, y is changing overtime.> > I want to find out the difference between the first and last observed y > value for each subject and get a table likesapply(split(test$y, test$ID), function(x) tail(x, 1)-head(x,1)) I am leaving formating to the resulting table to you. Hint: aggregate Best regards Petr> > ID x y > 1 10 20 > 2 12 15 > 3 5 5 > ...... > > Is there any easy way to generate the data set? > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
Carlos J. Gil Bellosta
2009-Jan-02 10:00 UTC
[R] the first and last observation for each subject
Hello, First, order your data by ID and time. The columns you want in your output dataframe are then unique(ID), tapply( x, ID, function( z ) z[ 1 ] ) and tapply( y, ID, function( z ) z[ lenght( z ) ] - z[ 1 ] ) Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com On Fri, 2009-01-02 at 17:20 +0800, gallon li wrote:> I have the following data > > ID x y time > 1 10 20 0 > 1 10 30 1 > 1 10 40 2 > 2 12 23 0 > 2 12 25 1 > 2 12 28 2 > 2 12 38 3 > 3 5 10 0 > 3 5 15 2 > ..... > > x is time invariant, ID is the subject id number, y is changing over time. > > I want to find out the difference between the first and last observed y > value for each subject and get a table like > > ID x y > 1 10 20 > 2 12 15 > 3 5 5 > ...... > > Is there any easy way to generate the data set? > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jorge Ivan Velez
2009-Jan-02 10:47 UTC
[R] the first and last observation for each subject
Dear Gallon, Assuming that your data is called "mydata", something like this should do the job: newdf<-data.frame( ID = unique(mydata$ID), x = unique(mydata$x), y = with(mydata,tapply(y,ID,function(m) tail(m,1)-head(m,1))) ) newdf HTH, Jorge On Fri, Jan 2, 2009 at 4:20 AM, gallon li <gallon.li@gmail.com> wrote:> I have the following data > > ID x y time > 1 10 20 0 > 1 10 30 1 > 1 10 40 2 > 2 12 23 0 > 2 12 25 1 > 2 12 28 2 > 2 12 38 3 > 3 5 10 0 > 3 5 15 2 > ..... > > x is time invariant, ID is the subject id number, y is changing over time. > > I want to find out the difference between the first and last observed y > value for each subject and get a table like > > ID x y > 1 10 20 > 2 12 15 > 3 5 5 > ...... > > Is there any easy way to generate the data set? > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Gabor Grothendieck
2009-Jan-02 10:56 UTC
[R] the first and last observation for each subject
Try this:> Lines <- "ID x y time+ 1 10 20 0 + 1 10 30 1 + 1 10 40 2 + 2 12 23 0 + 2 12 25 1 + 2 12 28 2 + 2 12 38 3 + 3 5 10 0 + 3 5 15 2"> DF <- read.table(textConnection(Lines), header = TRUE) > aggregate(DF[3], DF[1:2], function(x) tail(x, 1) - head(x, 1))ID x y 1 3 5 5 2 1 10 20 3 2 12 15 On Fri, Jan 2, 2009 at 4:20 AM, gallon li <gallon.li at gmail.com> wrote:> I have the following data > > ID x y time > 1 10 20 0 > 1 10 30 1 > 1 10 40 2 > 2 12 23 0 > 2 12 25 1 > 2 12 28 2 > 2 12 38 3 > 3 5 10 0 > 3 5 15 2 > ..... > > x is time invariant, ID is the subject id number, y is changing over time. > > I want to find out the difference between the first and last observed y > value for each subject and get a table like > > ID x y > 1 10 20 > 2 12 15 > 3 5 5 > ...... > > Is there any easy way to generate the data set? > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
On Fri, Jan 2, 2009 at 3:20 AM, gallon li <gallon.li at gmail.com> wrote:> I have the following data > > ID x y time > 1 10 20 0 > 1 10 30 1 > 1 10 40 2 > 2 12 23 0 > 2 12 25 1 > 2 12 28 2 > 2 12 38 3 > 3 5 10 0 > 3 5 15 2 > ..... > > x is time invariant, ID is the subject id number, y is changing over time. > > I want to find out the difference between the first and last observed y > value for each subject and get a table like > > ID x y > 1 10 20 > 2 12 15 > 3 5 5 > ...... > > Is there any easy way to generate the data set?One approach is to use the plyr package, as documented at http://had.co.nz/plyr. The basic idea is that your problem is easy to solve if you have a subset for a single subject value: one <- subset(DF, ID == 1) with(one, y[length(y)] - y[1]) The difficulty is splitting up the original dataset in to subjects, applying the solution to each piece and then joining all the results back together. This is what the plyr package does for you: library(plyr) # ddply is for splitting up data frames and combining the results # into a data frame. .(ID) says to split up the data frame by the subject # variable ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1])) # if you want a more informative variable name in the result # return a named vector: ddply(DF, .(ID), function(one) c(diff = with(one, y[length(y)] - y[1]))) # plyr takes care of labelling the result for you. You don't say why you want to include x, or what to do if x is not invariant, but here are couple of options: # Split up by ID and x ddply(DF, .(ID, x), function(one) c(diff = with(one, y[length(y)] - y[1]))) # Return the first x value ddply(DF, .(ID), function(one) { with(one, c( x = x[1], diff = y[length(y)] - y[1] )) }) # Throw an error is x is not unique ddply(DF, .(ID), function(one) { stopifnot(length(unique(one$x)) == 1) with(one, c( x = x[1], diff = y[length(y)] - y[1] )) }) Regards, Hadley -- http://had.co.nz/
Stavros Macrakis
2009-Jan-02 18:16 UTC
[R] the first and last observation for each subject
I think there's a pretty simple solution here, though probably not the most efficient: t(sapply(split(a,a$ID), function(q) with(q,c(ID=unique(ID),x=unique(x),y=max(y)-min(y))))) Using 'unique' instead of min or [[1]] has the advantage that if x is in fact not time-invariant, this gives an error rather than silently ignore inconsistencies. Trying to package up this idiom into a function leads to: select <- function(df, groupby, selection) { pf <- parent.frame() fields <- substitute(selection) t(sapply(split(df,eval(substitute(groupby),df,enclos=pf)), function(q) eval(fields,q,enclos=pf))) } which I admit is rather ugly (and does no error-checking), but it does work:> select(a,ID,list(min(ID),unique(x),max(y)-min(y)))[,1] [,2] [,3] 1 1 10 20 2 2 12 15 3 3 5 5 Perhaps some of the more experienced people on the list could show me how to write this more cleanly. -s On Fri, Jan 2, 2009 at 4:20 AM, gallon li <gallon.li at gmail.com> wrote:> I have the following data > > ID x y time > 1 10 20 0 > 1 10 30 1 > 1 10 40 2 > 2 12 23 0 > 2 12 25 1 > 2 12 28 2 > 2 12 38 3 > 3 5 10 0 > 3 5 15 2 > ..... > > x is time invariant, ID is the subject id number, y is changing over time. > > I want to find out the difference between the first and last observed y > value for each subject and get a table like > > ID x y > 1 10 20 > 2 12 15 > 3 5 5 > ...... > > Is there any easy way to generate the data set? > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
> [R] the first and last observation for each subject > hadley wickham h.wickham at gmail.com > Fri Jan 2 14:52:42 CET 2009 > > On Fri, Jan 2, 2009 at 3:20 AM, gallon li <gallon.li at gmail.com>wrote:> > I have the following data > > > > ID x y time > > 1 10 20 0 > > 1 10 30 1 > > 1 10 40 2 > > 2 12 23 0 > > 2 12 25 1 > > 2 12 28 2 > > 2 12 38 3 > > 3 5 10 0 > > 3 5 15 2 > > ..... > > > > x is time invariant, ID is the subject id number, y is changing overtime.> > > > I want to find out the difference between the first and lastobserved y> > value for each subject and get a table like > > > > ID x y > > 1 10 20 > > 2 12 15 > > 3 5 5 > > ...... > > > > Is there any easy way to generate the data set? > > One approach is to use the plyr package, as documented at > http://had.co.nz/plyr. The basic idea is that your problem is easy to > solve if you have a subset for a single subject value: > > one <- subset(DF, ID == 1) > with(one, y[length(y)] - y[1]) > > The difficulty is splitting up the original dataset in to subjects, > applying the solution to each piece and then joining all the results > back together. This is what the plyr package does for you: > > library(plyr) > > # ddply is for splitting up data frames and combining the results > # into a data frame. .(ID) says to split up the data frame by thesubject> # variable > ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1])) > ...The above is much quicker than the versions based on aggregate and easy to understand. Another approach is more specialized but useful when you have lots of ID's (e.g., millions) and speed is very important. It computes where the first and last entry for each ID in a vectorized computation, akin to the computation that rle() uses: f0 <- function(DF){ changes <- DF$ID[-1] != DF$ID[-length(DF$ID)] first <- c(TRUE, changes) last <- c(changes, TRUE) ydiff <- DF$y[last] - DF$y[first] DF <- DF[first,] DF$y <- ydiff DF } Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com