Suppose I have a dataframe that looks like the following: n=2 mydata <- data.frame(exp = rep(1:5,each=n), rslt = c(12,15,7,8,24,28,33,15,22,11)) mydata exp rslt 1 1 12 2 1 15 3 2 7 4 2 8 5 3 24 6 3 28 7 4 33 8 4 15 9 5 22 10 5 11 The variable 'exp' (for experiment') occurs in pairs over consecutive rows -- 1,1, then 2,2, then 3,3, and so on. The first row in a pair is the 'control', and the second is a 'treatment'. The rslt column is the result. What I'm trying to do is create a subset of this dataframe that consists of the exp number, and the lagged difference between the 'control' and 'treatment' result. So, for exp=1, the difference is (15-12)=3. For exp=2, the difference is (8-7)=1, and so on. What I'm hoping to do is take mydata (above), and turn it into exp diff 1 1 3 2 2 1 3 3 4 4 4 -18 5 5 -11 The basic 'trick' I can't figure out is how to create a lagged variable between the second row (record) for a given level of exp, and the first row for that exp. This is easy to do in SAS (which I'm more familiar with), but I'm struggling with the equivalent in R. The brute force approach I thought of is to simply split the dataframe into to (one even rows, one odd rows), merge by exp, and then calculate a difference. But this seems to require renaming the rslt column in the two new dataframes so they are different in the merge (say, rslt_cont n the odd dataframe, and rslt_trt in the even dataframe), allowing me to calculate a difference between the two. While I suppose this would work, I'm wondering if I'm missing a more elegant 'in place' approach that doesn't require me to split the data frame and do every via a merge. Suggestions/pointers to the obvious welcome. I've tried playing with lag, and some approaches using lag in the zoo package, but haven't found the magic trick. The problem (meaning, what I can't figure out) seems to be conditioning the lag on the level of exp. Many thanks... mydata <-*data.frame*(x = c(20,35,45,55,70), n = rep(50,5), y = c(6,17,26,37,44)) [[alternative HTML version deleted]]
Ulrik Stervbo
2017-Mar-17 16:58 UTC
[R] lagging over consecutive pairs of rows in dataframe
Hi Evan you can easily do this by applying diff() to each exp group. Either using dplyr: library(dplyr) mydata %>% group_by(exp) %>% summarise(difference = diff(rslt)) Or with base R aggregate(mydata, by = list(group = mydata$exp), FUN = diff) HTH Ulrik On Fri, 17 Mar 2017 at 17:34 Evan Cooch <evan.cooch at gmail.com> wrote:> Suppose I have a dataframe that looks like the following: > > n=2 > mydata <- data.frame(exp = rep(1:5,each=n), rslt > c(12,15,7,8,24,28,33,15,22,11)) > mydata > exp rslt > 1 1 12 > 2 1 15 > 3 2 7 > 4 2 8 > 5 3 24 > 6 3 28 > 7 4 33 > 8 4 15 > 9 5 22 > 10 5 11 > > The variable 'exp' (for experiment') occurs in pairs over consecutive > rows -- 1,1, then 2,2, then 3,3, and so on. The first row in a pair is > the 'control', and the second is a 'treatment'. The rslt column is the > result. > > What I'm trying to do is create a subset of this dataframe that consists > of the exp number, and the lagged difference between the 'control' and > 'treatment' result. So, for exp=1, the difference is (15-12)=3. For > exp=2, the difference is (8-7)=1, and so on. What I'm hoping to do is > take mydata (above), and turn it into > > exp diff > 1 1 3 > 2 2 1 > 3 3 4 > 4 4 -18 > 5 5 -11 > > The basic 'trick' I can't figure out is how to create a lagged variable > between the second row (record) for a given level of exp, and the first > row for that exp. This is easy to do in SAS (which I'm more familiar > with), but I'm struggling with the equivalent in R. The brute force > approach I thought of is to simply split the dataframe into to (one > even rows, one odd rows), merge by exp, and then calculate a difference. > But this seems to require renaming the rslt column in the two new > dataframes so they are different in the merge (say, rslt_cont n the odd > dataframe, and rslt_trt in the even dataframe), allowing me to calculate > a difference between the two. > > While I suppose this would work, I'm wondering if I'm missing a more > elegant 'in place' approach that doesn't require me to split the data > frame and do every via a merge. > > Suggestions/pointers to the obvious welcome. I've tried playing with > lag, and some approaches using lag in the zoo package, but haven't > found the magic trick. The problem (meaning, what I can't figure out) > seems to be conditioning the lag on the level of exp. > > Many thanks... > > > mydata <-*data.frame*(x = c(20,35,45,55,70), n = rep(50,5), y > c(6,17,26,37,44)) > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On 3/17/2017 12:58 PM, Ulrik Stervbo wrote:> Hi Evan > > you can easily do this by applying diff() to each exp group. > > Either using dplyr: > library(dplyr) > mydata %>% > group_by(exp) %>% > summarise(difference = diff(rslt)) > > Or with base R > aggregate(mydata, by = list(group = mydata$exp), FUN = diff) > >Indeed -- thanks very much! [[alternative HTML version deleted]]
Evan: You misunderstand the concept of a lagged variable. Ulrik: Well, yes, that is certainly a general solution that works. However, given the *specific* structure described by the OP, an even more direct (maybe more efficient?) way to do it just uses (logical) subscripting: odds <- (seq_len(nrow(mydata)) %% 2) == 1 newdat <-data.frame(mydata[odds,1 ],mydata[!odds,2] - mydata[odds,2]) names(newdat) <- names(mydata) Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, Mar 17, 2017 at 9:58 AM, Ulrik Stervbo <ulrik.stervbo at gmail.com> wrote:> Hi Evan > > you can easily do this by applying diff() to each exp group. > > Either using dplyr: > library(dplyr) > mydata %>% > group_by(exp) %>% > summarise(difference = diff(rslt)) > > Or with base R > aggregate(mydata, by = list(group = mydata$exp), FUN = diff) > > HTH > Ulrik > > > On Fri, 17 Mar 2017 at 17:34 Evan Cooch <evan.cooch at gmail.com> wrote: > >> Suppose I have a dataframe that looks like the following: >> >> n=2 >> mydata <- data.frame(exp = rep(1:5,each=n), rslt >> c(12,15,7,8,24,28,33,15,22,11)) >> mydata >> exp rslt >> 1 1 12 >> 2 1 15 >> 3 2 7 >> 4 2 8 >> 5 3 24 >> 6 3 28 >> 7 4 33 >> 8 4 15 >> 9 5 22 >> 10 5 11 >> >> The variable 'exp' (for experiment') occurs in pairs over consecutive >> rows -- 1,1, then 2,2, then 3,3, and so on. The first row in a pair is >> the 'control', and the second is a 'treatment'. The rslt column is the >> result. >> >> What I'm trying to do is create a subset of this dataframe that consists >> of the exp number, and the lagged difference between the 'control' and >> 'treatment' result. So, for exp=1, the difference is (15-12)=3. For >> exp=2, the difference is (8-7)=1, and so on. What I'm hoping to do is >> take mydata (above), and turn it into >> >> exp diff >> 1 1 3 >> 2 2 1 >> 3 3 4 >> 4 4 -18 >> 5 5 -11 >> >> The basic 'trick' I can't figure out is how to create a lagged variable >> between the second row (record) for a given level of exp, and the first >> row for that exp. This is easy to do in SAS (which I'm more familiar >> with), but I'm struggling with the equivalent in R. The brute force >> approach I thought of is to simply split the dataframe into to (one >> even rows, one odd rows), merge by exp, and then calculate a difference. >> But this seems to require renaming the rslt column in the two new >> dataframes so they are different in the merge (say, rslt_cont n the odd >> dataframe, and rslt_trt in the even dataframe), allowing me to calculate >> a difference between the two. >> >> While I suppose this would work, I'm wondering if I'm missing a more >> elegant 'in place' approach that doesn't require me to split the data >> frame and do every via a merge. >> >> Suggestions/pointers to the obvious welcome. I've tried playing with >> lag, and some approaches using lag in the zoo package, but haven't >> found the magic trick. The problem (meaning, what I can't figure out) >> seems to be conditioning the lag on the level of exp. >> >> Many thanks... >> >> >> mydata <-*data.frame*(x = c(20,35,45,55,70), n = rep(50,5), y >> c(6,17,26,37,44)) >> >> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Mathew Guilfoyle
2017-Mar-18 16:18 UTC
[R] lagging over consecutive pairs of rows in dataframe
If you are strict about your data formatting then the following is a fast way of calculating the differences, based on reshaping the data column: A = matrix(mydata$rslt, nrow=2) data.frame(exp=1:ncol(A), diff=A[2,]-A[1,]) alternatively, if the 'exp' values are not guaranteed to be sequential you can reshape an index: A = matrix(1:nrow(mydata), nrow=2) data.frame(exp=mydata$exp[A[1,]], diff=mydata$rslt[A[2,]]-mydata$rslt[A[1,]]) However, I would suggest that you have a further variable to label 'control' and 'treatment' groups and explicitly use this for the calculation. Otherwise, if at any time you sort or reorder the data you will run into problems or produce erroneous results (but more than likely won't generate any actual R errors to alert you): data.frame(exp = unique(mydata$exp), diff = as.vector(by(mydata, mydata$exp, function(x) x$rslt[x$type=='treatment']-x$rslt[x$type=='control']))) The efficiency of the various options that have been suggested in this thread piqued my interest so a quick benchmark seemed in order (see below, including a 'safer' method). Of course, this is probably only relevant if you have huge datasets that you are repeatedly performing this calculation on. library('microbenchmark') #create some example data similar to the OP ndata = 1000 mydata = data.frame(exp = cumsum(rep(c(1,0),ndata)),rslt=sample(1:50, size = ndata*2, replace = TRUE), type=rep(c('control','treatment'),ndata)) #various suggested options diff.BG = function(mydata) { evens <- (seq_len(nrow(mydata)) %% 2) == 0 data.frame(exp = mydata[evens,1 ], diff = diff(mydata[,2])[evens[-1]]) } diff.US = function(mydata) { aggregate(mydata$rslt, by = list(group = mydata$exp), FUN = diff) } diff.MG1 = function(mydata) { A = matrix(mydata$rslt, nrow=2) data.frame(exp=1:ncol(A), diff=A[2,]-A[1,]) } diff.MG2 = function(mydata) { A = matrix(1:nrow(mydata), nrow=2) data.frame(exp=mydata$exp[A[1,]], diff=mydata$rslt[A[2,]]-mydata$rslt[A[1,]]) } diff.safe = function(mydata) { data.frame(exp = unique(mydata$exp), diff = as.vector(by(mydata, mydata$exp, function(x) x$rslt[x$type=='treatment']-x$rslt[x$type=='control']))) } #benchmark microbenchmark(BG=diff.BG(mydata), US=diff.US(mydata), MG1=diff.MG1(mydata), MG2=diff.MG2(mydata), safe=diff.safe(my data)) Unit: microseconds expr min lq mean median uq max neval BG 273.837 299.0015 351.0377 316.7400 350.5220 2385.289 100 US 9872.457 10511.1065 11555.6048 11108.0790 12471.8060 17609.518 100 MG1 168.783 196.8635 229.9329 210.9370 249.4895 471.381 100 MG2 221.303 237.0480 265.5097 254.3895 280.7815 418.728 100 safe 97869.540 104164.5130 109579.9834 107199.7715 110315.8590 170028.377 100> On 17 Mar 2017, at 14:54, Evan Cooch <evan.cooch at gmail.com> wrote: > > Suppose I have a dataframe that looks like the following: > > n=2 > mydata <- data.frame(exp = rep(1:5,each=n), rslt = > c(12,15,7,8,24,28,33,15,22,11)) > mydata > exp rslt > 1 1 12 > 2 1 15 > 3 2 7 > 4 2 8 > 5 3 24 > 6 3 28 > 7 4 33 > 8 4 15 > 9 5 22 > 10 5 11 > > The variable 'exp' (for experiment') occurs in pairs over consecutive > rows -- 1,1, then 2,2, then 3,3, and so on. The first row in a pair is > the 'control', and the second is a 'treatment'. The rslt column is the > result. > > What I'm trying to do is create a subset of this dataframe that consists > of the exp number, and the lagged difference between the 'control' and > 'treatment' result. So, for exp=1, the difference is (15-12)=3. For > exp=2, the difference is (8-7)=1, and so on. What I'm hoping to do is > take mydata (above), and turn it into > > exp diff > 1 1 3 > 2 2 1 > 3 3 4 > 4 4 -18 > 5 5 -11 > > The basic 'trick' I can't figure out is how to create a lagged variable > between the second row (record) for a given level of exp, and the first > row for that exp. This is easy to do in SAS (which I'm more familiar > with), but I'm struggling with the equivalent in R. The brute force > approach I thought of is to simply split the dataframe into to (one > even rows, one odd rows), merge by exp, and then calculate a difference. > But this seems to require renaming the rslt column in the two new > dataframes so they are different in the merge (say, rslt_cont n the odd > dataframe, and rslt_trt in the even dataframe), allowing me to calculate > a difference between the two. > > While I suppose this would work, I'm wondering if I'm missing a more > elegant 'in place' approach that doesn't require me to split the data > frame and do every via a merge. > > Suggestions/pointers to the obvious welcome. I've tried playing with > lag, and some approaches using lag in the zoo package, but haven't > found the magic trick. The problem (meaning, what I can't figure out) > seems to be conditioning the lag on the level of exp. > > Many thanks... > > > mydata <-*data.frame*(x = c(20,35,45,55,70), n = rep(50,5), y = c(6,17,26,37,44)) > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.