Hi, I need to perform calculations on subsets of a data frame: DF = data.frame(read.table(textConnection(" A B C D E F 1 a 1995 0 4 1 2 a 1997 1 1 3 3 b 1995 3 7 0 4 b 1996 1 2 3 5 b 1997 1 2 3 6 b 1998 6 0 0 7 b 1999 3 7 0 8 c 1997 1 2 3 9 c 1998 1 2 3 10 c 1999 6 0 0 11 d 1999 3 7 0 12 e 1995 1 2 3 13 e 1998 1 2 3 14 e 1999 6 0 0"),head=TRUE,stringsAsFactors=FALSE)) I'd like to create new dataframes for each unique year in which for each value of A, the values of D, E and F are summed over the last 3 years (e.g. 1998 = 1998, 1997, 1996): Question 1: How do I go from DF to newDFyear? Examples: newDF1995 B D E F a 0 4 1 b 3 7 0 e 1 2 3 newDF1998 B D E F a 1 1 3 b 8 4 6 c 2 4 6 e 1 2 3 Then, for each new DF I need to generate a square matrix after doing the following: newDF1998$G<-newDF1998$D + newDF1998$E + newDF1998$F newDF1998$D<-newDF1998$D/newDF1998$G newDF1998$E<-newDF1998$E/newDF1998$G newDF1998$F<-newDF1998$F/newDF1998$G newDF1998<-NewDF1998[,c(-5)] newDF1998 B D E F a 0.2 0.2 0.6 b 0.4 0.2 0.3 c 0.2 0.3 0.5 e 0.2 0.3 0.5 Question 2: How do I go from newDF1998 to a matrix a b c e a b c e in which Cell ab = (0.2*0.4 + 0.2*0.2 + 0.6*0.3)/((0.2*0.2 + 0.2*0.2 + 0.6*0.6)^0.5) * ((0.4*0.4 + 0.2*0.2 + 0.3*0.3)^0.5) = 0.84 Thanks a lot for your help! -- View this message in context: http://r.789695.n4.nabble.com/Yearly-aggregates-and-matrices-tp3438140p3438140.html Sent from the R help mailing list archive at Nabble.com.
On Sat, Apr 9, 2011 at 5:14 AM, mathijsdevaan <mathijsdevaan at gmail.com> wrote:> Hi, > > I need to perform calculations on subsets of a data frame: > > DF = data.frame(read.table(textConnection(" ? ?A ?B ?C ?D ?E ?F > 1 a ?1995 ?0 ?4 ?1 > 2 a ?1997 ?1 ?1 ?3 > 3 b ?1995 ?3 ?7 ?0 > 4 b ?1996 ?1 ?2 ?3 > 5 b ?1997 ?1 ?2 ?3 > 6 b ?1998 ?6 ?0 ?0 > 7 b ?1999 ?3 ?7 ?0 > 8 c ?1997 ?1 ?2 ?3 > 9 c ?1998 ?1 ?2 ?3 > 10 c ?1999 ?6 ?0 ?0 > 11 d ?1999 ?3 ?7 ?0 > 12 e ?1995 ?1 ?2 ?3 > 13 e ?1998 ?1 ?2 ?3 > 14 e ?1999 ?6 ?0 ?0"),head=TRUE,stringsAsFactors=FALSE)) > > I'd like to create new dataframes for each unique year in which for each > value of A, the values of D, E and F are summed over the last 3 years (e.g. > 1998 = 1998, 1997, 1996): > Question 1: How do I go from DF to newDFyear? > > Examples: > > newDF1995 > B ?D ?E ?F > a ?0 ?4 ?1 > b ?3 ?7 ?0 > e ?1 ?2 ?3 > > newDF1998 > B ?D ?E ?F > a ?1 ?1 ?3 > b ?8 ?4 ?6 > c ?2 ?4 ?6 > e ?1 ?2 ?3 > > Then, for each new DF I need to generate a square matrix after doing the > following: > > newDF1998$G<-newDF1998$D + newDF1998$E + newDF1998$F > newDF1998$D<-newDF1998$D/newDF1998$G > newDF1998$E<-newDF1998$E/newDF1998$G > newDF1998$F<-newDF1998$F/newDF1998$G > newDF1998<-NewDF1998[,c(-5)] > > newDF1998 > B ?D ?E ?F > a ?0.2 ?0.2 ?0.6 > b ?0.4 ?0.2 ?0.3 > c ?0.2 ?0.3 ?0.5 > e ?0.2 ?0.3 ?0.5 > > Question 2: How do I go from newDF1998 to a matrix > > ?a ?b ?c ?e > a > b > c > e > > in which Cell ab = (0.2*0.4 + 0.2*0.2 + 0.6*0.3)/((0.2*0.2 + 0.2*0.2 + > 0.6*0.6)^0.5) * ((0.4*0.4 + 0.2*0.2 + 0.3*0.3)^0.5) = 0.84First we use read.zoo to reform DF into a multivariate time series and use rollapply (where we have used the devel version of zoo since it supports the partial= argument on rollapply). We then reform each resulting row into a matrix converting each row of each matrix to proportions. Finally we form the desired scaled cross product. # devel version of zoo install.packages("zoo", repos = "http://r-forge.r-project.org") library(zoo) z <- read.zoo(DF, split = 2, index = 3, FUN = identity) sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA r <- rollapply(z, 3, sum.na, align = "right", partial = TRUE) newDF <- lapply(1:nrow(r), function(i) prop.table(na.omit(matrix(r[i,], nc = 4, byrow = TRUE, dimnames = list(unique(DF$B), names(DF)[-2:-3]))[, -1]), 1)) names(newDF) <- time(z) lapply(mats, function(mat) tcrossprod(mat / sqrt(rowSums(mat^2)))) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Solved the problem: I guess I was still using the main version of zoo. Thanks again! -- View this message in context: http://r.789695.n4.nabble.com/Yearly-aggregates-and-matrices-tp3438140p3441723.html Sent from the R help mailing list archive at Nabble.com.
Thanks, but it did not really improve the speed. Why is it that when I change the layout of the matrix (which does not give the required results), the speed increases tremendously? So: library(reshape2) library(zoo) z <- read.zoo(DF, split = 3, index = 2, FUN = identity) # Split on 3 and index on 2 instead of vice versa sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA r <- rollapply(z, 3, sum.na, align = "right", partial = TRUE) or mm <- melt(DF, id = c("B", "C")) aa <- acast(mm, B ~ C + variable, FUN = sum) # B ~ C instead of C ~ B sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA r <- rollapply(aa, 3, sum.na, align = "right", partial = TRUE) Thanks! Gabor Grothendieck wrote:> > On Wed, Apr 27, 2011 at 2:03 PM, mathijsdevaan > <mathijsdevaan at gmail.com> wrote: >> Hi, >> >> Is there an alternative to "z <- read.zoo(DF, split = 2, index = 3, FUN >> identity)" and "r <- rollapply(z, 3, ?sum.na, align = "right", partial >> TRUE)"? I am trying to use the following script in which the split data >> (B) >> contains about 300000 unique cases and obviously I am getting an >> allocation >> error. Thanks! >> > > You could test the speed of this to see if its faster: > > library(reshape2) > library(zoo) > mm <- melt(DF, id = c("B", "C")) > aa <- acast(mm, C ~ B + variable, FUN = sum) > sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA > r <- rollapply(aa, 3, sum.na, align = "right", partial = TRUE) > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- View this message in context: http://r.789695.n4.nabble.com/Yearly-aggregates-and-matrices-tp3438140p3482174.html Sent from the R help mailing list archive at Nabble.com.
On Thu, Apr 28, 2011 at 4:49 PM, mathijsdevaan <mathijsdevaan at gmail.com> wrote:> Thanks, but it did not really improve the speed. Why is it that when I change > the layout of the matrix (which does not give the required results), the > speed increases tremendously? So: > > library(reshape2) > library(zoo) > z <- read.zoo(DF, split = 3, index = 2, FUN = identity) # Split on 3 and > index on 2 instead of vice versa > sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA > r <- rollapply(z, 3, ?sum.na, align = "right", partial = TRUE) > > or > > mm <- melt(DF, id = c("B", "C")) > aa <- acast(mm, B ~ C + variable, FUN = sum) # B ~ C instead of C ~ B > sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA > r <- rollapply(aa, 3, ?sum.na, align = "right", partial = TRUE) >For me it makes little difference:> system.time(for(i in 1:100) read.zoo(DF, split = 3, index = 2, FUN = identity))user system elapsed 1.72 0.00 1.71> system.time(for(i in 1:100) read.zoo(DF, split = 2, index = 3, FUN = identity))user system elapsed 1.75 0.00 1.74 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On Thu, Apr 28, 2011 at 10:13 PM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> On Thu, Apr 28, 2011 at 4:49 PM, mathijsdevaan <mathijsdevaan at gmail.com> wrote: >> Thanks, but it did not really improve the speed. Why is it that when I change >> the layout of the matrix (which does not give the required results), the >> speed increases tremendously? So: >> >> library(reshape2) >> library(zoo) >> z <- read.zoo(DF, split = 3, index = 2, FUN = identity) # Split on 3 and >> index on 2 instead of vice versa >> sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA >> r <- rollapply(z, 3, ?sum.na, align = "right", partial = TRUE) >> >> or >> >> mm <- melt(DF, id = c("B", "C")) >> aa <- acast(mm, B ~ C + variable, FUN = sum) # B ~ C instead of C ~ B >> sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA >> r <- rollapply(aa, 3, ?sum.na, align = "right", partial = TRUE) >> > > > For me it makes little difference: > >> system.time(for(i in 1:100) read.zoo(DF, split = 3, index = 2, FUN = identity)) > ? user ?system elapsed > ? 1.72 ? ?0.00 ? ?1.71 >> system.time(for(i in 1:100) read.zoo(DF, split = 2, index = 3, FUN = identity)) > ? user ?system elapsed > ? 1.75 ? ?0.00 ? ?1.74and here it is with rollapply included:> system.time(for(i in 1:100) { read.zoo(DF, split = 3, index = 2, FUN = identity)+ r <- rollapply(z, 3, sum.na, align = "right", partial = TRUE)}) user system elapsed 14.74 0.00 14.76 .> system.time(for(i in 1:100) {read.zoo(DF, split = 2, index = 3, FUN = identity)+ r <- rollapply(z, 3, sum.na, align = "right", partial = TRUE)}) user system elapsed 14.72 0.00 14.74 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com