Maya Joshi
2011-Sep-03 03:18 UTC
[R] problem in applying function in data subset (with a level) - using plyr or other alternative are also welcome
Dear R experts. I might be missing something obvious. I have been trying to fix this problem for some weeks. Please help. #data ped <- c(rep(1, 4), rep(2, 3), rep(3, 3)) y <- rnorm(10, 8, 2) # variable set 1 M1a <- sample (c(1, 2,3), 10, replace= T) M1b <- sample (c(1, 2,3), 10, replace= T) M1aP1 <- sample (c(1, 2,3), 10, replace= T) M1bP2 <- sample (c(1, 2,3), 10, replace= T) # variable set 2 M2a <- sample (c(1, 2,3), 10, replace= T) M2b <- sample (c(1, 2,3), 10, replace= T) M2aP1 <- sample (c(1, 2,3), 10, replace= T) M2bP2 <- sample (c(1, 2,3), 10, replace= T) # variable set 3 M3a <- sample (c(1, 2,3), 10, replace= T) M3b <- sample (c(1, 2,3), 10, replace= T) M3aP1 <- sample (c(1, 2,3), 10, replace= T) M3bP2 <- sample (c(1, 2,3), 10, replace= T) mydf <- data.frame (ped, M1a,M1b,M1aP1,M1bP2, M2a,M2b,M2aP1,M2bP2, M3a,M3b,M3aP1,M3bP2, y) # functions and further calculations mmat <- matrix (c("M1a","M2a","M3a","M1b","M2b","M3b","M1aP1","M2aP1","M3aP1", "M1bP2","M2bP2","M3bP2"), ncol = 4) # first function myfun <- function(x) { x<- as.vector(x) ot1 <- ifelse(mydf[x[1]] == mydf[x[3]], 1, -1) ot2 <- ifelse(mydf[x[2]] == mydf[x[4]], 1, -1) qt <- ot1 + ot2 return(qt) } qt <- apply(mmat, 1, myfun) ydv <- c((y - mean(y))^2) qtd <- data.frame(ped, ydv, qt) # second function myfun2 <- function(dataframe) { vydv <- sum(ydv)*0.25 sumD <- sum(ydv * qt) Rt <- vydv / sumD return(Rt) } # using plyr require(plyr) dfsumd1 <- ddply(mydf,.(mydf$ped),myfun2) Here are 2 issues: (1) The output just one, I need the output for all three set of variables (as listed above) (2) all three values of dfsumd is returning to same for all level of ped: 1,2, 3 Means that the function is applied to whole dataset but only replicated in output !!! I tried with plyr not being lazy but due to my limited R knowledge, If you have a different suggestion, you are welcome too. Thank you in advance... Maya [[alternative HTML version deleted]]
David Winsemius
2011-Sep-03 04:28 UTC
[R] problem in applying function in data subset (with a level) - using plyr or other alternative are also welcome
On Sep 2, 2011, at 11:18 PM, Maya Joshi wrote:> Dear R experts. > > I might be missing something obvious. I have been trying to fix this > problem > for some weeks. Please help. > > #data > ped <- c(rep(1, 4), rep(2, 3), rep(3, 3)) > y <- rnorm(10, 8, 2) > > # variable set 1 > M1a <- sample (c(1, 2,3), 10, replace= T) > M1b <- sample (c(1, 2,3), 10, replace= T) > M1aP1 <- sample (c(1, 2,3), 10, replace= T) > M1bP2 <- sample (c(1, 2,3), 10, replace= T) > > # variable set 2 > M2a <- sample (c(1, 2,3), 10, replace= T) > M2b <- sample (c(1, 2,3), 10, replace= T) > M2aP1 <- sample (c(1, 2,3), 10, replace= T) > M2bP2 <- sample (c(1, 2,3), 10, replace= T) > > # variable set 3 > M3a <- sample (c(1, 2,3), 10, replace= T) > M3b <- sample (c(1, 2,3), 10, replace= T) > M3aP1 <- sample (c(1, 2,3), 10, replace= T) > M3bP2 <- sample (c(1, 2,3), 10, replace= T) > > mydf <- data.frame (ped, M1a,M1b,M1aP1,M1bP2, M2a,M2b,M2aP1,M2bP2, > M3a,M3b,M3aP1,M3bP2, y) > > # functions and further calculations > > mmat <- matrix > (c("M1a","M2a","M3a","M1b","M2b","M3b","M1aP1","M2aP1","M3aP1", > "M1bP2","M2bP2","M3bP2"), ncol = 4) > > # first function > myfun <- function(x) { > x<- as.vector(x) > ot1 <- ifelse(mydf[x[1]] == mydf[x[3]], 1, -1)You really ought to explain what you are trying to do. This code will compare two lists. The list from mydf[x2]] will be the column mydf["M2b"] which will have as its first element the four element vector assigned above. I am guessing that was not what you wanted. Notice this simple case using the "[" function as you are attempting throws an error: > list(M2b = c(1,2,3)) == list(M2a = c(1,2,3)) Error in list(M2b = c(1, 2, 3)) == list(M2a = c(1, 2, 3)) : comparison of these types is not implemented' So ... now that your code has failed to explain what you wanted why don't your try some natural language explanations. Notice that the "[[" function is generally what people want when extracting from lists: > list(M2b = c(1,2,3))[["M2b"]] == list(M2a = c(1,2,3))[["M2a"]] [1] TRUE TRUE TRUE> ot2 <- ifelse(mydf[x[2]] == mydf[x[4]], 1, -1) > qt <- ot1 + ot2 > return(qt) > } > qt <- apply(mmat, 1, myfun) > ydv <- c((y - mean(y))^2) > qtd <- data.frame(ped, ydv, qt) > > # second function > myfun2 <- function(dataframe) { > vydv <- sum(ydv)*0.25 > sumD <- sum(ydv * qt) > Rt <- vydv / sumD > return(Rt) > } > > # using plyr > require(plyr) > dfsumd1 <- ddply(mydf,.(mydf$ped),myfun2) > > Here are 2 issues: > (1) The output just one, I need the output for all three set of > variables > (as listed above)An incredibly vague description.> > (2) all three values of dfsumd is returning to same for all level > of ped: > 1,2, 3I sympathize with those forced to adopt the English language, but that is the standard this decade. So givne your apparent difficulties, you need to exert more effort at making explicit what is _supposed_ to be returned.> Means that the function is applied to whole dataset but only > replicated in > output !!! > > I tried with plyr not being lazy but due to my limited R knowledge, > If you > have a different suggestion, you are welcome too. > > Thank you in advance... > > Maya > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
Dennis Murphy
2011-Sep-03 08:07 UTC
[R] problem in applying function in data subset (with a level) - using plyr or other alternative are also welcome
Hi: I tried to figure out what you were doing...some of it I think I grasped, other parts not so much. On Fri, Sep 2, 2011 at 8:18 PM, Maya Joshi <maya.d.joshi at gmail.com> wrote:> Dear R experts. > > I might be missing something obvious. I have been trying to fix this problem > for some weeks. Please help.Let's start by reducing some of your code: ped <- rep(1:3, c(4, 3, 3)) y <- rnorm(10, 8, 2) # This replaces all of your sample() statements, and is equivalent: smat <- matrix(sample(1:3, 120, replace = TRUE), ncol = 12) colnames(smat) <- c('M1a', 'M1b', 'M1aP1', 'M1bP2', 'M2a', 'M2b', 'M2aP1', 'M2bP2', 'M3a', 'M3b', 'M3aP1', 'M3bP2') mydf <- as.data.frame(cbind(ped, y, smat))> > #data > ped <- c(rep(1, 4), rep(2, 3), rep(3, 3)) > y <- rnorm(10, 8, 2) > > # variable set 1 > M1a <- sample (c(1, 2,3), 10, replace= T) > M1b <- sample (c(1, 2,3), 10, replace= T) > M1aP1 <- sample (c(1, 2,3), 10, replace= T) > M1bP2 <- sample (c(1, 2,3), 10, replace= T) > > # variable set 2 > M2a <- sample (c(1, 2,3), 10, replace= T) > M2b <- sample (c(1, 2,3), 10, replace= T) > M2aP1 <- sample (c(1, 2,3), 10, replace= T) > M2bP2 <- sample (c(1, 2,3), 10, replace= T) > > # variable set 3 > M3a <- sample (c(1, 2,3), 10, replace= T) > M3b <- sample (c(1, 2,3), 10, replace= T) > M3aP1 <- sample (c(1, 2,3), 10, replace= T) > M3bP2 <- sample (c(1, 2,3), 10, replace= T) > > mydf <- data.frame (ped, M1a,M1b,M1aP1,M1bP2, M2a,M2b,M2aP1,M2bP2, > M3a,M3b,M3aP1,M3bP2, y) > > # functions and further calculations > > mmat <- matrix > (c("M1a","M2a","M3a","M1b","M2b","M3b","M1aP1","M2aP1","M3aP1", > "M1bP2","M2bP2","M3bP2"), ncol = 4)As far as I can tell, you want to compare a with aP1 and b with bP1 for all three M values. Given the way your data are arranged, each row of the matrix smat contains all 12 values. apply() on index 1 generally takes a row vector as its input source. Your function to pass to apply should respect that. My version of myfun() takes the row vector as input, divides it into two subgroups of six, uses the ifelse() function to do the comparisons and reshapes it into a 3 x 2 matrix, and then finally takes the row sums of the matrix, which is the return value from the function. myfun <- function(x) { # Indices of the input vector to be compared u <- c(1, 2, 5, 6, 9, 10) v <- c(3, 4, 7, 8, 11, 12) ot <- matrix(ifelse(x[u] == x[v], 1, -1), ncol = 2, byrow = TRUE) rowSums(ot) } # Apply myfun() to the matrix of samples (smat) qt <- t(apply(smat, 1, myfun)) colnames(qt) <- paste('M', 1L:3L, sep = '') mydf2 <- data.frame(ped, y, as.data.frame(qt), Msum = rowSums(qt)) I wasn't sure what you wanted to do with M1-M3, so I added a row sum variable just in case.> > # first function > myfun <- function(x) { > x<- as.vector(x) > ot1 <- ifelse(mydf[x[1]] == mydf[x[3]], 1, -1) > ot2 <- ifelse(mydf[x[2]] == mydf[x[4]], 1, -1) > qt <- ot1 + ot2 > return(qt) > } > qt <- apply(mmat, 1, myfun)# Mean deviation vector for y - OK.> ydv <- c((y - mean(y))^2) > qtd <- data.frame(ped, ydv, qt) >Here's the point where things get more confusing. I'm not sure what you intend from sumD; as it stands, it will, in each row, multiply y by the values in qt, which will generate an n x 3 matrix, n representing the number of rows in the sub-data frame indexed by ped. sumD will then sum the entire n x 3 matrix. Is that what you wanted? More generally, when you write a function to pass to ddply(), the input argument should be a data frame and the output should be a data frame, especially if you intend to return more than one value. Part of the problem with your function is that the variables inside the body have no reference to the input data frame. Moreover, your mean squared deviation function always uses 4 as the denominator rather than the number of rows of the input data frame. (We'll ignore the bias in the estimator since we're concentrating on the code.) I have absolutely no idea if the following is what you intended, but I'm showing this to illustrate how to create a function for input into ddply() that can be applied groupwise. myfun2 <- function(d) { # input argument d is a data frame # vector of squared mean deviations of y ydv <- with(d, (y - mean(y))^2) vydv <- mean(ydv) # multiply the squared deviations vector by the sum # of M1-M3 and then add the cross-products sumD <- sum(ydv * d$Msum) # or equivalently, as an inner product: # sumD <- ydv %*% d$Msum # Return the ratio as a data frame data.frame(Rt = vydv / sumD) } ddply(mydf2, .(ped), myfun2) ## My result:> ddply(mydf2, .(ped), myfun2)ped Rt 1 1 0.35787388 2 2 -0.17739049 3 3 -0.09958257 I'm reasonably sure that myfun() is OK, but myfun2() contains more than a little guesswork. Assuming it's wrong, what the code can tell you is how to pass the variables in from the input data frame and to output a data frame when called within ddply(). HTH, Dennis> # second function > myfun2 <- function(dataframe) { > vydv <- sum(ydv)*0.25 > sumD <- sum(ydv * qt) > Rt <- vydv / sumD > return(Rt) > } > > # using plyr > require(plyr) > dfsumd1 <- ddply(mydf,.(mydf$ped),myfun2) > > Here are 2 issues: > (1) The output just one, I need the output for all three set of variables > (as listed above) > > (2) ?all three values of dfsumd is returning to same for all level of ped: > 1,2, 3 > Means that the function is applied to whole dataset but only replicated in > output !!! > > I tried with plyr not being lazy but due to my limited R knowledge, If you > have a different suggestion, you are welcome too. > > Thank you in advance... > > Maya > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >