Emmanuel Levy
2006-Sep-13 16:38 UTC
[R] group bunch of lines in a data.frame, an additional requirement
Thanks for pointing me out "aggregate", that works fine! There is one complication though: I have mixed types (numerical and character), So the matrix is of the form: A 1.0 200 ID1 A 3.0 800 ID1 A 2.0 200 ID1 B 0.5 20 ID2 B 0.9 50 ID2 C 5.0 70 ID1 One letter always has the same ID but one ID can be shared by many letters (like ID1) I just want to keep track of the ID, and get a matrix like: A 2.0 400 ID1 B 0.7 35 ID2 C 5.0 70 ID1 Any idea on how to do that without a loop? Many thanks, Emmanuel On 9/12/06, Emmanuel Levy <emmanuel.levy at gmail.com> wrote:> Hello, > > I'd like to group the lines of a matrix so that: > A 1.0 200 > A 3.0 800 > A 2.0 200 > B 0.5 20 > B 0.9 50 > C 5.0 70 > > Would give: > A 2.0 400 > B 0.7 35 > C 5.0 70 > > So all lines corresponding to a letter (level), become a single line > where all the values of each column are averaged. > > I've done that with a loop but it doesn't sound right (it is very > slow). I imagine there is a > sort of "apply" shortcut but I can't figure it out. > > Please note that it is not exactly a matrix I'm using, the function > "typeof" tells me it's a list, however I access to it like it was a > matrix. > > Could someone help me with the right function to use, a help topic or > a piece of code? > > Thanks, > > Emmanuel >
Marc Schwartz (via MN)
2006-Sep-13 17:33 UTC
[R] group bunch of lines in a data.frame, an additional requirement
Try something like this: # Initial data frame> DFV1 V2 V3 V4 1 A 1.0 200 ID1 2 A 3.0 800 ID1 3 A 2.0 200 ID1 4 B 0.5 20 ID2 5 B 0.9 50 ID2 6 C 5.0 70 ID1 # Now do the aggregation to get the means DF.1 <- aggregate(DF[, 2:3], list(V1 = DF$V1), mean)> DF.1V1 V2 V3 1 A 2.0 400 2 B 0.7 35 3 C 5.0 70 # Now get the unique combinations of letters and IDs in DF DF.U <- unique(DF[, c("V1", "V4")])> DF.UV1 V4 1 A ID1 4 B ID2 6 C ID1 # Now merge the two data frames together, matching the letters DF.NEW <- merge(DF.1, DF.U, by = "V1")> DF.NEWV1 V2 V3 V4 1 A 2.0 400 ID1 2 B 0.7 35 ID2 3 C 5.0 70 ID1 See ?unique and ?merge for more information. Also, for the sake of clarification, these are not matrices, but data frames. A matrix may contain only one data type, whereas data frames are specifically designed to contain multiple data types as you have here. HTH, Marc Schwartz On Wed, 2006-09-13 at 17:38 +0100, Emmanuel Levy wrote:> Thanks for pointing me out "aggregate", that works fine! > > There is one complication though: I have mixed types (numerical and character), > > So the matrix is of the form: > > A 1.0 200 ID1 > A 3.0 800 ID1 > A 2.0 200 ID1 > B 0.5 20 ID2 > B 0.9 50 ID2 > C 5.0 70 ID1 > > One letter always has the same ID but one ID can be shared by many > letters (like ID1) > > I just want to keep track of the ID, and get a matrix like: > > A 2.0 400 ID1 > B 0.7 35 ID2 > C 5.0 70 ID1 > > Any idea on how to do that without a loop? > > Many thanks, > > Emmanuel > > On 9/12/06, Emmanuel Levy <emmanuel.levy at gmail.com> wrote: > > Hello, > > > > I'd like to group the lines of a matrix so that: > > A 1.0 200 > > A 3.0 800 > > A 2.0 200 > > B 0.5 20 > > B 0.9 50 > > C 5.0 70 > > > > Would give: > > A 2.0 400 > > B 0.7 35 > > C 5.0 70 > > > > So all lines corresponding to a letter (level), become a single line > > where all the values of each column are averaged. > > > > I've done that with a loop but it doesn't sound right (it is very > > slow). I imagine there is a > > sort of "apply" shortcut but I can't figure it out. > > > > Please note that it is not exactly a matrix I'm using, the function > > "typeof" tells me it's a list, however I access to it like it was a > > matrix. > > > > Could someone help me with the right function to use, a help topic or > > a piece of code? > > > > Thanks, > > > > Emmanuel > >
Gabor Grothendieck
2006-Sep-13 17:44 UTC
[R] group bunch of lines in a data.frame, an additional requirement
See below. On 9/13/06, Emmanuel Levy <emmanuel.levy at gmail.com> wrote:> Thanks for pointing me out "aggregate", that works fine! > > There is one complication though: I have mixed types (numerical and character), > > So the matrix is of the form: > > A 1.0 200 ID1 > A 3.0 800 ID1 > A 2.0 200 ID1 > B 0.5 20 ID2 > B 0.9 50 ID2 > C 5.0 70 ID1 > > One letter always has the same ID but one ID can be shared by many > letters (like ID1) > > I just want to keep track of the ID, and get a matrix like: > > A 2.0 400 ID1 > B 0.7 35 ID2 > C 5.0 70 ID1 > > Any idea on how to do that without a loop?If V4 is a function of V1 then you can aggregate by it too and it will appear but have no effect on the classification:> aggregate(DF[2:3], DF[c(1,4)], mean)V1 V4 V2 V3 1 A ID1 2.0 400 2 C ID1 5.0 70 3 B ID2 0.7 35> > Many thanks, > > Emmanuel > > On 9/12/06, Emmanuel Levy <emmanuel.levy at gmail.com> wrote: > > Hello, > > > > I'd like to group the lines of a matrix so that: > > A 1.0 200 > > A 3.0 800 > > A 2.0 200 > > B 0.5 20 > > B 0.9 50 > > C 5.0 70 > > > > Would give: > > A 2.0 400 > > B 0.7 35 > > C 5.0 70 > > > > So all lines corresponding to a letter (level), become a single line > > where all the values of each column are averaged. > > > > I've done that with a loop but it doesn't sound right (it is very > > slow). I imagine there is a > > sort of "apply" shortcut but I can't figure it out. > > > > Please note that it is not exactly a matrix I'm using, the function > > "typeof" tells me it's a list, however I access to it like it was a > > matrix. > > > > Could someone help me with the right function to use, a help topic or > > a piece of code? > > > > Thanks, > > > > Emmanuel > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Emmanuel Levy
2006-Sep-15 14:32 UTC
[R] group bunch of lines in a data.frame, an additional requirement
(re)-Hello I actually thought about another possibility with a "1" column, a sum (instead of a mean), and a division of the columns for which I want the mean:> DF = data.frame( V1=c("A","A","A","B","B","C") , V2=c(1,3,2,0.5,0.9,5.0), V3=c(200,800,200,20,50,70), V4=c("ID1","ID1","ID1","ID2","ID2","ID3")) > DF2 = cbind(DF,o=rep(1,length(DF[,1]))) > DF3 = aggregate(DF2[,c(2,3,5)], data.frame(code=DF2[,1],id=DF2[,4]), sum, na.rm=T) > DF3[,3:4]=DF3[,3:4]/DF3[,5] > DF3code id V2 V3 o 1 A ID1 2.0 400 3 2 B ID2 0.7 35 2 3 C ID3 5.0 70 1 Thanks again, Best, Emmanuel On 9/15/06, Emmanuel Levy <emmanuel.levy at gmail.com> wrote:> Dear Mark, dear Gabor, > > Thanks again for your help! It is great to be able to get answers when > no-one (can you believe this?) uses R in your lab. > > The package doBy looks really convenient for many puposes. > > Best, > > Emmanuel > > On 9/15/06, Gabor Grothendieck <ggrothendieck at gmail.com> wrote: > > Here are three different ways to do it: > > > > # base R > > fb <- function(x) > > c(V1 = x$V1[1], V4 = x$V4[1], V2.mean = mean(x$V2), > > V3.mean = mean(x$V3), n = length(x$V1)) > > do.call(rbind, by(DF, DF[c(1,4)], fb)) > > > > # package doBy > > library(doBy) > > summaryBy(V2 + V3 ~ V1 + V4, DF, FUN = c(mean, length))[,-5] > > > > # package reshape > > library(reshape) > > f <- function(x) c(mean = mean(x), n = length(x)) > > cast(melt(DF, id = c(1,4)), V1 + V4 ~ variable, fun.aggregate = f)[,-6] > > > > ---- > > > > > # base R > > > fb <- function(x) > > + c(V1 = x$V1[1], V4 = x$V4[1], V2.mean = mean(x$V2), > > + V3.mean = mean(x$V3), n = length(x$V1)) > > > do.call(rbind, by(DF, DF[c(1,4)], fb)) > > V1 V4 V2.mean V3.mean n > > [1,] 1 1 2.0 400 3 > > [2,] 3 1 5.0 70 1 > > [3,] 2 2 0.7 35 2 > > > > > > # package doBy > > > library(doBy) > > > summaryBy(V2 + V3 ~ V1 + V4, DF, FUN = c(mean, length))[,-5] > > V1 V4 mean.V2 mean.V3 length.V3 > > 1 A ID1 2.0 400 3 > > 2 C ID1 5.0 70 1 > > 3 B ID2 0.7 35 2 > > > > > > # package reshape > > > library(reshape) > > > f <- function(x) c(mean = mean(x), n = length(x)) > > > cast(melt(DF, id = c(1,4)), V1 + V4 ~ variable, fun.aggregate = f)[,-6] > > V1 V4 V2_mean V2_n V3_mean > > 1 A ID1 2.0 3 400 > > 2 B ID2 0.7 2 35 > > 3 C ID1 5.0 1 70 > > > > > > > > > > > > > > --- > > > > > library(doBy) > > > summaryBy(V2 + V3 ~ V1 + V4, DF, FUN = c(mean, length))[,-5] > > V1 V4 mean.V2 mean.V3 length.V3 > > 1 A ID1 2.0 400 3 > > 2 C ID1 5.0 70 1 > > 3 B ID2 0.7 35 2 > > > > > > library(reshape) > > > f <- function(x) c(mean = mean(x), n = length(x)) > > > cast(melt(DF, id = c(1,4)), V1 + V4 ~ variable, fun.aggregate = f)[,-6] > > V1 V4 V2_mean V2_n V3_mean > > 1 A ID1 2.0 3 400 > > 2 B ID2 0.7 2 35 > > 3 C ID1 5.0 1 70 > > > > > > > > On 9/14/06, Emmanuel Levy <emmanuel.levy at gmail.com> wrote: > > > Thanks Gabor, that is much faster than using a loop! > > > > > > I've got a last question: > > > > > > Can you think of a fast way of keeping track of the number of > > > observations collapsed for each entry? > > > > > > i.e. I'd like to end up with: > > > > > > A 2.0 400 ID1 3 (3obs in the first matrix) > > > B 0.7 35 ID2 2 (2obs in the first matrix) > > > C 5.0 70 ID1 1 (1obs in the first matrix) > > > > > > Or is it required to use an temporary matrix that is merged later? (As > > > examplified by Mark in a previous email?) > > > > > > Thanks a lot for your help, > > > > > > Emmanuel > > > > > > On 9/13/06, Gabor Grothendieck <ggrothendieck at gmail.com> wrote: > > > > See below. > > > > > > > > On 9/13/06, Emmanuel Levy <emmanuel.levy at gmail.com> wrote: > > > > > Thanks for pointing me out "aggregate", that works fine! > > > > > > > > > > There is one complication though: I have mixed types (numerical and character), > > > > > > > > > > So the matrix is of the form: > > > > > > > > > > A 1.0 200 ID1 > > > > > A 3.0 800 ID1 > > > > > A 2.0 200 ID1 > > > > > B 0.5 20 ID2 > > > > > B 0.9 50 ID2 > > > > > C 5.0 70 ID1 > > > > > > > > > > One letter always has the same ID but one ID can be shared by many > > > > > letters (like ID1) > > > > > > > > > > I just want to keep track of the ID, and get a matrix like: > > > > > > > > > > A 2.0 400 ID1 > > > > > B 0.7 35 ID2 > > > > > C 5.0 70 ID1 > > > > > > > > > > Any idea on how to do that without a loop? > > > > > > > > If V4 is a function of V1 then you can aggregate by it too and it will > > > > appear but have no effect on the classification: > > > > > > > > > aggregate(DF[2:3], DF[c(1,4)], mean) > > > > V1 V4 V2 V3 > > > > 1 A ID1 2.0 400 > > > > 2 C ID1 5.0 70 > > > > 3 B ID2 0.7 35 > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > > > Emmanuel > > > > > > > > > > On 9/12/06, Emmanuel Levy <emmanuel.levy at gmail.com> wrote: > > > > > > Hello, > > > > > > > > > > > > I'd like to group the lines of a matrix so that: > > > > > > A 1.0 200 > > > > > > A 3.0 800 > > > > > > A 2.0 200 > > > > > > B 0.5 20 > > > > > > B 0.9 50 > > > > > > C 5.0 70 > > > > > > > > > > > > Would give: > > > > > > A 2.0 400 > > > > > > B 0.7 35 > > > > > > C 5.0 70 > > > > > > > > > > > > So all lines corresponding to a letter (level), become a single line > > > > > > where all the values of each column are averaged. > > > > > > > > > > > > I've done that with a loop but it doesn't sound right (it is very > > > > > > slow). I imagine there is a > > > > > > sort of "apply" shortcut but I can't figure it out. > > > > > > > > > > > > Please note that it is not exactly a matrix I'm using, the function > > > > > > "typeof" tells me it's a list, however I access to it like it was a > > > > > > matrix. > > > > > > > > > > > > Could someone help me with the right function to use, a help topic or > > > > > > a piece of code? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Emmanuel > > > > > > > > > > > > > > > > ______________________________________________ > > > > > R-help at stat.math.ethz.ch mailing list > > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > > > > > > > > >