Hello, I have a dataframe with 3 variables. I want to loop through it to get the mean value of the variable `z`, as follows: ``` df = data.frame(x = c(rep(1,5), rep(2,5), rep(3,5)), y = rep(letters[1:5],3), z = rnorm(15), stringsAsFactors = FALSE) m = vector() for (i in unique(df$y)) { s = df[df$y == i,] m = append(m, mean(s$z)) } names(m) = unique(df$y)> (m)a b c d e -0.6355382 -0.4218053 -0.7256680 -0.8320783 -0.2587004 ``` The problem is that I have one million `y` values, so the work takes almost a day. I understand that vectorization will speed up the procedure. But how shall I write the procedure in vectorial terms? Thank you
If I follow what you are trying to do, you want the mean of z for each value of y. tapply(df$z, df$y, mean)> On Nov 17, 2021, at 8:20 AM, Luigi Marongiu <marongiu.luigi at gmail.com> wrote: > > Hello, > I have a dataframe with 3 variables. I want to loop through it to get > the mean value of the variable `z`, as follows: > ``` > df = data.frame(x = c(rep(1,5), rep(2,5), rep(3,5)), > y = rep(letters[1:5],3), > z = rnorm(15), > stringsAsFactors = FALSE) > m = vector() > for (i in unique(df$y)) { > s = df[df$y == i,] > m = append(m, mean(s$z)) > } > names(m) = unique(df$y) >> (m) > a b c d e > -0.6355382 -0.4218053 -0.7256680 -0.8320783 -0.2587004 > ``` > The problem is that I have one million `y` values, so the work takes > almost a day. I understand that vectorization will speed up the > procedure. But how shall I write the procedure in vectorial terms? > Thank you > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Kevin E. Thorpe Head of Biostatistics, Applied Health Research Centre (AHRC) Li Ka Shing Knowledge Institute of St. Michael?s Hospital Assistant Professor, Dalla Lana School of Public Health University of Toronto email: kevin.thorpe at utoronto.ca Tel: 416.864.5776 Fax: 416.864.3016
Have a look at the base functions tapply and aggregate. For example see: - https://cran.r-project.org/doc/manuals/r-release/R-intro.html#The-function-tapply_0028_0029-and-ragged-arrays , - https://online.stat.psu.edu/stat484/lesson/9/9.2, - or ?tapply and ?aggregate. Also your current code seems to contain an error: `s = df[df$y == i,]` should be `s = df$z[df$y == i]` I think. HTH, Jan On 17-11-2021 14:20, Luigi Marongiu wrote:> Hello, > I have a dataframe with 3 variables. I want to loop through it to get > the mean value of the variable `z`, as follows: > ``` > df = data.frame(x = c(rep(1,5), rep(2,5), rep(3,5)), > y = rep(letters[1:5],3), > z = rnorm(15), > stringsAsFactors = FALSE) > m = vector() > for (i in unique(df$y)) { > s = df[df$y == i,] > m = append(m, mean(s$z)) > } > names(m) = unique(df$y) >> (m) > a b c d e > -0.6355382 -0.4218053 -0.7256680 -0.8320783 -0.2587004 > ``` > The problem is that I have one million `y` values, so the work takes > almost a day. I understand that vectorization will speed up the > procedure. But how shall I write the procedure in vectorial terms? > Thank you > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Hi above tapply and aggregate, split *apply could be used) sapply(with(df, split(z, y)), mean) Cheers Petr> -----Original Message----- > From: R-help <r-help-bounces at r-project.org> On Behalf Of Luigi Marongiu > Sent: Wednesday, November 17, 2021 2:21 PM > To: r-help <r-help at r-project.org> > Subject: [R] vectorization of loops in R > > Hello, > I have a dataframe with 3 variables. I want to loop through it to get > the mean value of the variable `z`, as follows: > ``` > df = data.frame(x = c(rep(1,5), rep(2,5), rep(3,5)), > y = rep(letters[1:5],3), > z = rnorm(15), > stringsAsFactors = FALSE) > m = vector() > for (i in unique(df$y)) { > s = df[df$y == i,] > m = append(m, mean(s$z)) > } > names(m) = unique(df$y) > > (m) > a b c d e > -0.6355382 -0.4218053 -0.7256680 -0.8320783 -0.2587004 > ``` > The problem is that I have one million `y` values, so the work takes > almost a day. I understand that vectorization will speed up the > procedure. But how shall I write the procedure in vectorial terms? > Thank you > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.