Bert Gunter
2015-Jun-16 20:02 UTC
[R] dplyr - counting a number of specific values in each column - for all columns at once
... my bad! -- I filed to read carefully. A base syntax version is: dat <- data.frame (a=sample(1:5,10,rep=TRUE), b=sample(3:7,10,rep=TRUE), g = sample(7:9,10,rep=TRUE)) dev <- sample(1:3,10,rep=TRUE) sapply(dat,function(x) tapply(x,dev,function(x)sum(x==5,na.rm=TRUE))) a b g 1 2 0 0 2 1 3 0 3 2 1 0 I think, no matter what, that there are 2 loops here: An outer one by column and an inner one by device within each column. Being both old and lazy, I have found it easier and more natural to stick with the basic functional syntax of the "apply" family of functions rather than to learn an alternative database type syntax (and semantics). My applications were never so large that the possible execution inefficiency mattered. However, it certainly might for others. And of course, what is "natural" for me might not be for others. Cheers, Bert Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Tue, Jun 16, 2015 at 12:47 PM, Hadley Wickham <h.wickham at gmail.com> wrote:> On Tue, Jun 16, 2015 at 12:24 PM, Dimitri Liakhovitski > <dimitri.liakhovitski at gmail.com> wrote: > > Hello! > > > > I have a data frame: > > > > md <- data.frame(a = c(3,5,4,5,3,5), b = c(5,5,5,4,4,1), c > c(1,3,4,3,5,5), > > device = c(1,1,2,2,3,3)) > > myvars = c("a", "b", "c") > > md[2,3] <- NA > > md[4,1] <- NA > > md > > > > I want to count number of 5s in each column - by device. I can do it > like this: > > > > library(dplyr) > > group_by(md, device) %>% > > summarise(counts.a = sum(a==5, na.rm = T), > > counts.b = sum(b==5, na.rm = T), > > counts.c = sum(c==5, na.rm = T)) > > > > However, in real life I'll have tons of variables (the length of > > 'myvars' can be very large) - so that I can't specify those counts.a, > > counts.b, etc. manually - dozens of times. > > > > Does dplyr allow to run the count of 5s on all 'myvars' columns at once? > > md %>% > group_by(device) %>% > summarise_each(funs(sum(. == 5, na.rm = TRUE))) > > Hadley > > -- > http://had.co.nz/ > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
David L Carlson
2015-Jun-16 20:22 UTC
[R] dplyr - counting a number of specific values in each column - for all columns at once
Not in base, but in stats:> aggregate(md[,-4]==5, list(device=md$device), sum, na.rm=TRUE)device a b c 1 1 1 2 0 2 2 0 1 0 3 3 1 0 2 ------------------------------------- David L Carlson Department of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter Sent: Tuesday, June 16, 2015 3:02 PM To: Hadley Wickham Cc: r-help Subject: Re: [R] dplyr - counting a number of specific values in each column - for all columns at once ... my bad! -- I filed to read carefully. A base syntax version is: dat <- data.frame (a=sample(1:5,10,rep=TRUE), b=sample(3:7,10,rep=TRUE), g = sample(7:9,10,rep=TRUE)) dev <- sample(1:3,10,rep=TRUE) sapply(dat,function(x) tapply(x,dev,function(x)sum(x==5,na.rm=TRUE))) a b g 1 2 0 0 2 1 3 0 3 2 1 0 I think, no matter what, that there are 2 loops here: An outer one by column and an inner one by device within each column. Being both old and lazy, I have found it easier and more natural to stick with the basic functional syntax of the "apply" family of functions rather than to learn an alternative database type syntax (and semantics). My applications were never so large that the possible execution inefficiency mattered. However, it certainly might for others. And of course, what is "natural" for me might not be for others. Cheers, Bert Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Tue, Jun 16, 2015 at 12:47 PM, Hadley Wickham <h.wickham at gmail.com> wrote:> On Tue, Jun 16, 2015 at 12:24 PM, Dimitri Liakhovitski > <dimitri.liakhovitski at gmail.com> wrote: > > Hello! > > > > I have a data frame: > > > > md <- data.frame(a = c(3,5,4,5,3,5), b = c(5,5,5,4,4,1), c > c(1,3,4,3,5,5), > > device = c(1,1,2,2,3,3)) > > myvars = c("a", "b", "c") > > md[2,3] <- NA > > md[4,1] <- NA > > md > > > > I want to count number of 5s in each column - by device. I can do it > like this: > > > > library(dplyr) > > group_by(md, device) %>% > > summarise(counts.a = sum(a==5, na.rm = T), > > counts.b = sum(b==5, na.rm = T), > > counts.c = sum(c==5, na.rm = T)) > > > > However, in real life I'll have tons of variables (the length of > > 'myvars' can be very large) - so that I can't specify those counts.a, > > counts.b, etc. manually - dozens of times. > > > > Does dplyr allow to run the count of 5s on all 'myvars' columns at once? > > md %>% > group_by(device) %>% > summarise_each(funs(sum(. == 5, na.rm = TRUE))) > > Hadley > > -- > http://had.co.nz/ > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Bert Gunter
2015-Jun-16 21:53 UTC
[R] dplyr - counting a number of specific values in each column - for all columns at once
Yes, indeed. Thanks, David. But if you check, tapply, aggregate(), by(), etc. are all basically wrappers to lapply() .So it's all a question of what syntax one feels most comfortable with. However note that data.table, plyR stuff and perhaps others are different in that they re-implement the underlying engines, thereby gaining efficiencies that some folks may want as well as new syntax. Cheers, Bert Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Tue, Jun 16, 2015 at 1:22 PM, David L Carlson <dcarlson at tamu.edu> wrote:> Not in base, but in stats: > > > aggregate(md[,-4]==5, list(device=md$device), sum, na.rm=TRUE) > device a b c > 1 1 1 2 0 > 2 2 0 1 0 > 3 3 1 0 2 > > ------------------------------------- > David L Carlson > Department of Anthropology > Texas A&M University > College Station, TX 77840-4352 > > -----Original Message----- > From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Bert > Gunter > Sent: Tuesday, June 16, 2015 3:02 PM > To: Hadley Wickham > Cc: r-help > Subject: Re: [R] dplyr - counting a number of specific values in each > column - for all columns at once > > ... my bad! -- I filed to read carefully. > > A base syntax version is: > > dat <- data.frame (a=sample(1:5,10,rep=TRUE), > b=sample(3:7,10,rep=TRUE), > g = sample(7:9,10,rep=TRUE)) > > dev <- sample(1:3,10,rep=TRUE) > > sapply(dat,function(x) > tapply(x,dev,function(x)sum(x==5,na.rm=TRUE))) > > a b g > 1 2 0 0 > 2 1 3 0 > 3 2 1 0 > > I think, no matter what, that there are 2 loops here: An outer one by > column and an inner one by device within each column. > > Being both old and lazy, I have found it easier and more natural to stick > with the basic functional syntax of the "apply" family of functions rather > than to learn an alternative database type syntax (and semantics). My > applications were never so large that the possible execution inefficiency > mattered. However, it certainly might for others. And of course, what is > "natural" for me might not be for others. > > Cheers, > Bert > > Bert Gunter > > "Data is not information. Information is not knowledge. And knowledge is > certainly not wisdom." > -- Clifford Stoll > > On Tue, Jun 16, 2015 at 12:47 PM, Hadley Wickham <h.wickham at gmail.com> > wrote: > > > On Tue, Jun 16, 2015 at 12:24 PM, Dimitri Liakhovitski > > <dimitri.liakhovitski at gmail.com> wrote: > > > Hello! > > > > > > I have a data frame: > > > > > > md <- data.frame(a = c(3,5,4,5,3,5), b = c(5,5,5,4,4,1), c > > c(1,3,4,3,5,5), > > > device = c(1,1,2,2,3,3)) > > > myvars = c("a", "b", "c") > > > md[2,3] <- NA > > > md[4,1] <- NA > > > md > > > > > > I want to count number of 5s in each column - by device. I can do it > > like this: > > > > > > library(dplyr) > > > group_by(md, device) %>% > > > summarise(counts.a = sum(a==5, na.rm = T), > > > counts.b = sum(b==5, na.rm = T), > > > counts.c = sum(c==5, na.rm = T)) > > > > > > However, in real life I'll have tons of variables (the length of > > > 'myvars' can be very large) - so that I can't specify those counts.a, > > > counts.b, etc. manually - dozens of times. > > > > > > Does dplyr allow to run the count of 5s on all 'myvars' columns at > once? > > > > md %>% > > group_by(device) %>% > > summarise_each(funs(sum(. == 5, na.rm = TRUE))) > > > > Hadley > > > > -- > > http://had.co.nz/ > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]