My dataframe has 113K rows split by a factor into 58 separate data.frames, with a different numbers of rows (see error output below). I cannot think of a way of proving a sample of data; if a sample for a MWE is desired advice on produing one using dput() is needed. To summarize each group within this dataframe I'm using by() and getting an error because of the different number of rows:> by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) {+ mean.rain <- mean(rainfall_by_site[, 'prcp']) + }) Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 4900, 1085, 1894, 2844, 3520, 647, 239, 3652, 3701, 3063, 176, 4713, 4887, 119, 165, 1221, 3358, 1457, 4896, 166, 690, 1110, 212, 1727, 227, 236, 1175, 1485, 186, 769, 139, 203, 2727, 4357, 1035, 1329, 1454, 973, 4536, 208, 350, 125, 3437, 731, 4894, 2598, 2419, 752, 427, 136, 685, 4849, 914, 171 My web searches have not found anything relevant; perhaps my search terms (such as 'R: apply by() with different factor row numbers') can be improved. The help pages found using apropos('by') appear the same: ?by, ?by.data.frame, ?by.default and provide no hint on how to work with unequal rows per factor. How can I apply by() on these data.frames? Rich
> > by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) { >+ mean.rain <- mean(rainfall_by_site[, 'prcp']) + }) Note that you define a function of x which does not use x in it. Hence, even if the function gave a value, it would give the same value for each group. To see what the 'x' in that function will be, use the identity function:> d <- data.frame(X=2^(0:5), Y=2^(6:11), Group=c("A","B","C","A","B","A")) > by(d[,1:2], d$Group, function(x)x)d$Group: A X Y 1 1 64 4 8 512 6 32 2048 ------------------------------------------------------------ d$Group: B X Y 2 2 128 5 16 1024 ------------------------------------------------------------ d$Group: C X Y 3 4 256 I suspect you want to use the aggregate function.> aggregate(d[,1:2], list(Group=d$Group), sum)Group X Y 1 A 41 2624 2 B 18 1152 3 C 4 256 or the functions in the dplyr package:> d %>% group_by(Group) %>% summarize(sumX=sum(X), meanY=mean(Y))# A tibble: 3 x 3 Group sumX meanY <fct> <dbl> <dbl> 1 A 41 875. 2 B 18 576 3 C 4 256 Bill Dunlap TIBCO Software wdunlap tibco.com On Mon, Sep 17, 2018 at 11:54 AM, Rich Shepard <rshepard at appl-ecosys.com> wrote:> My dataframe has 113K rows split by a factor into 58 separate > data.frames, > with a different numbers of rows (see error output below). > > I cannot think of a way of proving a sample of data; if a sample for a > MWE > is desired advice on produing one using dput() is needed. > > To summarize each group within this dataframe I'm using by() and getting > an error because of the different number of rows: > > by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) { >> > + mean.rain <- mean(rainfall_by_site[, 'prcp']) > + }) > Error in (function (..., row.names = NULL, check.rows = FALSE, check.names > = TRUE, : > arguments imply differing number of rows: 4900, 1085, 1894, 2844, 3520, > 647, 239, 3652, 3701, 3063, 176, 4713, 4887, 119, 165, 1221, 3358, 1457, > 4896, 166, 690, 1110, 212, 1727, 227, 236, 1175, 1485, 186, 769, 139, 203, > 2727, 4357, 1035, 1329, 1454, 973, 4536, 208, 350, 125, 3437, 731, 4894, > 2598, 2419, 752, 427, 136, 685, 4849, 914, 171 > > My web searches have not found anything relevant; perhaps my search terms > (such as 'R: apply by() with different factor row numbers') can be > improved. > > The help pages found using apropos('by') appear the same: ?by, > ?by.data.frame, ?by.default and provide no hint on how to work with unequal > rows per factor. > > How can I apply by() on these data.frames? > > Rich > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posti > ng-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Try changing it to by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) {mean.rain <- mean(x[, 'prcp']) }) Inside the function, so to speak, the function sees an object named "x", because that's how the function is defined: function(x). So you have to operate on x inside the function. For sure, the fact that the subgroups have different numbers of rows is not the problem. -Don -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 Lab cell 925-724-7509 ?On 9/17/18, 11:54 AM, "R-help on behalf of Rich Shepard" <r-help-bounces at r-project.org on behalf of rshepard at appl-ecosys.com> wrote: My dataframe has 113K rows split by a factor into 58 separate data.frames, with a different numbers of rows (see error output below). I cannot think of a way of proving a sample of data; if a sample for a MWE is desired advice on produing one using dput() is needed. To summarize each group within this dataframe I'm using by() and getting an error because of the different number of rows: > by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) { + mean.rain <- mean(rainfall_by_site[, 'prcp']) + }) Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 4900, 1085, 1894, 2844, 3520, 647, 239, 3652, 3701, 3063, 176, 4713, 4887, 119, 165, 1221, 3358, 1457, 4896, 166, 690, 1110, 212, 1727, 227, 236, 1175, 1485, 186, 769, 139, 203, 2727, 4357, 1035, 1329, 1454, 973, 4536, 208, 350, 125, 3437, 731, 4894, 2598, 2419, 752, 427, 136, 685, 4849, 914, 171 My web searches have not found anything relevant; perhaps my search terms (such as 'R: apply by() with different factor row numbers') can be improved. The help pages found using apropos('by') appear the same: ?by, ?by.data.frame, ?by.default and provide no hint on how to work with unequal rows per factor. How can I apply by() on these data.frames? Rich ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I'm also going to guess that maybe your object rainfall_by_site has already been split into separate data frames (because of its name). But by() does the splitting internally, so you should be passing it the original unsplit data frame. You could supply example data by providing the first few rows of each of the first few groups. That would be enough to test with. -Don -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 Lab cell 925-724-7509 ?On 9/17/18, 11:54 AM, "R-help on behalf of Rich Shepard" <r-help-bounces at r-project.org on behalf of rshepard at appl-ecosys.com> wrote: My dataframe has 113K rows split by a factor into 58 separate data.frames, with a different numbers of rows (see error output below). I cannot think of a way of proving a sample of data; if a sample for a MWE is desired advice on produing one using dput() is needed. To summarize each group within this dataframe I'm using by() and getting an error because of the different number of rows: > by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) { + mean.rain <- mean(rainfall_by_site[, 'prcp']) + }) Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 4900, 1085, 1894, 2844, 3520, 647, 239, 3652, 3701, 3063, 176, 4713, 4887, 119, 165, 1221, 3358, 1457, 4896, 166, 690, 1110, 212, 1727, 227, 236, 1175, 1485, 186, 769, 139, 203, 2727, 4357, 1035, 1329, 1454, 973, 4536, 208, 350, 125, 3437, 731, 4894, 2598, 2419, 752, 427, 136, 685, 4849, 914, 171 My web searches have not found anything relevant; perhaps my search terms (such as 'R: apply by() with different factor row numbers') can be improved. The help pages found using apropos('by') appear the same: ?by, ?by.data.frame, ?by.default and provide no hint on how to work with unequal rows per factor. How can I apply by() on these data.frames? Rich ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Inline. Bert On Mon, Sep 17, 2018 at 11:54 AM Rich Shepard <rshepard at appl-ecosys.com> wrote:> My dataframe has 113K rows split by a factor into 58 separate > data.frames, > with a different numbers of rows (see error output below). > > I cannot think of a way of proving a sample of data; if a sample for a > MWE > is desired advice on produing one using dput() is needed. >This is gibberish. What does "proving a sample of data" mean? etc. Please proofread and edit.> > To summarize each group within this dataframe I'm using by() and getting > an error because of the different number of rows: >> > by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) { > + mean.rain <- mean(rainfall_by_site[, 'prcp']) > + }) >You are misspecifying your function. It has argument x, but you do not use x in your function. Also the assignment at the end is unnecessary and probably wrong for your use case. Please go through a tutorial on how to write functions in R. You are probably also misusing by(), but as you did not provided sufficient information -- head(your_data_frame) or similar would have told us its structure, rather than having us guess -- nor a reproducible example, it's hard (for me) to figure out your intent. **PLEASE** follow the posting guide and provide such information. You have been requested to do this several times already. Here is the sort of thing I think you wanted to do:> set.seed(54321) ## for reproducibility > df <- data.frame(f = sample(LETTERS[1:3], 12, rep = TRUE), y = runif(12)) > dff y 1 B 0.04529991 2 B 0.65272100 3 A 0.99406601 4 A 0.67763735 5 A 0.91854517 6 C 0.46244494 7 A 0.57141480 8 A 0.45193882 9 B 0.16770701 10 B 0.06826135 11 A 0.89691069 12 C 0.27383703> by(df, df$f, function(x)mean(x$y))df$f: A [1] 0.7517521 ------------------------------------------------------ df$f: B [1] 0.2334973 ------------------------------------------------------ df$f: C [1] 0.368141 Note that you do not first break up the df into separate df's, which sounds like what you tried to do. However, note that if all you want to do is summarize a *single* numeric column by a factor, you do not need to use by() at all, which is designed to work on (several columns of) the whole data frame simultaneously. For a single column, tapply() is all you need (or as Duncan noted, functionality in the dplyr package.> with(df,tapply(y,f,mean))A B C 0.7517521 0.2334973 0.3681410 Finally, if I have misunderstood your intent, my apologies. I tried. -- Bert mean.rain <- by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) { + mean.rain <- mean(rainfall_by_site[, 'prcp']) + })> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names > = TRUE, : > arguments imply differing number of rows: 4900, 1085, 1894, 2844, 3520, > 647, 239, 3652, 3701, 3063, 176, 4713, 4887, 119, 165, 1221, 3358, 1457, > 4896, 166, 690, 1110, 212, 1727, 227, 236, 1175, 1485, 186, 769, 139, > 203, > 2727, 4357, 1035, 1329, 1454, 973, 4536, 208, 350, 125, 3437, 731, 4894, > 2598, 2419, 752, 427, 136, 685, 4849, 914, 171 > > My web searches have not found anything relevant; perhaps my search > terms > (such as 'R: apply by() with different factor row numbers') can be > improved. > > The help pages found using apropos('by') appear the same: ?by, > ?by.data.frame, ?by.default and provide no hint on how to work with unequal > rows per factor. > > How can I apply by() on these data.frames? > > Rich > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Rich Shepard
2018-Sep-17 19:56 UTC
[R] Applying by() when groups have different lengths [RESOLVED]
On Mon, 17 Sep 2018, MacQueen, Don wrote:> I'm also going to guess that maybe your object rainfall_by_site has > already been split into separate data frames (because of its name). But > by() does the splitting internally, so you should be passing it the > original unsplit data frame.Don, I did not pick up on by() doing the splitting for me when I read the help file and a few web sites! Using the unsplit data.frame did the job; e.g., rainfall[, "name"]: Sandy 1.4 NE [1] 0.1636066 ------------------------------------------------------------ rainfall[, "name"]: Sandy 1.7 SSW [1] 0.2021324 ------------------------------------------------------------ rainfall[, "name"]: Sherwood 3.3 SE [1] 0.1461752 Now I know how to properly apply by() to an unsplit dataframe. Thanks for the insightful lesson. Best regards, Rich