thr3ads.net - R help - [R] Applying by() when groups have different lengths [Sep 2018]

If this information is useful, please help other people find it:
Share via:

Rich Shepard

2018-Sep-17 18:54 UTC

[R] Applying by() when groups have different lengths

My dataframe has 113K rows split by a factor into 58 separate data.frames,
with a different numbers of rows (see error output below).

   I cannot think of a way of proving a sample of data; if a sample for a MWE
is desired advice on produing one using dput() is needed.

   To summarize each group within this dataframe I'm using by() and getting
an error because of the different number of rows:
> by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) {+ mean.rain <- mean(rainfall_by_site[, 'prcp'])
+ })
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =
TRUE,  :
   arguments imply differing number of rows: 4900, 1085, 1894, 2844, 3520,
  647, 239, 3652, 3701, 3063, 176, 4713, 4887, 119, 165, 1221, 3358, 1457,
  4896, 166, 690, 1110, 212, 1727, 227, 236, 1175, 1485, 186, 769, 139, 203,
  2727, 4357, 1035, 1329, 1454, 973, 4536, 208, 350, 125, 3437, 731, 4894,
  2598, 2419, 752, 427, 136, 685, 4849, 914, 171

   My web searches have not found anything relevant; perhaps my search terms
(such as 'R: apply by() with different factor row numbers') can be
improved.

   The help pages found using apropos('by') appear the same: ?by,
?by.data.frame, ?by.default and provide no hint on how to work with unequal
rows per factor.

   How can I apply by() on these data.frames?

Rich

William Dunlap

2018-Sep-17 19:25 UTC

head link

[R] Applying by() when groups have different lengths

>
> by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) {
>+ mean.rain <- mean(rainfall_by_site[, 'prcp'])
+ })

Note that you define a function of x which does not use x in it.
Hence, even if the function gave a value, it would give the same
value for each group.  To see what the 'x' in that function will
be, use the identity function:
> d <- data.frame(X=2^(0:5), Y=2^(6:11),
Group=c("A","B","C","A","B","A"))
> by(d[,1:2], d$Group, function(x)x)d$Group: A
   X    Y
1  1   64
4  8  512
6 32 2048
------------------------------------------------------------
d$Group: B
   X    Y
2  2  128
5 16 1024
------------------------------------------------------------
d$Group: C
  X   Y
3 4 256

I suspect you want to use the aggregate function.
> aggregate(d[,1:2], list(Group=d$Group), sum)  Group  X    Y
1     A 41 2624
2     B 18 1152
3     C  4  256

or the functions in the dplyr package:
> d %>% group_by(Group) %>% summarize(sumX=sum(X), meanY=mean(Y))# A tibble: 3 x 3
  Group  sumX meanY
  <fct> <dbl> <dbl>
1 A        41  875.
2 B        18  576
3 C         4  256






Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Sep 17, 2018 at 11:54 AM, Rich Shepard <rshepard at
appl-ecosys.com>
wrote:
>   My dataframe has 113K rows split by a factor into 58 separate
> data.frames,
> with a different numbers of rows (see error output below).
>
>   I cannot think of a way of proving a sample of data; if a sample for a
> MWE
> is desired advice on produing one using dput() is needed.
>
>   To summarize each group within this dataframe I'm using by() and
getting
> an error because of the different number of rows:
>
> by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) {
>>
> + mean.rain <- mean(rainfall_by_site[, 'prcp'])
> + })
> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names
> = TRUE,  :
>   arguments imply differing number of rows: 4900, 1085, 1894, 2844, 3520,
>  647, 239, 3652, 3701, 3063, 176, 4713, 4887, 119, 165, 1221, 3358, 1457,
>  4896, 166, 690, 1110, 212, 1727, 227, 236, 1175, 1485, 186, 769, 139, 203,
>  2727, 4357, 1035, 1329, 1454, 973, 4536, 208, 350, 125, 3437, 731, 4894,
>  2598, 2419, 752, 427, 136, 685, 4849, 914, 171
>
>   My web searches have not found anything relevant; perhaps my search terms
> (such as 'R: apply by() with different factor row numbers') can be
> improved.
>
>   The help pages found using apropos('by') appear the same: ?by,
> ?by.data.frame, ?by.default and provide no hint on how to work with unequal
> rows per factor.
>
>   How can I apply by() on these data.frames?
>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

MacQueen, Don

2018-Sep-17 19:26 UTC

head link

[R] Applying by() when groups have different lengths

Try changing it to 

     by(rainfall_by_site, rainfall_by_site[, 'name'],
    function(x) {mean.rain <- mean(x[, 'prcp'])
     })

Inside the function, so to speak, the function sees an object named
"x", because that's how the function is defined:  function(x).
So you have to operate on x inside the function.  

For sure, the fact that the subgroups have different numbers of rows is not the
problem.

-Don

--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
Lab cell 925-724-7509
 
 

?On 9/17/18, 11:54 AM, "R-help on behalf of Rich Shepard"
<r-help-bounces at r-project.org on behalf of rshepard at appl-ecosys.com>
wrote:

       My dataframe has 113K rows split by a factor into 58 separate
data.frames,
    with a different numbers of rows (see error output below).
    
       I cannot think of a way of proving a sample of data; if a sample for a
MWE
    is desired advice on produing one using dput() is needed.
    
       To summarize each group within this dataframe I'm using by() and
getting
    an error because of the different number of rows:
    
    > by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) {
    + mean.rain <- mean(rainfall_by_site[, 'prcp'])
    + })
    Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =
TRUE,  :
       arguments imply differing number of rows: 4900, 1085, 1894, 2844, 3520,
      647, 239, 3652, 3701, 3063, 176, 4713, 4887, 119, 165, 1221, 3358, 1457,
      4896, 166, 690, 1110, 212, 1727, 227, 236, 1175, 1485, 186, 769, 139, 203,
      2727, 4357, 1035, 1329, 1454, 973, 4536, 208, 350, 125, 3437, 731, 4894,
      2598, 2419, 752, 427, 136, 685, 4849, 914, 171
    
       My web searches have not found anything relevant; perhaps my search terms
    (such as 'R: apply by() with different factor row numbers') can be
improved.
    
       The help pages found using apropos('by') appear the same: ?by,
    ?by.data.frame, ?by.default and provide no hint on how to work with unequal
    rows per factor.
    
       How can I apply by() on these data.frames?
    
    Rich
    
    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.

MacQueen, Don

2018-Sep-17 19:35 UTC

head link

[R] Applying by() when groups have different lengths

I'm also going to guess that maybe your object
   rainfall_by_site
has already been split into separate data frames (because of its name).

But by() does the splitting internally, so you should be passing it the original
unsplit data frame.

You could supply example data by providing the first few rows of each of the
first few groups. That would be enough to test with.

-Don

--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
Lab cell 925-724-7509
 
 

?On 9/17/18, 11:54 AM, "R-help on behalf of Rich Shepard"
<r-help-bounces at r-project.org on behalf of rshepard at appl-ecosys.com>
wrote:

       My dataframe has 113K rows split by a factor into 58 separate
data.frames,
    with a different numbers of rows (see error output below).
    
       I cannot think of a way of proving a sample of data; if a sample for a
MWE
    is desired advice on produing one using dput() is needed.
    
       To summarize each group within this dataframe I'm using by() and
getting
    an error because of the different number of rows:
    
    > by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) {
    + mean.rain <- mean(rainfall_by_site[, 'prcp'])
    + })
    Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =
TRUE,  :
       arguments imply differing number of rows: 4900, 1085, 1894, 2844, 3520,
      647, 239, 3652, 3701, 3063, 176, 4713, 4887, 119, 165, 1221, 3358, 1457,
      4896, 166, 690, 1110, 212, 1727, 227, 236, 1175, 1485, 186, 769, 139, 203,
      2727, 4357, 1035, 1329, 1454, 973, 4536, 208, 350, 125, 3437, 731, 4894,
      2598, 2419, 752, 427, 136, 685, 4849, 914, 171
    
       My web searches have not found anything relevant; perhaps my search terms
    (such as 'R: apply by() with different factor row numbers') can be
improved.
    
       The help pages found using apropos('by') appear the same: ?by,
    ?by.data.frame, ?by.default and provide no hint on how to work with unequal
    rows per factor.
    
       How can I apply by() on these data.frames?
    
    Rich
    
    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2018-Sep-17 19:38 UTC

head link

[R] Applying by() when groups have different lengths

Inline.

Bert



On Mon, Sep 17, 2018 at 11:54 AM Rich Shepard <rshepard at
appl-ecosys.com>
wrote:
>    My dataframe has 113K rows split by a factor into 58 separate
> data.frames,
> with a different numbers of rows (see error output below).
>
>    I cannot think of a way of proving a sample of data; if a sample for a
> MWE
> is desired advice on produing one using dput() is needed.
>
This is gibberish. What does "proving a sample of data" mean? etc.
Please
proofread and edit.
>
>    To summarize each group within this dataframe I'm using by() and
getting
> an error because of the different number of rows:
>
> > by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) {
> + mean.rain <- mean(rainfall_by_site[, 'prcp'])
> + })
>
You are misspecifying your function. It has argument x, but you do not use
x in your function. Also the assignment at the end is unnecessary and
probably wrong for your use case. Please go through a tutorial on how to
write functions in R.

You are probably also misusing by(), but as you did not provided sufficient
information -- head(your_data_frame) or similar would have told us its
structure, rather than having us guess -- nor a reproducible example, it's
hard (for me) to figure out your intent. **PLEASE** follow the posting
guide and provide such information. You have been requested to do this
several times already.

Here is the sort of thing I think you wanted to do:
> set.seed(54321) ## for reproducibility
> df <- data.frame(f = sample(LETTERS[1:3], 12, rep = TRUE), y =
runif(12))
> df   f          y
1  B 0.04529991
2  B 0.65272100
3  A 0.99406601
4  A 0.67763735
5  A 0.91854517
6  C 0.46244494
7  A 0.57141480
8  A 0.45193882
9  B 0.16770701
10 B 0.06826135
11 A 0.89691069
12 C 0.27383703
> by(df, df$f, function(x)mean(x$y))df$f: A
[1] 0.7517521
------------------------------------------------------
df$f: B
[1] 0.2334973
------------------------------------------------------
df$f: C
[1] 0.368141

Note that you do not first break up the df into separate df's, which sounds
like what you tried to do.

However, note that if all you want to do is summarize a *single* numeric
column by a factor, you do not need to use by() at all, which is designed
to work on (several columns of) the whole data frame simultaneously. For a
single column, tapply() is all you need (or as Duncan noted, functionality
in the dplyr package.
> with(df,tapply(y,f,mean))        A         B         C
0.7517521 0.2334973 0.3681410

Finally, if I have misunderstood your intent, my apologies. I tried.

-- Bert



mean.rain <- by(rainfall_by_site, rainfall_by_site[, 'name'],
function(x) {
+ mean.rain <- mean(rainfall_by_site[, 'prcp'])
+ })
> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names
> = TRUE,  :
>    arguments imply differing number of rows: 4900, 1085, 1894, 2844, 3520,
>   647, 239, 3652, 3701, 3063, 176, 4713, 4887, 119, 165, 1221, 3358, 1457,
>   4896, 166, 690, 1110, 212, 1727, 227, 236, 1175, 1485, 186, 769, 139,
> 203,
>   2727, 4357, 1035, 1329, 1454, 973, 4536, 208, 350, 125, 3437, 731, 4894,
>   2598, 2419, 752, 427, 136, 685, 4849, 914, 171
>
>    My web searches have not found anything relevant; perhaps my search
> terms
> (such as 'R: apply by() with different factor row numbers') can be
> improved.
>
>    The help pages found using apropos('by') appear the same: ?by,
> ?by.data.frame, ?by.default and provide no hint on how to work with unequal
> rows per factor.
>
>    How can I apply by() on these data.frames?
>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Rich Shepard

2018-Sep-17 19:56 UTC

head link

[R] Applying by() when groups have different lengths [RESOLVED]

On Mon, 17 Sep 2018, MacQueen, Don wrote:
> I'm also going to guess that maybe your object rainfall_by_site has
> already been split into separate data frames (because of its name). But
> by() does the splitting internally, so you should be passing it the
> original unsplit data frame.
Don,

   I did not pick up on by() doing the splitting for me when I read the help
file and a few web sites!

   Using the unsplit data.frame did the job; e.g.,

rainfall[, "name"]: Sandy 1.4 NE
[1] 0.1636066
------------------------------------------------------------ 
rainfall[, "name"]: Sandy 1.7 SSW
[1] 0.2021324
------------------------------------------------------------ 
rainfall[, "name"]: Sherwood 3.3 SE
[1] 0.1461752

   Now I know how to properly apply by() to an unsplit dataframe. Thanks for
the insightful lesson.

Best regards,

Rich

Rich Shepard

2018-Sep-17 20:10 UTC

head link

[R] Applying by() when groups have different lengths

On Mon, 17 Sep 2018, Bert Gunter wrote:
>>    I cannot think of a way of proving a sample of data; if a sample for
a
   Typo: s/proving/providing/

Rich

R help - Sep 2018 - Applying by() when groups have different lengths

[R] Applying by() when groups have different lengths

[R] Applying by() when groups have different lengths

[R] Applying by() when groups have different lengths

[R] Applying by() when groups have different lengths

[R] Applying by() when groups have different lengths

[R] Applying by() when groups have different lengths [RESOLVED]

[R] Applying by() when groups have different lengths