thr3ads.net - R help - [R] aggregate formula - differing results [Sep 2023]

If this information is useful, please help other people find it:
Share via:

Ivan Calandra

2023-Sep-04 11:51 UTC

[R] aggregate formula - differing results

Thanks Rui for your help; that would be one possibility indeed.

But am I the only one who finds that behavior of aggregate() completely 
unexpected and confusing? Especially considering that dplyr::summarise() 
and doBy::summaryBy() deal with NAs differently, even though they all 
use mean(na.rm = TRUE) to calculate the group stats.

Best wishes,
Ivan

On 04/09/2023 13:46, Rui Barradas wrote:> ?s 10:44 de 04/09/2023, Ivan Calandra escreveu:
>> Dear useRs,
>>
>> I have just stumbled across a behavior in aggregate() that I cannot 
>> explain. Any help would be appreciated!
>>
>> Sample data:
>> my_data <- structure(list(ID = c("FLINT-1",
"FLINT-10", "FLINT-100",
>> "FLINT-101", "FLINT-102", "HORN-10",
"HORN-100", "HORN-102",
>> "HORN-103", "HORN-104"), EdgeLength = c(130.75,
168.77, 142.79,
>> 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = 
>> c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61, 
>> 1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869, 48.011, 
>> 54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982, 
>> 67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417, 
>> 35.297), PLATWIDTH = c(38.84, NA, 15.33, 30.37, 11.44, 14.88, 13.86, 
>> NA, NA, 26.71), PLATTHICK = c(8.67, NA, 7.99, 11.69, 3.3, 16.52, 
>> 4.58, NA, NA, 9.35), EPA = c(78, NA, 78, 54, 72, 49, 56, NA, NA, 56), 
>> THICKNESS = c(10.97, NA, 9.36, 6.4, 5.89, 11.05, 4.9, NA, NA, 10.08), 
>> WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9, 29.5, 4.5, NA, NA, 23), RAWMAT 
>> = c("FLINT", "FLINT", "FLINT",
"FLINT", "FLINT", "HORNFELS",
>> "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS")), row.names = c(1L,
>> 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class =
"data.frame")
>>
>> 1) Simple aggregation with 2 variables:
>> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean, 
>> na.rm = TRUE)
>>
>> 2) Using the dot notation - different results:
>> aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
>>
>> 3) Using dplyr, I get the same results as #1:
>> group_by(my_data, RAWMAT) %>%
>> ?? summarise(across(c("Length", "Width"), ~
mean(.x, na.rm = TRUE)))
>>
>> 4) It gets weirder: using all columns in #1 give the same results as 
>> in #2 but different from #1 and #3
>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH, 
>> PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = 
>> mean, na.rm = TRUE)
>>
>> So it seems it is not only due to the notation (cbind() vs. dot). Is 
>> it a bug? A peculiar thing in my dataset? I tend to think this could 
>> be due to some variables (or their names) as all notations seem to 
>> agree when I remove some variables (although I haven't found out 
>> which variable(s) is (are) at fault), e.g.:
>>
>> my_data2 <- structure(list(ID = c("FLINT-1",
"FLINT-10", "FLINT-100",
>> "FLINT-101", "FLINT-102", "HORN-10",
"HORN-100", "HORN-102",
>> "HORN-103", "HORN-104"), EdgeLength = c(130.75,
168.77, 142.79,
>> 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = 
>> c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61, 
>> 1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869, 48.011, 
>> 54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982, 
>> 67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417, 
>> 35.297), RAWMAT = c("FLINT", "FLINT",
"FLINT", "FLINT", "FLINT",
>> "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS")),
>> row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), 
>> class = "data.frame")
>>
>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, 
>> data = my_data2, FUN = mean, na.rm = TRUE)
>>
>> aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
>>
>> group_by(my_data2, RAWMAT) %>%
>> ?? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
>>
>>
>> Thank you in advance for any hint.
>> Best wishes,
>> Ivan
>>
>>
>>
>>
>> ?????*LEIBNIZ-ZENTRUM*
>> *F?R ARCH?OLOGIE*
>>
>> *Dr. Ivan CALANDRA*
>> **Head of IMPALA (IMaging Platform At LeizA)
>>
>> *MONREPOS* Archaeological Research Centre, Schloss Monrepos
>> 56567 Neuwied, Germany
>>
>> T: +49 2631 9772 243
>> T: +49 6131 8885 543
>> ivan.calandra at leiza.de
>>
>> leiza.de <http://www.leiza.de/>
>> <http://www.leiza.de/>
>> ORCID <https://orcid.org/0000-0003-3816-6359>
>> ResearchGate
>> <https://www.researchgate.net/profile/Ivan_Calandra>
>>
>> LEIZA is a foundation under public law of the State of 
>> Rhineland-Palatinate and the City of Mainz. Its headquarters are in 
>> Mainz. Supervision is carried out by the Ministry of Science and 
>> Health of the State of Rhineland-Palatinate. LEIZA is a research 
>> museum of the Leibniz Association.
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> You can define a vector of the columns of interest and subset the data 
> with it. Then the default na.action = na.omit will no longer remove 
> the rows with NA vals in at least one column and the results are the 
> same.
>
> However, this will not give the mean values of the other numeric 
> columns, just of those two.
>
>
>
> # define a vector of columns of interest
> cols <- c("Length", "Width", "RAWMAT")
>
> # 1) Simple aggregation with 2 variables, select cols:
> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN = 
> mean, na.rm = TRUE)
>
> # 2) Using the dot notation - if cols are selected, equal results:
> aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)
>
> # 3) Using dplyr, the results are now the same results as #1 and #2:
> my_data %>%
> ? select(all_of(cols)) %>%
> ? group_by(RAWMAT) %>%
> ? summarise(across(c("Length", "Width"), ~ mean(.x,
na.rm = TRUE)))
>
>
> Hope this helps,
>
> Rui Barradas
>

Rui Barradas

2023-Sep-04 11:56 UTC

head link

[R] aggregate formula - differing results

?s 12:51 de 04/09/2023, Ivan Calandra escreveu:> Thanks Rui for your help; that would be one possibility indeed.
> 
> But am I the only one who finds that behavior of aggregate() completely 
> unexpected and confusing? Especially considering that dplyr::summarise() 
> and doBy::summaryBy() deal with NAs differently, even though they all 
> use mean(na.rm = TRUE) to calculate the group stats.
> 
> Best wishes,
> Ivan
> 
> On 04/09/2023 13:46, Rui Barradas wrote:
>> ?s 10:44 de 04/09/2023, Ivan Calandra escreveu:
>>> Dear useRs,
>>>
>>> I have just stumbled across a behavior in aggregate() that I cannot
>>> explain. Any help would be appreciated!
>>>
>>> Sample data:
>>> my_data <- structure(list(ID = c("FLINT-1",
"FLINT-10", "FLINT-100",
>>> "FLINT-101", "FLINT-102", "HORN-10",
"HORN-100", "HORN-102",
>>> "HORN-103", "HORN-104"), EdgeLength = c(130.75,
168.77, 142.79,
>>> 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = 
>>> c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61, 
>>> 1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869,
48.011,
>>> 54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982, 
>>> 67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417, 
>>> 35.297), PLATWIDTH = c(38.84, NA, 15.33, 30.37, 11.44, 14.88,
13.86,
>>> NA, NA, 26.71), PLATTHICK = c(8.67, NA, 7.99, 11.69, 3.3, 16.52, 
>>> 4.58, NA, NA, 9.35), EPA = c(78, NA, 78, 54, 72, 49, 56, NA, NA,
56),
>>> THICKNESS = c(10.97, NA, 9.36, 6.4, 5.89, 11.05, 4.9, NA, NA,
10.08),
>>> WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9, 29.5, 4.5, NA, NA, 23),
RAWMAT
>>> = c("FLINT", "FLINT", "FLINT",
"FLINT", "FLINT", "HORNFELS",
>>> "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS")), row.names = c(1L,
>>> 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class =
"data.frame")
>>>
>>> 1) Simple aggregation with 2 variables:
>>> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN =
mean,
>>> na.rm = TRUE)
>>>
>>> 2) Using the dot notation - different results:
>>> aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
>>>
>>> 3) Using dplyr, I get the same results as #1:
>>> group_by(my_data, RAWMAT) %>%
>>> ?? summarise(across(c("Length", "Width"), ~
mean(.x, na.rm = TRUE)))
>>>
>>> 4) It gets weirder: using all columns in #1 give the same results
as
>>> in #2 but different from #1 and #3
>>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH, 
>>> PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = 
>>> mean, na.rm = TRUE)
>>>
>>> So it seems it is not only due to the notation (cbind() vs. dot).
Is
>>> it a bug? A peculiar thing in my dataset? I tend to think this
could
>>> be due to some variables (or their names) as all notations seem to 
>>> agree when I remove some variables (although I haven't found
out
>>> which variable(s) is (are) at fault), e.g.:
>>>
>>> my_data2 <- structure(list(ID = c("FLINT-1",
"FLINT-10", "FLINT-100",
>>> "FLINT-101", "FLINT-102", "HORN-10",
"HORN-100", "HORN-102",
>>> "HORN-103", "HORN-104"), EdgeLength = c(130.75,
168.77, 142.79,
>>> 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = 
>>> c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61, 
>>> 1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869,
48.011,
>>> 54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982, 
>>> 67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417, 
>>> 35.297), RAWMAT = c("FLINT", "FLINT",
"FLINT", "FLINT", "FLINT",
>>> "HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS")),
>>> row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), 
>>> class = "data.frame")
>>>
>>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, 
>>> data = my_data2, FUN = mean, na.rm = TRUE)
>>>
>>> aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm =
TRUE)
>>>
>>> group_by(my_data2, RAWMAT) %>%
>>> ?? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
>>>
>>>
>>> Thank you in advance for any hint.
>>> Best wishes,
>>> Ivan
>>>
>>>
>>>
>>>
>>> ?????*LEIBNIZ-ZENTRUM*
>>> *F?R ARCH?OLOGIE*
>>>
>>> *Dr. Ivan CALANDRA*
>>> **Head of IMPALA (IMaging Platform At LeizA)
>>>
>>> *MONREPOS* Archaeological Research Centre, Schloss Monrepos
>>> 56567 Neuwied, Germany
>>>
>>> T: +49 2631 9772 243
>>> T: +49 6131 8885 543
>>> ivan.calandra at leiza.de
>>>
>>> leiza.de <http://www.leiza.de/>
>>> <http://www.leiza.de/>
>>> ORCID <https://orcid.org/0000-0003-3816-6359>
>>> ResearchGate
>>> <https://www.researchgate.net/profile/Ivan_Calandra>
>>>
>>> LEIZA is a foundation under public law of the State of 
>>> Rhineland-Palatinate and the City of Mainz. Its headquarters are in
>>> Mainz. Supervision is carried out by the Ministry of Science and 
>>> Health of the State of Rhineland-Palatinate. LEIZA is a research 
>>> museum of the Leibniz Association.
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> Hello,
>>
>> You can define a vector of the columns of interest and subset the data 
>> with it. Then the default na.action = na.omit will no longer remove 
>> the rows with NA vals in at least one column and the results are the 
>> same.
>>
>> However, this will not give the mean values of the other numeric 
>> columns, just of those two.
>>
>>
>>
>> # define a vector of columns of interest
>> cols <- c("Length", "Width", "RAWMAT")
>>
>> # 1) Simple aggregation with 2 variables, select cols:
>> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN = 
>> mean, na.rm = TRUE)
>>
>> # 2) Using the dot notation - if cols are selected, equal results:
>> aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)
>>
>> # 3) Using dplyr, the results are now the same results as #1 and #2:
>> my_data %>%
>> ? select(all_of(cols)) %>%
>> ? group_by(RAWMAT) %>%
>> ? summarise(across(c("Length", "Width"), ~ mean(.x,
na.rm = TRUE)))
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>Hello,

Puzzling at first yes, unexpected no, it's documented behavior.

This is the result of how the aggregate works, by first applying 
na.action to the data and only then applying the function to each column 
in the formula's LHS.

dplyr works column by column, so there is no na.action involved.

Hope this helps,

Rui Barradas

Achim Zeileis

2023-Sep-04 22:27 UTC

head link

[R] aggregate formula - differing results

On Mon, 4 Sep 2023, Ivan Calandra wrote:
> Thanks Rui for your help; that would be one possibility indeed.
>
> But am I the only one who finds that behavior of aggregate() completely 
> unexpected and confusing? Especially considering that dplyr::summarise()
and
> doBy::summaryBy() deal with NAs differently, even though they all use 
> mean(na.rm = TRUE) to calculate the group stats.
I agree with Rui that this behaves as documented but I also agree with 
Ivan that the behavior is potentially confusing. Not so much because other 
packages behave differently but mostly because the handling of missing 
values differs between the different aggregate() methods.

Based on my teaching experience, I feel that a default of 
na.action=na.pass would be less confusing, especially in the case with 
multivariate "response".

In the univeriate case the discrepancy can be surprising - in the default 
method you need na.rm=TRUE but in the formula method you get the same 
result without additional arguments (due to na.action=na.omit). But in the 
multivariate case the discrepancy is not obvious, especially for 
beginners, because the results in other variables without NAs are affected 
as well.

A minimal toy example is the following data with two groups (x = A vs. B) 
and two "responses" (y without NAs and z with NA):

d <- data.frame(x = c("A", "A", "B",
"B"), y = 1:4, z = c(1:3, NA))
d
##   x y  z
## 1 A 1  1
## 2 A 2  2
## 3 B 3  3
## 4 B 4 NA

Except for naming of the columns, both of the following summaries for y by 
x (without NAs) yield the same result:

aggregate(d$y, list(d$x), FUN = mean)
aggregate(y ~ x, data = d, FUN = mean)
##   x   y
## 1 A 1.5
## 2 B 3.5

For a single variable _with_ NAs, the default method needs the na.rm = 
TRUE argument, the fomula method does not. Again, except for naming of the 
columns:

aggregate(d$z, list(d$x), FUN = mean, na.rm = TRUE)
aggregate(z ~ x, data = d, FUN = mean)
##   x   z
## 1 A 1.5
## 2 B 3.0

Conversely, if you do want the NAs in the groups, the following two are 
the same (except for naming):

aggregate(d$z, list(d$x), FUN = mean)
aggregate(z ~ x, data = d, FUN = mean, na.action = na.pass)
##   x   z
## 1 A 1.5
## 2 B  NA

But in the multivariate case, it is not so obvious why the following two 
commands differ in their results for y (!), the variable without NAs, in 
group B:

aggregate(d[, c("y", "z")], list(d$x), FUN = mean, na.rm =
TRUE)
##   Group.1   y   z
## 1       A 1.5 1.5
## 2       B 3.5 3.0
              ^^^

aggregate(cbind(y, z) ~ x, data = d, FUN = mean)
##   x   y   z
## 1 A 1.5 1.5
## 2 B 3.0 3.0
        ^^^

Hence, in my programming courses I tell students to use na.action=na.pass 
in the formula method and to handle NAs in the FUN argument themselves.

I guess that this is not important enough to change the default in 
aggregate.formula. Or are there R core members who also find that this 
inconsistency between the different methods is worth addressing?

If not, maybe an explicit example could be added on the help page? Showing 
something like this might help:

## default: omit enitre row 4 where z=NA
aggregate(cbind(y, z) ~ x, data = d, FUN = mean)
##   x   y   z
## 1 A 1.5 1.5
## 2 B 3.0 3.0

## alternatively: omit row 4 only for z result but not for y result
aggregate(cbind(y, z) ~ x, data = d, FUN = mean, na.action = na.pass, na.rm =
TRUE)
##   x   y   z
## 1 A 1.5 1.5
## 2 B 3.5 3.0

Best wishes,
Achim
> On 04/09/2023 13:46, Rui Barradas wrote:
>> ?s 10:44 de 04/09/2023, Ivan Calandra escreveu:
>>> Dear useRs,
>>> 
>>> I have just stumbled across a behavior in aggregate() that I cannot
>>> explain. Any help would be appreciated!
>>> 
>>> Sample data:
>>> my_data <- structure(list(ID = c("FLINT-1",
"FLINT-10", "FLINT-100",
>>> "FLINT-101", "FLINT-102", "HORN-10",
"HORN-100", "HORN-102", "HORN-103",
>>> "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79,
130.1, 140.41, 121.37,
>>> 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46,
>>> 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2), Length
=
>>> c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075,
21.337,
>>> 35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149, 33.427,
20.683,
>>> 62.783, 26.417, 35.297), PLATWIDTH = c(38.84, NA, 15.33, 30.37,
11.44,
>>> 14.88, 13.86, NA, NA, 26.71), PLATTHICK = c(8.67, NA, 7.99, 11.69,
3.3,
>>> 16.52, 4.58, NA, NA, 9.35), EPA = c(78, NA, 78, 54, 72, 49, 56, NA,
NA,
>>> 56), THICKNESS = c(10.97, NA, 9.36, 6.4, 5.89, 11.05, 4.9, NA, NA,
10.08),
>>> WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9, 29.5, 4.5, NA, NA, 23),
RAWMAT =
>>> c("FLINT", "FLINT", "FLINT",
"FLINT", "FLINT", "HORNFELS",
"HORNFELS",
>>> "HORNFELS", "HORNFELS", "HORNFELS")),
row.names = c(1L, 2L, 3L, 4L, 5L,
>>> 111L, 112L, 113L, 114L, 115L), class = "data.frame")
>>> 
>>> 1) Simple aggregation with 2 variables:
>>> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN =
mean, na.rm
>>> = TRUE)
>>> 
>>> 2) Using the dot notation - different results:
>>> aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
>>> 
>>> 3) Using dplyr, I get the same results as #1:
>>> group_by(my_data, RAWMAT) %>%
>>> ?? summarise(across(c("Length", "Width"), ~
mean(.x, na.rm = TRUE)))
>>> 
>>> 4) It gets weirder: using all columns in #1 give the same results
as in #2
>>> but different from #1 and #3
>>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH, 
>>> PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN =
mean,
>>> na.rm = TRUE)
>>> 
>>> So it seems it is not only due to the notation (cbind() vs. dot).
Is it a
>>> bug? A peculiar thing in my dataset? I tend to think this could be
due to
>>> some variables (or their names) as all notations seem to agree when
I
>>> remove some variables (although I haven't found out which
variable(s) is
>>> (are) at fault), e.g.:
>>> 
>>> my_data2 <- structure(list(ID = c("FLINT-1",
"FLINT-10", "FLINT-100",
>>> "FLINT-101", "FLINT-102", "HORN-10",
"HORN-100", "HORN-102", "HORN-103",
>>> "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79,
130.1, 140.41, 121.37,
>>> 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46,
>>> 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2), Length
=
>>> c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075,
21.337,
>>> 35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149, 33.427,
20.683,
>>> 62.783, 26.417, 35.297), RAWMAT = c("FLINT",
"FLINT", "FLINT", "FLINT",
>>> "FLINT", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS", "HORNFELS")),
>>> row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L),
class =
>>> "data.frame")
>>> 
>>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT,
data =
>>> my_data2, FUN = mean, na.rm = TRUE)
>>> 
>>> aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm =
TRUE)
>>> 
>>> group_by(my_data2, RAWMAT) %>%
>>> ?? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
>>> 
>>> 
>>> Thank you in advance for any hint.
>>> Best wishes,
>>> Ivan
>>> 
>>> 
>>> 
>>> 
>>> ?????*LEIBNIZ-ZENTRUM*
>>> *F?R ARCH?OLOGIE*
>>> 
>>> *Dr. Ivan CALANDRA*
>>> **Head of IMPALA (IMaging Platform At LeizA)
>>> 
>>> *MONREPOS* Archaeological Research Centre, Schloss Monrepos
>>> 56567 Neuwied, Germany
>>> 
>>> T: +49 2631 9772 243
>>> T: +49 6131 8885 543
>>> ivan.calandra at leiza.de
>>> 
>>> leiza.de <http://www.leiza.de/>
>>> <http://www.leiza.de/>
>>> ORCID <https://orcid.org/0000-0003-3816-6359>
>>> ResearchGate
>>> <https://www.researchgate.net/profile/Ivan_Calandra>
>>> 
>>> LEIZA is a foundation under public law of the State of 
>>> Rhineland-Palatinate and the City of Mainz. Its headquarters are in
Mainz.
>>> Supervision is carried out by the Ministry of Science and Health of
the
>>> State of Rhineland-Palatinate. LEIZA is a research museum of the
Leibniz
>>> Association.
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> Hello,
>> 
>> You can define a vector of the columns of interest and subset the data
with
>> it. Then the default na.action = na.omit will no longer remove the rows
>> with NA vals in at least one column and the results are the same.
>> 
>> However, this will not give the mean values of the other numeric
columns,
>> just of those two.
>> 
>> 
>> 
>> # define a vector of columns of interest
>> cols <- c("Length", "Width", "RAWMAT")
>> 
>> # 1) Simple aggregation with 2 variables, select cols:
>> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN =
mean,
>> na.rm = TRUE)
>> 
>> # 2) Using the dot notation - if cols are selected, equal results:
>> aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)
>> 
>> # 3) Using dplyr, the results are now the same results as #1 and #2:
>> my_data %>%
>> ? select(all_of(cols)) %>%
>> ? group_by(RAWMAT) %>%
>> ? summarise(across(c("Length", "Width"), ~ mean(.x,
na.rm = TRUE)))
>> 
>> 
>> Hope this helps,
>> 
>> Rui Barradas
>> 
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

R help - Sep 2023 - aggregate formula - differing results

[R] aggregate formula - differing results

[R] aggregate formula - differing results

[R] aggregate formula - differing results