Haha, got it now, there is an na.action argument (which defaults to
na.omit) to aggregate() which is applied before calling mean(na.rm =
TRUE). Thank you Rui for pointing this out.
So running it with na.pass instead of na.omit gives the same results as
dplyr::group_by()+summarise():
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE,
na.action = na.pass)
Cheers,
Ivan
On 04/09/2023 13:56, Rui Barradas wrote:> ?s 12:51 de 04/09/2023, Ivan Calandra escreveu:
>> Thanks Rui for your help; that would be one possibility indeed.
>>
>> But am I the only one who finds that behavior of aggregate()
>> completely unexpected and confusing? Especially considering that
>> dplyr::summarise() and doBy::summaryBy() deal with NAs differently,
>> even though they all use mean(na.rm = TRUE) to calculate the group
>> stats.
>>
>> Best wishes,
>> Ivan
>>
>> On 04/09/2023 13:46, Rui Barradas wrote:
>>> ?s 10:44 de 04/09/2023, Ivan Calandra escreveu:
>>>> Dear useRs,
>>>>
>>>> I have just stumbled across a behavior in aggregate() that I
cannot
>>>> explain. Any help would be appreciated!
>>>>
>>>> Sample data:
>>>> my_data <- structure(list(ID = c("FLINT-1",
"FLINT-10",
>>>> "FLINT-100", "FLINT-101",
"FLINT-102", "HORN-10", "HORN-100",
>>>> "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77,
>>>> 142.79, 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5),
>>>> SurfaceArea = c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47,
>>>> 1169.26, 444.61, 1791.48, 461.15, 1127.2), Length = c(44.384,
>>>> 29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075, 21.337,
>>>> 35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149,
33.427,
>>>> 20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84, NA,
15.33,
>>>> 30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK = c(8.67,
NA,
>>>> 7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78, NA,
78,
>>>> 54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36,
6.4,
>>>> 5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5,
18.6,
>>>> 14.9, 29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT",
"FLINT", "FLINT",
>>>> "FLINT", "FLINT", "HORNFELS",
"HORNFELS", "HORNFELS", "HORNFELS",
>>>> "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L,
112L, 113L,
>>>> 114L, 115L), class = "data.frame")
>>>>
>>>> 1) Simple aggregation with 2 variables:
>>>> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN =
>>>> mean, na.rm = TRUE)
>>>>
>>>> 2) Using the dot notation - different results:
>>>> aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm =
TRUE)
>>>>
>>>> 3) Using dplyr, I get the same results as #1:
>>>> group_by(my_data, RAWMAT) %>%
>>>> ?? summarise(across(c("Length", "Width"), ~
mean(.x, na.rm = TRUE)))
>>>>
>>>> 4) It gets weirder: using all columns in #1 give the same
results
>>>> as in #2 but different from #1 and #3
>>>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width,
PLATWIDTH,
>>>> PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data,
FUN =
>>>> mean, na.rm = TRUE)
>>>>
>>>> So it seems it is not only due to the notation (cbind() vs.
dot).
>>>> Is it a bug? A peculiar thing in my dataset? I tend to think
this
>>>> could be due to some variables (or their names) as all
notations
>>>> seem to agree when I remove some variables (although I
haven't
>>>> found out which variable(s) is (are) at fault), e.g.:
>>>>
>>>> my_data2 <- structure(list(ID = c("FLINT-1",
"FLINT-10",
>>>> "FLINT-100", "FLINT-101",
"FLINT-102", "HORN-10", "HORN-100",
>>>> "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77,
>>>> 142.79, 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5),
>>>> SurfaceArea = c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47,
>>>> 1169.26, 444.61, 1791.48, 461.15, 1127.2), Length = c(44.384,
>>>> 29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075, 21.337,
>>>> 35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149,
33.427,
>>>> 20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT",
"FLINT",
>>>> "FLINT", "FLINT", "FLINT",
"HORNFELS", "HORNFELS", "HORNFELS",
>>>> "HORNFELS", "HORNFELS")), row.names = c(1L,
2L, 3L, 4L, 5L, 111L,
>>>> 112L, 113L, 114L, 115L), class = "data.frame")
>>>>
>>>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~
RAWMAT,
>>>> data = my_data2, FUN = mean, na.rm = TRUE)
>>>>
>>>> aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm =
TRUE)
>>>>
>>>> group_by(my_data2, RAWMAT) %>%
>>>> ?? summarise(across(where(is.numeric), ~ mean(.x, na.rm =
TRUE)))
>>>>
>>>>
>>>> Thank you in advance for any hint.
>>>> Best wishes,
>>>> Ivan
>>>>
>>>>
>>>>
>>>>
>>>> ?????*LEIBNIZ-ZENTRUM*
>>>> *F?R ARCH?OLOGIE*
>>>>
>>>> *Dr. Ivan CALANDRA*
>>>> **Head of IMPALA (IMaging Platform At LeizA)
>>>>
>>>> *MONREPOS* Archaeological Research Centre, Schloss Monrepos
>>>> 56567 Neuwied, Germany
>>>>
>>>> T: +49 2631 9772 243
>>>> T: +49 6131 8885 543
>>>> ivan.calandra at leiza.de
>>>>
>>>> leiza.de <http://www.leiza.de/>
>>>> <http://www.leiza.de/>
>>>> ORCID <https://orcid.org/0000-0003-3816-6359>
>>>> ResearchGate
>>>> <https://www.researchgate.net/profile/Ivan_Calandra>
>>>>
>>>> LEIZA is a foundation under public law of the State of
>>>> Rhineland-Palatinate and the City of Mainz. Its headquarters
are in
>>>> Mainz. Supervision is carried out by the Ministry of Science
and
>>>> Health of the State of Rhineland-Palatinate. LEIZA is a
research
>>>> museum of the Leibniz Association.
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>> Hello,
>>>
>>> You can define a vector of the columns of interest and subset the
>>> data with it. Then the default na.action = na.omit will no longer
>>> remove the rows with NA vals in at least one column and the results
>>> are the same.
>>>
>>> However, this will not give the mean values of the other numeric
>>> columns, just of those two.
>>>
>>>
>>>
>>> # define a vector of columns of interest
>>> cols <- c("Length", "Width",
"RAWMAT")
>>>
>>> # 1) Simple aggregation with 2 variables, select cols:
>>> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN
=
>>> mean, na.rm = TRUE)
>>>
>>> # 2) Using the dot notation - if cols are selected, equal results:
>>> aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm =
TRUE)
>>>
>>> # 3) Using dplyr, the results are now the same results as #1 and
#2:
>>> my_data %>%
>>> ? select(all_of(cols)) %>%
>>> ? group_by(RAWMAT) %>%
>>> ? summarise(across(c("Length", "Width"), ~
mean(.x, na.rm = TRUE)))
>>>
>>>
>>> Hope this helps,
>>>
>>> Rui Barradas
>>>
> Hello,
>
> Puzzling at first yes, unexpected no, it's documented behavior.
>
> This is the result of how the aggregate works, by first applying
> na.action to the data and only then applying the function to each
> column in the formula's LHS.
>
> dplyr works column by column, so there is no na.action involved.
>
> Hope this helps,
>
> Rui Barradas
>
>
>