Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannot
explain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10",
"FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10",
"HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84,
NA, 15.33, 30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK =
c(8.67, NA, 7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78,
NA, 78, 54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36, 6.4,
5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9,
29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT", "FLINT",
"FLINT", "FLINT",
"FLINT", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS", "HORNFELS")),
row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class =
"data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm =
TRUE)))
4) It gets weirder: using all columns in #1 give the same results as in
#2 but different from #1 and #3
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot). Is it
a bug? A peculiar thing in my dataset? I tend to think this could be due
to some variables (or their names) as all notations seem to agree when I
remove some variables (although I haven't found out which variable(s) is
(are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10",
"FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10",
"HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT",
"FLINT", "FLINT", "FLINT", "FLINT",
"HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L,
111L, 112L,
113L, 114L, 115L), class = "data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, data =
my_data2, FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
group_by(my_data2, RAWMAT) %>%
? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Thank you in advance for any hint.
Best wishes,
Ivan
*LEIBNIZ-ZENTRUM*
*F?R ARCH?OLOGIE*
*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)
*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany
T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calandra at leiza.de
leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State of
Rhineland-Palatinate and the City of Mainz. Its headquarters are in
Mainz. Supervision is carried out by the Ministry of Science and Health
of the State of Rhineland-Palatinate. LEIZA is a research museum of the
Leibniz Association.
It seems that the issue are the missings. If in #1 you use the dataset
na.omit(my_data) instead of my_data, you get the same output that in #2 and in
#4, where all observations with missing data are removed since you are including
all the variables.
The second dataset has no issue since it has no missing data.
Iago
________________________________
De: R-help <r-help-bounces at r-project.org> de part de Ivan Calandra
<ivan.calandra at leiza.de>
Enviat el: dilluns, 4 de setembre de 2023 11:44
Per a: R-help <r-help at r-project.org>
Tema: [R] aggregate formula - differing results
Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannot
explain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10",
"FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10",
"HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84,
NA, 15.33, 30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK c(8.67, NA,
7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78,
NA, 78, 54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36, 6.4,
5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9,
29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT", "FLINT",
"FLINT", "FLINT",
"FLINT", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS", "HORNFELS")),
row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class
"data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
summarise(across(c("Length", "Width"), ~ mean(.x, na.rm =
TRUE)))
4) It gets weirder: using all columns in #1 give the same results as in
#2 but different from #1 and #3
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = mean,
na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot). Is it
a bug? A peculiar thing in my dataset? I tend to think this could be due
to some variables (or their names) as all notations seem to agree when I
remove some variables (although I haven't found out which variable(s) is
(are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10",
"FLINT-100",
"FLINT-101", "FLINT-102", "HORN-10",
"HORN-100", "HORN-102", "HORN-103",
"HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
25.149, 33.427, 20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT",
"FLINT", "FLINT", "FLINT", "FLINT",
"HORNFELS", "HORNFELS", "HORNFELS",
"HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L,
111L, 112L,
113L, 114L, 115L), class = "data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, data my_data2,
FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
group_by(my_data2, RAWMAT) %>%
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Thank you in advance for any hint.
Best wishes,
Ivan
*LEIBNIZ-ZENTRUM*
*F?R ARCH?OLOGIE*
*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)
*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany
T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calandra at leiza.de
leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State of
Rhineland-Palatinate and the City of Mainz. Its headquarters are in
Mainz. Supervision is carried out by the Ministry of Science and Health
of the State of Rhineland-Palatinate. LEIZA is a research museum of the
Leibniz Association.
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
?s 10:44 de 04/09/2023, Ivan Calandra escreveu:> Dear useRs, > > I have just stumbled across a behavior in aggregate() that I cannot > explain. Any help would be appreciated! > > Sample data: > my_data <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100", > "FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103", > "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41, > 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83, > 1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2), > Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854, > 32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42, > 25.149, 33.427, 20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84, > NA, 15.33, 30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK = > c(8.67, NA, 7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78, > NA, 78, 54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36, 6.4, > 5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9, > 29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT", "FLINT", "FLINT", "FLINT", > "FLINT", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")), > row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class = > "data.frame") > > 1) Simple aggregation with 2 variables: > aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean, > na.rm = TRUE) > > 2) Using the dot notation - different results: > aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE) > > 3) Using dplyr, I get the same results as #1: > group_by(my_data, RAWMAT) %>% > ? summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE))) > > 4) It gets weirder: using all columns in #1 give the same results as in > #2 but different from #1 and #3 > aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH, > PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = mean, > na.rm = TRUE) > > So it seems it is not only due to the notation (cbind() vs. dot). Is it > a bug? A peculiar thing in my dataset? I tend to think this could be due > to some variables (or their names) as all notations seem to agree when I > remove some variables (although I haven't found out which variable(s) is > (are) at fault), e.g.: > > my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100", > "FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103", > "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41, > 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83, > 1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2), > Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854, > 32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42, > 25.149, 33.427, 20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT", > "FLINT", "FLINT", "FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS", > "HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, > 113L, 114L, 115L), class = "data.frame") > > aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, data = > my_data2, FUN = mean, na.rm = TRUE) > > aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE) > > group_by(my_data2, RAWMAT) %>% > ? summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE))) > > > Thank you in advance for any hint. > Best wishes, > Ivan > > > > > ????*LEIBNIZ-ZENTRUM* > *F?R ARCH?OLOGIE* > > *Dr. Ivan CALANDRA* > **Head of IMPALA (IMaging Platform At LeizA) > > *MONREPOS* Archaeological Research Centre, Schloss Monrepos > 56567 Neuwied, Germany > > T: +49 2631 9772 243 > T: +49 6131 8885 543 > ivan.calandra at leiza.de > > leiza.de <http://www.leiza.de/> > <http://www.leiza.de/> > ORCID <https://orcid.org/0000-0003-3816-6359> > ResearchGate > <https://www.researchgate.net/profile/Ivan_Calandra> > > LEIZA is a foundation under public law of the State of > Rhineland-Palatinate and the City of Mainz. Its headquarters are in > Mainz. Supervision is carried out by the Ministry of Science and Health > of the State of Rhineland-Palatinate. LEIZA is a research museum of the > Leibniz Association. > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.Hello, You can define a vector of the columns of interest and subset the data with it. Then the default na.action = na.omit will no longer remove the rows with NA vals in at least one column and the results are the same. However, this will not give the mean values of the other numeric columns, just of those two. # define a vector of columns of interest cols <- c("Length", "Width", "RAWMAT") # 1) Simple aggregation with 2 variables, select cols: aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE) # 2) Using the dot notation - if cols are selected, equal results: aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE) # 3) Using dplyr, the results are now the same results as #1 and #2: my_data %>% select(all_of(cols)) %>% group_by(RAWMAT) %>% summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE))) Hope this helps, Rui Barradas