Meyners, Michael
2012-Sep-04 11:24 UTC
[R] unexpected (?) behavior of sort=TRUE in merge function
All, I realize from the archive that the sort argument in merge has been subject to discussion before, though I couldn't find an explanation for this behavior. I tried to simplify this to (kind of) minimal code from a real example to the following (and I have no doubts that there are smart people around achieving the same with smarter code :-)). I'm running R 2.15.1 64bit under MS Windows 7, full session info below. I do have a list with two dataframes: test <- list(structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("Y1", "Y2", "G", "F", "L", "K"), class = "factor"), cong = c(-1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 11, 11, 11, 11, 11, 11), x = c(5.85714285714286, 5.9, 7.3, 5.85714285714286, 7.27272727272727, 4.375, 3.875, 2.5, 4.8, 3.625, 6.25, 4.71428571428571, 3.53571428571429, 4.63888888888889, 4.42424242424242, 4.78260869565217, 4.875, 3.80434782608696, 5.73170731707317, 5.41935483870968, 5.78125, 6.30188679245283, 6.87755102040816, 5.56603773584906)), .Names = c("product", "cong", "x"), row.names = c(NA, -24L), class = "data.frame"), structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("Y1", "Y2", "G", "F", "L", "K"), class = "factor"), cong = c(-1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0), x = c(3.04347826086957, 4.18181818181818, 3.75, 4.31578947368421, 4.5, 3.73913043478261, 4.8876404494382, 5.20792079207921, 5.68, 5.70526315789474, 6.38636363636364, 4.96703296703297)), .Names = c("product", "cong", "x"), row.names = c(NA, -12L), class = "data.frame")) The dataframes are pretty much the same but for the values in the x-column and the fact that the second one has only half as many observations, missing the second half of the expand.grid if you like. Now if I run lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE)) # sort=TRUE is the default, so could be omitted sorts the first dataframe according to the labels of factor "product", while for the second one the order is maintained from the first dataframes (x) to merge (which is the difference that I could not find being documented). Now I run the same code with sort=FALSE instead: lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=FALSE)) The results are at least consistent and fulfill my needs (this is, btw, not unexpected from the documentation). Note that I get exactly the same behavior if I apply merge subsequently to test[[1]] and test[[2]], so it is not an issue from lapply. (I realize that my dataframes are ordered by levels of product, but using test[[2]] <- test[[2]][sample(12),] and applying the same code as above reveals that indeed no sorting is done but the order is maintained from the first dataframe.) I have a working solution for myself, so I'm not after any advice on how to achieve the sorting -- I'd just like to better understand what's going on here and/or what I might have missed in the documentation or in the list archives. Thanks in advance, Michael Session info: R version 2.15.1 (2012-06-22) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C LC_TIME=German_Germany.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.15.1
Rui Barradas
2012-Sep-04 12:00 UTC
[R] unexpected (?) behavior of sort=TRUE in merge function
Hello, Inline. Em 04-09-2012 12:24, Meyners, Michael escreveu:> All, > > I realize from the archive that the sort argument in merge has been subject to discussion before, though I couldn't find an explanation for this behavior. I tried to simplify this to (kind of) minimal code from a real example to the following (and I have no doubts that there are smart people around achieving the same with smarter code :-)). I'm running R 2.15.1 64bit under MS Windows 7, full session info below. > > I do have a list with two dataframes: > > test <- list(structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, > 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, > 4L, 5L, 6L), .Label = c("Y1", "Y2", "G", "F", "L", "K"), class = "factor"), > cong = c(-1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 1, 1, > 1, 1, 1, 1, 11, 11, 11, 11, 11, 11), x = c(5.85714285714286, > 5.9, 7.3, 5.85714285714286, 7.27272727272727, 4.375, 3.875, > 2.5, 4.8, 3.625, 6.25, 4.71428571428571, 3.53571428571429, > 4.63888888888889, 4.42424242424242, 4.78260869565217, 4.875, > 3.80434782608696, 5.73170731707317, 5.41935483870968, 5.78125, > 6.30188679245283, 6.87755102040816, 5.56603773584906)), .Names = c("product", > "cong", "x"), row.names = c(NA, -24L), class = "data.frame"), > structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, > 6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("Y1", "Y2", "G", > "F", "L", "K"), class = "factor"), cong = c(-1, -1, -1, -1, > -1, -1, 0, 0, 0, 0, 0, 0), x = c(3.04347826086957, 4.18181818181818, > 3.75, 4.31578947368421, 4.5, 3.73913043478261, 4.8876404494382, > 5.20792079207921, 5.68, 5.70526315789474, 6.38636363636364, > 4.96703296703297)), .Names = c("product", "cong", "x"), row.names = c(NA, > -12L), class = "data.frame")) > > > The dataframes are pretty much the same but for the values in the x-column and the fact that the second one has only half as many observations, missing the second half of the expand.grid if you like. Now if I run > > lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE)) # sort=TRUE is the default, so could be omitted > > sorts the first dataframe according to the labels of factor "product"No, it doesn't. It sorts according to the columns, i.e., the values, not according to the labels. The help page clearly states that the argument 'sort' is "logical. Should the results be sorted on the |by| columns?" And "Y1" is coded as 1, "Y2" as 2, etc. The output is right. Try the following. test2 <- test test2[[1]]$product <- as.character(test[[1]]$product) test2[[2]]$product <- as.character(test[[2]]$product) # To make it more readable. grd <- expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)) lapply(test2, function(x) merge(x, grd, all=T, sort=TRUE)) And now 'product' sorts from "F" to "Y2", even if grd$product is still a factor with the same coding as in 'test'. Hope this helps, Rui Barradas> , while for the second one the order is maintained from the first dataframes (x) to merge (which is the difference that I could not find being documented). Now I run the same code with sort=FALSE instead: > > lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=FALSE)) > > The results are at least consistent and fulfill my needs (this is, btw, not unexpected from the documentation). Note that I get exactly the same behavior if I apply merge subsequently to test[[1]] and test[[2]], so it is not an issue from lapply. (I realize that my dataframes are ordered by levels of product, but using test[[2]] <- test[[2]][sample(12),] and applying the same code as above reveals that indeed no sorting is done but the order is maintained from the first dataframe.) > > I have a working solution for myself, so I'm not after any advice on how to achieve the sorting -- I'd just like to better understand what's going on here and/or what I might have missed in the documentation or in the list archives. > > Thanks in advance, > Michael > > > > Session info: > R version 2.15.1 (2012-06-22) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C LC_TIME=German_Germany.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] tools_2.15.1 > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
Hi, Try this: convert.type1 <- function(obj,types){ ??? for (i in 1:length(obj)){ ??????? FUN <- switch(types[i],character = as.character, ?????????????????????????????????? numeric = as.numeric, ?????????????????????????????????? factor = as.factor) ??????? obj[,i] <- FUN(obj[,i]) ??? } ??? obj } test1<-test ?test1[[1]]<-convert.type1(test1[[1]],c("character","numeric","numeric")) ?test1[[2]]<-convert.type1(test1[[2]],c("character","numeric","numeric")) lapply(test1, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE))?? ------ ------ [[2]] ?? product cong??????? x 1??????? F?? -1 4.315789 2??????? F??? 0 5.705263 3??????? F??? 1?????? NA 4??????? F?? 11?????? NA 5??????? G?? -1 3.750000 6??????? G??? 0 5.680000 7??????? G??? 1?????? NA 8??????? G?? 11?????? NA 9??????? K?? -1 3.739130 10?????? K??? 0 4.967033 11?????? K??? 1?????? NA 12?????? K?? 11?????? NA 13?????? L?? -1 4.500000 14?????? L??? 0 6.386364 15?????? L??? 1?????? NA 16?????? L?? 11?????? NA 17????? Y1?? -1 3.043478 18????? Y1??? 0 4.887640 19????? Y1??? 1?????? NA 20????? Y1?? 11?????? NA 21????? Y2?? -1 4.181818 22????? Y2??? 0 5.207921 23????? Y2??? 1?????? NA 24????? Y2?? 11?????? NA A.K. ----- Original Message ----- From: "Meyners, Michael" <meyners.m at pg.com> To: "r-help at r-project.org" <r-help at r-project.org> Cc: Sent: Tuesday, September 4, 2012 7:24 AM Subject: [R] unexpected (?) behavior of sort=TRUE in merge function All, I realize from the archive that the sort argument in merge has been subject to discussion before, though I couldn't find an explanation for this behavior. I tried to simplify this to (kind of) minimal code from a real example to the following (and I have no doubts that there are smart people around achieving the same with smarter code :-)). I'm running R 2.15.1 64bit under MS Windows 7, full session info below. ??? I do have a list with two dataframes: test <- list(structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("Y1", "Y2", "G", "F", "L", "K"), class = "factor"), ? ? cong = c(-1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 1, 1, ? ? 1, 1, 1, 1, 11, 11, 11, 11, 11, 11), x = c(5.85714285714286, ? ? 5.9, 7.3, 5.85714285714286, 7.27272727272727, 4.375, 3.875, ? ? 2.5, 4.8, 3.625, 6.25, 4.71428571428571, 3.53571428571429, ? ? 4.63888888888889, 4.42424242424242, 4.78260869565217, 4.875, ? ? 3.80434782608696, 5.73170731707317, 5.41935483870968, 5.78125, ? ? 6.30188679245283, 6.87755102040816, 5.56603773584906)), .Names = c("product", "cong", "x"), row.names = c(NA, -24L), class = "data.frame"), ? ? structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, ? ? 6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("Y1", "Y2", "G", ? ? "F", "L", "K"), class = "factor"), cong = c(-1, -1, -1, -1, ? ? -1, -1, 0, 0, 0, 0, 0, 0), x = c(3.04347826086957, 4.18181818181818, ? ? 3.75, 4.31578947368421, 4.5, 3.73913043478261, 4.8876404494382, ? ? 5.20792079207921, 5.68, 5.70526315789474, 6.38636363636364, ? ? 4.96703296703297)), .Names = c("product", "cong", "x"), row.names = c(NA, ? ? -12L), class = "data.frame")) The dataframes are pretty much the same but for the values in the x-column and the fact that the second one has only half as many observations, missing the second half of the expand.grid if you like. Now if I run lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE))??? ? # sort=TRUE is the default, so could be omitted sorts the first dataframe according to the labels of factor "product", while for the second one the order is maintained from the first dataframes (x) to merge (which is the difference that I could not find being documented). Now I run the same code with sort=FALSE instead: lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=FALSE)) The results are at least consistent and fulfill my needs (this is, btw, not unexpected from the documentation). Note that I get exactly the same behavior if I apply merge subsequently to test[[1]] and test[[2]], so it is not an issue from lapply. (I realize that my dataframes are ordered by levels of product, but using test[[2]] <- test[[2]][sample(12),] and applying the same code as above reveals that indeed no sorting is done but the order is maintained from the first dataframe.) I have a working solution for myself, so I'm not after any advice on how to achieve the sorting -- I'd just like to better understand what's going on here and/or what I might have missed in the documentation or in the list archives. Thanks in advance, Michael Session info: R version 2.15.1 (2012-06-22) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=German_Germany.1252? LC_CTYPE=German_Germany.1252? ? LC_MONETARY=German_Germany.1252 LC_NUMERIC=C? ? ? ? ? ? ? ? ? ? LC_TIME=German_Germany.1252? ? attached base packages: [1] stats? ? graphics? grDevices utils? ? datasets? methods? base? ? loaded via a namespace (and not attached): [1] tools_2.15.1 ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Meyners, Michael
2012-Sep-04 14:17 UTC
[R] unexpected (?) behavior of sort=TRUE in merge function
Rui, Thanks for looking into this. I apologize, I should've added my output, maybe it looks differently on my machine than on others. I also should have made my question more explicit: I'm not looking for a solution to get the sorting one way or another, I have that already. I rather want to understand why the same code behaves differently on two very similar datasets (one just having less rows, see below). The first call gives the following for me:> lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE))[[1]] product cong x 1 F -1 5.857143 2 F 0 3.625000 3 F 1 4.782609 4 F 11 6.301887 5 G -1 7.300000 6 G 0 4.800000 7 G 1 4.424242 8 G 11 5.781250 9 K -1 4.375000 10 K 0 4.714286 11 K 1 3.804348 12 K 11 5.566038 13 L -1 7.272727 14 L 0 6.250000 15 L 1 4.875000 16 L 11 6.877551 17 Y1 -1 5.857143 18 Y1 0 3.875000 19 Y1 1 3.535714 20 Y1 11 5.731707 21 Y2 -1 5.900000 22 Y2 0 2.500000 23 Y2 1 4.638889 24 Y2 11 5.419355 [[2]] product cong x 1 Y1 -1 3.043478 2 Y1 0 4.887640 3 Y1 1 NA 4 Y1 11 NA 5 Y2 -1 4.181818 6 Y2 0 5.207921 7 Y2 1 NA 8 Y2 11 NA 9 G -1 3.750000 10 G 0 5.680000 11 G 1 NA 12 G 11 NA 13 F -1 4.315789 14 F 0 5.705263 15 F 1 NA 16 F 11 NA 17 L -1 4.500000 18 L 0 6.386364 19 L 1 NA 20 L 11 NA 21 K -1 3.739130 22 K 0 4.967033 23 K 1 NA 24 K 11 NA So different from what you may have observed, here the first data set [[1]] is sorted by label of "product", not by value. As you correctly stated, Y1" is coded as 1, "Y2" as 2, etc., but the first rows are for F, followed by G etc. The second [[2]] is sorted by level (value). So I have different behavior on very similar looking data sets, and hence to me at least one of those cannot be "right" according to documentation (but I agree with you that the second is correct according to the help). In my larger example, it seems as if data sets which do not originally have all combinations of product and cong anyway are sorted like [[2]], and those that are complete (all 24 combinations occur) are sorted like [[1]] is, which to me is still "unexpected". Hope this clarifies my question. Any thoughts appreciated. Michael> -----Original Message----- > From: Rui Barradas [mailto:ruipbarradas at sapo.pt] > Sent: Dienstag, 4. September 2012 14:01 > To: Meyners, Michael > Cc: r-help > Subject: Re: [R] unexpected (?) behavior of sort=TRUE in merge function > > Hello, > > Inline. > Em 04-09-2012 12:24, Meyners, Michael escreveu: > > All, > > I realize from the archive that the sort argument in merge has been > subject to discussion before, though I couldn't find an explanation for > this behavior. I tried to simplify this to (kind of) minimal code from > a real example to the following (and I have no doubts that there are > smart people around achieving the same with smarter code :-)). I'm > running R 2.15.1 64bit under MS Windows 7, full session info below. > > I do have a list with two dataframes: > > test <- list(structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, > 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, > 4L, 5L, 6L), .Label = c("Y1", "Y2", "G", "F", "L", "K"), class > "factor"), > cong = c(-1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 1, 1, > 1, 1, 1, 1, 11, 11, 11, 11, 11, 11), x = c(5.85714285714286, > 5.9, 7.3, 5.85714285714286, 7.27272727272727, 4.375, 3.875, > 2.5, 4.8, 3.625, 6.25, 4.71428571428571, 3.53571428571429, > 4.63888888888889, 4.42424242424242, 4.78260869565217, 4.875, > 3.80434782608696, 5.73170731707317, 5.41935483870968, 5.78125, > 6.30188679245283, 6.87755102040816, 5.56603773584906)), .Names > c("product", > "cong", "x"), row.names = c(NA, -24L), class = "data.frame"), > structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, > 6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("Y1", "Y2", "G", > "F", "L", "K"), class = "factor"), cong = c(-1, -1, -1, -1, > -1, -1, 0, 0, 0, 0, 0, 0), x = c(3.04347826086957, > 4.18181818181818, > 3.75, 4.31578947368421, 4.5, 3.73913043478261, 4.8876404494382, > 5.20792079207921, 5.68, 5.70526315789474, 6.38636363636364, > 4.96703296703297)), .Names = c("product", "cong", "x"), row.names > c(NA, > -12L), class = "data.frame")) > > > The dataframes are pretty much the same but for the values in the x- > column and the fact that the second one has only half as many > observations, missing the second half of the expand.grid if you like. > Now if I run > > lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", > "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE)) # > sort=TRUE is the default, so could be omitted > > sorts the first dataframe according to the labels of factor "product" > > No, it doesn't. It sorts according to the columns, i.e., the values, > not according to the labels. > The help page clearly states that the argument 'sort' is "logical. > Should the results be sorted on the by columns?" > > And "Y1" is coded as 1, "Y2" as 2, etc. The output is right. > > Try the following. > > test2 <- test > test2[[1]]$product <- as.character(test[[1]]$product) > test2[[2]]$product <- as.character(test[[2]]$product) > > # To make it more readable. > grd <- expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(- > 1,0,1,11)) > > lapply(test2, function(x) merge(x, grd, all=T, sort=TRUE)) > > And now 'product' sorts from "F" to "Y2", even if grd$product is still > a factor with the same coding as in 'test'. > > Hope this helps, > > Rui Barradas > > , while for the second one the order is maintained from the first > dataframes (x) to merge (which is the difference that I could not find > being documented). Now I run the same code with sort=FALSE instead: > > lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", > "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=FALSE)) > > The results are at least consistent and fulfill my needs (this is, btw, > not unexpected from the documentation). Note that I get exactly the > same behavior if I apply merge subsequently to test[[1]] and test[[2]], > so it is not an issue from lapply. (I realize that my dataframes are > ordered by levels of product, but using test[[2]] <- > test[[2]][sample(12),] and applying the same code as above reveals that > indeed no sorting is done but the order is maintained from the first > dataframe.) > > I have a working solution for myself, so I'm not after any advice on > how to achieve the sorting -- I'd just like to better understand what's > going on here and/or what I might have missed in the documentation or > in the list archives. > > Thanks in advance, > Michael > > > > Session info: > R version 2.15.1 (2012-06-22) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 > LC_MONETARY=German_Germany.1252 LC_NUMERIC=C > LC_TIME=German_Germany.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] tools_2.15.1 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.