Base R has generic functions called any() and all() that I am having trouble using. It works fine when I play with it in a base R context as in:> all(any(TRUE, TRUE), any(TRUE, FALSE))[1] TRUE> all(any(TRUE, TRUE), any(FALSE, FALSE))[1] FALSE But in a tidyverse/dplyr environment, it returns wrong answers. Consider this example. I have data I have joined together with pairs of columns representing a first generation and several other pairs representing additional generations. I want to consider any pair where at least one of the pair is not NA as a success. But in order to keep the entire row, I want all three pairs to have some valid data. This seems like a fairly common reasonable thing often needed when evaluating data. So to make it very general, I chose to do something a bit like this: result <- filter(mydata, all( any(!is.na(first.a), !is.na(first.b)), any(!is.na(second.a), !is.na(second.b)), any(!is.na(third.a), !is.na(third.b)))) I apologize if the formatting is not seen properly. The above logically should work. And it should be extendable to scenarios where you want at least one of M columns to contain data as a group with N such groups of any size. But since it did not work, I tried a plan that did work and feels silly. I used mutate() to make new columns such as: result <- mydata |> mutate( usable.1 = (!is.na(first.a) | !is.na(first.b)), usable.2 = (!is.na(second.a) | !is.na(second.b)), usable.3 = (!is.na(third.a) | !is.na(third.b)), usable = (usable.1 & usable.2 & usable.3) ) |> filter(usable == TRUE) The above wastes time and effort making new columns so I can check the calculations then uses the combined columns to make a Boolean that can be used to filter the result. I know this is not the place to discuss dplyr. I want to check first if I am doing anything wrong in how I use any/all. One guess is that the generic is messed with by dplyr or other packages I libraried. And, of course, some aspects of delayed evaluation can interfere in subtle ways. I note I have had other problems with these base R functions before and generally solved them by not using them, as shown above. I would much rather use them, or something similar. Avi [[alternative HTML version deleted]]
On 12/04/2024 3:52 p.m., avi.e.gross at gmail.com wrote:> Base R has generic functions called any() and all() that I am having trouble > using. > > It works fine when I play with it in a base R context as in: > >> all(any(TRUE, TRUE), any(TRUE, FALSE)) > [1] TRUE >> all(any(TRUE, TRUE), any(FALSE, FALSE)) > [1] FALSE > > But in a tidyverse/dplyr environment, it returns wrong answers. > > Consider this example. I have data I have joined together with pairs of > columns representing a first generation and several other pairs representing > additional generations. I want to consider any pair where at least one of > the pair is not NA as a success. But in order to keep the entire row, I want > all three pairs to have some valid data. This seems like a fairly common > reasonable thing often needed when evaluating data. > > So to make it very general, I chose to do something a bit like this:We can't really help you without a reproducible example. It's not enough to show us something that doesn't run but is a bit like the real code. Duncan Murdoch> > result <- filter(mydata, > all( > any(!is.na(first.a), !is.na(first.b)), > any(!is.na(second.a), !is.na(second.b)), > any(!is.na(third.a), !is.na(third.b)))) > > I apologize if the formatting is not seen properly. The above logically > should work. And it should be extendable to scenarios where you want at > least one of M columns to contain data as a group with N such groups of any > size. > > But since it did not work, I tried a plan that did work and feels silly. I > used mutate() to make new columns such as: > > result <- > mydata |> > mutate( > usable.1 = (!is.na(first.a) | !is.na(first.b)), > usable.2 = (!is.na(second.a) | !is.na(second.b)), > usable.3 = (!is.na(third.a) | !is.na(third.b)), > usable = (usable.1 & usable.2 & usable.3) > ) |> > filter(usable == TRUE) > > The above wastes time and effort making new columns so I can check the > calculations then uses the combined columns to make a Boolean that can be > used to filter the result. > > I know this is not the place to discuss dplyr. I want to check first if I am > doing anything wrong in how I use any/all. One guess is that the generic is > messed with by dplyr or other packages I libraried. > > And, of course, some aspects of delayed evaluation can interfere in subtle > ways. > > I note I have had other problems with these base R functions before and > generally solved them by not using them, as shown above. I would much rather > use them, or something similar. > > > Avi > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi Avi, As D?nes T?th has rightly diagnosed, you are building an "all or nothing" filter. However, you do not need to explicitly spell out all columns that you want to filter for; the "tidy" way would be to use a helper function like `if_all()` or `if_any()`. Consider this example (I hope I understand your intentions correctly): ``` library(dplyr) data <- tribble( ? ~first.a, ~first.b, ~first.c, ? 1L,????? ? 1L,?????? 0L, ? NA,?????? 1L,?????? 0L, ? 1L,??????? 0L,?????? NA, ? NA,?????? NA,?????? 1L ) ``` Let's say we only want to keep rows that have a non-missing value for either `first.a` or `first.b` (or hypothetical later generations like `second.a` and `second.b` etc.): ``` data |> ? filter(if_any(ends_with(c(".a", ".b")), \(x) !is.na(x))) ``` So: `filter()` (keep observations) `if_any` of the columns ending with .a or .b is not `NA` (we have to wrap `!is.na` into an anonymous function for it to be a valid argument type). This would yield ``` # A tibble: 3 ? 3 ? first.a first.b first.c ??? <int>?? <int>?? <int> 1?????? 1?????? 1?????? 0 2????? NA?????? 1?????? 0 3?????? 1?????? 0????? NA ``` Discarding only the row where both of them are missing. Another way of writing this would be ``` data |> ? filter(!if_all(ends_with(c(".a", ".b")), is.na)) ``` i.e. don't keep rows where all columns ending in .a or .b are `NA`, which returns the same result. Hope this helps, Lennart Kasserra Am 12.04.24 um 21:52 schrieb avi.e.gross at gmail.com:> Base R has generic functions called any() and all() that I am having trouble > using. > > It works fine when I play with it in a base R context as in: > >> all(any(TRUE, TRUE), any(TRUE, FALSE)) > [1] TRUE >> all(any(TRUE, TRUE), any(FALSE, FALSE)) > [1] FALSE > > But in a tidyverse/dplyr environment, it returns wrong answers. > > Consider this example. I have data I have joined together with pairs of > columns representing a first generation and several other pairs representing > additional generations. I want to consider any pair where at least one of > the pair is not NA as a success. But in order to keep the entire row, I want > all three pairs to have some valid data. This seems like a fairly common > reasonable thing often needed when evaluating data. > > So to make it very general, I chose to do something a bit like this: > > result <- filter(mydata, > all( > any(!is.na(first.a), !is.na(first.b)), > any(!is.na(second.a), !is.na(second.b)), > any(!is.na(third.a), !is.na(third.b)))) > > I apologize if the formatting is not seen properly. The above logically > should work. And it should be extendable to scenarios where you want at > least one of M columns to contain data as a group with N such groups of any > size. > > But since it did not work, I tried a plan that did work and feels silly. I > used mutate() to make new columns such as: > > result <- > mydata |> > mutate( > usable.1 = (!is.na(first.a) | !is.na(first.b)), > usable.2 = (!is.na(second.a) | !is.na(second.b)), > usable.3 = (!is.na(third.a) | !is.na(third.b)), > usable = (usable.1 & usable.2 & usable.3) > ) |> > filter(usable == TRUE) > > The above wastes time and effort making new columns so I can check the > calculations then uses the combined columns to make a Boolean that can be > used to filter the result. > > I know this is not the place to discuss dplyr. I want to check first if I am > doing anything wrong in how I use any/all. One guess is that the generic is > messed with by dplyr or other packages I libraried. > > And, of course, some aspects of delayed evaluation can interfere in subtle > ways. > > I note I have had other problems with these base R functions before and > generally solved them by not using them, as shown above. I would much rather > use them, or something similar. > > > Avi > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.