Jocelyn Ireson-Paine
2015-Mar-12 07:55 UTC
[R] How to filter data using sets generated by flattening with dcast, when I can't store those sets in a data frame
This is a fairly long question. It's about a problem that's easy to specify in terms of sets, but that I found hard to solve in R by using them, because of the strange design of R data structures. In explaining it, I'm going to touch on the reshape2 library, dcast, sets, and the non-orthogonality of R. My problem stems from some drug-trial data that I've been analysing for the Oxford Pain Research Unit. Here's an example. Imagine a data frame representing patients in a trial of pain-relief drugs. The trial lasts for ten days. Each patient's pain is measured once a day, and the values are recorded in a data frame, one row per patient per day. Like this: ID Day Pain 1 1 10 1 2 9 1 4 7 1 7 2 2 2 8 2 3 7 3 1 10 3 3 6 3 4 6 3 8 2 Unfortunately, many patients have measurements missing. Thus, in the example above, patient 1 was only observed on days 1, 2, 4, and 7, rather than on the full ten days. But a patient's measurements are only useful to us if that patient has a certain minimum set of days, so I need to check for patients who lack those days. Let's assume that these days are numbers 1, 4, and 9. Such a question is trivial to state in terms of sets. Let D(i) denote the set of days on which patient i was measured: then I want to find out which patients p, or how many patients p, have a D(p) that contains the set {1,4,9}. The obvious way to solve this is to write a function that tells me whether one set is a superset of another. Then flatten my data frame so that it looks like this: ID Days 1 {1,2,4,7} 2 {2,3} 3 {1,3,4,8} And finally, filter it by some R translation of flattened[ includes( flattened$Days, {1,4,9} ), ] I started with the built-in functions that operate on sets represented as vectors. These are described in https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html , "Set Operations". For example: > union( c(1,2,3), c(2,4,6) ) [1] 1 2 3 4 6 > intersect( c(1,2,3), c(2,4,6) ) [1] 2 So I first wrote a set-inclusion function: # True if vector a is a superset of vector b. # includes <- function( a, b ) { return( setequal( union( a, b ), a ) ) } Here are some sample calls: > includes( c(1), c() ) [1] TRUE > includes( c(1), c(1) ) [1] TRUE > includes( c(1), c(1,2) ) [1] FALSE > includes( c(2,1), c(1,2) ) [1] TRUE > includes( c(2,1,3), c(1,2) ) [1] TRUE > includes( c(2,1,3), c(4,1,2) ) [1] FALSE I then made myself a variable holding my sample data frame: df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) ) And I tried flattening it, using dcast and an aggregator function as described in (amongst many other places) http://seananderson.ca/2013/10/19/reshape.html , "An Introduction to reshape2" by Sean C. Anderson. The idea behind this is that (for my data) dcast will call the aggregator function once per patient ID, passing it all the Day values for the patient. The aggregator must combine them in some way, and dcast puts its results into a new column. For example, here's an aggregator that merely sums its arguments: aggregator_making_sum <- function( ... ) { return( sum( ... ) ) } If I call it, I get this: > dcast( df, ID~. , fun.aggregate=aggregator_making_sum ) Using Day as value column: use value.var to override. ID . 1 1 14 2 2 5 3 3 16 And here's an aggregator that converts the argument list to a string: aggregator_making_string <- function( ... ) { return( toString( ... ) ) } Calling it gives this: > dcast( df, ID~. , fun.aggregate=aggregator_making_string ) Using Day as value column: use value.var to override. ID . 1 1 1, 2, 4, 7 2 2 2, 3 3 3 1, 3, 4, 8 In both of these, the three dots denote all arguments to the aggregator, as explained in Burns Statistics's http://www.burns-stat.com/the-three-dots-construct-in-r/ . My first aggregator sums them; my second converts them to a string. Both uses of dcast generate a data frame with a column named "." , which contains the aggregates. In the second data frame, that may not be so clear: the first column of numbers is row numbers; the second column of numbers are the IDs; and the remaining columns form the strings, belonging to "." . But what I want is neither a sum nor a string but a set. Specifically, a set that's compatible with the R set operations I called in my 'includes' function. Since these sets are vectors, my aggregator should just pack its arguments into a vector: aggregator_making_set <- function( ... ) { return( c( ... ) ) } But when I tried it, I got an error: > dcast( df, ID~. , fun.aggregate=aggregator_making_set ) Using Day as value column: use value.var to override. Error in vapply(indices, fun, .default) : values must be length 0, but FUN(X[[1]]) result is length 4 It's not an informative error message, because it expects me to know how dcast is coded. And I'm surprised that values need to be length 0: length 1 would seem more appropriate. But perhaps it's trying to say that 'c' doesn't work on three-dots argument lists. Let's test that hypothesis: test_c_on_three_dots <- function( ... ) { return( c( ... ) ) } > test_c_on_three_dots( 1 ) [1] 1 > test_c_on_three_dots( 1, 2 ) [1] 1 2 > test_c_on_three_dots( 1, 2, 3 ) [1] 1 2 3 So 'c' does indeed work on three-dots argument lists. The error must have been caused by something else. Let's try making a set and putting it into a data frame directly: > df <- data.frame( col1=c(1,2), col2=c(3,4) ) > df col1 col2 1 1 3 2 2 4 > set <- union( c(5,6), c(6,7) ) > set [1] 5 6 7 > df[ 1, ]$col1 <- set Error in `$<-.data.frame`(`*tmp*`, "col1", value = c(5, 6, 7)) : replacement has 3 rows, data has 1 So that's the problem. Already in 1968, there was a language named Algol68 which had arrays and, in order to make things easy for its programmers, allowed you to create arrays of every data type the language provided. You could have arrays of Booleans, arrays of integers, arrays of records, arrays of discriminated unions, arrays of procedures, arrays of I/O formats, arrays of pointers, and arrays of arrays. The idea was "orthogonality" (see for example http://stackoverflow.com/questions/1527393/what-is-orthogonality ): that the programmer does not have to think about unexpected interactions between the concept of array and the concept of the element type, because there are none. If you have a data type, you can make arrays of that type. Pop-2 (1970), Snobol4 (1966), and Lisp (1958) were similarly generous. But R (1993) isn't. It wants to make life hard by forcing me to use different kinds of container for different kinds of element. And by providing a nice implementation of sets and then not letting me store them. So I thought about the kinds of data that I _can_ store in a data frame and generate by flattening. Strings! So I decided to use my aggregator_making_string function to make a string representation of the set of days, and to write a set-inclusion function that compared these sets against sets represented as vectors: includes2 <- function( a_as_string, b ) { a <- as.numeric( unlist( strsplit( a_as_string, split="," ) ) ) return( setequal( union( a, b ), a ) ) } Here are some example calls: > includes2( '1,2,3', c(1) ) [1] TRUE > includes2( '1,2,3', c(1,2) ) [1] TRUE > includes2( '1,2,3', c(1,2,4) ) [1] FALSE > includes2( '1,2,3', c(3) ) [1] TRUE > includes2( '1,2,3', c(0,3) ) [1] FALSE > I then tried using it: df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) ) aggregator_making_string <- function( ... ) { return( toString( ... ) ) } flattened <- dcast( df, ID~. , fun.aggregate=aggregator_making_string ) # Which patients have a day 1? flattened[ includes2( flattened$. , c(1) ), ] Unfortunately, that didn't work. The final statement selected every row of 'flattened'. I eventually realised that I had to vectorise 'includes2': includes3 <- Vectorize( includes2, "a_as_string" ) And that did work: > flattened[ includes3( flattened$. , c(1) ), ] ID . 1 1 1, 2, 4, 7 3 3 1, 3, 4, 8 > flattened[ includes3( flattened$. , c(1,2) ), ] ID . 1 1 1, 2, 4, 7 > flattened[ includes3( flattened$. , c(1,3) ), ] ID . 3 3 1, 3, 4, 8 > flattened[ includes3( flattened$. , c(2) ), ] ID . 1 1 1, 2, 4, 7 2 2 2, 3 The moral of this email tale is that sets are really useful for filtering data, and dcast ought to be really useful for generating sets, but R refuses to let me store them in the data frame that dcast generates. I can fudge it by representing the sets as strings, but is there a cleaner way to solve the problem? Cheers, Jocelyn Ireson-Paine 07768 534 091 http://www.jocelyns-cartoons.uk http://www.j-paine.org
David Barron
2015-Mar-12 09:20 UTC
[R] How to filter data using sets generated by flattening with dcast, when I can't store those sets in a data frame
Most of this question is over my head, I'm afraid, but looking at what I think is the crux of your question, couldn't you achieve the results you want in two steps, like this: dta <- data.frame(ID=c(1,1,1,1,2,2,3,3,3,3), Day=c(1,2,4,7,2,3,1,3,4,8),Pain=c(10,9,7,2,8,7,10,6,6,2)) l1 <- tapply(dta$Day, dta$ID, function(x) x) sapply(l1, function(x) all(c(1,4,8) %in% x )) I'm not sure you really need to do it in two steps, but given you said you wanted a flattened data frame with the Days as a vector, this will give it to you. Actually, l1 is a list, but you can turn it in to a data frame if you really want to. In the sapply call I changed the days required to 1, 4 and 8 to show that it does return TRUE if there is a patient that meets the required criterion. David On 12 March 2015 at 07:55, Jocelyn Ireson-Paine <popx at j-paine.org> wrote:> This is a fairly long question. It's about a problem that's easy to specify > in terms of sets, but that I found hard to solve in R by using them, because > of the strange design of R data structures. In explaining it, I'm going to > touch on the reshape2 library, dcast, sets, and the non-orthogonality of R. > > My problem stems from some drug-trial data that I've been analysing for the > Oxford Pain Research Unit. Here's an example. Imagine a data frame > representing patients in a trial of pain-relief drugs. The trial lasts for > ten days. Each patient's pain is measured once a day, and the values are > recorded in a data frame, one row per patient per day. Like this: > > ID Day Pain > 1 1 10 > 1 2 9 > 1 4 7 > 1 7 2 > 2 2 8 > 2 3 7 > 3 1 10 > 3 3 6 > 3 4 6 > 3 8 2 > > Unfortunately, many patients have measurements missing. Thus, in the example > above, patient 1 was only observed on days 1, 2, 4, and 7, rather than on > the full ten days. But a patient's measurements are only useful to us if > that patient has a certain minimum set of days, so I need to check for > patients who lack those days. Let's assume that these days are numbers 1, 4, > and 9. > > Such a question is trivial to state in terms of sets. Let D(i) denote the > set of days on which patient i was measured: then I want to find out which > patients p, or how many patients p, have a D(p) that contains the set > {1,4,9}. > > The obvious way to solve this is to write a function that tells me whether > one set is a superset of another. Then flatten my data frame so that it > looks like this: > > ID Days > 1 {1,2,4,7} > 2 {2,3} > 3 {1,3,4,8} > > And finally, filter it by some R translation of > > flattened[ includes( flattened$Days, {1,4,9} ), ] > > I started with the built-in functions that operate on sets represented as > vectors. These are described in > https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html , > "Set Operations". For example: > > > union( c(1,2,3), c(2,4,6) ) > [1] 1 2 3 4 6 > > intersect( c(1,2,3), c(2,4,6) ) > [1] 2 > > So I first wrote a set-inclusion function: > > # True if vector a is a superset of vector b. > # > includes <- function( a, b ) > { > return( setequal( union( a, b ), a ) ) > } > > Here are some sample calls: > > > includes( c(1), c() ) > [1] TRUE > > includes( c(1), c(1) ) > [1] TRUE > > includes( c(1), c(1,2) ) > [1] FALSE > > includes( c(2,1), c(1,2) ) > [1] TRUE > > includes( c(2,1,3), c(1,2) ) > [1] TRUE > > includes( c(2,1,3), c(4,1,2) ) > [1] FALSE > > I then made myself a variable holding my sample data frame: > > df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) > , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) > ) > > And I tried flattening it, using dcast and an aggregator function as > described in (amongst many other places) > http://seananderson.ca/2013/10/19/reshape.html , "An Introduction to > reshape2" by Sean C. Anderson. > > The idea behind this is that (for my data) dcast will call the aggregator > function once per patient ID, passing it all the Day values for the patient. > The aggregator must combine them in some way, and dcast puts its results > into a new column. For example, here's an aggregator that merely sums its > arguments: > > aggregator_making_sum <- function( ... ) > { > return( sum( ... ) ) > } > > If I call it, I get this: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_sum ) > Using Day as value column: use value.var to override. > ID . > 1 1 14 > 2 2 5 > 3 3 16 > > And here's an aggregator that converts the argument list to a string: > > aggregator_making_string <- function( ... ) > { > return( toString( ... ) ) > } > > Calling it gives this: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_string ) > Using Day as value column: use value.var to override. > ID . > 1 1 1, 2, 4, 7 > 2 2 2, 3 > 3 3 1, 3, 4, 8 > > In both of these, the three dots denote all arguments to the aggregator, as > explained in Burns Statistics's > http://www.burns-stat.com/the-three-dots-construct-in-r/ . My first > aggregator sums them; my second converts them to a string. Both uses of > dcast generate a data frame with a column named "." , which contains the > aggregates. In the second data frame, that may not be so clear: the first > column of numbers is row numbers; the second column of numbers are the IDs; > and the remaining columns form the strings, belonging to "." . > > But what I want is neither a sum nor a string but a set. Specifically, a set > that's compatible with the R set operations I called in my 'includes' > function. Since these sets are vectors, my aggregator should just pack its > arguments into a vector: > > aggregator_making_set <- function( ... ) > { > return( c( ... ) ) > } > > But when I tried it, I got an error: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_set ) > Using Day as value column: use value.var to override. > Error in vapply(indices, fun, .default) : values must be length 0, > but FUN(X[[1]]) result is length 4 > > It's not an informative error message, because it expects me to know how > dcast is coded. And I'm surprised that values need to be length 0: length 1 > would seem more appropriate. But perhaps it's trying to say that 'c' doesn't > work on three-dots argument lists. Let's test that hypothesis: > > test_c_on_three_dots <- function( ... ) > { > return( c( ... ) ) > } > > > test_c_on_three_dots( 1 ) > [1] 1 > > test_c_on_three_dots( 1, 2 ) > [1] 1 2 > > test_c_on_three_dots( 1, 2, 3 ) > [1] 1 2 3 > > So 'c' does indeed work on three-dots argument lists. The error must have > been caused by something else. Let's try making a set and putting it into a > data frame directly: > > > df <- data.frame( col1=c(1,2), col2=c(3,4) ) > > df > col1 col2 > 1 1 3 > 2 2 4 > > set <- union( c(5,6), c(6,7) ) > > set > [1] 5 6 7 > > df[ 1, ]$col1 <- set > Error in `$<-.data.frame`(`*tmp*`, "col1", value = c(5, 6, 7)) : > replacement has 3 rows, data has 1 > > So that's the problem. Already in 1968, there was a language named Algol68 > which had arrays and, in order to make things easy for its programmers, > allowed you to create arrays of every data type the language provided. You > could have arrays of Booleans, arrays of integers, arrays of records, arrays > of discriminated unions, arrays of procedures, arrays of I/O formats, arrays > of pointers, and arrays of arrays. The idea was "orthogonality" (see for > example http://stackoverflow.com/questions/1527393/what-is-orthogonality ): > that the programmer does not have to think about unexpected interactions > between the concept of array and the concept of the element type, because > there are none. If you have a data type, you can make arrays of that type. > Pop-2 (1970), Snobol4 (1966), and Lisp (1958) were similarly generous. But R > (1993) isn't. It wants to make life hard by forcing me to use different > kinds of container for different kinds of element. And by providing a nice > implementation of sets and then not letting me store them. > > So I thought about the kinds of data that I _can_ store in a data frame and > generate by flattening. Strings! So I decided to use my > aggregator_making_string function to make a string representation of the set > of days, and to write a set-inclusion function that compared these sets > against sets represented as vectors: > > includes2 <- function( a_as_string, b ) > { > a <- as.numeric( unlist( strsplit( a_as_string, split="," ) ) ) > return( setequal( union( a, b ), a ) ) > } > > Here are some example calls: > > > includes2( '1,2,3', c(1) ) > [1] TRUE > > includes2( '1,2,3', c(1,2) ) > [1] TRUE > > includes2( '1,2,3', c(1,2,4) ) > [1] FALSE > > includes2( '1,2,3', c(3) ) > [1] TRUE > > includes2( '1,2,3', c(0,3) ) > [1] FALSE > > > > I then tried using it: > > df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) > , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) > ) > > aggregator_making_string <- function( ... ) > { > return( toString( ... ) ) > } > > flattened <- dcast( df, ID~. , fun.aggregate=aggregator_making_string ) > > # Which patients have a day 1? > flattened[ includes2( flattened$. , c(1) ), ] > > Unfortunately, that didn't work. The final statement selected every row of > 'flattened'. I eventually realised that I had to vectorise 'includes2': > > includes3 <- Vectorize( includes2, "a_as_string" ) > > And that did work: > > > flattened[ includes3( flattened$. , c(1) ), ] > ID . > 1 1 1, 2, 4, 7 > 3 3 1, 3, 4, 8 > > flattened[ includes3( flattened$. , c(1,2) ), ] > ID . > 1 1 1, 2, 4, 7 > > flattened[ includes3( flattened$. , c(1,3) ), ] > ID . > 3 3 1, 3, 4, 8 > > flattened[ includes3( flattened$. , c(2) ), ] > ID . > 1 1 1, 2, 4, 7 > 2 2 2, 3 > > The moral of this email tale is that sets are really useful for filtering > data, and dcast ought to be really useful for generating sets, but R refuses > to let me store them in the data frame that dcast generates. I can fudge it > by representing the sets as strings, but is there a cleaner way to solve the > problem? > > Cheers, > > Jocelyn Ireson-Paine > 07768 534 091 > http://www.jocelyns-cartoons.uk > http://www.j-paine.org > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
William Dunlap
2015-Mar-12 14:53 UTC
[R] How to filter data using sets generated by flattening with dcast, when I can't store those sets in a data frame
In base R you can do what I think you want with aggregate() and Filter(). E.g., > a <- aggregate(df["Day"], df["ID"], function(x)x) > str(a) 'data.frame': 3 obs. of 2 variables: $ ID : num 1 2 3 $ Day:List of 3 ..$ 1: num 1 2 4 7 ..$ 5: num 2 3 ..$ 7: num 1 3 4 8 > i14 <- Filter(function(i){all(c(1,4) %in% a$Day[[i]])}, seq_len(nrow(a))) > a[i14,] ID Day 1 1 1, 2, 4, 7 3 3 1, 3, 4, 8 Note that 'reshape2' is not 'R', it is a user-contributed package that runs in R. Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Mar 12, 2015 at 12:55 AM, Jocelyn Ireson-Paine <popx at j-paine.org> wrote:> This is a fairly long question. It's about a problem that's easy to > specify in terms of sets, but that I found hard to solve in R by using > them, because of the strange design of R data structures. In explaining it, > I'm going to touch on the reshape2 library, dcast, sets, and the > non-orthogonality of R. > > My problem stems from some drug-trial data that I've been analysing for > the Oxford Pain Research Unit. Here's an example. Imagine a data frame > representing patients in a trial of pain-relief drugs. The trial lasts for > ten days. Each patient's pain is measured once a day, and the values are > recorded in a data frame, one row per patient per day. Like this: > > ID Day Pain > 1 1 10 > 1 2 9 > 1 4 7 > 1 7 2 > 2 2 8 > 2 3 7 > 3 1 10 > 3 3 6 > 3 4 6 > 3 8 2 > > Unfortunately, many patients have measurements missing. Thus, in the > example above, patient 1 was only observed on days 1, 2, 4, and 7, rather > than on the full ten days. But a patient's measurements are only useful to > us if that patient has a certain minimum set of days, so I need to check > for patients who lack those days. Let's assume that these days are numbers > 1, 4, and 9. > > Such a question is trivial to state in terms of sets. Let D(i) denote the > set of days on which patient i was measured: then I want to find out which > patients p, or how many patients p, have a D(p) that contains the set > {1,4,9}. > > The obvious way to solve this is to write a function that tells me whether > one set is a superset of another. Then flatten my data frame so that it > looks like this: > > ID Days > 1 {1,2,4,7} > 2 {2,3} > 3 {1,3,4,8} > > And finally, filter it by some R translation of > > flattened[ includes( flattened$Days, {1,4,9} ), ] > > I started with the built-in functions that operate on sets represented as > vectors. These are described in > https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html , > "Set Operations". For example: > > > union( c(1,2,3), c(2,4,6) ) > [1] 1 2 3 4 6 > > intersect( c(1,2,3), c(2,4,6) ) > [1] 2 > > So I first wrote a set-inclusion function: > > # True if vector a is a superset of vector b. > # > includes <- function( a, b ) > { > return( setequal( union( a, b ), a ) ) > } > > Here are some sample calls: > > > includes( c(1), c() ) > [1] TRUE > > includes( c(1), c(1) ) > [1] TRUE > > includes( c(1), c(1,2) ) > [1] FALSE > > includes( c(2,1), c(1,2) ) > [1] TRUE > > includes( c(2,1,3), c(1,2) ) > [1] TRUE > > includes( c(2,1,3), c(4,1,2) ) > [1] FALSE > > I then made myself a variable holding my sample data frame: > > df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) > , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) > ) > > And I tried flattening it, using dcast and an aggregator function as > described in (amongst many other places) http://seananderson.ca/2013/ > 10/19/reshape.html , "An Introduction to reshape2" by Sean C. Anderson. > > The idea behind this is that (for my data) dcast will call the aggregator > function once per patient ID, passing it all the Day values for the > patient. The aggregator must combine them in some way, and dcast puts its > results into a new column. For example, here's an aggregator that merely > sums its arguments: > > aggregator_making_sum <- function( ... ) > { > return( sum( ... ) ) > } > > If I call it, I get this: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_sum ) > Using Day as value column: use value.var to override. > ID . > 1 1 14 > 2 2 5 > 3 3 16 > > And here's an aggregator that converts the argument list to a string: > > aggregator_making_string <- function( ... ) > { > return( toString( ... ) ) > } > > Calling it gives this: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_string ) > Using Day as value column: use value.var to override. > ID . > 1 1 1, 2, 4, 7 > 2 2 2, 3 > 3 3 1, 3, 4, 8 > > In both of these, the three dots denote all arguments to the aggregator, > as explained in Burns Statistics's http://www.burns-stat.com/the- > three-dots-construct-in-r/ . My first aggregator sums them; my second > converts them to a string. Both uses of dcast generate a data frame with a > column named "." , which contains the aggregates. In the second data frame, > that may not be so clear: the first column of numbers is row numbers; the > second column of numbers are the IDs; and the remaining columns form the > strings, belonging to "." . > > But what I want is neither a sum nor a string but a set. Specifically, a > set that's compatible with the R set operations I called in my 'includes' > function. Since these sets are vectors, my aggregator should just pack its > arguments into a vector: > > aggregator_making_set <- function( ... ) > { > return( c( ... ) ) > } > > But when I tried it, I got an error: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_set ) > Using Day as value column: use value.var to override. > Error in vapply(indices, fun, .default) : values must be length 0, > but FUN(X[[1]]) result is length 4 > > It's not an informative error message, because it expects me to know how > dcast is coded. And I'm surprised that values need to be length 0: length 1 > would seem more appropriate. But perhaps it's trying to say that 'c' > doesn't work on three-dots argument lists. Let's test that hypothesis: > > test_c_on_three_dots <- function( ... ) > { > return( c( ... ) ) > } > > > test_c_on_three_dots( 1 ) > [1] 1 > > test_c_on_three_dots( 1, 2 ) > [1] 1 2 > > test_c_on_three_dots( 1, 2, 3 ) > [1] 1 2 3 > > So 'c' does indeed work on three-dots argument lists. The error must have > been caused by something else. Let's try making a set and putting it into a > data frame directly: > > > df <- data.frame( col1=c(1,2), col2=c(3,4) ) > > df > col1 col2 > 1 1 3 > 2 2 4 > > set <- union( c(5,6), c(6,7) ) > > set > [1] 5 6 7 > > df[ 1, ]$col1 <- set > Error in `$<-.data.frame`(`*tmp*`, "col1", value = c(5, 6, 7)) : > replacement has 3 rows, data has 1 > > So that's the problem. Already in 1968, there was a language named Algol68 > which had arrays and, in order to make things easy for its programmers, > allowed you to create arrays of every data type the language provided. You > could have arrays of Booleans, arrays of integers, arrays of records, > arrays of discriminated unions, arrays of procedures, arrays of I/O > formats, arrays of pointers, and arrays of arrays. The idea was > "orthogonality" (see for example http://stackoverflow.com/ > questions/1527393/what-is-orthogonality ): that the programmer does not > have to think about unexpected interactions between the concept of array > and the concept of the element type, because there are none. If you have a > data type, you can make arrays of that type. Pop-2 (1970), Snobol4 (1966), > and Lisp (1958) were similarly generous. But R (1993) isn't. It wants to > make life hard by forcing me to use different kinds of container for > different kinds of element. And by providing a nice implementation of sets > and then not letting me store them. > > So I thought about the kinds of data that I _can_ store in a data frame > and generate by flattening. Strings! So I decided to use my > aggregator_making_string function to make a string representation of the > set of days, and to write a set-inclusion function that compared these sets > against sets represented as vectors: > > includes2 <- function( a_as_string, b ) > { > a <- as.numeric( unlist( strsplit( a_as_string, split="," ) ) ) > return( setequal( union( a, b ), a ) ) > } > > Here are some example calls: > > > includes2( '1,2,3', c(1) ) > [1] TRUE > > includes2( '1,2,3', c(1,2) ) > [1] TRUE > > includes2( '1,2,3', c(1,2,4) ) > [1] FALSE > > includes2( '1,2,3', c(3) ) > [1] TRUE > > includes2( '1,2,3', c(0,3) ) > [1] FALSE > > > > I then tried using it: > > df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) > , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) > ) > > aggregator_making_string <- function( ... ) > { > return( toString( ... ) ) > } > > flattened <- dcast( df, ID~. , fun.aggregate=aggregator_making_string ) > > # Which patients have a day 1? > flattened[ includes2( flattened$. , c(1) ), ] > > Unfortunately, that didn't work. The final statement selected every row of > 'flattened'. I eventually realised that I had to vectorise 'includes2': > > includes3 <- Vectorize( includes2, "a_as_string" ) > > And that did work: > > > flattened[ includes3( flattened$. , c(1) ), ] > ID . > 1 1 1, 2, 4, 7 > 3 3 1, 3, 4, 8 > > flattened[ includes3( flattened$. , c(1,2) ), ] > ID . > 1 1 1, 2, 4, 7 > > flattened[ includes3( flattened$. , c(1,3) ), ] > ID . > 3 3 1, 3, 4, 8 > > flattened[ includes3( flattened$. , c(2) ), ] > ID . > 1 1 1, 2, 4, 7 > 2 2 2, 3 > > The moral of this email tale is that sets are really useful for filtering > data, and dcast ought to be really useful for generating sets, but R > refuses to let me store them in the data frame that dcast generates. I can > fudge it by representing the sets as strings, but is there a cleaner way to > solve the problem? > > Cheers, > > Jocelyn Ireson-Paine > 07768 534 091 > http://www.jocelyns-cartoons.uk > http://www.j-paine.org > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Jocelyn Ireson-Paine
2015-Mar-15 20:06 UTC
[R] How to filter data using sets generated by flattening with dcast, when I can't store those sets in a data frame
David, and also William Dunlap, thanks for taking the time to reply, with examples. Both your answers are very helpful. William noted that 'reshape2' is not 'R', but a user-contributed package that runs in R. I agree, and I'm not confusing one with the other. But what I don't like is that somewhere in the interaction between them, generality is lost. I contrast this with a means of aggregating data that I use when programming in Lisp, Prolog, and other "functional" languages. This is aggregation by "folding" a list of values. The idea is explained at http://wiki.tcl.tk/17983 , "Fold in functional programming" by "juef", amongst other places. He/she gives a common example: take a list of values, such as (1 2 3 4) and "fold" the + operation over it. Doing so runs + along the list forming intermediate sums and adding the next value to them, until all values have been summed. Here, 'fold' is analogous to dcast, with + being analogous to the function dcast takes for its fun.aggregate argument. But the good thing about 'fold' is that it does not restrict the type of result that its aggregation function can return. The result can be a number, a string, a list, a list of lists, an array, or any other type. I'd like dcast to be as general. Jocelyn Ireson-Paine 07768 534 091 http://www.jocelyns-cartoons.uk http://www.j-paine.org On Thu, 12 Mar 2015, David Barron wrote:> Most of this question is over my head, I'm afraid, but looking at what > I think is the crux of your question, couldn't you achieve the results > you want in two steps, like this: > > dta <- data.frame(ID=c(1,1,1,1,2,2,3,3,3,3), > Day=c(1,2,4,7,2,3,1,3,4,8),Pain=c(10,9,7,2,8,7,10,6,6,2)) > > l1 <- tapply(dta$Day, dta$ID, function(x) x) > > sapply(l1, function(x) all(c(1,4,8) %in% x )) > > I'm not sure you really need to do it in two steps, but given you said > you wanted a flattened data frame with the Days as a vector, this will > give it to you. Actually, l1 is a list, but you can turn it in to a > data frame if you really want to. In the sapply call I changed the > days required to 1, 4 and 8 to show that it does return TRUE if there > is a patient that meets the required criterion. > > David > > On 12 March 2015 at 07:55, Jocelyn Ireson-Paine <popx at j-paine.org> wrote: >> This is a fairly long question. It's about a problem that's easy to specify >> in terms of sets, but that I found hard to solve in R by using them, because >> of the strange design of R data structures. In explaining it, I'm going to >> touch on the reshape2 library, dcast, sets, and the non-orthogonality of R. >> >> My problem stems from some drug-trial data that I've been analysing for the >> Oxford Pain Research Unit. Here's an example. Imagine a data frame >> representing patients in a trial of pain-relief drugs. The trial lasts for >> ten days. Each patient's pain is measured once a day, and the values are >> recorded in a data frame, one row per patient per day. Like this: >> >> ID Day Pain >> 1 1 10 >> 1 2 9 >> 1 4 7 >> 1 7 2 >> 2 2 8 >> 2 3 7 >> 3 1 10 >> 3 3 6 >> 3 4 6 >> 3 8 2 >> >> Unfortunately, many patients have measurements missing. Thus, in the example >> above, patient 1 was only observed on days 1, 2, 4, and 7, rather than on >> the full ten days. But a patient's measurements are only useful to us if >> that patient has a certain minimum set of days, so I need to check for >> patients who lack those days. Let's assume that these days are numbers 1, 4, >> and 9. >> >> Such a question is trivial to state in terms of sets. Let D(i) denote the >> set of days on which patient i was measured: then I want to find out which >> patients p, or how many patients p, have a D(p) that contains the set >> {1,4,9}. >> >> The obvious way to solve this is to write a function that tells me whether >> one set is a superset of another. Then flatten my data frame so that it >> looks like this: >> >> ID Days >> 1 {1,2,4,7} >> 2 {2,3} >> 3 {1,3,4,8} >> >> And finally, filter it by some R translation of >> >> flattened[ includes( flattened$Days, {1,4,9} ), ] >> >> I started with the built-in functions that operate on sets represented as >> vectors. These are described in >> https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html , >> "Set Operations". For example: >> >> > union( c(1,2,3), c(2,4,6) ) >> [1] 1 2 3 4 6 >> > intersect( c(1,2,3), c(2,4,6) ) >> [1] 2 >> >> So I first wrote a set-inclusion function: >> >> # True if vector a is a superset of vector b. >> # >> includes <- function( a, b ) >> { >> return( setequal( union( a, b ), a ) ) >> } >> >> Here are some sample calls: >> >> > includes( c(1), c() ) >> [1] TRUE >> > includes( c(1), c(1) ) >> [1] TRUE >> > includes( c(1), c(1,2) ) >> [1] FALSE >> > includes( c(2,1), c(1,2) ) >> [1] TRUE >> > includes( c(2,1,3), c(1,2) ) >> [1] TRUE >> > includes( c(2,1,3), c(4,1,2) ) >> [1] FALSE >> >> I then made myself a variable holding my sample data frame: >> >> df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) >> , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) >> ) >> >> And I tried flattening it, using dcast and an aggregator function as >> described in (amongst many other places) >> http://seananderson.ca/2013/10/19/reshape.html , "An Introduction to >> reshape2" by Sean C. Anderson. >> >> The idea behind this is that (for my data) dcast will call the aggregator >> function once per patient ID, passing it all the Day values for the patient. >> The aggregator must combine them in some way, and dcast puts its results >> into a new column. For example, here's an aggregator that merely sums its >> arguments: >> >> aggregator_making_sum <- function( ... ) >> { >> return( sum( ... ) ) >> } >> >> If I call it, I get this: >> >> > dcast( df, ID~. , fun.aggregate=aggregator_making_sum ) >> Using Day as value column: use value.var to override. >> ID . >> 1 1 14 >> 2 2 5 >> 3 3 16 >> >> And here's an aggregator that converts the argument list to a string: >> >> aggregator_making_string <- function( ... ) >> { >> return( toString( ... ) ) >> } >> >> Calling it gives this: >> >> > dcast( df, ID~. , fun.aggregate=aggregator_making_string ) >> Using Day as value column: use value.var to override. >> ID . >> 1 1 1, 2, 4, 7 >> 2 2 2, 3 >> 3 3 1, 3, 4, 8 >> >> In both of these, the three dots denote all arguments to the aggregator, as >> explained in Burns Statistics's >> http://www.burns-stat.com/the-three-dots-construct-in-r/ . My first >> aggregator sums them; my second converts them to a string. Both uses of >> dcast generate a data frame with a column named "." , which contains the >> aggregates. In the second data frame, that may not be so clear: the first >> column of numbers is row numbers; the second column of numbers are the IDs; >> and the remaining columns form the strings, belonging to "." . >> >> But what I want is neither a sum nor a string but a set. Specifically, a set >> that's compatible with the R set operations I called in my 'includes' >> function. Since these sets are vectors, my aggregator should just pack its >> arguments into a vector: >> >> aggregator_making_set <- function( ... ) >> { >> return( c( ... ) ) >> } >> >> But when I tried it, I got an error: >> >> > dcast( df, ID~. , fun.aggregate=aggregator_making_set ) >> Using Day as value column: use value.var to override. >> Error in vapply(indices, fun, .default) : values must be length 0, >> but FUN(X[[1]]) result is length 4 >> >> It's not an informative error message, because it expects me to know how >> dcast is coded. And I'm surprised that values need to be length 0: length 1 >> would seem more appropriate. But perhaps it's trying to say that 'c' doesn't >> work on three-dots argument lists. Let's test that hypothesis: >> >> test_c_on_three_dots <- function( ... ) >> { >> return( c( ... ) ) >> } >> >> > test_c_on_three_dots( 1 ) >> [1] 1 >> > test_c_on_three_dots( 1, 2 ) >> [1] 1 2 >> > test_c_on_three_dots( 1, 2, 3 ) >> [1] 1 2 3 >> >> So 'c' does indeed work on three-dots argument lists. The error must have >> been caused by something else. Let's try making a set and putting it into a >> data frame directly: >> >> > df <- data.frame( col1=c(1,2), col2=c(3,4) ) >> > df >> col1 col2 >> 1 1 3 >> 2 2 4 >> > set <- union( c(5,6), c(6,7) ) >> > set >> [1] 5 6 7 >> > df[ 1, ]$col1 <- set >> Error in `$<-.data.frame`(`*tmp*`, "col1", value = c(5, 6, 7)) : >> replacement has 3 rows, data has 1 >> >> So that's the problem. Already in 1968, there was a language named Algol68 >> which had arrays and, in order to make things easy for its programmers, >> allowed you to create arrays of every data type the language provided. You >> could have arrays of Booleans, arrays of integers, arrays of records, arrays >> of discriminated unions, arrays of procedures, arrays of I/O formats, arrays >> of pointers, and arrays of arrays. The idea was "orthogonality" (see for >> example http://stackoverflow.com/questions/1527393/what-is-orthogonality ): >> that the programmer does not have to think about unexpected interactions >> between the concept of array and the concept of the element type, because >> there are none. If you have a data type, you can make arrays of that type. >> Pop-2 (1970), Snobol4 (1966), and Lisp (1958) were similarly generous. But R >> (1993) isn't. It wants to make life hard by forcing me to use different >> kinds of container for different kinds of element. And by providing a nice >> implementation of sets and then not letting me store them. >> >> So I thought about the kinds of data that I _can_ store in a data frame and >> generate by flattening. Strings! So I decided to use my >> aggregator_making_string function to make a string representation of the set >> of days, and to write a set-inclusion function that compared these sets >> against sets represented as vectors: >> >> includes2 <- function( a_as_string, b ) >> { >> a <- as.numeric( unlist( strsplit( a_as_string, split="," ) ) ) >> return( setequal( union( a, b ), a ) ) >> } >> >> Here are some example calls: >> >> > includes2( '1,2,3', c(1) ) >> [1] TRUE >> > includes2( '1,2,3', c(1,2) ) >> [1] TRUE >> > includes2( '1,2,3', c(1,2,4) ) >> [1] FALSE >> > includes2( '1,2,3', c(3) ) >> [1] TRUE >> > includes2( '1,2,3', c(0,3) ) >> [1] FALSE >> > >> >> I then tried using it: >> >> df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) >> , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) >> ) >> >> aggregator_making_string <- function( ... ) >> { >> return( toString( ... ) ) >> } >> >> flattened <- dcast( df, ID~. , fun.aggregate=aggregator_making_string ) >> >> # Which patients have a day 1? >> flattened[ includes2( flattened$. , c(1) ), ] >> >> Unfortunately, that didn't work. The final statement selected every row of >> 'flattened'. I eventually realised that I had to vectorise 'includes2': >> >> includes3 <- Vectorize( includes2, "a_as_string" ) >> >> And that did work: >> >> > flattened[ includes3( flattened$. , c(1) ), ] >> ID . >> 1 1 1, 2, 4, 7 >> 3 3 1, 3, 4, 8 >> > flattened[ includes3( flattened$. , c(1,2) ), ] >> ID . >> 1 1 1, 2, 4, 7 >> > flattened[ includes3( flattened$. , c(1,3) ), ] >> ID . >> 3 3 1, 3, 4, 8 >> > flattened[ includes3( flattened$. , c(2) ), ] >> ID . >> 1 1 1, 2, 4, 7 >> 2 2 2, 3 >> >> The moral of this email tale is that sets are really useful for filtering >> data, and dcast ought to be really useful for generating sets, but R refuses >> to let me store them in the data frame that dcast generates. I can fudge it >> by representing the sets as strings, but is there a cleaner way to solve the >> problem? >> >> Cheers, >> >> Jocelyn Ireson-Paine >> 07768 534 091 >> http://www.jocelyns-cartoons.uk >> http://www.j-paine.org >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >
Michael Lawrence
2015-Mar-17 03:01 UTC
[R] How to filter data using sets generated by flattening with dcast, when I can't store those sets in a data frame
The data structures you mention are complex, and too much of their complexity leaks into client code. Instead, aim to use higher level code constructs on simpler data structures. In R, the most convenient operations are those that are statistical in nature. For example, one might solve your problem by saying: restrict to the subset of the important days (1, 4, 7), and count the occurrences of each patient in that subset to find those patients with the sufficient number of important measurements. criticalDays <- c(1, 4, 8) criticalDf <- subset(df, Day %in% criticalDays) keepPatients <- table(criticalDf$ID) == length(criticalDays) df[keepPatients[df$ID],] Note that the code above assumes that the ID variable is a factor, which it should be. Michael On Thu, Mar 12, 2015 at 12:55 AM, Jocelyn Ireson-Paine <popx at j-paine.org> wrote:> This is a fairly long question. It's about a problem that's easy to > specify in terms of sets, but that I found hard to solve in R by using > them, because of the strange design of R data structures. In explaining it, > I'm going to touch on the reshape2 library, dcast, sets, and the > non-orthogonality of R. > > My problem stems from some drug-trial data that I've been analysing for > the Oxford Pain Research Unit. Here's an example. Imagine a data frame > representing patients in a trial of pain-relief drugs. The trial lasts for > ten days. Each patient's pain is measured once a day, and the values are > recorded in a data frame, one row per patient per day. Like this: > > ID Day Pain > 1 1 10 > 1 2 9 > 1 4 7 > 1 7 2 > 2 2 8 > 2 3 7 > 3 1 10 > 3 3 6 > 3 4 6 > 3 8 2 > > Unfortunately, many patients have measurements missing. Thus, in the > example above, patient 1 was only observed on days 1, 2, 4, and 7, rather > than on the full ten days. But a patient's measurements are only useful to > us if that patient has a certain minimum set of days, so I need to check > for patients who lack those days. Let's assume that these days are numbers > 1, 4, and 9. > > Such a question is trivial to state in terms of sets. Let D(i) denote the > set of days on which patient i was measured: then I want to find out which > patients p, or how many patients p, have a D(p) that contains the set > {1,4,9}. > > The obvious way to solve this is to write a function that tells me whether > one set is a superset of another. Then flatten my data frame so that it > looks like this: > > ID Days > 1 {1,2,4,7} > 2 {2,3} > 3 {1,3,4,8} > > And finally, filter it by some R translation of > > flattened[ includes( flattened$Days, {1,4,9} ), ] > > I started with the built-in functions that operate on sets represented as > vectors. These are described in > https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html , > "Set Operations". For example: > > > union( c(1,2,3), c(2,4,6) ) > [1] 1 2 3 4 6 > > intersect( c(1,2,3), c(2,4,6) ) > [1] 2 > > So I first wrote a set-inclusion function: > > # True if vector a is a superset of vector b. > # > includes <- function( a, b ) > { > return( setequal( union( a, b ), a ) ) > } > > Here are some sample calls: > > > includes( c(1), c() ) > [1] TRUE > > includes( c(1), c(1) ) > [1] TRUE > > includes( c(1), c(1,2) ) > [1] FALSE > > includes( c(2,1), c(1,2) ) > [1] TRUE > > includes( c(2,1,3), c(1,2) ) > [1] TRUE > > includes( c(2,1,3), c(4,1,2) ) > [1] FALSE > > I then made myself a variable holding my sample data frame: > > df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) > , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) > ) > > And I tried flattening it, using dcast and an aggregator function as > described in (amongst many other places) http://seananderson.ca/2013/ > 10/19/reshape.html , "An Introduction to reshape2" by Sean C. Anderson. > > The idea behind this is that (for my data) dcast will call the aggregator > function once per patient ID, passing it all the Day values for the > patient. The aggregator must combine them in some way, and dcast puts its > results into a new column. For example, here's an aggregator that merely > sums its arguments: > > aggregator_making_sum <- function( ... ) > { > return( sum( ... ) ) > } > > If I call it, I get this: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_sum ) > Using Day as value column: use value.var to override. > ID . > 1 1 14 > 2 2 5 > 3 3 16 > > And here's an aggregator that converts the argument list to a string: > > aggregator_making_string <- function( ... ) > { > return( toString( ... ) ) > } > > Calling it gives this: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_string ) > Using Day as value column: use value.var to override. > ID . > 1 1 1, 2, 4, 7 > 2 2 2, 3 > 3 3 1, 3, 4, 8 > > In both of these, the three dots denote all arguments to the aggregator, > as explained in Burns Statistics's http://www.burns-stat.com/the- > three-dots-construct-in-r/ . My first aggregator sums them; my second > converts them to a string. Both uses of dcast generate a data frame with a > column named "." , which contains the aggregates. In the second data frame, > that may not be so clear: the first column of numbers is row numbers; the > second column of numbers are the IDs; and the remaining columns form the > strings, belonging to "." . > > But what I want is neither a sum nor a string but a set. Specifically, a > set that's compatible with the R set operations I called in my 'includes' > function. Since these sets are vectors, my aggregator should just pack its > arguments into a vector: > > aggregator_making_set <- function( ... ) > { > return( c( ... ) ) > } > > But when I tried it, I got an error: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_set ) > Using Day as value column: use value.var to override. > Error in vapply(indices, fun, .default) : values must be length 0, > but FUN(X[[1]]) result is length 4 > > It's not an informative error message, because it expects me to know how > dcast is coded. And I'm surprised that values need to be length 0: length 1 > would seem more appropriate. But perhaps it's trying to say that 'c' > doesn't work on three-dots argument lists. Let's test that hypothesis: > > test_c_on_three_dots <- function( ... ) > { > return( c( ... ) ) > } > > > test_c_on_three_dots( 1 ) > [1] 1 > > test_c_on_three_dots( 1, 2 ) > [1] 1 2 > > test_c_on_three_dots( 1, 2, 3 ) > [1] 1 2 3 > > So 'c' does indeed work on three-dots argument lists. The error must have > been caused by something else. Let's try making a set and putting it into a > data frame directly: > > > df <- data.frame( col1=c(1,2), col2=c(3,4) ) > > df > col1 col2 > 1 1 3 > 2 2 4 > > set <- union( c(5,6), c(6,7) ) > > set > [1] 5 6 7 > > df[ 1, ]$col1 <- set > Error in `$<-.data.frame`(`*tmp*`, "col1", value = c(5, 6, 7)) : > replacement has 3 rows, data has 1 > > So that's the problem. Already in 1968, there was a language named Algol68 > which had arrays and, in order to make things easy for its programmers, > allowed you to create arrays of every data type the language provided. You > could have arrays of Booleans, arrays of integers, arrays of records, > arrays of discriminated unions, arrays of procedures, arrays of I/O > formats, arrays of pointers, and arrays of arrays. The idea was > "orthogonality" (see for example http://stackoverflow.com/ > questions/1527393/what-is-orthogonality ): that the programmer does not > have to think about unexpected interactions between the concept of array > and the concept of the element type, because there are none. If you have a > data type, you can make arrays of that type. Pop-2 (1970), Snobol4 (1966), > and Lisp (1958) were similarly generous. But R (1993) isn't. It wants to > make life hard by forcing me to use different kinds of container for > different kinds of element. And by providing a nice implementation of sets > and then not letting me store them. > > So I thought about the kinds of data that I _can_ store in a data frame > and generate by flattening. Strings! So I decided to use my > aggregator_making_string function to make a string representation of the > set of days, and to write a set-inclusion function that compared these sets > against sets represented as vectors: > > includes2 <- function( a_as_string, b ) > { > a <- as.numeric( unlist( strsplit( a_as_string, split="," ) ) ) > return( setequal( union( a, b ), a ) ) > } > > Here are some example calls: > > > includes2( '1,2,3', c(1) ) > [1] TRUE > > includes2( '1,2,3', c(1,2) ) > [1] TRUE > > includes2( '1,2,3', c(1,2,4) ) > [1] FALSE > > includes2( '1,2,3', c(3) ) > [1] TRUE > > includes2( '1,2,3', c(0,3) ) > [1] FALSE > > > > I then tried using it: > > df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) > , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) > ) > > aggregator_making_string <- function( ... ) > { > return( toString( ... ) ) > } > > flattened <- dcast( df, ID~. , fun.aggregate=aggregator_making_string ) > > # Which patients have a day 1? > flattened[ includes2( flattened$. , c(1) ), ] > > Unfortunately, that didn't work. The final statement selected every row of > 'flattened'. I eventually realised that I had to vectorise 'includes2': > > includes3 <- Vectorize( includes2, "a_as_string" ) > > And that did work: > > > flattened[ includes3( flattened$. , c(1) ), ] > ID . > 1 1 1, 2, 4, 7 > 3 3 1, 3, 4, 8 > > flattened[ includes3( flattened$. , c(1,2) ), ] > ID . > 1 1 1, 2, 4, 7 > > flattened[ includes3( flattened$. , c(1,3) ), ] > ID . > 3 3 1, 3, 4, 8 > > flattened[ includes3( flattened$. , c(2) ), ] > ID . > 1 1 1, 2, 4, 7 > 2 2 2, 3 > > The moral of this email tale is that sets are really useful for filtering > data, and dcast ought to be really useful for generating sets, but R > refuses to let me store them in the data frame that dcast generates. I can > fudge it by representing the sets as strings, but is there a cleaner way to > solve the problem? > > Cheers, > > Jocelyn Ireson-Paine > 07768 534 091 > http://www.jocelyns-cartoons.uk > http://www.j-paine.org > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]