thr3ads.net - R help - [R] meaning of formula in aggregate function [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Den

2011-Jan-22 12:44 UTC

[R] meaning of formula in aggregate function

Dear R community
Recently, dear Henrique Dallazuanna literally saved me solving one
problem on data transformation which follows:

(n_, _n, j_, k_ signify numbers)

SOURCE DATA:   
id      cycle1  cycle2  cycle3  ?       cycle_n
1       c       c       c               c
1       m       m       m               m
1       f       f       f               f
2       m       m       m               NA
2       f       f       f               NA
2       c       c       c               NA
3       a       a       NA              NA
3       c       c       c               NA
3       f       f       f               NA
3       NA      NA      m               NA
...........................................


Q: How to transform source data to:
RESULT DATA:
id      cyc1    cyc2    cyc3    ?       cyc_n
1       cfm     cfm     cfm             cfm
2       cfm     cfm     cfm             
3       acf     acf     cfm             
...........................................

 

The Henrique's solution is:

aggregate(.~ id, lapply(df, as.character), FUN function(x)paste(sort(x),
collapse = ''), na.action = na.pass)


Could somebody EXPLAIN HOW IT WORKS?
I mean Henrique saved my investigation indeed.
However, considering the fact, that I am about to perform investigation
of cancer chemotherapy in 500 patients, it would be nice to know what 
I am actually doing.

1. All help says about LHS in formulas like '.~id' is that it's
name is "dot notation". And not a single word more. Thus, I have no
clue, what dot in that formula really means.
2. help says:
 Note that ?paste()? coerces ?NA_character_?, the character missing
value, to ?"NA"'
And at the same time:
 ?na.pass? returns the object unchanged.
I am happy, that I don't have NAs in mydata.  I just don't understand
how it happened.
3. Can't see the real difference between 'FUN = function(x)
paste(x)'
and 'FUN = paste'. However, former works perfectly while latter simply
do not.


All I can follow from code above is that R breaks data on groups with
same id, then it tear each little 'cycle' piece in separate characters,
then sorts them and put together these characters within same id on each
'cycle'. I miss how R put together all this mess back into nice data
frame of long format. NAs is also a question, as I said before. 

Could you please put some light on it if you don't mind to answer those
naive  questions.

P Ehlers

2011-Jan-22 15:36 UTC

head link

[R] meaning of formula in aggregate function

Den wrote:> Dear R community
> Recently, dear Henrique Dallazuanna literally saved me solving one
> problem on data transformation which follows:
> 
> (n_, _n, j_, k_ signify numbers)
> 
> SOURCE DATA:   
> id      cycle1  cycle2  cycle3  ?       cycle_n
> 1       c       c       c               c
> 1       m       m       m               m
> 1       f       f       f               f
> 2       m       m       m               NA
> 2       f       f       f               NA
> 2       c       c       c               NA
> 3       a       a       NA              NA
> 3       c       c       c               NA
> 3       f       f       f               NA
> 3       NA      NA      m               NA
> ...........................................
> 
> 
> Q: How to transform source data to:
> RESULT DATA:
> id      cyc1    cyc2    cyc3    ?       cyc_n
> 1       cfm     cfm     cfm             cfm
> 2       cfm     cfm     cfm             
> 3       acf     acf     cfm             
> ...........................................
> 
>  
> 
> The Henrique's solution is:
> 
> aggregate(.~ id, lapply(df, as.character), FUN >
function(x)paste(sort(x), collapse = ''), na.action = na.pass)
> 
> 
> Could somebody EXPLAIN HOW IT WORKS?
> I mean Henrique saved my investigation indeed.
> However, considering the fact, that I am about to perform investigation
> of cancer chemotherapy in 500 patients, it would be nice to know what 
> I am actually doing.
> 
> 1. All help says about LHS in formulas like '.~id' is that it's
> name is "dot notation". And not a single word more. Thus, I have
no
> clue, what dot in that formula really means.
Well, ?aggregate does (rather gently) point you to the
help page for _formula_ where you will find quite a few
word about the use of '.' in the Details section.
> 2. help says:
>  Note that ?paste()? coerces ?NA_character_?, the character missing
> value, to ?"NA"'
> And at the same time:
>  ?na.pass? returns the object unchanged.
> I am happy, that I don't have NAs in mydata.  I just don't
understand
> how it happened.
I don't understand what you're asking.
> 3. Can't see the real difference between 'FUN = function(x)
paste(x)'
> and 'FUN = paste'. However, former works perfectly while latter
simply
> do not.
That's not quite true. You're using paste(sort(x)) and not
just x in Henrique's solution. And that's precisely
the point: when a function is not 'simple', you need to
define it. Henrique is defining it 'on the fly'; you
could also define it separately before the aggregate()
call and then use it like this:

myfun <- function(x) paste(sort(x), collapse='')
aggregate(...., FUN = myfun, ....)

Peter Ehlers
> 
> 
> All I can follow from code above is that R breaks data on groups with
> same id, then it tear each little 'cycle' piece in separate
characters,
> then sorts them and put together these characters within same id on each
> 'cycle'. I miss how R put together all this mess back into nice
data
> frame of long format. NAs is also a question, as I said before. 
> 
> Could you please put some light on it if you don't mind to answer those
> naive  questions.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Den

2011-Jan-23 02:58 UTC

head link

[R] meaning of formula in aggregate function

Dear Dennis
Thank you very much for your comprehensive reply and for time you've
spent dealing with my e-mail.
Your kindly explanation made things clearer for me. 
After your explanation it looks simple.
lapply with chosen options takes small part of cycle<n> with same id
(eg. df[df$id==3,"cycle2"] and makes from it just a bunch of
characters. 
The only thing I still don't get is why how this code get rid out of
NAs, but this is rather minor technical issue. Main question for me was
in formula. You helped me indeed.
Thank you again
Have a nice day
Denis>From bending but not broken Belarus
? ???, 22/01/2011 ? 17:55 -0800, Dennis Murphy ????:> Hi:
> 
> I wouldn't pretend to speak for Henrique, but I'll give it a shot.
> 
> On Sat, Jan 22, 2011 at 4:44 AM, Den <d.kazakiewicz at gmail.com>
wrote:
>         Dear R community
>         Recently, dear Henrique Dallazuanna literally saved me solving
>         one
>         problem on data transformation which follows:
>         
>         (n_, _n, j_, k_ signify numbers)
>         
>         SOURCE DATA:
>         id      cycle1  cycle2  cycle3  ?       cycle_n
>         1       c       c       c               c
>         1       m       m       m               m
>         1       f       f       f               f
>         2       m       m       m               NA
>         2       f       f       f               NA
>         2       c       c       c               NA
>         3       a       a       NA              NA
>         3       c       c       c               NA
>         3       f       f       f               NA
>         3       NA      NA      m               NA
>         ...........................................
>         
>         
>         Q: How to transform source data to:
>         RESULT DATA:
>         id      cyc1    cyc2    cyc3    ?       cyc_n
>         1       cfm     cfm     cfm             cfm
>         2       cfm     cfm     cfm
>         3       acf     acf     cfm
>         ...........................................
>         
>         
>         
>         The Henrique's solution is:
>         
>         aggregate(.~ id, lapply(df, as.character), FUN >        
function(x)paste(sort(x), collapse = ''), na.action = na.pass)
> 
> The first part, . ~ id, is the formula. It's using every available
> variable in the input data on the left hand side of the formula except
> for id, which is the grouping variable.
> 
> The data object is lapply(df, as.character), which is a list object
> that translates every element to character. I'm guessing that each
> element of the list is a character string or list of character
> strings, but I'm not sure. It looks like the individual characters of
> each cycle comprise a list component within id. (??)  [My guess: the
> result of lapply() is a list of lists. The top-level list components
> correspond to the id's, while the second-level components are the
> cycle variables, whose elements are the characters in each cycle
> variable for each row with the same id.]
> 
> The function to be applied to each id is described in FUN. As Peter
> mentioned, it's an 'anonymous' function, which means it is
defined
> in-line. In this case, a generic input object x has its elements
> sorted in increasing order and then combines the elements into a
> single string (the purpose of collapse = ); NA values are skipped
> over. Thus, if my hypothesis about the structure of the list is
> correct, the three characters in each cycle/id combination are first
> sorted and then combined into a single string, which is then output as
> the result. By the way that Henrique used the formula, the aggregate()
> function will march through each cycle variable within id and execute
> the function, and then iterate the process over all id's. 
> 
>         
>         
>         Could somebody EXPLAIN HOW IT WORKS?
>         I mean Henrique saved my investigation indeed.
>         However, considering the fact, that I am about to perform
>         investigation
>         of cancer chemotherapy in 500 patients, it would be nice to
>         know what
>         I am actually doing.
> 
> Henrique's R knowledge is on a different level from most of us, so I
> understand your question :) 
> 
>         
>         1. All help says about LHS in formulas like '.~id' is that
>         it's
>         name is "dot notation". And not a single word more. Thus,
I
>         have no
>         clue, what dot in that formula really means.
> 
> . is shorthand for 'everything not otherwise specified in the model
> formula'. In this case, it represents the entire set of cycle
> variables.
>  
> 
>         2. help says:
>          Note that ?paste()? coerces ?NA_character_?, the character
>         missing
>         value, to ?"NA"'
>         And at the same time:
>          ?na.pass? returns the object unchanged.
>         I am happy, that I don't have NAs in mydata.  I just don't
>         understand
>         how it happened.
>         3. Can't see the real difference between 'FUN = function(x)
>         paste(x)'
>         and 'FUN = paste'. However, former works perfectly while
>         latter simply
>         do not.
>         
>         
>         All I can follow from code above is that R breaks data on
>         groups with
>         same id, then it tear each little 'cycle' piece in separate
>         characters,
>         then sorts them and put together these characters within same
>         id on each
>         'cycle'. I miss how R put together all this mess back into
>         nice data
>         frame of long format. NAs is also a question, as I said
>         before.
> 
> By default, aggregate() will try to return a data frame. For each id,
> it will output the id and the result of the function applied to each
> cycle variable, so there should be one row for each id, and n + 1
> columns for the n cycle variables + id.
> 
> Does that help?
> 
> Cheers,
> Dennis 
> 
>         
>         Could you please put some light on it if you don't mind to
>         answer those
>         naive  questions.
>         
>         ______________________________________________
>         R-help at r-project.org mailing list
>         https://stat.ethz.ch/mailman/listinfo/r-help
>         PLEASE do read the posting guide
>         http://www.R-project.org/posting-guide.html
>         and provide commented, minimal, self-contained, reproducible
>         code.
>

Maybe Matching Threads

Search for more maybe matching threads

R help - Jan 2011 - meaning of formula in aggregate function

[R] meaning of formula in aggregate function

[R] meaning of formula in aggregate function

[R] meaning of formula in aggregate function

Maybe Matching Threads