thr3ads.net - R help - [R] Strange output daply with empty strata [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Jan van der Laan

2010-Sep-09 09:43 UTC

[R] Strange output daply with empty strata

Dear list,

I get some strange results with daply from the plyr package. In the  
example below, the average age per municipality for employed en  
unemployed is calculated. If I do this using tapply (see code below) I  
get the following result:

         no      yes
A       NA 36.94931
B 51.22505 34.24887
C 48.05759 51.00198

If I do this using daply:

municipality       no      yes
            A 36.94931 48.05759
            B 51.22505 51.00198
            C 34.24887       NA

daply generates the same numbers. However, these are not in the  
correct cells. For example, in municipality A everybody is employed.  
Therefore, the NA should be in the cell for unemployed in municipality  
A.

Am I using daply incorrectly or is there indeed something wrong with  
the output of daply?

Regards,

Jan


I am using version 1.1 of the plyr-package.


# Generate some test data
data.test <- data.frame(
   municipality=rep(LETTERS[1:3], each=10),
   employed=sample(c("yes", "no"), 30, replace=TRUE),
   age=runif(30,20,70))
# Make sure everybody is employed in municipality A
data.test$employed[data.test$municipality == "A"] <-
"yes"

# Compare the output of tapply:
tapply(data.test$age, list(data.test$municipality, data.test$employed),
mean)
# to that of daply:
daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
# results of ddply are the samen as tapply
ddply(data.test, .(municipality, employed), function(d){mean(d$age)} )

Dennis Murphy

2010-Sep-09 10:32 UTC

head link

[R] Strange output daply with empty strata

Hi:

Here's what I tried:

# data frame versions (aggregate, ddply):

aggregate(age ~ municipality + employed, data = data.test, FUN = mean)
  municipality employed      age
1            B       no 55.57407
2            C       no 44.67463
3            A      yes 41.58759
4            B      yes 43.59330
5            C      yes 43.82545

ddply(data.test, .(municipality, employed), summarise, mean = mean(age))
  municipality employed     mean
1            A      yes 41.58759
2            B       no 55.57407
3            B      yes 43.59330
4            C       no 44.67463
5            C      yes 43.82545

It appears that aggregate() silently removes groups where no observations
are present, but ddply() has an option .drop, which when set to FALSE,
returns NaN for the not employed group in municipality A:

ddply(data.test, .(municipality, employed), summarise, avgage = mean(age),
.drop = FALSE)
  municipality employed   avgage
1            A       no      NaN
2            A      yes 41.58759
3            B       no 55.57407
4            B      yes 43.59330
5            C       no 44.67463
6            C      yes 43.82545

#  tapply/daply

with(data.test, tapply(age, list(municipality, employed), mean))
        no      yes
A       NA 41.58759
B 55.57407 43.59330
C 44.67463 43.82545

daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
            employed
municipality       no      yes
           A 41.58759 44.67463
           B 55.57407 43.82545
           C 43.59330       NA

The .drop argument has a different meaning in daply. Some R functions have
an na.last argument, and it may be that somewhere in daply, there is a
function call that moves all NAs to the end. The means are in the right
order except for the first, where the NA is supposed to be, so everything is
offset in the table by 1. I've cc'ed Hadley on this.

HTH,
Dennis


On Thu, Sep 9, 2010 at 2:43 AM, Jan van der Laan <rhelp@eoos.dds.nl>
wrote:
> Dear list,
>
> I get some strange results with daply from the plyr package. In the example
> below, the average age per municipality for employed en unemployed is
> calculated. If I do this using tapply (see code below) I get the following
> result:
>
>        no      yes
> A       NA 36.94931
> B 51.22505 34.24887
> C 48.05759 51.00198
>
> If I do this using daply:
>
> municipality       no      yes
>           A 36.94931 48.05759
>           B 51.22505 51.00198
>           C 34.24887       NA
>
> daply generates the same numbers. However, these are not in the correct
> cells. For example, in municipality A everybody is employed. Therefore, the
> NA should be in the cell for unemployed in municipality A.
>
> Am I using daply incorrectly or is there indeed something wrong with the
> output of daply?
>
> Regards,
>
> Jan
>
>
> I am using version 1.1 of the plyr-package.
>
>
> # Generate some test data
> data.test <- data.frame(
>  municipality=rep(LETTERS[1:3], each=10),
>  employed=sample(c("yes", "no"), 30, replace=TRUE),
>  age=runif(30,20,70))
> # Make sure everybody is employed in municipality A
> data.test$employed[data.test$municipality == "A"] <-
"yes"
>
> # Compare the output of tapply:
> tapply(data.test$age, list(data.test$municipality, data.test$employed),
> mean)
> # to that of daply:
> daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
> # results of ddply are the samen as tapply
> ddply(data.test, .(municipality, employed), function(d){mean(d$age)} )
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Sep 2010 - Strange output daply with empty strata

[R] Strange output daply with empty strata

[R] Strange output daply with empty strata

Possibly Parallel Threads