Hello R users,
I'm writing a brief tutorial of getting statistical measures by splitting
according strata and over columns. When I used plyr::ddply I got and
unexpected result, with NA/NaN for non existing cells. Below is a minimal
reproducible code with the result that I got. For comparison, the result of
aggregate is showed. Why this behaviour? What I can do to avoid it?
> require(plyr)
>
> hab <-
+
read.table("http://www.leg.ufpr.br/~walmes/data/ipea_habitacao.csv",
+ header=TRUE, sep=",", stringsAsFactors=FALSE,
quote="",
+ encoding="utf-8")>
> hab <- hab[,-ncol(hab)]
> names(hab) <- c("sig", "cod", "mun",
"agua", "ener", "tel", "carro",
+ "comp", "tot")> hab <- transform(hab, sig=factor(sig))
> hab$siz <- cut(hab$tot, breaks=c(-Inf, 5000, Inf),
+ labels=c("P","G"))> str(hab, ve.len=1)
'data.frame': 5596 obs. of 10 variables:
$ sig : Factor w/ 27 levels "AC","AL","AM",..: 1
1 1 1 1 1 1 1 1 1 ...
$ cod : int 1200013 1200054 1200104 1200138 1200179 1200203 1200252
1200302 1200328 1200336 ...
$ mun : chr "Acrel?ndia" "Assis Brasil"
"Brasil?ia" "Bujari" ...
$ agua : num 21.5 27.4 26.9 17.3 13.1 ...
$ ener : num 56.2 65.3 55.9 43.9 35.9 ...
$ tel : num 8.85 26.71 22.73 12.28 9.19 ...
$ carro: num 9.3 6.03 7.47 6.49 5.73 ...
$ comp : num 0.947 1.637 1.857 0.127 0.088 ...
$ tot : int 1878 807 4114 1365 1267 14368 2807 5268 740 2308 ...
$ siz : Factor w/ 2 levels "P","G": 1 1 1 1 1 2 1 2 1 1
...>
> xtabs(~siz+sig, hab)
sig
siz AC AL AM AP BA CE DF ES GO MA MG MS MT PA PB PE PI
PR RJ
P 17 72 47 13 266 103 0 43 187 152 671 52 100 80 195 97 199
310 27
G 5 29 15 3 149 81 1 34 55 65 182 25 26 63 28 88 22
89 64
sig
siz RN RO RR RS SC SE SP TO
P 147 34 14 355 236 56 396 129
G 19 18 1 112 57 19 249 10>
> x <- ddply(hab, ~sig+siz,
+ colwise(.fun=mean,
+ .cols=~agua+ener+tel+carro+comp,
na.rm=TRUE))> head(x)
sig siz agua ener tel carro comp
1 AC P 19.30229 51.30535 12.857118 5.395824 0.7028235
2 AC G 28.39300 65.95740 26.322800 8.942000 2.3806000
3 AL P 42.14337 81.20935 4.801500 7.069958 0.5332639
4 AL G 54.03966 87.47834 9.771724 9.428586 1.3583793
5 *AL <NA> NaN NaN NaN NaN NaN*
6 AM P 25.66202 61.12709 5.749596 1.980362
0.7629362>
> x <- ddply(hab, ~sig+siz,
+ colwise(.fun=sum,
+ .cols=~agua+ener+tel+carro+comp,
na.rm=TRUE))> head(x)
sig siz agua ener tel carro comp
1 AC P 328.139 872.191 218.571 91.729 11.948
2 AC G 141.965 329.787 131.614 44.710 11.903
3 AL P 3034.323 5847.073 345.708 509.037 38.395
4 AL G 1567.150 2536.872 283.380 273.429 39.393
5 *AL <NA> 0.000 0.000 0.000 0.000 0.000*
6 AM P 1206.115 2872.973 270.231 93.077 35.858>
> y <- aggregate(as.matrix(hab[,4:8])~sig+siz, data=hab, FUN=mean,
+ na.rm=TRUE)> head(y)
sig siz agua ener tel carro comp
1 AC P 19.30229 51.30535 12.857118 5.395824 0.7028235
2 AL P 42.14337 81.20935 4.801500 7.069958 0.5332639
3 AM P 25.66202 61.12709 5.749596 1.980362 0.7629362
4 AP P 37.20362 82.04515 14.929154 5.775923 0.7922308
5 BA P 41.23200 66.45516 4.524045 10.387203 0.8404624
6 CE P 36.78599 78.20176 6.339990 7.768981
0.6446990>
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=pt_BR.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=pt_BR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.8.1
loaded via a namespace (and not attached):
[1] compiler_3.1.1 Rcpp_0.11.2 tools_3.1.1>
Thanks in advance.
Walmes.
=========================================================================Walmes
Marques Zeviani
LEG (Laborat?rio de Estat?stica e Geoinforma??o, 25.450418 S, 49.231759 W)
Departamento de Estat?stica - Universidade Federal do Paran?
fone: (+55) 41 3361 3573
skype: walmeszeviani
homepage: http://www.leg.ufpr.br/~walmes
linux user number: 531218
=========================================================================
[[alternative HTML version deleted]]